Tutorial: Localized RAG Chatbot with ACCESS HPC
This tutorial shows how to set up an open-source customizable RAG chatbot to answer questions about documents you can choose. It uses Indiana's Jetstream 2 HPC, but should work on any major ACCESS HPC
This is a tutorial on how to set up an open-source, fully customizable RAG Chatbot that can answer questions about any documents you choose. This could be useful for many applications - for example, it could answer technical questions for new members of a research lab, pulling from the lab's funding/research design application documents, or answer questions about a class or subject, pulling from a textbook.
This tutorial is meant for people with little to no experience with HPCs or Coding. If you are experienced and want to skip the explanations, feel free to jump right to the Quickstart
Why Use a Localized RAG Chatbot?
This type of chatbot offers a variety of advantages over a SaaS solution like ChatGPT, including:
Data Privacy: Keeps sensitive documents (e.g., research, technical papers) secure within your own system.
Vector Storage and Retrieval: The vector storage aspect of this chatbot allows it to answer very specific and detailed questions based on documents you provide, instead of pulling from the entirety of its training information to hopefully answer a question correctly.
Customization: You can modify the prompts and logic flow of this bot to tailor responses to specific datasets or specialized fields.
No External Limitations: Bypasses restrictions imposed by cloud services (e.g., content moderation).
High-Security Applications: Ideal for research labs or industries with strict confidentiality needs.
Step 1: Provision HPC Resources
Of course, you will need some compute resources to run this chatbot on. ACCESS Compute is a great candidate for a problem like this because of its accessibility and ease of use - you can pick from a large variety of University-run HPC programs, and easily provision the resources you need from anywhere in the world!
This tutorial uses Jetstream 2, but any major ACCESS HPC should work.
Jetstream 2 Provision
Log in to Jetstream 2 Exosphere and provision a new Instance. The m3.medium configuration should suffice for this project.
Select the m3.medium Instance configuration, leave all other settings as the default, and click the Create button at the bottom.
Back on your Instances page, you should see your instance building and starting up, and after about a minute it should look like this:
Click Connect to -> Web Shell and continue on to the next step.
Step 2: Chatbot Setup
The code for this chatbot uses Llamafile to host the LLM and Embedding models, and a FAISS Vector store as a vector database. It is a modified version of a llamafile rag example published by Mozilla with some for ease of use.
First, clone the GitHub repository to your Instance:
Then, run the setup script
This should set up a python virtual environment and install the required packages:
Then, it should download the required embedding and generation model files:
Next, we need to give the model some documents we want it to reference when we talk to it. You can upload anything you want, and for this example I will use the informational PDF on SLAC National Laboratory's FACET-II Department, which should be a good demonstration on how vector search can help a chatbot respond to questions it would normally not know the answer to.
To do this, I move into the local_data
directory and download the pdf with wget:
And that's it! You are now ready to use your chatbot. Start it with the app.sh
script and enjoy:
It should take around 40 seconds for the llamafile servers to start up, then you will be greeted with a prompt in the terminal:
Let's ask it about something specific to FACET-II, like plasma wakefield acceleration:
You'll see the chunks that the vector search process returns, as well as the full prompt that is given to the chatbot, along with the chatbot's response at the bottom. In this case, the chatbot responded succintly summarizing the plasma wakefield technology used in FACET-II, along with the analogy of particles "surfing" on a wave.
Plasma wakefield acceleration is a method for accelerating charged particles, such as electrons, to very high energies in very short distances. This is achieved by making the charged particles "surf" on waves of plasma, a hot, ionized gas. The plasma is created by applying a high-power laser pulse or RF (radio frequency) wave to a gas. The plasma then forms a wakefield, which is a wave of electric and magnetic fields. The charged particles are then accelerated by this wakefield, gaining v ery high energies over a given distance. This approach has the potential to significantly reduce the size and cost of particle accelerators compared to current technologies. Research at SLAC has demonstrated that a plasma can accelerate electrons to 1,000 times greater energies over a given distance than current technologies can manage.
Using the chatbot
Now that the chatbot is up an running, feel free to play around with it and modify it to your specific requirements. Anytime you update the local_data directory, the app.sh script will detect it when the chatbot starts and re-build the vector store with the new information. If you want to change the prompt the chatbot uses, it can be found on line 171 of app.py.
Quickstart
If you already have experience with coding and/or HPCs, here's all the steps you need to get up and running in one place on your machine:
Clone the github repo & setup
Add your data to
local_data
Run the chatbot
That's it! It really is that easy!
Troubleshooting
If you're running into any issues, feel free to make an Issue on the GitHub repository or send me an email at benjamin.chauhan@duke.edu and I'd love to help you out!
Some common issues you might run into:
If you're running this on your own machine or a smaller instance, you might run out of storage or RAM when downloading and running these models. Make sure you have enough space on your machine, and if you can't free up enough space for these models, consider replacing the models with smaller ones in setup.sh
(This is why I recommend using an ACCESS HPC)
If the llamafile servers are not running properly, you might have some port conflicts with the llamafile servers. Consider changing the ports in your .env file to ports that are currently free on your local system.
Last updated