Skip to content

A "chat with your data" example: using a large language models (LLM) to interact with our own (local) data. Everything is local: the embedding model, the LLM, the vector database. This is an example of retrieval-augmented generation (RAG): we find relevant sections from our documents and pass it to the LLM as part of the prompt (see pics).

Notifications You must be signed in to change notification settings

fau-masters-collected-works-cgarbin/gpt-all-local

Repository files navigation

Using LLMs on private data, all locally

This project is a learning exercise on using large language models (LLMs) to retrieve information from private data, running all pieces (including the LLM) locally. The goal is to run an LLM on your computer to ask questions about a set of files on your computer. The files can be any type of document, such as PDF, Word, or text files.

This method of combining LLMs and private data is known as retrieval-augmented generation (RAG). It was introduced in this paper.

Credit where credit is due: I based this project on the original privateGPT (what they now call the primordial version). I reimplemented the pieces to understand how they work. See more in the sources section.

What we are trying to achieve: given a set of files on a computer (A), we want a large language model (B) running on that computer to answer questions (C) on them.

What we are trying to achieve

However, we cannot feed the files directly to the model. Large language models (LLMs) have a context window that limits how much information we can feed into them (their working memory). To overcome that limitation, we split the files into smaller pieces, called chunks, and feed only the relevant ones to the model (D).

Solution part 1

But then, the question becomes "how do we find the relevant chunks?". We use similarity search (E) to match the question and the chunks. Similarity search, in turn, requires vector embeddings (F), a representation of words with vectors that encode semantic relationships (technically, a dense vector embedding, not to confuse it with sparse vector representations such as bag-of-words and TF-IDF). Once we have the relevant chunks, we combine them with the question to create a prompt (G) that instructs the LLM to answer the question.

Solution part 2

We need one last piece: persistent storage. Creating embeddings for the chunks takes time. We don't want to do that every time we ask a question. Therefore, we need to save the embeddings and the original text (the chunks) in a vector store (or database) (H). The vector store can grow large because it stores the original text chunks and their vector embeddings. We use a vector index (I) to find relevant chunks efficiently.

Solution part 3

Now we have all the pieces we need.

We can divide the implementation into three distinct steps, organized as a pipeline:

┌────────────────────────────────────────────────────────────────┐
│                              RAG Pipeline                      │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  ┌─────────────┐      ┌──────────────┐      ┌──────────────┐   │
│  │   INGEST    │      │   RETRIEVE   │      │   GENERATE   │   │
│  │ (ingest.py) │      │(retrieve.py) │      │(generate.py) │   │
│  ├─────────────┤      ├──────────────┤      ├──────────────┤   │
│  │ Load docs   │      │ Similarity   │      │ Build prompt │   │
│  │ Chunk text  │─────▶│ search in    │─────▶│ Query LLM    │   │
│  │ Embed       │      │ vector store │      │ Return answer│   │
│  │ Store       │      │              │      │              │   │
│  └─────────────┘      └──────────────┘      └──────────────┘   │
│        │                     │                     │           │
│        ▼                     ▼                     ▼           │
│  ┌───────────┐         ┌───────────┐         ┌───────────┐     │
│  │  Vector   │         │ Relevant  │         │  Answer   │     │
│  │   Store   │         │  Chunks   │         │           │     │
│  └───────────┘         └───────────┘         └───────────┘     │
│                                                                │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                      pipeline.py                         │  │
│  │           Orchestrates retrieve.py + generate.py         │  │
│  │                    via pipeline.run()                    │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                │
└────────────────────────────────────────────────────────────────┘
  1. Ingestion (ingest.py): Divide local files into smaller chunks that fit into the LLM input size (context window). Create vector embeddings for each chunk. Save the results in a vector store (database). This is a one-time step per document.
  2. Retrieval (retrieve.py): Given a user prompt, use similarity search to find the most relevant chunks from the vector store.
  3. Generation (generate.py): Combine the user prompt with the relevant chunks into an LLM prompt and query the model to generate an answer.

The pipeline (pipeline.py) orchestrates the retrieval and generation steps. It provides a simple interface via pipeline.run(prompt) that handles the full flow: find relevant chunks, then generate an answer.

This separation allows each component to be tested and modified independently. For example, you can swap the LLM without touching the retrieval logic, or change the vector store without affecting generation.

These steps are illustrated in the following diagram.

Ingestion and retrieval

How to use this project

There are two ways to use this project:

  1. Command line interface: use this one to see more logs and understand what is going on (see the --verbose flag below).
  2. Streamlit app: use this one for a more user-friendly experience.

The first time you run the commands it may take a while to complete because it will download some pieces like the embedding model. Subsequent runs will be faster.

Command-line interface

If you haven't done so yet, prepare the environment. If you have already prepared the environment, activate it with source venv/bin/activate.

  1. Copy the files you want to use into the data folder.
  2. Run python main.py ingest --verbose to ingest the files into the vector store.
    1. Review the PDF parsing section if you get an error when ingesting PDF files.
  3. Run python main.py ask --verbose to ask questions on the documents. It will prompt you for a question.

The --verbose flag shows more details on what the program is doing behind the scenes.

To re-ingest the data, delete the vector_store folder and run python main.py ingest again.

Review the PDF parsing section if you get an error when ingesting PDF files.

Streamlit app

If you haven't done so yet, prepare the environment. If you have already prepared the environment, activate it with source venv/bin/activate.

Run streamlit run app.py. It will open the app in a browser window.

This command may fail the first you run it. There is a glitch somewhere in how the Python environment works together with pyenv. If Streamlit shows a "cannot import module message", deactivate the Python environment with deactivate, activate it again with source venv/bin/activate, and run streamlit run app.py.

It will take a few minutes to show the UI the first time you run it because it will download the embedding model. Subsequent runs will be faster.

Review the PDF parsing section if you get an error when ingesting PDF files.

Design

Ingesting data

If you haven't done so yet, prepare the environment. If you have already prepared the environment, activate it with source venv/bin/activate.

Command: python main.py ingest [--verbose]

The goal of this stage is to make the data searchable. However, the user's question and the data contents may not match exactly. Therefore, we cannot use a simple search engine. We must perform a similarity search supported by vector embeddings. The vector embedding is the most important part of this stage.

Ingesting data has the following steps:

  1. Load the file: a document reader that matches the document type is used to load the file. At this point, we have an array of characters with the file contents (a "document" from now on). Metadata, pictures, etc., are ignored.
  2. Split the document into chunks: a document splitter divides the document into chunks of the specified size. We need to split the document to fit the context size of the model (and to send fewer tokens when using a paid model). The exact size of each chunk depends on the document splitter. For example, a sentence splitter attempts to split at the sentence level, making some chunks smaller than the specified size.
  3. Create vector embeddings for each chunk: an embedding model creates a vector embedding for each chunk. This is the crucial step that allows us to find the most relevant chunks to help answer the question.
  4. Save the embeddings into the vector database (store): persist all the work we did above so we don't have to repeat it in the future.

Future improvements:

  • More intelligent document parsing. For example, do not mix figure captions with the section text; do not parse the reference section (alternatively, replace the inline references with the actual reference text).
  • Improve parallelism. Ideally, we want to run the entire workflow (load document, chunk, embed, persist) in parallel for each file. This requires a solution that parallelizes not only I/O-bound but also CPU-bound tasks. The vector store must also support multiple writers.
  • Try different chunking strategies, e.g. check if sentence splitters (NLTKTextSplitter or SpacyTextSplitter) improve the answers.
  • Choose chunking size based on the LLM input (context) size. It is currently hardcoded to a small number, which may affect the quality of the results. On the other hand, it saves costs on the LLM API. We need to find a balance.
  • Automate the ingestion process: detect if there are new or changed files and ingest them.

Retrieving and generating (the pipeline)

If you haven't done so yet, prepare the environment. If you have already prepared the environment, activate it with source venv/bin/activate.

Command: python main.py ask [--verbose]

The goal of this stage is to retrieve information from the local data and generate an answer. This is handled by the pipeline, which orchestrates two separate modules:

Retrieval (retrieve.py): Find the most relevant chunks from the vector store using similarity search.

Generation (generate.py): Build a prompt combining the user's question with the retrieved chunks, then query the LLM to generate an answer.

The pipeline (pipeline.py) coordinates these steps:

  1. Call retrieve.retrieve(prompt) to find the most relevant chunks from the vector store.
  2. Call generate.generate(prompt, chunks) to build the LLM prompt and get the answer.
  3. Return the answer, source documents, and timing information to the caller.

Future improvements:

  • Add moderation to filter out offensive answers.
  • Improve the answers with reranking: "over-fetch our search results, and then deterministically rerank based on a modifier or set of modifiers.".
  • Try different chain types (related to the previous point).

Improving results

We had to make some compromises to make it run on a local machine in a reasonable amount of time.

  • We use a small model. This one is hard to change. The model has to run on a CPU and fit in memory.
  • We use a small embedding size. We can increase the embedding size if we wait longer for the ingestion process.

Sources

Most of the ingest/retrieve code is based on the original privateGPT, the one they call now primordial.

What is different:

  • Streamlit app for the UI.
  • Use newer embeddings and large language model versions.
  • Modernized the Python code. For example, it uses pathlib instead of os.path and has proper logging instead of print statements.
  • Added more logging to understand what is going on. Use the --verbose flag to see the details.
  • Added a main program to run the ingest/ask steps.
  • Filled in requirements.txt with the indirect dependencies, for example, for HuggingFace transformers and LangChain document loaders.
  • A modular pipeline architecture with separate modules for ingestion (ingest.py), retrieval (retrieve.py), and generation (generate.py), orchestrated by pipeline.py.

See this file for more notes collected during the development of this project.

Preparing the environment

This is a one-time step. If you have already done this, just activate the virtual environment with source venv/bin/activate (Windows: venv\Scripts\activate.bat).

Python environment

Some packages require Python 3.11 or lower. Ensure that python3 --version shows 3.11.

Run the following commands to create a virtual environment and install the required packages.

python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate.bat
pip install --upgrade pip
pip install -r requirements.txt

PDF parsing

The PDF parser in unstructured is a layer on top of the actual parser packages. Follow the instructions in the unstructured README, under the "Install the following system dependencies" bullets. The poppler and tesseract packages are required (ignore the others).

On Mac OS with Homebrew: brew install poppler tesseract.

Model

I suggest starting with a small model that runs on CPU. GPT4All has a list of compatible models. I tested with TinyLlama 1.1B Chat. It requires only 2 GB of RAM to run and is very fast, though answer quality is lower than larger models. Note that some models have restrictive licenses. Check the licenses before using them in commercial projects.

  1. Create a folder named models.
  2. Download TinyLlama 1.1B (637 MB download, 2 GB RAM).
  3. Copy the model to the models folder.

On Mac OS, Git Bash, or Linux:

mkdir -p models
cd models

curl -L -O https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

ls -lh tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

About

A "chat with your data" example: using a large language models (LLM) to interact with our own (local) data. Everything is local: the embedding model, the LLM, the vector database. This is an example of retrieval-augmented generation (RAG): we find relevant sections from our documents and pass it to the LLM as part of the prompt (see pics).

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages