Back to services

RAG

Transformer-based language models like GPT4 are limited in the number of tokens they can process. This is known as the context length and even the most-capable models at the time of writing like GPT4-Turbo, Mixtral-8x7B or Claude 2 have context sizes in the ten-thousands rather than millions or billions. GPT4-Turbo is at 128k tokens (roughly 96k words), Mixtral-8x7B is at 32k tokens (~24k words). Antrophic pushed this boundary to 200k (~150k words) with their Claude 2 model, but typically transformer-based models degrade considerable at larger input sizes.

While that may sound like a lot, it is actually not that much! If we calculate with 500 words for an A4 page, a 128k context window will only allow you to process around 192 pages.

Asking an LLM questions about a larger document, set of documents or the corporate wiki is therefore not possible.

The rise of RAG

While there is active research into alternatives to the transformer architecture that cope better with longer sequence lengths (see e.g. selective state space models), the last two years have seen the rise of retrieval augmented generation (RAG) as another solution to this problem.

In RAG, the prompt to the LLM is enriched with all the necessary context information to answer the question, but not the whole document(s). This avoids the need for large context windows, as usually only a few short sentences or paragraphs are relevant for answering the question. Research indicates, that this can even result in better quality results than having a fine-tuned model for the specific task!

How RAG works

The question remains how to get the relevant context information. This is a multi-step process:

  1. The meaning of the document (like .pdf, .doc, .md) is encoded into a vector that conveys the semantic information, called embedding. This is done using an encoder model such as Bert and is both cheap and fast.
  2. These embeddings are stored in a database with a mapping to the original document. New, specialized vector databases such as qdrant or pinecone have emerged, but most traditional database now also have support for vectors, e.g. Postgres using the pgvector extension.
  3. Now when the user asks the LLM, the question is first encoded into an embedding just like the documents. The document embeddings that are most similar to the question embeddings are then searched. This search is extremely fast and cheap, even for millions of embeddings.
  4. Using the document embeddings that are most similar to the question embeddings, their corresponding document content is looked up and added to the context to the LLM.
  5. Using this enriched context, the LLM can now provide high-quality answers.

How we can help you

We offer support for the whole process from design to deployment:

  • Design & development of a RAG pipeline tailored to your use-case
  • Selection and if necessary fine-tuning of embedding models
  • Integration of custom data sources
  • Deployment into your cloud