While there is active research into alternatives to the transformer architecture that cope better with longer sequence lengths (see e.g. selective state space models), the last two years have seen the rise of retrieval augmented generation (RAG) as another solution to this problem.
In RAG, the prompt to the LLM is enriched with all the necessary context information to answer the question, but not the whole document(s). This avoids the need for large context windows, as usually only a few short sentences or paragraphs are relevant for answering the question. Research indicates, that this can even result in better quality results than having a fine-tuned model for the specific task!
How RAG works
The question remains how to get the relevant context information. This is a multi-step process:
- The meaning of the document (like .pdf, .doc, .md) is encoded into a vector that conveys the semantic information, called embedding. This is done using an encoder model such as Bert and is both cheap and fast.
- These embeddings are stored in a database with a mapping to the original document. New, specialized vector databases such as qdrant or pinecone have emerged, but most traditional database now also have support for vectors, e.g. Postgres using the pgvector extension.
- Now when the user asks the LLM, the question is first encoded into an embedding just like the documents. The document embeddings that are most similar to the question embeddings are then searched. This search is extremely fast and cheap, even for millions of embeddings.
- Using the document embeddings that are most similar to the question embeddings, their corresponding document content is looked up and added to the context to the LLM.
- Using this enriched context, the LLM can now provide high-quality answers.
How we can help you
We provide end-to-end RAG solutions tailored to your needs:
- Custom Pipeline Development: Design and implement RAG systems for your specific use case
- Model Optimization: Select and fine-tune embedding models for optimal performance
- Data Integration: Connect multiple data sources seamlessly
- Cloud Deployment: Professional implementation in your preferred cloud environment