RAG
Transformer-based language models like GPT4 are limited in the number of tokens they can process. This is known as the context length and even the most-capable models at the time of writing like GPT4-Turbo, Mixtral-8x7B or Claude 2 have context sizes in the ten-thousands rather than millions or billions. GPT4-Turbo is at 128k tokens (roughly 96k words), Mixtral-8x7B is at 32k tokens (~24k words). Antrophic pushed this boundary to 200k (~150k words) with their Claude 2 model, but typically transformer-based models degrade considerable at larger input sizes.
While that may sound like a lot, it is actually not that much! If we calculate with 500 words for an A4 page, a 128k context window will only allow you to process around 192 pages.
Asking an LLM questions about a larger document, set of documents or the corporate wiki is therefore not possible.
While there is active research into alternatives to the transformer architecture that cope better with longer sequence lengths (see e.g. selective state space models), the last two years have seen the rise of retrieval augmented generation (RAG) as another solution to this problem.
In RAG, the prompt to the LLM is enriched with all the necessary context information to answer the question, but not the whole document(s). This avoids the need for large context windows, as usually only a few short sentences or paragraphs are relevant for answering the question. Research indicates, that this can even result in better quality results than having a fine-tuned model for the specific task!
The question remains how to get the relevant context information. This is a multi-step process:
We offer support for the whole process from design to deployment: