Data Ingestion

This section provides an overview of the data ingestion process in RAG. The data ingestion process is the process of loading data from various sources into a vector store. The data can be loaded from various sources such as databases, files, and APIs.

Overview

The data ingestion process in RAG involves the following steps:

Data Collection: Collected data from source and store it locally.
Data Loading: Load file, or multiple files, into Document objects.
Data Chunking: Chunk the data into smaller parts. This is important for two reasons: (1) it makes it easier to index data, and (2) it makes it easier to query data. Furthermore, since most LLM models have a finite context window (or input size), having smaller chunks of data ensures relevant context information isn't lost.
Embedding Generation: Generate embeddings for each chunk of data. This is done using an embedding model. An embedding model converts text data into a numerical representation, and the distance between these numerical representations is used to determine the similarity between two pieces of text.
Storage: The generate vector embeddings are then stored in an efficient vector store.

Data Loaders

QvikChat provides built-in support for loading data from text, PDF, JSON, CSV, or a code file of a supported programming language. To learn more about data loading or how you can use custom data not supported by QvikChat, refer to the Data Loaders page.

Embedding Models

You can provide your own embedding model to generate embeddings for the data. There are more than 20+ embedding models supported by QvikChat through LangChain. To learn more about embedding models, refer to the Embedding Models page.

Vector Stores

The vector store is used to store the generated embeddings. The vector store is an efficient way to store and query embeddings. QvikChat provides support for more than 30+ vector stores such as Faiss, Pinecone, Chroma and more, through LangChain. To learn more about vector stores, refer to the Vector Stores page.

RAG Guide Data Retrieval