Data Ingestion
This section provides an overview of the data ingestion process in RAG. The data ingestion process is the process of loading data from various sources into a vector store. The data can be loaded from various sources such as databases, files, and APIs.
Overview
The data ingestion process in RAG involves the following steps:
- Data Collection: Collected data from source and store it locally.
- Data Loading: Load file, or multiple files, into
Document
objects. - Data Chunking: Chunk the data into smaller parts. This is important for two reasons: (1) it makes it easier to index data, and (2) it makes it easier to query data. Furthermore, since most LLM models have a finite context window (or input size), having smaller chunks of data ensures relevant context information isn't lost.
- Embedding Generation: Generate embeddings for each chunk of data. This is done using an embedding model. An embedding model converts text data into a numerical representation, and the distance between these numerical representations is used to determine the similarity between two pieces of text.
- Storage: The generate vector embeddings are then stored in an efficient vector store.
Data Loaders
QvikChat provides built-in support for loading data from text, PDF, JSON, CSV, or a code file of a supported programming language. To learn more about data loading or how you can use custom data not supported by QvikChat, refer to the Data Loaders page.
Embedding Models
You can provide your own embedding model to generate embeddings for the data. There are more than 20+ embedding models supported by QvikChat through LangChain. To learn more about embedding models, refer to the Embedding Models page.
Vector Stores
The vector store is used to store the generated embeddings. The vector store is an efficient way to store and query embeddings. QvikChat provides support for more than 30+ vector stores such as Faiss, Pinecone, Chroma and more, through LangChain. To learn more about vector stores, refer to the Vector Stores page.