Data Loaders
QvikChat provides built-in support for loading data from text, PDF, JSON, CSV, or a code file of a supported programming language. However, if you want to load a file not supported by QvikChat by default, you can simply use any LangChain-supported data loader (opens in a new tab) to load the data and provide the documents as the docs
property when configuring the retriever. Check Loading Custom Data for more information.
Built-in Data Loaders
QvikChat provides built-in support for loading data from the following file types:
- Text files: Any documents containing text data, plus any code files of supported programming languages.
- PDF files: Can use the
pdfLoaderOptions
to specify additional options. - JSON and JSONLines files: Can use the
jsonLoaderKeysToInclude
property to specify the keys containing relevant data. - CSV files: Can use the
csvLoaderOptions
property to specify the delimiter and other options.
The type of data you are providing is inferred from the file extension given in the filePath
value that you provide when configuring the retriever. For information on configuring the retriever, see the Retriever Configuration page.
Data Splitting
Data in the given file is first loaded into processable Document
objects, and is then split into smaller chunks using a chunking strategy. This is important for two reasons: (1) it makes it easier to index data, and (2) it makes it easier to query data. Furthermore, since most LLM models have a finite context window (or input size), having smaller chunks of data ensures relevant context information isn't lost.
The chunking strategy used can have a signigicant impact on the performance of a chat service that responds to queries based on the data. However, since data comes in all shapes and sizes, it is recommended to experiment with different configurations to find the best one for your use case. There are, however, some default configurations that can be used, for example, for CSV data each row (or line) can be a chunk. If not sure, start with the default configurations and then experiment with different configurations.
You can use the chunkingConfig
property to specify the data chunking configuration when configuring the retriever. The chunkingConfig
object can contain the following properties:
chunkSize
: The size of each chunk in the data. The default value is 1000.overlap
: The number of tokens to overlap between chunks. The default value is 200.
Here is an example of how you can specify the data chunking configuration:
import { defineChatEndpoint } from "@oconva/qvikchat/endpoints";
// endpoint with custom chunking configurations
defineChatEndpoint({
endpoint: "rag",
enableRAG: true,
topic: "Test",
retrieverConfig: {
filePath: "src/data/knowledge-bases/test-data.csv",
generateEmbeddings: true,
chunkingConfig: {
chunkSize: 500,
chunkOverlap: 50,
},
},
});
Data Loading Options
You can specify data loading options through the retriever configurations object when called the getDataRetriever
method or the when specifying the retrieverConfig
property in the chat endpoint configuration. For more information, check Retriever Config.
Loading Custom Data
When you want to load data from a file that is not supported by QvikChat out of the box, you can use any LangChain-supported data loader (opens in a new tab) to load the data and provide the documents as the docs
property when configuring the retriever.
Example - Loading Data from Webpage
Below is an example of how you can load data from a webpage.
QvikChat by default doesn't provide a data loader for web pages. So, in this example, we are going to a custom web loader from LangChain to load data from a webpage.
In this example, we're going to use the Cheerio (opens in a new tab) web loader. Cheerio is a fast and lightweight library that can help you extract data from web pages, without the need for a full browser environment.
import { setupGenkit, runServer } from "@oconva/qvikchat/genkit";
import { defineChatEndpoint } from "@oconva/qvikchat/endpoints";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Setup Genkit
setupGenkit();
// Method to define all endpoints of the project and run the server
const defineEndpointsRunServer = async () => {
// configure and instantiate data loader
const loader = new CheerioWebBaseLoader(
"https://qvikchat.pkural.ca/rag-guide"
);
// load data and get docs
const docs = await loader.load();
// define RAG chat endpoint with docs
defineChatEndpoint({
endpoint: "chat",
enableRAG: true,
topic: "QvikChat - RAG chat endpoint",
retrieverConfig: {
docs: docs,
dataType: "text",
generateEmbeddings: true,
},
});
// Run server
runServer();
};
// execute method to define endpoints and run server
defineEndpointsRunServer();
For the above example, the full source code and the instructions to run it can be found here (opens in a new tab).
GIGO (Garbage In, Garbage Out):
For best performance, it is highly recommended that you spend some extra time preparing a strategy for data collection for Retrieval Augmented Generation (RAG). Without proper data cleaning and preprocessing, the data may contain a lot of noise, which can also lead to poor response quality. For example, when crawling data from a webpage, if done without proper planning, the data may contain a lot of irrelevant information such as ads, navigation links, etc. This can lead to poor performance of the chat endpoints. Moreover, data comes in all shapes and sizes. A single webpage or a word document can contain information of various types, like code, text, tables, etc. It is important to handle and process these different types of information using an appropriate strategy.