Data Loaders

Data Loaders

QvikChat provides built-in support for loading data from text, PDF, JSON, CSV, or a code file of a supported programming language. However, if you want to load a file not supported by QvikChat by default, you can simply use any LangChain-supported data loader (opens in a new tab) to load the data and provide the documents as the docs property when configuring the retriever. Check Loading Custom Data for more information.

Built-in Data Loaders

QvikChat provides built-in support for loading data from the following file types:

  • Text files: Any documents containing text data, plus any code files of supported programming languages.
  • PDF files: Can use the pdfLoaderOptions to specify additional options.
  • JSON and JSONLines files: Can use the jsonLoaderKeysToInclude property to specify the keys containing relevant data.
  • CSV files: Can use the csvLoaderOptions property to specify the delimiter and other options.

The type of data you are providing is inferred from the file extension given in the filePath value that you provide when configuring the retriever. For information on configuring the retriever, see the Retriever Configuration page.

Data Splitting

Data in the given file is first loaded into processable Document objects, and is then split into smaller chunks using a chunking strategy. This is important for two reasons: (1) it makes it easier to index data, and (2) it makes it easier to query data. Furthermore, since most LLM models have a finite context window (or input size), having smaller chunks of data ensures relevant context information isn't lost.

The chunking strategy used can have a signigicant impact on the performance of a chat service that responds to queries based on the data. However, since data comes in all shapes and sizes, it is recommended to experiment with different configurations to find the best one for your use case. There are, however, some default configurations that can be used, for example, for CSV data each row (or line) can be a chunk. If not sure, start with the default configurations and then experiment with different configurations.

You can use the chunkingConfig property to specify the data chunking configuration when configuring the retriever. The chunkingConfig object can contain the following properties:

  • chunkSize: The size of each chunk in the data. The default value is 1000.
  • overlap: The number of tokens to overlap between chunks. The default value is 200.

Here is an example of how you can specify the data chunking configuration:

import { defineChatEndpoint } from "@oconva/qvikchat/endpoints";
 
// endpoint with custom chunking configurations
defineChatEndpoint({
  endpoint: "rag",
  enableRAG: true,
  topic: "Test",
  retrieverConfig: {
    filePath: "src/data/knowledge-bases/test-data.csv",
    generateEmbeddings: true,
    chunkingConfig: {
      chunkSize: 500,
      chunkOverlap: 50,
    },
  },
});

Data Loading Options

You can specify data loading options through the retriever configurations object when called the getDataRetriever method or the when specifying the retrieverConfig property in the chat endpoint configuration. For more information, check Retriever Config.

Loading Custom Data

When you want to load data from a file that is not supported by QvikChat out of the box, you can use any LangChain-supported data loader (opens in a new tab) to load the data and provide the documents as the docs property when configuring the retriever.

Example - Loading Data from Webpage

Below is an example of how you can load data from a webpage.

QvikChat by default doesn't provide a data loader for web pages. So, in this example, we are going to a custom web loader from LangChain to load data from a webpage.

In this example, we're going to use the Cheerio (opens in a new tab) web loader. Cheerio is a fast and lightweight library that can help you extract data from web pages, without the need for a full browser environment.

src/index.ts
import { setupGenkit, runServer } from "@oconva/qvikchat/genkit";
import { defineChatEndpoint } from "@oconva/qvikchat/endpoints";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
 
// Setup Genkit
setupGenkit();
 
// Method to define all endpoints of the project and run the server
const defineEndpointsRunServer = async () => {
  // configure and instantiate data loader
  const loader = new CheerioWebBaseLoader(
    "https://qvikchat.pkural.ca/rag-guide"
  );
 
  // load data and get docs
  const docs = await loader.load();
 
  // define RAG chat endpoint with docs
  defineChatEndpoint({
    endpoint: "chat",
    enableRAG: true,
    topic: "QvikChat - RAG chat endpoint",
    retrieverConfig: {
      docs: docs,
      dataType: "text",
      generateEmbeddings: true,
    },
  });
 
  // Run server
  runServer();
};
 
// execute method to define endpoints and run server
defineEndpointsRunServer();

For the above example, the full source code and the instructions to run it can be found here (opens in a new tab).

GIGO (Garbage In, Garbage Out):

For best performance, it is highly recommended that you spend some extra time preparing a strategy for data collection for Retrieval Augmented Generation (RAG). Without proper data cleaning and preprocessing, the data may contain a lot of noise, which can also lead to poor response quality. For example, when crawling data from a webpage, if done without proper planning, the data may contain a lot of irrelevant information such as ads, navigation links, etc. This can lead to poor performance of the chat endpoints. Moreover, data comes in all shapes and sizes. A single webpage or a word document can contain information of various types, like code, text, tables, etc. It is important to handle and process these different types of information using an appropriate strategy.