A class that extends the BufferLoader class. It represents a document loader that loads documents from PDF files.

const loader = new PDFLoader("path/to/bitcoin.pdf");
const docs = await loader.load();
console.log({ docs });

Hierarchy

  • BufferLoader
    • PDFLoader

Constructors

Properties

Methods

Constructors

  • Parameters

    • filePathOrBlob: string | Blob
    • __namedParameters: {
          parsedItemSeparator: undefined | string;
          pdfjs: undefined | (() => Promise<{
              getDocument: {
                  (src:
                      | string
                      | ArrayBuffer
                      | URL
                      | TypedArray
                      | DocumentInitParameters): PDFDocumentLoadingTask;
                  (src:
                      | string
                      | ArrayBuffer
                      | URL
                      | TypedArray
                      | DocumentInitParameters): PDFDocumentLoadingTask;
              };
              version: string;
          }>);
          splitPages: undefined | boolean;
      } = {}
      • parsedItemSeparator: undefined | string
      • pdfjs: undefined | (() => Promise<{
            getDocument: {
                (src:
                    | string
                    | ArrayBuffer
                    | URL
                    | TypedArray
                    | DocumentInitParameters): PDFDocumentLoadingTask;
                (src:
                    | string
                    | ArrayBuffer
                    | URL
                    | TypedArray
                    | DocumentInitParameters): PDFDocumentLoadingTask;
            };
            version: string;
        }>)
      • splitPages: undefined | boolean

    Returns PDFLoader

Properties

parsedItemSeparator: string

Methods

  • A method that takes a raw buffer and metadata as parameters and returns a promise that resolves to an array of Document instances. It uses the getDocument function from the PDF.js library to load the PDF from the buffer. It then iterates over each page of the PDF, retrieves the text content using the getTextContent method, and joins the text items to form the page content. It creates a new Document instance for each page with the extracted text content and metadata, and adds it to the documents array. If splitPages is true, it returns the array of Document instances. Otherwise, if there are no documents, it returns an empty array. Otherwise, it concatenates the page content of all documents and creates a single Document instance with the concatenated content.

    Parameters

    • raw: Buffer

      The buffer to be parsed.

    • metadata: Document

      The metadata of the document.

    Returns Promise<Document[]>

    A promise that resolves to an array of Document instances.