Overview
Integration details
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
PyMuPDF4LLMLoader | langchain-pymupdf4llm | ✅ | ❌ | ❌ |
Loader features
Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
---|---|---|---|---|
PyMuPDF4LLMLoader | ✅ | ❌ | ✅ | ✅ |
Setup
To access PyMuPDF4LLM document loader you’ll need to install thelangchain-pymupdf4llm
integration package.
Credentials
No credentials are required to use PyMuPDF4LLMLoader. To enable automated tracing of your model calls, set your LangSmith API key:Installation
Install langchain-community and langchain-pymupdf4llm.Initialization
Now we can instantiate our model object and load documents:Load
Lazy Load
- source
- page (if in mode page)
- total_page
- creationdate
- creator
- producer
Splitting mode & custom pages delimiter
When loading the PDF file you can split it in two different ways:- By page
- As a single text flow
Extract the PDF by page. Each page is extracted as a langchain Document object
page
(page number). But in some cases we could want to process the pdf as a single text flow (so we don’t cut some paragraphs in half). In this case you can use the single mode :
Extract the whole PDF as a single langchain Document object
page
(page_number) metadata disappears. Here’s how to clearly identify where pages end in the text flow :
Add a custom pages_delimiter to identify where are ends of pages in single mode
pages_delimiter
is \n-----\n\n.
But this could simply be \n, or \f to clearly indicate a page change, or <!— PAGE BREAK —> for seamless injection in a Markdown viewer without a visual effect.
Extract images from the PDF
You can extract images from your PDFs (in text form) with a choice of three different solutions:- rapidOCR (lightweight Optical Character Recognition tool)
- Tesseract (OCR tool with high precision)
- Multimodal language model
Extract images from the PDF with rapidOCR
Extract images from the PDF with Tesseract
Extract images from the PDF with multimodal model
Extract tables from the PDF
With PyMUPDF4LLM you can extract tables from your PDFs in markdown format :Working with Files
Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can useopen
to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.
As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.