PyMuPDF
document loader. For detailed documentation of all __ModuleName__Loader features and configurations head to the API reference.
Overview
Integration details
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
PyMuPDFLoader | langchain-community | ✅ | ❌ | ❌ |
Loader features
Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
---|---|---|---|---|
PyMuPDFLoader | ✅ | ❌ | ✅ | ✅ |
Setup
Credentials
No credentials are required to use PyMuPDFLoader If you want to get automated best in-class tracing of your model calls you can also set your LangSmith API key by uncommenting below:Installation
Install langchain-community and pymupdf.Initialization
Now we can instantiate our model object and load documents:Load
Lazy Load
- source
- page (if in mode page)
- total_page
- creationdate
- creator
- producer
Splitting mode & custom pages delimiter
When loading the PDF file you can split it in two different ways:- By page
- As a single text flow
Extract the PDF by page. Each page is extracted as a langchain Document object
Extract the whole PDF as a single langchain Document object
Add a custom pages_delimiter to identify where are ends of pages in single mode
Extract images from the PDF
You can extract images from your PDFs with a choice of three different solutions:- rapidOCR (lightweight Optical Character Recognition tool)
- Tesseract (OCR tool with high precision)
- Multimodal language model
Extract images from the PDF with rapidOCR
Extract images from the PDF with Tesseract
Extract images from the PDF with multimodal model
Extract tables from the PDF
With PyMUPDF you can extract tables from your PDFs in html, markdown or csv format :Working with Files
Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can useopen
to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.
As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.
API reference
For detailed documentation of allPyMuPDFLoader
features and configurations head to the API reference: python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html