LLM Sherpa
to load files of many types. LLM Sherpa
supports different file formats including DOCX, PPTX, HTML, TXT, and XML.
LLMSherpaFileLoader
use LayoutPDFReader, which is part of the LLMSherpa library. This tool is designed to parse PDFs while preserving their layout information, which is often lost when using most PDF to text parsers.
Here are some key features of LayoutPDFReader:
- It can identify and extract sections and subsections along with their levels.
- It combines lines to form paragraphs.
- It can identify links between sections and paragraphs.
- It can extract tables along with the section the tables are found in.
- It can identify and extract lists and nested lists.
- It can join content spread across pages.
- It can remove repeating headers and footers.
- It can remove watermarks.
INFO: this library fail with some pdf files so use it with caution.
LLMSherpaFileLoader
Under the hood LLMSherpaFileLoader defined some strategist to load file content: [“sections”, “chunks”, “html”, “text”], setup nlm-ingestor to getllmsherpa_api_url
or use the default.