arXiv is an open-access archive for 2 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.This notebook shows how to retrieve scientific articles from Arxiv.org into the Document format that is used downstream. For detailed documentation of all
ArxivRetriever
features and configurations head to the API reference.
Integration details
Setup
If you want to get automated tracing from individual queries, you can also set your LangSmith API key by uncommenting below:Installation
This retriever lives in thelangchain-community
package. We will also need the arxiv dependency:
Instantiation
ArxivRetriever
parameters include:
- optional
load_max_docs
: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now. - optional
load_all_available_meta
: default=False. By default only the most important fields downloaded:Published
(date when document was published/last updated),Title
,Authors
,Summary
. If True, other fields also downloaded. get_full_documents
: boolean, default False. Determines whether to fetch full text of documents.
Usage
ArxivRetriever
supports retrieval by article identifier:
ArxivRetriever
also supports retrieval based on natural language text:
Use within a chain
Like other retrievers,ArxivRetriever
can be incorporated into LLM applications via chains.
We will need a LLM or chat model:
API reference
For detailed documentation of allArxivRetriever
features and configurations head to the API reference.