Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. Wikipedia
is the largest and most-read reference work in history.
This notebook shows how to retrieve wiki pages from wikipedia.org
into the Document format that is used downstream.
Integration details
Setup
To enable automated tracing of individual tools, set your LangSmith API key:Installation
The integration lives in thelangchain-community
package. We also need to install the wikipedia
python package itself.
Instantiation
Now we can instantiate our retriever:WikipediaRetriever
parameters include:
- optional
lang
: default=“en”. Use it to search in a specific language part of Wikipedia - optional
load_max_docs
: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now. - optional
load_all_available_meta
: default=False. By default only the most important fields downloaded:Published
(date when document was published/last updated),title
,Summary
. If True, other fields also downloaded.
get_relevant_documents()
has one argument, query
: free text which used to find documents in Wikipedia
Usage
Use within a chain
Like other retrievers,WikipediaRetriever
can be incorporated into LLM applications via chains.
We will need a LLM or chat model:
API reference
For detailed documentation of allWikipediaRetriever
features and configurations head to the API reference.