- C (*)
- C++ (*)
- C# (*)
- COBOL
- Elixir
- Go (*)
- Java (*)
- JavaScript (requires package
esprima
) - Kotlin (*)
- Lua (*)
- Perl (*)
- Python
- Ruby (*)
- Rust (*)
- Scala (*)
- TypeScript (*)
tree_sitter
and tree_sitter_languages
.
It is straightforward to add support for additional languages using tree_sitter
,
although this currently requires modifying LangChain.
The language used for parsing can be configured, along with the minimum number of
lines required to activate the splitting based on syntax.
If a language is not explicitly specified, LanguageParser
will infer one from
filename extensions, if present.
parser_threshold
indicates the minimum number of lines that the source code file must have to be segmented using the parser.
Splitting
Additional splitting could be needed for those functions, classes, or scripts that are too big.Adding Languages using Tree-sitter Template
Expanding language support using the Tree-Sitter template involves a few essential steps:- Creating a New Language File:
- Begin by creating a new file in the designated directory (langchain/libs/community/langchain_community/document_loaders/parsers/language).
- Model this file based on the structure and parsing logic of existing language files like
cpp.py
. - You will also need to create a file in the langchain directory (langchain/libs/langchain/langchain/document_loaders/parsers/language).
- Parsing Language Specifics:
- Mimic the structure used in the
cpp.py
file, adapting it to suit the language you are incorporating. - The primary alteration involves adjusting the chunk query array to suit the syntax and structure of the language you are parsing.
- Mimic the structure used in the
- Testing the Language Parser:
- For thorough validation, generate a test file specific to the new language. Create
test_language.py
in the designated directory(langchain/libs/community/tests/unit_tests/document_loaders/parsers/language). - Follow the example set by
test_cpp.py
to establish fundamental tests for the parsed elements in the new language.
- For thorough validation, generate a test file specific to the new language. Create
- Integration into the Parser and Text Splitter:
- Incorporate your new language within the
language_parser.py
file. Ensure to update LANGUAGE_EXTENSIONS and LANGUAGE_SEGMENTERS along with the docstring for LanguageParser to recognize and handle the added language. - Also, confirm that your language is included in
text_splitter.py
in class Language for proper parsing.
- Incorporate your new language within the