Diffbot is a suite of ML-based products that make it easy to structure web data.
Diffbot’s Extract API is a service that structures and normalizes data from web pages.
Unlike traditional web scraping tools, Diffbot Extract
doesn’t require any rules to read the content on a page. It uses a computer vision model to classify a page into one of 20 possible types, and then transforms raw HTML markup into JSON. The resulting structured JSON follows a consistent type-based ontology, which makes it easy to extract data from multiple different web sources with the same schema.
Overview
This guide covers how to extract data from a list of URLs using the Diffbot Extract API into structured JSON that we can use downstream.Setting up
Start by installing the required packages.Using the Document Loader
Import the DiffbotLoader module and instantiate it with a list of URLs and your Diffbot token..load()
method, you can see the documents loaded
Transform Extracted Text to a Graph Document
Structured page content can be further processed withDiffbotGraphTransformer
to extract entities and relationships into a graph.
DiffbotGraphTransformer
guide.