Scraping
Classify maintains a continuously updated corpus of scraped web content. When you submit a URL, Classify fetches and indexes its content — extracting the full page text, title, language, metadata, and supply chain information. This data powers both classification and direct content retrieval.
What scraping gives you
| Data | Description |
|---|---|
| Full page text | The complete readable text of the page, stripped of HTML |
| Title | The page title as published |
| Language | Detected content language (e.g. en, fr) |
| Ads.txt supply paths | Authorized seller data for the domain |
| Header metadata | HTTP and HTML metadata associated with the page |
| Published / updated timestamps | When the content was originally published and last modified |
How it works
Scraping is asynchronous. If a URL hasn't been indexed yet, you submit it for scraping and Classify crawls it in the background. Once indexed, the full artifact is available for retrieval.
A typical flow:
- Check whether the URL has already been scraped (
POST /v1/scraping/search) - Request scraping if it hasn't (
POST /v1/scraping/jobs) - Retrieve the artifact once processing is complete
See the Scraping API reference for full endpoint documentation.
Use cases
- Content research — retrieve full page text from any URL for analysis or enrichment
- Brand safety — inspect page content before allowing ad placement
- Audience building — provide URLs as seeds when creating contextual segments
- Supply chain transparency — access ads.txt data to verify authorized sellers