Skip to main content

Scraping

Classify maintains a continuously updated corpus of scraped web content. When you submit a URL, Classify fetches and indexes its content — extracting the full page text, title, language, metadata, and supply chain information. This data powers both classification and direct content retrieval.

What scraping gives you

DataDescription
Full page textThe complete readable text of the page, stripped of HTML
TitleThe page title as published
LanguageDetected content language (e.g. en, fr)
Ads.txt supply pathsAuthorized seller data for the domain
Header metadataHTTP and HTML metadata associated with the page
Published / updated timestampsWhen the content was originally published and last modified

How it works

Scraping is asynchronous. If a URL hasn't been indexed yet, you submit it for scraping and Classify crawls it in the background. Once indexed, the full artifact is available for retrieval.

A typical flow:

  1. Check whether the URL has already been scraped (POST /v1/scraping/search)
  2. Request scraping if it hasn't (POST /v1/scraping/jobs)
  3. Retrieve the artifact once processing is complete

See the Scraping API reference for full endpoint documentation.

Use cases

  • Content research — retrieve full page text from any URL for analysis or enrichment
  • Brand safety — inspect page content before allowing ad placement
  • Audience building — provide URLs as seeds when creating contextual segments
  • Supply chain transparency — access ads.txt data to verify authorized sellers