Scraping

Classify maintains a continuously updated index of scraped web content. Submit URLs to retrieve full page text, metadata, and ads.txt supply chain data. Already-indexed URLs are returned instantly; new URLs are scraped asynchronously.

Two ways to use scraping:

Search — look up a single URL. If it's already indexed, you get the result immediately.
Jobs — submit a batch of up to 1,000,000 URLs for scraping. Poll for results.

The scrape result object

Each scraped URL produces a result with the following structure:

{
  "url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
  "tld": "thegamer.com",
  "errors": [],
  "published_date": "2026-02-10T08:02:11Z",
  "last_updated_date": "2026-02-11T18:10:13Z",
  "content": {
    "title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
    "full_text": "While the Nine Divines and several other deities...",
    "language": "en",
    "word_count": 2450
  },
  "metadata": {
    "http_status": 200,
    "content_type": "text/html; charset=utf-8",
    "canonical_url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
    "og_title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
    "og_description": "A guide to every Daedric Prince and their plane of Oblivion.",
    "og_image": "https://static0.thegamerimages.com/wordpress/wp-content/uploads/daedric-princes.jpg",
    "og_type": "article",
    "author": "John Smith",
    "meta_description": "A guide to every Daedric Prince and their plane of Oblivion.",
    "robots": "index, follow"
  },
  "ads_txt": {
    "status": "found",
    "seller_count": 48,
    "entries": [
      {
        "domain": "google.com",
        "account_id": "pub-1234567890123456",
        "relationship": "DIRECT",
        "certification_id": "f08c47fec0942fa0"
      },
      {
        "domain": "rubiconproject.com",
        "account_id": "12345",
        "relationship": "RESELLER",
        "certification_id": "0bfd66d529a55807"
      }
    ]
  }
}

Top-level fields

Field	Type	Description
`url`	string	The URL that was scraped
`tld`	string	Top-level domain extracted from the URL
`errors`	array[string]	Error codes for this URL (empty if successful). See Error codes.
`published_date`	string (ISO 8601) \| null	When the content was originally published (if detectable)
`last_updated_date`	string (ISO 8601) \| null	When the content was last modified (if detectable)
`content`	object \| null	Extracted page content. `null` if scraping failed.
`metadata`	object \| null	HTTP and HTML metadata. `null` if scraping failed.
`ads_txt`	object \| null	Ads.txt supply chain data for the domain. `null` if not found.

`content` object

Field	Type	Description
`title`	string	The page title
`full_text`	string	Complete readable text, stripped of HTML
`language`	string	Detected language (ISO 639-1 code, e.g. `"en"`, `"es"`)
`word_count`	integer	Word count of the extracted text

`metadata` object

Field	Type	Description
`http_status`	integer	HTTP status code returned by the origin server
`content_type`	string	Content-Type header value
`canonical_url`	string \| null	Canonical URL specified by the page (`<link rel="canonical">`)
`og_title`	string \| null	OpenGraph title
`og_description`	string \| null	OpenGraph description
`og_image`	string \| null	OpenGraph image URL
`og_type`	string \| null	OpenGraph type (e.g. `"article"`, `"website"`)
`author`	string \| null	Author name from meta tags
`meta_description`	string \| null	HTML meta description
`robots`	string \| null	Robots meta directive (e.g. `"index, follow"`)

`ads_txt` object

Ads.txt data is resolved at the domain level — all URLs on the same domain share the same ads.txt entries.

Field	Type	Description
`status`	string	`"found"` if the domain has an ads.txt file, `"not_found"` if it doesn't, `"error"` if it couldn't be fetched
`seller_count`	integer	Total number of seller entries in the ads.txt file
`entries`	array[object]	Authorized seller entries (see below)

Each entry in the entries array:

Field	Type	Description
`domain`	string	The advertising system domain (e.g. `"google.com"`)
`account_id`	string	Publisher's account ID in that system
`relationship`	string	`"DIRECT"` or `"RESELLER"`
`certification_id`	string \| null	TAG certification authority ID, if present

The scrape job object

When you submit URLs for scraping, you receive a job object that tracks the overall progress.

{
  "id": 701,
  "status": "complete",
  "url_count": 3,
  "created_date": "2026-02-20T11:00:00Z",
  "processed_date": "2026-02-20T11:03:45Z",
  "results": [...]
}

Field	Type	Description
`id`	integer	Unique identifier for this scrape job
`status`	string	`pending` → `processing` → `complete` or `failed`
`url_count`	integer	Number of URLs submitted in the job
`created_date`	string (ISO 8601)	When the job was submitted
`processed_date`	string (ISO 8601) \| null	When scraping completed. `null` until complete.
`results`	array[object] \| null	Array of scrape result objects. Present only when `status` is `complete`.

Search for a URL

Check whether a URL has already been scraped and retrieve its data immediately. This endpoint is synchronous — no polling required.

POST https://api.clsfy.me/v1/scraping/search

Also available as GET /v1/scraping/search?url={url}.

Parameter	Type	Required	Description
`url`	string	Required	The URL to look up. Accepts `url` or `u` as the field name.

Return codes

Status	Meaning
`200 OK`	URL found — scrape result returned in the response body
`202 Accepted`	URL is valid and has been submitted for scraping, but results are not ready yet
`404 Not Found`	URL has not been scraped — submit it via a scrape job
`422 Unprocessable Content`	URL is malformed or invalid

curl
Python

curl -X POST "https://api.clsfy.me/v1/scraping/search" \
  -H "X-API-Key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/"}'

import requests

response = requests.post(
    "https://api.clsfy.me/v1/scraping/search",
    headers={
        "X-API-Key": "<your_api_key>",
        "Content-Type": "application/json",
    },
    json={"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/"},
)

if response.status_code == 200:
    result = response.json()
    print(result["content"]["title"])
elif response.status_code == 404:
    print("URL not in index — submit a scrape job")

Response (200 OK)

Returns the full scrape result object:

{
  "url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
  "tld": "thegamer.com",
  "errors": [],
  "published_date": "2026-02-10T08:02:11Z",
  "last_updated_date": "2026-02-11T18:10:13Z",
  "content": {
    "title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
    "full_text": "While the Nine Divines and several other deities...",
    "language": "en",
    "word_count": 2450
  },
  "metadata": {
    "http_status": 200,
    "content_type": "text/html; charset=utf-8",
    "canonical_url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
    "og_title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
    "og_description": "A guide to every Daedric Prince and their plane of Oblivion.",
    "og_image": "https://static0.thegamerimages.com/wordpress/wp-content/uploads/daedric-princes.jpg",
    "og_type": "article",
    "author": "John Smith",
    "meta_description": "A guide to every Daedric Prince and their plane of Oblivion.",
    "robots": "index, follow"
  },
  "ads_txt": {
    "status": "found",
    "seller_count": 48,
    "entries": [
      {
        "domain": "google.com",
        "account_id": "pub-1234567890123456",
        "relationship": "DIRECT",
        "certification_id": "f08c47fec0942fa0"
      }
    ]
  }
}

Submit a scrape job

Submit one or more URLs for scraping. Returns a job ID immediately. Poll for results.

POST https://api.clsfy.me/v1/scraping/jobs

Parameters

Parameter	Type	Required	Description
`urls`	array[string]	Required	URLs to scrape. Maximum 1,000,000 per request.

Request

curl
Python

curl -X POST "https://api.clsfy.me/v1/scraping/jobs" \
  -H "X-API-Key: <your_api_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
      "https://www.bbc.com/sport/football/premier-league",
      "https://www.allrecipes.com/recipe/24074/almond-crescent-cookies/"
    ]
  }'

import requests

response = requests.post(
    "https://api.clsfy.me/v1/scraping/jobs",
    headers={
        "X-API-Key": "<your_api_key>",
        "Content-Type": "application/json",
    },
    json={
        "urls": [
            "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
            "https://www.bbc.com/sport/football/premier-league",
            "https://www.allrecipes.com/recipe/24074/almond-crescent-cookies/",
        ],
    },
)

job = response.json()
print(job["id"])  # e.g. 701

Response

{
  "id": 701,
  "status": "pending",
  "url_count": 3,

  "created_date": "2026-02-20T11:00:00Z",
  "processed_date": null,
  "results": null
}

Get job results

Retrieve a scrape job by ID. Use this to poll for completion and retrieve results.

GET https://api.clsfy.me/v1/scraping/jobs/{id}

Parameter	Type	Description
`id`	integer	The job ID returned at creation
`limit`	integer (query)	Results per page. Default `1000`, max `10000`.
`offset`	integer (query)	Results to skip. Default `0`.

curl
Python

curl "https://api.clsfy.me/v1/scraping/jobs/701" \
  -H "X-API-Key: <your_api_key>"

import requests

response = requests.get(
    "https://api.clsfy.me/v1/scraping/jobs/701",
    headers={"X-API-Key": "<your_api_key>"},
)

job = response.json()
if job["status"] == "complete":
    for result in job["results"]:
        print(result["url"], result["content"]["title"])

Completed response

{
  "id": 701,
  "status": "complete",
  "url_count": 3,

  "created_date": "2026-02-20T11:00:00Z",
  "processed_date": "2026-02-20T11:03:45Z",
  "results": [
    {
      "url": "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
      "tld": "nytimes.com",
      "errors": [],
      "published_date": "2026-01-15T06:00:00Z",
      "last_updated_date": "2026-01-15T14:22:00Z",
      "content": {
        "title": "The Race to Build the Next Generation of AI Chips",
        "full_text": "The semiconductor industry is entering a new phase...",
        "language": "en",
        "word_count": 1830
      },
      "metadata": {
        "http_status": 200,
        "content_type": "text/html; charset=utf-8",
        "canonical_url": "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
        "og_title": "The Race to Build the Next Generation of AI Chips",
        "og_description": "Companies are investing billions in custom silicon...",
        "og_image": "https://static01.nyt.com/images/2026/01/15/ai-chips.jpg",
        "og_type": "article",
        "author": "Jane Doe",
        "meta_description": "Companies are investing billions in custom silicon...",
        "robots": "index, follow"
      },
      "ads_txt": {
        "status": "found",
        "seller_count": 112,
        "entries": [
          {
            "domain": "google.com",
            "account_id": "pub-4853049608284556",
            "relationship": "DIRECT",
            "certification_id": "f08c47fec0942fa0"
          },
          {
            "domain": "appnexus.com",
            "account_id": "7459",
            "relationship": "RESELLER",
            "certification_id": null
          }
        ]
      }
    }
  ]
}

Polling for completion

Poll GET /v1/scraping/jobs/{id} until status is "complete".

Python

import requests
import time

def wait_for_scrape(job_id: int, api_key: str, poll_interval: int = 30):
    """Poll until a scrape job is ready. Returns the completed job object."""
    url = f"https://api.clsfy.me/v1/scraping/jobs/{job_id}"
    headers = {"X-API-Key": api_key}

    while True:
        job = requests.get(url, headers=headers).json()

        if job["status"] == "complete":
            print(f"Done — {job['url_count']} URLs scraped")
            return job
        elif job["status"] == "failed":
            raise RuntimeError(f"Scrape job {job_id} failed.")

        print(f"Status: {job['status']} — retrying in {poll_interval}s")
        time.sleep(poll_interval)

Paginating large results

For jobs with many URLs, use limit and offset to page through results:

Python

import requests

def get_all_scrape_results(job_id: int, api_key: str, page_size: int = 5000):
    """Retrieve all scrape results, paginating automatically."""
    url = f"https://api.clsfy.me/v1/scraping/jobs/{job_id}"
    headers = {"X-API-Key": api_key}
    all_results = []
    offset = 0

    while True:
        response = requests.get(
            url,
            headers=headers,
            params={"limit": page_size, "offset": offset},
        )
        job = response.json()
        results = job.get("results", [])

        if not results:
            break

        all_results.extend(results)
        offset += len(results)

        if len(results) < page_size:
            break

    return all_results

Error responses

When a request fails, the API returns a JSON object with an error code and a human-readable message:

{
  "error": "not_found",
  "message": "Scrape job with ID 999 not found"
}

HTTP status codes

Status	Meaning
`200 OK`	Success (or URL found, for search endpoint)
`201 Created`	Scrape job created
`202 Accepted`	URL submitted for scraping (search endpoint)
`400 Bad Request`	Invalid or missing parameters
`401 Unauthorized`	Missing or invalid API key
`404 Not Found`	Resource not found (or URL not yet scraped, for search endpoint)
`422 Unprocessable Content`	Validation error (e.g. malformed URL)
`429 Too Many Requests`	Rate limit exceeded

Rate limits

Limit	Value
API requests	120 per minute
URLs per job	1,000,000
Concurrent jobs	5

Exceeding rate limits returns 429 Too Many Requests. Retry after the Retry-After header value.

Error codes

The errors array on each result URL contains error codes indicating issues encountered during scraping. An empty array means the URL was processed successfully. For classification-specific errors, see the Classification API — Error tags.

Code	Name	Description
`E001`	URL Not Fully Processed	The URL did not reach an end state — processing is still in progress
`E002`	URL Not Accessible	The URL could not be accessed (HTTP error, timeout, blocked, etc.)
`E003`	Content Not Parseable	The page was fetched but its content could not be properly extracted
`E004`	Insufficient Content	The page has too little content to process accurately
`E005`	Unable to Classify Language	The text is not sufficient to determine the language
`E006`	Language Not Supported	The content is in a language Classify does not currently support
`E007`	Classification Failure	Text was extracted but could not be sufficiently classified
`E008`	Incomplete Extraction	Classification succeeded but content extraction was incomplete

The scrape result object​

Top-level fields​

content object​

metadata object​

ads_txt object​

The scrape job object​

Search for a URL​

Return codes​

Response (200 OK)​

Submit a scrape job​

Parameters​

Request​

Response​

Get job results​

Completed response​

Polling for completion​

Paginating large results​

Error responses​

HTTP status codes​

Rate limits​

Error codes​