Skip to main content

Scraping

Classify maintains a continuously updated index of scraped web content. Submit URLs to retrieve full page text, metadata, and ads.txt supply chain data. Already-indexed URLs are returned instantly; new URLs are scraped asynchronously.

Two ways to use scraping:

  • Search — look up a single URL. If it's already indexed, you get the result immediately.
  • Jobs — submit a batch of up to 1,000,000 URLs for scraping. Poll for results.

The scrape result object

Each scraped URL produces a result with the following structure:

{
"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"tld": "thegamer.com",
"errors": [],
"published_date": "2026-02-10T08:02:11Z",
"last_updated_date": "2026-02-11T18:10:13Z",
"content": {
"title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"full_text": "While the Nine Divines and several other deities...",
"language": "en",
"word_count": 2450
},
"metadata": {
"http_status": 200,
"content_type": "text/html; charset=utf-8",
"canonical_url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"og_title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"og_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"og_image": "https://static0.thegamerimages.com/wordpress/wp-content/uploads/daedric-princes.jpg",
"og_type": "article",
"author": "John Smith",
"meta_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"robots": "index, follow"
},
"ads_txt": {
"status": "found",
"seller_count": 48,
"entries": [
{
"domain": "google.com",
"account_id": "pub-1234567890123456",
"relationship": "DIRECT",
"certification_id": "f08c47fec0942fa0"
},
{
"domain": "rubiconproject.com",
"account_id": "12345",
"relationship": "RESELLER",
"certification_id": "0bfd66d529a55807"
}
]
}
}

Top-level fields

FieldTypeDescription
urlstringThe URL that was scraped
tldstringTop-level domain extracted from the URL
errorsarray[string]Error codes for this URL (empty if successful). See Error codes.
published_datestring (ISO 8601) | nullWhen the content was originally published (if detectable)
last_updated_datestring (ISO 8601) | nullWhen the content was last modified (if detectable)
contentobject | nullExtracted page content. null if scraping failed.
metadataobject | nullHTTP and HTML metadata. null if scraping failed.
ads_txtobject | nullAds.txt supply chain data for the domain. null if not found.

content object

FieldTypeDescription
titlestringThe page title
full_textstringComplete readable text, stripped of HTML
languagestringDetected language (ISO 639-1 code, e.g. "en", "es")
word_countintegerWord count of the extracted text

metadata object

FieldTypeDescription
http_statusintegerHTTP status code returned by the origin server
content_typestringContent-Type header value
canonical_urlstring | nullCanonical URL specified by the page (<link rel="canonical">)
og_titlestring | nullOpenGraph title
og_descriptionstring | nullOpenGraph description
og_imagestring | nullOpenGraph image URL
og_typestring | nullOpenGraph type (e.g. "article", "website")
authorstring | nullAuthor name from meta tags
meta_descriptionstring | nullHTML meta description
robotsstring | nullRobots meta directive (e.g. "index, follow")

ads_txt object

Ads.txt data is resolved at the domain level — all URLs on the same domain share the same ads.txt entries.

FieldTypeDescription
statusstring"found" if the domain has an ads.txt file, "not_found" if it doesn't, "error" if it couldn't be fetched
seller_countintegerTotal number of seller entries in the ads.txt file
entriesarray[object]Authorized seller entries (see below)

Each entry in the entries array:

FieldTypeDescription
domainstringThe advertising system domain (e.g. "google.com")
account_idstringPublisher's account ID in that system
relationshipstring"DIRECT" or "RESELLER"
certification_idstring | nullTAG certification authority ID, if present

The scrape job object

When you submit URLs for scraping, you receive a job object that tracks the overall progress.

{
"id": 701,
"status": "complete",
"url_count": 3,
"created_date": "2026-02-20T11:00:00Z",
"processed_date": "2026-02-20T11:03:45Z",
"results": [...]
}
FieldTypeDescription
idintegerUnique identifier for this scrape job
statusstringpendingprocessingcomplete or failed
url_countintegerNumber of URLs submitted in the job
created_datestring (ISO 8601)When the job was submitted
processed_datestring (ISO 8601) | nullWhen scraping completed. null until complete.
resultsarray[object] | nullArray of scrape result objects. Present only when status is complete.

Search for a URL

Check whether a URL has already been scraped and retrieve its data immediately. This endpoint is synchronous — no polling required.

POST https://api.clsfy.me/v1/scraping/search

Also available as GET /v1/scraping/search?url={url}.

ParameterTypeRequiredDescription
urlstringRequiredThe URL to look up. Accepts url or u as the field name.

Return codes

StatusMeaning
200 OKURL found — scrape result returned in the response body
202 AcceptedURL is valid and has been submitted for scraping, but results are not ready yet
404 Not FoundURL has not been scraped — submit it via a scrape job
422 Unprocessable ContentURL is malformed or invalid
curl -X POST "https://api.clsfy.me/v1/scraping/search" \
-H "X-API-Key: <your_api_key>" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/"}'

Response (200 OK)

Returns the full scrape result object:

{
"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"tld": "thegamer.com",
"errors": [],
"published_date": "2026-02-10T08:02:11Z",
"last_updated_date": "2026-02-11T18:10:13Z",
"content": {
"title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"full_text": "While the Nine Divines and several other deities...",
"language": "en",
"word_count": 2450
},
"metadata": {
"http_status": 200,
"content_type": "text/html; charset=utf-8",
"canonical_url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"og_title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"og_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"og_image": "https://static0.thegamerimages.com/wordpress/wp-content/uploads/daedric-princes.jpg",
"og_type": "article",
"author": "John Smith",
"meta_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"robots": "index, follow"
},
"ads_txt": {
"status": "found",
"seller_count": 48,
"entries": [
{
"domain": "google.com",
"account_id": "pub-1234567890123456",
"relationship": "DIRECT",
"certification_id": "f08c47fec0942fa0"
}
]
}
}

Submit a scrape job

Submit one or more URLs for scraping. Returns a job ID immediately. Poll for results.

POST https://api.clsfy.me/v1/scraping/jobs

Parameters

ParameterTypeRequiredDescription
urlsarray[string]RequiredURLs to scrape. Maximum 1,000,000 per request.

Request

curl -X POST "https://api.clsfy.me/v1/scraping/jobs" \
-H "X-API-Key: <your_api_key>" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
"https://www.bbc.com/sport/football/premier-league",
"https://www.allrecipes.com/recipe/24074/almond-crescent-cookies/"
]
}'

Response

{
"id": 701,
"status": "pending",
"url_count": 3,

"created_date": "2026-02-20T11:00:00Z",
"processed_date": null,
"results": null
}

Get job results

Retrieve a scrape job by ID. Use this to poll for completion and retrieve results.

GET https://api.clsfy.me/v1/scraping/jobs/{id}
ParameterTypeDescription
idintegerThe job ID returned at creation
limitinteger (query)Results per page. Default 1000, max 10000.
offsetinteger (query)Results to skip. Default 0.
curl "https://api.clsfy.me/v1/scraping/jobs/701" \
-H "X-API-Key: <your_api_key>"

Completed response

{
"id": 701,
"status": "complete",
"url_count": 3,

"created_date": "2026-02-20T11:00:00Z",
"processed_date": "2026-02-20T11:03:45Z",
"results": [
{
"url": "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
"tld": "nytimes.com",
"errors": [],
"published_date": "2026-01-15T06:00:00Z",
"last_updated_date": "2026-01-15T14:22:00Z",
"content": {
"title": "The Race to Build the Next Generation of AI Chips",
"full_text": "The semiconductor industry is entering a new phase...",
"language": "en",
"word_count": 1830
},
"metadata": {
"http_status": 200,
"content_type": "text/html; charset=utf-8",
"canonical_url": "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
"og_title": "The Race to Build the Next Generation of AI Chips",
"og_description": "Companies are investing billions in custom silicon...",
"og_image": "https://static01.nyt.com/images/2026/01/15/ai-chips.jpg",
"og_type": "article",
"author": "Jane Doe",
"meta_description": "Companies are investing billions in custom silicon...",
"robots": "index, follow"
},
"ads_txt": {
"status": "found",
"seller_count": 112,
"entries": [
{
"domain": "google.com",
"account_id": "pub-4853049608284556",
"relationship": "DIRECT",
"certification_id": "f08c47fec0942fa0"
},
{
"domain": "appnexus.com",
"account_id": "7459",
"relationship": "RESELLER",
"certification_id": null
}
]
}
}
]
}

Polling for completion

Poll GET /v1/scraping/jobs/{id} until status is "complete".

import requests
import time

def wait_for_scrape(job_id: int, api_key: str, poll_interval: int = 30):
"""Poll until a scrape job is ready. Returns the completed job object."""
url = f"https://api.clsfy.me/v1/scraping/jobs/{job_id}"
headers = {"X-API-Key": api_key}

while True:
job = requests.get(url, headers=headers).json()

if job["status"] == "complete":
print(f"Done — {job['url_count']} URLs scraped")
return job
elif job["status"] == "failed":
raise RuntimeError(f"Scrape job {job_id} failed.")

print(f"Status: {job['status']} — retrying in {poll_interval}s")
time.sleep(poll_interval)

Paginating large results

For jobs with many URLs, use limit and offset to page through results:

import requests

def get_all_scrape_results(job_id: int, api_key: str, page_size: int = 5000):
"""Retrieve all scrape results, paginating automatically."""
url = f"https://api.clsfy.me/v1/scraping/jobs/{job_id}"
headers = {"X-API-Key": api_key}
all_results = []
offset = 0

while True:
response = requests.get(
url,
headers=headers,
params={"limit": page_size, "offset": offset},
)
job = response.json()
results = job.get("results", [])

if not results:
break

all_results.extend(results)
offset += len(results)

if len(results) < page_size:
break

return all_results

Error responses

When a request fails, the API returns a JSON object with an error code and a human-readable message:

{
"error": "not_found",
"message": "Scrape job with ID 999 not found"
}

HTTP status codes

StatusMeaning
200 OKSuccess (or URL found, for search endpoint)
201 CreatedScrape job created
202 AcceptedURL submitted for scraping (search endpoint)
400 Bad RequestInvalid or missing parameters
401 UnauthorizedMissing or invalid API key
404 Not FoundResource not found (or URL not yet scraped, for search endpoint)
422 Unprocessable ContentValidation error (e.g. malformed URL)
429 Too Many RequestsRate limit exceeded

Rate limits

LimitValue
API requests120 per minute
URLs per job1,000,000
Concurrent jobs5

Exceeding rate limits returns 429 Too Many Requests. Retry after the Retry-After header value.


Error codes

The errors array on each result URL contains error codes indicating issues encountered during scraping. An empty array means the URL was processed successfully. For classification-specific errors, see the Classification API — Error tags.

CodeNameDescription
E001URL Not Fully ProcessedThe URL did not reach an end state — processing is still in progress
E002URL Not AccessibleThe URL could not be accessed (HTTP error, timeout, blocked, etc.)
E003Content Not ParseableThe page was fetched but its content could not be properly extracted
E004Insufficient ContentThe page has too little content to process accurately
E005Unable to Classify LanguageThe text is not sufficient to determine the language
E006Language Not SupportedThe content is in a language Classify does not currently support
E007Classification FailureText was extracted but could not be sufficiently classified
E008Incomplete ExtractionClassification succeeded but content extraction was incomplete