Scraping
Classify maintains a continuously updated index of scraped web content. Submit URLs to retrieve full page text, metadata, and ads.txt supply chain data. Already-indexed URLs are returned instantly; new URLs are scraped asynchronously.
Two ways to use scraping:
- Search — look up a single URL. If it's already indexed, you get the result immediately.
- Jobs — submit a batch of up to 1,000,000 URLs for scraping. Poll for results.
The scrape result object
Each scraped URL produces a result with the following structure:
{
"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"tld": "thegamer.com",
"errors": [],
"published_date": "2026-02-10T08:02:11Z",
"last_updated_date": "2026-02-11T18:10:13Z",
"content": {
"title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"full_text": "While the Nine Divines and several other deities...",
"language": "en",
"word_count": 2450
},
"metadata": {
"http_status": 200,
"content_type": "text/html; charset=utf-8",
"canonical_url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"og_title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"og_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"og_image": "https://static0.thegamerimages.com/wordpress/wp-content/uploads/daedric-princes.jpg",
"og_type": "article",
"author": "John Smith",
"meta_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"robots": "index, follow"
},
"ads_txt": {
"status": "found",
"seller_count": 48,
"entries": [
{
"domain": "google.com",
"account_id": "pub-1234567890123456",
"relationship": "DIRECT",
"certification_id": "f08c47fec0942fa0"
},
{
"domain": "rubiconproject.com",
"account_id": "12345",
"relationship": "RESELLER",
"certification_id": "0bfd66d529a55807"
}
]
}
}
Top-level fields
| Field | Type | Description |
|---|---|---|
url | string | The URL that was scraped |
tld | string | Top-level domain extracted from the URL |
errors | array[string] | Error codes for this URL (empty if successful). See Error codes. |
published_date | string (ISO 8601) | null | When the content was originally published (if detectable) |
last_updated_date | string (ISO 8601) | null | When the content was last modified (if detectable) |
content | object | null | Extracted page content. null if scraping failed. |
metadata | object | null | HTTP and HTML metadata. null if scraping failed. |
ads_txt | object | null | Ads.txt supply chain data for the domain. null if not found. |
content object
| Field | Type | Description |
|---|---|---|
title | string | The page title |
full_text | string | Complete readable text, stripped of HTML |
language | string | Detected language (ISO 639-1 code, e.g. "en", "es") |
word_count | integer | Word count of the extracted text |
metadata object
| Field | Type | Description |
|---|---|---|
http_status | integer | HTTP status code returned by the origin server |
content_type | string | Content-Type header value |
canonical_url | string | null | Canonical URL specified by the page (<link rel="canonical">) |
og_title | string | null | OpenGraph title |
og_description | string | null | OpenGraph description |
og_image | string | null | OpenGraph image URL |
og_type | string | null | OpenGraph type (e.g. "article", "website") |
author | string | null | Author name from meta tags |
meta_description | string | null | HTML meta description |
robots | string | null | Robots meta directive (e.g. "index, follow") |
ads_txt object
Ads.txt data is resolved at the domain level — all URLs on the same domain share the same ads.txt entries.
| Field | Type | Description |
|---|---|---|
status | string | "found" if the domain has an ads.txt file, "not_found" if it doesn't, "error" if it couldn't be fetched |
seller_count | integer | Total number of seller entries in the ads.txt file |
entries | array[object] | Authorized seller entries (see below) |
Each entry in the entries array:
| Field | Type | Description |
|---|---|---|
domain | string | The advertising system domain (e.g. "google.com") |
account_id | string | Publisher's account ID in that system |
relationship | string | "DIRECT" or "RESELLER" |
certification_id | string | null | TAG certification authority ID, if present |
The scrape job object
When you submit URLs for scraping, you receive a job object that tracks the overall progress.
{
"id": 701,
"status": "complete",
"url_count": 3,
"created_date": "2026-02-20T11:00:00Z",
"processed_date": "2026-02-20T11:03:45Z",
"results": [...]
}
| Field | Type | Description |
|---|---|---|
id | integer | Unique identifier for this scrape job |
status | string | pending → processing → complete or failed |
url_count | integer | Number of URLs submitted in the job |
created_date | string (ISO 8601) | When the job was submitted |
processed_date | string (ISO 8601) | null | When scraping completed. null until complete. |
results | array[object] | null | Array of scrape result objects. Present only when status is complete. |
Search for a URL
Check whether a URL has already been scraped and retrieve its data immediately. This endpoint is synchronous — no polling required.
POST https://api.clsfy.me/v1/scraping/search
Also available as GET /v1/scraping/search?url={url}.
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Required | The URL to look up. Accepts url or u as the field name. |
Return codes
| Status | Meaning |
|---|---|
200 OK | URL found — scrape result returned in the response body |
202 Accepted | URL is valid and has been submitted for scraping, but results are not ready yet |
404 Not Found | URL has not been scraped — submit it via a scrape job |
422 Unprocessable Content | URL is malformed or invalid |
- curl
- Python
curl -X POST "https://api.clsfy.me/v1/scraping/search" \
-H "X-API-Key: <your_api_key>" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/"}'
import requests
response = requests.post(
"https://api.clsfy.me/v1/scraping/search",
headers={
"X-API-Key": "<your_api_key>",
"Content-Type": "application/json",
},
json={"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/"},
)
if response.status_code == 200:
result = response.json()
print(result["content"]["title"])
elif response.status_code == 404:
print("URL not in index — submit a scrape job")
Response (200 OK)
Returns the full scrape result object:
{
"url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"tld": "thegamer.com",
"errors": [],
"published_date": "2026-02-10T08:02:11Z",
"last_updated_date": "2026-02-11T18:10:13Z",
"content": {
"title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"full_text": "While the Nine Divines and several other deities...",
"language": "en",
"word_count": 2450
},
"metadata": {
"http_status": 200,
"content_type": "text/html; charset=utf-8",
"canonical_url": "https://www.thegamer.com/skyrim-every-daedric-prince-realm/",
"og_title": "Skyrim: Every Daedric Prince and Their Realm of Oblivion",
"og_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"og_image": "https://static0.thegamerimages.com/wordpress/wp-content/uploads/daedric-princes.jpg",
"og_type": "article",
"author": "John Smith",
"meta_description": "A guide to every Daedric Prince and their plane of Oblivion.",
"robots": "index, follow"
},
"ads_txt": {
"status": "found",
"seller_count": 48,
"entries": [
{
"domain": "google.com",
"account_id": "pub-1234567890123456",
"relationship": "DIRECT",
"certification_id": "f08c47fec0942fa0"
}
]
}
}
Submit a scrape job
Submit one or more URLs for scraping. Returns a job ID immediately. Poll for results.
POST https://api.clsfy.me/v1/scraping/jobs
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
urls | array[string] | Required | URLs to scrape. Maximum 1,000,000 per request. |
Request
- curl
- Python
curl -X POST "https://api.clsfy.me/v1/scraping/jobs" \
-H "X-API-Key: <your_api_key>" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
"https://www.bbc.com/sport/football/premier-league",
"https://www.allrecipes.com/recipe/24074/almond-crescent-cookies/"
]
}'
import requests
response = requests.post(
"https://api.clsfy.me/v1/scraping/jobs",
headers={
"X-API-Key": "<your_api_key>",
"Content-Type": "application/json",
},
json={
"urls": [
"https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
"https://www.bbc.com/sport/football/premier-league",
"https://www.allrecipes.com/recipe/24074/almond-crescent-cookies/",
],
},
)
job = response.json()
print(job["id"]) # e.g. 701
Response
{
"id": 701,
"status": "pending",
"url_count": 3,
"created_date": "2026-02-20T11:00:00Z",
"processed_date": null,
"results": null
}
Get job results
Retrieve a scrape job by ID. Use this to poll for completion and retrieve results.
GET https://api.clsfy.me/v1/scraping/jobs/{id}
| Parameter | Type | Description |
|---|---|---|
id | integer | The job ID returned at creation |
limit | integer (query) | Results per page. Default 1000, max 10000. |
offset | integer (query) | Results to skip. Default 0. |
- curl
- Python
curl "https://api.clsfy.me/v1/scraping/jobs/701" \
-H "X-API-Key: <your_api_key>"
import requests
response = requests.get(
"https://api.clsfy.me/v1/scraping/jobs/701",
headers={"X-API-Key": "<your_api_key>"},
)
job = response.json()
if job["status"] == "complete":
for result in job["results"]:
print(result["url"], result["content"]["title"])
Completed response
{
"id": 701,
"status": "complete",
"url_count": 3,
"created_date": "2026-02-20T11:00:00Z",
"processed_date": "2026-02-20T11:03:45Z",
"results": [
{
"url": "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
"tld": "nytimes.com",
"errors": [],
"published_date": "2026-01-15T06:00:00Z",
"last_updated_date": "2026-01-15T14:22:00Z",
"content": {
"title": "The Race to Build the Next Generation of AI Chips",
"full_text": "The semiconductor industry is entering a new phase...",
"language": "en",
"word_count": 1830
},
"metadata": {
"http_status": 200,
"content_type": "text/html; charset=utf-8",
"canonical_url": "https://www.nytimes.com/2026/01/15/technology/ai-chips.html",
"og_title": "The Race to Build the Next Generation of AI Chips",
"og_description": "Companies are investing billions in custom silicon...",
"og_image": "https://static01.nyt.com/images/2026/01/15/ai-chips.jpg",
"og_type": "article",
"author": "Jane Doe",
"meta_description": "Companies are investing billions in custom silicon...",
"robots": "index, follow"
},
"ads_txt": {
"status": "found",
"seller_count": 112,
"entries": [
{
"domain": "google.com",
"account_id": "pub-4853049608284556",
"relationship": "DIRECT",
"certification_id": "f08c47fec0942fa0"
},
{
"domain": "appnexus.com",
"account_id": "7459",
"relationship": "RESELLER",
"certification_id": null
}
]
}
}
]
}
Polling for completion
Poll GET /v1/scraping/jobs/{id} until status is "complete".
- Python
import requests
import time
def wait_for_scrape(job_id: int, api_key: str, poll_interval: int = 30):
"""Poll until a scrape job is ready. Returns the completed job object."""
url = f"https://api.clsfy.me/v1/scraping/jobs/{job_id}"
headers = {"X-API-Key": api_key}
while True:
job = requests.get(url, headers=headers).json()
if job["status"] == "complete":
print(f"Done — {job['url_count']} URLs scraped")
return job
elif job["status"] == "failed":
raise RuntimeError(f"Scrape job {job_id} failed.")
print(f"Status: {job['status']} — retrying in {poll_interval}s")
time.sleep(poll_interval)
Paginating large results
For jobs with many URLs, use limit and offset to page through results:
- Python
import requests
def get_all_scrape_results(job_id: int, api_key: str, page_size: int = 5000):
"""Retrieve all scrape results, paginating automatically."""
url = f"https://api.clsfy.me/v1/scraping/jobs/{job_id}"
headers = {"X-API-Key": api_key}
all_results = []
offset = 0
while True:
response = requests.get(
url,
headers=headers,
params={"limit": page_size, "offset": offset},
)
job = response.json()
results = job.get("results", [])
if not results:
break
all_results.extend(results)
offset += len(results)
if len(results) < page_size:
break
return all_results
Error responses
When a request fails, the API returns a JSON object with an error code and a human-readable message:
{
"error": "not_found",
"message": "Scrape job with ID 999 not found"
}
HTTP status codes
| Status | Meaning |
|---|---|
200 OK | Success (or URL found, for search endpoint) |
201 Created | Scrape job created |
202 Accepted | URL submitted for scraping (search endpoint) |
400 Bad Request | Invalid or missing parameters |
401 Unauthorized | Missing or invalid API key |
404 Not Found | Resource not found (or URL not yet scraped, for search endpoint) |
422 Unprocessable Content | Validation error (e.g. malformed URL) |
429 Too Many Requests | Rate limit exceeded |
Rate limits
| Limit | Value |
|---|---|
| API requests | 120 per minute |
| URLs per job | 1,000,000 |
| Concurrent jobs | 5 |
Exceeding rate limits returns 429 Too Many Requests. Retry after the Retry-After header value.
Error codes
The errors array on each result URL contains error codes indicating issues encountered during scraping. An empty array means the URL was processed successfully. For classification-specific errors, see the Classification API — Error tags.
| Code | Name | Description |
|---|---|---|
E001 | URL Not Fully Processed | The URL did not reach an end state — processing is still in progress |
E002 | URL Not Accessible | The URL could not be accessed (HTTP error, timeout, blocked, etc.) |
E003 | Content Not Parseable | The page was fetched but its content could not be properly extracted |
E004 | Insufficient Content | The page has too little content to process accurately |
E005 | Unable to Classify Language | The text is not sufficient to determine the language |
E006 | Language Not Supported | The content is in a language Classify does not currently support |
E007 | Classification Failure | Text was extracted but could not be sufficiently classified |
E008 | Incomplete Extraction | Classification succeeded but content extraction was incomplete |