Welcome to the updated documentation for the Robots.txt Bulk Validator API! This new version offers improved performance, enhanced features, and better scalability. Before we dive into the details, let's review what a robots.txt
file is and why it's crucial for web development and SEO.
A robots.txt
file is a standard used by websites to communicate with web crawlers and other automated agents. It indicates which areas of a website these agents are allowed or disallowed from accessing and interacting with. These rules help website administrators control the behavior of web crawlers and ensure that they index the site in a way that is conducive to the site's SEO strategy.
The Robots.txt Bulk Validator API is designed to facilitate the process of validating multiple URLs against the robots.txt file of a given host. It helps you ascertain whether certain URLs are allowed or disallowed by the robots.txt file of a host, thus automating and simplifying a task that can be quite cumbersome when done manually, especially when dealing with a large number of URLs.
The new version of the API can be accessed via the following endpoint:
https://robots-txt-parser-api.estevecastells.workers.dev/
Note: The previous endpoint (https://tools.estevecastells.com/api/bulk-robots-txt/v1
) is now considered legacy. While it will continue to work for the foreseeable future, we recommend migrating to the new endpoint for improved performance and features.
This is a POST endpoint. If you're less familiar with POST endpoints, you can find more information here.
Send a POST request with a JSON payload in the following format:
{
"robots_txt_url": "https://example.com/robots.txt",
"links": [
"https://example.com/page1",
"https://example.com/page2",
...
],
"user_agent": "CustomBot/1.0"
}
robots_txt_url
: The URL of the robots.txt file you want to check against.links
: An array of up to 1,000 URLs you want to validate.user_agent
: (Optional) The User-Agent to use for parsing the robots.txt rules. If not provided, defaults to 'Googlebot'.The API returns a JSON response with the following structure:
{
"results": {
"https://example.com/page1": true,
"https://example.com/page2": false,
...
},
"metadata": {
"robots_txt_url": "https://example.com/robots.txt",
"parsed_size": 1024,
"truncated": false,
"rules_count": 10,
"sitemaps_count": 2,
"user_agent": "CustomBot/1.0"
},
"errors": [
{
"link": "https://example.com/invalid",
"error": "Invalid URL format"
}
]
}
results
: A dictionary where keys are the input URLs and values are boolean:
true
: URL is not blocked by robots.txt (can be crawled)false
: URL is blocked by robots.txt (cannot be crawled)metadata
: Information about the parsing process, including the User-Agent usederrors
: Any errors encountered during processing of specific linkscURL:
curl -X POST "https://robots-txt-parser-api.estevecastells.workers.dev/" \
-H "Content-Type: application/json" \
-d '{"robots_txt_url": "https://example.com/robots.txt", "links": ["https://example.com/page1", "https://example.com/page2"], "user_agent": "CustomBot/1.0"}'
Python:
import requests
url = "https://robots-txt-parser-api.estevecastells.workers.dev/"
payload = {
"robots_txt_url": "https://example.com/robots.txt",
"links": ["https://example.com/page1", "https://example.com/page2"],
"user_agent": "CustomBot/1.0"
}
response = requests.post(url, json=payload)
print(response.json())
We're constantly working to enhance the API. Future improvements may include:
Rest assured, all improvements will be made backwards-compatible to prevent breaking changes.
For any questions, suggestions, or to report issues, please reach out through the contact information provided on the website.