Robots.txt Bulk Validator API

Welcome to the updated documentation for the Robots.txt Bulk Validator API! This new version offers improved performance, enhanced features, and better scalability. Before we dive into the details, let's review what a robots.txt file is and why it's crucial for web development and SEO.

What is a robots.txt file?

A robots.txt file is a standard used by websites to communicate with web crawlers and other automated agents. It indicates which areas of a website these agents are allowed or disallowed from accessing and interacting with. These rules help website administrators control the behavior of web crawlers and ensure that they index the site in a way that is conducive to the site's SEO strategy.

About the API

The Robots.txt Bulk Validator API is designed to facilitate the process of validating multiple URLs against the robots.txt file of a given host. It helps you ascertain whether certain URLs are allowed or disallowed by the robots.txt file of a host, thus automating and simplifying a task that can be quite cumbersome when done manually, especially when dealing with a large number of URLs.

New API Endpoint

The new version of the API can be accessed via the following endpoint:

https://robots-txt-parser-api.estevecastells.workers.dev/

Note: The previous endpoint (https://tools.estevecastells.com/api/bulk-robots-txt/v1) is now considered legacy. While it will continue to work for the foreseeable future, we recommend migrating to the new endpoint for improved performance and features.

Features and Benefits

Bulk Checking: Validate up to 1,000 URLs in a single API call, saving time and resources.
Enhanced Performance: Responses are typically returned in less than 500 milliseconds, allowing for seamless integration into your workflows.
Improved Scalability: The new version can handle a higher volume of requests with better efficiency.
User-Agent Customization: You can now specify a custom User-Agent for parsing the robots.txt rules. If not specified, it defaults to Googlebot.
Comprehensive Parsing: Support for wildcard matching, URL decoding, and handling of nonstandard directives.
Sophisticated Path Normalization: Proper handling of relative paths, including './' and '../'.
Size Limit Handling: Automatically handles robots.txt files up to 500 KiB, truncating larger files.
Detailed Metadata: Provides information about the parsing process, including file size, truncation status, rule counts, and the User-Agent used for parsing.
Enhanced Error Handling: Detailed error messages and per-link error reporting for easier debugging.
Ease of Use: Straightforward request payload structure, user-friendly even for those less acquainted with POST endpoints.

How to Use the API

This is a POST endpoint. If you're less familiar with POST endpoints, you can find more information here.

Request Format

Send a POST request with a JSON payload in the following format:

{
  "robots_txt_url": "https://example.com/robots.txt",
  "links": [
    "https://example.com/page1",
    "https://example.com/page2",
    ...
  ],
  "user_agent": "CustomBot/1.0"
}

robots_txt_url: The URL of the robots.txt file you want to check against.
links: An array of up to 1,000 URLs you want to validate.
user_agent: (Optional) The User-Agent to use for parsing the robots.txt rules. If not provided, defaults to 'Googlebot'.

Response Format

The API returns a JSON response with the following structure:

{
  "results": {
    "https://example.com/page1": true,
    "https://example.com/page2": false,
    ...
  },
  "metadata": {
    "robots_txt_url": "https://example.com/robots.txt",
    "parsed_size": 1024,
    "truncated": false,
    "rules_count": 10,
    "sitemaps_count": 2,
    "user_agent": "CustomBot/1.0"
  },
  "errors": [
    {
      "link": "https://example.com/invalid",
      "error": "Invalid URL format"
    }
  ]
}

results: A dictionary where keys are the input URLs and values are boolean:
- true: URL is not blocked by robots.txt (can be crawled)
- false: URL is blocked by robots.txt (cannot be crawled)
metadata: Information about the parsing process, including the User-Agent used
errors: Any errors encountered during processing of specific links

Limitations and Considerations

Single Domain per Request: You can only check URLs against a single robots.txt file per request. For multiple domains, send separate requests.
Maximum URLs: Up to 1,000 URLs can be checked in a single request.
File Size Limit: The API processes up to 500 KiB of a robots.txt file. Larger files are truncated.
Rate Limiting: While there's currently no strict rate limit, we recommend adding a 1-second delay between requests to ensure platform stability.
Blocking: Some domains might block the API's requests if they consider it a bad bot. This is rare but possible.

Examples

cURL:

curl -X POST "https://robots-txt-parser-api.estevecastells.workers.dev/" \
-H "Content-Type: application/json" \
-d '{"robots_txt_url": "https://example.com/robots.txt", "links": ["https://example.com/page1", "https://example.com/page2"], "user_agent": "CustomBot/1.0"}'

Python:

import requests

url = "https://robots-txt-parser-api.estevecastells.workers.dev/"
payload = {
    "robots_txt_url": "https://example.com/robots.txt",
    "links": ["https://example.com/page1", "https://example.com/page2"],
    "user_agent": "CustomBot/1.0"
}
response = requests.post(url, json=payload)
print(response.json())

Future Improvements

We're constantly working to enhance the API. Future improvements may include:

Support for multiple robots.txt parses in one request
Asynchronous batch processing for very large URL sets
Custom rule testing functionality
Continued performance optimizations

Rest assured, all improvements will be made backwards-compatible to prevent breaking changes.

For any questions, suggestions, or to report issues, please reach out through the contact information provided on the website.