⚙️ Implementation

val parser = RobotsTxtParser()

// Check if crawling is allowed
val canCrawl = parser.isAllowed(
    url = "https://example.com/admin/logs",
    userAgent = "CognotikBot"
)

// Respect site-specific delays
val delay = parser.getCrawlDelay("https://example.com")
            
👁️ Parser Logic Output
Target URL /admin/logs
✖ Access Disallowed
# robots.txt match found
User-agent: *
Disallow: /admin/ <-- Matched
Cache Status: HIT (TTL 10m)

Core Methods

Method Parameters Returns Description
isAllowed url: String, userAgent: String Boolean Checks if the given URL is accessible based on Allow/Disallow rules.
getCrawlDelay url: String Long? Returns the crawl delay in milliseconds if specified for the domain.

Data Structure: RobotsTxt

Field Type Description
disallowedPaths List<String> Patterns that the bot is forbidden from visiting.
allowedPaths List<String> Patterns that take precedence over disallow rules.
crawlDelay Long? Parsed delay in milliseconds.
sitemaps List<String> Discovered XML sitemap URLs.

Execution Lifecycle

  1. Normalization: Extracts the base URL (scheme, host, port) from the target URI.
  2. Cache Lookup: Checks ConcurrentHashMap for existing RobotsTxt object.
  3. Fetch (on Miss): Executes a GET request to /robots.txt with User-Agent CognotikBot/1.0.
  4. Parsing:
    • Filters directives by User-Agent (matches * or cognotik).
    • Converts robots.txt patterns (*, $) into standard Regex.
    • Extracts Crawl-delay and Sitemap links.
  5. Matching: Evaluates Allow rules first (precedence), then Disallow.

Boilerplate


import com.simiacryptus.cognotik.plan.tools.online.RobotsTxtParser

class WebScraper(private val parser: RobotsTxtParser = RobotsTxtParser()) {
    fun scrape(url: String) {
        if (parser.isAllowed(url)) {
            val delay = parser.getCrawlDelay(url) ?: 0L
            Thread.sleep(delay)
            // Proceed with fetch...
        }
    }
}