Cognotik | RobotsTxtParser

⚙️ Implementation


val parser = RobotsTxtParser()

// Check if crawling is allowed
val canCrawl = parser.isAllowed(
    url = "https://example.com/admin/logs",
    userAgent = "CognotikBot"
)

// Respect site-specific delays
val delay = parser.getCrawlDelay("https://example.com")

→

👁️ Parser Logic Output

Target URL /admin/logs

✖ Access Disallowed

# robots.txt match found
User-agent: *
Disallow: /admin/ <-- Matched

Cache Status: HIT (TTL 10m)

Core Methods

Method	Parameters	Returns	Description
`isAllowed`	`url: String`, `userAgent: String`	`Boolean`	Checks if the given URL is accessible based on Allow/Disallow rules.
`getCrawlDelay`	`url: String`	`Long?`	Returns the crawl delay in milliseconds if specified for the domain.

Data Structure: `RobotsTxt`

Field	Type	Description
`disallowedPaths`	`List<String>`	Patterns that the bot is forbidden from visiting.
`allowedPaths`	`List<String>`	Patterns that take precedence over disallow rules.
`crawlDelay`	`Long?`	Parsed delay in milliseconds.
`sitemaps`	`List<String>`	Discovered XML sitemap URLs.

Execution Lifecycle

Normalization: Extracts the base URL (scheme, host, port) from the target URI.
Cache Lookup: Checks ConcurrentHashMap for existing RobotsTxt object.
Fetch (on Miss): Executes a GET request to /robots.txt with User-Agent CognotikBot/1.0.
Parsing:
- Filters directives by User-Agent (matches * or cognotik).
- Converts robots.txt patterns (*, $) into standard Regex.
- Extracts Crawl-delay and Sitemap links.
Matching: Evaluates Allow rules first (precedence), then Disallow.

Boilerplate


import com.simiacryptus.cognotik.plan.tools.online.RobotsTxtParser

class WebScraper(private val parser: RobotsTxtParser = RobotsTxtParser()) {
    fun scrape(url: String) {
        if (parser.isAllowed(url)) {
            val delay = parser.getCrawlDelay(url) ?: 0L
            Thread.sleep(delay)
            // Proceed with fetch...
        }
    }
}

Core Methods

Data Structure: RobotsTxt

Execution Lifecycle

Boilerplate

Data Structure: `RobotsTxt`