RobotsTxtParser
High-performance, thread-safe robots.txt compliance engine. Handles wildcard matching, crawl-delay directives, and sitemap discovery with built-in caching.
Side-Effect: Safe
Network: Required
Thread-Safe
⚙️ Implementation
val parser = RobotsTxtParser()
// Check if crawling is allowed
val canCrawl = parser.isAllowed(
url = "https://example.com/admin/logs",
userAgent = "CognotikBot"
)
// Respect site-specific delays
val delay = parser.getCrawlDelay("https://example.com")
→
👁️ Parser Logic Output
Target URL
/admin/logs
✖ Access Disallowed
# robots.txt match found
User-agent: *
Disallow: /admin/ <-- Matched
User-agent: *
Disallow: /admin/ <-- Matched
Cache Status: HIT (TTL 10m)
Core Methods
| Method | Parameters | Returns | Description |
|---|---|---|---|
isAllowed |
url: String, userAgent: String |
Boolean |
Checks if the given URL is accessible based on Allow/Disallow rules. |
getCrawlDelay |
url: String |
Long? |
Returns the crawl delay in milliseconds if specified for the domain. |
Data Structure: RobotsTxt
| Field | Type | Description |
|---|---|---|
disallowedPaths |
List<String> |
Patterns that the bot is forbidden from visiting. |
allowedPaths |
List<String> |
Patterns that take precedence over disallow rules. |
crawlDelay |
Long? |
Parsed delay in milliseconds. |
sitemaps |
List<String> |
Discovered XML sitemap URLs. |
Execution Lifecycle
- Normalization: Extracts the base URL (scheme, host, port) from the target URI.
- Cache Lookup: Checks
ConcurrentHashMapfor existingRobotsTxtobject. - Fetch (on Miss): Executes a GET request to
/robots.txtwith User-AgentCognotikBot/1.0. - Parsing:
- Filters directives by User-Agent (matches
*orcognotik). - Converts robots.txt patterns (
*,$) into standard Regex. - Extracts
Crawl-delayandSitemaplinks.
- Filters directives by User-Agent (matches
- Matching: Evaluates
Allowrules first (precedence), thenDisallow.
Boilerplate
import com.simiacryptus.cognotik.plan.tools.online.RobotsTxtParser
class WebScraper(private val parser: RobotsTxtParser = RobotsTxtParser()) {
fun scrape(url: String) {
if (parser.isAllowed(url)) {
val delay = parser.getCrawlDelay(url) ?: 0L
Thread.sleep(delay)
// Proceed with fetch...
}
}
}