CrawlerAgentTask
An autonomous research agent that performs multi-threaded web crawling, content synthesis, and link discovery to answer complex queries with live data.
Category: Online
Model: GPT-4 Preferred
Side-Effect: Safe
ExecutionConfig.json
UTF-8
{
"search_query": "Latest LLM benchmarks 2024",
"content_queries": "Extract performance metrics for
Llama-3 and GPT-4o.",
"direct_urls": ["https://arxiv.org/abs/..."],
"max_pages_per_task": 15
}
→
Session UI (TabbedDisplay)
Live Render
Final Output
Crawl Details
Queue Details
# Research Summary
Based on 12 analyzed pages, Llama-3 70B shows a 15% improvement in MMLU over previous iterations...
- Source: HuggingFace Blog (Relevance: 98%)
- Source: Arxiv (Relevance: 92%)
Execution Configuration
| Field | Type | Description |
|---|---|---|
search_query |
String? |
The primary query used to seed the crawl via Google Search. |
direct_urls |
List<String>? |
Explicit URLs to analyze alongside or instead of search results. |
content_queries * |
Any |
Detailed instructions for the LLM on how to transform page content into insights. |
Type Configuration (Static)
| Field | Default | Description |
|---|---|---|
processing_strategy |
DefaultSummarizer |
Strategy for page analysis (e.g., FactChecking, JobMatching). |
max_pages_per_task |
30 |
Hard limit on the number of unique pages to process. |
concurrent_page_processing |
3 |
Number of parallel threads for fetching and analysis. |
respect_robots_txt |
true |
Ensures compliance with site crawling policies. |
Lifecycle of a Crawler Task
-
Seeding: The agent uses
SeedMethod(Google Proxy or Direct URLs) to populate the initialPriorityQueue. -
Priority Discovery: Links are scored based on
relevance_score. Higher relevance and lower depth are prioritized. -
Concurrent Execution: The
ExecutorServicemanages parallelcrawlPageoperations. Each page is fetched viaHttpClientorSelenium. -
LLM Synthesis: Each page is processed by the selected
PageProcessingStrategy, which extracts data and discovers new links. -
Final Aggregation: Once limits are reached or the queue is exhausted, a
ChatAgentsynthesizes all individual reports into a final comprehensive summary.
Kotlin Implementation
OrchestrationConfig.kt
val crawlerTask = CrawlerAgentTask(
orchestrationConfig = config,
planTask = CrawlerTaskExecutionConfigData(
search_query = "Cognotik AI Framework",
content_queries = "Summarize core features and architecture."
)
).apply {
typeConfig = CrawlerTaskTypeConfig(
max_pages_per_task = 10,
processing_strategy = ProcessingStrategyType.DefaultSummarizer
)
}