SchemaExtractionStrategy
Extracts structured data from web pages according to a user-defined JSON schema and aggregates results into a comprehensive dataset.
{
"task_type": "CrawlerAgentTask",
"strategy": "SchemaExtractionStrategy",
"schema_definition": "{
\"type\": \"object\",
\"properties\": {
\"product_name\": {\"type\": \"string\"},
\"price\": {\"type\": \"number\"}
}
}",
"extraction_instructions": "Extract all listed products",
"min_confidence": 0.8,
"aggregate_results": true,
"deduplicate": true,
"deduplication_keys": "product_name"
}
## Data Extraction Results
Confidence: 94% | Items Extracted: 1 | Total: 12{
"product_name": "Industrial Sensor A1",
"price": 299.99
},
...
]
aggregated_data.json
Configuration Parameters
| Field | Type | Description |
|---|---|---|
schema_definition* |
String | JSON schema definition describing the data structure to extract. Default: "{}". |
extraction_instructions |
String | Natural language guidance for the LLM on what specific data to target. |
min_confidence |
Double | Threshold (0.0-1.0) for including results. Default: 0.7. |
aggregate_results |
Boolean | If true, combines all page results into a single JSON array. Default: true. |
max_items_per_page |
Integer | Maximum number of items to extract per page. Default: null (unlimited). |
validate_schema |
Boolean | Whether to validate extracted data against the schema. Default: true. |
deduplicate |
Boolean | Whether to remove duplicate items based on keys. Default: true. |
deduplication_keys |
String | Comma-separated field names used to identify unique items. |
Note: This strategy inherits from DefaultSummarizerStrategy and supports standard link crawling.
Task Execution Lifecycle
1. Initialization & Validation
The strategy parses the SchemaExtractionConfig. It validates that the schema_definition is present and that min_confidence is within the valid range (0.0-1.0).
2. Page Processing (LLM Extraction)
For each crawled page, the content is sent to a ParsedAgent. The LLM is prompted with the schema and instructions to return a structured ExtractedData object containing the data, a confidence score, and validation notes. Content is capped at 50,000 characters.
3. Filtering & Deduplication
Extracted items are filtered against the min_confidence threshold. If deduplicate is enabled, items are checked against a ConcurrentHashMap of seen keys generated from the deduplication_keys.
4. Final Aggregation
Upon completion, the strategy generates aggregated_data.json, aggregated_data_pretty.json, and extraction_metadata.json in the workspace. It also produces a comprehensive Markdown report summarizing statistics and sample data.
Kotlin Integration
Add this strategy to your CrawlerAgentTask configuration:
val task = CrawlerAgentTask(
processingStrategy = SchemaExtractionStrategy(),
executionConfig = CrawlerAgentTask.CrawlerConfig(
content_queries = SchemaExtractionConfig(
schema_definition = """{ "type": "object", ... }""",
extraction_instructions = "Extract pricing tables",
min_confidence = 0.85
).toJson()
)
)
Prompt Segment
The internal prompt logic injected into the LLM:
"Extract structured data from the following web page content according to the schema provided.
SCHEMA DEFINITION:
${config.schema_definition}
EXTRACTION INSTRUCTIONS:
${config.extraction_instructions}
Provide: 1. The extracted data... 2. A confidence score... 3. Validation notes..."