Cognotik | SchemaExtractionStrategy

⚙️ TaskConfig.json

{
  "task_type": "CrawlerAgentTask",
  "strategy": "SchemaExtractionStrategy",
  "schema_definition": "{
    \"type\": \"object\",
    \"properties\": {
      \"product_name\": {\"type\": \"string\"},
      \"price\": {\"type\": \"number\"}
    }
  }",
  "extraction_instructions": "Extract all listed products",
  "min_confidence": 0.8,
  "aggregate_results": true,
  "deduplicate": true,
  "deduplication_keys": "product_name"
}

→

👁️ Extraction Result UI

## Data Extraction Results

Confidence: 94% | Items Extracted: 1 | Total: 12

[
  {
    "product_name": "Industrial Sensor A1",
    "price": 299.99
  },
  ...
]

✅ Saved to aggregated_data.json

Configuration Parameters

Field	Type	Description
`schema_definition`*	String	JSON schema definition describing the data structure to extract. Default: `"{}"`.
`extraction_instructions`	String	Natural language guidance for the LLM on what specific data to target.
`min_confidence`	Double	Threshold (0.0–1.0) for including results. Default: `0.7`.
`aggregate_results`	Boolean	If true, combines all page results into a single JSON array. Default: `true`.
`max_items_per_page`	Integer	Maximum number of items to extract per page. Default: `null` (unlimited).
`validate_schema`	Boolean	Whether to validate extracted data against the schema. Default: `true`.
`deduplicate`	Boolean	Whether to remove duplicate items based on keys. Default: `true`.
`deduplication_keys`	String	Comma-separated field names used to identify unique items.

Note: This strategy inherits from DefaultSummarizerStrategy and supports standard link crawling.

Task Execution Lifecycle

1. Initialization & Validation

The strategy parses the SchemaExtractionConfig. It validates that the schema_definition is present and that min_confidence is within the valid range (0.0-1.0).

2. Page Processing (LLM Extraction)

For each crawled page, the content is sent to a ParsedAgent. The LLM is prompted with the schema and instructions to return a structured ExtractedData object containing the data, a confidence score, and validation notes. Content is capped at 50,000 characters.

3. Filtering & Deduplication

Extracted items are filtered against the min_confidence threshold. If deduplicate is enabled, items are checked against a ConcurrentHashMap of seen keys generated from the deduplication_keys.

4. Final Aggregation

Upon completion, the strategy generates aggregated_data.json, aggregated_data_pretty.json, and extraction_metadata.json in the workspace. It also produces a comprehensive Markdown report summarizing statistics and sample data.

Kotlin Integration

Add this strategy to your CrawlerAgentTask configuration:

val task = CrawlerAgentTask(
    processingStrategy = SchemaExtractionStrategy(),
    executionConfig = CrawlerAgentTask.CrawlerConfig(
        content_queries = SchemaExtractionConfig(
            schema_definition = """{ "type": "object", ... }""",
            extraction_instructions = "Extract pricing tables",
            min_confidence = 0.85
        ).toJson()
    )
)

Prompt Segment

The internal prompt logic injected into the LLM:

"Extract structured data from the following web page content according to the schema provided.

SCHEMA DEFINITION:
${config.schema_definition}

EXTRACTION INSTRUCTIONS:
${config.extraction_instructions}

Provide: 1. The extracted data... 2. A confidence score... 3. Validation notes..."