Home > Tasks > Web Crawling > Strategies

DataTableAccumulationStrategy

Extracts and accumulates structured tabular data across multiple web pages with automated normalization, deduplication, and multi-format export.

Stateful: Accumulative Model: GPT-4o Preferred Output: CSV/JSON/Markdown

⚙️ ExecutionConfig.json

{
  "column_names": "Model, Price, Range",
  "column_types": { "Price": "number" },
  "deduplicate": true,
  "key_columns": "Model",
  "export_format": "csv",
  "include_source_urls": true
}

→

📊 Session UI Output

✔ Extraction Successful

Confidence: 98% | Rows: 12 | Total: 45

Model	Price	Range
Model 3	38990	333 mi
Ioniq 6	42450	361 mi
Air Pure	69900	419 mi

⚠️ Normalized 'Price' from "$38,990" to "38990"

DataTableConfig Fields

Field	Type	Description
`column_names` *	`String`	Comma-separated list of columns to extract.
`column_descriptions`	`Map<String, String>`	Guidance for the LLM on what each column represents.
`column_types`	`Map<String, String>`	Mapping of column names to types (`string`, `number`, `boolean`, `date`).
`extraction_instructions`	`String`	Natural language guidance for the LLM extraction engine.
`auto_detect_tables`	`Boolean`	If true, prioritizes existing HTML table structures. Default: `true`.
`min_rows`	`Int`	Minimum rows required to consider a table valid. Default: `1`.
`max_rows_per_page`	`Int?`	Limit extraction per page. Default: `null` (unlimited).
`deduplicate`	`Boolean`	Prevents duplicate rows across multiple pages. Default: `true`.
`key_columns`	`String?`	Comma-separated columns used for deduplication identity.
`validate_types`	`Boolean`	Enforce type checking on extracted values. Default: `true`.
`normalize_data`	`Boolean`	Clean values (e.g. strip currency symbols). Default: `true`.
`export_format`	`String`	Final output format: `csv`, `json`, or `markdown`. Default: `csv`.

Orchestration Setup

val strategy = DataTableAccumulationStrategy()
val config = DataTableAccumulationStrategy.DataTableConfig(
    column_names = "Product, Price, Rating",
    column_types = mapOf("Price" to "number", "Rating" to "number"),
    export_format = "csv",
    deduplicate = true,
    key_columns = "Product"
)

// Add to CrawlerAgentTask
val task = CrawlerAgentTask(
    processingStrategy = strategy,
    executionConfig = CrawlerAgentTask.ExecutionConfig(
        content_queries = config.toJson(),
        max_pages = 50
    )
)

Prompt Segment

The following logic is injected into the LLM context:

Extract tabular data... REQUIRED COLUMNS: [columns]. 
Look for HTML tables, lists, or structured data. 
Return a list of rows and a confidence score.

DataTableAccumulationStrategy

✔ Extraction Successful

DataTableConfig Fields

1. LLM Extraction

2. Normalization & Validation

3. Deduplication

4. Final Export

Orchestration Setup

Prompt Segment