DataIngestTask
Iteratively parse unstructured logs and text into structured data. Uses LLM-driven pattern discovery to handle residual data and streams results to high-performance artifacts.
Category: File
Side-Effect: Safe
Output: Multi-Artifact
⚙️ DataIngestConfig.json
{
"task_type": "DataIngest",
"input_files": ["logs/*.log"],
"sample_size": 1000,
"coverage_threshold": 0.95,
"task_description": "Parse application access logs"
}
→
👁️ Session UI (Discovery Tab)
Iteration 2 | Coverage: 82% | Residuals: 180
✅ Discovered Pattern:
Matched 450 lines in sample.
(?<timestamp>[^ ]+) (?<level>[A-Z]+) (?<msg>.*)
Matched 450 lines in sample.
Generating artifacts:
data.jsonl, index.csv...
Live Results Showcase
Explore actual artifacts generated by the DataIngestTask in the test workspace.
Execution Configuration
| Field | Type | Default | Description |
|---|---|---|---|
input_files* |
List<String> | - | Glob patterns for files to ingest (e.g., **/*.log). |
sample_size |
Int | 1000 | Number of lines to sample for initial pattern discovery. |
max_iterations |
Int | 10 | Maximum discovery loops to run before finalizing. |
coverage_threshold |
Double | 0.95 | Stop discovery when this % of sample is covered (0.0 - 1.0). |
Generated Artifacts
data.jsonl: The primary structured output in JSON Lines format.data.csv: Flattened CSV version of the extracted data.patterns.json: Registry of all discovered Regex patterns and field names.index.csv: Mapping of extracted records back to source file offsets.
Task Lifecycle
- File Resolution: Scans the workspace using provided glob patterns to identify target files.
- Sampling: Reads a subset of lines from the identified files to build a representative dataset for the LLM.
-
Discovery Loop:
- Identifies "residual" lines (those not matched by current patterns).
- Prompts LLM to generate a Java-compatible Regex with named capture groups.
- Validates and tests the Regex against the sample.
- Bulk Extraction: Streams through all files, applying the pattern registry in priority order, and writes structured artifacts to the workspace.
Kotlin Boilerplate
Invoke via the UnifiedHarness for embedded or headless execution:
// 1. Initialize the Harness
val harness = UnifiedHarness(serverless = true)
harness.start()
// 2. Define Execution Configuration
val config = DataIngestTask.DataIngestTaskExecutionConfigData(
input_files = listOf("logs/prod-*.log"),
sample_size = 2000,
coverage_threshold = 0.98,
task_description = "Extract error codes and stack traces"
)
// 3. Run the Task
harness.runTask(
taskType = DataIngestTask.DataIngest,
executionConfig = config,
workspace = File("./my-project"),
autoFix = true
)
CLI / GitHub Actions
Run as a standalone tool in CI/CD pipelines:
- name: Run Data Ingestion
run: |
java -jar cognotik-cli.jar \
--task DataIngest \
--input_files "logs/*.log"
Prompt Segment
The following context is injected into the LLM when this task is available:
DataIngest - Iteratively parse unstructured logs/text into structured data
** Specify input_files patterns (glob) to process
** Iteratively discovers Regex patterns using LLM for residual data
** Generates structured artifacts: data.jsonl, data.csv, patterns.json, and index.csv
** Efficiently handles large files via streaming extraction