Cognotik | DataIngestTask

⚙️ DataIngestConfig.json

{
  "task_type": "DataIngest",
  "input_files": ["logs/*.log"],
  "sample_size": 1000,
  "coverage_threshold": 0.95,
  "task_description": "Parse application access logs"
}

→

👁️ Session UI (Discovery Tab)

Iteration 2 | Coverage: 82% | Residuals: 180

✅ Discovered Pattern:
(?<timestamp>[^ ]+) (?<level>[A-Z]+) (?<msg>.*)

Matched 450 lines in sample.

Generating artifacts: data.jsonl, index.csv...

Live Results Showcase

Explore actual artifacts generated by the DataIngestTask in the test workspace.

Execution Configuration

Field	Type	Default	Description
`input_files`*	List<String>	-	Glob patterns for files to ingest (e.g., `*/.log`).
`sample_size`	Int	1000	Number of lines to sample for initial pattern discovery.
`max_iterations`	Int	10	Maximum discovery loops to run before finalizing.
`coverage_threshold`	Double	0.95	Stop discovery when this % of sample is covered (0.0 - 1.0).

Generated Artifacts

data.jsonl: The primary structured output in JSON Lines format.
data.csv: Flattened CSV version of the extracted data.
patterns.json: Registry of all discovered Regex patterns and field names.
index.csv: Mapping of extracted records back to source file offsets.

Task Lifecycle

File Resolution: Scans the workspace using provided glob patterns to identify target files.
Sampling: Reads a subset of lines from the identified files to build a representative dataset for the LLM.
Discovery Loop:
- Identifies "residual" lines (those not matched by current patterns).
- Prompts LLM to generate a Java-compatible Regex with named capture groups.
- Validates and tests the Regex against the sample.
Bulk Extraction: Streams through all files, applying the pattern registry in priority order, and writes structured artifacts to the workspace.

Kotlin Boilerplate

Invoke via the UnifiedHarness for embedded or headless execution:

// 1. Initialize the Harness
val harness = UnifiedHarness(serverless = true)
harness.start()

// 2. Define Execution Configuration
val config = DataIngestTask.DataIngestTaskExecutionConfigData(
    input_files = listOf("logs/prod-*.log"),
    sample_size = 2000,
    coverage_threshold = 0.98,
    task_description = "Extract error codes and stack traces"
)

// 3. Run the Task
harness.runTask(
    taskType = DataIngestTask.DataIngest,
    executionConfig = config,
    workspace = File("./my-project"),
    autoFix = true
)

CLI / GitHub Actions

Run as a standalone tool in CI/CD pipelines:

- name: Run Data Ingestion
  run: |
    java -jar cognotik-cli.jar \
      --task DataIngest \
      --input_files "logs/*.log"

Prompt Segment

The following context is injected into the LLM when this task is available:

DataIngest - Iteratively parse unstructured logs/text into structured data
  ** Specify input_files patterns (glob) to process
  ** Iteratively discovers Regex patterns using LLM for residual data
  ** Generates structured artifacts: data.jsonl, data.csv, patterns.json, and index.csv
  ** Efficiently handles large files via streaming extraction