⚙️ DataIngestConfig.json
{
  "task_type": "DataIngest",
  "input_files": ["logs/*.log"],
  "sample_size": 1000,
  "coverage_threshold": 0.95,
  "task_description": "Parse application access logs"
}
👁️ Session UI (Discovery Tab)
Iteration 2 | Coverage: 82% | Residuals: 180
✅ Discovered Pattern:
(?<timestamp>[^ ]+) (?<level>[A-Z]+) (?<msg>.*)

Matched 450 lines in sample.
Generating artifacts: data.jsonl, index.csv...

Live Results Showcase

Explore actual artifacts generated by the DataIngestTask in the test workspace.

Execution Configuration

Field Type Default Description
input_files* List<String> - Glob patterns for files to ingest (e.g., **/*.log).
sample_size Int 1000 Number of lines to sample for initial pattern discovery.
max_iterations Int 10 Maximum discovery loops to run before finalizing.
coverage_threshold Double 0.95 Stop discovery when this % of sample is covered (0.0 - 1.0).

Generated Artifacts

  • data.jsonl: The primary structured output in JSON Lines format.
  • data.csv: Flattened CSV version of the extracted data.
  • patterns.json: Registry of all discovered Regex patterns and field names.
  • index.csv: Mapping of extracted records back to source file offsets.

Task Lifecycle

  1. File Resolution: Scans the workspace using provided glob patterns to identify target files.
  2. Sampling: Reads a subset of lines from the identified files to build a representative dataset for the LLM.
  3. Discovery Loop:
    • Identifies "residual" lines (those not matched by current patterns).
    • Prompts LLM to generate a Java-compatible Regex with named capture groups.
    • Validates and tests the Regex against the sample.
  4. Bulk Extraction: Streams through all files, applying the pattern registry in priority order, and writes structured artifacts to the workspace.

Kotlin Boilerplate

Invoke via the UnifiedHarness for embedded or headless execution:

// 1. Initialize the Harness
val harness = UnifiedHarness(serverless = true)
harness.start()

// 2. Define Execution Configuration
val config = DataIngestTask.DataIngestTaskExecutionConfigData(
    input_files = listOf("logs/prod-*.log"),
    sample_size = 2000,
    coverage_threshold = 0.98,
    task_description = "Extract error codes and stack traces"
)

// 3. Run the Task
harness.runTask(
    taskType = DataIngestTask.DataIngest,
    executionConfig = config,
    workspace = File("./my-project"),
    autoFix = true
)

CLI / GitHub Actions

Run as a standalone tool in CI/CD pipelines:

- name: Run Data Ingestion
  run: |
    java -jar cognotik-cli.jar \
      --task DataIngest \
      --input_files "logs/*.log"

Prompt Segment

The following context is injected into the LLM when this task is available:

DataIngest - Iteratively parse unstructured logs/text into structured data
  ** Specify input_files patterns (glob) to process
  ** Iteratively discovers Regex patterns using LLM for residual data
  ** Generates structured artifacts: data.jsonl, data.csv, patterns.json, and index.csv
  ** Efficiently handles large files via streaming extraction