Cognotik | LLMExperimentTask - Industrial Design

ExecutionConfig.json Kotlin / JSON

{
  "prompt_templates": [
    "Explain {topic} to a {audience}"
  ],
  "prompt_variables": {
    "topic": ["Quantum Computing"],
    "audience": ["5-year old", "PhD"]
  },
  "temperature_values": [0.1, 0.8],
  "repetitions": 3,
  "metrics": ["clarity", "accuracy"],
  "significance_level": 0.05
}

→

SessionTask UI ● Live

Overview Statistical Tables Insights

### Table 2: Pairwise Temperature Comparisons

Metric	p-value	Sig
Response Length	0.0042	✓
Clarity	0.1250	✗

Insight: Higher temperatures significantly increased response length variance (Cohen's d: 0.82).

Live Results Showcase

Explore actual experiment reports and statistical artifacts generated by this task.

Execution Configuration

Field	Type	Description
`prompt_templates`*	`List<String>`	Base templates using `{variable}` syntax for substitution.
`prompt_variables`	`Map<String, List<String>>`	Keys matching template placeholders and their possible values.
`temperature_values`	`List<Double>`	LLM temperatures to test (e.g., `[0.0, 0.7, 1.0]`).
`repetitions`	`Int`	Number of times to repeat each condition (Default: 3).
`metrics`	`List<String>`	Custom attributes for the LLM to rate in responses.
`statistical_analysis`	`Boolean`	Whether to compute t-tests and variance (Default: true).
`significance_level`	`Double`	Alpha level for statistical tests (Default: 0.05).

Task Execution Lifecycle

Condition Generation: Computes the Cartesian product of all templates, variables, and temperatures.
Concurrent Execution: Trials are submitted to a thread pool for parallel processing across the configured model.
Metric Calculation: Uses a secondary ParsedAgent to objectively rate responses based on the metrics list.
Statistical Synthesis: Calculates Mean, SD, Cohen's d, and p-values for all metric pairs using an internal math engine.
Insight Generation: A specialized "Analyst" agent reviews the data to provide natural language interpretations of the findings.

Embedded Execution (Headless)

Invoke this task directly within your CI/CD pipeline or automated scripts using the UnifiedHarness.

// 1. Initialize the Harness
val harness = UnifiedHarness(serverless = true, openBrowser = false)
harness.start()
// 2. Define the Experiment Configuration
val executionConfig = LLMExperimentTask.LLMExperimentTaskExecutionConfigData(
    prompt_templates = listOf("Explain {topic} to a {audience}"),
    prompt_variables = mapOf(
        "topic" to listOf("Quantum Computing"),
        "audience" to listOf("5-year old", "PhD")
    ),
    temperature_values = listOf(0.1, 0.8),
    repetitions = 3,
    metrics = listOf("clarity", "accuracy")
)
// 3. Run the Task
harness.runTask(
    taskType = LLMExperimentTask.LLMExperiment,
    typeConfig = TaskTypeConfig(), // Default static config
    executionConfig = executionConfig,
    workspace = File("./experiments"),
    autoFix = true
)

CLI Usage

Run via the Cognotik CLI for quick characterization studies.

java -jar cognotik-cli.jar \
  --task LLMExperiment \
  --prompt_templates "Summarize: {text}" \
  --prompt_variables '{"text": ["Long article A", "Long article B"]}' \
  --repetitions 5 \
  --workspace ./results

Prompt Segment

The following capabilities are injected into the orchestrator's context:

LLMExperiment - Conduct controlled experiments on LLM behavior
** Specify one or more prompt templates with variables for substitution
** Define experimental conditions (temperature(s), prompt variations)
** Configure number of repetitions for statistical validity
** Rate custom attributes in responses
** Analyze statistical significance of results