ExecutionConfig.json Kotlin / JSON
{
  "prompt_templates": [
    "Explain {topic} to a {audience}"
  ],
  "prompt_variables": {
    "topic": ["Quantum Computing"],
    "audience": ["5-year old", "PhD"]
  },
  "temperature_values": [0.1, 0.8],
  "repetitions": 3,
  "metrics": ["clarity", "accuracy"],
  "significance_level": 0.05
}
SessionTask UI ● Live
Overview Statistical Tables Insights
### Table 2: Pairwise Temperature Comparisons
Metric p-value Sig
Response Length 0.0042
Clarity 0.1250
Insight: Higher temperatures significantly increased response length variance (Cohen's d: 0.82).

Live Results Showcase

Explore actual experiment reports and statistical artifacts generated by this task.

Execution Configuration

Field Type Description
prompt_templates* List<String> Base templates using {variable} syntax for substitution.
prompt_variables Map<String, List<String>> Keys matching template placeholders and their possible values.
temperature_values List<Double> LLM temperatures to test (e.g., [0.0, 0.7, 1.0]).
repetitions Int Number of times to repeat each condition (Default: 3).
metrics List<String> Custom attributes for the LLM to rate in responses.
statistical_analysis Boolean Whether to compute t-tests and variance (Default: true).
significance_level Double Alpha level for statistical tests (Default: 0.05).

Task Execution Lifecycle

  1. Condition Generation: Computes the Cartesian product of all templates, variables, and temperatures.
  2. Concurrent Execution: Trials are submitted to a thread pool for parallel processing across the configured model.
  3. Metric Calculation: Uses a secondary ParsedAgent to objectively rate responses based on the metrics list.
  4. Statistical Synthesis: Calculates Mean, SD, Cohen's d, and p-values for all metric pairs using an internal math engine.
  5. Insight Generation: A specialized "Analyst" agent reviews the data to provide natural language interpretations of the findings.

Embedded Execution (Headless)

Invoke this task directly within your CI/CD pipeline or automated scripts using the UnifiedHarness.

// 1. Initialize the Harness
val harness = UnifiedHarness(serverless = true, openBrowser = false)
harness.start()
// 2. Define the Experiment Configuration
val executionConfig = LLMExperimentTask.LLMExperimentTaskExecutionConfigData(
    prompt_templates = listOf("Explain {topic} to a {audience}"),
    prompt_variables = mapOf(
        "topic" to listOf("Quantum Computing"),
        "audience" to listOf("5-year old", "PhD")
    ),
    temperature_values = listOf(0.1, 0.8),
    repetitions = 3,
    metrics = listOf("clarity", "accuracy")
)
// 3. Run the Task
harness.runTask(
    taskType = LLMExperimentTask.LLMExperiment,
    typeConfig = TaskTypeConfig(), // Default static config
    executionConfig = executionConfig,
    workspace = File("./experiments"),
    autoFix = true
)

CLI Usage

Run via the Cognotik CLI for quick characterization studies.

java -jar cognotik-cli.jar \
  --task LLMExperiment \
  --prompt_templates "Summarize: {text}" \
  --prompt_variables '{"text": ["Long article A", "Long article B"]}' \
  --repetitions 5 \
  --workspace ./results

Prompt Segment

The following capabilities are injected into the orchestrator's context:

LLMExperiment - Conduct controlled experiments on LLM behavior
** Specify one or more prompt templates with variables for substitution
** Define experimental conditions (temperature(s), prompt variations)
** Configure number of repetitions for statistical validity
** Rate custom attributes in responses
** Analyze statistical significance of results