LLMExperimentTask
Conduct rigorous, multi-condition experiments to characterize LLM behaviors, biases, and performance metrics with automated statistical validation and insight generation.
Category: Social
Side-Effect: Safe
Multi-Condition
ExecutionConfig.json
Kotlin / JSON
{
"prompt_templates": [
"Explain {topic} to a {audience}"
],
"prompt_variables": {
"topic": ["Quantum Computing"],
"audience": ["5-year old", "PhD"]
},
"temperature_values": [0.1, 0.8],
"repetitions": 3,
"metrics": ["clarity", "accuracy"],
"significance_level": 0.05
}
→
SessionTask UI
● Live
Overview
Statistical Tables
Insights
### Table 2: Pairwise Temperature Comparisons
| Metric | p-value | Sig |
|---|---|---|
| Response Length | 0.0042 | ✓ |
| Clarity | 0.1250 | ✗ |
Insight: Higher temperatures significantly increased response length variance (Cohen's d: 0.82).
Live Results Showcase
Explore actual experiment reports and statistical artifacts generated by this task.
Execution Configuration
| Field | Type | Description |
|---|---|---|
prompt_templates* |
List<String> |
Base templates using {variable} syntax for substitution. |
prompt_variables |
Map<String, List<String>> |
Keys matching template placeholders and their possible values. |
temperature_values |
List<Double> |
LLM temperatures to test (e.g., [0.0, 0.7, 1.0]). |
repetitions |
Int |
Number of times to repeat each condition (Default: 3). |
metrics |
List<String> |
Custom attributes for the LLM to rate in responses. |
statistical_analysis |
Boolean |
Whether to compute t-tests and variance (Default: true). |
significance_level |
Double |
Alpha level for statistical tests (Default: 0.05). |
Task Execution Lifecycle
- Condition Generation: Computes the Cartesian product of all templates, variables, and temperatures.
- Concurrent Execution: Trials are submitted to a thread pool for parallel processing across the configured model.
-
Metric Calculation: Uses a secondary
ParsedAgentto objectively rate responses based on themetricslist. - Statistical Synthesis: Calculates Mean, SD, Cohen's d, and p-values for all metric pairs using an internal math engine.
- Insight Generation: A specialized "Analyst" agent reviews the data to provide natural language interpretations of the findings.
Embedded Execution (Headless)
Invoke this task directly within your CI/CD pipeline or automated scripts using the UnifiedHarness.
// 1. Initialize the Harness
val harness = UnifiedHarness(serverless = true, openBrowser = false)
harness.start()
// 2. Define the Experiment Configuration
val executionConfig = LLMExperimentTask.LLMExperimentTaskExecutionConfigData(
prompt_templates = listOf("Explain {topic} to a {audience}"),
prompt_variables = mapOf(
"topic" to listOf("Quantum Computing"),
"audience" to listOf("5-year old", "PhD")
),
temperature_values = listOf(0.1, 0.8),
repetitions = 3,
metrics = listOf("clarity", "accuracy")
)
// 3. Run the Task
harness.runTask(
taskType = LLMExperimentTask.LLMExperiment,
typeConfig = TaskTypeConfig(), // Default static config
executionConfig = executionConfig,
workspace = File("./experiments"),
autoFix = true
)
CLI Usage
Run via the Cognotik CLI for quick characterization studies.
java -jar cognotik-cli.jar \
--task LLMExperiment \
--prompt_templates "Summarize: {text}" \
--prompt_variables '{"text": ["Long article A", "Long article B"]}' \
--repetitions 5 \
--workspace ./results
Prompt Segment
The following capabilities are injected into the orchestrator's context:
LLMExperiment - Conduct controlled experiments on LLM behavior
** Specify one or more prompt templates with variables for substitution
** Define experimental conditions (temperature(s), prompt variations)
** Configure number of repetitions for statistical validity
** Rate custom attributes in responses
** Analyze statistical significance of results