Significance
This guide covers the configuration features of Kubit's Significance report: Comparison Criteria, Metrics, Test Settings, and how to interpret results.
When to Use Significance
A Significance report is the right tool whenever you need to determine whether an observed difference between two groups is statistically real or just random noise.
Use Case | Example |
|---|---|
A/B testing | Compare conversion rates between a control prompt template and a variant to see if the new template performs significantly better. |
Segment comparison | Test whether power users have a statistically different error rate than casual users. |
Model evaluation | Determine whether switching from one LLM model to another produces a significant change in latency or cost. |
Feature rollout | Validate that a new tool call implementation actually improves response quality compared to the previous version. |
Experiment analysis | Analyze a formal experiment with named control and treatment arms. |
Choosing the right criteria mode:
Use Breakdown when you want to compare values of a single field (e.g., model A vs model B).
Use Segment when you want to compare predefined cohorts (e.g., power users vs casual users).
Use Experiment when you have a formal experiment with named variants.
Comparison Criteria
The criteria define what you're comparing — the control group and one or two variants.
Breakdown Mode
Compare specific values of a field. Select a dimension field, then choose which value is the control and which values are the variants.
Example: Field = "Model Name", Control = "gpt-4", Variant A = "claude-sonnet"
Segment Mode
Compare predefined user cohorts. Select a saved cohort as the control and one or two other cohorts as variants.
Example: Control = "Free tier users" cohort, Variant A = "Premium users" cohort
Experiment Mode
Compare named variants from a formal experiment. Select an experiment ID, then choose which variant is the control and which are the treatment arms.
Constraints
Constraint | Details |
|---|---|
Control group | Exactly 1 required |
Variants | 1 or 2 (Variant A required, Variant B optional) |
Metrics
The metric defines what you're measuring — the performance indicator used to compare control and variant groups.
Metric Types
Type | Configuration | Use Case |
|---|---|---|
Measure | Select a measure function + field directly | Quick metric comparison without a pre-built report |
Query | Reference a saved Query report + select a measure from it | Reuse an existing metric definition |
Funnel | Reference a saved Funnel report + specify which step to measure | Test significance of conversion at a specific funnel step |
Retention | Reference a saved Retention report | Test significance of retention differences |
For Funnel metrics, you specify which step's conversion to test and can optionally include time-to-convert as an additional metric.
Test Settings
Configure the statistical parameters for the hypothesis test.
Test Type
Type | Description |
|---|---|
Two-Tailed | Tests whether the variant is significantly different from the control (higher or lower). The default. |
One-Tailed | Tests whether the variant is significantly better than the control (in one direction only). |
Use two-tailed when you don't have a directional hypothesis. Use one-tailed when you specifically expect an improvement.
Significance Level (P-Value Threshold)
The maximum p-value at which you consider the result statistically significant.
Setting | Details |
|---|---|
Range | 0.01 to 0.10 |
Default | 0.05 |
A p-value of 0.05 means there's a 5% chance the observed difference is due to random variation. Lower thresholds are more conservative.
Confidence Level
The probability that the true effect lies within the reported confidence interval.
Setting | Details |
|---|---|
Range | 90% to 99% |
Default | 95% |
Subject
Select the entity type for the analysis (User, Trace, Session, or Span). This determines how sample size is counted.
Constraints
Constraint | Details |
|---|---|
Time granularity | All Time only — Significance tests aggregate across the full date range. |
Sampling | Not supported — all data is used. |
Interpreting Results
Statistical Method
Kubit uses different statistical tests depending on the metric type:
Metric Type | Test Used | Description |
|---|---|---|
Measure, Query, Retention | Welch's t-test | Compares means of two groups without assuming equal variances. |
Funnel | Chi-Square test | Tests independence in conversion (converted vs. not converted) contingency tables. |
Results Summary
The results table shows one row per metric with these columns:
Column | Description |
|---|---|
Metric | The metric being tested |
Control | Mean, standard deviation, and sample size for the control group |
Variant A / B | Lift, confidence interval, statistical significance, and sample size for each variant |
A star indicator marks the leading variant (higher performing).
Variant Detail Metrics
Hover over a variant to see the full statistical detail:
Metric | Description |
|---|---|
Lift | Percentage change relative to control: (Variant Mean − Control Mean) / Control Mean × 100% |
Lift Confidence Interval | Range within which the true lift likely falls (lower% to upper%) |
Stat Sig | True or False, with the p-value. True = the difference is unlikely due to chance. |
Power | Statistical power of the test (0–1). Higher power means more reliable results. |
Test Score | The t-statistic (Welch's t-test) or χ² statistic (Chi-Square test) |
Delta | Absolute difference in means (Variant Mean − Control Mean) |
Mean | Average metric value for the variant |
Std Deviation | Spread of metric values |
Sample Size | Number of subjects in the variant |
Std Error | Standard error of the difference (not shown for Funnel metrics) |
Performance Chart
A visual confidence interval chart shows each variant's lift as a point with error bars. The horizontal axis represents lift percentage — if the confidence interval doesn't cross zero, the result is statistically significant.
Prompting Kubit Through MCP
When using Kubit through MCP, you create Significance reports by describing what you're comparing and what metric to test. The MCP server translates your request into the appropriate statistical test configuration.
Effective Prompts
A good Significance prompt specifies:
Control and variant — the two (or three) groups being compared, with the control identified
Metric — what to measure (e.g., average latency, conversion rate, retention)
Criteria mode (if not Breakdown) — breakdown field, cohort segments, or experiment
Date range — the time window
Test settings (if not defaults) — confidence level, p-value threshold, one-tailed vs two-tailed
Examples by Complexity
Simple — compare two values of a field:
"Is the difference in average latency between gpt-4 and claude-sonnet statistically significant?"
Medium — with explicit control and settings:
"Run a significance test comparing model gpt-4 (control) vs claude-sonnet on average latency over the last 30 days at 99% confidence"
Advanced — cohort-based comparison:
"Test whether premium users have significantly different error rates than free users, using the 'Premium' and 'Free Tier' cohorts"
Specialized — funnel metric with direction:
"Run a one-tailed significance test on step 3 conversion of the checkout funnel, comparing Android vs iOS users — testing whether Android is better"
Tips
Say "control" or "baseline" to identify which group is the reference — the other is automatically the variant.
Defaults are two-tailed test, p=0.05, 95% confidence. Say "one-tailed" if you expect a specific direction of improvement.
Say "at p=0.01" or "with 99% confidence" to tighten the significance threshold.
Significance always aggregates across the full date range (All Time) — time granularity is not configurable.
You can test up to 2 variants against 1 control. Say "variant A" and "variant B" to compare both.