Significance

This guide covers the configuration features of Kubit's Significance report: Comparison Criteria, Metrics, Test Settings, and how to interpret results.

When to Use Significance

A Significance report is the right tool whenever you need to determine whether an observed difference between two groups is statistically real or just random noise.

Use Case	Example
A/B testing	Compare conversion rates between a control prompt template and a variant to see if the new template performs significantly better.
Segment comparison	Test whether power users have a statistically different error rate than casual users.
Model evaluation	Determine whether switching from one LLM model to another produces a significant change in latency or cost.
Feature rollout	Validate that a new tool call implementation actually improves response quality compared to the previous version.
Experiment analysis	Analyze a formal experiment with named control and treatment arms.

Choosing the right criteria mode:

Use Breakdown when you want to compare values of a single field (e.g., model A vs model B).
Use Segment when you want to compare predefined cohorts (e.g., power users vs casual users).
Use Experiment when you have a formal experiment with named variants.

Comparison Criteria

The criteria define what you're comparing — the control group and one or two variants.

Breakdown Mode

Compare specific values of a field. Select a dimension field, then choose which value is the control and which values are the variants.

Example: Field = "Model Name", Control = "gpt-4", Variant A = "claude-sonnet"

Segment Mode

Compare predefined user cohorts. Select a saved cohort as the control and one or two other cohorts as variants.

Example: Control = "Free tier users" cohort, Variant A = "Premium users" cohort

Experiment Mode

Compare named variants from a formal experiment. Select an experiment ID, then choose which variant is the control and which are the treatment arms.

Constraints

Constraint	Details
Control group	Exactly 1 required
Variants	1 or 2 (Variant A required, Variant B optional)

Metrics

The metric defines what you're measuring — the performance indicator used to compare control and variant groups.

Metric Types

Type	Configuration	Use Case
Measure	Select a measure function + field directly	Quick metric comparison without a pre-built report
Query	Reference a saved Query report + select a measure from it	Reuse an existing metric definition
Funnel	Reference a saved Funnel report + specify which step to measure	Test significance of conversion at a specific funnel step
Retention	Reference a saved Retention report	Test significance of retention differences

For Funnel metrics, you specify which step's conversion to test and can optionally include time-to-convert as an additional metric.

Test Settings

Configure the statistical parameters for the hypothesis test.

Test Type

Type	Description
Two-Tailed	Tests whether the variant is significantly different from the control (higher or lower). The default.
One-Tailed	Tests whether the variant is significantly better than the control (in one direction only).

Use two-tailed when you don't have a directional hypothesis. Use one-tailed when you specifically expect an improvement.

Significance Level (P-Value Threshold)

The maximum p-value at which you consider the result statistically significant.

Setting	Details
Range	0.01 to 0.10
Default	0.05

A p-value of 0.05 means there's a 5% chance the observed difference is due to random variation. Lower thresholds are more conservative.

Confidence Level

The probability that the true effect lies within the reported confidence interval.

Setting	Details
Range	90% to 99%
Default	95%

Subject

Select the entity type for the analysis (User, Trace, Session, or Span). This determines how sample size is counted.

Constraints

Constraint	Details
Time granularity	All Time only — Significance tests aggregate across the full date range.
Sampling	Not supported — all data is used.

Interpreting Results

Statistical Method

Kubit uses different statistical tests depending on the metric type:

Metric Type	Test Used	Description
Measure, Query, Retention	Welch's t-test	Compares means of two groups without assuming equal variances.
Funnel	Chi-Square test	Tests independence in conversion (converted vs. not converted) contingency tables.

Results Summary

The results table shows one row per metric with these columns:

Column	Description
Metric	The metric being tested
Control	Mean, standard deviation, and sample size for the control group
Variant A / B	Lift, confidence interval, statistical significance, and sample size for each variant

A star indicator marks the leading variant (higher performing).

Variant Detail Metrics

Hover over a variant to see the full statistical detail:

Metric	Description
Lift	Percentage change relative to control: (Variant Mean − Control Mean) / Control Mean × 100%
Lift Confidence Interval	Range within which the true lift likely falls (lower% to upper%)
Stat Sig	True or False, with the p-value. True = the difference is unlikely due to chance.
Power	Statistical power of the test (0–1). Higher power means more reliable results.
Test Score	The t-statistic (Welch's t-test) or χ² statistic (Chi-Square test)
Delta	Absolute difference in means (Variant Mean − Control Mean)
Mean	Average metric value for the variant
Std Deviation	Spread of metric values
Sample Size	Number of subjects in the variant
Std Error	Standard error of the difference (not shown for Funnel metrics)

Performance Chart

A visual confidence interval chart shows each variant's lift as a point with error bars. The horizontal axis represents lift percentage — if the confidence interval doesn't cross zero, the result is statistically significant.

Prompting Kubit Through MCP

When using Kubit through MCP, you create Significance reports by describing what you're comparing and what metric to test. The MCP server translates your request into the appropriate statistical test configuration.

Effective Prompts

A good Significance prompt specifies:

Control and variant — the two (or three) groups being compared, with the control identified
Metric — what to measure (e.g., average latency, conversion rate, retention)
Criteria mode (if not Breakdown) — breakdown field, cohort segments, or experiment
Date range — the time window
Test settings (if not defaults) — confidence level, p-value threshold, one-tailed vs two-tailed

Examples by Complexity

Simple — compare two values of a field:

"Is the difference in average latency between gpt-4 and claude-sonnet statistically significant?"

Medium — with explicit control and settings:

"Run a significance test comparing model gpt-4 (control) vs claude-sonnet on average latency over the last 30 days at 99% confidence"

Advanced — cohort-based comparison:

"Test whether premium users have significantly different error rates than free users, using the 'Premium' and 'Free Tier' cohorts"

Specialized — funnel metric with direction:

"Run a one-tailed significance test on step 3 conversion of the checkout funnel, comparing Android vs iOS users — testing whether Android is better"

Tips

Say "control" or "baseline" to identify which group is the reference — the other is automatically the variant.
Defaults are two-tailed test, p=0.05, 95% confidence. Say "one-tailed" if you expect a specific direction of improvement.
Say "at p=0.01" or "with 99% confidence" to tighten the significance threshold.
Significance always aggregates across the full date range (All Time) — time granularity is not configurable.
You can test up to 2 variants against 1 control. Say "variant A" and "variant B" to compare both.