Skip to content
Kubit Guide home
Kubit Guide home

Significance

This guide covers the configuration features of Kubit's Significance report: Comparison Criteria, Metrics, Test Settings, and how to interpret results.


When to Use Significance

A Significance report is the right tool whenever you need to determine whether an observed difference between two groups is statistically real or just random noise.

Use Case

Example

A/B testing

Compare conversion rates between a control prompt template and a variant to see if the new template performs significantly better.

Segment comparison

Test whether power users have a statistically different error rate than casual users.

Model evaluation

Determine whether switching from one LLM model to another produces a significant change in latency or cost.

Feature rollout

Validate that a new tool call implementation actually improves response quality compared to the previous version.

Experiment analysis

Analyze a formal experiment with named control and treatment arms.

Choosing the right criteria mode:

  • Use Breakdown when you want to compare values of a single field (e.g., model A vs model B).

  • Use Segment when you want to compare predefined cohorts (e.g., power users vs casual users).

  • Use Experiment when you have a formal experiment with named variants.


Comparison Criteria

The criteria define what you're comparing — the control group and one or two variants.

Breakdown Mode

Compare specific values of a field. Select a dimension field, then choose which value is the control and which values are the variants.

Example: Field = "Model Name", Control = "gpt-4", Variant A = "claude-sonnet"

Segment Mode

Compare predefined user cohorts. Select a saved cohort as the control and one or two other cohorts as variants.

Example: Control = "Free tier users" cohort, Variant A = "Premium users" cohort

Experiment Mode

Compare named variants from a formal experiment. Select an experiment ID, then choose which variant is the control and which are the treatment arms.

Constraints

Constraint

Details

Control group

Exactly 1 required

Variants

1 or 2 (Variant A required, Variant B optional)


Metrics

The metric defines what you're measuring — the performance indicator used to compare control and variant groups.

Metric Types

Type

Configuration

Use Case

Measure

Select a measure function + field directly

Quick metric comparison without a pre-built report

Query

Reference a saved Query report + select a measure from it

Reuse an existing metric definition

Funnel

Reference a saved Funnel report + specify which step to measure

Test significance of conversion at a specific funnel step

Retention

Reference a saved Retention report

Test significance of retention differences

For Funnel metrics, you specify which step's conversion to test and can optionally include time-to-convert as an additional metric.


Test Settings

Configure the statistical parameters for the hypothesis test.

Test Type

Type

Description

Two-Tailed

Tests whether the variant is significantly different from the control (higher or lower). The default.

One-Tailed

Tests whether the variant is significantly better than the control (in one direction only).

Use two-tailed when you don't have a directional hypothesis. Use one-tailed when you specifically expect an improvement.

Significance Level (P-Value Threshold)

The maximum p-value at which you consider the result statistically significant.

Setting

Details

Range

0.01 to 0.10

Default

0.05

A p-value of 0.05 means there's a 5% chance the observed difference is due to random variation. Lower thresholds are more conservative.

Confidence Level

The probability that the true effect lies within the reported confidence interval.

Setting

Details

Range

90% to 99%

Default

95%

Subject

Select the entity type for the analysis (User, Trace, Session, or Span). This determines how sample size is counted.

Constraints

Constraint

Details

Time granularity

All Time only — Significance tests aggregate across the full date range.

Sampling

Not supported — all data is used.


Interpreting Results

Statistical Method

Kubit uses different statistical tests depending on the metric type:

Metric Type

Test Used

Description

Measure, Query, Retention

Welch's t-test

Compares means of two groups without assuming equal variances.

Funnel

Chi-Square test

Tests independence in conversion (converted vs. not converted) contingency tables.

Results Summary

The results table shows one row per metric with these columns:

Column

Description

Metric

The metric being tested

Control

Mean, standard deviation, and sample size for the control group

Variant A / B

Lift, confidence interval, statistical significance, and sample size for each variant

A star indicator marks the leading variant (higher performing).

Variant Detail Metrics

Hover over a variant to see the full statistical detail:

Metric

Description

Lift

Percentage change relative to control: (Variant Mean − Control Mean) / Control Mean × 100%

Lift Confidence Interval

Range within which the true lift likely falls (lower% to upper%)

Stat Sig

True or False, with the p-value. True = the difference is unlikely due to chance.

Power

Statistical power of the test (0–1). Higher power means more reliable results.

Test Score

The t-statistic (Welch's t-test) or χ² statistic (Chi-Square test)

Delta

Absolute difference in means (Variant Mean − Control Mean)

Mean

Average metric value for the variant

Std Deviation

Spread of metric values

Sample Size

Number of subjects in the variant

Std Error

Standard error of the difference (not shown for Funnel metrics)

Performance Chart

A visual confidence interval chart shows each variant's lift as a point with error bars. The horizontal axis represents lift percentage — if the confidence interval doesn't cross zero, the result is statistically significant.


Prompting Kubit Through MCP

When using Kubit through MCP, you create Significance reports by describing what you're comparing and what metric to test. The MCP server translates your request into the appropriate statistical test configuration.

Effective Prompts

A good Significance prompt specifies:

  1. Control and variant — the two (or three) groups being compared, with the control identified

  2. Metric — what to measure (e.g., average latency, conversion rate, retention)

  3. Criteria mode (if not Breakdown) — breakdown field, cohort segments, or experiment

  4. Date range — the time window

  5. Test settings (if not defaults) — confidence level, p-value threshold, one-tailed vs two-tailed

Examples by Complexity

Simple — compare two values of a field:

"Is the difference in average latency between gpt-4 and claude-sonnet statistically significant?"

Medium — with explicit control and settings:

"Run a significance test comparing model gpt-4 (control) vs claude-sonnet on average latency over the last 30 days at 99% confidence"

Advanced — cohort-based comparison:

"Test whether premium users have significantly different error rates than free users, using the 'Premium' and 'Free Tier' cohorts"

Specialized — funnel metric with direction:

"Run a one-tailed significance test on step 3 conversion of the checkout funnel, comparing Android vs iOS users — testing whether Android is better"

Tips

  • Say "control" or "baseline" to identify which group is the reference — the other is automatically the variant.

  • Defaults are two-tailed test, p=0.05, 95% confidence. Say "one-tailed" if you expect a specific direction of improvement.

  • Say "at p=0.01" or "with 99% confidence" to tighten the significance threshold.

  • Significance always aggregates across the full date range (All Time) — time granularity is not configurable.

  • You can test up to 2 variants against 1 control. Say "variant A" and "variant B" to compare both.