How to Measure AI Model Upgrades Without a Control Group: Synthetic Control in Python

The Global Rollout Problem in AI Product Experimentation

Every product team working with large language models (LLMs) eventually faces a frustrating measurement challenge: when an AI provider ships a new model version, there is no holdout group. Your infrastructure team upgrades every workspace from Claude 4.5 to Claude 4.6 overnight. All fifty production workspaces get the new model simultaneously. A week later, task completion has climbed across the board, and the head of product declares it a success.

How to Measure AI Model Upgrades Without a Control Group: Synthetic Control in Python — Source: www.freecodecamp.org

But if you are the data scientist, you know something is wrong. No group remained on the old version during the upgrade week. A naïve before-and-after comparison picks up whatever else changed that week — a new onboarding flow, a seasonal uptick, or a high-profile customer onboarding. This is the global rollout problem. It occurs whenever a team ships a model upgrade to the entire user base at the same time.

For product teams running generative AI features, this is one of the most common measurement traps. Staged rollouts buy you a control group; global rollouts eliminate it. In 2026, global model upgrades are the norm: every API provider pushes new versions, and every team using Claude, GPT, or Gemini has experienced the sudden jump from one version to the next with no opt-out.

What Synthetic Control Does

Synthetic control is the tool data scientists use when the control group is missing. You build a weighted combination of untreated units — other workspaces or regions that were not upgraded at the same time — whose pre-upgrade behavior matches that of the treated unit. After the upgrade, you compare the treated unit to its synthetic twin, and the gap becomes the causal estimate.

This approach relies on three identification assumptions: no interference between units, the donor pool remains unaffected by the treatment, and the treated unit can be approximated by a weighted combination of donors. When these hold, synthetic control provides a principled method for counterfactual inference.

Prerequisites

To follow along, you will need Python with scipy, numpy, pandas, and matplotlib. All code in this tutorial runs end-to-end in the companion notebook, which you can find at the GitHub repository. The notebook has all outputs pre-executed, so you can read along before running anything locally.

Building a Synthetic Control from Scratch

Step 0: Set Up the Working Example

We use a synthetic SaaS dataset with 50,000 users across multiple workspaces. The data includes a pre-treatment period, a treatment event (the model upgrade), and a post-treatment period. One workspace receives the global upgrade; the others serve as potential donors. The dataset mimics real product telemetry: daily task completion rates, user counts, and timestamps.

Step 1: Fit Donor Weights with SLSQP

We use scipy.optimize.minimize with the SLSQP method to find donor weights that minimize the pre-treatment difference between the treated unit and a weighted sum of donor units. The optimizer constrains weights to sum to one and be non-negative. The result is a vector of weights that creates the best possible synthetic pre-treatment path.

Step 2: Plot Treated vs Synthetic Control Trajectories

Once we have the weights, we compute the synthetic outcome for both pre- and post-treatment periods. Plotting the treated unit alongside its synthetic control reveals the divergence after the upgrade. The post-treatment gap is the estimated treatment effect.

Step 3: In-Space Placebo Permutation Test

To assess statistical significance, we run a placebo permutation test. We re-assign the treatment label to each donor unit in turn, pretending it received the upgrade, and compute the synthetic control gap. If the actual treated unit’s gap is larger than most placebo gaps, we have evidence of a real effect. This is a robust non-parametric test.

Step 4: Leave-One-Out Donor Sensitivity

We check whether the result depends heavily on any single donor by removing each donor one at a time and re-running the synthetic control. Stable effects suggest a robust estimate; large swings indicate a fragile inference.

Step 5: Cluster Bootstrap 95% Confidence Intervals

Finally, we construct a cluster bootstrap confidence interval. Drawing clustered resamples of the donor workspaces, we repeat the weight optimization and effect estimation. The 2.5th and 97.5th percentiles from the bootstrap distribution give us a 95% confidence interval for the causal effect.

When Synthetic Control Fails

Synthetic control is not a silver bullet. It fails when the pre-treatment fit is poor — no weighted combination of donors can approximate the treated unit. It also fails if there is interference between units (e.g., the upgrade affects donor workspaces through network effects) or if the treatment is correlated with time-varying confounders that differ between the treated and donors. Always test the key assumptions with placebo tests and sensitivity analyses.

What to Do Next

This tutorial gives you a complete, end-to-end implementation of synthetic control for evaluating global model rollouts. The same method applies to any product change that affects all users at once — feature launches, pricing changes, or policy updates. Next steps: apply the code to your own data, experiment with different donor pools, and compare results to other causal inference methods like difference-in-differences or interrupted time series.

For further reading, explore the companion notebook on GitHub, which includes pre-run outputs and adjustable hyperparameters.

Fbhchile