Phase 11a-2 — Pennant criteria A/B test (Variant 2)¶
Re-run of the Phase 11a A/B test against a less-aggressive variant.
Baseline events and outcomes from Phase 11a are reused unchanged. V1
numbers are pulled from reports/pennant_criteria_ab_test_2026-05-11.md
and included in side-by-side tables where they fit.
Override mechanism, outcome formula, and universe are identical to Phase 11a. No production code, config, or data modified.
1. Configuration¶
| Parameter | Baseline | Variant 2 | (Variant 1, ref) |
|---|---|---|---|
| pennant.min_duration_bars | 5 | 7 | 10 |
| pennant.max_duration_bars | 15 | 17 | 20 |
| flagpole.min_duration_bars | 1 | 1 | 1 |
| flagpole.max_duration_bars | 10 | 5 | 5 |
| pennant.max_retrace_pct | 0.382 | 0.382 | 0.382 |
| flagpole.min_magnitude_pct | 12.0 | 12.0 | 12.0 |
| flagpole.min_atr_multiple | 4.0 | 4.0 | 4.0 |
| flagpole.volume_ratio_min | 1.5 | 1.5 | 1.5 |
| trend_filter (EMA_55 ≥ 10d prior) | on | on | on |
Date range scanned: 2007-02-15 → 2026-05-08. Universe: 2,974 active tickers (2,413 with ≥300 bars to qualify). Earliest v2 anchor: 2007-02-15.
2. Detection counts¶
| Population | Events | vs Baseline |
|---|---|---|
| Baseline | 15,534 | 1.00× |
| Variant 2 | 7,428 | 0.478× |
| Variant 1 (ref) | 5,155 | 0.332× |
V2 is roughly halfway between Baseline and V1 in count terms.
Per-year counts¶
| Year | Baseline | V2 | V1 | V2 / Baseline |
|---|---|---|---|---|
| 2007 | 400 | 173 | 127 | 0.43 |
| 2008 | 226 | 108 | 69 | 0.48 |
| 2009 | 702 | 311 | 238 | 0.44 |
| 2010 | 619 | 281 | 172 | 0.45 |
| 2011 | 471 | 218 | 158 | 0.46 |
| 2012 | 564 | 270 | 187 | 0.48 |
| 2013 | 875 | 336 | 233 | 0.38 |
| 2014 | 498 | 258 | 178 | 0.52 |
| 2015 | 533 | 255 | 186 | 0.48 |
| 2016 | 806 | 343 | 249 | 0.43 |
| 2017 | 871 | 413 | 301 | 0.47 |
| 2018 | 719 | 356 | 253 | 0.50 |
| 2019 | 748 | 392 | 260 | 0.52 |
| 2020 | 1,168 | 552 | 357 | 0.47 |
| 2021 | 1,336 | 644 | 455 | 0.48 |
| 2022 | 533 | 261 | 189 | 0.49 |
| 2023 | 1,024 | 540 | 349 | 0.53 |
| 2024 | 1,780 | 908 | 662 | 0.51 |
| 2025 | 1,232 | 589 | 397 | 0.48 |
| 2026 (YTD) | 429 | 220 | 135 | 0.51 |
The V2 / Baseline ratio is stable across regimes (0.38 – 0.53). No year shows disproportionate filtering.
Per-sector counts¶
| Sector | Baseline | V2 | V2 / Baseline |
|---|---|---|---|
| Healthcare | 3,206 | 1,601 | 0.50 |
| Technology | 2,694 | 1,338 | 0.50 |
| Industrials | 2,538 | 1,238 | 0.49 |
| Financial Services | 2,067 | 877 | 0.42 |
| Consumer Cyclical | 1,916 | 934 | 0.49 |
| Consumer Defensive | 744 | 359 | 0.48 |
| Energy | 677 | 295 | 0.44 |
| Basic Materials | 672 | 311 | 0.46 |
| Communication Services | 590 | 289 | 0.49 |
| Real Estate | 328 | 141 | 0.43 |
| Utilities | 102 | 45 | 0.44 |
Sector mix is preserved.
3. Overlap analysis (Baseline vs V2)¶
| Measure | Count |
|---|---|
Exact-anchor match (same symbol + same event_date) |
1,995 |
| Fuzzy match within ±1 calendar day (baseline events with ≥1 nearby V2) | 3,397 |
| Fuzzy match within ±1 calendar day (V2 events with ≥1 nearby baseline) | 3,397 |
| Baseline-only events (no V2 within ±1 day) | 12,137 |
| V2-only events (no baseline within ±1 day) | 4,031 |
Interpretation. Of the 7,428 V2 events, 4,031 (54%) are not present
in the baseline population within a ±1-day window. As with V1, this is
driven by (a) V2's wider pennant window (max 17 vs 15) admitting some
longer consolidations, and (b) different min_duration_bars causing
the inner-loop dedup to pick different anchors at the same symbol. V2
is not a strict subset of baseline.
For reference: V1 had 2,965 unique events (57% of V1 population was net new). V2's net-new fraction is similar (54%), confirming the non-incremental nature of any duration-window change.
4. MFE distribution comparison¶
| Stat (MFE %) | Baseline | V2 | V1 |
|---|---|---|---|
| Mean | 13.92 | 14.55 | 14.03 |
| Median | 7.50 | 7.64 | 7.47 |
| P25 | 2.67 | 2.74 | 2.63 |
| P75 | 16.01 | 15.95 | 15.97 |
| P90 | 30.87 | 31.58 | 30.54 |
Hit-rate at common MFE thresholds¶
| Threshold | Baseline | V2 | V1 |
|---|---|---|---|
| ≥ 5 % | 61.5 % | 62.2 % | 62.0 % |
| ≥ 10 % | 40.5 % | 40.9 % | 40.1 % |
| ≥ 15 % | 27.0 % | 27.1 % | 27.0 % |
| ≥ 20 % | 18.9 % | 18.7 % | 18.7 % |
| ≥ 30 % | 10.5 % | 10.9 % | 10.5 % |
| ≥ 50 % | 4.3 % | 5.0 % | 4.4 % |
V2 produces a small but consistent uplift on MFE: mean +0.63 pp, median +0.14 pp, P90 +0.71 pp, and a notable bump at the right tail (≥ 50% MFE rises from 4.3% → 5.0%, a 16% relative increase). V1 showed no such shift — its MFE distribution was statistically indistinguishable from baseline.
5. MAE distribution comparison¶
| Stat (MAE %) | Baseline | V2 | V1 |
|---|---|---|---|
| Mean | −9.54 | −9.60 | −9.68 |
| Median | −6.59 | −6.65 | −6.74 |
| P25 | −13.64 | −13.85 | −14.05 |
| P75 | −2.23 | −2.17 | −2.15 |
| P10 | −23.21 | −23.27 | −23.48 |
Stop-loss-relevant loss rates¶
| MAE worse than… | Baseline | V2 | V1 |
|---|---|---|---|
| −5 % | 58.5 % | 58.3 % | 58.7 % |
| −7 % | 47.9 % | 48.3 % | 48.7 % |
| −10 % | 35.9 % | 36.0 % | 36.4 % |
| −15 % | 21.7 % | 22.0 % | 22.4 % |
V2's downside is essentially indistinguishable from baseline — within 0.4 pp at every threshold. The MFE uplift in §4 is therefore not bought with deeper drawdowns. V1 was marginally worse on the downside; V2 is not.
6. Time-to-MFE-peak comparison¶
| Stat (days to MFE) | Baseline | V2 | V1 |
|---|---|---|---|
| Mean | 15.7 | 15.6 | 15.5 |
| Median | 16 | 16 | 16 |
| P25 | 5 | 5 | 5 |
| P75 | 26 | 26 | 26 |
| P90 | 30 | 29 | 30 |
Bucket distribution¶
| Days-to-peak | Baseline | V2 | V1 |
|---|---|---|---|
| 1 – 5 | 25.4 % | 25.2 % | 25.7 % |
| 6 – 10 | 12.8 % | 13.4 % | 13.3 % |
| 11 – 15 | 11.0 % | 11.1 % | 10.7 % |
| 16 – 20 | 10.8 % | 11.9 % | 11.4 % |
| 21 – 30 | 39.9 % | 38.4 % | 38.8 % |
Indistinguishable from baseline; the same U-shape (heavy mass at 1–5 and 21–30 days) is preserved.
Endpoint returns (means, %)¶
| Horizon | Baseline | V2 | V1 |
|---|---|---|---|
| 5d | 0.50 | 0.73 | 0.54 |
| 10d | 0.95 | 1.24 | 1.02 |
| 20d | 1.66 | 2.18 | 1.76 |
| 30d | 2.79 | 2.94 | 2.64 |
V2 means are higher than baseline at every horizon, with the cleanest lift at 5–20 days (+24% to +31% relative). V2 medians at 5/10/20d are also positive and above baseline (0.21 vs 0.08, 0.31 vs 0.12, 0.50 vs 0.34); 30d median runs slightly below baseline (0.65 vs 0.88), so the mid-window lift partially attenuates by day 30.
7. Quality vs quantity tradeoff¶
Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).
| Metric | Baseline | V2 | V1 |
|---|---|---|---|
| n (with outcomes) | 15,528 | 7,425 | 5,154 |
| mean MFE % | 13.92 | 14.55 | 14.03 |
| P(MFE ≥ 15 %) | 27.0 % | 27.1 % | 27.0 % |
| Per-pattern proxy | 3.76 | 3.94 | 3.79 |
| Population-total proxy (per-pattern × n) | 58,370 | 29,266 | 19,561 |
Interpretation. V2 shows a real per-pattern quality lift, modest in size: +4.9% on the expectancy proxy (3.94 vs 3.76), driven by mean MFE moving from 13.92% to 14.55% while the hit-rate at +15% MFE is unchanged. The lift extends across all 5/10/20/30-day endpoint means (§6). Importantly, downside is not degraded: MAE distribution and stop-loss-trigger rates are within 0.4 pp of baseline.
So V2 finds about half as many patterns at slightly better per-pattern quality, in contrast to V1 which found a third as many at the same per-pattern quality. V2's population-total expectancy proxy (29,266) is ~50% of baseline — fewer-but-better, not just fewer.
8. Recommendation summary¶
Variant 2 would be a clear improvement over Baseline if the consumer of
the detector values a smaller, slightly higher-quality candidate set
with equivalent downside and accepts a halving of detection volume.
The MFE uplift is small in absolute terms (+0.63 pp on the mean, +0.7
pp at the 90th percentile, +0.7 pp at the ≥50% MFE tail) but consistent
across horizons and not bought with worse drawdowns. Per-pattern
expectancy is ~5% higher than baseline; combined with the V1
comparison, the gain isolates to the tightened flagpole window
(flagpole.max_duration_bars = 5) rather than to the pennant-window
shift — V2 keeps the flagpole change but uses a much milder pennant
shift (5–15 → 7–17 rather than 5–15 → 10–20), and still captures
essentially the same quality lift V1 lacked. El Don decides.
Artifacts preserved under ab_test/:
run_v2.py, analyze_v2.py, variant_v2_events.parquet,
variant_v2_outcomes.parquet, summary_v2.json, run_v2.log,
plus all Phase 11a v1 artifacts.