Skip to content

Phase 11a-2 — Pennant criteria A/B test (Variant 2)

Re-run of the Phase 11a A/B test against a less-aggressive variant. Baseline events and outcomes from Phase 11a are reused unchanged. V1 numbers are pulled from reports/pennant_criteria_ab_test_2026-05-11.md and included in side-by-side tables where they fit.

Override mechanism, outcome formula, and universe are identical to Phase 11a. No production code, config, or data modified.

1. Configuration

Parameter Baseline Variant 2 (Variant 1, ref)
pennant.min_duration_bars 5 7 10
pennant.max_duration_bars 15 17 20
flagpole.min_duration_bars 1 1 1
flagpole.max_duration_bars 10 5 5
pennant.max_retrace_pct 0.382 0.382 0.382
flagpole.min_magnitude_pct 12.0 12.0 12.0
flagpole.min_atr_multiple 4.0 4.0 4.0
flagpole.volume_ratio_min 1.5 1.5 1.5
trend_filter (EMA_55 ≥ 10d prior) on on on

Date range scanned: 2007-02-15 → 2026-05-08. Universe: 2,974 active tickers (2,413 with ≥300 bars to qualify). Earliest v2 anchor: 2007-02-15.

2. Detection counts

Population Events vs Baseline
Baseline 15,534 1.00×
Variant 2 7,428 0.478×
Variant 1 (ref) 5,155 0.332×

V2 is roughly halfway between Baseline and V1 in count terms.

Per-year counts

Year Baseline V2 V1 V2 / Baseline
2007 400 173 127 0.43
2008 226 108 69 0.48
2009 702 311 238 0.44
2010 619 281 172 0.45
2011 471 218 158 0.46
2012 564 270 187 0.48
2013 875 336 233 0.38
2014 498 258 178 0.52
2015 533 255 186 0.48
2016 806 343 249 0.43
2017 871 413 301 0.47
2018 719 356 253 0.50
2019 748 392 260 0.52
2020 1,168 552 357 0.47
2021 1,336 644 455 0.48
2022 533 261 189 0.49
2023 1,024 540 349 0.53
2024 1,780 908 662 0.51
2025 1,232 589 397 0.48
2026 (YTD) 429 220 135 0.51

The V2 / Baseline ratio is stable across regimes (0.38 – 0.53). No year shows disproportionate filtering.

Per-sector counts

Sector Baseline V2 V2 / Baseline
Healthcare 3,206 1,601 0.50
Technology 2,694 1,338 0.50
Industrials 2,538 1,238 0.49
Financial Services 2,067 877 0.42
Consumer Cyclical 1,916 934 0.49
Consumer Defensive 744 359 0.48
Energy 677 295 0.44
Basic Materials 672 311 0.46
Communication Services 590 289 0.49
Real Estate 328 141 0.43
Utilities 102 45 0.44

Sector mix is preserved.

3. Overlap analysis (Baseline vs V2)

Measure Count
Exact-anchor match (same symbol + same event_date) 1,995
Fuzzy match within ±1 calendar day (baseline events with ≥1 nearby V2) 3,397
Fuzzy match within ±1 calendar day (V2 events with ≥1 nearby baseline) 3,397
Baseline-only events (no V2 within ±1 day) 12,137
V2-only events (no baseline within ±1 day) 4,031

Interpretation. Of the 7,428 V2 events, 4,031 (54%) are not present in the baseline population within a ±1-day window. As with V1, this is driven by (a) V2's wider pennant window (max 17 vs 15) admitting some longer consolidations, and (b) different min_duration_bars causing the inner-loop dedup to pick different anchors at the same symbol. V2 is not a strict subset of baseline.

For reference: V1 had 2,965 unique events (57% of V1 population was net new). V2's net-new fraction is similar (54%), confirming the non-incremental nature of any duration-window change.

4. MFE distribution comparison

Stat (MFE %) Baseline V2 V1
Mean 13.92 14.55 14.03
Median 7.50 7.64 7.47
P25 2.67 2.74 2.63
P75 16.01 15.95 15.97
P90 30.87 31.58 30.54

Hit-rate at common MFE thresholds

Threshold Baseline V2 V1
≥ 5 % 61.5 % 62.2 % 62.0 %
≥ 10 % 40.5 % 40.9 % 40.1 %
≥ 15 % 27.0 % 27.1 % 27.0 %
≥ 20 % 18.9 % 18.7 % 18.7 %
≥ 30 % 10.5 % 10.9 % 10.5 %
≥ 50 % 4.3 % 5.0 % 4.4 %

V2 produces a small but consistent uplift on MFE: mean +0.63 pp, median +0.14 pp, P90 +0.71 pp, and a notable bump at the right tail (≥ 50% MFE rises from 4.3% → 5.0%, a 16% relative increase). V1 showed no such shift — its MFE distribution was statistically indistinguishable from baseline.

5. MAE distribution comparison

Stat (MAE %) Baseline V2 V1
Mean −9.54 −9.60 −9.68
Median −6.59 −6.65 −6.74
P25 −13.64 −13.85 −14.05
P75 −2.23 −2.17 −2.15
P10 −23.21 −23.27 −23.48

Stop-loss-relevant loss rates

MAE worse than… Baseline V2 V1
−5 % 58.5 % 58.3 % 58.7 %
−7 % 47.9 % 48.3 % 48.7 %
−10 % 35.9 % 36.0 % 36.4 %
−15 % 21.7 % 22.0 % 22.4 %

V2's downside is essentially indistinguishable from baseline — within 0.4 pp at every threshold. The MFE uplift in §4 is therefore not bought with deeper drawdowns. V1 was marginally worse on the downside; V2 is not.

6. Time-to-MFE-peak comparison

Stat (days to MFE) Baseline V2 V1
Mean 15.7 15.6 15.5
Median 16 16 16
P25 5 5 5
P75 26 26 26
P90 30 29 30

Bucket distribution

Days-to-peak Baseline V2 V1
1 – 5 25.4 % 25.2 % 25.7 %
6 – 10 12.8 % 13.4 % 13.3 %
11 – 15 11.0 % 11.1 % 10.7 %
16 – 20 10.8 % 11.9 % 11.4 %
21 – 30 39.9 % 38.4 % 38.8 %

Indistinguishable from baseline; the same U-shape (heavy mass at 1–5 and 21–30 days) is preserved.

Endpoint returns (means, %)

Horizon Baseline V2 V1
5d 0.50 0.73 0.54
10d 0.95 1.24 1.02
20d 1.66 2.18 1.76
30d 2.79 2.94 2.64

V2 means are higher than baseline at every horizon, with the cleanest lift at 5–20 days (+24% to +31% relative). V2 medians at 5/10/20d are also positive and above baseline (0.21 vs 0.08, 0.31 vs 0.12, 0.50 vs 0.34); 30d median runs slightly below baseline (0.65 vs 0.88), so the mid-window lift partially attenuates by day 30.

7. Quality vs quantity tradeoff

Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).

Metric Baseline V2 V1
n (with outcomes) 15,528 7,425 5,154
mean MFE % 13.92 14.55 14.03
P(MFE ≥ 15 %) 27.0 % 27.1 % 27.0 %
Per-pattern proxy 3.76 3.94 3.79
Population-total proxy (per-pattern × n) 58,370 29,266 19,561

Interpretation. V2 shows a real per-pattern quality lift, modest in size: +4.9% on the expectancy proxy (3.94 vs 3.76), driven by mean MFE moving from 13.92% to 14.55% while the hit-rate at +15% MFE is unchanged. The lift extends across all 5/10/20/30-day endpoint means (§6). Importantly, downside is not degraded: MAE distribution and stop-loss-trigger rates are within 0.4 pp of baseline.

So V2 finds about half as many patterns at slightly better per-pattern quality, in contrast to V1 which found a third as many at the same per-pattern quality. V2's population-total expectancy proxy (29,266) is ~50% of baseline — fewer-but-better, not just fewer.

8. Recommendation summary

Variant 2 would be a clear improvement over Baseline if the consumer of the detector values a smaller, slightly higher-quality candidate set with equivalent downside and accepts a halving of detection volume. The MFE uplift is small in absolute terms (+0.63 pp on the mean, +0.7 pp at the 90th percentile, +0.7 pp at the ≥50% MFE tail) but consistent across horizons and not bought with worse drawdowns. Per-pattern expectancy is ~5% higher than baseline; combined with the V1 comparison, the gain isolates to the tightened flagpole window (flagpole.max_duration_bars = 5) rather than to the pennant-window shift — V2 keeps the flagpole change but uses a much milder pennant shift (5–15 → 7–17 rather than 5–15 → 10–20), and still captures essentially the same quality lift V1 lacked. El Don decides.


Artifacts preserved under ab_test/: run_v2.py, analyze_v2.py, variant_v2_events.parquet, variant_v2_outcomes.parquet, summary_v2.json, run_v2.log, plus all Phase 11a v1 artifacts.