Phase 11a-2 — Pennant criteria A/B test (Variant 2)¶

Re-run of the Phase 11a A/B test against a less-aggressive variant. Baseline events and outcomes from Phase 11a are reused unchanged. V1 numbers are pulled from reports/pennant_criteria_ab_test_2026-05-11.md and included in side-by-side tables where they fit.

Override mechanism, outcome formula, and universe are identical to Phase 11a. No production code, config, or data modified.

1. Configuration¶

Parameter	Baseline	Variant 2	(Variant 1, ref)
pennant.min_duration_bars	5	7	10
pennant.max_duration_bars	15	17	20
flagpole.min_duration_bars	1	1	1
flagpole.max_duration_bars	10	5	5
pennant.max_retrace_pct	0.382	0.382	0.382
flagpole.min_magnitude_pct	12.0	12.0	12.0
flagpole.min_atr_multiple	4.0	4.0	4.0
flagpole.volume_ratio_min	1.5	1.5	1.5
trend_filter (EMA_55 ≥ 10d prior)	on	on	on

Date range scanned: 2007-02-15 → 2026-05-08. Universe: 2,974 active tickers (2,413 with ≥300 bars to qualify). Earliest v2 anchor: 2007-02-15.

2. Detection counts¶

Population	Events	vs Baseline
Baseline	15,534	1.00×
Variant 2	7,428	0.478×
Variant 1 (ref)	5,155	0.332×

V2 is roughly halfway between Baseline and V1 in count terms.

Per-year counts¶

Year	Baseline	V2	V1	V2 / Baseline
2007	400	173	127	0.43
2008	226	108	69	0.48
2009	702	311	238	0.44
2010	619	281	172	0.45
2011	471	218	158	0.46
2012	564	270	187	0.48
2013	875	336	233	0.38
2014	498	258	178	0.52
2015	533	255	186	0.48
2016	806	343	249	0.43
2017	871	413	301	0.47
2018	719	356	253	0.50
2019	748	392	260	0.52
2020	1,168	552	357	0.47
2021	1,336	644	455	0.48
2022	533	261	189	0.49
2023	1,024	540	349	0.53
2024	1,780	908	662	0.51
2025	1,232	589	397	0.48
2026 (YTD)	429	220	135	0.51

The V2 / Baseline ratio is stable across regimes (0.38 – 0.53). No year shows disproportionate filtering.

Per-sector counts¶

Sector	Baseline	V2	V2 / Baseline
Healthcare	3,206	1,601	0.50
Technology	2,694	1,338	0.50
Industrials	2,538	1,238	0.49
Financial Services	2,067	877	0.42
Consumer Cyclical	1,916	934	0.49
Consumer Defensive	744	359	0.48
Energy	677	295	0.44
Basic Materials	672	311	0.46
Communication Services	590	289	0.49
Real Estate	328	141	0.43
Utilities	102	45	0.44

Sector mix is preserved.

3. Overlap analysis (Baseline vs V2)¶

Measure	Count
Exact-anchor match (same symbol + same `event_date`)	1,995
Fuzzy match within ±1 calendar day (baseline events with ≥1 nearby V2)	3,397
Fuzzy match within ±1 calendar day (V2 events with ≥1 nearby baseline)	3,397
Baseline-only events (no V2 within ±1 day)	12,137
V2-only events (no baseline within ±1 day)	4,031

Interpretation. Of the 7,428 V2 events, 4,031 (54%) are not present in the baseline population within a ±1-day window. As with V1, this is driven by (a) V2's wider pennant window (max 17 vs 15) admitting some longer consolidations, and (b) different min_duration_bars causing the inner-loop dedup to pick different anchors at the same symbol. V2 is not a strict subset of baseline.

For reference: V1 had 2,965 unique events (57% of V1 population was net new). V2's net-new fraction is similar (54%), confirming the non-incremental nature of any duration-window change.

4. MFE distribution comparison¶

Stat (MFE %)	Baseline	V2	V1
Mean	13.92	14.55	14.03
Median	7.50	7.64	7.47
P25	2.67	2.74	2.63
P75	16.01	15.95	15.97
P90	30.87	31.58	30.54

Hit-rate at common MFE thresholds¶

Threshold	Baseline	V2	V1
≥ 5 %	61.5 %	62.2 %	62.0 %
≥ 10 %	40.5 %	40.9 %	40.1 %
≥ 15 %	27.0 %	27.1 %	27.0 %
≥ 20 %	18.9 %	18.7 %	18.7 %
≥ 30 %	10.5 %	10.9 %	10.5 %
≥ 50 %	4.3 %	5.0 %	4.4 %

V2 produces a small but consistent uplift on MFE: mean +0.63 pp, median +0.14 pp, P90 +0.71 pp, and a notable bump at the right tail (≥ 50% MFE rises from 4.3% → 5.0%, a 16% relative increase). V1 showed no such shift — its MFE distribution was statistically indistinguishable from baseline.

5. MAE distribution comparison¶

Stat (MAE %)	Baseline	V2	V1
Mean	−9.54	−9.60	−9.68
Median	−6.59	−6.65	−6.74
P25	−13.64	−13.85	−14.05
P75	−2.23	−2.17	−2.15
P10	−23.21	−23.27	−23.48

Stop-loss-relevant loss rates¶

MAE worse than…	Baseline	V2	V1
−5 %	58.5 %	58.3 %	58.7 %
−7 %	47.9 %	48.3 %	48.7 %
−10 %	35.9 %	36.0 %	36.4 %
−15 %	21.7 %	22.0 %	22.4 %

V2's downside is essentially indistinguishable from baseline — within 0.4 pp at every threshold. The MFE uplift in §4 is therefore not bought with deeper drawdowns. V1 was marginally worse on the downside; V2 is not.

6. Time-to-MFE-peak comparison¶

Stat (days to MFE)	Baseline	V2	V1
Mean	15.7	15.6	15.5
Median	16	16	16
P25	5	5	5
P75	26	26	26
P90	30	29	30

Bucket distribution¶

Days-to-peak	Baseline	V2	V1
1 – 5	25.4 %	25.2 %	25.7 %
6 – 10	12.8 %	13.4 %	13.3 %
11 – 15	11.0 %	11.1 %	10.7 %
16 – 20	10.8 %	11.9 %	11.4 %
21 – 30	39.9 %	38.4 %	38.8 %

Indistinguishable from baseline; the same U-shape (heavy mass at 1–5 and 21–30 days) is preserved.

Endpoint returns (means, %)¶

Horizon	Baseline	V2	V1
5d	0.50	0.73	0.54
10d	0.95	1.24	1.02
20d	1.66	2.18	1.76
30d	2.79	2.94	2.64

V2 means are higher than baseline at every horizon, with the cleanest lift at 5–20 days (+24% to +31% relative). V2 medians at 5/10/20d are also positive and above baseline (0.21 vs 0.08, 0.31 vs 0.12, 0.50 vs 0.34); 30d median runs slightly below baseline (0.65 vs 0.88), so the mid-window lift partially attenuates by day 30.

7. Quality vs quantity tradeoff¶

Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).

Metric	Baseline	V2	V1
n (with outcomes)	15,528	7,425	5,154
mean MFE %	13.92	14.55	14.03
P(MFE ≥ 15 %)	27.0 %	27.1 %	27.0 %
Per-pattern proxy	3.76	3.94	3.79
Population-total proxy (per-pattern × n)	58,370	29,266	19,561

Interpretation. V2 shows a real per-pattern quality lift, modest in size: +4.9% on the expectancy proxy (3.94 vs 3.76), driven by mean MFE moving from 13.92% to 14.55% while the hit-rate at +15% MFE is unchanged. The lift extends across all 5/10/20/30-day endpoint means (§6). Importantly, downside is not degraded: MAE distribution and stop-loss-trigger rates are within 0.4 pp of baseline.

So V2 finds about half as many patterns at slightly better per-pattern quality, in contrast to V1 which found a third as many at the same per-pattern quality. V2's population-total expectancy proxy (29,266) is ~50% of baseline — fewer-but-better, not just fewer.

8. Recommendation summary¶

Variant 2 would be a clear improvement over Baseline if the consumer of the detector values a smaller, slightly higher-quality candidate set with equivalent downside and accepts a halving of detection volume. The MFE uplift is small in absolute terms (+0.63 pp on the mean, +0.7 pp at the 90th percentile, +0.7 pp at the ≥50% MFE tail) but consistent across horizons and not bought with worse drawdowns. Per-pattern expectancy is ~5% higher than baseline; combined with the V1 comparison, the gain isolates to the tightened flagpole window (flagpole.max_duration_bars = 5) rather than to the pennant-window shift — V2 keeps the flagpole change but uses a much milder pennant shift (5–15 → 7–17 rather than 5–15 → 10–20), and still captures essentially the same quality lift V1 lacked. El Don decides.

Artifacts preserved under ab_test/: run_v2.py, analyze_v2.py, variant_v2_events.parquet, variant_v2_outcomes.parquet, summary_v2.json, run_v2.log, plus all Phase 11a v1 artifacts.