Phase 11a-3 — Pennant criteria A/B test (Variants 3 & 4)¶

Continuation of the Phase 11a/11a-2 sweep, pushing flagpole.max_duration_bars tighter (5 → 3 → 2). Baseline events and outcomes are reused unchanged from Phase 11a. V1/V2 numbers are pulled from prior reports for context.

Override mechanism, outcome formula, and universe are identical to Phase 11a. No production code, config, or data modified.

1. Configuration¶

Main configurations under test (Baseline / V3 / V4):

Parameter	Baseline	Variant 3	Variant 4
pennant.min_duration_bars	5	6	6
pennant.max_duration_bars	15	17	17
flagpole.min_duration_bars	1	1	1
flagpole.max_duration_bars	10	3	2
All other criteria	—	unchanged	unchanged

For reference (prior variants):

Parameter	V1 (Phase 11a)	V2 (Phase 11a-2)
pennant.{min,max}	10 – 20	7 – 17
flagpole.max	5	5

Date range scanned: 2007-02-15 → 2026-05-08. Universe: 2,974 active tickers (2,413 with ≥300 bars).

2. Detection counts¶

Population	Events	vs Baseline
Baseline	15,534	1.000×
Variant 3	6,304	0.406×
Variant 4	4,637	0.299×
V1 (ref)	5,155	0.332×
V2 (ref)	7,428	0.478×

Both V3 and V4 sit between V1 and V2 in count.

Per-year counts (Baseline / V3 / V4)¶

Year	Baseline	V3	V4	V3/B	V4/B
2007	400	161	122	0.40	0.31
2008	226	90	63	0.40	0.28
2009	702	235	162	0.33	0.23
2010	619	220	156	0.36	0.25
2011	471	196	146	0.42	0.31
2012	564	211	154	0.37	0.27
2013	875	302	215	0.35	0.25
2014	498	214	164	0.43	0.33
2015	533	224	175	0.42	0.33
2016	806	267	195	0.33	0.24
2017	871	334	247	0.38	0.28
2018	719	325	229	0.45	0.32
2019	748	371	290	0.50	0.39
2020	1,168	400	286	0.34	0.24
2021	1,336	509	342	0.38	0.26
2022	533	200	130	0.38	0.24
2023	1,024	500	367	0.49	0.36
2024	1,780	817	646	0.46	0.36
2025	1,232	525	404	0.43	0.33
2026 (YTD)	429	203	144	0.47	0.34

Both ratios are stable across regimes; no year is disproportionately filtered.

Per-sector counts¶

Sector	Baseline	V3	V4	V3/B	V4/B
Healthcare	3,206	1,367	1,000	0.43	0.31
Technology	2,694	1,170	920	0.43	0.34
Industrials	2,538	1,092	800	0.43	0.32
Financial Services	2,067	726	523	0.35	0.25
Consumer Cyclical	1,916	791	565	0.41	0.29
Consumer Defensive	744	315	258	0.42	0.35
Energy	677	220	134	0.32	0.20
Basic Materials	672	252	183	0.38	0.27
Communication Services	590	233	158	0.39	0.27
Real Estate	328	107	77	0.33	0.23
Utilities	102	31	19	0.30	0.19

Mix is preserved. Energy/Utilities/Real Estate filter slightly harder, consistent across V3 and V4.

3. Overlap analysis¶

Baseline vs V3¶

Measure	Count
Exact-anchor match	1,749
Fuzzy ±1d (baseline events with ≥1 nearby V3)	3,088
Fuzzy ±1d (V3 events with ≥1 nearby baseline)	3,088
Baseline-only	12,446
V3-only	3,216

Baseline vs V4¶

Measure	Count
Exact-anchor match	1,073
Fuzzy ±1d (baseline events with ≥1 nearby V4)	1,975
Fuzzy ±1d (V4 events with ≥1 nearby baseline)	1,975
Baseline-only	13,559
V4-only	2,662

V3 has 51% net-new events vs baseline; V4 has 57% net-new. Both populations are substantially distinct from baseline, not subsets of it.

V3 vs V4¶

Measure	Count
V3 total	6,304
V4 total	4,637
Exact-anchor match	3,402
Fuzzy ±1d match	3,800
V3-only (no V4 within ±1d)	2,504
V4-only (no V3 within ±1d)	837

Interpretation. V4 is much closer to a subset of V3 than V3 is to baseline: 82 % of V4 events have a matching V3 event within ±1 day. Tightening flagpole.max from 3 → 2 mostly removes patterns rather than shifting the population. There are still 837 V4-only events (18 %), so the inner-loop dedup picks slightly different anchors at the same symbol, but the qualitative population is essentially a culled V3.

4. MFE distribution comparison¶

Stat (MFE %)	Baseline	V3	V4	V2 (ref)	V1 (ref)
Mean	13.92	14.62	14.64	14.55	14.03
Median	7.50	7.66	7.64	7.64	7.47
P25	2.67	2.70	2.66	2.74	2.63
P75	16.01	16.07	15.78	15.95	15.97
P90	30.87	32.15	31.70	31.58	30.54

Hit rate at MFE thresholds¶

Threshold	Baseline	V3	V4	V2 (ref)	V1 (ref)
≥ 5 %	61.5 %	62.6 %	62.0 %	62.2 %	62.0 %
≥ 10 %	40.5 %	40.8 %	40.7 %	40.9 %	40.1 %
≥ 15 %	27.0 %	27.0 %	26.5 %	27.1 %	27.0 %
≥ 20 %	18.9 %	18.5 %	18.6 %	18.7 %	18.7 %
≥ 30 %	10.5 %	11.1 %	11.0 %	10.9 %	10.5 %
≥ 50 %	4.3 %	4.9 %	4.9 %	5.0 %	4.4 %

V3 and V4 carry the same mean-MFE lift as V2 (≈ +0.7 pp on the mean, ~+0.6 pp on the 50% tail), but V4 shows a small drop in the ≥15% hit rate (26.5% vs baseline 27.0%, vs V3's 27.0% and V2's 27.1%). This is the first metric in the sweep where tightening farther actively underperforms a less-tightened sibling.

5. MAE distribution comparison¶

Stat (MAE %)	Baseline	V3	V4
Mean	−9.54	−9.63	−9.50
Median	−6.59	−6.49	−6.35
P25	−13.64	−13.98	−13.77
P75	−2.23	−2.29	−2.18
P10	−23.21	−23.65	−23.30

Stop-loss-relevant loss rates¶

MAE worse than…	Baseline	V3	V4	V2 (ref)
−5 %	58.5 %	58.1 %	57.7 %	58.3 %
−7 %	47.9 %	47.8 %	47.3 %	48.3 %
−10 %	35.9 %	36.3 %	35.6 %	36.0 %
−15 %	21.7 %	22.2 %	21.6 %	22.0 %

V4 is the only variant in the sweep whose downside is better than baseline at the −5 % and −7 % thresholds (−0.8 pp and −0.6 pp respectively). V3 is within 0.5 pp of baseline at every threshold.

6. Time-to-MFE-peak + endpoint returns¶

Stat (days to MFE)	Baseline	V3	V4
Mean	15.7	15.6	15.4
Median	16	16	15
P90	30	29	29

Bucket distribution¶

Days-to-peak	Baseline	V3	V4
1 – 5	25.4 %	25.3 %	25.5 %
6 – 10	12.8 %	13.0 %	13.1 %
11 – 15	11.0 %	11.5 %	11.6 %
16 – 20	10.8 %	11.6 %	11.7 %
21 – 30	39.9 %	38.7 %	38.0 %

V4's median peak day shifts from 16 → 15; combined with the U-shape attenuating slightly at the 21–30 bucket, V4 patterns resolve modestly faster on average than baseline.

Endpoint returns (means, %)¶

Horizon	Baseline	V3	V4	V2 (ref)
5d	0.50	0.69	0.67	0.73
10d	0.95	1.17	1.27	1.24
20d	1.66	2.19	2.45	2.18
30d	2.79	2.87	2.71	2.94

V4 has the strongest 20-day lift in the entire sweep (+0.79 pp on mean), but is the only variant whose 30-day endpoint mean falls below baseline (2.71 vs 2.79). Combined with the faster median peak day, V4 patterns appear to reach their peak earlier and then mean-revert.

7. Quality vs quantity tradeoff¶

Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).

Metric	Baseline	V3	V4	V2 (ref)	V1 (ref)
n (with outcomes)	15,528	6,303	4,636	7,425	5,154
mean MFE %	13.92	14.62	14.64	14.55	14.03
P(MFE ≥ 15 %)	27.0 %	27.0 %	26.5 %	27.1 %	27.0 %
Per-pattern proxy	3.76	3.95	3.87	3.94	3.79
Population-total proxy (per-pattern × n)	58,370	24,879	17,967	29,266	19,561

Per-pattern expectancy: - Baseline → V1: +0.8 % (3.76 → 3.79) - Baseline → V2: +4.9 % (3.76 → 3.94) - Baseline → V3: +5.1 % (3.76 → 3.95) - Baseline → V4: +2.9 % (3.76 → 3.87)

The lift saturates between V2 and V3 (3.94 vs 3.95 — within rounding), and regresses from V3 to V4 (3.95 → 3.87). The regression is driven by the ≥15% hit rate dropping from 27.0 % to 26.5 % rather than by mean MFE — V4 patterns produce slightly fatter right and left tails without making the middle-distance targets more reliable.

8. Recommendation summary¶

This sweep was designed to answer two questions; the numbers answer both factually.

(a) Is the quality lift monotonic in flagpole tightness? No. The per-pattern expectancy proxy goes Baseline 3.76 → V2 3.94 → V3 3.95 → V4 3.87. The lever saturates between flagpole.max = 5 and 3 and reverses between 3 and 2. The mean-MFE component continues to climb very slightly (14.55 → 14.62 → 14.64), but the hit rate at +15 % MFE drops at V4 (27.1 % → 27.0 % → 26.5 %).

(b) Does the answer change between V3 (max = 3) and V4 (max = 2)? Yes. V4 has gone past the useful point in two ways: 1. ≥ 15 % MFE hit rate falls below baseline (26.5 % vs 27.0 %), and per-pattern expectancy proxy drops (3.95 → 3.87). 2. 30-day endpoint mean falls below baseline (2.71 % vs 2.79 %), the only variant in the sweep where this happens. V4 patterns peak earlier (median day 15 vs 16) and then fade.

V4 does produce the strongest 20-day endpoint mean in the sweep (+0.79 pp vs baseline) and the cleanest downside profile (−0.8 pp at the −5 % MAE threshold). These could matter if the consumer holds for a fixed 20-day window with a tight stop, but the overall expectancy proxy is lower than V3's.

V3 is the local optimum of the lever as currently parameterised. Per-pattern quality matches V2 within rounding, with a slightly tighter flagpole window and ~15 % fewer events than V2 (6,304 vs 7,428). The gain over baseline isolates cleanly to the flagpole window — V3 keeps a mild pennant shift (5–15 → 6–17) similar in spirit to V2's, yet achieves the same per-pattern lift. Going from V3 to V4 trades real expectancy for a marginally cleaner downside.

El Don decides.

Artifacts preserved under ab_test/: run_v3_v4.py, analyze_v3_v4.py, variant_v3_events.parquet, variant_v3_outcomes.parquet, variant_v4_events.parquet, variant_v4_outcomes.parquet, summary_v3.json, summary_v4.json, summary_v3_v4_overlap.json, run_v3_v4.log, plus all baseline + v1 + v2 artifacts from prior phases.