Phase 11a-3 — Pennant criteria A/B test (Variants 3 & 4)¶
Continuation of the Phase 11a/11a-2 sweep, pushing flagpole.max_duration_bars
tighter (5 → 3 → 2). Baseline events and outcomes are reused unchanged
from Phase 11a. V1/V2 numbers are pulled from prior reports for context.
Override mechanism, outcome formula, and universe are identical to Phase 11a. No production code, config, or data modified.
1. Configuration¶
Main configurations under test (Baseline / V3 / V4):
| Parameter | Baseline | Variant 3 | Variant 4 |
|---|---|---|---|
| pennant.min_duration_bars | 5 | 6 | 6 |
| pennant.max_duration_bars | 15 | 17 | 17 |
| flagpole.min_duration_bars | 1 | 1 | 1 |
| flagpole.max_duration_bars | 10 | 3 | 2 |
| All other criteria | — | unchanged | unchanged |
For reference (prior variants):
| Parameter | V1 (Phase 11a) | V2 (Phase 11a-2) |
|---|---|---|
| pennant.{min,max} | 10 – 20 | 7 – 17 |
| flagpole.max | 5 | 5 |
Date range scanned: 2007-02-15 → 2026-05-08. Universe: 2,974 active tickers (2,413 with ≥300 bars).
2. Detection counts¶
| Population | Events | vs Baseline |
|---|---|---|
| Baseline | 15,534 | 1.000× |
| Variant 3 | 6,304 | 0.406× |
| Variant 4 | 4,637 | 0.299× |
| V1 (ref) | 5,155 | 0.332× |
| V2 (ref) | 7,428 | 0.478× |
Both V3 and V4 sit between V1 and V2 in count.
Per-year counts (Baseline / V3 / V4)¶
| Year | Baseline | V3 | V4 | V3/B | V4/B |
|---|---|---|---|---|---|
| 2007 | 400 | 161 | 122 | 0.40 | 0.31 |
| 2008 | 226 | 90 | 63 | 0.40 | 0.28 |
| 2009 | 702 | 235 | 162 | 0.33 | 0.23 |
| 2010 | 619 | 220 | 156 | 0.36 | 0.25 |
| 2011 | 471 | 196 | 146 | 0.42 | 0.31 |
| 2012 | 564 | 211 | 154 | 0.37 | 0.27 |
| 2013 | 875 | 302 | 215 | 0.35 | 0.25 |
| 2014 | 498 | 214 | 164 | 0.43 | 0.33 |
| 2015 | 533 | 224 | 175 | 0.42 | 0.33 |
| 2016 | 806 | 267 | 195 | 0.33 | 0.24 |
| 2017 | 871 | 334 | 247 | 0.38 | 0.28 |
| 2018 | 719 | 325 | 229 | 0.45 | 0.32 |
| 2019 | 748 | 371 | 290 | 0.50 | 0.39 |
| 2020 | 1,168 | 400 | 286 | 0.34 | 0.24 |
| 2021 | 1,336 | 509 | 342 | 0.38 | 0.26 |
| 2022 | 533 | 200 | 130 | 0.38 | 0.24 |
| 2023 | 1,024 | 500 | 367 | 0.49 | 0.36 |
| 2024 | 1,780 | 817 | 646 | 0.46 | 0.36 |
| 2025 | 1,232 | 525 | 404 | 0.43 | 0.33 |
| 2026 (YTD) | 429 | 203 | 144 | 0.47 | 0.34 |
Both ratios are stable across regimes; no year is disproportionately filtered.
Per-sector counts¶
| Sector | Baseline | V3 | V4 | V3/B | V4/B |
|---|---|---|---|---|---|
| Healthcare | 3,206 | 1,367 | 1,000 | 0.43 | 0.31 |
| Technology | 2,694 | 1,170 | 920 | 0.43 | 0.34 |
| Industrials | 2,538 | 1,092 | 800 | 0.43 | 0.32 |
| Financial Services | 2,067 | 726 | 523 | 0.35 | 0.25 |
| Consumer Cyclical | 1,916 | 791 | 565 | 0.41 | 0.29 |
| Consumer Defensive | 744 | 315 | 258 | 0.42 | 0.35 |
| Energy | 677 | 220 | 134 | 0.32 | 0.20 |
| Basic Materials | 672 | 252 | 183 | 0.38 | 0.27 |
| Communication Services | 590 | 233 | 158 | 0.39 | 0.27 |
| Real Estate | 328 | 107 | 77 | 0.33 | 0.23 |
| Utilities | 102 | 31 | 19 | 0.30 | 0.19 |
Mix is preserved. Energy/Utilities/Real Estate filter slightly harder, consistent across V3 and V4.
3. Overlap analysis¶
Baseline vs V3¶
| Measure | Count |
|---|---|
| Exact-anchor match | 1,749 |
| Fuzzy ±1d (baseline events with ≥1 nearby V3) | 3,088 |
| Fuzzy ±1d (V3 events with ≥1 nearby baseline) | 3,088 |
| Baseline-only | 12,446 |
| V3-only | 3,216 |
Baseline vs V4¶
| Measure | Count |
|---|---|
| Exact-anchor match | 1,073 |
| Fuzzy ±1d (baseline events with ≥1 nearby V4) | 1,975 |
| Fuzzy ±1d (V4 events with ≥1 nearby baseline) | 1,975 |
| Baseline-only | 13,559 |
| V4-only | 2,662 |
V3 has 51% net-new events vs baseline; V4 has 57% net-new. Both populations are substantially distinct from baseline, not subsets of it.
V3 vs V4¶
| Measure | Count |
|---|---|
| V3 total | 6,304 |
| V4 total | 4,637 |
| Exact-anchor match | 3,402 |
| Fuzzy ±1d match | 3,800 |
| V3-only (no V4 within ±1d) | 2,504 |
| V4-only (no V3 within ±1d) | 837 |
Interpretation. V4 is much closer to a subset of V3 than V3 is to baseline: 82 % of V4 events have a matching V3 event within ±1 day. Tightening flagpole.max from 3 → 2 mostly removes patterns rather than shifting the population. There are still 837 V4-only events (18 %), so the inner-loop dedup picks slightly different anchors at the same symbol, but the qualitative population is essentially a culled V3.
4. MFE distribution comparison¶
| Stat (MFE %) | Baseline | V3 | V4 | V2 (ref) | V1 (ref) |
|---|---|---|---|---|---|
| Mean | 13.92 | 14.62 | 14.64 | 14.55 | 14.03 |
| Median | 7.50 | 7.66 | 7.64 | 7.64 | 7.47 |
| P25 | 2.67 | 2.70 | 2.66 | 2.74 | 2.63 |
| P75 | 16.01 | 16.07 | 15.78 | 15.95 | 15.97 |
| P90 | 30.87 | 32.15 | 31.70 | 31.58 | 30.54 |
Hit rate at MFE thresholds¶
| Threshold | Baseline | V3 | V4 | V2 (ref) | V1 (ref) |
|---|---|---|---|---|---|
| ≥ 5 % | 61.5 % | 62.6 % | 62.0 % | 62.2 % | 62.0 % |
| ≥ 10 % | 40.5 % | 40.8 % | 40.7 % | 40.9 % | 40.1 % |
| ≥ 15 % | 27.0 % | 27.0 % | 26.5 % | 27.1 % | 27.0 % |
| ≥ 20 % | 18.9 % | 18.5 % | 18.6 % | 18.7 % | 18.7 % |
| ≥ 30 % | 10.5 % | 11.1 % | 11.0 % | 10.9 % | 10.5 % |
| ≥ 50 % | 4.3 % | 4.9 % | 4.9 % | 5.0 % | 4.4 % |
V3 and V4 carry the same mean-MFE lift as V2 (≈ +0.7 pp on the mean, ~+0.6 pp on the 50% tail), but V4 shows a small drop in the ≥15% hit rate (26.5% vs baseline 27.0%, vs V3's 27.0% and V2's 27.1%). This is the first metric in the sweep where tightening farther actively underperforms a less-tightened sibling.
5. MAE distribution comparison¶
| Stat (MAE %) | Baseline | V3 | V4 |
|---|---|---|---|
| Mean | −9.54 | −9.63 | −9.50 |
| Median | −6.59 | −6.49 | −6.35 |
| P25 | −13.64 | −13.98 | −13.77 |
| P75 | −2.23 | −2.29 | −2.18 |
| P10 | −23.21 | −23.65 | −23.30 |
Stop-loss-relevant loss rates¶
| MAE worse than… | Baseline | V3 | V4 | V2 (ref) |
|---|---|---|---|---|
| −5 % | 58.5 % | 58.1 % | 57.7 % | 58.3 % |
| −7 % | 47.9 % | 47.8 % | 47.3 % | 48.3 % |
| −10 % | 35.9 % | 36.3 % | 35.6 % | 36.0 % |
| −15 % | 21.7 % | 22.2 % | 21.6 % | 22.0 % |
V4 is the only variant in the sweep whose downside is better than baseline at the −5 % and −7 % thresholds (−0.8 pp and −0.6 pp respectively). V3 is within 0.5 pp of baseline at every threshold.
6. Time-to-MFE-peak + endpoint returns¶
| Stat (days to MFE) | Baseline | V3 | V4 |
|---|---|---|---|
| Mean | 15.7 | 15.6 | 15.4 |
| Median | 16 | 16 | 15 |
| P90 | 30 | 29 | 29 |
Bucket distribution¶
| Days-to-peak | Baseline | V3 | V4 |
|---|---|---|---|
| 1 – 5 | 25.4 % | 25.3 % | 25.5 % |
| 6 – 10 | 12.8 % | 13.0 % | 13.1 % |
| 11 – 15 | 11.0 % | 11.5 % | 11.6 % |
| 16 – 20 | 10.8 % | 11.6 % | 11.7 % |
| 21 – 30 | 39.9 % | 38.7 % | 38.0 % |
V4's median peak day shifts from 16 → 15; combined with the U-shape attenuating slightly at the 21–30 bucket, V4 patterns resolve modestly faster on average than baseline.
Endpoint returns (means, %)¶
| Horizon | Baseline | V3 | V4 | V2 (ref) |
|---|---|---|---|---|
| 5d | 0.50 | 0.69 | 0.67 | 0.73 |
| 10d | 0.95 | 1.17 | 1.27 | 1.24 |
| 20d | 1.66 | 2.19 | 2.45 | 2.18 |
| 30d | 2.79 | 2.87 | 2.71 | 2.94 |
V4 has the strongest 20-day lift in the entire sweep (+0.79 pp on mean), but is the only variant whose 30-day endpoint mean falls below baseline (2.71 vs 2.79). Combined with the faster median peak day, V4 patterns appear to reach their peak earlier and then mean-revert.
7. Quality vs quantity tradeoff¶
Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).
| Metric | Baseline | V3 | V4 | V2 (ref) | V1 (ref) |
|---|---|---|---|---|---|
| n (with outcomes) | 15,528 | 6,303 | 4,636 | 7,425 | 5,154 |
| mean MFE % | 13.92 | 14.62 | 14.64 | 14.55 | 14.03 |
| P(MFE ≥ 15 %) | 27.0 % | 27.0 % | 26.5 % | 27.1 % | 27.0 % |
| Per-pattern proxy | 3.76 | 3.95 | 3.87 | 3.94 | 3.79 |
| Population-total proxy (per-pattern × n) | 58,370 | 24,879 | 17,967 | 29,266 | 19,561 |
Per-pattern expectancy: - Baseline → V1: +0.8 % (3.76 → 3.79) - Baseline → V2: +4.9 % (3.76 → 3.94) - Baseline → V3: +5.1 % (3.76 → 3.95) - Baseline → V4: +2.9 % (3.76 → 3.87)
The lift saturates between V2 and V3 (3.94 vs 3.95 — within rounding), and regresses from V3 to V4 (3.95 → 3.87). The regression is driven by the ≥15% hit rate dropping from 27.0 % to 26.5 % rather than by mean MFE — V4 patterns produce slightly fatter right and left tails without making the middle-distance targets more reliable.
8. Recommendation summary¶
This sweep was designed to answer two questions; the numbers answer both factually.
(a) Is the quality lift monotonic in flagpole tightness? No. The per-pattern expectancy proxy goes Baseline 3.76 → V2 3.94 → V3 3.95 → V4 3.87. The lever saturates between flagpole.max = 5 and 3 and reverses between 3 and 2. The mean-MFE component continues to climb very slightly (14.55 → 14.62 → 14.64), but the hit rate at +15 % MFE drops at V4 (27.1 % → 27.0 % → 26.5 %).
(b) Does the answer change between V3 (max = 3) and V4 (max = 2)? Yes. V4 has gone past the useful point in two ways: 1. ≥ 15 % MFE hit rate falls below baseline (26.5 % vs 27.0 %), and per-pattern expectancy proxy drops (3.95 → 3.87). 2. 30-day endpoint mean falls below baseline (2.71 % vs 2.79 %), the only variant in the sweep where this happens. V4 patterns peak earlier (median day 15 vs 16) and then fade.
V4 does produce the strongest 20-day endpoint mean in the sweep (+0.79 pp vs baseline) and the cleanest downside profile (−0.8 pp at the −5 % MAE threshold). These could matter if the consumer holds for a fixed 20-day window with a tight stop, but the overall expectancy proxy is lower than V3's.
V3 is the local optimum of the lever as currently parameterised. Per-pattern quality matches V2 within rounding, with a slightly tighter flagpole window and ~15 % fewer events than V2 (6,304 vs 7,428). The gain over baseline isolates cleanly to the flagpole window — V3 keeps a mild pennant shift (5–15 → 6–17) similar in spirit to V2's, yet achieves the same per-pattern lift. Going from V3 to V4 trades real expectancy for a marginally cleaner downside.
El Don decides.
Artifacts preserved under ab_test/:
run_v3_v4.py, analyze_v3_v4.py, variant_v3_events.parquet,
variant_v3_outcomes.parquet, variant_v4_events.parquet,
variant_v4_outcomes.parquet, summary_v3.json, summary_v4.json,
summary_v3_v4_overlap.json, run_v3_v4.log, plus all baseline +
v1 + v2 artifacts from prior phases.