Skip to content

Phase 11a-3 — Pennant criteria A/B test (Variants 3 & 4)

Continuation of the Phase 11a/11a-2 sweep, pushing flagpole.max_duration_bars tighter (5 → 3 → 2). Baseline events and outcomes are reused unchanged from Phase 11a. V1/V2 numbers are pulled from prior reports for context.

Override mechanism, outcome formula, and universe are identical to Phase 11a. No production code, config, or data modified.

1. Configuration

Main configurations under test (Baseline / V3 / V4):

Parameter Baseline Variant 3 Variant 4
pennant.min_duration_bars 5 6 6
pennant.max_duration_bars 15 17 17
flagpole.min_duration_bars 1 1 1
flagpole.max_duration_bars 10 3 2
All other criteria unchanged unchanged

For reference (prior variants):

Parameter V1 (Phase 11a) V2 (Phase 11a-2)
pennant.{min,max} 10 – 20 7 – 17
flagpole.max 5 5

Date range scanned: 2007-02-15 → 2026-05-08. Universe: 2,974 active tickers (2,413 with ≥300 bars).

2. Detection counts

Population Events vs Baseline
Baseline 15,534 1.000×
Variant 3 6,304 0.406×
Variant 4 4,637 0.299×
V1 (ref) 5,155 0.332×
V2 (ref) 7,428 0.478×

Both V3 and V4 sit between V1 and V2 in count.

Per-year counts (Baseline / V3 / V4)

Year Baseline V3 V4 V3/B V4/B
2007 400 161 122 0.40 0.31
2008 226 90 63 0.40 0.28
2009 702 235 162 0.33 0.23
2010 619 220 156 0.36 0.25
2011 471 196 146 0.42 0.31
2012 564 211 154 0.37 0.27
2013 875 302 215 0.35 0.25
2014 498 214 164 0.43 0.33
2015 533 224 175 0.42 0.33
2016 806 267 195 0.33 0.24
2017 871 334 247 0.38 0.28
2018 719 325 229 0.45 0.32
2019 748 371 290 0.50 0.39
2020 1,168 400 286 0.34 0.24
2021 1,336 509 342 0.38 0.26
2022 533 200 130 0.38 0.24
2023 1,024 500 367 0.49 0.36
2024 1,780 817 646 0.46 0.36
2025 1,232 525 404 0.43 0.33
2026 (YTD) 429 203 144 0.47 0.34

Both ratios are stable across regimes; no year is disproportionately filtered.

Per-sector counts

Sector Baseline V3 V4 V3/B V4/B
Healthcare 3,206 1,367 1,000 0.43 0.31
Technology 2,694 1,170 920 0.43 0.34
Industrials 2,538 1,092 800 0.43 0.32
Financial Services 2,067 726 523 0.35 0.25
Consumer Cyclical 1,916 791 565 0.41 0.29
Consumer Defensive 744 315 258 0.42 0.35
Energy 677 220 134 0.32 0.20
Basic Materials 672 252 183 0.38 0.27
Communication Services 590 233 158 0.39 0.27
Real Estate 328 107 77 0.33 0.23
Utilities 102 31 19 0.30 0.19

Mix is preserved. Energy/Utilities/Real Estate filter slightly harder, consistent across V3 and V4.

3. Overlap analysis

Baseline vs V3

Measure Count
Exact-anchor match 1,749
Fuzzy ±1d (baseline events with ≥1 nearby V3) 3,088
Fuzzy ±1d (V3 events with ≥1 nearby baseline) 3,088
Baseline-only 12,446
V3-only 3,216

Baseline vs V4

Measure Count
Exact-anchor match 1,073
Fuzzy ±1d (baseline events with ≥1 nearby V4) 1,975
Fuzzy ±1d (V4 events with ≥1 nearby baseline) 1,975
Baseline-only 13,559
V4-only 2,662

V3 has 51% net-new events vs baseline; V4 has 57% net-new. Both populations are substantially distinct from baseline, not subsets of it.

V3 vs V4

Measure Count
V3 total 6,304
V4 total 4,637
Exact-anchor match 3,402
Fuzzy ±1d match 3,800
V3-only (no V4 within ±1d) 2,504
V4-only (no V3 within ±1d) 837

Interpretation. V4 is much closer to a subset of V3 than V3 is to baseline: 82 % of V4 events have a matching V3 event within ±1 day. Tightening flagpole.max from 3 → 2 mostly removes patterns rather than shifting the population. There are still 837 V4-only events (18 %), so the inner-loop dedup picks slightly different anchors at the same symbol, but the qualitative population is essentially a culled V3.

4. MFE distribution comparison

Stat (MFE %) Baseline V3 V4 V2 (ref) V1 (ref)
Mean 13.92 14.62 14.64 14.55 14.03
Median 7.50 7.66 7.64 7.64 7.47
P25 2.67 2.70 2.66 2.74 2.63
P75 16.01 16.07 15.78 15.95 15.97
P90 30.87 32.15 31.70 31.58 30.54

Hit rate at MFE thresholds

Threshold Baseline V3 V4 V2 (ref) V1 (ref)
≥ 5 % 61.5 % 62.6 % 62.0 % 62.2 % 62.0 %
≥ 10 % 40.5 % 40.8 % 40.7 % 40.9 % 40.1 %
≥ 15 % 27.0 % 27.0 % 26.5 % 27.1 % 27.0 %
≥ 20 % 18.9 % 18.5 % 18.6 % 18.7 % 18.7 %
≥ 30 % 10.5 % 11.1 % 11.0 % 10.9 % 10.5 %
≥ 50 % 4.3 % 4.9 % 4.9 % 5.0 % 4.4 %

V3 and V4 carry the same mean-MFE lift as V2 (≈ +0.7 pp on the mean, ~+0.6 pp on the 50% tail), but V4 shows a small drop in the ≥15% hit rate (26.5% vs baseline 27.0%, vs V3's 27.0% and V2's 27.1%). This is the first metric in the sweep where tightening farther actively underperforms a less-tightened sibling.

5. MAE distribution comparison

Stat (MAE %) Baseline V3 V4
Mean −9.54 −9.63 −9.50
Median −6.59 −6.49 −6.35
P25 −13.64 −13.98 −13.77
P75 −2.23 −2.29 −2.18
P10 −23.21 −23.65 −23.30

Stop-loss-relevant loss rates

MAE worse than… Baseline V3 V4 V2 (ref)
−5 % 58.5 % 58.1 % 57.7 % 58.3 %
−7 % 47.9 % 47.8 % 47.3 % 48.3 %
−10 % 35.9 % 36.3 % 35.6 % 36.0 %
−15 % 21.7 % 22.2 % 21.6 % 22.0 %

V4 is the only variant in the sweep whose downside is better than baseline at the −5 % and −7 % thresholds (−0.8 pp and −0.6 pp respectively). V3 is within 0.5 pp of baseline at every threshold.

6. Time-to-MFE-peak + endpoint returns

Stat (days to MFE) Baseline V3 V4
Mean 15.7 15.6 15.4
Median 16 16 15
P90 30 29 29

Bucket distribution

Days-to-peak Baseline V3 V4
1 – 5 25.4 % 25.3 % 25.5 %
6 – 10 12.8 % 13.0 % 13.1 %
11 – 15 11.0 % 11.5 % 11.6 %
16 – 20 10.8 % 11.6 % 11.7 %
21 – 30 39.9 % 38.7 % 38.0 %

V4's median peak day shifts from 16 → 15; combined with the U-shape attenuating slightly at the 21–30 bucket, V4 patterns resolve modestly faster on average than baseline.

Endpoint returns (means, %)

Horizon Baseline V3 V4 V2 (ref)
5d 0.50 0.69 0.67 0.73
10d 0.95 1.17 1.27 1.24
20d 1.66 2.19 2.45 2.18
30d 2.79 2.87 2.71 2.94

V4 has the strongest 20-day lift in the entire sweep (+0.79 pp on mean), but is the only variant whose 30-day endpoint mean falls below baseline (2.71 vs 2.79). Combined with the faster median peak day, V4 patterns appear to reach their peak earlier and then mean-revert.

7. Quality vs quantity tradeoff

Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).

Metric Baseline V3 V4 V2 (ref) V1 (ref)
n (with outcomes) 15,528 6,303 4,636 7,425 5,154
mean MFE % 13.92 14.62 14.64 14.55 14.03
P(MFE ≥ 15 %) 27.0 % 27.0 % 26.5 % 27.1 % 27.0 %
Per-pattern proxy 3.76 3.95 3.87 3.94 3.79
Population-total proxy (per-pattern × n) 58,370 24,879 17,967 29,266 19,561

Per-pattern expectancy: - Baseline → V1: +0.8 % (3.76 → 3.79) - Baseline → V2: +4.9 % (3.76 → 3.94) - Baseline → V3: +5.1 % (3.76 → 3.95) - Baseline → V4: +2.9 % (3.76 → 3.87)

The lift saturates between V2 and V3 (3.94 vs 3.95 — within rounding), and regresses from V3 to V4 (3.95 → 3.87). The regression is driven by the ≥15% hit rate dropping from 27.0 % to 26.5 % rather than by mean MFE — V4 patterns produce slightly fatter right and left tails without making the middle-distance targets more reliable.

8. Recommendation summary

This sweep was designed to answer two questions; the numbers answer both factually.

(a) Is the quality lift monotonic in flagpole tightness? No. The per-pattern expectancy proxy goes Baseline 3.76 → V2 3.94 → V3 3.95 → V4 3.87. The lever saturates between flagpole.max = 5 and 3 and reverses between 3 and 2. The mean-MFE component continues to climb very slightly (14.55 → 14.62 → 14.64), but the hit rate at +15 % MFE drops at V4 (27.1 % → 27.0 % → 26.5 %).

(b) Does the answer change between V3 (max = 3) and V4 (max = 2)? Yes. V4 has gone past the useful point in two ways: 1. ≥ 15 % MFE hit rate falls below baseline (26.5 % vs 27.0 %), and per-pattern expectancy proxy drops (3.95 → 3.87). 2. 30-day endpoint mean falls below baseline (2.71 % vs 2.79 %), the only variant in the sweep where this happens. V4 patterns peak earlier (median day 15 vs 16) and then fade.

V4 does produce the strongest 20-day endpoint mean in the sweep (+0.79 pp vs baseline) and the cleanest downside profile (−0.8 pp at the −5 % MAE threshold). These could matter if the consumer holds for a fixed 20-day window with a tight stop, but the overall expectancy proxy is lower than V3's.

V3 is the local optimum of the lever as currently parameterised. Per-pattern quality matches V2 within rounding, with a slightly tighter flagpole window and ~15 % fewer events than V2 (6,304 vs 7,428). The gain over baseline isolates cleanly to the flagpole window — V3 keeps a mild pennant shift (5–15 → 6–17) similar in spirit to V2's, yet achieves the same per-pattern lift. Going from V3 to V4 trades real expectancy for a marginally cleaner downside.

El Don decides.


Artifacts preserved under ab_test/: run_v3_v4.py, analyze_v3_v4.py, variant_v3_events.parquet, variant_v3_outcomes.parquet, variant_v4_events.parquet, variant_v4_outcomes.parquet, summary_v3.json, summary_v4.json, summary_v3_v4_overlap.json, run_v3_v4.log, plus all baseline + v1 + v2 artifacts from prior phases.