Skip to content

Phase 11a — Pennant criteria A/B test

Side-by-side comparison of the production bull pennant detector under current criteria (Baseline) vs proposed tightened criteria (Variant) over the full 2007–2026 historical dataset. Analytical only — no production code, config, or tables were modified.

Override mechanism (Approach a): the production pennant detector reads its thresholds from uriel.config.get_config(), which returns a (mutable) Pydantic v2 model. The harness at ab_test/run_ab.py calls get_config() once per run and mutates the four duration fields before invoking uriel.detect.pennant._detect_for_ticker(...) directly. Events are returned in-memory and written to parquet under ab_test/; production pattern_events is untouched. Outcomes (MFE/MAE/endpoints) are computed inline using the same anchor-relative formula as uriel.outcomes.profiler._profile_one_event (forward 30 trading days, percent vs anchor close), but written to parquet rather than into pattern_events. Note the v1.4/v1.5 Charter §7.5 Q2 minimum (5 bars) is itself a parameter under test here — no implicit charter override required.

1. Configuration

Parameter Baseline Variant
pennant.min_duration_bars 5 10
pennant.max_duration_bars 15 20
flagpole.min_duration_bars 1 1
flagpole.max_duration_bars 10 5
pennant.max_retrace_pct 0.382 0.382
flagpole.min_magnitude_pct 12.0 12.0
flagpole.min_atr_multiple 4.0 4.0
flagpole.volume_ratio_min 1.5 1.5
trend_filter (EMA_55 ≥ 10d prior) on on

Date range scanned: 2007-02-15 → 2026-05-08 (≈20 years). Universe: 2,974 active tickers; 2,413 had ≥300 bars to qualify for scanning. Earliest variant anchor: 2007-02-16; both runs span the full window.

2. Detection counts

  • Baseline: 15,534 events (matches the production pattern_events count exactly — confidence that the harness reproduces production output)
  • Variant: 5,155 events
  • Variant / Baseline ratio: 0.332 — variant finds roughly one in three of the patterns baseline finds.

Per-year counts

Year Baseline Variant Variant/Baseline
2007 400 127 0.32
2008 226 69 0.31
2009 702 238 0.34
2010 619 172 0.28
2011 471 158 0.34
2012 564 187 0.33
2013 875 233 0.27
2014 498 178 0.36
2015 533 186 0.35
2016 806 249 0.31
2017 871 301 0.35
2018 719 253 0.35
2019 748 260 0.35
2020 1,168 357 0.31
2021 1,336 455 0.34
2022 533 189 0.35
2023 1,024 349 0.34
2024 1,780 662 0.37
2025 1,232 397 0.32
2026 (YTD) 429 135 0.31

The ratio is remarkably stable across regimes (0.27 – 0.37). No year shows the variant disproportionately favouring or punishing a particular regime.

Per-sector counts

Sector Baseline Variant V/B
Healthcare 3,206 1,091 0.34
Technology 2,694 943 0.35
Industrials 2,538 869 0.34
Financial Services 2,067 600 0.29
Consumer Cyclical 1,916 652 0.34
Consumer Defensive 744 235 0.32
Energy 677 216 0.32
Basic Materials 672 213 0.32
Communication Services 590 202 0.34
Real Estate 328 106 0.32
Utilities 102 28 0.27

Sector mix is preserved; no sector is disproportionately filtered.

3. Overlap analysis

Measure Count
Exact-anchor match (same symbol + same event_date) 1,339
Fuzzy match within ±1 calendar day (baseline events with ≥1 nearby variant) 2,190
Fuzzy match within ±1 calendar day (variant events with ≥1 nearby baseline) 2,190
Baseline-only events (no variant within ±1 day) 13,344
Variant-only events (no baseline within ±1 day) 2,965

Interpretation. Only ≈8.6 % of baseline events have an exact variant counterpart, and only ≈14 % match even with a ±1-day tolerance. Of the 5,155 variant events, 2,965 (57 %) are not present in the baseline population at all — they are net-new detections. This is despite the variant having a stricter pennant.min_duration_bars and a stricter flagpole.max_duration_bars. Two mechanisms explain it:

  • The variant's wider pennant window (max 20 vs 15) admits longer consolidations baseline rejects.
  • The detector's inner-loop dedup (skip_until_idx = end_idx + min_dur and break on the first qualifying win_len at each end_idx) means that changing min_duration_bars changes which anchor "wins" at each symbol, so the two configurations can land on different anchor dates for the same underlying pattern.

The change is therefore not incremental — variant produces a qualitatively different event population, not a strict subset of baseline.

4. MFE distribution comparison

Stat (MFE %) Baseline Variant
Mean 13.92 14.03
Median 7.50 7.47
P25 2.67 2.63
P75 16.01 15.97
P90 30.87 30.54

Hit-rate at common MFE thresholds

Threshold Baseline Variant
≥ 5 % 61.5 % 62.0 %
≥ 10 % 40.5 % 40.1 %
≥ 15 % 27.0 % 27.0 %
≥ 20 % 18.9 % 18.7 %
≥ 30 % 10.5 % 10.5 %
≥ 50 % 4.3 % 4.4 %

The MFE distributions are statistically indistinguishable at every percentile and every threshold.

5. MAE distribution comparison

Stat (MAE %) Baseline Variant
Mean −9.54 −9.68
Median −6.59 −6.74
P25 −13.64 −14.05
P75 −2.23 −2.15
P10 −23.21 −23.48

Stop-loss-relevant loss rates

MAE worse than… Baseline Variant
−5 % 58.5 % 58.7 %
−7 % 47.9 % 48.7 %
−10 % 35.9 % 36.4 %
−15 % 21.7 % 22.4 %

Variant patterns have marginally deeper drawdowns at every level (differences of 0.2 – 0.8 percentage points). The effect is small and consistent — variant is not improving the downside profile.

6. Time-to-MFE-peak comparison

Stat (days to MFE) Baseline Variant
Mean 15.7 15.5
Median 16 16
P25 5 5
P75 26 26
P90 30 30

Bucket distribution

Days-to-peak Baseline Variant
1 – 5 25.4 % 25.7 %
6 – 10 12.8 % 13.3 %
11 – 15 11.0 % 10.7 %
16 – 20 10.8 % 11.4 %
21 – 30 39.9 % 38.8 %

The two populations resolve at indistinguishable speeds. The U-shaped profile (high mass at 1–5 and 21–30 days) is present in both — patterns that work tend to work fast or take the full window.

7. Quality vs quantity tradeoff

Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).

Metric Baseline Variant
n (with outcomes) 15,528 5,154
mean MFE % 13.92 14.03
P(MFE ≥ 15 %) 27.0 % 27.0 %
Per-pattern proxy 3.76 3.79
Population-total proxy (per-pattern × n) 58,370 19,561

Interpretation. Per-pattern quality is essentially identical (+0.8 % on mean MFE, identical hit-rate at +15 %). But the variant detects only 33 % as many patterns, so the total expected-return contribution from the variant population is about one-third of the baseline's. The variant is fewer patterns at the same per-pattern quality, not "fewer but better." Endpoint returns at 5, 10, 20, 30 days tell the same story — means within 0.2 percentage points, medians within 0.2 percentage points, P75/P90 essentially overlapping.

The downside picture is marginally worse for the variant (0.2 – 0.8 pp higher loss rates at every threshold), suggesting the wider pennant window (up to 20 bars) admits some consolidations that decay rather than coil.

8. Recommendation summary

The variant would be a clear improvement over the baseline if and only if the consumer of the detector values per-pattern selectivity over total coverage, and is willing to accept marginally deeper drawdowns in exchange for a smaller, cleaner candidate set. On the quality dimensions actually measured (forward MFE, MAE, time-to-peak, endpoint returns) the two populations are statistically equivalent; the variant produces no measurable lift in any of them. The decision is therefore not a "better detector" question but a "right population size" question — fewer-but-equivalent patterns vs more-but-equivalent patterns. El Don decides.


Artifacts preserved under ab_test/: run_ab.py, analyze.py, baseline_events.parquet, variant_events.parquet, baseline_outcomes.parquet, variant_outcomes.parquet, summary.json, run.log.