Sentry PR Review Friction

Abstract

This study measures where code review friction concentrates in getsentry/sentry and identifies which Architecture Decision Records (ADRs) could reduce repeated discussion cycles. Unlike a surface-level metric dump, we go beyond aggregate numbers: we read actual comment threads from high-friction PRs, classify recurring discussion themes with evidence, and propose ADRs grounded in real examples.

Scope

Parameter	Value
Repository	`getsentry/sentry`
Window	90 days (ending April 2026)
Merged PRs analyzed	500
Closed-unmerged PRs analyzed	500
Open PRs sampled	251
Deep comment analysis	60 high-friction PRs (50 merged + 10 closed-unmerged)
Total comments analyzed	965 (604 non-bot)

Key Findings

Median time-to-merge is 4.98 hours, but P90 reaches 70.54 hours — a 14x multiplier. Large PRs (≥10 files or ≥400 churn) have a median TTM of 22.52 hours vs 1.66 hours for tiny PRs.
PR size is the strongest friction predictor. Large PRs hit the high-friction quartile 57.4% of the time. Tiny PRs hit it only 9.8% of the time. Feature PRs (feat) have the highest friction rate at 38.6%, nearly double fix PRs (17.6%).
The top 9 discussion themes in high-friction PRs, identified from 604 non-bot comments across 60 PRs:
- API design and defaults — 38.3% of high-friction PRs
- Test evidence and coverage — 38.3%
- Component patterns and styling — 35.0%
- State management and data flow — 35.0%
- Code documentation — 31.7%
- Type safety and error handling — 30.0%
- Follow-up and scope creep — 25.0%
- Security and permissions — 20.0%
- Naming and consistency — 11.7%
Automated reviewers find real bugs — but the same bug categories repeat across PRs. Bot reviewers (sentry[bot], sentry-warden[bot], cursor[bot]) account for 23.7% of substantive review comments and appear in 68.3% of high-friction PRs. Reading the actual findings reveals at least 10 recurring patterns (missing DoesNotExist handlers, .filter().first() with unreachable except, direct dict access on API responses, option key mismatches, companion list misses, etc.) that should be promoted from expensive agentic review to cheap deterministic checks (Ruff/Semgrep rules, mypy strict mode, typed registries). One PR (#111522) had the same pattern flagged 4 times in one review pass — exactly the case where a single lint rule pays off forever.
Abandoned PRs signal unresolved decision ambiguity. 12 closed-unmerged PRs (2.4%) had ≥10 discussion items. These abandoned PRs had 2x the median TTM and 2.5x the median review events compared to merged high-friction PRs.
92 open PRs are stale 14+ days, with 33 stale 30+ days. The most-discussed stale PR has 55 review events and has been open for over 760 hours.

Study Structure

Methodology — Data collection, tools, sample sizes, and limitations
Baseline Metrics — Aggregate metrics, percentiles, and size segmentation
Friction Map — Domain and area friction breakdown
Discussion Themes — Evidence-backed theme analysis from real comment threads
Automated Review Friction — Bot reviewers as a first-class friction source
Abandoned PRs — Patterns in closed-unmerged and stale PRs
ADR Proposals — Detailed proposals grounded in evidence

Reproducibility

All data and scripts are published for audit:

Study folder: studies/sentry-pr-review-friction
Method script: analyze_sentry_prs.py
Data artifacts: output/

# Reproduce the full pipeline
python analyze_sentry_prs.py collect --repo getsentry/sentry --days 90 --limit 500
python analyze_sentry_prs.py analyze
python analyze_sentry_prs.py report