This PR adds a new Python CLI tool for automating GitHub repository
management tasks.
## Overview
The initial implementation provides duplicate issue detection using text
similarity analysis. This is the first step toward automating repository
triage tasks.
## Features
- **Click-based CLI** with subcommands for future extensibility
- **find-duplicates command** for detecting duplicate issues using text
similarity
- Uses **gh CLI** for GitHub API access (no token management needed)
- Text similarity using `difflib.SequenceMatcher` (ratio-based
algorithm)
- Configurable similarity threshold (default: 0.6)
- Progress bar for long-running comparisons
- Age filtering support (`--min-age` parameter)
- Standard Python src-layout with **uv** for dependency management
- **Comprehensive test suite** with pytest (integrated into CI)
## Project Structure
```
etc/scripts/gh_tool/
├── src/gh_tool/ # Main package
│ ├── cli.py # Click-based CLI interface
│ └── duplicate_finder.py # Core duplicate detection logic
├── tests/ # Test suite
│ └── test_duplicate_finder.py
├── docs/ # Documentation
│ ├── TRIAGE-CRITERIA.md # Triage guidelines from manual review
│ └── PHASE1-FINDINGS.md # Historical analysis of 855 issues
├── pyproject.toml # Package configuration
└── README.md # Usage documentation
```
## Usage
```bash
cd etc/scripts/gh_tool
uv sync
uv run gh_tool find-duplicates /tmp/report.md
```
**Options:**
- `--threshold FLOAT` - Similarity threshold 0-1 (default: 0.6)
- `--state {all,open,closed}` - Issue state to check (default: open)
- `--min-age DAYS` - Only check issues older than N days (default: 0)
- `--limit INTEGER` - Maximum number of issues to fetch (default: 1000)
- `--repo TEXT` - GitHub repository in owner/repo format (default:
compiler-explorer/compiler-explorer)
**Example:**
```bash
# Find high-confidence duplicates in open issues
uv run gh_tool find-duplicates /tmp/report.md --threshold 0.85
# Check all issues older than 30 days
uv run gh_tool find-duplicates /tmp/report.md --state all --min-age 30
```
## Testing
The tool includes comprehensive test coverage:
- Unit tests for similarity calculation
- Integration tests for duplicate detection
- Edge case handling (transitive grouping, age filtering, threshold
sensitivity)
- Report generation validation
**Run tests:**
```bash
cd etc/scripts/gh_tool
uv run pytest -v
```
Tests are integrated into CI and run on every push.
## Documentation
- **`README.md`**: Complete usage guide with examples
- **`docs/TRIAGE-CRITERIA.md`**: Comprehensive triage guidelines
developed during manual review of 22+ issues
- **`docs/PHASE1-FINDINGS.md`**: Historical analysis context from
initial 855 issue review
## CI Integration
The tool is integrated into the GitHub Actions workflow:
- `uv` is installed via `astral-sh/setup-uv@v6`
- Tests run automatically on every push
- Ensures tool remains functional as codebase evolves
## Next Steps
Future enhancements planned for follow-up PRs:
- GitHub Action for automatic duplicate detection on new issues
- Additional automation tools (upstream health checker, label validator,
etc.)
- Automated triage reports
## Changes in this PR
- ✅ Core duplicate detection implementation
- ✅ Comprehensive test suite (192 lines)
- ✅ CI integration
- ✅ Complete documentation
- ✅ Example triage criteria and findings
---------
Co-authored-by: Claude <noreply@anthropic.com>
7.0 KiB
Compiler Explorer Issue Analysis - Phase 1 Findings
Date: 2025-10-06 Total Open Issues: 855
Executive Summary
The issue backlog reveals clear patterns that can guide systematic triage:
- 46% (398) are stale (>2 years old, <5 comments) - prime candidates for review
- 35% (297) have zero engagement - likely duplicates, invalid, or need clarification
- 48% (412) are poorly labeled - only generic labels, impeding organization
- 51% (439) are generic "requests" - mostly compiler/library additions
- 28% (239) are bugs - actual functionality issues
- 5% (42) marked "probably-wont-happen" but still open - should be closed kindly
Key Statistics
Age Distribution
>5 years old: 137 (16.0%) ← Ancient, likely obsolete or philosophical
3-5 years old: 189 (22.1%) ← Old, likely stale
2-3 years old: 181 (21.2%) ← Getting stale
1-2 years old: 151 (17.7%)
6-12 months: 102 (11.9%)
3-6 months: 53 (6.2%)
<3 months: 42 (4.9%) ← Recent
Key Insight: 59% of issues are >2 years old. This suggests either low prioritization or lack of contributor bandwidth.
Label Distribution (Top 10)
request 439 (51.3%) ← Mostly compiler/library requests
bug 239 (28.0%)
new-compilers 121 (14.2%)
new-libs 93 (10.9%)
enhancement 79 (9.2%)
ui 51 (6.0%)
help wanted 44 (5.1%)
probably-wont-happen 42 (4.9%)
Status: triaged 35 (4.1%)
lang-c++ 29 (3.4%)
Engagement Patterns
Top 3 Most-Commented:
- #82 (39 comments) - "C(++) Compiler Master List" - META-ISSUE tracking compilers
- #891 (37 comments) - "Add pgroup compiler"
- #264 (35 comments) - "Add D8 JavaScript assembly output"
Zero Comments: 297 issues (35%) have NO engagement whatsoever
- Many are recent (<6 months) and may just need time
- Many are 5+ years old and likely forgotten/obsolete
Critical Findings
1. Tracking/Meta Issues
Issue #82 (10 years old) is a "Master List" with checkboxes tracking compiler support. This was useful historically but now creates confusion:
- 39 comments of various requests
- Acts as a catch-all for "add this compiler" requests
- Should probably be closed with pointer to proper issue templates
2. "Probably Won't Happen" Issues (42 total)
These are already flagged but remain open. Examples:
- #187 (8.9y) - "Add support for MMIX" - niche educational architecture
- #264 (8.7y) - "Add D8 JavaScript assembly" - 35 comments, out of scope
- #341 (8.5y) - "classic GCC versions" - limited value vs maintenance cost
- #514 (8.1y) - "ARM GCC 6.3.0 standard libraries missing" - old compiler version
Action: These should be kindly closed with explanations.
3. Stale Issues (398 total)
Criteria: >2 years old AND <5 comments
These show minimal community interest and are likely:
- Overtaken by events (e.g., new compiler versions available)
- Niche requests with no PR momentum
- Questions that were never answered
Sample oldest stale issues:
- #187 (8.9y, 0 comments) - MMIX support
- #297 (8.6y, 0 comments) - GCC7 verbose-asm format
- #341 (8.5y, 0 comments) - Classic GCC versions
- #425 (8.4y, 0 comments) - libfirm/cparser
4. Duplicate/Similar Requests
GCC versions: 11 issues requesting various GCC versions/configurations Clang variants: 9 issues for different Clang forks/versions MSVC versions: 3+ issues for recent MSVC versions (could be consolidated) ROCm/AMD: 3 issues about ROCm compiler versions
These could be consolidated into tracking issues or closed as duplicates.
5. Poorly Labeled Issues (412 total)
Many issues have only generic labels like "request" or "bug" without specifics:
- No language tag (lang-c++, lang-rust, etc.)
- No area tag (ui, compiler-issue, etc.)
- No priority indication
This makes filtering and prioritization difficult.
Common Themes in Titles
Top keywords:
- "request" (437) - the label shows up in titles too
- "compiler" (183)
- "support" (97)
- "clang" (62)
- "library" (35)
- "msvc" (27)
Pattern: Most issues are straightforward "add X compiler" or "add Y library" requests.
Recommendations for Phase 2
Immediate Actions
-
Close "Probably Won't Happen" Issues (42)
- Draft kind, clear explanation template
- Close with rationale and link to contribution docs if applicable
- Expected impact: -5% issue count
-
Close Ancient Zero-Comment Issues (estimate ~50-100)
- Issues >5 years old with 0 comments and no activity
- Use "stale bot" type reasoning: "no activity, assumed no longer relevant"
- Allow 2 weeks for objections before closing
- Expected impact: -6-12% issue count
-
Close Obvious Duplicates (estimate ~20-30)
- Multiple GCC/Clang version requests → consolidate or close with "use latest"
- Expected impact: -2-4% issue count
Medium-Term Actions
-
Improve Labeling (412 issues)
- Add language-specific labels (lang-*)
- Add area labels (ui, execution, api, etc.)
- Add priority labels where clear
- This enables better filtering and assignment
-
Create "Good First Issue" Pipeline (~50 issues)
- Review "help wanted" (44 issues) for suitable ones
- Look for well-defined, isolated tasks
- Add good-first-issue label and mentorship notes
-
Consolidate Request Tracking
- Create wiki/doc page for "Compiler Request Guidelines"
- Explain: how to request, how to contribute, what makes requests likely
- Close #82 meta-issue with pointer to new docs
Long-Term Strategy
-
Implement Stale Bot
- Auto-label issues >1 year old with no activity
- Auto-close after warning period if still no activity
- Keep it friendly: "seems resolved, please reopen if still relevant"
-
Improve Issue Templates
- Separate templates for: compiler requests, library requests, bug reports, feature requests
- Require more details (version numbers, error messages, why needed)
- Auto-label based on template used
-
Regular Triage Cadence
- Weekly review of new issues (label, prioritize, close if needed)
- Monthly review of old issues (close stale, consolidate duplicates)
- Quarterly review of "help wanted" (ensure still relevant)
Estimated Impact
Conservative cleanup estimate:
- Close "probably-wont-happen": -42 issues
- Close ancient zero-comment: -50 issues
- Close duplicates: -20 issues
- Total: -112 issues (13% reduction)
Aggressive cleanup estimate:
- Above plus stale >3 years: -200+ additional
- Total: -300+ issues (35% reduction)
Next Steps
Proposed Plan:
- Review sample of "probably-wont-happen" issues together
- Draft closure message templates (kind but clear)
- Review sample of stale issues together
- Implement Phase 2: Systematic triage with your approval
- Consider automation (stale bot, better templates)
Files Generated
all-issues-raw.json- Full issue data from GitHubissues-summary.json- Simplified formatfull-analysis.txt- Complete statistical analysisPHASE1-FINDINGS.md- This document