mirror of
https://github.com/compiler-explorer/compiler-explorer.git
synced 2026-05-16 14:53:04 -04:00
This PR adds a new Python CLI tool for automating GitHub repository
management tasks.
## Overview
The initial implementation provides duplicate issue detection using text
similarity analysis. This is the first step toward automating repository
triage tasks.
## Features
- **Click-based CLI** with subcommands for future extensibility
- **find-duplicates command** for detecting duplicate issues using text
similarity
- Uses **gh CLI** for GitHub API access (no token management needed)
- Text similarity using `difflib.SequenceMatcher` (ratio-based
algorithm)
- Configurable similarity threshold (default: 0.6)
- Progress bar for long-running comparisons
- Age filtering support (`--min-age` parameter)
- Standard Python src-layout with **uv** for dependency management
- **Comprehensive test suite** with pytest (integrated into CI)
## Project Structure
```
etc/scripts/gh_tool/
├── src/gh_tool/ # Main package
│ ├── cli.py # Click-based CLI interface
│ └── duplicate_finder.py # Core duplicate detection logic
├── tests/ # Test suite
│ └── test_duplicate_finder.py
├── docs/ # Documentation
│ ├── TRIAGE-CRITERIA.md # Triage guidelines from manual review
│ └── PHASE1-FINDINGS.md # Historical analysis of 855 issues
├── pyproject.toml # Package configuration
└── README.md # Usage documentation
```
## Usage
```bash
cd etc/scripts/gh_tool
uv sync
uv run gh_tool find-duplicates /tmp/report.md
```
**Options:**
- `--threshold FLOAT` - Similarity threshold 0-1 (default: 0.6)
- `--state {all,open,closed}` - Issue state to check (default: open)
- `--min-age DAYS` - Only check issues older than N days (default: 0)
- `--limit INTEGER` - Maximum number of issues to fetch (default: 1000)
- `--repo TEXT` - GitHub repository in owner/repo format (default:
compiler-explorer/compiler-explorer)
**Example:**
```bash
# Find high-confidence duplicates in open issues
uv run gh_tool find-duplicates /tmp/report.md --threshold 0.85
# Check all issues older than 30 days
uv run gh_tool find-duplicates /tmp/report.md --state all --min-age 30
```
## Testing
The tool includes comprehensive test coverage:
- Unit tests for similarity calculation
- Integration tests for duplicate detection
- Edge case handling (transitive grouping, age filtering, threshold
sensitivity)
- Report generation validation
**Run tests:**
```bash
cd etc/scripts/gh_tool
uv run pytest -v
```
Tests are integrated into CI and run on every push.
## Documentation
- **`README.md`**: Complete usage guide with examples
- **`docs/TRIAGE-CRITERIA.md`**: Comprehensive triage guidelines
developed during manual review of 22+ issues
- **`docs/PHASE1-FINDINGS.md`**: Historical analysis context from
initial 855 issue review
## CI Integration
The tool is integrated into the GitHub Actions workflow:
- `uv` is installed via `astral-sh/setup-uv@v6`
- Tests run automatically on every push
- Ensures tool remains functional as codebase evolves
## Next Steps
Future enhancements planned for follow-up PRs:
- GitHub Action for automatic duplicate detection on new issues
- Additional automation tools (upstream health checker, label validator,
etc.)
- Automated triage reports
## Changes in this PR
- ✅ Core duplicate detection implementation
- ✅ Comprehensive test suite (192 lines)
- ✅ CI integration
- ✅ Complete documentation
- ✅ Example triage criteria and findings
---------
Co-authored-by: Claude <noreply@anthropic.com>
2.9 KiB
2.9 KiB
GitHub Tools for Compiler Explorer
CLI tools for automating GitHub repository management tasks.
Setup
This project uses uv for Python version and dependency management.
Install dependencies:
cd etc/scripts/gh_tool
uv sync
Usage
Run from the gh_tool directory:
cd etc/scripts/gh_tool
# Get help
uv run gh_tool --help
# Get help for a specific command
uv run gh_tool find-duplicates --help
Commands
find-duplicates
Finds potential duplicate issues in the compiler-explorer repository using text similarity analysis (difflib.SequenceMatcher).
Usage:
# Basic usage (checks all open issues)
uv run gh_tool find-duplicates /tmp/duplicates-report.md
# Check all issues (including closed)
uv run gh_tool find-duplicates /tmp/all-duplicates.md --state all
# Adjust similarity threshold for higher confidence matches
uv run gh_tool find-duplicates /tmp/high-confidence.md --threshold 0.85
# Combine options
uv run gh_tool find-duplicates /tmp/report.md --threshold 0.7 --state all --min-age 30
# Use with a different repository
uv run gh_tool find-duplicates /tmp/other-repo.md --repo owner/repository
Arguments:
OUTPUT_FILE(required) - Path to output markdown file
Options:
--threshold FLOAT- Similarity threshold between 0 and 1 (default: 0.6)- 0.6 = 60% similar titles
- Higher values = fewer, more confident matches
--state {all,open,closed}- Which issues to check (default: open)--min-age DAYS- Only check issues older than N days (default: 0)--limit INTEGER- Maximum number of issues to fetch (default: 1000)--repo TEXT- GitHub repository in owner/repo format (default: compiler-explorer/compiler-explorer)
Example Output:
# Potential Duplicate Issues
Found 5 potential duplicate groups:
## Group 1 (85% similar)
- #3201 [LIB REQUEST] numpy (12 comments, created 2021-03-15)
- #7778 [LIB REQUEST] numpy (0 comments, created 2024-01-10)
## Group 2 (72% similar)
- #4336 [COMPILER REQUEST]: Groovy (3 comments, created 2022-05-20)
- #6526 [COMPILER REQUEST]: Groovy (1 comments, created 2023-08-15)
Performance:
The duplicate detection algorithm uses O(n²) pairwise comparisons. For reference:
- ~850 issues: ~362,000 comparisons (~1-2 minutes)
- ~1,000 issues: ~500,000 comparisons (~2-3 minutes)
A progress bar shows real-time progress during the comparison phase.
Requirements:
ghCLI must be installed and authenticated- Read access to compiler-explorer/compiler-explorer repository
Future Tools
This directory is intended to house additional GitHub automation scripts such as:
- Upstream project health checker (detect abandoned compiler/library projects)
- Label consistency validator
- Issue template compliance checker
- Automated triage reports
Development
Run tests:
uv run pytest -v
Run linting:
uv run ruff check .
Format code:
uv run ruff format .