Commit Graph

4 Commits

Author SHA1 Message Date
Matt Godbolt
3fc36bb322 Tweaks 2025-10-07 16:50:14 -05:00
Matt Godbolt
fd39af7be8 Add AI-powered duplicate detection with strict filtering (#8176)
## Summary

Adds optional AI-powered duplicate detection using Claude API to
eliminate false positives from string similarity matching. The AI
analyzes candidate groups with strict rules and only confirms true
duplicates.

## Problem

The tag-stripping improvement (#8175) reduced false positives
significantly, but string similarity still produces noise:
- Different assemblers grouped together: "fasm", "YASM", "AsmX" (all
different tools)
- Unrelated features: "language tooltips" vs "language detection"
(different features)
- Specific vs general requests: "EWARM" vs "ARM execution" (specific
toolchain vs general support)

**Before AI filtering:** 63 groups with ~60% false positive rate

## Solution

### Two-phase detection:
1. **Broadphase:** String similarity (fast, high recall) creates
candidate groups
2. **AI Refinement:** Claude Sonnet 4 applies strict rules to confirm
duplicates

### Strict AI rules:
- ✓ **Duplicates:** Same tool with spelling/version variants ("NumPy" =
"numpy", "GCC 13" = "GCC 13.1")
- ✗ **NOT duplicates:** Different named tools ("fasm" ≠ "YASM"), related
features ("tooltips" ≠ "detection")

### Features:
- Optional `--use-ai` flag (requires `ANTHROPIC_API_KEY` in `.env` file)
- Adjustable confidence threshold with `--ai-confidence` (default: 0.7)
- AI reasoning included in markdown reports for transparency
- Graceful fallback if API key not available

## Results

Testing on 843 open CE issues:

| Phase | Groups | Quality |
|-------|--------|---------|
| Broadphase (string similarity) | 63 | ~40% accurate |
| AI filtering | 5 | **100% accurate** |

### Confirmed duplicates found:
1. Forth language requests (identical)
2. Documentation out-of-date reports (#5937 + #4906)
3. objdump tool requests (#4633 + #3139)
4. Haskell vector library requests
5. Make/webpack build issues (same bug)

### False positives eliminated:
- ✗ fasm/YASM/AsmX (different assemblers)
- ✗ Language tooltips vs detection (different features)
- ✗ ARM vs EWARM execution (general vs specific)
- ✗ Lua vs LUAU (related but different languages)
- ✗ OpenBLAS vs OpenSSL (different libraries)

## Cost Analysis

- 63 groups × ~400-500 tokens/group ≈ 25-30k tokens per run
- Cost: **~$0.15 per run** with Sonnet 4 (based on actual usage)
- Runs only when `--use-ai` flag is used
- Very affordable for occasional duplicate detection

## Configuration

Create `.env` file in `etc/scripts/gh_tool/`:
```bash
ANTHROPIC_API_KEY=sk-ant-...
```

The `.env` file is gitignored for security.

## Example Usage

```bash
# Standard detection (no AI)
uv run gh_tool find-duplicates /tmp/report.md

# AI-powered detection
uv run gh_tool find-duplicates /tmp/report.md --use-ai

# Adjust AI confidence threshold
uv run gh_tool find-duplicates /tmp/report.md --use-ai --ai-confidence 0.8
```

## Dependencies Added

- `anthropic>=0.40.0` - Claude API SDK
- `python-dotenv>=1.0.0` - Environment variable management

## Test Plan

- [x] All existing tests pass
- [x] Tested on 843 real CE issues with 100% accuracy
- [x] Graceful fallback when API key not available
- [x] AI reasoning included in markdown reports
- [x] Code passes ruff linting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-07 16:29:12 -05:00
Matt Godbolt
70df51df29 Improve duplicate issue detection by stripping category tags (#8175)
## Summary

Fixes the duplicate issue detection algorithm to strip `[TAGS]` from
issue titles before calculating similarity. This eliminates massive
false-positive groups caused by shared tag prefixes like `[LIB REQUEST]`
or `[COMPILER REQUEST]`.

## Problem

The previous implementation would create groups of 98+ completely
unrelated issues just because they shared common tag prefixes. For
example:
- `[LIB REQUEST] Add ULib Library` 
- `[LIB REQUEST] musl vs glibc`
- `[REQUEST] Float explorer support`
- `[REQUEST] Support logging in`

These would all be grouped together despite being completely different
requests.

## Solution

- Strip `[TAGS]` before calculating text similarity using a compiled
regex pattern
- Compare only the actual content: "Add ULib Library" vs "musl vs glibc"
→ low similarity ✓

## Additional Changes

- Added `ruff` as a project dependency for consistent code quality
- Fixed linting issues (unused imports, updated to `datetime.UTC`)
- Updated tests to reflect new tag-stripping behavior

## Results

Testing on actual CE issues shows dramatic improvement:
- **Before**: 83 groups, with Group 1 containing 98 unrelated issues
(98% false positives)
- **After**: 63 groups, with Group 1 containing 2 legitimate "Forth"
duplicates (actual duplicates)

Most groups are now legitimate duplicates like:
- Three "Problem with [opcode]" bugs
- Two TI ARM compiler requests  
- Multiple MSVC version requests

## Test Plan

- [x] All existing tests pass
- [x] Tested on real CE issue data showing 20+ group reduction
- [x] Code passes ruff linting and formatting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-07 15:44:56 -05:00
Matt Godbolt
4cb1416c2a Add gh_tool CLI for GitHub repository automation (#8170)
This PR adds a new Python CLI tool for automating GitHub repository
management tasks.

## Overview

The initial implementation provides duplicate issue detection using text
similarity analysis. This is the first step toward automating repository
triage tasks.

## Features

- **Click-based CLI** with subcommands for future extensibility
- **find-duplicates command** for detecting duplicate issues using text
similarity
- Uses **gh CLI** for GitHub API access (no token management needed)
- Text similarity using `difflib.SequenceMatcher` (ratio-based
algorithm)
- Configurable similarity threshold (default: 0.6)
- Progress bar for long-running comparisons
- Age filtering support (`--min-age` parameter)
- Standard Python src-layout with **uv** for dependency management
- **Comprehensive test suite** with pytest (integrated into CI)

## Project Structure

```
etc/scripts/gh_tool/
├── src/gh_tool/          # Main package
│   ├── cli.py            # Click-based CLI interface
│   └── duplicate_finder.py  # Core duplicate detection logic
├── tests/                # Test suite
│   └── test_duplicate_finder.py
├── docs/                 # Documentation
│   ├── TRIAGE-CRITERIA.md    # Triage guidelines from manual review
│   └── PHASE1-FINDINGS.md    # Historical analysis of 855 issues
├── pyproject.toml        # Package configuration
└── README.md             # Usage documentation
```

## Usage

```bash
cd etc/scripts/gh_tool
uv sync
uv run gh_tool find-duplicates /tmp/report.md
```

**Options:**
- `--threshold FLOAT` - Similarity threshold 0-1 (default: 0.6)
- `--state {all,open,closed}` - Issue state to check (default: open)
- `--min-age DAYS` - Only check issues older than N days (default: 0)
- `--limit INTEGER` - Maximum number of issues to fetch (default: 1000)
- `--repo TEXT` - GitHub repository in owner/repo format (default:
compiler-explorer/compiler-explorer)

**Example:**
```bash
# Find high-confidence duplicates in open issues
uv run gh_tool find-duplicates /tmp/report.md --threshold 0.85

# Check all issues older than 30 days
uv run gh_tool find-duplicates /tmp/report.md --state all --min-age 30
```

## Testing

The tool includes comprehensive test coverage:
- Unit tests for similarity calculation
- Integration tests for duplicate detection
- Edge case handling (transitive grouping, age filtering, threshold
sensitivity)
- Report generation validation

**Run tests:**
```bash
cd etc/scripts/gh_tool
uv run pytest -v
```

Tests are integrated into CI and run on every push.

## Documentation

- **`README.md`**: Complete usage guide with examples
- **`docs/TRIAGE-CRITERIA.md`**: Comprehensive triage guidelines
developed during manual review of 22+ issues
- **`docs/PHASE1-FINDINGS.md`**: Historical analysis context from
initial 855 issue review

## CI Integration

The tool is integrated into the GitHub Actions workflow:
- `uv` is installed via `astral-sh/setup-uv@v6`
- Tests run automatically on every push
- Ensures tool remains functional as codebase evolves

## Next Steps

Future enhancements planned for follow-up PRs:
- GitHub Action for automatic duplicate detection on new issues
- Additional automation tools (upstream health checker, label validator,
etc.)
- Automated triage reports

## Changes in this PR

-  Core duplicate detection implementation
-  Comprehensive test suite (192 lines)
-  CI integration
-  Complete documentation
-  Example triage criteria and findings

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-07 14:50:22 -05:00