Files
Matt Godbolt 70df51df29 Improve duplicate issue detection by stripping category tags (#8175)
## Summary

Fixes the duplicate issue detection algorithm to strip `[TAGS]` from
issue titles before calculating similarity. This eliminates massive
false-positive groups caused by shared tag prefixes like `[LIB REQUEST]`
or `[COMPILER REQUEST]`.

## Problem

The previous implementation would create groups of 98+ completely
unrelated issues just because they shared common tag prefixes. For
example:
- `[LIB REQUEST] Add ULib Library` 
- `[LIB REQUEST] musl vs glibc`
- `[REQUEST] Float explorer support`
- `[REQUEST] Support logging in`

These would all be grouped together despite being completely different
requests.

## Solution

- Strip `[TAGS]` before calculating text similarity using a compiled
regex pattern
- Compare only the actual content: "Add ULib Library" vs "musl vs glibc"
→ low similarity ✓

## Additional Changes

- Added `ruff` as a project dependency for consistent code quality
- Fixed linting issues (unused imports, updated to `datetime.UTC`)
- Updated tests to reflect new tag-stripping behavior

## Results

Testing on actual CE issues shows dramatic improvement:
- **Before**: 83 groups, with Group 1 containing 98 unrelated issues
(98% false positives)
- **After**: 63 groups, with Group 1 containing 2 legitimate "Forth"
duplicates (actual duplicates)

Most groups are now legitimate duplicates like:
- Three "Problem with [opcode]" bugs
- Two TI ARM compiler requests  
- Multiple MSVC version requests

## Test Plan

- [x] All existing tests pass
- [x] Tested on real CE issue data showing 20+ group reduction
- [x] Code passes ruff linting and formatting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-07 15:44:56 -05:00
..