compiler-explorer

mirror of https://github.com/compiler-explorer/compiler-explorer.git synced 2026-07-22 01:46:48 -04:00

Files

Matt Godbolt 70df51df29 Improve duplicate issue detection by stripping category tags (#8175 )

## Summary

Fixes the duplicate issue detection algorithm to strip `[TAGS]` from
issue titles before calculating similarity. This eliminates massive
false-positive groups caused by shared tag prefixes like `[LIB REQUEST]`
or `[COMPILER REQUEST]`.

## Problem

The previous implementation would create groups of 98+ completely
unrelated issues just because they shared common tag prefixes. For
example:
- `[LIB REQUEST] Add ULib Library` 
- `[LIB REQUEST] musl vs glibc`
- `[REQUEST] Float explorer support`
- `[REQUEST] Support logging in`

These would all be grouped together despite being completely different
requests.

## Solution

- Strip `[TAGS]` before calculating text similarity using a compiled
regex pattern
- Compare only the actual content: "Add ULib Library" vs "musl vs glibc"
→ low similarity ✓

## Additional Changes

- Added `ruff` as a project dependency for consistent code quality
- Fixed linting issues (unused imports, updated to `datetime.UTC`)
- Updated tests to reflect new tag-stripping behavior

## Results

Testing on actual CE issues shows dramatic improvement:
- **Before**: 83 groups, with Group 1 containing 98 unrelated issues
(98% false positives)
- **After**: 63 groups, with Group 1 containing 2 legitimate "Forth"
duplicates (actual duplicates)

Most groups are now legitimate duplicates like:
- Three "Problem with [opcode]" bugs
- Two TI ARM compiler requests  
- Multiple MSVC version requests

## Test Plan

- [x] All existing tests pass
- [x] Tested on real CE issue data showing 20+ group reduction
- [x] Code passes ruff linting and formatting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>

2025-10-07 15:44:56 -05:00

test_duplicate_finder.py

Improve duplicate issue detection by stripping category tags (#8175 )

2025-10-07 15:44:56 -05:00