Complete toolkit for cleaning up and maintaining consistent tags in Linkwarden instances, especially those using LLM-based auto-tagging.
Auto-tagging with small LLMs (like gemma3b) creates severe tag inconsistencies:
- Case duplicates: "Music" vs "music", "AI" vs "ai" (43 found)
- Semantic overlaps: "AI", "Machine Learning", "ML", "LLM" all meaning similar things
- Tag proliferation: 84% of tags used on ≤3 links
- Junk tags: Non-substantive tags like "Avoid", "Sign Up", "Room", "Feel"
- No reuse: LLM creates new tags instead of reusing existing ones
Result: 4,866 tags where only ~200 are actually useful.
This toolkit provides four complementary tools:
- Tag Analyzer - Identify duplicates, overlaps, and low-usage tags
- Tag Consolidator - Merge duplicates and clean up existing tags
- Tag Normalizer - Prevent future tag proliferation (ongoing service)
- Junk Remover - Remove non-substantive tags that provide no value
# Clone the repository
git clone https://github.com/roelven/linkwarden-tag-cleanup.git
cd linkwarden-tag-cleanup
# Install dependencies
pip3 install -r requirements.txt
# Configure
cp .env.example .env
nano .env # Add your Linkwarden API URL and token# 1. Analyze your tags
bin/run_analysis.sh
# 2. Consolidate duplicates (dry-run first)
bin/run_consolidation.sh
bin/run_consolidation.sh --no-dry-run
# 3. Remove junk tags
bin/run_junk_removal.sh --analyze
bin/run_junk_removal.sh
# 4. Set up ongoing normalization (cron)
crontab -e
# Add: */5 * * * * cd /path/to/linkwarden-cleanup && bin/run_normalization.sh >> normalization.log 2>&1See QUICKSTART.md for detailed setup instructions.
- ✅ Identify case-insensitive duplicates
- ✅ Find semantic overlaps
- ✅ Detect low-usage tags
- ✅ Generate consolidation mappings
- ✅ Automatic backup before changes
- ✅ Merge case variants (music → Music)
- ✅ Merge semantic duplicates (AI/ML/LLM → AI)
- ✅ Delete low-usage tags (<3 uses)
- ✅ Update all affected links
- ✅ Dry-run mode for safety
- ✅ Fuzzy matching (85% similarity)
- ✅ Automatic case normalization
- ✅ Reuse existing tags
- ✅ Runs via cron/systemd
- ✅ Configurable thresholds
- ✅ Remove non-substantive tags
- ✅ 200+ built-in junk patterns
- ✅ Custom blocklist support
- ✅ Smart acronym detection
- ✅ Usage-based filtering
linkwarden-cleanup/
├── bin/ # Wrapper scripts (run these)
│ ├── run_analysis.sh
│ ├── run_consolidation.sh
│ ├── run_normalization.sh
│ └── run_junk_removal.sh
├── scripts/ # Core Python scripts
│ ├── analyze_tags.py
│ ├── consolidate_tags.py
│ ├── normalize_new_tags.py
│ └── remove_junk_tags.py
├── config/ # Configuration files
│ ├── config.example.json
│ └── junk_tags_blocklist.txt
├── docs/ # Documentation
│ ├── QUICKSTART.md
│ ├── JUNK_TAGS_GUIDE.md
│ ├── TESTING.md
│ ├── IMPLEMENTATION_SUMMARY.md
│ └── deployment/ # Systemd setup
└── examples/ # Example configs and debug scripts
- Total tags: 4,866
- Single-use: 3,041 (62.5%)
- Low-use (≤3): 4,089 (84%)
- Junk tags: ~1,000
- Average usage: 1.5 links/tag
- Total tags: 150-250
- Tag reduction: 85%
- Case consistency: 100%
- Average usage: 15+ links/tag
- Ongoing prevention: 80-90% of future duplicates
- Quick Start Guide - Get up and running in 5 minutes
- Junk Tags Guide - Remove non-substantive tags
- Testing Guide - Comprehensive test suite
- Deployment Guide - Systemd setup
LINKWARDEN_API_URL=https://your-linkwarden.example.com/api/v1
LINKWARDEN_TOKEN=your_api_token_here
LOW_USE_THRESHOLD=3
SIMILARITY_THRESHOLD=0.85
LOOKBACK_MINUTES=15- Log in to Linkwarden
- Go to Settings → API Tokens
- Create new token with read/write permissions
- Copy token to
.envfile
Edit the consolidation mapping before applying:
bin/run_analysis.sh
nano consolidation_mapping.json # Review and customize
bin/run_consolidation.sh --no-dry-runAdd your own junk tags:
echo "placeholder" >> config/junk_tags_blocklist.txt
echo "example" >> config/junk_tags_blocklist.txt
bin/run_junk_removal.sh --analyze# Only case normalizations
python3 scripts/consolidate_tags.py --skip-semantic --skip-delete
# Only semantic consolidations
python3 scripts/consolidate_tags.py --skip-case --skip-delete- Python 3.7+
requestslibrary- Linkwarden instance with API access
- API token with read/write permissions
- Automatic backups - Tags saved before changes
- Dry-run mode - Preview changes before applying
- Confirmation prompts - Prevents accidental deletions
- Rate limiting - Avoids API throttling
- Error handling - Graceful failure recovery
# Verify your token works
curl -H "Authorization: Bearer YOUR_TOKEN" \
https://your-linkwarden.example.com/api/v1/tagsNormal if no links were recently added. The normalization service will catch new links on the next run.
Tag was already deleted or renamed. Safe to ignore.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details.
Created to solve tag proliferation issues with LLM-based auto-tagging in Linkwarden.
- Linkwarden - Self-hosted bookmark manager
- Linkwarden Docs - Official documentation
For issues, questions, or feature requests:
- Open an issue on GitHub
- Check the documentation
- Review the testing guide
Note: This toolkit works with Linkwarden v2.x. Always backup your data before running cleanup operations.