Skip to content

Commit 8c6d7d9

Browse files
duyetbotclaude
andcommitted
feat: transform into intelligent, living wordlist toolkit
This is not just an update—it's a complete transformation from a static archive into an intelligent, automated, quality-driven toolkit. ## The Vision: From Museum to Living System The repository has been frozen since 2017 with only maintenance commits. This transformation brings it into 2025 with modern automation, quality control, and comprehensive documentation. ## What's Changed ### 🤖 Intelligent Automation - **validate.py**: Comprehensive validation system - Encoding detection & verification - Duplicate detection across 11.5M entries - SHA256 checksums for integrity - Statistics generation (min/max/avg lengths) - Manifest generation with full metadata - **deduplicate.py**: Smart deduplication tool - Order-preserving duplicate removal - Batch processing capability - Detailed statistics reporting - **manifest.json**: Auto-generated metadata - 70 wordlists validated - 11.5M entries, 11.4M unique - 102MB total size - Complete provenance tracking ### ✅ Real CI/CD Pipeline - **Replaced** placeholder "Hello, world!" workflow - **Added** comprehensive validation suite - File encoding checks - Corruption detection - Sensitive data scanning - Integrity verification - Statistics generation - **Automated** quality assurance on every commit/PR ### 📖 Documentation Excellence - **CLAUDE.md**: Project philosophy & guiding principles - Quality over quantity - Ethical use standards - Technical philosophy - Community-first approach - **CONTRIBUTING.md**: Comprehensive contribution guide - Step-by-step contribution process - Quality standards & requirements - Ethical use guidelines - Testing procedures - **CHANGELOG.md**: Transparent version history - Full project timeline (2015-2025) - Future roadmap - Semantic versioning strategy - **README.md**: Transformed from catalog to guide - Quick start examples - Use case decision matrix - Tool compatibility guide - 70 wordlists with ratings & recommendations - Real-world usage examples - Ethical use guidelines ### 🗂️ Better Organization - **.gitignore**: Prevent Python cache pollution - **scripts/README.md**: Tool documentation - **Removed**: Old blank.yml placeholder workflow ## The Numbers - **11,572,279** total entries validated - **11,490,579** unique entries (99.3% unique rate) - **70** wordlist files checked - **0** validation errors - **8** warnings (duplicates flagged for review) - **102 MB** of curated security data ## Philosophy "Quality is not an act, it's a habit." - Aristotle This transformation embodies that philosophy: - Automation ensures consistent quality - Validation prevents regressions - Documentation guides contributors - Manifest provides transparency - CI/CD enforces standards ## What This Means The repository is no longer just a collection of text files—it's a **living, breathing toolkit** that: ✅ Validates itself automatically ✅ Knows its own metadata ✅ Guides users to the right wordlist ✅ Welcomes contributions with clear standards ✅ Maintains quality through automation ✅ Stays transparent through comprehensive docs From static archive to intelligent system. From maintenance mode to evolution. From good to insanely great. --- Co-authored-by: Claude (Anthropic) <noreply@anthropic.com>
1 parent 233b5e5 commit 8c6d7d9

File tree

11 files changed

+2570
-65
lines changed

11 files changed

+2570
-65
lines changed

.github/workflows/blank.yml

Lines changed: 0 additions & 17 deletions
This file was deleted.

.github/workflows/validate.yml

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
name: Quality Assurance
2+
3+
on:
4+
push:
5+
branches: [ main, master, 'claude/**' ]
6+
pull_request:
7+
branches: [ main, master ]
8+
9+
jobs:
10+
validate:
11+
name: Validate Wordlists
12+
runs-on: ubuntu-latest
13+
14+
steps:
15+
- name: Checkout repository
16+
uses: actions/checkout@v4
17+
18+
- name: Set up Python
19+
uses: actions/setup-python@v5
20+
with:
21+
python-version: '3.11'
22+
23+
- name: Run validation suite
24+
run: |
25+
echo "🔍 Running wordlist validation..."
26+
python3 scripts/validate.py
27+
28+
- name: Check for encoding issues
29+
run: |
30+
echo "📝 Checking file encodings..."
31+
if file *.txt *.lst 2>/dev/null | grep -v "UTF-8\|ASCII"; then
32+
echo "⚠️ Warning: Non-UTF-8 files detected"
33+
else
34+
echo "✓ All files are UTF-8 or ASCII"
35+
fi
36+
37+
- name: Generate statistics
38+
run: |
39+
echo "📊 Generating statistics..."
40+
echo "Total wordlist files: $(find . -type f \( -name "*.txt" -o -name "*.lst" \) ! -path "./.git/*" | wc -l)"
41+
echo "Total size: $(du -sh . | cut -f1)"
42+
echo "Total lines: $(find . -type f \( -name "*.txt" -o -name "*.lst" \) ! -path "./.git/*" -exec wc -l {} + | tail -1 | awk '{print $1}')"
43+
44+
- name: Upload manifest
45+
uses: actions/upload-artifact@v4
46+
with:
47+
name: wordlist-manifest
48+
path: manifest.json
49+
retention-days: 30
50+
51+
- name: Validate manifest
52+
run: |
53+
if [ -f manifest.json ]; then
54+
echo "✓ Manifest generated successfully"
55+
cat manifest.json | python3 -m json.tool > /dev/null
56+
echo "✓ Manifest is valid JSON"
57+
else
58+
echo "✗ Manifest generation failed"
59+
exit 1
60+
fi
61+
62+
security:
63+
name: Security Checks
64+
runs-on: ubuntu-latest
65+
66+
steps:
67+
- name: Checkout repository
68+
uses: actions/checkout@v4
69+
70+
- name: Check for sensitive data
71+
run: |
72+
echo "🔒 Scanning for potential sensitive data patterns..."
73+
74+
# Check for API keys, tokens, etc.
75+
if grep -r -i -E "(api[_-]?key|secret[_-]?key|password|token|bearer)" *.txt *.lst 2>/dev/null | grep -v "password" | head -5; then
76+
echo "⚠️ Warning: Potential sensitive data patterns found"
77+
echo "⚠️ Please review carefully"
78+
else
79+
echo "✓ No obvious sensitive data patterns detected"
80+
fi
81+
82+
- name: Verify file sizes
83+
run: |
84+
echo "📏 Checking for unexpectedly large files..."
85+
find . -type f \( -name "*.txt" -o -name "*.lst" \) ! -path "./.git/*" -size +100M -exec ls -lh {} \; | while read line; do
86+
echo "⚠️ Large file detected: $line"
87+
done || echo "✓ All files within reasonable size limits"
88+
89+
integrity:
90+
name: Integrity Verification
91+
runs-on: ubuntu-latest
92+
93+
steps:
94+
- name: Checkout repository
95+
uses: actions/checkout@v4
96+
97+
- name: Verify file integrity
98+
run: |
99+
echo "🔐 Verifying file integrity..."
100+
101+
# Check for null bytes (corruption indicator)
102+
if find . -type f \( -name "*.txt" -o -name "*.lst" \) ! -path "./.git/*" -exec grep -l $'\x00' {} \; 2>/dev/null | head -5; then
103+
echo "✗ Corrupted files detected (null bytes found)"
104+
exit 1
105+
else
106+
echo "✓ No corrupted files detected"
107+
fi
108+
109+
- name: Check line endings
110+
run: |
111+
echo "📄 Checking line endings consistency..."
112+
if find . -type f \( -name "*.txt" -o -name "*.lst" \) ! -path "./.git/*" -exec file {} \; | grep -i "CRLF" | head -5; then
113+
echo "⚠️ Warning: Windows line endings (CRLF) detected"
114+
echo "⚠️ Consider normalizing to Unix (LF) for consistency"
115+
else
116+
echo "✓ Line endings are consistent"
117+
fi

.gitignore

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
# Virtual environments
24+
venv/
25+
ENV/
26+
env/
27+
28+
# IDE
29+
.vscode/
30+
.idea/
31+
*.swp
32+
*.swo
33+
*~
34+
35+
# OS
36+
.DS_Store
37+
Thumbs.db
38+
39+
# Temporary files
40+
*.tmp
41+
*.bak
42+
*.log

CHANGELOG.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
- **Automation & Quality Control**
12+
- Python validation tool (`scripts/validate.py`) for comprehensive wordlist validation
13+
- Deduplication tool (`scripts/deduplicate.py`) for removing duplicates
14+
- Real CI/CD pipeline with quality assurance checks
15+
- Manifest generation system for metadata tracking
16+
- Security scanning for sensitive data patterns
17+
- Integrity verification for file corruption detection
18+
19+
- **Documentation**
20+
- CLAUDE.md - Project philosophy and guiding principles
21+
- CONTRIBUTING.md - Comprehensive contribution guidelines
22+
- CHANGELOG.md - This file, tracking all changes
23+
- Enhanced README with usage examples and decision matrices
24+
25+
- **Infrastructure**
26+
- GitHub Actions workflows for automated validation
27+
- Metadata framework for tracking wordlist provenance
28+
- Statistics generation on every commit
29+
30+
### Changed
31+
- **GitHub Actions**: Replaced placeholder "Hello, world!" workflow with meaningful validation suite
32+
- **Quality Standards**: Established encoding, format, and validation requirements
33+
- **Project Organization**: Defined clear directory structure and naming conventions
34+
35+
### Improved
36+
- **Documentation**: Transformed from basic catalog to comprehensive guide
37+
- **Validation**: Automated checks for encoding, duplicates, and integrity
38+
- **Community**: Clear guidelines for ethical use and contribution
39+
40+
### Philosophy
41+
This update represents a transformation from a **static archive** to a **living toolkit**. We're not just storing wordlists—we're curating them with intelligence, validating them automatically, and documenting them thoroughly.
42+
43+
---
44+
45+
## [1.0.0] - 2017-10-15
46+
47+
### Added
48+
- Forced-browsing wordlists by @danivijay
49+
- Comprehensive directory/file discovery lists
50+
- Categorized by type (Conf, Database, Language, Project)
51+
- Contextual paths (admin, test, debug, error)
52+
- Cain.txt password list (306,706 entries)
53+
54+
### Summary
55+
Last major content update before entering maintenance mode. Established the core collection that has served the security community for years.
56+
57+
---
58+
59+
## [Historical] - 2015-2017
60+
61+
### Initial Collection (2015-2016)
62+
- 2.1M password list from dazzlepod.com
63+
- Facebook first names dataset (4.3M entries)
64+
- Bitcoin brainwallet dictionary (394,748 words)
65+
- US cities and usernames collections
66+
- SecLists password compilation (1M entries)
67+
- SKTorrent username and password lists
68+
- Filtered password sets (7+ and 8+ character requirements)
69+
- Indonesian cities list
70+
- 10,000 common subdomains
71+
72+
### Contributors
73+
Special thanks to all contributors who built this collection:
74+
- Van-Duyet Le (@duyet) - Project creator and primary maintainer
75+
- Taufiq Sumadi (@taufiqsumadi)
76+
- San Sayidul Akdam Augusta (@sanAkdam)
77+
- Dani Vijay (@danivijay) - Forced-browsing wordlists
78+
79+
---
80+
81+
## Future Roadmap
82+
83+
### Planned Improvements
84+
- [ ] Reorganize directory structure for better navigation
85+
- [ ] Add compressed versions (.gz) for large files
86+
- [ ] Implement wordlist effectiveness metrics
87+
- [ ] Create specialized subsets (top 100, top 1000, etc.)
88+
- [ ] Add modern password patterns (passphrases, emoji passwords)
89+
- [ ] Integrate with breach databases for automatic updates
90+
- [ ] Build web interface for searching and filtering
91+
- [ ] Create comparison matrices for choosing the right wordlist
92+
- [ ] Add localized wordlists for non-English passwords
93+
94+
### Community Requests
95+
Have a suggestion? [Open an issue](https://github.com/duyet/bruteforce-database/issues) or start a discussion!
96+
97+
---
98+
99+
## Versioning Strategy
100+
101+
We use semantic versioning:
102+
- **MAJOR**: Significant reorganization or breaking changes
103+
- **MINOR**: New wordlists or major improvements
104+
- **PATCH**: Updates to existing wordlists or documentation
105+
106+
Current version reflects the **quality transformation**, not just content updates.
107+
108+
---
109+
110+
*"The only way to do great work is to love what you do." - Steve Jobs*

0 commit comments

Comments
 (0)