A powerful document analysis tool that extracts text from various document formats (PDF, DOCX, TXT) and performs intelligent topic classification and keyword matching analysis.
- Multi-format Document Support: Extract text from PDF, DOCX, and TXT files
- Intelligent Topic Classification: AI-powered topic classification using Sentence Transformers
- Keyword Matching Analysis: Calculate topic relevance based on predefined keyword sets
- Interactive Web Interface: User-friendly Streamlit-based web application
- Real-time Analysis: Get instant results with visual progress indicators
- Multiple Analysis Pages: Main analysis page and advanced features
- Python 3.8 or higher
- pip package manager
-
Clone the repository
git clone https://github.com/NhanPhamThanh-IT/Scan-PDF-Paper.git cd Scan-PDF-Paper -
Install dependencies
pip install -r requirements.txt
-
Run the application
streamlit run app/main.py
-
Open your browser and navigate to
http://localhost:8501
-
Select a Topic: Choose from predefined topics including:
- AI & Technology
- Healthcare
- Finance
- Environment
- Cybersecurity
- Software Development
- And more...
-
Upload Document: Support for multiple file formats:
- PDF files
- Microsoft Word documents (.docx)
- Plain text files (.txt)
-
Get Analysis Results: View detailed analysis including:
- Total word count
- Keyword matches found
- Topic relevance percentage
- Detailed breakdown of analysis
Access the Advanced page for additional functionality and enhanced analysis options.
Scan-PDF-Paper/
โโโ app/
โ โโโ main.py # Main Streamlit application
โ โโโ assets/
โ โ โโโ themes.css # CSS styling
โ โโโ core/
โ โ โโโ AI/
โ โ โ โโโ TopicClassifier.py # AI-powered topic classification
โ โ โโโ Utils/
โ โ โโโ DataHandling.py # Data processing utilities
โ โ โโโ FileHandling.py # File extraction utilities
โ โ โโโ TextHandling.py # Text processing utilities
โ โโโ dataset/
โ โ โโโ metadata/
โ โ โ โโโ topics.json # Available topics list
โ โ โโโ topics_keywords/ # Keyword datasets for each topic
โ โ โโโ AI.json
โ โ โโโ Healthcare.json
โ โ โโโ Finance.json
โ โ โโโ ...
โ โโโ pages/
โ โ โโโ MainPage.py # Main analysis interface
โ โ โโโ AdvancesPage.py # Advanced features
โ โ โโโ HelpsPage.py # Help and documentation
โ โโโ settings/
โ โ โโโ ThemeManager.py # Theme management
โ โโโ ui/
โ โโโ PageHeaderComponent.py # Reusable header component
โ โโโ ResultComponent.py # Results display component
โ โโโ TabsComponent.py # Navigation tabs component
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
- Streamlit: Web framework for the user interface
- PyMuPDF (fitz): PDF text extraction
- python-docx: Microsoft Word document processing
- Sentence Transformers: AI-powered text analysis
- spaCy: Natural language processing and stop words removal
The application uses the all-MiniLM-L6-v2 model from Sentence Transformers to:
- Generate embeddings for input text
- Compare against predefined topic embeddings
- Calculate cosine similarity scores
- Provide confidence percentages for topic classification
- Document Parsing: Extract raw text from uploaded files
- Text Preprocessing: Remove stop words and normalize text
- Keyword Analysis: Match against topic-specific keyword sets
- AI Classification: Use machine learning for intelligent topic detection
- Results Generation: Calculate relevance scores and generate insights
The application supports analysis across 21+ topic categories:
- Technology: AI, Software, Cybersecurity
- Sciences: Healthcare, Environment, Science
- Business: Finance, Economy, Business
- Society: Education, Politics, Law, Culture
- Lifestyle: Sports, Travel, Food, Art
- Others: Media, Religion, Agriculture, Energy, Security
Run the test suite using pytest:
pytestTest configuration is available in pytest.ini.
streamlit>=1.47.0- Web application frameworkPyMuPDF- PDF processingpython-docx- Word document processingsentence-transformers>=2.6.1- AI text analysistorch>=2.0.0- Machine learning backendspacy- Natural language processing
pytest- Testing frameworkpytest-mock- Testing utilities
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Visit the Help & Documentation page in the application
- Issues: Report bugs or request features via GitHub Issues
- Discussions: Join project discussions on GitHub
- Support for additional file formats (RTF, ODT)
- Batch processing capabilities
- Export results to various formats
- Custom topic creation
- Advanced visualization features
- REST API integration
- First-time loading may take longer due to AI model initialization
- Large documents (>10MB) may require additional processing time
- Recommended RAM: 4GB+ for optimal performance
Built with โค๏ธ using Python and Streamlit