📄 Scan-PDF-Paper

A powerful document analysis tool that extracts text from various document formats (PDF, DOCX, TXT) and performs intelligent topic classification and keyword matching analysis.

🌟 Features

Multi-format Document Support: Extract text from PDF, DOCX, and TXT files
Intelligent Topic Classification: AI-powered topic classification using Sentence Transformers
Keyword Matching Analysis: Calculate topic relevance based on predefined keyword sets
Interactive Web Interface: User-friendly Streamlit-based web application
Real-time Analysis: Get instant results with visual progress indicators
Multiple Analysis Pages: Main analysis page and advanced features

🚀 Quick Start

Prerequisites

Python 3.8 or higher
pip package manager

Installation

Clone the repository

git clone https://github.com/NhanPhamThanh-IT/Scan-PDF-Paper.git
cd Scan-PDF-Paper

Install dependencies
```
pip install -r requirements.txt
```
Run the application
```
streamlit run app/main.py
```
Open your browser and navigate to http://localhost:8501

📖 Usage

Main Analysis Page

Select a Topic: Choose from predefined topics including:
- AI & Technology
- Healthcare
- Finance
- Environment
- Cybersecurity
- Software Development
- And more...
Upload Document: Support for multiple file formats:
- PDF files
- Microsoft Word documents (.docx)
- Plain text files (.txt)
Get Analysis Results: View detailed analysis including:
- Total word count
- Keyword matches found
- Topic relevance percentage
- Detailed breakdown of analysis

Advanced Features

Access the Advanced page for additional functionality and enhanced analysis options.

🏗️ Project Structure

Scan-PDF-Paper/
├── app/
│   ├── main.py                 # Main Streamlit application
│   ├── assets/
│   │   └── themes.css          # CSS styling
│   ├── core/
│   │   ├── AI/
│   │   │   └── TopicClassifier.py  # AI-powered topic classification
│   │   └── Utils/
│   │       ├── DataHandling.py     # Data processing utilities
│   │       ├── FileHandling.py     # File extraction utilities
│   │       └── TextHandling.py     # Text processing utilities
│   ├── dataset/
│   │   ├── metadata/
│   │   │   └── topics.json         # Available topics list
│   │   └── topics_keywords/        # Keyword datasets for each topic
│   │       ├── AI.json
│   │       ├── Healthcare.json
│   │       ├── Finance.json
│   │       └── ...
│   ├── pages/
│   │   ├── MainPage.py             # Main analysis interface
│   │   ├── AdvancesPage.py         # Advanced features
│   │   └── HelpsPage.py            # Help and documentation
│   ├── settings/
│   │   └── ThemeManager.py         # Theme management
│   └── ui/
│       ├── PageHeaderComponent.py  # Reusable header component
│       ├── ResultComponent.py      # Results display component
│       └── TabsComponent.py        # Navigation tabs component
├── requirements.txt                # Python dependencies
└── README.md                      # Project documentation

🛠️ Technical Details

Core Technologies

Streamlit: Web framework for the user interface
PyMuPDF (fitz): PDF text extraction
python-docx: Microsoft Word document processing
Sentence Transformers: AI-powered text analysis
spaCy: Natural language processing and stop words removal

AI-Powered Classification

The application uses the all-MiniLM-L6-v2 model from Sentence Transformers to:

Generate embeddings for input text
Compare against predefined topic embeddings
Calculate cosine similarity scores
Provide confidence percentages for topic classification

Text Processing Pipeline

Document Parsing: Extract raw text from uploaded files
Text Preprocessing: Remove stop words and normalize text
Keyword Analysis: Match against topic-specific keyword sets
AI Classification: Use machine learning for intelligent topic detection
Results Generation: Calculate relevance scores and generate insights

📊 Supported Topics

The application supports analysis across 21+ topic categories:

Technology: AI, Software, Cybersecurity
Sciences: Healthcare, Environment, Science
Business: Finance, Economy, Business
Society: Education, Politics, Law, Culture
Lifestyle: Sports, Travel, Food, Art
Others: Media, Religion, Agriculture, Energy, Security

🧪 Testing

Run the test suite using pytest:

pytest

Test configuration is available in pytest.ini.

📋 Requirements

Core Dependencies

streamlit>=1.47.0 - Web application framework
PyMuPDF - PDF processing
python-docx - Word document processing
sentence-transformers>=2.6.1 - AI text analysis
torch>=2.0.0 - Machine learning backend
spacy - Natural language processing

Development Dependencies

pytest - Testing framework
pytest-mock - Testing utilities

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Documentation: Visit the Help & Documentation page in the application
Issues: Report bugs or request features via GitHub Issues
Discussions: Join project discussions on GitHub

🔮 Future Enhancements

Support for additional file formats (RTF, ODT)
Batch processing capabilities
Export results to various formats
Custom topic creation
Advanced visualization features
REST API integration

⚡ Performance Notes

First-time loading may take longer due to AI model initialization
Large documents (>10MB) may require additional processing time
Recommended RAM: 4GB+ for optimal performance

Built with ❤️ using Python and Streamlit

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
app		app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Scan-PDF-Paper

🌟 Features

🚀 Quick Start

Prerequisites

Installation

📖 Usage

Main Analysis Page

Advanced Features

🏗️ Project Structure

🛠️ Technical Details

Core Technologies

AI-Powered Classification

Text Processing Pipeline

📊 Supported Topics

🧪 Testing

📋 Requirements

Core Dependencies

Development Dependencies

🤝 Contributing

📝 License

🆘 Support

🔮 Future Enhancements

⚡ Performance Notes

About

Uh oh!

Releases 1

Languages

License

NhanPhamThanh-IT/Scan-PDF-Paper

Folders and files

Latest commit

History

Repository files navigation

📄 Scan-PDF-Paper

🌟 Features

🚀 Quick Start

Prerequisites

Installation

📖 Usage

Main Analysis Page

Advanced Features

🏗️ Project Structure

🛠️ Technical Details

Core Technologies

AI-Powered Classification

Text Processing Pipeline

📊 Supported Topics

🧪 Testing

📋 Requirements

Core Dependencies

Development Dependencies

🤝 Contributing

📝 License

🆘 Support

🔮 Future Enhancements

⚡ Performance Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Languages