Skip to content
This repository was archived by the owner on Jul 21, 2025. It is now read-only.

Advanced document analysis platform that extracts text from PDF, DOCX, and TXT files with AI-powered topic classification using Sentence Transformers. Features keyword matching, real-time analysis, interactive Streamlit web interface, and multi-topic support.

License

Notifications You must be signed in to change notification settings

NhanPhamThanh-IT/Scan-PDF-Paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

21 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“„ Scan-PDF-Paper

A powerful document analysis tool that extracts text from various document formats (PDF, DOCX, TXT) and performs intelligent topic classification and keyword matching analysis.

Python Streamlit License: MIT PyTorch Transformers Code Style: Black Maintenance

GitHub stars GitHub forks

๐ŸŒŸ Features

  • Multi-format Document Support: Extract text from PDF, DOCX, and TXT files
  • Intelligent Topic Classification: AI-powered topic classification using Sentence Transformers
  • Keyword Matching Analysis: Calculate topic relevance based on predefined keyword sets
  • Interactive Web Interface: User-friendly Streamlit-based web application
  • Real-time Analysis: Get instant results with visual progress indicators
  • Multiple Analysis Pages: Main analysis page and advanced features

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Installation

  1. Clone the repository

    git clone https://github.com/NhanPhamThanh-IT/Scan-PDF-Paper.git
    cd Scan-PDF-Paper
  2. Install dependencies

    pip install -r requirements.txt
  3. Run the application

    streamlit run app/main.py
  4. Open your browser and navigate to http://localhost:8501

๐Ÿ“– Usage

Main Analysis Page

  1. Select a Topic: Choose from predefined topics including:

    • AI & Technology
    • Healthcare
    • Finance
    • Environment
    • Cybersecurity
    • Software Development
    • And more...
  2. Upload Document: Support for multiple file formats:

    • PDF files
    • Microsoft Word documents (.docx)
    • Plain text files (.txt)
  3. Get Analysis Results: View detailed analysis including:

    • Total word count
    • Keyword matches found
    • Topic relevance percentage
    • Detailed breakdown of analysis

Advanced Features

Access the Advanced page for additional functionality and enhanced analysis options.

๐Ÿ—๏ธ Project Structure

Scan-PDF-Paper/
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ main.py                 # Main Streamlit application
โ”‚   โ”œโ”€โ”€ assets/
โ”‚   โ”‚   โ””โ”€โ”€ themes.css          # CSS styling
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ AI/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ TopicClassifier.py  # AI-powered topic classification
โ”‚   โ”‚   โ””โ”€โ”€ Utils/
โ”‚   โ”‚       โ”œโ”€โ”€ DataHandling.py     # Data processing utilities
โ”‚   โ”‚       โ”œโ”€โ”€ FileHandling.py     # File extraction utilities
โ”‚   โ”‚       โ””โ”€โ”€ TextHandling.py     # Text processing utilities
โ”‚   โ”œโ”€โ”€ dataset/
โ”‚   โ”‚   โ”œโ”€โ”€ metadata/
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ topics.json         # Available topics list
โ”‚   โ”‚   โ””โ”€โ”€ topics_keywords/        # Keyword datasets for each topic
โ”‚   โ”‚       โ”œโ”€โ”€ AI.json
โ”‚   โ”‚       โ”œโ”€โ”€ Healthcare.json
โ”‚   โ”‚       โ”œโ”€โ”€ Finance.json
โ”‚   โ”‚       โ””โ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ pages/
โ”‚   โ”‚   โ”œโ”€โ”€ MainPage.py             # Main analysis interface
โ”‚   โ”‚   โ”œโ”€โ”€ AdvancesPage.py         # Advanced features
โ”‚   โ”‚   โ””โ”€โ”€ HelpsPage.py            # Help and documentation
โ”‚   โ”œโ”€โ”€ settings/
โ”‚   โ”‚   โ””โ”€โ”€ ThemeManager.py         # Theme management
โ”‚   โ””โ”€โ”€ ui/
โ”‚       โ”œโ”€โ”€ PageHeaderComponent.py  # Reusable header component
โ”‚       โ”œโ”€โ”€ ResultComponent.py      # Results display component
โ”‚       โ””โ”€โ”€ TabsComponent.py        # Navigation tabs component
โ”œโ”€โ”€ requirements.txt                # Python dependencies
โ””โ”€โ”€ README.md                      # Project documentation

๐Ÿ› ๏ธ Technical Details

Core Technologies

  • Streamlit: Web framework for the user interface
  • PyMuPDF (fitz): PDF text extraction
  • python-docx: Microsoft Word document processing
  • Sentence Transformers: AI-powered text analysis
  • spaCy: Natural language processing and stop words removal

AI-Powered Classification

The application uses the all-MiniLM-L6-v2 model from Sentence Transformers to:

  • Generate embeddings for input text
  • Compare against predefined topic embeddings
  • Calculate cosine similarity scores
  • Provide confidence percentages for topic classification

Text Processing Pipeline

  1. Document Parsing: Extract raw text from uploaded files
  2. Text Preprocessing: Remove stop words and normalize text
  3. Keyword Analysis: Match against topic-specific keyword sets
  4. AI Classification: Use machine learning for intelligent topic detection
  5. Results Generation: Calculate relevance scores and generate insights

๐Ÿ“Š Supported Topics

The application supports analysis across 21+ topic categories:

  • Technology: AI, Software, Cybersecurity
  • Sciences: Healthcare, Environment, Science
  • Business: Finance, Economy, Business
  • Society: Education, Politics, Law, Culture
  • Lifestyle: Sports, Travel, Food, Art
  • Others: Media, Religion, Agriculture, Energy, Security

๐Ÿงช Testing

Run the test suite using pytest:

pytest

Test configuration is available in pytest.ini.

๐Ÿ“‹ Requirements

Core Dependencies

  • streamlit>=1.47.0 - Web application framework
  • PyMuPDF - PDF processing
  • python-docx - Word document processing
  • sentence-transformers>=2.6.1 - AI text analysis
  • torch>=2.0.0 - Machine learning backend
  • spacy - Natural language processing

Development Dependencies

  • pytest - Testing framework
  • pytest-mock - Testing utilities

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ†˜ Support

  • Documentation: Visit the Help & Documentation page in the application
  • Issues: Report bugs or request features via GitHub Issues
  • Discussions: Join project discussions on GitHub

๐Ÿ”ฎ Future Enhancements

  • Support for additional file formats (RTF, ODT)
  • Batch processing capabilities
  • Export results to various formats
  • Custom topic creation
  • Advanced visualization features
  • REST API integration

โšก Performance Notes

  • First-time loading may take longer due to AI model initialization
  • Large documents (>10MB) may require additional processing time
  • Recommended RAM: 4GB+ for optimal performance

Built with โค๏ธ using Python and Streamlit

About

Advanced document analysis platform that extracts text from PDF, DOCX, and TXT files with AI-powered topic classification using Sentence Transformers. Features keyword matching, real-time analysis, interactive Streamlit web interface, and multi-topic support.

Topics

Resources

License

Stars

Watchers

Forks