PDF Text Analyzer

A robust, modular, and high-performance Python system for downloading, extracting, analyzing, and searching text from PDF documents.

Key Features

Robust Ingestion: Strict validation of PDF signatures, size limits, and encryption detection.
Modular Architecture: Clean separation of concerns (Processing, Analysis, Models, Caching, Search).
Advanced Analysis:
- Language detection (langdetect).
- Keyword extraction and readability scoring (textstat equivalent logic).
- Stopword removal using NLTK.
Performance:
- Asynchronous I/O (aiohttp) for downloads.
- Multiprocessing for text extraction (fitz / PyMuPDF).
- In-memory caching for repeated requests.
Scalability:
- Batch Processing: Concurrent processing of multiple PDFs.
- Search Engine: TF-IDF based indexing and searching of processed documents.

Installation

Clone the repository:

git clone https://github.com/cortega26/PDF-Text-Analizer.git
cd PDF-Text-Analyzer

Install dependencies:

pip install -r requirements.txt

For development and testing:

pip install -r requirements-dev.txt

Usage

Basic Usage

import asyncio
from pdf_processor import PdfProcessor

async def main():
    processor = PdfProcessor()
    url = "https://example.com/document.pdf"
    
    # Process a single PDF
    results = await processor.process_url(url, "search phrase")
    
    print(f"Status: {results['metadata']['extraction_status']}")
    print(f"Language: {results['analysis']['language']}")
    print(f"Word Count: {results['analysis']['word_count']}")

if __name__ == "__main__":
    asyncio.run(main())

Batch Processing

from batch import PdfBatch

async def process_batch():
    processor = PdfProcessor()
    batch_processor = PdfBatch(processor)
    
    urls = [
        "https://example.com/doc1.pdf",
        "https://example.com/doc2.pdf"
    ]
    
    results = await batch_processor.process_urls(urls, "keyword")
    print(f"Processed {results['summary']['total_processed']} files.")

### Batch Processing (Streaming)
For memory-efficient processing of huge batches, use the new `process_stream` API:

```python
from batch import PdfBatch

async def process_many(processor, urls):
    batch = PdfBatch(processor)
    async for url, result, error in batch.process_stream(urls, "search term"):
        if error:
            print(f"Failed {url}: {error}")
        else:
            print(f"Success {url}: Found {result['analysis']['search_term_count']} matches")
            # Save result to DB immediately...


### Search Engine

```python
from search import PdfSearchEngine

# Add processed results to the index
engine = PdfSearchEngine()
engine.add_document(url="...", analysis_result=..., metadata=...)

# Search
matches = engine.search("important concept")
for match in matches:
    print(f"Found in {match['url']} (Score: {match['relevance_score']})")

Architecture

The project has been refactored into single-responsibility modules:

pdf_processor.py: Main facade/coordinator.
models.py: Data classes (PdfMetadata, ProcessingStatistics, ExtractionStatus).
validators.py: Security and file validation logic.
text_analysis.py: NLP and content analysis logic.
cache.py: Caching protocols and implementations.
search.py: Vector-based search engine functionality.
batch.py: Orchestration for multiple files.
config.py: Centralized configuration.
exceptions.py: Custom error hierarchy.

Robustness & Error Handling

The system now distinguishes between different failure modes:

Encrypted PDFs: Raises EncryptedPdfError immediately.
Invalid Files: Rejects non-PDFs (even with .pdf extension) via InvalidFileError.
Scanned/Empty: Returns ExtractionStatus.SCANNED_OCR_REQUIRED rather than failing silenty.
Size Limits: Enforced via MAX_PDF_SIZE in config.py.

Testing

A full regression suite is available using pytest.

# Run all tests
python -m pytest

# Run with coverage report
python -m pytest --cov=.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
docs		docs
tests		tests
.gitignore		.gitignore
ARCHITECTURE_NOTES.md		ARCHITECTURE_NOTES.md
AUDIT_REPORT.md		AUDIT_REPORT.md
CHANGELOG.md		CHANGELOG.md
COMPLEXITY_MAP.md		COMPLEXITY_MAP.md
LICENSE		LICENSE
MAINTAINABILITY_SCORECARD.md		MAINTAINABILITY_SCORECARD.md
PERFORMANCE_NOTES.md		PERFORMANCE_NOTES.md
PIPELINE_FAILURE_MODES.md		PIPELINE_FAILURE_MODES.md
README.md		README.md
REFACTOR_PLAN.md		REFACTOR_PLAN.md
RISK_REGISTER.md		RISK_REGISTER.md
ROBUSTNESS_MATRIX.md		ROBUSTNESS_MATRIX.md
SCALABILITY_REVIEW.md		SCALABILITY_REVIEW.md
batch.py		batch.py
cache.py		cache.py
config.py		config.py
example.py		example.py
exceptions.py		exceptions.py
languages.py		languages.py
models.py		models.py
pdf_ops.py		pdf_ops.py
pdf_processor.py		pdf_processor.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
search.py		search.py
test_output.txt		test_output.txt
text_analysis.py		text_analysis.py
utils.py		utils.py
validators.py		validators.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Text Analyzer

Key Features

Installation

Usage

Basic Usage

Batch Processing

Architecture

Robustness & Error Handling

Testing

License

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

cortega26/PDF-Text-Analyzer

Folders and files

Latest commit

History

Repository files navigation

PDF Text Analyzer

Key Features

Installation

Usage

Basic Usage

Batch Processing

Architecture

Robustness & Error Handling

Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages