AskMyNotes_LLM_RAG

Objective

Over time, I've collected a large number of lecture notes, technical documents, and study files. Existing note tools like Notion help with organization, but they still make it hard to quickly locate specific knowledge, especially when notes are long and cover multiple topics.

This project aims to build a personal note assistant powered by Retrieval-Augmented Generation (RAG) to solve this problem. By semantically understanding your documents and queries, it provides more relevant answers than traditional keyword search.

Features

PDF Processing: Extract text from PDF documents with support for multilingual content
Smart Text Chunking: Break documents into semantic chunks with configurable size and overlap
Vector Storage: Store document embeddings in Pinecone for efficient similarity search
Multilingual Support: Process and understand content in multiple languages
Interactive UI: Simple Streamlit interface for uploading documents and asking questions
Fast Retrieval: Quickly find relevant information across all your documents
Answer Evaluation: Added an LLM-based evaluation component to judge the relevance, completeness, and accuracy of answers.

Installation

Install dependencies:
```
pip install -r requirements.txt
```

Create a .env file in the project root with your API keys:

PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=your_index_name
DEEPSEEK_API_KEY=your_deepseek_api_key

Usage

Running the App on the cloud

The link is AskmyNote_Link

Running the App (local)

Start the application:

streamlit run app.py

Then open your browser at http://localhost:8501 to access the interface.

Using the Application

Ask Questions: Enter your question in the text box and click "Submit"
View Results: The system will display the AI's answer along with reference chunks from your documents
Upload Documents: Use the file uploader to add more PDF files to the knowledge base

Project Structure

app.py - Streamlit application entry point
parse.py - PDF text extraction utilities
chunk.py - Text chunking algorithms
embedding.py - Text embedding functions
vector.py - Vector database interface
pipeline.py - End-to-end pipeline (extraction, chunking, embedding)
llm.py - LLM interface for answering questions
judge.py -Answer Evaluation Script
evaluation.py -Runner for Judge Process

Module Details

PDFTextExtractor

Extracts and cleans text from PDF documents:

Handles multi-page documents
Cleans and normalizes text
Preserves paragraph structure
Supports multilingual content including Chinese

SimpleTextChunker

Splits documents into semantic chunks:

Paragraph-aware chunking where possible
Configurable chunk size and overlap
Support for recursive chunking based on document structure
Special handling for Chinese text

VectorStore

Interface to Pinecone vector database:

Creates and manages Pinecone indexes
Stores document chunks with metadata
Performs similarity search with configurable parameters

Workflow

Text Extraction: PDF documents are processed to extract clean, normalized text
Chunking: Text is divided into semantic chunks with context preservation
Embedding: Chunks are converted to vector embeddings using SentenceTransformer
Storage: Embeddings and metadata are stored in Pinecone
Retrieval: User queries are converted to embeddings and matched against stored vectors
Answer Generation: Retrieved context is sent to LLM to generate relevant answers

Performance Evaluation

Accuracy (with LLM Judge) We implemented an optional evaluation step using DeepSeek-LLM to grade answers generated by the pipeline.

Example grading result for the question "What is F1 Score?":

Relevance: High - Directly answered the question with relevant context.

Completeness: High - Covered definition, formula, precision/recall, and use cases.

Factual Accuracy: High - No factual errors detected.

Suggested Improvement: Mentioning Precision-Recall Curve for extra context.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
script		script
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AskMyNotes_LLM_RAG

Objective

Features

Installation

Usage

Running the App on the cloud

Running the App (local)

Using the Application

Project Structure

Module Details

PDFTextExtractor

SimpleTextChunker

VectorStore

Workflow

Performance Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AskMyNotes_LLM_RAG

Objective

Features

Installation

Usage

Running the App on the cloud

Running the App (local)

Using the Application

Project Structure

Module Details

PDFTextExtractor

SimpleTextChunker

VectorStore

Workflow

Performance Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages