Over time, I've collected a large number of lecture notes, technical documents, and study files. Existing note tools like Notion help with organization, but they still make it hard to quickly locate specific knowledge, especially when notes are long and cover multiple topics.
This project aims to build a personal note assistant powered by Retrieval-Augmented Generation (RAG) to solve this problem. By semantically understanding your documents and queries, it provides more relevant answers than traditional keyword search.
- PDF Processing: Extract text from PDF documents with support for multilingual content
- Smart Text Chunking: Break documents into semantic chunks with configurable size and overlap
- Vector Storage: Store document embeddings in Pinecone for efficient similarity search
- Multilingual Support: Process and understand content in multiple languages
- Interactive UI: Simple Streamlit interface for uploading documents and asking questions
- Fast Retrieval: Quickly find relevant information across all your documents
- Answer Evaluation: Added an LLM-based evaluation component to judge the relevance, completeness, and accuracy of answers.
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.envfile in the project root with your API keys:PINECONE_API_KEY=your_pinecone_api_key PINECONE_INDEX_NAME=your_index_name DEEPSEEK_API_KEY=your_deepseek_api_key
The link is AskmyNote_Link
Start the application:
streamlit run app.pyThen open your browser at http://localhost:8501 to access the interface.
- Ask Questions: Enter your question in the text box and click "Submit"
- View Results: The system will display the AI's answer along with reference chunks from your documents
- Upload Documents: Use the file uploader to add more PDF files to the knowledge base
app.py- Streamlit application entry pointparse.py- PDF text extraction utilitieschunk.py- Text chunking algorithmsembedding.py- Text embedding functionsvector.py- Vector database interfacepipeline.py- End-to-end pipeline (extraction, chunking, embedding)llm.py- LLM interface for answering questionsjudge.py-Answer Evaluation Scriptevaluation.py-Runner for Judge Process
Extracts and cleans text from PDF documents:
- Handles multi-page documents
- Cleans and normalizes text
- Preserves paragraph structure
- Supports multilingual content including Chinese
Splits documents into semantic chunks:
- Paragraph-aware chunking where possible
- Configurable chunk size and overlap
- Support for recursive chunking based on document structure
- Special handling for Chinese text
Interface to Pinecone vector database:
- Creates and manages Pinecone indexes
- Stores document chunks with metadata
- Performs similarity search with configurable parameters
- Text Extraction: PDF documents are processed to extract clean, normalized text
- Chunking: Text is divided into semantic chunks with context preservation
- Embedding: Chunks are converted to vector embeddings using SentenceTransformer
- Storage: Embeddings and metadata are stored in Pinecone
- Retrieval: User queries are converted to embeddings and matched against stored vectors
- Answer Generation: Retrieved context is sent to LLM to generate relevant answers
- Accuracy (with LLM Judge) We implemented an optional evaluation step using DeepSeek-LLM to grade answers generated by the pipeline.
Example grading result for the question "What is F1 Score?":
Relevance: High - Directly answered the question with relevant context.
Completeness: High - Covered definition, formula, precision/recall, and use cases.
Factual Accuracy: High - No factual errors detected.
Suggested Improvement: Mentioning Precision-Recall Curve for extra context.