This project is designed to crawl, process, and query oncology-related articles from the web. It includes components for web scraping, data storage, summarization, keyword extraction, and vector-based search.
- Clone the repository:
git clone https://github.com/suryansh2207/Carer-project.git cd Carer-project
To run the entire pipeline, execute the run_all.sh script:
./run_all.sh- Start the necessary services.
- Set up the vector store.
- Process and store articles.
- Run a query interface for searching articles.
Crawler
- The crawler.py script is responsible for crawling oncology-related articles from the web and storing them in the MySQL database.
Summarizer
- The summarizer.py script processes articles to generate summaries and extract keywords using pre-trained models.
Query
- The query.py script provides functionality to search articles based on query text, including vector-based similarity search.
Vector Store
- The vector_store.py script initializes the vector store and processes articles for vector-based search.
The config.in file contains configuration settings for the project, including database connection details and Milvus settings.