This project introduces a geography-aware news aggregation system that classifies news articles by geographic regions and generates region-specific summaries. It integrates Named Entity Recognition (NER), machine learning classifiers, and both extractive and abstractive summarization techniques to enhance the user experience of personalized news delivery.
Develop a system that:
- Classifies news articles by geographic regions using NER and machine learning.
- Generates concise summaries tailored to each region using advanced summarization techniques.
-
Data Collection:
- Dataset: CNN/DailyMail Dataset
- Contains over 300,000 news articles with corresponding highlights for summarization.
-
Data Preprocessing:
- Tokenized and cleaned text (e.g., removed stopwords, URLs, and punctuation).
- Utilized
spaCyfor tokenization and stopword filtering.
-
Named Entity Recognition (NER):
- Extracted geopolitical entities (GPE) using
spaCy. - Mapped entities to predefined regions (e.g., North America, Europe, Asia).
- Extracted geopolitical entities (GPE) using
-
Classification:
- Vectorized text using TF-IDF.
- Applied machine learning classifiers:
- Logistic Regression (Baseline)
- Support Vector Machines (SVM)
- XGBoost (Best Performance with 83% Accuracy)
-
Summarization:
- Extractive: TextRank, LexRank, BERTSum.
- Abstractive: BART, T5, PEGASUS.
- Evaluation metrics: ROUGE, BLEU.
-
Integration:
- Combined classification outputs with summarization tasks to ensure region-specific summaries.
- Classification:
- Best accuracy achieved: 83% (XGBoost).
- Summarization:
- Best extractive model: BERTSum.
- Best abstractive model: BART.
- Programming Language: Python
- Libraries:
spaCy,scikit-learn,XGBoost,Hugging Face Transformers - Platform: Google Colab
- Dataset: CNN/DailyMail
- NER Tool: spaCy
- Summarization Models: Hugging Face Transformers