Real-Time News Data Pipeline

A real-time data engineering project that fetches live news data, streams it through Apache Kafka and Apache Spark, performs sentiment analysis, and stores it in a PostgreSQL database. Visual components and architecture are included for a comprehensive view.

Project Architecture

Tech Stack

Data Source: NewsAPI – for fetching live headlines by country and category
Messaging/Streaming: Apache Kafka – for real-time data ingestion and message queueing (hosted on EC2)
Raw Storage: Amazon S3 – for temporarily storing raw news data before processing
Stream Processing: Apache Spark Structured Streaming – for real-time parsing, transformation, and enrichment
Sentiment Analysis: TextBlob – to classify news headlines as Positive, Negative, or Neutral
Structured Storage: PostgreSQL (AWS RDS) – to persist cleaned and labeled data for querying
Visualization: Streamlit – for building an interactive UI to display news and sentiment
Deployment:
- Docker – to containerize the Streamlit application
- Amazon ECR (Elastic Container Registry) – to store container images
- Amazon ECS (Fargate) – for serverless deployment of the Streamlit app
Programming Language: Python

Pipeline Flow

News API
→ Fetches live news headlines (e.g., from NewsAPI) based on country and category preferences.
Kafka Producer (producer.py)
→ Sends the fetched news data as JSON messages to a Kafka topic (news-topic) running on an EC2 instance.
Kafka Consumer (Consumer.py)
→ Reads messages from the Kafka topic and:
- Stores the raw news data temporarily in an S3 bucket
- Prepares data for stream processing
Apache Spark Streaming (streaming.py)
→ Reads data from Kafka or S3
- Parses and cleans the data
- Performs sentiment analysis using sentiment_analysis.py
- Writes the transformed, enriched news data into a PostgreSQL database hosted on AWS RDS
PostgreSQL RDS
→ Stores the cleaned, structured news articles, including title, description, timestamp, and sentiment.
Streamlit Dashboard
→ Connects to the PostgreSQL database
- Visualizes the latest news headlines along with their sentiment labels
- Provides a real-time UI to browse and analyze news content
Deployment using ECS + ECR
→ The Streamlit app is containerized using Docker
- Pushed to Amazon ECR (Elastic Container Registry)
- Deployed on Amazon ECS Fargate for serverless, scalable hosting

Code Overview

The core logic of this project lives inside the Src/ folder. Here’s a quick look at what each file does:

fetch.py: Utility module for fetching API data — used inside the producer.
producer.py: Pulls live news headlines from NewsAPI and streams them to a Kafka topic.
Consumer.py: Listens to the Kafka topic and stores the raw news data in an S3 bucket.
streaming.py: Spark job that reads from Kafka/S3, cleans the data, runs sentiment analysis, and stores results in PostgreSQL.
sentiment_analysis.py: Adds sentiment labels (Positive, Negative, Neutral) using TextBlob.
table.sql: PostgreSQL schema for creating the news_articles table.
Dockerfile: Docker setup to containerize and deploy the Streamlit frontend.
requirements.txt: List of Python packages required to run the pipeline.

Features

Real-time data ingestion using Apache Kafka
Stream processing and enrichment using Apache Spark Structured Streaming
Optional Sentiment Analysis using TextBlob
Analytics-ready data stored in PostgreSQL (AWS RDS)
Temporary raw data storage in Amazon S3
Cloud-native design: compatible with AWS EC2, RDS, S3
Dockerized setup for local development and deployment
Containerized Streamlit app deployed using Amazon ECR for storing the Docker image and Amazon ECS for scalable, serverless hosting of the frontend

License

This project is licensed under the MIT License – feel free to use, modify, and share!

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Images		Images
codes		codes
.gitattributes		.gitattributes
Dependencies.txt		Dependencies.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time News Data Pipeline

Project Architecture

Tech Stack

Pipeline Flow

Code Overview

Features

License

About

Uh oh!

Languages

License

Hridya2001/real-time-news-analysis

Folders and files

Latest commit

History

Repository files navigation

Real-Time News Data Pipeline

Project Architecture

Tech Stack

Pipeline Flow

Code Overview

Features

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages