Weather Analytics

This project is a data engineering assignment focused on processing historical weather observations (temperature, pressure, humidity and weather conditions) for multiple cities. The goal is to design, implement and test a data processing pipeline using PySpark, with particular attention to data quality, transformations and project structure.

Tasks

Task 1: for each year available in the dataset, identify the cities that recorded at least 15 clear days per month for all spring months (March, April, May).
Task 2: for each country, compute the mean, standard deviation, minimum and maximum of temperature, pressure and humidity, aggregated by year and month.
Task 3: for each country, identify the top 3 cities that recorded the highest difference between average temperatures during the warm period (June-September) and the cold period (January-April), considering only local-time observations between 12:00 and 15:00.

Project structure

.
├── data/
│   ├── city_attributes.csv
│   ├── humidity.csv
│   ├── pressure.csv
│   ├── temperature.csv
│   └── weather_description.csv
│
├── preanalysis/
│   ├── data_quality_checks.py
│   └── data_quality.txt
│
├── preprocessing/
│   ├── __init__.py
│   ├── generate_city_offsets.py
│   └── city_offsets.json
│
├── results/
│   ├── task1.csv
│   ├── task2.csv
│   └── task3.csv
│
├── tasks/
│   ├── __init__.py
│   ├── task1.py
│   ├── task2.py
│   └── task3.py
│
├── tests/
│   ├── conftest.py
│   ├── test_task1.py
│   ├── test_task2.py
│   ├── test_task3.py
│   └── test_utils.py
│
├── utils/
│   ├── __init__.py
│   └── shared_utils.py
│
├── .dockerignore
├── Dockerfile
├── main.py
├── pytest.ini
├── README.md
└── requirements.txt

Breakdown

data/
Contains the original datasets provided for the assignment (weather measurements and city attributes).

preanalysis/
Contains the script data_quality_checks.py that can be launched once to check for malformed timestamps, weather description domain and null values. Results are saved in data_quality.txt.

preprocessing/
Contains the script generate_city_offsets.py, used to generate the mapping between cities and their UTC offsets, required for Task 3. The result is saved in city_offsets.json and the script is automatically executed if the file is not present yet.

results/
Contains the CSV outputs of the three tasks. They are included for convenience, so results can be inspected without running the full pipeline.

tasks/
Contains the implementation of the three required tasks, each encapsulated in its own module.

tests/
Contains tests for both tasks and utilities, implemented with pytest. The file conftest.py provides a SparkSession fixture used across all tests. Test configuration is defined in pytest.ini.

utils/
Contains shared_utils.py, that defines utility functions shared across the project (e.g. Spark session initialization, wide-to-long conversion, saving results).

How to run the project

Two execution modes are supported: manual and Docker-based.

Prerequisites (manual mode only)

Python 3.10.11
Java 11

Manual execution

Ensure Python and Java are correctly installed
Open a terminal and move to the project root
Create and activate a virtual environment:

python -m venv .venv
./.venv/Scripts/activate

Install the dependencies:

pip install -r requirements.txt

Docker execution

Install Docker and verify it is running correctly
Open a terminal and move to the project root
Build the Docker image:

docker build -t weather-analytics

Run the container:

docker run -it --rm weather-analytics

After setting up the environment, you can:

Run the full pipeline:

python main.py

Run the preliminary data quality analysis (optional):

python preanalysis/data_quality_checks.py

Run all tests:

pytest

Run tests for a specific module:

pytest tests/test_<module>.py

Design choices & Extra

The Spark DataFrame API is used throughout the project to keep transformations declarative, enable query plan optimization, and allow Spark to optimize execution through its logical and physical planning mechanisms.
A dedicated data quality analysis step is included to inspect schema consistency, temporal coverage, null ratios and domain values before implementing business logic.
Weather datasets are provided in a wide format; they are converted to a long format where needed within the tasks to simplify grouping logic, reduce code duplication and apply uniform transformations across cities.
Thresholds and business rules are defined as constants at module level to make assuptions explicitly and easy to adjust.
For Task 1, a day is considered "clear" if at least 16 hourly observations are available and at least 50% of them report the value "sky is clear", in order to ensure sufficient temporal coverage and reduce the impact of missing data.
Timezone handling in Task 3 is implemented by precomputing UTC offsets per city using a one-off preprocessing step, avoiding additional runtime dependencies and expensive per-row operations.
Tests are divided by task and focus on validating the correctness of the transformation logic and the produced outputs on controlled input datasets.
Docker support is provided to ensure reproducibility and simplify execution in heterogeneous environments, while keeping a manual setup available.
In a production scenario, the pipeline could be extended with orchestration, partitioned outputs and externalized configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weather Analytics

Tasks

Project structure

Breakdown

How to run the project

Prerequisites (manual mode only)

Manual execution

Docker execution

Design choices & Extra

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
preanalysis		preanalysis
preprocessing		preprocessing
results		results
tasks		tasks
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

AlessioCappello2/WeatherAnalytics

Folders and files

Latest commit

History

Repository files navigation

Weather Analytics

Tasks

Project structure

Breakdown

How to run the project

Prerequisites (manual mode only)

Manual execution

Docker execution

Design choices & Extra

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages