Skip to content

Data engineering assignment on weather analysis using Python and Spark

Notifications You must be signed in to change notification settings

AlessioCappello2/WeatherAnalytics

Repository files navigation

Weather Analytics

This project is a data engineering assignment focused on processing historical weather observations (temperature, pressure, humidity and weather conditions) for multiple cities. The goal is to design, implement and test a data processing pipeline using PySpark, with particular attention to data quality, transformations and project structure.

Tasks

  • Task 1: for each year available in the dataset, identify the cities that recorded at least 15 clear days per month for all spring months (March, April, May).
  • Task 2: for each country, compute the mean, standard deviation, minimum and maximum of temperature, pressure and humidity, aggregated by year and month.
  • Task 3: for each country, identify the top 3 cities that recorded the highest difference between average temperatures during the warm period (June-September) and the cold period (January-April), considering only local-time observations between 12:00 and 15:00.

Project structure

.
├── data/
│   ├── city_attributes.csv
│   ├── humidity.csv
│   ├── pressure.csv
│   ├── temperature.csv
│   └── weather_description.csv
│
├── preanalysis/
│   ├── data_quality_checks.py
│   └── data_quality.txt
│
├── preprocessing/
│   ├── __init__.py
│   ├── generate_city_offsets.py
│   └── city_offsets.json
│
├── results/
│   ├── task1.csv
│   ├── task2.csv
│   └── task3.csv
│
├── tasks/
│   ├── __init__.py
│   ├── task1.py
│   ├── task2.py
│   └── task3.py
│
├── tests/
│   ├── conftest.py
│   ├── test_task1.py
│   ├── test_task2.py
│   ├── test_task3.py
│   └── test_utils.py
│
├── utils/
│   ├── __init__.py
│   └── shared_utils.py
│
├── .dockerignore
├── Dockerfile
├── main.py
├── pytest.ini
├── README.md
└── requirements.txt

Breakdown

data/
Contains the original datasets provided for the assignment (weather measurements and city attributes).

preanalysis/
Contains the script data_quality_checks.py that can be launched once to check for malformed timestamps, weather description domain and null values. Results are saved in data_quality.txt.

preprocessing/
Contains the script generate_city_offsets.py, used to generate the mapping between cities and their UTC offsets, required for Task 3. The result is saved in city_offsets.json and the script is automatically executed if the file is not present yet.

results/
Contains the CSV outputs of the three tasks. They are included for convenience, so results can be inspected without running the full pipeline.

tasks/
Contains the implementation of the three required tasks, each encapsulated in its own module.

tests/
Contains tests for both tasks and utilities, implemented with pytest. The file conftest.py provides a SparkSession fixture used across all tests. Test configuration is defined in pytest.ini.

utils/
Contains shared_utils.py, that defines utility functions shared across the project (e.g. Spark session initialization, wide-to-long conversion, saving results).

How to run the project

Two execution modes are supported: manual and Docker-based.

Prerequisites (manual mode only)

  • Python 3.10.11
  • Java 11

Manual execution

  • Ensure Python and Java are correctly installed
  • Open a terminal and move to the project root
  • Create and activate a virtual environment:
python -m venv .venv
./.venv/Scripts/activate
  • Install the dependencies:
pip install -r requirements.txt

Docker execution

  • Install Docker and verify it is running correctly
  • Open a terminal and move to the project root
  • Build the Docker image:
docker build -t weather-analytics
  • Run the container:
docker run -it --rm weather-analytics

After setting up the environment, you can:

  • Run the full pipeline:
python main.py
  • Run the preliminary data quality analysis (optional):
python preanalysis/data_quality_checks.py
  • Run all tests:
pytest
  • Run tests for a specific module:
pytest tests/test_<module>.py

Design choices & Extra

  • The Spark DataFrame API is used throughout the project to keep transformations declarative, enable query plan optimization, and allow Spark to optimize execution through its logical and physical planning mechanisms.

  • A dedicated data quality analysis step is included to inspect schema consistency, temporal coverage, null ratios and domain values before implementing business logic.

  • Weather datasets are provided in a wide format; they are converted to a long format where needed within the tasks to simplify grouping logic, reduce code duplication and apply uniform transformations across cities.

  • Thresholds and business rules are defined as constants at module level to make assuptions explicitly and easy to adjust.

  • For Task 1, a day is considered "clear" if at least 16 hourly observations are available and at least 50% of them report the value "sky is clear", in order to ensure sufficient temporal coverage and reduce the impact of missing data.

  • Timezone handling in Task 3 is implemented by precomputing UTC offsets per city using a one-off preprocessing step, avoiding additional runtime dependencies and expensive per-row operations.

  • Tests are divided by task and focus on validating the correctness of the transformation logic and the produced outputs on controlled input datasets.

  • Docker support is provided to ensure reproducibility and simplify execution in heterogeneous environments, while keeping a manual setup available.

  • In a production scenario, the pipeline could be extended with orchestration, partitioned outputs and externalized configuration.

About

Data engineering assignment on weather analysis using Python and Spark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published