This project is a data engineering assignment focused on processing historical weather observations (temperature, pressure, humidity and weather conditions) for multiple cities. The goal is to design, implement and test a data processing pipeline using PySpark, with particular attention to data quality, transformations and project structure.
- Task 1: for each year available in the dataset, identify the cities that recorded at least 15 clear days per month for all spring months (March, April, May).
- Task 2: for each country, compute the mean, standard deviation, minimum and maximum of temperature, pressure and humidity, aggregated by year and month.
- Task 3: for each country, identify the top 3 cities that recorded the highest difference between average temperatures during the warm period (June-September) and the cold period (January-April), considering only local-time observations between 12:00 and 15:00.
.
├── data/
│ ├── city_attributes.csv
│ ├── humidity.csv
│ ├── pressure.csv
│ ├── temperature.csv
│ └── weather_description.csv
│
├── preanalysis/
│ ├── data_quality_checks.py
│ └── data_quality.txt
│
├── preprocessing/
│ ├── __init__.py
│ ├── generate_city_offsets.py
│ └── city_offsets.json
│
├── results/
│ ├── task1.csv
│ ├── task2.csv
│ └── task3.csv
│
├── tasks/
│ ├── __init__.py
│ ├── task1.py
│ ├── task2.py
│ └── task3.py
│
├── tests/
│ ├── conftest.py
│ ├── test_task1.py
│ ├── test_task2.py
│ ├── test_task3.py
│ └── test_utils.py
│
├── utils/
│ ├── __init__.py
│ └── shared_utils.py
│
├── .dockerignore
├── Dockerfile
├── main.py
├── pytest.ini
├── README.md
└── requirements.txt
data/
Contains the original datasets provided for the assignment (weather measurements and city attributes).
preanalysis/
Contains the script data_quality_checks.py that can be launched once to check for malformed timestamps, weather description domain and null values. Results are saved in data_quality.txt.
preprocessing/
Contains the script generate_city_offsets.py, used to generate the mapping between cities and their UTC offsets, required for Task 3. The result is saved in city_offsets.json and the script is automatically executed if the file is not present yet.
results/
Contains the CSV outputs of the three tasks. They are included for convenience, so results can be inspected without running the full pipeline.
tasks/
Contains the implementation of the three required tasks, each encapsulated in its own module.
tests/
Contains tests for both tasks and utilities, implemented with pytest. The file conftest.py provides a SparkSession fixture used across all tests. Test configuration is defined in pytest.ini.
utils/
Contains shared_utils.py, that defines utility functions shared across the project (e.g. Spark session initialization, wide-to-long conversion, saving results).
Two execution modes are supported: manual and Docker-based.
- Python 3.10.11
- Java 11
- Ensure Python and Java are correctly installed
- Open a terminal and move to the project root
- Create and activate a virtual environment:
python -m venv .venv
./.venv/Scripts/activate- Install the dependencies:
pip install -r requirements.txt- Install Docker and verify it is running correctly
- Open a terminal and move to the project root
- Build the Docker image:
docker build -t weather-analytics- Run the container:
docker run -it --rm weather-analyticsAfter setting up the environment, you can:
- Run the full pipeline:
python main.py- Run the preliminary data quality analysis (optional):
python preanalysis/data_quality_checks.py- Run all tests:
pytest- Run tests for a specific module:
pytest tests/test_<module>.py-
The Spark DataFrame API is used throughout the project to keep transformations declarative, enable query plan optimization, and allow Spark to optimize execution through its logical and physical planning mechanisms.
-
A dedicated data quality analysis step is included to inspect schema consistency, temporal coverage, null ratios and domain values before implementing business logic.
-
Weather datasets are provided in a wide format; they are converted to a long format where needed within the tasks to simplify grouping logic, reduce code duplication and apply uniform transformations across cities.
-
Thresholds and business rules are defined as constants at module level to make assuptions explicitly and easy to adjust.
-
For Task 1, a day is considered "clear" if at least 16 hourly observations are available and at least 50% of them report the value "sky is clear", in order to ensure sufficient temporal coverage and reduce the impact of missing data.
-
Timezone handling in Task 3 is implemented by precomputing UTC offsets per city using a one-off preprocessing step, avoiding additional runtime dependencies and expensive per-row operations.
-
Tests are divided by task and focus on validating the correctness of the transformation logic and the produced outputs on controlled input datasets.
-
Docker support is provided to ensure reproducibility and simplify execution in heterogeneous environments, while keeping a manual setup available.
-
In a production scenario, the pipeline could be extended with orchestration, partitioned outputs and externalized configuration.