Julian Metias
Back to projects
Fuel food correlation pipe line

Fuel food correlation pipe line

Spark SQLData bricks

📊 NZ Fuel & Food Price Correlation Pipeline

🚀 Project Overview

This project is a Data Engineering Pipeline built in Databricks to analyze the economic relationship between gasoline price fluctuations and the Food Price Index (FPI) in New Zealand.

The goal was to determine if spikes in fuel (specifically Diesel and Regular Petrol) directly correlate with rising food costs, and to identify any "time-lags" in this relationship.

🏗️ Architecture: The Medallion Approach

The pipeline follows the Medallion Architecture to ensure data quality and scalability:

Bronze Layer: Raw ingestion of Stats NZ and Fuel Price CSV files into Delta tables.

Silver Layer: * Data cleaning and type casting.

Feature Engineering: Created a custom dim_date table using the Python holidays library to flag New Zealand public holidays.

Aggregation: Downsampled daily fuel prices to a monthly mean to match the Food Price Index grain.

Gold Layer: * Implemented a Star Schema design.

Created a fact_fuel_food_correlation table by joining processed fuel, food, and date dimension data.

🛠️ Tech Stack

Platform: Databricks (Community Edition)

Language: PySpark (Python) and Spark SQL

Storage: Delta Lake

Orchestration: Logic designed for integration with Airflow/Databricks Workflows

Libraries: pandas, pyspark.sql.functions, holidays

📈 Key Features & Logic

Movable Holiday Logic: Automated detection of NZ-specific holidays (like Matariki and Easter) using the holidays library to account for seasonal price anomalies.

Data Normalization: Transformed the "1960M01" date format into standard SQL timestamps for seamless joining.

Business Intelligence Ready: The Gold layer is optimized for PowerBI/Tableau, pivoting fuel types into individual columns for immediate correlation analysis.

📂 Repository Structure

Bronze/: Notebooks for raw data ingestion.

Silver/: Transformation logic and Dimension table creation.

Gold/: Final Star Schema modeling and join logic.

SourceData/: Sample CSV datasets used in the pipeline.