Superstore ETL Pipeline

This project is a Python-based ETL (Extract, Transform, Load) pipeline for processing Superstore sales data. It takes raw sales data in CSV format, cleans and standardizes it, and outputs the processed data to both a cleaned CSV file and an SQLite database.

Project Structure

superstore_etl/
├── main.py          # Main execution script to run the ETL pipeline
├── config.yaml      # Configuration file with source and database paths
├── notebooks/       # Directory for Jupyter notebooks (e.g., data exploration)
├── data/
│   ├── source/      # Raw source data (e.g., Sample - Superstore.csv)
│   └── processed/   # Processed outputs (cleaned CSV, SQLite DB)
└── src/
    ├── extract.py   # Data extraction logic (reads CSV)
    ├── transform.py # Data cleaning, imputation, and transformation logic
    └── load.py      # Logic to load data into CSV and SQLite

How It Works

Extract (src/extract.py): Reads the raw CSV data from the path specified in config.yaml using the configured encoding.
Transform (src/transform.py): Cleans the data by:
- Handling and imputing missing values with "Unknown" or 0.
- Removing duplicate records.
- Casting date columns (Order Date, Ship Date) to correct datetime formats.
- Trimming whitespace and standardizing string formats (Title Casing).
- Validating numeric columns (Sales, Quantity), filtering out strictly non-positive values.
- Standardizing column names (lowercase with underscores) to prepare them for a database insertion.
Load (src/load.py): Exports the cleaned pandas DataFrame to a processed CSV file and loads it into an SQLite database (cleaned_superstore table) specified in the config.

Setup and Installation

Ensure you have Python installed.
Install the necessary dependencies:
```
pip install pandas pyyaml
```
Place the raw data file into the appropriate directory as specified in config.yaml (e.g., data/source/Sample - Superstore.csv).

Usage

To run the full ETL pipeline, simply execute the main.py script:

python main.py

This will read the configuration, perform the data extraction, transformation, and load processes, and output success and completion messages to the console.

Super store

Superstore ETL Pipeline

Project Structure

How It Works

Setup and Installation

Usage