# ETL-framework 🏗️

A demo ETL framework designed to showcase high-quality data engineering principles and practices. This project isn’t production-grade - it’s a reference implementation that brings together key patterns like medallion architecture, data quality enforcement, CI/CD with GitHub Actions, Infrastructure as Code (IaC) using Databricks Asset Bundles (DABs), automated documentation, and more. It's intended as a learning tool and best practice guide, demonstrating how to build reliable, maintainable, and secure data pipelines using modern tooling and engineering standards.

The pipeline is simulating a real-world online grocery analytics scenario: The business has asked for a dashboard to help category managers and supply chain analysts make better decisions around product placement, replenishment, and promotional effectiveness. They want to answer questions like:
- Which products drive the most repeat purchases and should be prioritised?
- Which promotions are actually increasing reorder rates or basket sizes?
- How do sales patterns vary across departments and days of the week?

We are using a Kaggle [Instacart Online Grocery dataset](https://www.kaggle.com/datasets/yasserh/instacart-online-grocery-basket-analysis-dataset) as a proxy for real transactional source data from an online grocer. See [Use Case](https://tomoscorbin.github.io/ETL-framework/use_case.html) for more info.


## Principles & Practices🚦
### Medallion Architecture (Bronze → Silver → Gold) 🪙 
- Structured layering from raw ingestion to clean and consumable datasets. Bronze stores raw, Silver applies transformations and joins, and Gold produces analytics-ready outputs. This enables traceability and maintainability.

### Data Quality 🧪
- Automated DQ framework that warns or blocks bad data from writing to tables.
- Leverages Databricks DQX to define expectation-based rules (e.g. completeness, uniqueness, patterns, ranges) at both row and column levels, enabling flexible validation pipelines.
- Table health is continuously monitored using Lakehouse Monitoring, with support for profiling and drift detection over time.
- Failures trigger Databricks SQL alerts, ensuring awareness of ata quality issues.

### Data Modeling 🧩
- Organizes the gold layer into relational star schemas using Kimball's dimensional modeling approach.
- Separates fact tables from dimension tables and promotes conformed dimensions for consistent analytics.

### CI / CD with GitHub Actions & Databricks Asset Bundles 🔁
- Run unit tests & linting on pull requests.
- Validate Databricks Asset Bundle YAML to catch infra/configuration errors early.
- Use GitHub Actions to deploy bundles to a dev environment automatically on pushes to main.
- Support manual, SHA-controlled deployment to prod environments.
- Isolate environments (dev, prod) using environment-specific DAB targets and variable substitution.
- Trigger post-deployment actions like running DDL operations.
- Auto-build and deploy Sphinx documentation to GitHub Pages on pushes to main, ensuring public docs stay current.
- Leverage Git commit SHA or semantic version tags for traceability and rollback capabilities.
- Promote to prod upon successful dev deployments.

### Testing & Code Quality 🧪
- Includes unit tests and integration tests, validating transforms, quality logic, and infrastructure using Pytest.
- Enforces linting with Ruff and static typing with MyPy.

### Documentation 📚
- Fully Sphinx-documented (docs/) with autodoc configuration.
- CI pipeline auto-deploys docs to GitHub Pages, ensuring public-facing documentation reflects the codebase.

### Infrastructure as Code (DABS) 🧱
- Databricks infrastructure is managed using DABS (Databricks Asset Bundles) for repeatable, declarative deployments.
- Configurations define clusters, secrets, jobs, and workspace assets using YAML.
- DABS integrates with GitHub Actions for automated validation and deployment as part of the CI/CD workflow.

### Software Engineering Standards 💻
- Configuration, constants, and logging are structured and centralized to promote maintainability, clarity, and consistent behavior across the codebase.
- Abstractions like DeltaTable and DeltaWriter isolate complexity and support reusable logic across pipelines.

### Security & Governance 🔐
- Sensitive credentials (e.g. API keys, database passwords) are never hardcoded and are securely managed via GitHub secrets. Secrets are injected at runtime using environment variables or cluster scopes.
- Tables are defined and registered within Unity Catalog, enabling centralized data governance across workspaces. This allows for fine-grained access controls, lineage tracking, and auditability.
- Permissions are set at the catalog, schema, and table level to restrict access based on least privilege principles (e.g. consumers only have read access to Gold; only SPNs or authorised users have write access).
- Unity Catalog automatically tracks table lineage and access history, supporting traceability for all medallion-layered tables.


*Note: This is an ongoing demo framework. Some elements are partially implemented or stubbed for demonstration purposes.*
*While ingestion is included to enable end-to-end flow, this project focuses more on downstream practices.*