ETL-framework 🏗️¶
A demo ETL framework designed to showcase high-quality data engineering principles and practices. This project isn’t production-grade - it’s a reference implementation that brings together key patterns like medallion architecture, data quality enforcement, CI/CD with GitHub Actions, Infrastructure as Code (IaC) using Databricks Asset Bundles (DABs), automated documentation, and more. It’s intended as a learning tool and best practice guide, demonstrating how to build reliable, maintainable, and secure data pipelines using modern tooling and engineering standards.
The pipeline is simulating a real-world online grocery analytics scenario: The business has asked for a dashboard to help category managers and supply chain analysts make better decisions around product placement, replenishment, and promotional effectiveness. They want to answer questions like:
Which products drive the most repeat purchases and should be prioritised?
Which promotions are actually increasing reorder rates or basket sizes?
How do sales patterns vary across departments and days of the week?
We are using a Kaggle Instacart Online Grocery dataset as a proxy for real transactional source data from an online grocer. See Use Case for more info.
Principles & Practices🚦¶
Medallion Architecture (Bronze → Silver → Gold) 🪙¶
Structured layering from raw ingestion to clean and consumable datasets. Bronze stores raw, Silver applies transformations and joins, and Gold produces analytics-ready outputs. This enables traceability and maintainability.
Data Quality 🧪¶
Automated DQ framework that warns or blocks bad data from writing to tables.
Leverages Databricks DQX to define expectation-based rules (e.g. completeness, uniqueness, patterns, ranges) at both row and column levels, enabling flexible validation pipelines.
Table health is continuously monitored using Lakehouse Monitoring, with support for profiling and drift detection over time.
Failures trigger Databricks SQL alerts, ensuring awareness of ata quality issues.
Data Modeling 🧩¶
Organizes the gold layer into relational star schemas using Kimball’s dimensional modeling approach.
Separates fact tables from dimension tables and promotes conformed dimensions for consistent analytics.
CI / CD with GitHub Actions & Databricks Asset Bundles 🔁¶
Run unit tests & linting on pull requests.
Validate Databricks Asset Bundle YAML to catch infra/configuration errors early.
Use GitHub Actions to deploy bundles to a dev environment automatically on pushes to main.
Support manual, SHA-controlled deployment to prod environments.
Isolate environments (dev, prod) using environment-specific DAB targets and variable substitution.
Trigger post-deployment actions like running DDL operations.
Auto-build and deploy Sphinx documentation to GitHub Pages on pushes to main, ensuring public docs stay current.
Leverage Git commit SHA or semantic version tags for traceability and rollback capabilities.
Promote to prod upon successful dev deployments.
Testing & Code Quality 🧪¶
Includes unit tests and integration tests, validating transforms, quality logic, and infrastructure using Pytest.
Enforces linting with Ruff and static typing with MyPy.
Documentation 📚¶
Fully Sphinx-documented (docs/) with autodoc configuration.
CI pipeline auto-deploys docs to GitHub Pages, ensuring public-facing documentation reflects the codebase.
Infrastructure as Code (DABS) 🧱¶
Databricks infrastructure is managed using DABS (Databricks Asset Bundles) for repeatable, declarative deployments.
Configurations define clusters, secrets, jobs, and workspace assets using YAML.
DABS integrates with GitHub Actions for automated validation and deployment as part of the CI/CD workflow.
Software Engineering Standards 💻¶
Configuration, constants, and logging are structured and centralized to promote maintainability, clarity, and consistent behavior across the codebase.
Abstractions like DeltaTable and DeltaWriter isolate complexity and support reusable logic across pipelines.
Security & Governance 🔐¶
Sensitive credentials (e.g. API keys, database passwords) are never hardcoded and are securely managed via GitHub secrets. Secrets are injected at runtime using environment variables or cluster scopes.
Tables are defined and registered within Unity Catalog, enabling centralized data governance across workspaces. This allows for fine-grained access controls, lineage tracking, and auditability.
Permissions are set at the catalog, schema, and table level to restrict access based on least privilege principles (e.g. consumers only have read access to Gold; only SPNs or authorised users have write access).
Unity Catalog automatically tracks table lineage and access history, supporting traceability for all medallion-layered tables.
Note: This is an ongoing demo framework. Some elements are partially implemented or stubbed for demonstration purposes. While ingestion is included to enable end-to-end flow, this project focuses more on downstream practices.