Tooling Overview

This project combines several tools to maintain code quality, automate deployments and process data efficiently.

Data Processing

  • PySpark & Delta Lake – the ETL jobs are written in PySpark and persist data in Delta tables for reliable, ACID-compliant storage.

  • Databricks Labs DQX – expectation-based data quality checks stop bad data from progressing through the pipeline.

Infrastructure & Deployment

  • Databricks Asset Bundles (DABs) define clusters, jobs and other workspace assets as code. The databricks.yml bundle is validated and deployed through CI/CD.

  • GitHub Actions run tests, linting and bundle validation on every pull request and push to main.

Testing & Code Quality

  • Pytest covers unit and integration tests under the tests/ directory.

  • Ruff enforces style and formatting rules and runs via the lint.sh script.

  • MyPy performs static type checking, also invoked from lint.sh.

Documentation

  • Sphinx with the MyST parser renders the Markdown files in docs/ and builds the public documentation site.