Tooling Overview¶

This project combines several tools to maintain code quality, automate deployments and process data efficiently.

Data Processing¶

PySpark & Delta Lake – the ETL jobs are written in PySpark and persist data in Delta tables for reliable, ACID-compliant storage.
Databricks Labs DQX – expectation-based data quality checks stop bad data from progressing through the pipeline.

Infrastructure & Deployment¶

Databricks Asset Bundles (DABs) define clusters, jobs and other workspace assets as code. The databricks.yml bundle is validated and deployed through CI/CD.
GitHub Actions run tests, linting and bundle validation on every pull request and push to main.

Testing & Code Quality¶

Pytest covers unit and integration tests under the tests/ directory.
Ruff enforces style and formatting rules and runs via the lint.sh script.
MyPy performs static type checking, also invoked from lint.sh.

Documentation¶

Sphinx with the MyST parser renders the Markdown files in docs/ and builds the public documentation site.