Brickflow Projects
Prerequisites¶
-
Install Locally (optional):
- Python >= 3.8
-
Configure the databricks cli cfg file.
pip install databricks-cli
and thendatabricks configure -t
which will configure the databricks cli with a token. -
Install brickflow cli
Confirming the installation¶
-
To confirm the setup run the following command:
-
Also confirm the connectivity to databricks:
or if you have specific profile
Brickflow Projects Setup¶
Brickflow introduced projects in version 0.9.2 for managing mono repos with multiple projects or workflows that need to be deployed in groups. It helps with the following things:
- It helps manage statefiles and simplifies deployment.
- It helps you manage clean up of state, etc.
- It also helps the framework resolve imports for python modules, etc in your repo.
Concepts¶
- Project - A project is a collection of workflows that are deployed together. A project is a folder with a entrypoint and a set of workflows.
- Workflow - A workflow is a collection of tasks that are deployed together which may be DLT pipelines, notebooks, wheels, jars, etc.
Monorepo Style¶
A monorepo style project is a repository that has multiple folders and modules that can contain multiple brickflow projects. Learn more here.
Folder structure:
repo-root/
├── .git
├── projects/
│ ├── project_abc/
│ │ ├── lib/
│ │ │ ├── __init__.py
│ │ │ └── shared_functions.py
│ │ ├── workflows/
│ │ │ ├── __init__.py
│ │ │ ├── entrypoint.py
│ │ │ └── workflow_abc.py
│ │ ├── setup.py
│ │ └── .brickflow-project-root.yml
│ └── project_xyz/
│ ├── workflows_geo_b/
│ │ ├── entrypoint.py
│ │ └── workflow_xyz.py
│ ├── workflows_geo_a/
│ │ ├── entrypoint.py
│ │ └── workflow_xyz.py
│ └── .brickflow-project-root.yml
├── .gitignore
├── brickflow-multi-project.yml
└── README.md
- entrypoint.py: This is the entrypoint for your project. It is the file that will be used to identify all the workflows to be deployed.
- brickflow-multi-project.yml: This is the project file that will be generated by brickflow. It will contain the list of projects and a path to the project root config. This will be created in the git repository root (where your .git folder is).
Example for monorepo with multiple projects:
```yaml
project_roots:
project_abc:
root_yaml_rel_path: projects/project_abc
project_xyz_geo_a:
root_yaml_rel_path: projects/project_xyz
project_xyz_geo_b:
root_yaml_rel_path: projects/project_xyz
version: v1
```
- brickflow-project-root.yml: This is the project root config file. It will contain the list of workflows and a path to the workflows root config.
Example for monorepo with multiple projects for repo-root/projects/project_xyz/.brickflow-project-root.yml
:
```yaml
# DO NOT MODIFY THIS FILE - IT IS AUTO GENERATED BY BRICKFLOW AND RESERVED FOR FUTURE USAGE
projects:
project_xyz_geo_a:
brickflow_version: auto # automatically determine the brickflow version based on cli version
deployment_mode: bundle
name: project_xyz_geo_a
path_from_repo_root_to_project_root: projects/project_xyz # path from the repo root (where your .git folder is) to the project root
path_project_root_to_workflows_dir: workflows_geo_a
project_xyz_geo_b:
brickflow_version: auto # automatically determine the brickflow version based on cli version
deployment_mode: bundle
name: project_xyz_geo_b
path_from_repo_root_to_project_root: projects/project_xyz
path_project_root_to_workflows_dir: workflows_geo_b
version: v1
```
The important fields are:
- path_from_repo_root_to_project_root: This is the path from the repo root to the project root. This is the path that will be used to find the entrypoint file.
- path_project_root_to_workflows_dir: This is the path from the project .git root and is used to find and load
modules into python
- This is what helps you make your imports work in your notebooks. It is the path from the project root to the workflows directory.
Polyrepo Style¶
A polyrepo style project is a repository that has multiple repositories that can contain multiple brickflow projects.
Folder structure
repo-root/
├── .git
├── src/
│ ├── lib/
│ │ ├── __init__.py
│ │ └── shared_functions.py
│ ├── workflows_a/
│ │ ├── __init__.py
│ │ ├── entrypoint.py
│ │ └── workflow_a.py
│ ├── workflows_b/
│ │ ├── __init__.py
│ │ ├── entrypoint.py
│ │ └── workflow_b.py
│ └── __init__.py
├── .gitignore
├── .brickflow-project-root.yml
├── brickflow-multi-project.yml
└── README.md
- entrypoint.py: This is the entrypoint for your project. It is the file that will be used to identify all the workflows to be deployed.
- brickflow-multi-project.yml: This is the project file that will be generated by brickflow. It will contain the list of projects and a path to the project root config. This will be created in the git repository root (where your .git folder is).
Example for polyrepo with multiple projects:
```yaml
project_roots:
project_abc:
root_yaml_rel_path: .
project_abc_workflows_2:
root_yaml_rel_path: .
project_xyz:
root_yaml_rel_path: .
version: v1
```
- brickflow-project-root.yml: This is the project root config file. It will contain the list of workflows and a path to the workflows root config.
Example for polyrepo with multiple projects:
```yaml
# DO NOT MODIFY THIS FILE - IT IS AUTO GENERATED BY BRICKFLOW AND RESERVED FOR FUTURE USAGE
projects:
project_abc:
brickflow_version: auto # automatically determine the brickflow version based on cli version
deployment_mode: bundle
name: project_abc
path_from_repo_root_to_project_root: . # path from the repo root (where your .git folder is) to the project root
path_project_root_to_workflows_dir: workflows
project_abc_workflows_2:
brickflow_version: auto # automatically determine the brickflow version based on cli version
deployment_mode: bundle
name: project_abc_workflows_2
path_from_repo_root_to_project_root: .
path_project_root_to_workflows_dir: workflows2
version: v1
```
The important fields are:
* path_from_repo_root_to_project_root: This is the path from the repo root to the project root. This is the path
that will be used to find the entrypoint file.
* path_project_root_to_workflows_dir: This is the path from the project .git root and is used to find and load
modules into python
* This is what helps you make your imports work in your notebooks. It is the path from the project root to the
workflows directory.
Initialize Project¶
The first step is to create a new project.
Warning
Make sure you are in repository root (where your .git folder is) to do this! Otherwise you will run into validation issues.
Note
Please note that if you are an advanced user and understand the concepts of both files described above, you can manually create the files thats brickflow projects add creates.
- Run the following command:
-
Update your .gitignore file with the correct directories to ignore.
.databricks
andbundle.yml
should be ignored. -
It will prompt you for the:
Project Name: # (1)!
Path from repo root to project root (optional) [.]: # (2)!
Path from project root to workflows dir: # (3)!
Git https url: # (4)!
Brickflow version [auto]: # (5)!
Spark expectations version [0.8.0]: # (6)!
Skip entrypoint [y/N]: # (7)!
- A name thats not already used please only use alphanumeric characters
- If you have a polyrepo leave this a
.
. Look above for polyrepo sections and monorepo sections for guidance. - Look above for polyrepo sections and monorepo sections for guidance.
- Used to populate entrypoint and used for deployment to higher environments
- Auto or hard code specific version to be shipped with the project during deployment
- If you want to use spark expectations. Visit spark-expectations for more information.
- If you already have an entrypoint in that folder you can skip this step.
Validating your project¶
-
To test your configuration run the following command:
-
This will generate the following output at the end:
-
This should create a bundle.yml file in your project root and it should contain all the information for your workflow.
-
Anything else would indicate an error.
gitignore¶
-
For now all the bundle.yml files will be code generated so you can add the following to your .gitignore file:
Deploying your Project¶
-
To deploy the workflow run the following command
By default this will deploy to local.
Important
Keep in mind that environments are logical, your profile controls where the workflows are deployed and your code may have business logic based on which environment you are on.
If you want to deploy to a higher environment you can use the following command:
-
dev:
-
test:
-
prod:
Deployments By Release Candidates or PRs¶
Sometimes you may want to deploy multiple RC branches into the same "test" environment. Your objective will be to:
- Deploy the workflows
- Run and test the workflows
- Destroy the workflows after confirming the tests pass
To do this you can use the BRICKFLOW_WORKFLOW_PREFIX
and BRICKFLOW_WORKFLOW_SUFFIX
environment variables.
- Doing it based on release candidates
BRICKFLOW_WORKFLOW_SUFFIX="0.1.0-rc1" bf projects deploy --project <project> -p <profile> -e test --force-acquire-lock # force acquire lock is optional
- Doing it based on PRs
BRICKFLOW_WORKFLOW_SUFFIX="0.1.0-pr34" bf projects deploy --project <project> -p <profile> -e test --force-acquire-lock # force acquire lock is optional
Make sure when using the suffix and prefix that you destroy them, they are considered independent deployments and have their own state.
BRICKFLOW_WORKFLOW_SUFFIX="0.1.0-rc1" bf projects destroy --project <project> -p <profile> -e test --force-acquire-lock # force acquire lock is optional
- Doing it based on PRs
BRICKFLOW_WORKFLOW_SUFFIX="0.1.0-pr34" bf projects destroy --project <project> -p <profile> -e test --force-acquire-lock # force acquire lock is optional
Destroying your project¶
-
To destroy the workflow run the following command