WIP...
Materials for the Deploy and Monitor ML Pipelines with Python, Docker, and GitHub Actions workshop at the PyData NYC 2024 Conference.
When 📆: Wednesday, November 6th, 10:50 AM EST
The workshop is based on the LinkedIn Learning course - Data Pipeline Automation with GitHub Actions, code is available here.
Description: The workshop will focus on different deployment designs of machine learning pipelines using open-source applications and free-tier tools. Demonstrating with the US hourly demand for electricity live data from the EIA API, we will learn how to deploy data and ML pipeline with Python, GitHub Actions, and Docker. This includes the use of open-source tools such as Nixtla's mlforecast library to create a forecast, MLflow and pytest to monitor the health of the data and the model's success. Last but not least, we will use Quarto doc to set up the monitoring dashboard and deploy it on GitHub Pages.
New York City (created with Midjourney)To organize and track the project requirements, we will set up a GitHub Project, create general milestones, and use issues to define sub-milestone. For setting up a data/ML pipeline, we will define the following milestones:
-
Define Scope and Requirements:
- General project scope
- Pipeline scope:
- Data pipeline
- Forecasting pipeline
- Set a development environment:
- Set a Docker image
- Update the Dev Containers settings
-
Data Backfill:
- Set a data backfill process
- Deploy the process in a Quarto doc
- Productionize - add unit tests
-
Data pipeline:
- Create pipeline schema/draft
- Build a prototype
- Test deployment on GitHub Actions
- Productionize the pipeline - set unit tests
-
ML Pipeline:
- Set experiment
- Set backtesting function
- Define forecasting models
- Test and evaluate the models' performance
- Select the best model for deployment
- Deploy the pipeline with Quarto doc to GitHub Action
-
Dashboard:
- Define the dashboard scope
- Set a Quarto/Shinylive dashboard
-
Documentation:
- Set project documentation
- Slides
The milestones are available in the repository issues section, and you can track them on the project tracker.
The project tracker-
Goal: Forecast the hourly demand for electricity in the New York Independent System Operator subregion (NYIS). This includes the following 11 providers:
- West - NYIS (ZONA)
- Genesee - NYIS (ZONB)
- Central - NYIS (ZONC)
- North - NYIS (ZOND)
- Mohawk Valley - NYIS (ZONE)
- Capital - NYIS (ZONF)
- Hudson Valley - NYIS (ZONG)
- Millwood - NYIS (ZONH)
- Dunwoodie - NYIS (ZONI)
- New York City - NYIS (ZONJ)
- Long Island - NYIS (ZONK)
-
Forecast Horizon: 24 hours
-
Refresh: Every 24 hours
The data is available on the EIA API, the API dashboard provides the GET setting to pull the above series.
The GET request details from the EIA API dashboardBelow is the header of the GET request:
{
"frequency": "hourly",
"data": [
"value"
],
"facets": {
"parent": [
"NYIS"
],
"subba": [
"ZONA",
"ZONB",
"ZONC",
"ZOND",
"ZONE",
"ZONF",
"ZONG",
"ZONH",
"ZONI",
"ZONJ",
"ZONK"
]
},
"start": null,
"end": null,
"sort": [
{
"column": "period",
"direction": "desc"
}
],
"offset": 0,
"length": 5000
}
- The following functions:
- Data backfill function
- Data refresh function
- Forecast function
- Metadata function
- Docker image
- EIA API key
To make the deployment to GitHub Actions seamless, we will use Docker. In addition, we will set up a development environment using VScode and the Dev Containers extension.
To make the deployment to GitHub Actions seamless, we will use Docker. This will enable us to ship our code to GitHub Actions using the same environment we used to develop and test our code. We will use the below Dockerfile
to set the environment:
FROM python:3.10-slim AS builder
ARG QUARTO_VER="1.5.56"
ARG VENV_NAME="my_project"
ENV QUARTO_VER=$QUARTO_VER
ENV VENV_NAME=$VENV_NAME
RUN mkdir requirements
COPY install_requirements.sh requirements/
COPY requirements.txt requirements/
RUN bash ./requirements/install_requirements.sh $VENV_NAME
FROM python:3.10-slim
ARG QUARTO_VER="1.5.56"
ARG VENV_NAME="my_project"
ENV QUARTO_VER=$QUARTO_VER
ENV VENV_NAME=$VENV_NAME
COPY --from=builder /opt/$VENV_NAME /opt/$VENV_NAME
COPY install_requirements.sh install_quarto.sh install_dependencies.sh requirements/
RUN bash ./requirements/install_dependencies.sh
RUN bash ./requirements/install_quarto.sh $QUARTO_VER
RUN echo "source /opt/$VENV_NAME/bin/activate" >> ~/.bashrc
We will use the Python slim image as our baseline, along with a Multi-Stage build approach, to make the image size as minimal as possible.
To make the image size as minimal as possible, we will use the Python slim image as our baseline along with a Multi-Stage build approach. More details about the Multi-Stage build are available in the Docker documentation and this tutorial.
We will use the below Bash script (build_image.sh
) to build and push the image to the Docker Hub:
#!/bin/bash
# Image settings
user_name=rkrispin
image_label=pydata_ny_workshop
image_tag=0.0.2
image_name=$user_name/$image_label:$image_tag
venv_name="pydata-ny-workshop"
echo "Build the docker"
docker build . -f Dockerfile \
--progress=plain \
--build-arg QUARTO_VER=$QUARTO_VER \
--build-arg VENV_NAME=$venv_name \
-t $image_name
if [[ $? = 0 ]] ; then
echo "Pushing docker..."
#docker push $image_name
else
echo "Docker build failed"
fi
The Dockerfile and its supporting files are under the docker folder.
Note: GitHub Actions, by default, does not support ARM64 processer but AMD64 (e.g., Intel). Therefore, if you are using Apple Silicon (M1/M2/M3) or any other ARM64-based machine, you will have to use Docker BuildX or similar to build the image to AMD64 architecture.
We will use the following devcontainer.json
file to set the development environment:
{
"name": "PyData NY 2024 Workshop",
"image": "docker.io/rkrispin/pydata_ny_workshop:0.0.2",
"customizations": {
"vscode": {
"settings": {
"python.defaultInterpreterPath": "/opt/pydata-ny-workshop/bin/python3",
"python.selectInterpreter": "/opt/pydata-ny-workshop/bin/python3"
},
"extensions": [
// Container
"ms-azuretools.vscode-docker",
"ms-vscode-remote.remote-containers",
//Python
"ms-python.python",
"ms-python.vscode-pylance",
// HTML
"ms-vscode.live-server",
// Documents
"yzhang.markdown-all-in-one",
"quarto.quarto",
"redhat.vscode-yaml",
"ms-toolsai.jupyter",
//Code
"oderwat.indent-rainbow",
// Data
"mechatroner.rainbow-csv",
"ms-toolsai.datawrangler"
]
}
},
"remoteEnv": {
"EIA_API_KEY": "${localEnv:EIA_API_KEY}"
}
}
If you want to learn more about setting up a dockerized development environment with the Dev Containers extension, please check the Python and R tutorials.
- Docker documentation: https://docs.docker.com/
- Dev Containers Extension: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers
- GitHub Actions documentation: https://docs.github.com/en/actions
This tutorial is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.