How to Deploy MLFlow on a Kubernetes Cluster: Step-by-Step

Table of Contents

Prerequisites
Step 1: Install and Configure K3s
Step 2: Install NVIDIA Container Toolkit
Step 3: Configure Persistent Storage for MLFlow
Step 4: Deploy MLFlow Using Helm
Step 5: Expose MLFlow Service
Step 6: Run the MLFlow Experiment
Step 7: Copy Kubeconfig to the Local Machine
Step 8: Access MLFlow UI
Conclusion

MLFlow is a powerful, open-source platform designed to manage the entire lifecycle of machine learning (ML) development. It provides tools for tracking experiments, packaging code, and deploying models. By deploying MLFlow on a Kubernetes cluster, you can leverage scalability, reliability, and GPU support for ML workloads.

This guide provides a detailed, step-by-step walkthrough for setting up MLFlow on a Kubernetes cluster.

Prerequisites

Before starting, ensure you have the following:

An Ubuntu 22.04 Cloud GPU Server.
CUDA Toolkit, cuDNN and Helm Installed.
A root or sudo privileges.

Step 1: Install and Configure K3s

K3s is a lightweight Kubernetes distribution that is ideal for quick setups. Install it using the following command:

curl -sfL https://get.k3s.io | sh -

After installation, copy the K3s configuration file to your kubeconfig directory for kubectl to interact with the cluster:

cp /etc/rancher/k3s/k3s.yaml ~/.kube/config

Confirm the Kubernetes cluster is up and running by checking the node status:

kubectl get nodes

Expected output:

NAME     STATUS   ROLES                  AGE   VERSION
ubuntu   Ready    control-plane,master   9s    v1.31.3+k3s1

Step 2: Install NVIDIA Container Toolkit

To enable GPU support in your Kubernetes cluster, install the NVIDIA container toolkit:

1. Add the NVIDIA repository and import its GPG key:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

2. Update and install the toolkit:

apt-get update
apt-get install -y nvidia-container-toolkit

3. Verify the installation:

nvidia-container-cli --version

Output.

cli-version: 1.17.3
lib-version: 1.17.3
build date: 2024-12-04T09:47+00:00

4. Deploy the NVIDIA plugin to manage GPUs in your Kubernetes cluster:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

Step 3: Configure Persistent Storage for MLFlow

MLFlow requires persistent storage to save experiment data and models. Create and configure a Persistent Volume (PV) and Persistent Volume Claim (PVC):

1. Create a YAML file for the PV and PVC configuration:

nano mlflow-pv-pvc.yaml

Add the following content:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: mlflow-pv
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt/data"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mlflow-pvc
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

2. Apply the configuration:

kubectl apply -f mlflow-pv-pvc.yaml

3. Verify the PV and PVC:

kubectl get pv

Output.

NAME        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
mlflow-pv   10Gi       RWO            Retain           Available           manual                                   6s

kubectl get pvc

Output.

NAME         STATUS   VOLUME      CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
mlflow-pvc   Bound    mlflow-pv   10Gi       RWO            manual                          15

Step 4: Deploy MLFlow Using Helm

1. Add the Helm chart repository for MLFlow:

helm repo add community-charts https://community-charts.github.io/helm-charts

2. Update the Helm repository.

helm repo update

3. Install MLFlow using Helm:

helm install atlantic community-charts/mlflow

Output.

NAME: atlantic
LAST DEPLOYED: Wed Dec 18 09:43:10 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
  Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods --namespace default -l "app.kubernetes.io/name=mlflow,app.kubernetes.io/instance=atlantic" -o jsonpath="{.items[0].metadata.name}")
  export CONTAINER_PORT=$(kubectl get pod --namespace default $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
  echo "Visit http://127.0.0.1:8080 to use your application"
  kubectl --namespace default port-forward $POD_NAME 8080:$CONTAINER_PORT

4. Verify that MLFlow has been successfully deployed:

kubectl get deployments

Output.

NAME              READY   UP-TO-DATE   AVAILABLE   AGE
atlantic-mlflow   1/1     1            1           70s

Step 5: Expose MLFlow Service

1. To make MLFlow accessible, create a service:

nano mlflow-service.yaml

Add the following configuration:

apiVersion: v1
kind: Service
metadata:
  name: mlflow-service
spec:
  selector:
    app: mlflow
  ports:
    - protocol: TCP
      port: 80
      targetPort: 5000
      nodePort: 30001  # Optional, or let Kubernetes assign a random NodePort
  type: NodePort

2. Deploy the service.

kubectl apply -f mlflow-service.yaml

3. Check the services running in your cluster.

kubectl get services

Output.

NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
atlantic-mlflow   ClusterIP   10.43.183.234           5000/TCP       28m
kubernetes        ClusterIP   10.43.0.1               443/TCP        35m
mlflow-service    NodePort    10.43.158.142           80:30001/TCP   10m

Note: Note down the IP 10.43.158.142 shown in the CLUSTER-IP column.

Step 6: Run the MLFlow Experiment

1. Create a directory for your models.

mkdir Models
cd Models

2. Install the required Python packages.

pip install mlflow scikit-learn shap matplotlib

3. Set environment variables.

export MLFLOW_EXPERIMENT_NAME='my-sample-experiment'
export MLFLOW_TRACKING_URI='http://10.43.158.142'

Note: Replaced 10.43.158.142 with your CLUSTER-IP shown in the previous step.

4. Create a Python script (main.py) and add your ML experiment code.

nano main.py

Add the following code:

# Import Libraries

import os

import numpy as np
import shap
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression

import mlflow
from mlflow.artifacts import download_artifacts
from mlflow.tracking import MlflowClient

# Prepare the Training Data

X, y = load_diabetes(return_X_y=True, as_frame=True)
X = X.iloc[:50, :4]
y = y.iloc[:50]

# Train a model

model = LinearRegression()
model.fit(X, y)

# Log an explanation

with mlflow.start_run() as run:
    mlflow.shap.log_explanation(model.predict, X)

# List Artifacts

client = MlflowClient()
artifact_path = "model_explanations_shap"
artifacts = [x.path for x in client.list_artifacts(run.info.run_id, artifact_path)]
print("# artifacts:")
print(artifacts)

# Load the logged explanation

dst_path = download_artifacts(run_id=run.info.run_id, artifact_path=artifact_path)
base_values = np.load(os.path.join(dst_path, "base_values.npy"))
shap_values = np.load(os.path.join(dst_path, "shap_values.npy"))

# Show a Force Plot

shap.force_plot(float(base_values), shap_values[0, :], X.iloc[0, :], matplotlib=True)

5. Run the script.

python3 main.py

Output.

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 249.00it/s]
🏃 View run persistent-frog-660 at: http://10.43.158.142/#/experiments/1/runs/e19dbaac59fe452bab13217fcc9aac49
🧪 View experiment at: http://10.43.158.142/#/experiments/1
# artifacts:
['model_explanations_shap/base_values.npy', 'model_explanations_shap/shap_values.npy', 'model_explanations_shap/summary_bar_plot.png']

Step 7: Copy Kubeconfig to the Local Machine

To manage your Kubernetes cluster remotely, copy the kubeconfig file from the server to your local machine.

1. Create a .kube directory in your local machine.

mkdir .kube

2. Copy the kubeconfig file from your server.

scp root@server-ip:/root/.kube/config .kube/

3. Edit the kubeconfig file.

nano .kube/config

Find the following line:

server: https://127.0.0.1:6443

And replace it with the following:

server: https://your-server-ip:6443

4. Verify connectivity.

kubectl get services

Output.

NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
atlantic-mlflow   ClusterIP   10.43.183.234           5000/TCP       28m
kubernetes        ClusterIP   10.43.0.1               443/TCP        35m
mlflow-service    NodePort    10.43.158.142           80:30001/TCP   10m

Step 8: Access MLFlow UI

To access the MLFlow UI from your local machine, you will need to forward the MLFlow service on that machine.

kubectl port-forward svc/mlflow-service 8880:80

Output.

Forwarding from 127.0.0.1:8880 -> 5000
Forwarding from [::1]:8880 -> 5000

Now, open your web browser and access the MLFlow UI at http://127.0.0.1:8880/#/experiments/1

Conclusion

You’ve successfully deployed MLFlow on a Kubernetes cluster, configured it for GPU support, and run a sample experiment. This setup can be extended to manage and track ML experiments for production-scale applications. Try it today on GPU hosting from Altantic.Net!

Facebook

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Up in 60 Seconds!

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Newsletter

Subscribe to our newsletter and stay updated.

Email Address

Provide your email address to subscribe. For e.g [email protected]

Your subscription could not be saved. Please try again.

Your subscription has been successful.

View White Papers

How to Deploy MLFlow on a Kubernetes Cluster

Prerequisites

Step 1: Install and Configure K3s

Step 2: Install NVIDIA Container Toolkit

Step 3: Configure Persistent Storage for MLFlow

Step 4: Deploy MLFlow Using Helm

Step 5: Expose MLFlow Service

Step 6: Run the MLFlow Experiment

Step 7: Copy Kubeconfig to the Local Machine

Step 8: Access MLFlow UI

Conclusion

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Award-Winning Hosting Solutions & Services