Guide To Deploy AI Models on Red Hat OpenShift AI (RHOAI)

If you are in a workshop, start here.

Prerequisites for environment

An OpenShift cluster that have OpenShift AI with Nvidia GPU.

For Red Hatters, provision a demo environment from the demo catalog, ie. RHOAI on OCP on AWS with NVIDIA GPUs.
For attendees, this is already setup for you - start here.

Overview

Deploy a generative AI model with serving endpoint
- Option 1: Modelcar - Pre-packaged LLM in container
- Option 2: Object store - Any model from HuggingFace
Create and use a custom workbench with the deployed inference server
QnA with Retrieval Augmented Generation
Agentic AI and MCP Server
Monitor the deployed model

Purpose

The purpose for this guide is to offer the simplest steps for deploying a privately hosted AI model on Red Hat Openshift AI. This guide will be covering deploying a Red Hat certified Qwen3 Model using the vLLM ServingRuntime for KServe on NVIDIA GPU. This guide will also cover setting up MCP server and agentic AI along with a simple observability stack so that you can collect and visualize metrics related to AI model performance and GPU information.

1. Deploy a generative AI model with serving endpoint

1.1 Option 1: Using a pre-built LLM container. (Faster)

Option 1 Modelcar explores the method of using a pre-built container with a serving endpoint. This is the easiest way to get a LLM model running on Red Hat OpenShift AI.

Create a workspace by going into your Openshift AI portal. Go to Data Science Projects and create a project.
Now in a new browser tab, navigate to https://quay.io/repository/redhat-ai-services/modelcar-catalog to get the modelcar models.
Go to the container tag page and select a model you want to use. In this example, we will use qwen3-4b.
Click the download/save button to reveal the tags, select any of the tag to reveal the URI.
Copy the URL from quay.io onwards.
Next, deploy the model by navigating to the Models tab in your workspace on Red Hat Openshift AI. Go to the Models tab within your Data Science Project and select single-model serving:
Fill in a name, ensure to choose nvidia gpu serving runtime and deployment mode to Standard.
Remember to check the box to secure the LLM model endpoint that you are about to deploy.
Select connection type URI - v1 and give it a name. A good practice is to name it the model you are about to deploy.
Next, append the URI with oci://
```
oci://quay.io/redhat-ai-services/modelcar-catalog:qwen3-4b
```
Note: If you face resource problems, try select a smaller model. For example qwen2.5-0.5b oci://quay.io/redhat-ai-services/modelcar-catalog:qwen2.5-0.5b-instruct
Your deployment will look something like this.
- Model deployment name: Name of the deployed model
- Serving runtime: vLLM NVIDIA GPU ServingRuntime for KServe
- Model server size: You can select whatever size you wish, for this guide I will keep the small size
- Accelerator: Select NVIDIA GPU
- Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
- Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name.
Wait for the model to finish deploy and the status turns green.

Note: It may take up to a few minutes depending on model size.
In production deployment, you might want to adjust the vllm parameter args to fit your use case, for example increase context length or apply certain quantization. Every use case is different and there is no silver bullet for a config that fits all.
Congratulations, you have just deployed your very first LLM model!

1.2 Option 2: Object store - Any model from HuggingFace

If you have completed the option above, you can skip this. If you want to learn, carry on.

This option explores deploying any HuggingFace model with an object store. In this option, an object store(MinIO) is deployed to act as a model storage. You will upload a model of your choice to MinIO and have OpenShift AI to get the model from the storage.

1.2.1 Create a S3 compatible object store - MinIO

Minio is a high-performance, S3 compatible object store. It can be deployed on a wide variety of platforms.

We will be deploying Minio in a project namespace.
From the console, click the CLI icon and start a terminal session.

Note: Configure the terminal timeout option to be at least 2 hours. If not the terminal will be killed before your model download finishes.
Note: If you are using your own computer, this is not required. On a mac, install openshift cli via brew. Look this up on how to do so.
```
git clone https://github.com/cbtham/rhoai-genai-workshop.git && cd rhoai-genai-workshop
```
```
oc new-project minio
```
```
oc apply -f minio-setup.yaml -n minio
```
- By default, the size of the storage is 150 GB. (see line 11). Change it if you need to, however it is not necessary to change it for this guide.
- If you want to, edit lines 21-22 of minio-setup.yaml to change the default user/password.

From the console, click the CLI icon and start a terminal session.

git clone https://github.com/cbtham/rhoai-genai-workshop.git && cd rhoai-genai-workshop

oc apply -f minio-setup.yaml

Give it a minute, and there should now be a running minio pod:
Two routes will be created in the Networking tab:
- The --api route is for programmatic access to Minio
- The --ui route is for browser-based access to Minio

1.2.2 Getting An Open Sourced LLM Model

Navigate to https://huggingface.co/ and find the model you would like to deploy.

In this guide, we will be deploying a Red Hat certified models at https://huggingface.co/RedHatAI. We will use the (Qwen3-4B-quantized.w4a16 model).

First you need to generate an access token:

Navigate to settings -> access tokens
Select create new token
For token type, select Read and then give it a name
Copy the token

Now that you have a token, you can download the model.

git clone https://<user_name>:<token>@huggingface.co/<repo_path>

For me this looks like

git clone https://cbtham:<token>@huggingface.co/RedHatAI/Qwen3-4B-quantized.w4a16

This will take some time depending on the model you are downloading. Some model like gpt-oss-120b is nearing 100GB!

1.2.3 Upload Model to the S3 Minio Storage

Assuming you followed the steps above to deploy a minio storage, navigate to the minio UI, found in Networking -> routes within the openshift console.
Login with the credentials you specified in the minio deployment
Create a bucket, give it a name such as “Models”, select create bucket
Select your bucket from the object browser
From within the bucket, select upload -> upload folder, and select the folder where the model was downloaded from huggingface. Wait for the upload to finish, this will take a while.

1.2.4 Deploying Model on Red Hat OpenShift AI

Go to Openshift AI Console. Create a datascience project.
Next, create S3 Storage data connection. We will need to create a linkage to the deployed S3 storage. You can also BYO storage by inputing the connection information with any S3 compatible storage provider you prefer.
Within your data science project, navigate to connections and select create connection
Fill in the following values:
- Connection name: name of your data connection MinIO
- Access key: username of minio deployment [minio]
- Secret key: password for minio deployment [minio123]
- Endpoint: API endpoint of the minio deployment [YOUR_OWN_MINIO_API_URL_GET_FROM_ROUTES]
- Bucket: name of the minio bucket you created [models]
Next deploy your model. Go to the Models tab within your Data Science Project and select single-model serving:

After selecting single-models serving, select Deploy Model. Deployment mode using: RawDeployment
Fill in the following values:
- Model deployment name: Name of the deployed model
- Serving runtime: vLLM NVIDIA GPU ServingRuntime for KServe
- Model server size: Select small. Adjust as you need.
- Accelerator: Select NVIDIA GPU
- Deployment Mode: Select RawDeployment
- Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
- Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name
- Source model location: Select the data connection that you set up in step 4.1. Then provide it the path to your model. If you're following this guide, the path will be qwen3-4b-quantizedw4a16. If you're unsure of the path you can go to the minio-ui, navigate to the models bucket you created, and see the name of the directory where the model is stored.
Once that is done, you are all set to hit Deploy!
It will take some time for the model to be deployed.
When you see the green tick, you are done.
Congratulations! You have now successfully deployed a LLM model on Red Hat Openshift AI using the vLLM ServingRuntime for KServe.

2.0 Create and use a custom workbench with the deployed inference server

Workbench is a containerized, development environment for data scientists, AI engineers to build, train, test and iterate within the OpenShift AI platform.

2.1 AnythingLLM

AnythingLLM is a full-stack workbench that enables you to turn any document, resource, or piece of content into context that any LLM can use as a reference during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.

2.1.1 AnythingLLM in Red Hat Openshift AI

To get started quickly, we will use a custom workbench - a feature offered by Red Hat Openshift AI to quickly host compatible containerized applications easily.

We will add an image by providing the details of the hosted container registry. Navigate to https://quay.io/rh-aiservices-bu/anythingllm-workbench:1.8.5 Copy the URL and paste it into Settings > Workbench Images > image location.
Create a new workbench, pick the name of the workbench you have given in the previous step. If you are participating in a workshop and you read this, quickly go here. Your admin have already set this up for you, choose "AnythingLLM".

Remember to make your storage name unique to avoid name clash.
Wait for the workbench to start. You should see a green status showing it is running. Click on the name to navigate to AnythingLLM UI.

2.2 Connecting AnythingLLM To Our Privately Hosted Model

AnythingLLM is able to consume inference endpoints from multiple AI provider. In this exercise, we will connect it to our privately hosted LLM inference endpoints set up in previous steps.

Select OpenAI Compatible API
Paste the baseURL from your deployed model(external endpoint) and append /v1 on it. It will look like this example

https://qwen3-4b-llm.apps.cluster-q2cbm.q2cbm.sandbox1007.opentlc.com/v1
Paste the token copied from above steps as the API key. The token starts with ey....
Use the name of the model you deployed. In this example, I use qwen3-4b-quantizedw4a16 as the name of the model. You may key in the name of your model that you want to use. You can refer the model name from the deployed model URL at Red Hat OpenShift AI models tab.
Set the context window and max token to 4096.
Once it is saved, navigate back to the main page of AnythingLLM and start a chat. If everything is set up properly, you will be greeted with a response from the LLM.

3.0 Retrieval Augmented Generation with AnythingLLM

RAG, or Retrieval-Augmented Generation, is an AI framework that combines the strengths of traditional information retrieval systems with generative large language models (LLMs). It allows LLMs to access and reference information outside their training data to provide more accurate, up-to-date, and relevant responses. Essentially, RAG enhances LLMs by enabling them to tap into external knowledge sources, like documents, databases, or even the internet, before generating text.

For the purpose of demonstration, we will use a local vector database - LanceDB.

LanceDB is deployed as part of AnythingLLM. You may explore the settings page of AnythingLLM to provide your own vector database.

3.1 Scraping Website For RAG

You may insert your own pdf, csv or any digestible format for RAG. In this guide, we will step up a notch to scrape website and use its data as RAG. We will use built-in scraper from AnythingLLM, after getting the data, it will chunk it and store in the vector database LanceDB for retrieval.

We first ask a question and capture the default response. We'll see the LLM gave us a generic response.

Ask: What is an AI tool from Intel?

We can see the response is short and very generic.

Option 1 : Upload your own data (PDF)

Now lets implement RAG by attaching a pdf. Click on the upload button beside your user workspace. Upload the pdf rag-demo.pdf in this repository.
After that, move it to the workspace and click Save and Embed.
We can see the answer after RAG is more detailed with reference to the data we uploaded.

Option 2 : Scrapping a website

Now let's try another way to implement RAG by scraping a website. The website has a section which has a better answer to our previous question.
Instead of uploading a document, select Data Connector and click bulk link scraper.
Input the link, set the depth to 1 and click Submit.

https://www.redhat.com/en/resources/openshift-ai-overview
Web scraping will take some time especially with the depth set to a higher value. If you are an admin, you can navigate to the anythingllm pod and see the process of scraping, chunking and embedding.
Once this step is done, you will see the data available. Move it to the workspace, save and embed.
After that, ask a question and you can see the answer is much more detailed and with reference to the scraped website.
Behind the scenes, AnythingLLM scraped the website, chunked it and embedded it into the workspace.

4.0 Agentic AI & MCP Server

Prerequisite: The following section requires you to use Terminal or CLI commands. We will deploy llama-stack, an open-sourced developer framework and library by Meta for building agentic AI.

Go to your OpenShift AI cluster and select the app icon to go to console.
From the console, click the CLI icon and start a terminal session.

Git clone this repository

git clone https://github.com/cbtham/rhoai-genai-workshop.git && cd rhoai-genai-workshop

4.1 Deploying Llama Stack and MCP Server

Llama Stack is a developer framework for building generative AI applications — are set up and connected to create a production-ready environment across various environments like on-prem, air-gapped or the cloud.

We will need a few components:-

Llama-stack Kubernetes Operator
We will be using llama-stack-k8s-operator. This operator will orchestrate and automate Llama Stack deployment(servers, resource management, deployment in underlying clsuter).
Llama-stack configuration
Llama Stack configuration defines various components like models, RAG providers, inference engines, and other tools are used to build and deploy the AI application. Llama Stack also provides a unified API layer, allowing developers to switch providers for different components without changing core application code.
An MCP server
Model Context Protocol, MCP is an open standard for AI agents and LLMs to connect with external data sources, tools, and services. Like USB-C port for AI, it standardizes communication, allowing LLM-powered agents to access real-world information and functionality beyond their training data.

To deploy llama-stack-operator, run

oc apply -k obs/experimental/llama-stack-operator

Next we will deploy an mcp server. The MCP server can be any thing from Spotify, Uber to Datadog, GitHub. You may also build your own MCP server as well. In this example, we will deploy an Openshift MCP server.
```
oc new-project llama-stack
oc apply -k obs/experimental/openshift-mcp -n llama-stack
```

To utilize llama-stack, we will configure it with llama-stack configuration deployment. For this step, you will need to provide details from earlier steps:

export MODEL_NAME="qwen3-4b" # Your LLM Model

export MODEL_NAMESPACE="admin-workshop" # Your datascience project name

export LLM_MODEL_TOKEN="YOUR_TOKEN" # Your LLM Model token

export LLM_MODEL_URL="https://${MODEL_NAME}-predictor.${MODEL_NAMESPACE}.svc.cluster.local:8443/v1"

After that, run

perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/experimental/llama-stack-with-config/configmap.yaml | oc apply -f - -n llama-stack

and

perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/experimental/llama-stack-with-config/llama-stack-with-config.yaml | oc apply -f - -n llama-stack

The first command deploys configmap, the second command deploys the server.

Ensure that no errors from llama-stack and mcp server.

oc get pods -n llama-stack

and

oc get pods -n llama-stack-k8s-operator-controller-manager

4.2 Giving your LLM the power to call tools and use MCP

To do this, we will need to go back to OpenShift AI portal. We will need to modify the deployment.

In your datascience project Models tab, select the LLM that you have deployed and choose edit.
Scroll down to vLLM arguments. We will need to enable a few flags to allow tool calling.
```
--enable-auto-tool-choice
--tool-call-parser hermes
```
Note: Qwen3 models use hermes parser. If you are using other LLM models, you may need to modify the parser. Check the foundation model provider docs for more information.
When the LLM model finish re-deploy, we will be able to test.

4.3 Test and interact with your LLM model with tool call capability

AnythingLLM

Go to AnythingLLM settings
Go to AI Providers > LLM and change the context length to 8192 as tool calling to MCP will consume a lot more token.

Add the configuration of the OpenShift MCP server

Change the YOUR-PROJECT-NAMESPACE to your namespace!

export MODEL_NAMESPACE="YOUR-PROJECT-NAMESPACE"

perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/experimental/anythingllm-mcp-config/anythingllm_mcp_servers.json > /tmp/anythingllm_mcp_servers.json && oc cp /tmp/anythingllm_mcp_servers.json anythingllm-0:/app/server/storage/plugins/anythingllm_mcp_servers.json -c anythingllm

Restart/Refresh AnythingLLM.
Go to Agent skills, scroll down to MCP servers and hit refresh.
After that you will be able to see the Openshift MCP Server.
To test, go to chat or agent chat. Ensure to type in @agent before your question.

Your MCP server currently can only access it's own namespace.
To implement cluster wide read-only acccess, apply the following:

oc apply -f obs/experimental/openshift-mcp/cluster-read-serviceaccount.yaml

4.4 OPTIONAL: Deploy Llama-stack Playground To Test MCP

In any organization, you may have developers that would like to build and try on other tools, library or frameworks. The section below showcases llama-stack playground, which have a more robust debugging and logging interfaces.

To deploy llama-stack playground, follow on. The playground is a streamlit based UI to test the LLM model with options to enable capabilities on demand.

In your openshift terminal CLI, run

oc apply -k obs/experimental/llama-stack-playground -n llama-stack

Ensure that no errors and all the pods are running.
```
oc get pods -n llama-stack
```

Next we will get the route URL to the playground UI.

oc get route llama-stack-playground -n llama-stack -o jsonpath='https://{.spec.host}{"\n"}'

Open a new tab in your browser and visit the URL to access the playground UI.

Ensure to select agent-based and openshift MCP on the right panel.

Ask How many pods are there in llama-stack namespace?

5.0 Setting up Observability Stack & Collecting Metrics

The following section requires you run code in a Terminal. You can run this directly on Red Hat Openshift console or run this through your local terminal connected to the openshift cluster. When you are ready, git clone this repository.

5.1 Prometheus

Prometheus is used to aggregate logs. Prometheus is installed by default with OpenShift. However, the default monitoring stack only collects metrics related to core OpenShift platform components. Therefore we need to enable User Workload Monitoring in order to collect metrics from the model we have deployed. More about configuring Prometheus in documentation - Enable monitoring for user-defined projects

In order to enable monitoring for user-defined projects, we need to set enableUserWorkload: true in the cluster monitoring ConfigMap object. You can do this by applying the following yaml:
```
oc apply -f obs/cluster-monitoring-config.yaml -n openshift-monitoring
```

5.1.2 Add vLLM metrics to user-workload (uwl) metrics allowlist

Now that we have enabled user-workload monitoring, we just need to add vLLM to the list of metrics we want Prometheus to gather. We can do this by adding vllm:.* to the metrics_allowlist.yaml in the project namespace. Before applying the yaml, make sure to CHANGE the value of namespace to the namespace that your model has been deployed in. Once you've changed the namespace value, deploy with the following command.

    oc apply -f obs/metrics_allowlist.yaml

Note: If you face Error from server (NotFound): error when creating "obs/metrics_allowlist.yaml": namespaces "YOUR_PROJECT_NAMESPACE" not found, modify the metrics_allowlist.yaml to include reflect your project namespace.

5.2 Grafana

5.2.1 Grafana is use for dashboarding. It will be used to display key metrices from the logs collected.

Back at the command line terminal. Create a new Grafana namespace
```
oc create namespace grafana
```

Deploy grafana PVC, Service, and Deployment

oc apply -f obs/grafana-setup.yaml -n grafana

Apply route to expose Grafana UI externally

oc apply -f obs/expose-grafana.yaml -n grafana

Get the Grafana route URL:

oc get route grafana -n grafana -o jsonpath='https://{.spec.host}{"\n"}'

5.2.2 Adding Data Source to Grafana

Create Grafana Secret Token. This is used so that Grafana can access the Prometheus Data Source.
```
oc apply -f obs/grafana-prometheus-token.yaml -n grafana --namespace=openshift-monitoring
```

Get the token by running the following command:

oc get secret grafana-prometheus-token \
-n openshift-monitoring \
-o jsonpath='{.data.token}' | base64 -d && echo

Add Data Source in Grafana UI. Navigate to data sources -> add data source

Select Prometheus as the data source, then fill in the following values:

Prometheus Server URL: https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
Skip TLS certificate Validation: Check this box
HTTP headers:
- Header: Authorization
- Value: Bearer [Space] [Token created in step 6.2.2]
Prometheus server url is the same for everyone

The parameter of "Value" should include "Bearer<SPACE>YOUR_TOKEN"

Once the above is filled out, hit save and test at the bottom. You should then see the following:

Verify vLLM and DCGM Metrics can be read from Data Source

We want to make sure Grafana is actually getting the vLLM and DCGM metrics from the Data Source.

Go to explore->metrics explorer and then for the metric value type vllm, verify that you can see the different vllm metrics. Then type DCGM, and verify you can see the different DCGM metrics.

5.3 Importing vLLM Dashboard

The vLLM dashboard that is used by Emerging Tech and Red Hat Research can be found here: https://github.com/redhat-et/ai-observability/blob/main/vllm-dashboards/vllm-grafana-openshift.json. This dashboard is based on the upstream vLLM dashboard. It gives you an insight to key metrices of your deployed LLM/s.

Go to Dashboards -> Create Dashboard
Select Import a dashboard. Then either upload the vLLM dashboard yaml or just copy and paste the yaml into the box provided.
Then hit load, then Import.

Optional: vLLM Advanced Performance Dashboard

This dashboard is meant to provide high level metrices - key to assist in setting SLO, monitoring and improving performance.

To add this, select Import a dashboard. Then copy and paste the content of vLLM Advanced Performance Dashboard yaml to import.

5.4 Importing Nvidia DCGM Dashboard for GPU

The DCGM Grafana Dashboard can be found here: https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/.

Go back to dashboards in Grafana UI and select new->import. Copy the following dashboard ID: 12239. Paste that dashboard ID on Import Dashboard page. Then hit load.
Select prometheus data source then select Import.
Now you should have successfully imported the NVIDIA DCGM Exporter Dashboard, useful for GPU Monitoring.

Knowledge Base

1. Model pod automatically terminated (Workaround)

Post deploying your model, after some time the pod for the model may terminate. You have to manually go into the console and spin it back up to 1. After this, the model should not terminate again and the model pod should successfully be created. This is a current bug that is caused by large models being deployed. We believe the issue may be caused because the model is of a size that it takes a while to get it in place on the node or into the cluster that it isn't given a proper enough amount of breathing room to actually allow it to start up. This bug is currently in the backlog of things to fix. So for now with bigger models like granite, you will have to manually spin the pod back up.

After some minutes you can see my pod was terminated the deployment scaled to 0. I will just manually scale it back to 1 in the UI, or I can run the following command from the cli:

oc scale deployment/[deployment_name] --replicas=1 -n [namespace] --as system:admin

For me this looks like

oc scale deployment/demo-granite-predictor-00001-deployment --replicas=1 -n sandbox --as system:admin

After this it will take some time for the model pod to spin back up.

2. If AnythingLLM refresh button fails, run this command to restart it.

oc rollout restart statefulset anythingllm -n admin-workshop

3. To let participants try dashboard, as an admin, run

export ADMIN_NAMESPACE="admin-workshop"

# Get the token and create secret in your admin namespace
TOKEN=$(oc get secret grafana-prometheus-token \
  -n openshift-monitoring \
  -o jsonpath='{.data.token}' | base64 -d)

oc create secret generic prometheus-token \
  --from-literal=token="$TOKEN" \
  -n admin-workshop

# Apply RBAC so participants can query
perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/prometheus-token-access-setup.yaml | oc apply -f -

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
docs		docs
img		img
obs		obs
.gitignore		.gitignore
README.md		README.md
minio-setup.yaml		minio-setup.yaml
rag-demo-2.pdf		rag-demo-2.pdf
rag-demo.pdf		rag-demo.pdf

cbtham/rhoai-genai-workshop

Folders and files

Latest commit

History

Repository files navigation