If you are in a workshop, start here.
An OpenShift cluster that have OpenShift AI with Nvidia GPU.
- For Red Hatters, provision a demo environment from the demo catalog, ie. RHOAI on OCP on AWS with NVIDIA GPUs.
- For attendees, this is already setup for you - start here.
-
Deploy a generative AI model with serving endpoint
- Option 1: Modelcar - Pre-packaged LLM in container
- Option 2: Object store - Any model from HuggingFace
-
Create and use a custom workbench with the deployed inference server
-
QnA with Retrieval Augmented Generation
-
Agentic AI and MCP Server
-
Monitor the deployed model
The purpose for this guide is to offer the simplest steps for deploying a privately hosted AI model on Red Hat Openshift AI. This guide will be covering deploying a Red Hat certified Qwen3 Model using the vLLM ServingRuntime for KServe on NVIDIA GPU. This guide will also cover setting up MCP server and agentic AI along with a simple observability stack so that you can collect and visualize metrics related to AI model performance and GPU information.
Option 1 Modelcar explores the method of using a pre-built container with a serving endpoint. This is the easiest way to get a LLM model running on Red Hat OpenShift AI.
-
Create a workspace by going into your Openshift AI portal. Go to Data Science Projects and create a project.

-
Now in a new browser tab, navigate to https://quay.io/repository/redhat-ai-services/modelcar-catalog to get the modelcar models.
-
Go to the container tag page and select a model you want to use. In this example, we will use qwen3-4b.
-
Click the download/save button to reveal the tags, select any of the tag to reveal the URI.

-
Copy the URL from quay.io onwards.
-
Next, deploy the model by navigating to the Models tab in your workspace on Red Hat Openshift AI. Go to the Models tab within your Data Science Project and select single-model serving:

-
Fill in a name, ensure to choose nvidia gpu serving runtime and deployment mode to Standard.

-
Remember to check the box to secure the LLM model endpoint that you are about to deploy.

-
Select connection type URI - v1 and give it a name. A good practice is to name it the model you are about to deploy.
-
Next, append the URI with oci://

oci://quay.io/redhat-ai-services/modelcar-catalog:qwen3-4bNote: If you face resource problems, try select a smaller model. For example qwen2.5-0.5b
oci://quay.io/redhat-ai-services/modelcar-catalog:qwen2.5-0.5b-instruct -
Your deployment will look something like this.
- Model deployment name: Name of the deployed model
- Serving runtime: vLLM NVIDIA GPU ServingRuntime for KServe
- Model server size: You can select whatever size you wish, for this guide I will keep the small size
- Accelerator: Select NVIDIA GPU
- Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
- Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name.
-
Wait for the model to finish deploy and the status turns green.
Note: It may take up to a few minutes depending on model size.
-
In production deployment, you might want to adjust the vllm parameter args to fit your use case, for example increase context length or apply certain quantization. Every use case is different and there is no silver bullet for a config that fits all.
-
Congratulations, you have just deployed your very first LLM model!
If you have completed the option above, you can skip this. If you want to learn, carry on.
This option explores deploying any HuggingFace model with an object store. In this option, an object store(MinIO) is deployed to act as a model storage. You will upload a model of your choice to MinIO and have OpenShift AI to get the model from the storage.
Minio is a high-performance, S3 compatible object store. It can be deployed on a wide variety of platforms.
-
We will be deploying Minio in a project namespace.
-
From the console, click the CLI icon and start a terminal session.
Note: Configure the terminal timeout option to be at least 2 hours. If not the terminal will be killed before your model download finishes.
Note: If you are using your own computer, this is not required. On a mac, install openshift cli via brew. Look this up on how to do so.
git clone https://github.com/cbtham/rhoai-genai-workshop.git && cd rhoai-genai-workshopoc new-project miniooc apply -f minio-setup.yaml -n minio- By default, the size of the storage is 150 GB. (see line 11). Change it if you need to, however it is not necessary to change it for this guide.
- If you want to, edit lines 21-22 of minio-setup.yaml to change the default user/password.
-
From the console, click the CLI icon and start a terminal session.
git clone https://github.com/cbtham/rhoai-genai-workshop.git && cd rhoai-genai-workshopoc apply -f minio-setup.yaml -
Give it a minute, and there should now be a running minio pod:
-
Two routes will be created in the Networking tab:
Navigate to https://huggingface.co/ and find the model you would like to deploy.
In this guide, we will be deploying a Red Hat certified models at https://huggingface.co/RedHatAI. We will use the (Qwen3-4B-quantized.w4a16 model).
First you need to generate an access token:
- Navigate to settings -> access tokens
- Select create new token
- For token type, select Read and then give it a name
- Copy the token
Now that you have a token, you can download the model.
git clone https://<user_name>:<token>@huggingface.co/<repo_path>
For me this looks like
git clone https://cbtham:<token>@huggingface.co/RedHatAI/Qwen3-4B-quantized.w4a16
This will take some time depending on the model you are downloading. Some model like gpt-oss-120b is nearing 100GB!
-
Assuming you followed the steps above to deploy a minio storage, navigate to the minio UI, found in Networking -> routes within the openshift console.
-
Login with the credentials you specified in the minio deployment
-
Create a bucket, give it a name such as “Models”, select create bucket
-
Select your bucket from the object browser
-
From within the bucket, select upload -> upload folder, and select the folder where the model was downloaded from huggingface. Wait for the upload to finish, this will take a while.
-
Next, create S3 Storage data connection. We will need to create a linkage to the deployed S3 storage. You can also BYO storage by inputing the connection information with any S3 compatible storage provider you prefer.
-
Within your data science project, navigate to connections and select create connection
-
Fill in the following values:
- Connection name: name of your data connection MinIO
- Access key: username of minio deployment [minio]
- Secret key: password for minio deployment [minio123]
- Endpoint: API endpoint of the minio deployment [YOUR_OWN_MINIO_API_URL_GET_FROM_ROUTES]
- Bucket: name of the minio bucket you created [models]
-
Next deploy your model. Go to the Models tab within your Data Science Project and select single-model serving:
-
After selecting single-models serving, select Deploy Model. Deployment mode using: RawDeployment
-
Fill in the following values:
- Model deployment name: Name of the deployed model
- Serving runtime: vLLM NVIDIA GPU ServingRuntime for KServe
- Model server size: Select small. Adjust as you need.
- Accelerator: Select NVIDIA GPU
- Deployment Mode: Select RawDeployment
- Model route: Select check box for "Make deployed models available through an external route" this will enable us to send requests to the model endpoint from outside the cluster
- Token authentication: Select check box for "Require token authentication" this makes it so that sending requests to the model endpoint requires a token, which is important for security. You can leave the service account name as default-name
- Source model location: Select the data connection that you set up in step 4.1. Then provide it the path to your model. If you're following this guide, the path will be qwen3-4b-quantizedw4a16. If you're unsure of the path you can go to the minio-ui, navigate to the models bucket you created, and see the name of the directory where the model is stored.
-
Once that is done, you are all set to hit Deploy!
-
It will take some time for the model to be deployed.
-
Congratulations! You have now successfully deployed a LLM model on Red Hat Openshift AI using the vLLM ServingRuntime for KServe.
Workbench is a containerized, development environment for data scientists, AI engineers to build, train, test and iterate within the OpenShift AI platform.
AnythingLLM is a full-stack workbench that enables you to turn any document, resource, or piece of content into context that any LLM can use as a reference during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.
To get started quickly, we will use a custom workbench - a feature offered by Red Hat Openshift AI to quickly host compatible containerized applications easily.
-
We will add an image by providing the details of the hosted container registry. Navigate to
https://quay.io/rh-aiservices-bu/anythingllm-workbench:1.8.5Copy the URL and paste it into Settings > Workbench Images > image location. -
Create a new workbench, pick the name of the workbench you have given in the previous step. If you are participating in a workshop and you read this, quickly go here. Your admin have already set this up for you, choose "AnythingLLM".
Remember to make your storage name unique to avoid name clash.
-
Wait for the workbench to start. You should see a green status showing it is running. Click on the name to navigate to AnythingLLM UI.
AnythingLLM is able to consume inference endpoints from multiple AI provider. In this exercise, we will connect it to our privately hosted LLM inference endpoints set up in previous steps.
-
Select OpenAI Compatible API
-
Paste the baseURL from your deployed model(external endpoint) and append /v1 on it. It will look like this example
https://qwen3-4b-llm.apps.cluster-q2cbm.q2cbm.sandbox1007.opentlc.com/v1 -
Paste the token copied from above steps as the API key. The token starts with ey....
-
Use the name of the model you deployed. In this example, I use qwen3-4b-quantizedw4a16 as the name of the model. You may key in the name of your model that you want to use. You can refer the model name from the deployed model URL at Red Hat OpenShift AI models tab.
-
Set the context window and max token to 4096.
-
Once it is saved, navigate back to the main page of AnythingLLM and start a chat. If everything is set up properly, you will be greeted with a response from the LLM.
RAG, or Retrieval-Augmented Generation, is an AI framework that combines the strengths of traditional information retrieval systems with generative large language models (LLMs). It allows LLMs to access and reference information outside their training data to provide more accurate, up-to-date, and relevant responses. Essentially, RAG enhances LLMs by enabling them to tap into external knowledge sources, like documents, databases, or even the internet, before generating text.
For the purpose of demonstration, we will use a local vector database - LanceDB.
LanceDB is deployed as part of AnythingLLM. You may explore the settings page of AnythingLLM to provide your own vector database.
You may insert your own pdf, csv or any digestible format for RAG. In this guide, we will step up a notch to scrape website and use its data as RAG. We will use built-in scraper from AnythingLLM, after getting the data, it will chunk it and store in the vector database LanceDB for retrieval.
We first ask a question and capture the default response. We'll see the LLM gave us a generic response.
We can see the response is short and very generic.
- Now lets implement RAG by attaching a pdf. Click on the upload button beside your user workspace. Upload the pdf rag-demo.pdf in this repository.

- After that, move it to the workspace and click Save and Embed.

- We can see the answer after RAG is more detailed with reference to the data we uploaded.

-
Now let's try another way to implement RAG by scraping a website. The website has a section which has a better answer to our previous question.
-
Instead of uploading a document, select Data Connector and click bulk link scraper.

-
Input the link, set the depth to 1 and click Submit.
-
Web scraping will take some time especially with the depth set to a higher value. If you are an admin, you can navigate to the anythingllm pod and see the process of scraping, chunking and embedding.
-
Once this step is done, you will see the data available. Move it to the workspace, save and embed.

-
After that, ask a question and you can see the answer is much more detailed and with reference to the scraped website.
-
Behind the scenes, AnythingLLM scraped the website, chunked it and embedded it into the workspace.

Prerequisite: The following section requires you to use Terminal or CLI commands. We will deploy llama-stack, an open-sourced developer framework and library by Meta for building agentic AI.
- Go to your OpenShift AI cluster and select the app icon to go to console.

- From the console, click the CLI icon and start a terminal session.

- Git clone this repository
git clone https://github.com/cbtham/rhoai-genai-workshop.git && cd rhoai-genai-workshop
Llama Stack is a developer framework for building generative AI applications — are set up and connected to create a production-ready environment across various environments like on-prem, air-gapped or the cloud.
We will need a few components:-
-
Llama-stack Kubernetes Operator
We will be using llama-stack-k8s-operator. This operator will orchestrate and automate Llama Stack deployment(servers, resource management, deployment in underlying clsuter). -
Llama-stack configuration
Llama Stack configuration defines various components like models, RAG providers, inference engines, and other tools are used to build and deploy the AI application. Llama Stack also provides a unified API layer, allowing developers to switch providers for different components without changing core application code. -
An MCP server
Model Context Protocol, MCP is an open standard for AI agents and LLMs to connect with external data sources, tools, and services. Like USB-C port for AI, it standardizes communication, allowing LLM-powered agents to access real-world information and functionality beyond their training data.
-
To deploy llama-stack-operator, run
oc apply -k obs/experimental/llama-stack-operator
-
Next we will deploy an mcp server. The MCP server can be any thing from Spotify, Uber to Datadog, GitHub. You may also build your own MCP server as well. In this example, we will deploy an Openshift MCP server.
oc new-project llama-stack oc apply -k obs/experimental/openshift-mcp -n llama-stack
-
To utilize llama-stack, we will configure it with llama-stack configuration deployment. For this step, you will need to provide details from earlier steps:
export MODEL_NAME="qwen3-4b" # Your LLM Model
export MODEL_NAMESPACE="admin-workshop" # Your datascience project name
export LLM_MODEL_TOKEN="YOUR_TOKEN" # Your LLM Model token
export LLM_MODEL_URL="https://${MODEL_NAME}-predictor.${MODEL_NAMESPACE}.svc.cluster.local:8443/v1"
After that, run
perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/experimental/llama-stack-with-config/configmap.yaml | oc apply -f - -n llama-stack
and
perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/experimental/llama-stack-with-config/llama-stack-with-config.yaml | oc apply -f - -n llama-stack
The first command deploys configmap, the second command deploys the server.
-
Ensure that no errors from llama-stack and mcp server.
oc get pods -n llama-stack
and
oc get pods -n llama-stack-k8s-operator-controller-manager
To do this, we will need to go back to OpenShift AI portal. We will need to modify the deployment.
-
In your datascience project Models tab, select the LLM that you have deployed and choose edit.
-
Scroll down to vLLM arguments. We will need to enable a few flags to allow tool calling.
--enable-auto-tool-choice --tool-call-parser hermesNote: Qwen3 models use hermes parser. If you are using other LLM models, you may need to modify the parser. Check the foundation model provider docs for more information.
-
When the LLM model finish re-deploy, we will be able to test.
-
Go to AI Providers > LLM and change the context length to 8192 as tool calling to MCP will consume a lot more token.

-
Add the configuration of the OpenShift MCP server
Change the YOUR-PROJECT-NAMESPACE to your namespace!
export MODEL_NAMESPACE="YOUR-PROJECT-NAMESPACE"
perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/experimental/anythingllm-mcp-config/anythingllm_mcp_servers.json > /tmp/anythingllm_mcp_servers.json && oc cp /tmp/anythingllm_mcp_servers.json anythingllm-0:/app/server/storage/plugins/anythingllm_mcp_servers.json -c anythingllm
-
Restart/Refresh AnythingLLM.
-
Go to Agent skills, scroll down to MCP servers and hit refresh.

-
After that you will be able to see the Openshift MCP Server.

-
To test, go to chat or agent chat. Ensure to type in @agent before your question.

Your MCP server currently can only access it's own namespace.
To implement cluster wide read-only acccess, apply the following:
oc apply -f obs/experimental/openshift-mcp/cluster-read-serviceaccount.yamlIn any organization, you may have developers that would like to build and try on other tools, library or frameworks. The section below showcases llama-stack playground, which have a more robust debugging and logging interfaces.
To deploy llama-stack playground, follow on. The playground is a streamlit based UI to test the LLM model with options to enable capabilities on demand.
-
In your openshift terminal CLI, run
oc apply -k obs/experimental/llama-stack-playground -n llama-stack
-
Ensure that no errors and all the pods are running.
oc get pods -n llama-stack
-
Next we will get the route URL to the playground UI.
oc get route llama-stack-playground -n llama-stack -o jsonpath='https://{.spec.host}{"\n"}' -
Open a new tab in your browser and visit the URL to access the playground UI.
Ensure to select agent-based and openshift MCP on the right panel.
The following section requires you run code in a Terminal. You can run this directly on Red Hat Openshift console or run this through your local terminal connected to the openshift cluster. When you are ready, git clone this repository.
Prometheus is used to aggregate logs. Prometheus is installed by default with OpenShift. However, the default monitoring stack only collects metrics related to core OpenShift platform components. Therefore we need to enable User Workload Monitoring in order to collect metrics from the model we have deployed. More about configuring Prometheus in documentation - Enable monitoring for user-defined projects
-
In order to enable monitoring for user-defined projects, we need to set
enableUserWorkload: truein the cluster monitoring ConfigMap object. You can do this by applying the following yaml:oc apply -f obs/cluster-monitoring-config.yaml -n openshift-monitoring
Now that we have enabled user-workload monitoring, we just need to add vLLM to the list of metrics we want Prometheus to gather. We can do this by adding vllm:.* to the metrics_allowlist.yaml in the project namespace. Before applying the yaml, make sure to CHANGE the value of namespace to the namespace that your model has been deployed in. Once you've changed the namespace value, deploy with the following command.
oc apply -f obs/metrics_allowlist.yaml
Note: If you face Error from server (NotFound): error when creating "obs/metrics_allowlist.yaml": namespaces "YOUR_PROJECT_NAMESPACE" not found, modify the metrics_allowlist.yaml to include reflect your project namespace.
5.2.1 Grafana is use for dashboarding. It will be used to display key metrices from the logs collected.
-
Back at the command line terminal. Create a new Grafana namespace
oc create namespace grafana -
Deploy grafana PVC, Service, and Deployment
oc apply -f obs/grafana-setup.yaml -n grafana -
Apply route to expose Grafana UI externally
oc apply -f obs/expose-grafana.yaml -n grafana -
Get the Grafana route URL:
oc get route grafana -n grafana -o jsonpath='https://{.spec.host}{"\n"}'
-
Create Grafana Secret Token. This is used so that Grafana can access the Prometheus Data Source.
oc apply -f obs/grafana-prometheus-token.yaml -n grafana --namespace=openshift-monitoring -
Get the token by running the following command:
oc get secret grafana-prometheus-token \ -n openshift-monitoring \ -o jsonpath='{.data.token}' | base64 -d && echo -
Add Data Source in Grafana UI. Navigate to data sources -> add data source
Select Prometheus as the data source, then fill in the following values:
-
Prometheus Server URL: https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
-
Skip TLS certificate Validation: Check this box
-
HTTP headers:
- Header: Authorization
- Value: Bearer [Space] [Token created in step 6.2.2]
Prometheus server url is the same for everyone
The parameter of "Value" should include "Bearer<SPACE>YOUR_TOKEN"
Once the above is filled out, hit save and test at the bottom. You should then see the following:
-
Verify vLLM and DCGM Metrics can be read from Data Source
We want to make sure Grafana is actually getting the vLLM and DCGM metrics from the Data Source.
Go to explore->metrics explorer and then for the metric value type vllm, verify that you can see the different vllm metrics. Then type DCGM, and verify you can see the different DCGM metrics.
The vLLM dashboard that is used by Emerging Tech and Red Hat Research can be found here: https://github.com/redhat-et/ai-observability/blob/main/vllm-dashboards/vllm-grafana-openshift.json. This dashboard is based on the upstream vLLM dashboard. It gives you an insight to key metrices of your deployed LLM/s.
-
Go to Dashboards -> Create Dashboard
-
Select Import a dashboard. Then either upload the vLLM dashboard yaml or just copy and paste the yaml into the box provided.
-
Then hit load, then Import.
This dashboard is meant to provide high level metrices - key to assist in setting SLO, monitoring and improving performance.
To add this, select Import a dashboard. Then copy and paste the content of vLLM Advanced Performance Dashboard yaml to import.
The DCGM Grafana Dashboard can be found here: https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/.
-
Go back to dashboards in Grafana UI and select new->import. Copy the following dashboard ID:
12239. Paste that dashboard ID on Import Dashboard page. Then hit load. -
Select prometheus data source then select Import.
-
Now you should have successfully imported the NVIDIA DCGM Exporter Dashboard, useful for GPU Monitoring.
Post deploying your model, after some time the pod for the model may terminate. You have to manually go into the console and spin it back up to 1. After this, the model should not terminate again and the model pod should successfully be created. This is a current bug that is caused by large models being deployed. We believe the issue may be caused because the model is of a size that it takes a while to get it in place on the node or into the cluster that it isn't given a proper enough amount of breathing room to actually allow it to start up. This bug is currently in the backlog of things to fix. So for now with bigger models like granite, you will have to manually spin the pod back up.
After some minutes you can see my pod was terminated the deployment scaled to 0. I will just manually scale it back to 1 in the UI, or I can run the following command from the cli:
oc scale deployment/[deployment_name] --replicas=1 -n [namespace] --as system:admin
For me this looks like
oc scale deployment/demo-granite-predictor-00001-deployment --replicas=1 -n sandbox --as system:admin
After this it will take some time for the model pod to spin back up.
oc rollout restart statefulset anythingllm -n admin-workshopexport ADMIN_NAMESPACE="admin-workshop"
# Get the token and create secret in your admin namespace
TOKEN=$(oc get secret grafana-prometheus-token \
-n openshift-monitoring \
-o jsonpath='{.data.token}' | base64 -d)
oc create secret generic prometheus-token \
--from-literal=token="$TOKEN" \
-n admin-workshop
# Apply RBAC so participants can query
perl -pe 's/\$\{([^}]+)\}/$ENV{$1}/g' obs/prometheus-token-access-setup.yaml | oc apply -f -






































