-
Notifications
You must be signed in to change notification settings - Fork 216
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Version
v2.0.0
Describe the bug.
Hey,
I want to launch nim with: 6xA100
Nvidia GRID version 570.124.06 CUDA Version: 12.8
$ docker logs nemoretriever-ranking-ms
===================================
== NVIDIA NIM for Text Reranking ==
===================================
NVIDIA Release 1.3.1
Model: nvidia/llama-3.2-nv-rerankqa-1b-v2
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/).
Third Party Software Attributions and Licenses can be found under /opt/nim/NOTICE
Overriding NIM_LOG_LEVEL: replacing NIM_LOG_LEVEL=unset with NIM_LOG_LEVEL=INFO
Traceback (most recent call last):
File "/opt/nim/start_server.d/nim_manifest_profile.py", line 166, in <module>
system = get_info()
Exception: Failed to query NVML device info: an internal driver error occured
Checking - docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi return table of nvidia-smi result
Sat Apr 19 18:42:27 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 GRID A100D-2-20C On | 00000000:02:00.0 Off | On |
| N/A N/A P0 N/A / N/A | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 1 GRID A100D-2-20C On | 00000000:02:02.0 Off | On |
| N/A N/A P0 N/A / N/A | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 2 GRID A100D-2-20C On | 00000000:02:03.0 Off | On |
| N/A N/A P0 N/A / N/A | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 3 GRID A100D-2-20C On | 00000000:02:04.0 Off | On |
| N/A N/A P0 N/A / N/A | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 4 GRID A100D-2-20C On | 00000000:02:05.0 Off | On |
| N/A N/A P0 N/A / N/A | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 5 GRID A100D-2-20C On | 00000000:02:06.0 Off | On |
| N/A N/A P0 N/A / N/A | N/A | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 0 0 0 | 1MiB / 18412MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 4096MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 1 0 0 0 | 1MiB / 18412MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 4096MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 2 0 0 0 | 1MiB / 18412MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 4096MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 3 0 0 0 | 1MiB / 18412MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 4096MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 4 0 0 0 | 1MiB / 18412MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 4096MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 5 0 0 0 | 1MiB / 18412MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 4096MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Full env printout
MODEL_DIRECTORY=/data/nvidia/.cache/model-cache
NVIDIA_API_KEY=nvapi-lx..........jcOILb- I agree to follow THIS PROJECT's Code of Conduct
- I have searched the open bugs and have found no duplicates for this bug report
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working