-
Notifications
You must be signed in to change notification settings - Fork 148
Description
Hi I am running into an 130 SIGSEGV Error when trying to run my NVTabular + Transformers4Rec training script in a container using the following base image nvcr.io/nvidia/merlin/merlin-pytorch:23.12
I believe its a potential clash between RAPIDs version and the NVIDIA drivers on this GCP Vertex AI instance. I am lost as to whether the only option is using older drivers because the RAPIDs version and rest of the CUDA toolkit is pinned with the base image
Attaching some logs below as well as my training and data_loading code. Would really appreciate if some help as to how I can resolve this:
Package Versioning Logs:
INFO 2025-11-26T19:34:58.551575363Z [resource.labels.taskName: workerpool0-0] === CUDA/RAPIDS DIAGNOSTIC INFORMATION ===
INFO 2025-11-26T19:34:58.551605883Z [resource.labels.taskName: workerpool0-0] 🔍 NVIDIA Driver & CUDA Runtime:
INFO 2025-11-26T19:34:58.707598777Z [resource.labels.taskName: workerpool0-0] - | NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
INFO 2025-11-26T19:34:58.707617237Z [resource.labels.taskName: workerpool0-0] - GPU Memory: | 0 N/A N/A 1 C python 102MiB |
INFO 2025-11-26T19:34:58.707641170Z [resource.labels.taskName: workerpool0-0] 🔍 CUDA Compiler (nvcc):
INFO 2025-11-26T19:34:58.731705831Z [resource.labels.taskName: workerpool0-0] - Cuda compilation tools, release 12.1, V12.1.105
INFO 2025-11-26T19:34:58.731755794Z [resource.labels.taskName: workerpool0-0] 🔍 PyTorch CUDA:
INFO 2025-11-26T19:34:58.731771375Z [resource.labels.taskName: workerpool0-0] - PyTorch version: 2.1.0a0+4136153
INFO 2025-11-26T19:34:58.733012810Z [resource.labels.taskName: workerpool0-0] - CUDA available: True
INFO 2025-11-26T19:34:58.733028479Z [resource.labels.taskName: workerpool0-0] - CUDA version: 12.1
INFO 2025-11-26T19:34:58.738117139Z [resource.labels.taskName: workerpool0-0] - cuDNN version: 8902
INFO 2025-11-26T19:34:58.747059985Z [resource.labels.taskName: workerpool0-0] - GPU count: 1
INFO 2025-11-26T19:34:58.775557686Z [resource.labels.taskName: workerpool0-0] - GPU 0: Tesla T4
INFO 2025-11-26T19:34:58.775582246Z [resource.labels.taskName: workerpool0-0] - GPU 0 memory: 14GB
INFO 2025-11-26T19:34:58.776206953Z [resource.labels.taskName: workerpool0-0] 🔍 CuPy:
INFO 2025-11-26T19:34:58.776226962Z [resource.labels.taskName: workerpool0-0] - CuPy version: 12.0.0b3
INFO 2025-11-26T19:34:58.779242981Z [resource.labels.taskName: workerpool0-0] - CUDA available: True
INFO 2025-11-26T19:34:58.779258026Z [resource.labels.taskName: workerpool0-0] - CUDA version: 12010
INFO 2025-11-26T19:34:58.779278965Z [resource.labels.taskName: workerpool0-0] 🔍 RAPIDS/cuDF:
INFO 2025-11-26T19:34:58.779293088Z [resource.labels.taskName: workerpool0-0] - cuDF version: 23.04.00
INFO 2025-11-26T19:34:58.779303703Z [resource.labels.taskName: workerpool0-0] - CUDA available: True
Error Logs:
INFO 2025-11-26T19:34:58.779353621Z [resource.labels.taskName: workerpool0-0] Configuration and utilities loaded successfully
INFO 2025-11-26T19:34:58.779363008Z [resource.labels.taskName: workerpool0-0] Loading SQL query template
INFO 2025-11-26T19:34:58.779372378Z [resource.labels.taskName: workerpool0-0] Query prepared - date range: 15 days
INFO 2025-11-26T19:34:58.779381671Z [resource.labels.taskName: workerpool0-0] Columns selected: 26 columns
INFO 2025-11-26T19:34:58.779390326Z [resource.labels.taskName: workerpool0-0] Setting up Google Cloud credentials
INFO 2025-11-26T19:34:58.876559353Z [resource.labels.taskName: workerpool0-0] Loading data from BigQuery...
ERROR 2025-11-26T19:35:02.470379094Z [resource.labels.taskName: service] The replica workerpool0-0 exited with a non-zero status of 139(SIGSEGV)