-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Hello!
I have noticed that chrombpnet train job hangs after generating the h5 model files but i cant understand whats going on. ChatGPT mentions is something to do with plots generations and in facts the report files are not generated. Im running the job with A100 GPU with 64GB.
2025-09-09 05:59:24.723665: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.736034: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.760498: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.772885: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.785249: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.810316: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.822529: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.834770: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.859310: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.871873: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.884284: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.909351: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.922039: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.934375: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.958576: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.970789: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.983114: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.007647: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.019993: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.032344: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.057171: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.069919: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.082545: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. 2025-09-09 06:18:42.665144: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. WARNING:tensorflow:Callback method on_train_batch_endis slow compared to the batch time (batch time: 0.0636s vson_train_batch_endtime: 0.0726s). Check your callbacks. WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built.model.compile_metrics` will be empty until you train or evaluate
the model.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
2025-09-09 07:52:51.715645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38337 MB
memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:2f:00.0, compute capability: 8.0
WARNING:tensorflow:From /sc/arion/work/giottb01/conda/envs/chrombpnet/lib/python3.8/site-packages/shap/explainers/deep/deep_tf.py:140: The name tf.keras.backe
nd.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.
`
Any input is much appreciated!
Thanks so much!!