Skip to content

chrombnet train hanging #258

@Brawni

Description

@Brawni

Hello!

I have noticed that chrombpnet train job hangs after generating the h5 model files but i cant understand whats going on. ChatGPT mentions is something to do with plots generations and in facts the report files are not generated. Im running the job with A100 GPU with 64GB.

2025-09-09 05:59:24.723665: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.736034: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.760498: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.772885: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.785249: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.810316: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.822529: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.834770: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.859310: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.871873: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.884284: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.909351: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.922039: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.934375: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.958576: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.970789: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:24.983114: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.007647: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.019993: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.032344: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.057171: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.069919: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled 2025-09-09 05:59:25.082545: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. 2025-09-09 06:18:42.665144: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. WARNING:tensorflow:Callback method on_train_batch_endis slow compared to the batch time (batch time: 0.0636s vson_train_batch_endtime: 0.0726s). Check your callbacks. WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built.model.compile_metrics` will be empty until you train or evaluate
the model.
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
2025-09-09 07:52:51.715645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38337 MB
memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:2f:00.0, compute capability: 8.0
WARNING:tensorflow:From /sc/arion/work/giottb01/conda/envs/chrombpnet/lib/python3.8/site-packages/shap/explainers/deep/deep_tf.py:140: The name tf.keras.backe
nd.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

`

Any input is much appreciated!

Thanks so much!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions