Skip to content

NanLossDuringTrainingError: NaN loss during training #105

@Wuxinxiaoshifu

Description

@Wuxinxiaoshifu

INFO:tensorflow:Using config: {'_model_dir': '/home/yzh/v3plus/tensorflow-deeplab-v3-plus/dataset/test2/model/new', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 1000000000.0, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff57ace0c88>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Start training.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2021-11-02 23:17:44.611893: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-11-02 23:17:44.828950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:02:00.0
totalMemory: 23.70GiB freeMemory: 23.44GiB
2021-11-02 23:17:44.967547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.755
pciBusID: 0000:81:00.0
totalMemory: 23.69GiB freeMemory: 23.27GiB
2021-11-02 23:17:44.967602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2021-11-02 23:21:49.062821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-02 23:21:49.062859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2021-11-02 23:21:49.062867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N
2021-11-02 23:21:49.062871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N
2021-11-02 23:21:49.063044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22724 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:02:00.0, compute capability: 8.6)
2021-11-02 23:21:49.063425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22555 MB memory) -> physical GPU (device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:81:00.0, compute capability: 8.6)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /home/yzh/v3plus/tensorflow-deeplab-v3-plus/dataset/test2/model/new/model.ckpt.
INFO:tensorflow:cross_entropy = 1.9338539, learning_rate = 0.007, train_mean_iou = 0.014417753, train_px_accuracy = 0.086506516
INFO:tensorflow:loss = 24.278753, step = 0
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "train.py", line 285, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 267, in main
hooks=train_hooks,
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
saving_listeners)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(*original_exc_info)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(*args, **kwargs)
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1320, in run
run_metadata=run_metadata))
File "/home/anaconda3/envs/tf-dpv3plus/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions