problem with resuming training from checkpoint

Hi... I am getting following error while resuming training from a checkpoint on a single GPU system. The training went fine when started from 0th iteration, but exited immediately after loading a checkpoint. The relevant excerpt that I have modified in main.py for that purpose is also shown below. Is it a bug or there's some mistake somewhere?

(command used)
 **sh scripts/cityscapes/ocrnet/run_r_101_d_8_ocrnet_train.sh resume x3**

(modifications in main.py: ignore single quotes typed in here for proper display)
elif [ "$1"x == "resume"x ]; then 
  ${PYTHON} -u main.py --configs '$'{CONFIGS} \\
                       --drop_last y \\
                       --phase train \\
                       --gathered n \\
                       --loss_balance y \\
                       --log_to_file n \\
                       --backbone ${BACKBONE} \\
                       --model_name ${MODEL\_NAME} \\
                       --max_iters ${MAX\_ITERS} \\
                       --data_dir ${DATA\_DIR} \\
                       --loss_type ${LOSS\_TYPE} \\
                       --resume_continue y \\
                       --resume ${CHECKPOINTS\_ROOT}/checkpoints/bottle/'$'{CHECKPOINTS\_NAME}_latest.pth \\
                       --checkpoints_name ${CHECKPOINTS\_NAME} \\
                       --distributed False \\
                        2>&1 | tee -a ${LOG\_FILE}
                       #--gpu 0 1 2 3 \**

2022-11-16 11:30:47,097 INFO    [module_runner.py, 87] Loading checkpoint from /workspace/data/defGen/graphics/Pre_CL_x3//..//checkpoints/bottle/spatial_ocrnet_deepbase_resnet101_dilated8_x3_latest.pth...
2022-11-16 11:30:47,283 INFO    [trainer.py, 90] Params Group Method: None
2022-11-16 11:30:47,285 INFO    [optim_scheduler.py, 96] Use lambda_poly policy with default power 0.9
2022-11-16 11:30:47,285 INFO    [data_loader.py, 132] use the DefaultLoader for train...
2022-11-16 11:30:47,773 INFO    [default_loader.py, 38] train 501
2022-11-16 11:30:47,774 INFO    [data_loader.py, 164] use DefaultLoader for val ...
2022-11-16 11:30:47,873 INFO    [default_loader.py, 38] val 126
2022-11-16 11:30:47,873 INFO    [loss_manager.py, 66] use loss: fs_auxce_loss.
2022-11-16 11:30:47,874 INFO    [loss_manager.py, 55] use DataParallelCriterion loss
2022-11-16 11:30:48,996 INFO    [data_helper.py, 126] Input keys: ['img']
2022-11-16 11:30:48,996 INFO    [data_helper.py, 127] Target keys: ['labelmap']
Traceback (most recent call last):
  File "main.py", line 227, in <module>
    model.train()
  File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 390, in train
    self.__train()
  File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 196, in __train
    backward_loss = display_loss = self.pixel_loss(outputs, targets,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/defGen/External/ContrastiveSeg-main/lib/extensions/parallel/data_parallel.py", line 125, in forward
    return self.module(inputs[0], *targets[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 309, in forward
    seg_loss = self.ce_loss(seg_out, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 203, in forward
    target = self._scale_target(targets[0], (inputs.size(2), inputs.size(3)))
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problem with resuming training from checkpoint #62

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

problem with resuming training from checkpoint #62

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions