Skip to content

Custom model training fails, need to downgrade torch (and setuptools) #15

@glemoine62

Description

@glemoine62

Hi,

I am using the deepquestai/deepstack:gpu-2022.01.1 container to do custom training. It comes with torch for cuda 11.3 but train.py fails after initiation (see error below). This is resolved when I downgrade to torch for cuda 11.0 (pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html as per the collab notebook).

docker run --gpus all -it --rm -v /home/eouser/deepstack:/deepstack/code -w /deepstack/code/deepstack-trainer deepquestai/deepstack_updated:gpu python3 train.py --dataset-path /deepstack/code/data
Traceback (most recent call last):
File "train.py", line 530, in
train(hyp, opt, device, tb_writer, wandb)
File "train.py", line 90, in train
model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc).to(device) # create
File "/deepstack/code/deepstack-trainer/models/yolo.py", line 96, in init
self._initialize_biases() # only run once
File "/deepstack/code/deepstack-trainer/models/yolo.py", line 151, in _initialize_biases
b[:, 4] += math.log(8 / (640 / s) ** 2) # obj (8 objects per 640 image)
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

I first need to downgrade setuptools inside the container, btw, because otherwise it throws:

Traceback (most recent call last):
File "train.py", line 21, in
from torch.utils.tensorboard import SummaryWriter
File "/usr/local/lib/python3.7/dist-packages/torch/utils/tensorboard/init.py", line 4, in
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'setuptools._distutils' has no attribute 'version'

(resolved with: pip install setuptools==59.5.0)

I am now happily training with the revised setup, so nothing too urgent, but maybe worth checking out.

Thx for this wonderful framework!

Guido

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions