-
Notifications
You must be signed in to change notification settings - Fork 380
Open
Description
I have a training set of system which includes large system (around 2000 atoms)
I suspect having large system in the training set causes memory related error during multihead finetuning.
I first tried without changing r_max but it failed even after changing r_max to 5.0.
Is there any solution to this error? Does it mean I should use multiple GPU training?
I'll be appreciated for comments, Many thanks in advance.
Traceback (most recent call last):
File "/home/hjung/.conda/envs/mace/bin/mace_run_train", line 7, in <module>
sys.exit(main())
^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/cli/run_train.py", line 77, in main
run(args)
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/cli/run_train.py", line 837, in run
tools.train(
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/tools/train.py", line 195, in train
valid_loss_head, eval_metrics = evaluate(
^^^^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/tools/train.py", line 550, in evaluate
output = model(
^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/modules/models.py", line 550, in forward
node_feats = product(
^^^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/modules/blocks.py", line 440, in forward
index_attrs = torch.nonzero(node_attrs)[:, 1].int()
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Also stdout file is attached.
stdout.txt
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels