Memory problem when trying to train with large system.

I have a training set of system which includes large system (around 2000 atoms) 
I suspect having large system in the training set causes memory related error during multihead finetuning. 
I first tried without changing `r_max` but it failed even after changing `r_max` to 5.0.
Is there any solution to this error? Does it mean I should use multiple GPU training? 
I'll be appreciated for comments, Many thanks in advance.

```
Traceback (most recent call last):
  File "/home/hjung/.conda/envs/mace/bin/mace_run_train", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/cli/run_train.py", line 77, in main
    run(args)
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/cli/run_train.py", line 837, in run
    tools.train(
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/tools/train.py", line 195, in train
    valid_loss_head, eval_metrics = evaluate(
                                    ^^^^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/tools/train.py", line 550, in evaluate
    output = model(
             ^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/modules/models.py", line 550, in forward
    node_feats = product(
                 ^^^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hjung/.conda/envs/mace/lib/python3.12/site-packages/mace/modules/blocks.py", line 440, in forward
    index_attrs = torch.nonzero(node_attrs)[:, 1].int()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Also stdout file is attached.
[stdout.txt](https://github.com/user-attachments/files/24825400/stdout.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory problem when trying to train with large system. #1337

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory problem when trying to train with large system. #1337

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions