Skip to content

NaN temperature and inf pair energy with mliap potential in LAMMPS #1353

@park2971

Description

@park2971

Hello,

I am using mace develop branch at the checkpoint Oct 10, 2025 (7372ad7)
due to the compatibility issue (#1287)

I use cuda 12.1.1 with the following python packages:

Package                       Version
----------------------------- ------------
annotated-types               0.7.0
ase                           3.27.0
certifi                       2026.1.4
charset-normalizer            3.4.4
click                         8.3.1
ConfigArgParse                1.7.1
contourpy                     1.3.3
cuequivariance                0.4.0
cuequivariance-ops-cu12       0.4.0
cuequivariance-ops-torch-cu12 0.4.0
cuequivariance-torch          0.4.0
cupy-cuda12x                  13.6.0
cycler                        0.12.1
e3nn                          0.4.4
fastrlock                     0.8.3
filelock                      3.20.0
fonttools                     4.61.1
fsspec                        2025.12.0
gitdb                         4.0.12
GitPython                     3.1.46
h5py                          3.15.1
idna                          3.11
Jinja2                        3.1.6
kiwisolver                    1.4.9
lammps                        2025.12.10
lightning-utilities           0.15.2
lmdb                          1.7.5
mace-torch                    0.3.15
MarkupSafe                    2.1.5
matplotlib                    3.10.8
matscipy                      1.2.0
mpmath                        1.3.0
networkx                      3.6.1
numpy                         2.4.1
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu12             9.1.0.70
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-ml-py                  13.590.48
nvidia-nccl-cu12              2.21.5
nvidia-nvjitlink-cu12         12.9.86
nvidia-nvtx-cu12              12.1.105
opt_einsum                    3.4.0
opt-einsum-fx                 0.1.4
orjson                        3.11.5
packaging                     26.0
pandas                        3.0.0
pillow                        12.1.0
pip                           25.3
platformdirs                  4.5.1
prettytable                   3.17.0
protobuf                      6.33.4
pydantic                      2.12.5
pydantic_core                 2.41.5
pynvml                        13.0.1
pyparsing                     3.3.2
python-dateutil               2.9.0.post0
python_hostlist               2.3.0
PyYAML                        6.0.3
requests                      2.32.5
scipy                         1.17.0
sentry-sdk                    2.50.0
setuptools                    80.10.2
six                           1.17.0
smmap                         5.0.2
sympy                         1.13.1
torch                         2.5.1+cu121
torch-ema                     0.3
torchaudio                    2.5.1+cu121
torchmetrics                  1.8.2
torchvision                   0.20.1+cu121
tqdm                          4.67.1
triton                        3.1.0
typing_extensions             4.15.0
typing-inspection             0.4.2
urllib3                       2.6.3
wandb                         0.24.0
wcwidth                       0.5.0
wheel                         0.46.3

I got a potential with this commands (I did multihead finetuning):

mace_run_train \
    --name="$model_name" \
    --energy_key="energy"\
    --forces_key="forces"\
    --stress_key="stress"\
    --E0s='{38:-0.03202952, 50:-0.07540552, 8:-0.06285466}' \
    --foundation_model="$HOME/MDkit/foundations/mace_matpes_0/MACE-matpes-r2scan-omat-ft.model" \
    --pt_train_file="../multihead/prep2/selected_configs.xyz" \
    --multiheads_finetuning=True \
    --train_file="../multihead/train.xyz" \
    --valid_fraction=0.05 \
    --num_samples_pt=10000 \
    --config_type_weights '{"Default": 10.0}'\
    --weight_pt_head=1.0 \
    --loss "universal" \
    --energy_weight=1 \
    --forces_weight=20 \
    --compute_stress=True \
    --stress_weight=5 \
    --lr=0.001 \
    --scaling="rms_forces_scaling" \
    --batch_size=10 \
    --max_num_epochs=10 \
    --swa \
    --swa_energy_weight=20 \
    --swa_forces_weight=5 \
    --swa_stress_weight=1 \
    --swa_lr=0.0001 \
    --start_swa=8 \
    --ema \
    --ema_decay=0.999 \
    --amsgrad \
    --default_dtype="float64" \
    --device=cuda \
    --enable_cueq=True\
    --wandb \
    --wandb_project="$WANDB_PROJECT"\
    --wandb_name="$WANDB_NAME"\
    --restart_latest \
    --seed=21045228    >> log_mace_train

I believe there is no significant error here but here is warning I obtained:

/users/7/park2971/env_mace/lib/python3.12/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/users/7/park2971/env_mace/lib/python3.12/site-packages/e3nn/o3/_wigner.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  _Jd, _W3j_flat, _W3j_indices = torch.load(os.path.join(os.path.dirname(__file__), 'constants.pt'))
/users/7/park2971/env_mace/lib/python3.12/site-packages/mace/cli/run_train.py:152: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model_foundation = torch.load(
/users/7/park2971/env_mace/lib/python3.12/site-packages/torch/jit/_check.py:178: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn(
/users/7/park2971/env_mace/lib/python3.12/site-packages/torch/jit/_check.py:178: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.

...
/users/7/park2971/env_mace/lib/python3.12/site-packages/mace/modules/models.py:84: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  "atomic_numbers", torch.tensor(atomic_numbers, dtype=torch.int64)
/users/7/park2971/env_mace/lib/python3.12/site-packages/mace/tools/checkpoint.py:187: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  torch.load(f=checkpoint_info.path, map_location=device),
wandb: [wandb.login()] Loaded credentials for https://api.wandb.ai from WANDB_API_KEY.
wandb: Currently logged in as: park2971 (park2971-university-of-minnesota) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: setting up run fm5v7v67
wandb: Tracking run with wandb version 0.24.0
wandb: Run data is saved locally in /projects/standard/birolt/park2971/research/SrSnO3/MACE/test4-new_env/wandb/run-20260130_134053-fm5v7v67
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run SrSnO3_test_mace_lammps
wandb: ⭐️ View project at https://wandb.ai/park2971-university-of-minnesota/Test_setup
wandb: 🚀 View run at https://wandb.ai/park2971-university-of-minnesota/Test_setup/runs/fm5v7v67
/users/7/park2971/env_mace/lib/python3.12/site-packages/mace/tools/checkpoint.py:187: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  torch.load(f=checkpoint_info.path, map_location=device),
/users/7/park2971/env_mace/lib/python3.12/site-packages/mace/modules/models.py:84: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  "atomic_numbers", torch.tensor(atomic_numbers, dtype=torch.int64)
/users/7/park2971/env_mace/lib/python3.12/site-packages/torch/jit/_check.py:178: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn(
/users/7/park2971/env_mace/lib/python3.12/site-packages/torch/jit/_check.py:178: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
...

Now, I wanted to convert this into mliap and run it in LAMMPS.
I install release version of LAMMPS and followed the instruction shown in the mace-docs (https://mace-docs.readthedocs.io/en/latest/guide/lammps_mliap.html#preparing-your-model-for-ml-iap)

Then, I tried:

mace_create_lammps_model test.model --format=mliap --dtype=float64 <<< "2"
lmp -k on g 1 -sf kk -pk kokkos newton on neigh half -in in.lammps

which gives:

Loading mkl version 2022.2.1
/users/7/park2971/env_mace/lib/python3.12/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/users/7/park2971/env_mace/lib/python3.12/site-packages/e3nn/o3/_wigner.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  _Jd, _W3j_flat, _W3j_indices = torch.load(os.path.join(os.path.dirname(__file__), 'constants.pt'))
/users/7/park2971/env_mace/lib/python3.12/site-packages/mace/cli/create_lammps_model.py:79: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model = torch.load(
/users/7/park2971/env_mace/lib/python3.12/site-packages/mace/modules/models.py:84: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  "atomic_numbers", torch.tensor(atomic_numbers, dtype=torch.int64)
Available heads in the model:
1: pt_head
2: Default
Select a head by number (Defaulting to head: 2, press Enter to accept): LAMMPS (10 Dec 2025)
KOKKOS mode with Kokkos version 4.7.1 is enabled
  using double precision
  using view layout = legacy
  will use up to 1 GPU(s) per node
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 8.0 on device with compute capability 8.6 , this will likely reduce potential performance.
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0 0 0) to (30.472629 30.472629 30.472629)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  2560 atoms
  read_data CPU = 0.008 seconds
/users/7/park2971/env_mace/lib/python3.12/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/users/7/park2971/env_mace/lib/python3.12/site-packages/e3nn/o3/_wigner.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  _Jd, _W3j_flat, _W3j_indices = torch.load(os.path.join(os.path.dirname(__file__), 'constants.pt'))

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Your simulation uses code contributions which should be cited:
- KOKKOS package: https://doi.org/10.1145/3731599.3767498
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Neighbor list info ...
  update: every = 1 steps, delay = 0 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 8, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair mliap/kk, perpetual
      attributes: full, newton on, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.001
Per MPI rank memory allocation (min/avg/max) = 1902 | 1902 | 1902 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press
         0   0              inf            0              inf           -nan
       100  -nan           -11944.646      0             -nan           -nan
       200  -nan           -11944.646      0             -nan           -nan
       300  -nan           -11944.646      0             -nan           -nan
       400  -nan           -11944.646      0             -nan           -nan
       500  -nan           -11944.646      0             -nan           -nan
       600  -nan           -11944.646      0             -nan           -nan
       700  -nan           -11944.646      0             -nan           -nan
       800  -nan           -11944.646      0             -nan           -nan
       900  -nan           -11944.646      0             -nan           -nan
...

Loop time of 5.55154 on 1 procs for 10000 steps with 2560 atoms

Performance: 155.632 ns/day, 0.154 hours/ns, 1801.301 timesteps/s, 4.611 Matom-step/s
99.4% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 3.3301     | 3.3301     | 3.3301     |   0.0 | 59.98
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 0.92416    | 0.92416    | 0.92416    |   0.0 | 16.65
Output  | 0.0042258  | 0.0042258  | 0.0042258  |   0.0 |  0.08
Modify  | 0.91684    | 0.91684    | 0.91684    |   0.0 | 16.52
Other   |            | 0.3762     |            |       |  6.78

Nlocal:           2560 ave        2560 max        2560 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:           6080 ave        6080 max        6080 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:       495742 ave      495742 max      495742 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 495742
Ave neighs/atom = 193.64922
Neighbor list builds = 0
Dangerous builds = 0
Total wall time: 0:00:15

It takes a short time, and I believe it actually does not evaluate the force and energy.

Did anyone face a similar issue and resolve it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions