[2025-12-25 02:50:14,762][INFO] Restoring checkpoint from /mnt/data/deploy/home_flow_m12_feed_ctrcvr_sorter_v1/20251217/model.ckpt-73979...
[2025-12-25 02:50:14,818][INFO] Restoring model state from /mnt/data/deploy/home_flow_m12_feed_ctrcvr_sorter_v1/20251217/model.ckpt-73979/model...
[2025-12-25 02:50:14,933][WARNING] /opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/planner_helpers.py:418: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
[2025-12-25 02:50:15,334][WARNING] /opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/planner_helpers.py:418: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/train_eval.py", line 57, in
[rank0]: train_and_evaluate(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 696, in train_and_evaluate
[rank0]: _train_and_evaluate(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 397, in _train_and_evaluate
[rank0]: checkpoint_util.restore_model(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/utils/checkpoint_util.py", line 282, in restore_model
[rank0]: model.load_state_dict(state_dict)
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 525, in load_state_dict
[rank0]: return self._load_state_dict(self, state_dict, prefix, strict)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank0]: m_keys, u_keys = self._load_state_dict(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank0]: m_keys, u_keys = self._load_state_dict(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank0]: m_keys, u_keys = self._load_state_dict(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: [Previous line repeated 3 more times]
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 543, in _load_state_dict
[rank0]: return module.load_state_dict(state_dict, strict=strict)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2604, in load_state_dict
[rank0]: load(self, state_dict)
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank0]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank0]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank0]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2597, in load
[rank0]: out = hook(module, incompatible_keys)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 231, in _load_state_dict_post_hook
[rank0]: module.validate_state()
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 1398, in validate_state
[rank0]: start in self._output_segments_tensor
[rank0]: AssertionError: shard within range [0, 189401] cannot be built out of segements tensor([ 0, 94701, 189402, ..., -1, -1, -1], device='cuda:0')
[rank1]: Traceback (most recent call last):
[rank1]: File "", line 198, in _run_module_as_main
[rank1]: File "", line 88, in _run_code
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/train_eval.py", line 57, in
[rank1]: train_and_evaluate(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 696, in train_and_evaluate
[rank1]: _train_and_evaluate(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 397, in _train_and_evaluate
[rank1]: checkpoint_util.restore_model(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/utils/checkpoint_util.py", line 282, in restore_model
[rank1]: model.load_state_dict(state_dict)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 525, in load_state_dict
[rank1]: return self._load_state_dict(self, state_dict, prefix, strict)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank1]: m_keys, u_keys = self._load_state_dict(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank1]: m_keys, u_keys = self._load_state_dict(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank1]: m_keys, u_keys = self._load_state_dict(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: [Previous line repeated 3 more times]
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 543, in _load_state_dict
[rank1]: return module.load_state_dict(state_dict, strict=strict)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2604, in load_state_dict
[rank1]: load(self, state_dict)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank1]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank1]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank1]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2597, in load
[rank1]: out = hook(module, incompatible_keys)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 231, in _load_state_dict_post_hook
[rank1]: module.validate_state()
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 1398, in validate_state
[rank1]: start in self._output_segments_tensor
[rank1]: AssertionError: shard within range [189401, 378801] cannot be built out of segements tensor([ 0, 94701, 189402, ..., -1, -1, -1], device='cuda:1')
I20251225 02:51:52.839771 65 fg_handler.cc:1416] Destroy FgHandler (1)
I20251225 02:51:52.863746 64 fg_handler.cc:1416] Destroy FgHandler (1)
[rank0]:[W1225 02:51:53.946349289 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E1225 02:51:58.249000 15 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 64) of binary: /opt/conda/bin/python3.11
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tzrec.train_eval FAILED