Skip to content

特征带有zch,单机4卡全量训练,单机2卡增量训练报错,但是单机1卡和单机4卡可以继续增量训练 #360

@chengaofei

Description

@chengaofei

[2025-12-25 02:50:14,762][INFO] Restoring checkpoint from /mnt/data/deploy/home_flow_m12_feed_ctrcvr_sorter_v1/20251217/model.ckpt-73979...
[2025-12-25 02:50:14,818][INFO] Restoring model state from /mnt/data/deploy/home_flow_m12_feed_ctrcvr_sorter_v1/20251217/model.ckpt-73979/model...
[2025-12-25 02:50:14,933][WARNING] /opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/planner_helpers.py:418: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)

[2025-12-25 02:50:15,334][WARNING] /opt/conda/lib/python3.11/site-packages/torch/distributed/checkpoint/planner_helpers.py:418: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor.
device = getattr(value, "device", None)

[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/train_eval.py", line 57, in
[rank0]: train_and_evaluate(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 696, in train_and_evaluate
[rank0]: _train_and_evaluate(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 397, in _train_and_evaluate
[rank0]: checkpoint_util.restore_model(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/tzrec/utils/checkpoint_util.py", line 282, in restore_model
[rank0]: model.load_state_dict(state_dict)
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 525, in load_state_dict
[rank0]: return self._load_state_dict(self, state_dict, prefix, strict)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank0]: m_keys, u_keys = self._load_state_dict(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank0]: m_keys, u_keys = self._load_state_dict(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank0]: m_keys, u_keys = self._load_state_dict(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: [Previous line repeated 3 more times]
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 543, in _load_state_dict
[rank0]: return module.load_state_dict(state_dict, strict=strict)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2604, in load_state_dict
[rank0]: load(self, state_dict)
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank0]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank0]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank0]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2597, in load
[rank0]: out = hook(module, incompatible_keys)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 231, in _load_state_dict_post_hook
[rank0]: module.validate_state()
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 1398, in validate_state
[rank0]: start in self._output_segments_tensor
[rank0]: AssertionError: shard within range [0, 189401] cannot be built out of segements tensor([ 0, 94701, 189402, ..., -1, -1, -1], device='cuda:0')
[rank1]: Traceback (most recent call last):
[rank1]: File "", line 198, in _run_module_as_main
[rank1]: File "", line 88, in _run_code
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/train_eval.py", line 57, in
[rank1]: train_and_evaluate(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 696, in train_and_evaluate
[rank1]: _train_and_evaluate(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/main.py", line 397, in _train_and_evaluate
[rank1]: checkpoint_util.restore_model(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/tzrec/utils/checkpoint_util.py", line 282, in restore_model
[rank1]: model.load_state_dict(state_dict)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 525, in load_state_dict
[rank1]: return self._load_state_dict(self, state_dict, prefix, strict)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank1]: m_keys, u_keys = self._load_state_dict(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank1]: m_keys, u_keys = self._load_state_dict(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 549, in _load_state_dict
[rank1]: m_keys, u_keys = self._load_state_dict(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: [Previous line repeated 3 more times]
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/distributed/model_parallel.py", line 543, in _load_state_dict
[rank1]: return module.load_state_dict(state_dict, strict=strict)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2604, in load_state_dict
[rank1]: load(self, state_dict)
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank1]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank1]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2592, in load
[rank1]: load(child, child_state_dict, child_prefix) # noqa: F821
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2597, in load
[rank1]: out = hook(module, incompatible_keys)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 231, in _load_state_dict_post_hook
[rank1]: module.validate_state()
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torchrec/modules/mc_modules.py", line 1398, in validate_state
[rank1]: start in self._output_segments_tensor
[rank1]: AssertionError: shard within range [189401, 378801] cannot be built out of segements tensor([ 0, 94701, 189402, ..., -1, -1, -1], device='cuda:1')
I20251225 02:51:52.839771 65 fg_handler.cc:1416] Destroy FgHandler (1)
I20251225 02:51:52.863746 64 fg_handler.cc:1416] Destroy FgHandler (1)
[rank0]:[W1225 02:51:53.946349289 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E1225 02:51:58.249000 15 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 64) of binary: /opt/conda/bin/python3.11
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tzrec.train_eval FAILED

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions