Snapshot for preemption by knshnb · Pull Request #155 · pfnet/pfrl

knshnb · 2021-08-30T10:57:11Z

Current pfrl does not support snapshot of training, which is important in many job systems such as Kubernetes.
This PR support saving and loading snapshot including replay buffer.

Done

save & load snapshot
- agent, replay buffer, step_offset, max_score
test locally by python examples/gym/train_dqn_gym.py --env CartPole-v0 --steps=5000 --eval-n-runs=10 --eval-interval=1000 --load_snapshot --checkpoint-freq=1000

Not Done

reflect on examples/* (Do we just need to create one separate example for snapshot?)
log (scores.txt)
test script

Could you check the current implementation strategy and give some ideas on how to implement the above points?

muupan · 2021-09-03T05:50:27Z

knshnb · 2021-09-14T10:54:00Z

Thank you for the detailed comments!!
Below is a memo of discussion with @muupan san

What I skip in this PR

rng-related things
reccurrent states of model, env’s internal state (needed only when you resume a half-way training episode.)
other internal states of Agent
Agent statistics

What I implement

save snapshot instead of save_agent only when take_resumable_snapshot is True
save steps and episodes in a file (such as checkpoint.txt)
- Agent statistics should be included here in the future
restore max_score from scores.txt
- include scores.txt in snapshot for the case eval_interval != checkpoint_freq
Add test in pfrl/examples_tests/atari/reproduction/test_dqn.sh

knshnb · 2021-09-14T11:15:20Z

I conducted the experiment that you suggested with the following command.
python examples/atari/reproduction/dqn/train_dqn.py --env SpaceInvadersNoFrameskip-v4 --steps 10000000 --checkpoint-freq 2000000 --save-snapshot --load-snapshot --seed ${SEED} --exp-id ${SEED}

For each seed, I ran another training resuming from the snapshot of 6000000-step. As shown in the graph below, the score transitions after resuming from the snapshots were roughly the same as the ones without resumption.

In this experiment, each snapshot was about 6.8GB and took around 60-100 (s) to save in an NFS server in my environment. You can check how many seconds it took to save each snapshot in snapshot_history.txt.

muupan · 2021-09-14T11:34:10Z

/test

pfn-ci-bot · 2021-09-14T11:34:15Z

Successfully created a job for commit dde7ebf:

Dashboard for commit dde7ebf

knshnb · 2021-09-14T12:17:05Z

Sorry, I fixed the linter problem

knshnb · 2021-09-15T09:27:04Z

(I forgot to write this)
Memo: It requires about twice more CPU memory if you save snapshots (~30GB in the above experiment).

knshnb · 2022-02-09T06:38:36Z

Hi! Is there any action required for this PR to be merged?

knshnb added 4 commits August 30, 2021 10:21

Save agent and replay_buffer as snapshot

78f5ef7

Add load_snapshot option to train_dqn_gym.py

5070ddc

Temporarily fix output directory

27bf497

Format

7cf0dd9

github-actions bot requested a review from ummavi August 30, 2021 10:57

knshnb requested a review from muupan August 31, 2021 09:41

muupan removed the request for review from ummavi September 3, 2021 03:53

knshnb added 11 commits September 9, 2021 10:27

Revert examples/gym/train_dqn_gym.py

e6bdd04

Snapshot only when take_resumable_snapshot is True

6200083

Refactor

2a4b12c

Do not write header in scores.txt if exists

ab9abb0

Add episode_offset to train_agent_with_evaluation

aa48894

Add comment of take_resumable_snapshot

8e9fcaf

Save scores.txt and snapshot_history in snapshot

e72a424

Add save & load snapshot option in atari example

c975ffc

Fix bug when scores.txt has only header

c4ed3df

Add snapshot test

ca5c450

Add warnings of agent analytics in docstring

dde7ebf

knshnb changed the title ~~[WIP] Snapshot for preemption~~ Snapshot for preemption Sep 14, 2021

Apply isort

2d4e67d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot for preemption#155

Snapshot for preemption#155
knshnb wants to merge 16 commits intopfnet:masterfrom
knshnb:snapshot

knshnb commented Aug 30, 2021 •

edited

Loading

Uh oh!

muupan commented Sep 3, 2021 •

edited by knshnb

Loading

Uh oh!

knshnb commented Sep 14, 2021

Uh oh!

knshnb commented Sep 14, 2021

Uh oh!

muupan commented Sep 14, 2021

Uh oh!

pfn-ci-bot commented Sep 14, 2021

Uh oh!

knshnb commented Sep 14, 2021

Uh oh!

knshnb commented Sep 15, 2021

Uh oh!

knshnb commented Feb 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

knshnb commented Aug 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Done

Not Done

Uh oh!

muupan commented Sep 3, 2021 • edited by knshnb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General comments on resumability

Things that need to be snapshotted for resumability except randomness:

RNG-related things that need to be snapshotted for complete resumability:

Specific comments on this PR

Uh oh!

knshnb commented Sep 14, 2021

What I skip in this PR

What I implement

Uh oh!

knshnb commented Sep 14, 2021

Uh oh!

muupan commented Sep 14, 2021

Uh oh!

pfn-ci-bot commented Sep 14, 2021

Uh oh!

knshnb commented Sep 14, 2021

Uh oh!

knshnb commented Sep 15, 2021

Uh oh!

knshnb commented Feb 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

knshnb commented Aug 30, 2021 •

edited

Loading

muupan commented Sep 3, 2021 •

edited by knshnb

Loading