-
Notifications
You must be signed in to change notification settings - Fork 62
[rrfs-mpas-jedi] Prep ic para #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: rrfs-mpas-jedi
Are you sure you want to change the base?
Conversation
| datadep_prod = f'''\n <datadep age="00:05:00"><cyclestr offset="-{cyc_interval}:00:00">&COMROOT;/&NET;/&rrfs_ver;/&RUN;.@Y@m@d/@H/fcst/&WGF;/</cyclestr><cyclestr>mpasout.@Y-@m-@d_@H.00.00.nc</cyclestr></datadep>''' | ||
|
|
||
| datadep_spinup = f'''\n <taskdep task="fcst_spinup" cycle_offset="-1:00:00"/>''' | ||
| if spinup_mode == 0: # no parallel spinup cycles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hu5970 We use one prep_ic task for the following spin up situations:
cold, spinup_mode = 0 # regular prod cycles, i.e no spin up cycles in an experiment
cold, spinup_mode = 1 # spin up, cold start
warm, spinup_mode = -1 # spin up, continue cycling
warm, spinup_mode = -1, prod_switch # prod switching from spinup
warm, spinup_mode = -1, regular # prod parallel to spinup, continue cycling
Check this slide for more details:
https://docs.google.com/presentation/d/1HPx2LzX8Hf9Imztl4OpyXdyTTWoN4hBOId4AP7bhyYM/edit?slide=id.g332010f5a43_4_374#slide=id.g332010f5a43_4_374
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spinup_mode needs more discussion. We learn from RRFSv1 that the cycle mode could be very complex and we need to set a good parameter to control all the possible cycle tasks: det spin-up, det prod, enkf prod, enkf spin-up, blending etc. Let though if we can define cycle_mode (0-99) to help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workflow is different from RRFSv1, as it will always support coldstart-only forecast experiments and non-spinup experiments. I think we are good for now. We can definitely talk more when we run into any issues.
scripts/exrrfs_prep_ic.sh
Outdated
| # | ||
| echo "===== CMDFILE =====" | ||
| cat "$CMDFILE" | ||
| xargs -I {} -P "${SLURM_NTASKS}" sh -c '{}' < "${CMDFILE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SLURM_NTASKS is a SLURM related number. Do we have better way to set parallel core number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chunhuazhou NTASKS is expected to be a defined env variable from the job card.
In the rocoto workflow, it is already defined in launch.sh
rrfs-workflow/workflow/sideload/launch.sh
Line 15 in 3ce4c70
| export NTASKS=${SLURM_NTASKS} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we want to move forward with running serial copying in parallel, we would want to be generic, not binding all logics to SLURM only.
We already have a generic rank_run tool in rrfs-workflow to be used. I can help with this.
But at this moment, I think 30 copy may work well if we use --exclusive or request sufficient memory for each task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use "NTASKS" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank_run needs extra lib. Using xargs should be good enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hu5970 What do you mean rank_run nees extra lib?
We should not develop a script working only for SLURM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xargs is linux command, not for SLURM. It works on wcoss2. Also, WCOSS2 has its own way to run parallel command in within one node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the reminder. I did not pay much attention to this part. I had thought it is similar to this:
rrfs-workflow/scripts/exrrfs_ensmean.sh
Line 89 in 3ce4c70
| srun --multi-prog "${CMDFILE}".multi |
With that said, xargs cannot distribute tasks across multiple nodes.
For example, when we run NA3km or global-15-3km or more ensembles with few cores, we may need to use 2 or more nodes to copy files in parallel because the file size is much larger and we need more memories for each copying.
rank_run is a simple replacement tool to NCO's CFP in non-NCO machines. It does real parallelism and can use multiple nodes. It has no other library dependencies.
scripts/exrrfs_prep_ic.sh
Outdated
|
|
||
| exit 0 | ||
| done | ||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs after surface cycle, right? If so, surface cycle will fail.
|
@chunhuazhou Currently, the ensemble workflow will launch 30 prep_ic tasks, right? If so, I think it should be as fast as the changes proposed in this PR. I guess the current issue (include the dead task) is because the copy process needs sufficient memory to complete successfully but prep_ic does not request sufficient memory or to be exclusive on Ursa (Ursa is very aggressive in sharing a node as much as possible among different tasks). |
|
Using one node to copy the files is a waste of resources. We need to avoid such setup. |
@hu5970 we don't have to use one node to do copy. Also, I think initially we plan to let |
|
There is one drawback to "manually" do 30 parallel copies in 1 or 2 nodes. Different HPCs have different core/memory per node. 30 parallel "manual" copying may work on Ursa (192 cores) but may NOT work on Hera (40 cores). |
…eble parallel processing
We do not need to worry about this. Xarge will run with the core number after -P to run command parallel with the core number and then run rest of the command after the first set finished. So, just the core number ad -P as the core number pernode. |
Those surface update and soil surgery are all small single core program. And most of them is actually running for the deterministic cycle only. |
scripts/exrrfs_prep_ic.sh
Outdated
| export CMDFILE="${DATA}/script_prep_ic_0.sh" | ||
| fi | ||
|
|
||
| mkdir -p "$(dirname "$CMDFILE")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${DATA} should be always available in the ex-script, no need to mkdir -p here
scripts/exrrfs_prep_ic.sh
Outdated
| fi | ||
|
|
||
| exit 0 | ||
| done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add two spaces to indent lines 42-171 correctly.
| for memdir in "${mem_list[@]}"; do | ||
| # Determine path | ||
| if [[ ${#memdir} -gt 1 ]]; then | ||
| umbrella_prep_ic_data="${UMBRELLA_PREP_IC_DATA}${memdir}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest changing umbrella_prep_ic_data to umbrella_prep_ic_mem to better distinguish it from UMBRELLA_PREP_IC_DATA
| if [[ "${ENS_SIZE:-0}" -gt 2 ]]; then | ||
| mapfile -t mem_list < <(printf "/mem%03d\n" $(seq 1 "$ENS_SIZE")) | ||
| else | ||
| mem_list=("/") # if determinitic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a double / situation in line 64:
thisfile=${COMINrrfs}/${RUN}.${PDY}/${cyc}/ic/${WGF}${memdir}/init.nc
It will generate something like ..../ic/det//init.nc. This is expected to be avoided per the NCO standard.
scripts/exrrfs_prep_ic.sh
Outdated
| if [[ ${#memdir} -gt 1 ]]; then | ||
| umbrella_prep_ic_data="${UMBRELLA_PREP_IC_DATA}${memdir}" | ||
| mkdir -p "${COMOUT}/prep_ic/${WGF}${memdir}" | ||
| pid=$((10#${memdir: -2}-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think it is more straightforward to generate a list of member numbers fist and add "/" when it is needed.
Let me try a case and post my example here.
| # Determine path | ||
| if [[ ${#memdir} -gt 1 ]]; then | ||
| umbrella_prep_ic_data="${UMBRELLA_PREP_IC_DATA}${memdir}" | ||
| mkdir -p "${COMOUT}/prep_ic/${WGF}${memdir}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we create COMOUT directories for each member's prep_ic? Will we save data there?
| : > "$CMDFILE" | ||
|
|
||
| # Create directory safely | ||
| mkdir -p "${umbrella_prep_ic_data}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chunhuazhou FYI, here is an alternate example for lines 34-57:
if (( "${ENS_SIZE:-0}" > 1 )); then
mapfile -t mem_list < <(printf "%03d\n" $(seq 1 "$ENS_SIZE"))
else
mem_list=("000") # if determinitic
fi
for index in "${mem_list[@]}"; do
# Determine path
if (( 10#${index} == 0 )); then
memdir=""
umbrella_prep_ic_mem="${UMBRELLA_PREP_IC_DATA}"
export CMDFILE="${DATA}/script_prep_ic_0.sh"
else
memdir="/mem${index}"
umbrella_prep_ic_mem="${UMBRELLA_PREP_IC_DATA}${memdir}"
mkdir -p "${umbrella_prep_ic_mem}"
pid=$((10#${index}-1))
export CMDFILE="${DATA}/script_prep_ic_${pid}.sh"
fi
echo $CMDFILE, $memdir
done
14692af to
c440403
Compare
| @@ -81,6 +109,7 @@ fi | |||
| # | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surface cycle have no relation with background copy. Better to separate to two sections?
DESCRIPTION OF CHANGES:
The ensemble prep_ic is taking too much time for the real-time runs on Ursa (could be as long as ~10 min per member in some cases), and even causing dead prep_ic jobs after finishing some members. This PR adds the capability to run prep_ic parallelly for ensemble members and significantly reduce the run time to less than 5 min for all the members. This has been tested in retro mode and is now in the real-time run.