Skip to content

Conversation

@chunhuazhou
Copy link
Collaborator

DESCRIPTION OF CHANGES:

The ensemble prep_ic is taking too much time for the real-time runs on Ursa (could be as long as ~10 min per member in some cases), and even causing dead prep_ic jobs after finishing some members. This PR adds the capability to run prep_ic parallelly for ensemble members and significantly reduce the run time to less than 5 min for all the members. This has been tested in retro mode and is now in the real-time run.

@chunhuazhou chunhuazhou requested a review from hu5970 January 21, 2026 18:12
datadep_prod = f'''\n <datadep age="00:05:00"><cyclestr offset="-{cyc_interval}:00:00">&COMROOT;/&NET;/&rrfs_ver;/&RUN;.@Y@m@d/@H/fcst/&WGF;/</cyclestr><cyclestr>mpasout.@Y-@m-@d_@H.00.00.nc</cyclestr></datadep>'''

datadep_spinup = f'''\n <taskdep task="fcst_spinup" cycle_offset="-1:00:00"/>'''
if spinup_mode == 0: # no parallel spinup cycles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what is this for?

Copy link
Contributor

@guoqing-noaa guoqing-noaa Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hu5970 We use one prep_ic task for the following spin up situations:

cold, spinup_mode = 0   # regular prod cycles, i.e no spin up cycles in an experiment

cold, spinup_mode = 1   #  spin up, cold start
warm, spinup_mode = -1   #  spin up, continue cycling

warm, spinup_mode = -1, prod_switch   #  prod switching from spinup
warm, spinup_mode = -1, regular          #  prod parallel to spinup, continue cycling

Check this slide for more details:
https://docs.google.com/presentation/d/1HPx2LzX8Hf9Imztl4OpyXdyTTWoN4hBOId4AP7bhyYM/edit?slide=id.g332010f5a43_4_374#slide=id.g332010f5a43_4_374

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spinup_mode needs more discussion. We learn from RRFSv1 that the cycle mode could be very complex and we need to set a good parameter to control all the possible cycle tasks: det spin-up, det prod, enkf prod, enkf spin-up, blending etc. Let though if we can define cycle_mode (0-99) to help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workflow is different from RRFSv1, as it will always support coldstart-only forecast experiments and non-spinup experiments. I think we are good for now. We can definitely talk more when we run into any issues.

#
echo "===== CMDFILE ====="
cat "$CMDFILE"
xargs -I {} -P "${SLURM_NTASKS}" sh -c '{}' < "${CMDFILE}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SLURM_NTASKS is a SLURM related number. Do we have better way to set parallel core number?

Copy link
Contributor

@guoqing-noaa guoqing-noaa Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chunhuazhou NTASKS is expected to be a defined env variable from the job card.
In the rocoto workflow, it is already defined in launch.sh

export NTASKS=${SLURM_NTASKS}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we want to move forward with running serial copying in parallel, we would want to be generic, not binding all logics to SLURM only.
We already have a generic rank_run tool in rrfs-workflow to be used. I can help with this.

But at this moment, I think 30 copy may work well if we use --exclusive or request sufficient memory for each task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use "NTASKS" here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank_run needs extra lib. Using xargs should be good enough for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hu5970 What do you mean rank_run nees extra lib?

We should not develop a script working only for SLURM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xargs is linux command, not for SLURM. It works on wcoss2. Also, WCOSS2 has its own way to run parallel command in within one node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the reminder. I did not pay much attention to this part. I had thought it is similar to this:

srun --multi-prog "${CMDFILE}".multi

With that said, xargs cannot distribute tasks across multiple nodes.
For example, when we run NA3km or global-15-3km or more ensembles with few cores, we may need to use 2 or more nodes to copy files in parallel because the file size is much larger and we need more memories for each copying.

rank_run is a simple replacement tool to NCO's CFP in non-NCO machines. It does real parallelism and can use multiple nodes. It has no other library dependencies.


exit 0
done
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs after surface cycle, right? If so, surface cycle will fail.

@guoqing-noaa
Copy link
Contributor

guoqing-noaa commented Jan 22, 2026

@chunhuazhou Currently, the ensemble workflow will launch 30 prep_ic tasks, right? If so, I think it should be as fast as the changes proposed in this PR.

I guess the current issue (include the dead task) is because the copy process needs sufficient memory to complete successfully but prep_ic does not request sufficient memory or to be exclusive on Ursa (Ursa is very aggressive in sharing a node as much as possible among different tasks).

@chunhuazhou chunhuazhou marked this pull request as draft January 22, 2026 14:36
@hu5970
Copy link
Contributor

hu5970 commented Jan 22, 2026

Using one node to copy the files is a waste of resources. We need to avoid such setup.

@guoqing-noaa
Copy link
Contributor

guoqing-noaa commented Jan 22, 2026

Using one node to copy the files is a waste of resources. We need to avoid such setup.

@hu5970 we don't have to use one node to do copy.
We can just request enough memory just as the changes in this PR and then let SLURM/PBS to manage how to allocate resources efficiently.

Also, I think initially we plan to let prep_ic to do more tasks such as surface updating/soilSurgery etc. So using one node is not that bad in practice, especially if it runs fast.

@guoqing-noaa
Copy link
Contributor

There is one drawback to "manually" do 30 parallel copies in 1 or 2 nodes. Different HPCs have different core/memory per node. 30 parallel "manual" copying may work on Ursa (192 cores) but may NOT work on Hera (40 cores).

@hu5970
Copy link
Contributor

hu5970 commented Jan 22, 2026

There is one drawback to "manually" do 30 parallel copies in 1 or 2 nodes. Different HPCs have different core/memory per node. 30 parallel "manual" copying may work on Ursa (192 cores) but may NOT work on Hera (40 cores).

We do not need to worry about this. Xarge will run with the core number after -P to run command parallel with the core number and then run rest of the command after the first set finished. So, just the core number ad -P as the core number pernode.

@hu5970
Copy link
Contributor

hu5970 commented Jan 22, 2026

Using one node to copy the files is a waste of resources. We need to avoid such setup.

@hu5970 we don't have to use one node to do copy. We can just request enough memory just as the changes in this PR and then let SLURM/PBS to manage how to allocate resources efficiently.

Also, I think initially we plan to let prep_ic to do more tasks such as surface updating/soilSurgery etc. So using one node is not that bad in practice, especially if it runs fast.

Those surface update and soil surgery are all small single core program. And most of them is actually running for the deterministic cycle only.

export CMDFILE="${DATA}/script_prep_ic_0.sh"
fi

mkdir -p "$(dirname "$CMDFILE")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${DATA} should be always available in the ex-script, no need to mkdir -p here

fi

exit 0
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add two spaces to indent lines 42-171 correctly.

for memdir in "${mem_list[@]}"; do
# Determine path
if [[ ${#memdir} -gt 1 ]]; then
umbrella_prep_ic_data="${UMBRELLA_PREP_IC_DATA}${memdir}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing umbrella_prep_ic_data to umbrella_prep_ic_mem to better distinguish it from UMBRELLA_PREP_IC_DATA

if [[ "${ENS_SIZE:-0}" -gt 2 ]]; then
mapfile -t mem_list < <(printf "/mem%03d\n" $(seq 1 "$ENS_SIZE"))
else
mem_list=("/") # if determinitic
Copy link
Contributor

@guoqing-noaa guoqing-noaa Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a double / situation in line 64:
thisfile=${COMINrrfs}/${RUN}.${PDY}/${cyc}/ic/${WGF}${memdir}/init.nc
It will generate something like ..../ic/det//init.nc. This is expected to be avoided per the NCO standard.

if [[ ${#memdir} -gt 1 ]]; then
umbrella_prep_ic_data="${UMBRELLA_PREP_IC_DATA}${memdir}"
mkdir -p "${COMOUT}/prep_ic/${WGF}${memdir}"
pid=$((10#${memdir: -2}-1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think it is more straightforward to generate a list of member numbers fist and add "/" when it is needed.
Let me try a case and post my example here.

# Determine path
if [[ ${#memdir} -gt 1 ]]; then
umbrella_prep_ic_data="${UMBRELLA_PREP_IC_DATA}${memdir}"
mkdir -p "${COMOUT}/prep_ic/${WGF}${memdir}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we create COMOUT directories for each member's prep_ic? Will we save data there?

: > "$CMDFILE"

# Create directory safely
mkdir -p "${umbrella_prep_ic_data}"
Copy link
Contributor

@guoqing-noaa guoqing-noaa Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chunhuazhou FYI, here is an alternate example for lines 34-57:

if (( "${ENS_SIZE:-0}" > 1 )); then
  mapfile -t mem_list < <(printf "%03d\n" $(seq 1 "$ENS_SIZE"))
else
  mem_list=("000") # if determinitic
fi

for index in "${mem_list[@]}"; do
  # Determine path
  if (( 10#${index} == 0 )); then
    memdir=""
    umbrella_prep_ic_mem="${UMBRELLA_PREP_IC_DATA}"
    export CMDFILE="${DATA}/script_prep_ic_0.sh"
  else
    memdir="/mem${index}"
    umbrella_prep_ic_mem="${UMBRELLA_PREP_IC_DATA}${memdir}"
    mkdir -p "${umbrella_prep_ic_mem}"
    pid=$((10#${index}-1))
    export CMDFILE="${DATA}/script_prep_ic_${pid}.sh"
  fi
  echo $CMDFILE, $memdir
done

@@ -81,6 +109,7 @@ fi
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surface cycle have no relation with background copy. Better to separate to two sections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants