Skip to content

b4b: Move the changes to initialize the task decomposition from mpi_scan to main development#3666

Draft
ekluzek wants to merge 60 commits intoESCOMP:b4b-devfrom
ekluzek:decomp_mpi_scan_move_to_b4b
Draft

b4b: Move the changes to initialize the task decomposition from mpi_scan to main development#3666
ekluzek wants to merge 60 commits intoESCOMP:b4b-devfrom
ekluzek:decomp_mpi_scan_move_to_b4b

Conversation

@ekluzek
Copy link
Collaborator

@ekluzek ekluzek commented Dec 16, 2025

Description of changes

This moves the core code changes to initialize the processor decomposition from the mpi_scan branch in #3469 to b4b-dev. This removes some of the changes for memory checking and additional self testing as well as some of the additional timers that don't look useful now.

I created two previous branches before I created the process in #3665 where I worked out the details to NOT make this have too many commits and be hard to do. I also figured out how to remove merge commits as they need special handling, and usually aren't wanted in a case like this. Another way to do this would be to do this outside of git, which might have been similar length as the final version, but could've missed some important changes.

Specific notes

Contributors other than yourself, if any: John Dennis

CTSM Issues Fixed (include github issue #):
Fixes #3370
Fixes #3368
Fixes #3672
Some work on #3448

Are answers expected to change (and if so in what way)? No (the determination of the decomposition is identical as well)

Any User Interface Changes (namelist or namelist defaults changes)? No

Does this create a need to change or add documentation? Did you do so? No No

Testing performed, if any: Will run standard testing
The mpi_scan testing branch has had all the test lists run for it: aux_clm, ctsm_sci, decomp_init, decomp_init_uhr, and fates

 Conflicts:
	src/cpl/share_esmf/lnd_set_decomp_and_domain.F90
…e destroyed so remove the destroy for the distgrid, and the two meshes, this runs but doesn't seem to lower memory
…e about leaving the distgrid around, and also delete the meshes as it seems to work with this in place
 Conflicts:
	src/main/decompInitMod.F90
 Conflicts:
	src/cpl/share_esmf/lnd_set_decomp_and_domain.F90
…ted error messaging

 Conflicts:
	src/main/decompInitMod.F90
…he subname, create new internal subroutines in decompInit_lnd for allocate, clean, and check errors, move the check errors part to the first thing done

 Conflicts:
	src/main/decompInitMod.F90
…array sizes are set before allocates, initialize some decompMod values to invalid for error checking, add error checking to get_proc_bounds/get_proc_clumps, seperate out allocate for gindex to own allocate method, as it has be be done later after decomp is done, these are all improvements in ESCOMP#3448

 Conflicts:
	src/main/decompInitMod.F90
… add error handling of nsegspc, don't check endCohort in get_proc_bounds and get_clump_bounds as doesn't seem to be set

 Conflicts:
	src/main/decompInitMod.F90
…etup/clean for each DecompInit test, move the decomp_mod_clean to decompMod and use it for the decompInit tests

 Conflicts:
	src/main/decompInitMod.F90
	src/self_tests/TestDecompInit.F90 --- removed
…re to the regular operation

 Conflicts:
	src/main/decompInitMod.F90
…mpi-serial

 Conflicts:
	src/main/decompInitMod.F90
 Conflicts:
	src/main/decompInitMod.F90
…sor_type structure, and start adding a couple methods to help get them set
…for mpiscan and verify it, allocate the new procinfo gi and gj indices, make sure they are set, compiles but fails at run

 Conflicts:
	src/main/decompInitMod.F90
…the call expects all subgrid levels to be set
…ocate for the local task, this works for the serial case

 Conflicts:
	src/main/decompInitMod.F90
… clump_pproc for the setting of ggidx

 Conflicts:
	src/main/decompInitMod.F90
…moved for the final version

 Conflicts:
	src/main/decompInitMod.F90
…s for serial mode

 Conflicts:
	src/main/decompInitMod.F90
…global clumps

 Conflicts:
	src/self_tests/TestDecompInit.F90
… the log that aren't needed anymore

 Conflicts:
	src/main/decompInitMod.F90
@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 19, 2025

The unit-tests are failing because I added error checking to get_proc_bounds. Here's how the code goes

unittestSubGridMod.F90

set_decomp_info subroutine does the following:

.
.
.

I talked this over with @billsacks. We realized that the setting of the decomp is OK like it is. It's just that unit tests don't always create the subgrid hierarchy beneath the level it needs. So one way to fix it would be to make sure that happens for the all the cases where it might need it. Another option is to just add the optional argument of "only_grid" to bypass the subgrid level checking in the unittestSubGridMod get_proc_bounds call.

I tried doing the latter and there's still one unit test that failed because there is a call to get_proc_bounds in the setup of the accumulators, which is in the main model code rather than just the unit test code. Note, as an aside this really should be getting the bounds from the higher level where you know what type of loop you are in.

So for this one case, I'll make sure the lower level hierarchy is setup by doing this sort of thing:

diff --git a/src/dyn_subgrid/test/dynInitColumns_test/test_init_columns.pf b/src/dyn_subgrid/test/dynInitColumns_test/test_init_columns.pf
index ade2e6d95..79ec506f9 100644
--- a/src/dyn_subgrid/test/dynInitColumns_test/test_init_columns.pf
+++ b/src/dyn_subgrid/test/dynInitColumns_test/test_init_columns.pf
@@ -66,24 +66,24 @@ contains
 
     ! The first landunit is neither natural veg nor crop
     call unittest_add_landunit(my_gi=gi, ltype=istwet, wtgcell=0.25_r8)
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.5_r8)
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.5_r8)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.5_r8, add_simple_patch=.true.)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.5_r8, add_simple_patch=.true.)
 
     call unittest_add_landunit(my_gi=gi, ltype=1, wtgcell=0.5_r8)
     this%l1 = li
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8)
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8, add_simple_patch=.true.)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8, add_simple_patch=.true.)
     ! This column (the second column on the landunit with ltype=1) will be the target for
     ! some tests of initialization of a new column
     this%c_new = ci
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8)
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8, add_simple_patch=.true.)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8, add_simple_patch=.true.)
 
     call unittest_add_landunit(my_gi=gi, ltype=2, wtgcell=0.25_r8)
     this%l2 = li
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8)
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8)
-    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.5_r8)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8, add_simple_patch=.true.)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.25_r8, add_simple_patch=.true.)
+    call unittest_add_column(my_li=li, ctype=1, wtlunit=0.5_r8, add_simple_patch=.true.)
 
     call unittest_subgrid_setup_end()
 

The new optional "add_simple_patch" argument means add a single simple patch below the column created. This could be done in other unittests to make sure the patch level is set when columns are added. The same sort of thing could be done for landunit and grid level as well.

… the one failing test, so that now the PF unit tests work
@ekluzek ekluzek moved this from Todo to In Progress in LMWG: Sprint Planning Board Jan 7, 2026
@ekluzek ekluzek moved this from In Progress to Done in LMWG: Sprint Planning Board Jan 7, 2026
@wwieder wwieder moved this from Done to In Progress in LMWG: Sprint Planning Board Jan 7, 2026
@ekluzek ekluzek marked this pull request as draft January 14, 2026 20:59
@ekluzek ekluzek requested review from samsrabin and slevis-lmwg and removed request for slevis-lmwg January 14, 2026 20:59
@samsrabin
Copy link
Member

@ekluzek I see that this is still a draft and has several comments from you in it, including several suggested changes. I'll unsubscribe for now; please @ me when it's fully ready (or now, if it actually is).

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 15, 2026

@samsrabin ok that makes sense. The one thing that I wanted input on is if all of the conversations I have need to be handled before we meet to go over it.

But I really think the answer is yes on that, so I'll try to complete them and see if we could meet next week to go over it. If something ends up taking too long I'll make sure we meet so we can discuss.

Thanks for the feedback.

@ekluzek ekluzek mentioned this pull request Jan 20, 2026
…f there is also an endrun in the it, so calc_globalxy_indices will need to be changed to a pure function that returns values that can be checked at the call level
… remove pure from the calc_ routines in decompMod, and write out information, the unit test does work now
…r returns as nglob_x/nglob_y should be set before used
…en't covered in the more public calc_globalxy_indices
…in each test, and add some tests that notes suggested
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bfb bit-for-bit code health improving internal code structure to make easier to maintain (sustainability) enhancement new capability or improved behavior of existing capability performance idea or PR to improve performance (e.g. throughput, memory)

Projects

Status: In progress - b4b-dev
Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants