Skip to content

enable building stochastic_physics using CMake#3

Open
guoqing-noaa wants to merge 8 commits intodtcenter:gsl/MPAS_stoch_physicsfrom
guoqing-noaa:stochastic
Open

enable building stochastic_physics using CMake#3
guoqing-noaa wants to merge 8 commits intodtcenter:gsl/MPAS_stoch_physicsfrom
guoqing-noaa:stochastic

Conversation

@guoqing-noaa
Copy link

@guoqing-noaa guoqing-noaa commented Feb 28, 2026

This PR enable building stochastic_physics using CMake.

  1. update core_atmosphere/CMakeLists.txt to build stochastic_physics as a module for MPAS-Model
  2. update modulefiles to provide needed MKL and NetCDF libraries.
  3. update the submodule stochastic_physics to use mpi instead of mpi_f08 as the default MPAS-Model compiles with mpi.
    This is the first step to make the stochastic compiling successfully using CMake. Future PR will introduce #ifdef MPAS_USE_MPI_F08 as has been done in MPAS-Model.
  4. NOTE: this PR does not compile cellular automata (CA). CA requires the extrafms library and this capability may be added in future PRs (so, halo_exchange.fv3.F90, cellular_automata_global.F90, cellular_automata_sgs.F90, update_ca.F90 are excluded in CMakeLists.txt for now).
  5. stochastic_physics needs to check whether CCPP_32BIT is defined. The parent MPAS-Model sets this or not depending on whether to build with double precision.
  6. core_atmosphere/stochastic_physics/CMakeLists.txt is NOT used by the parent MPAS-Model. So any changes in that file will not affect the MPAS-Model. This is consistent with the current MPAS-Model practice.

If needed, a -DSTOCHASTIC_PHYSICS=ON option may be added to turn on/off the building of stochastic_physics in a future PR.

@guoqing-noaa
Copy link
Author

Note: it is better to merge this PR dtcenter/stochastic_physics#3 first.

@willmayfield
Copy link

I ran several 3km conus tests on hera and ursa using the code in Guoqing's PRs and using the "build.sh" script as-is, meaning the build was with cmake. Each test requested 1200 cores (ursa therefore ended up being partially undersubscribed). This was the result:

-Hera, sppt on, spptint=0, dt=15, succeeded through 36h forecast
/scratch4/BMC/wrfruc/mayfield/mpas_stoch/cost_tests/expt_dirs/stoch_cost/conus_3km/mpas_atm_spptint0_cmake_hera

-Ursa, nosppt, dt=15, succeeded through 36h forecast
/scratch4/BMC/wrfruc/mayfield/mpas_stoch/cost_tests/expt_dirs/stoch_cost/conus_3km/mpas_atm_nosppt_cmake_ursa

-Hera, Nosppt, dt=15, succeeded through 36h forecast
/scratch4/BMC/wrfruc/mayfield/mpas_stoch/cost_tests/expt_dirs/stoch_cost/conus_3km/mpas_atm_nosppt_cmake_hera

-Hera, sppt on, spptint=600, dt=15, failed after 1hr, 23m. Repeated this test 3 times and all failed at the same moment, no clue as to why.
/scratch4/BMC/wrfruc/mayfield/mpas_stoch/cost_tests/expt_dirs/stoch_cost/conus_3km/mpas_atm_spptint600_cmake_hera_fail
Error:
[h10c29:4030808:0:4030808] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x154796a3ea9c)
6988 ==== backtrace (tid:3558242) ====
6989 0 0x0000000000053519 ucs_debug_print_backtrace() ???:0
6990 1 0x0000000000012990 funlockfile() :0
6991 2 0x0000000000579ef4 module_mp_thompson_aerosols_mp_gt_aod
() ???:0
6992 3 0x00000000003c0eb8 mpas_atmphys_driver_radiation_sw_mp_radiation_sw_from_mpas
() ???:0
6993 4 0x00000000003b59a7 mpas_atmphys_driver_radiation_sw_mp_driver_radiation_sw_() ???:0
6994 5 0x00000000002f9d3f mpas_atmphys_driver_mp_physics_driver_() ???:0
6995 6 0x0000000000081b68 atm_core_mp_atm_do_timestep_() ???:0
6996 7 0x0000000000080d5c atm_core_mp_atm_core_run_() ???:0
6997 8 0x00000000009efde7 mpas_subdriver_mp_mpas_run_() ???:0
6998 9 0x000000000040fef7 MAIN__() ???:0
6999 10 0x000000000040fe7d main() ???:0
7000 11 0x000000000003a865 __libc_start_main() ???:0
7001 12 0x000000000040fd9e _start() ???:0
7002 =================================

-hera, sppt on, spptint=600, dt=10, failed after 35hr, 9m forecast for job time limit reached
/scratch4/BMC/wrfruc/mayfield/mpas_stoch/cost_tests/expt_dirs/stoch_cost/conus_3km/mpas_atm_spptint600_cmake_hera

-Ursa, sppt on, spptint=600, dt=15, failed after 30h 9m 15s forecast, for “cancelled for time limit”…. However, it reached the point that it forecasted to well within the time limit (in fact, timesteps were progressing 20-30% faster than the same test on hera), and then appeared to hang for the remainder of the requested time.... seems like this may be the random node drops others have been seeing?
/scratch4/BMC/wrfruc/mayfield/mpas_stoch/cost_tests/expt_dirs/stoch_cost/conus_3km/mpas_atm_spptint600_cmake_ursa

I feel satisfied from these that any issues are not related to Guoqing's code or the cmake build. @gsketefian @NingWang325 @JeffBeck-NOAA Is there any reason to wait for Ning's PR, or are we ready to merge this?

@JeffBeck-NOAA
Copy link

@willmayfield, thanks for doing this extensive testing! Did you try rerunning spptint=600 on Ursa with dt=15 or try dt=10 to see if the hanging resolved? It does sound like a specific HPC issue and unrelated to the code. I'm comfortable merging these changes as is.

@guoqing-noaa
Copy link
Author

@willmayfield Thank you very much for lots of testing!
Is it possible to test it on derecho?

It is known that MPAS-Model may hang on Ursa (This exists before the stochastic physics work).
Hera has limited memories each node. You may check whether you can try to use the bigmem partition.

Also, we can test it on gaea if possible.

With that said, I agree with you and @JeffBeck-NOAA that the issue is NOT related to this PR. Thanks!

@guoqing-noaa
Copy link
Author

@willmayfield For model hangs on Ursa, please add the following line to your job card:

export I_MPI_COLL_INTRANODE=pt2pt

It was just suggested by Ursa Admin and it did solve one of my model hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants