I am encountering the same (or at least similar) issue with the gap_fit executable compiled with meson, with MPI turned on:
SYSTEM ABORT: proc=0 Traceback (most recent call last)
File "../src/libAtoms/linearalgebra.F90", line 2348 kind unspecified
LA_Matrix_Factorise: cannot factorise, error: 8
However, when building gap_fit using QUIP_ARCH=linux_x86_64_gfortran_openmpi and make config (using -lopenblas -lscalapack and no extra link options), the produced gap_fit does not have any problem running the fit to the end, using the same training file and config_file.
I am compiling quip inside a conda environment with openblas, scalapack and compilers downloaded from conda itself.
Originally posted by @lormio in #715
I add to the issue quoted the packages installed in the conda environment and some information
conda packages: scalapack, openblas, gxx, gcc, gfortran, openmpi, meson, ninja.
The issue seems to be the following: the old build system finds and links Scalapack correctly, while Meson doesn't.
OLD BUILD
$ ldd build/linux_x86_64_gfortran_openmpi/gap_fit | grep scalapack
libscalapack.so => /home/miolalor/.conda/envs/quip_comp/lib/libscalapack.so (0x0000742212600000)
MESON
$ ldd builddir/src/Programs/gap_fit | grep scalapack
No output
Moreover, just watching these symbols produced by ldd, I noticed that the symbols for MPI are different, with the old build having an extra 2 of them(I don't know if it's related or not, but I don't know why it should differ):
OLD BUILD
$ ldd build/linux_x86_64_gfortran_openmpi/gap_fit | grep mpi
libmpi_usempif08.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_usempif08.so.40 (0x00007747a44ad000)
libmpi_usempi_ignore_tkr.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007747a4499000)
libmpi_mpifh.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_mpifh.so.40 (0x00007747a4422000)
libmpi.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi.so.40 (0x0000774799400000)
MESON
ldd builddir/src/Programs/gap_fit | grep mpi
libmpi_mpifh.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_mpifh.so.40 (0x00007f57742b9000)
libmpi.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi.so.40 (0x00007f5769e00000)
However, I do not understand why meson is failing in linking the scalapack library when using the old build system I just specify the -lscalapack, without having to specify the -I or -L paths.
The executable produced with meson will run when used without mpiexec until the error as reported in the previous issue, and complain about undefined symbols when run in parallel (blacs_gridinit_). The old build system gap_fit works without issue instead, with or without multiple MPI processes.
Originally posted by @lormio in #715
I add to the issue quoted the packages installed in the conda environment and some information
conda packages: scalapack, openblas, gxx, gcc, gfortran, openmpi, meson, ninja.
The issue seems to be the following: the old build system finds and links Scalapack correctly, while Meson doesn't.
OLD BUILD
$ ldd build/linux_x86_64_gfortran_openmpi/gap_fit | grep scalapack
libscalapack.so => /home/miolalor/.conda/envs/quip_comp/lib/libscalapack.so (0x0000742212600000)
MESON
$ ldd builddir/src/Programs/gap_fit | grep scalapack
No output
Moreover, just watching these symbols produced by ldd, I noticed that the symbols for MPI are different, with the old build having an extra 2 of them(I don't know if it's related or not, but I don't know why it should differ):
OLD BUILD
$ ldd build/linux_x86_64_gfortran_openmpi/gap_fit | grep mpi
libmpi_usempif08.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_usempif08.so.40 (0x00007747a44ad000)
libmpi_usempi_ignore_tkr.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007747a4499000)
libmpi_mpifh.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_mpifh.so.40 (0x00007747a4422000)
libmpi.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi.so.40 (0x0000774799400000)
MESON
ldd builddir/src/Programs/gap_fit | grep mpi
libmpi_mpifh.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi_mpifh.so.40 (0x00007f57742b9000)
libmpi.so.40 => /home/miolalor/.conda/envs/quip_comp/lib/libmpi.so.40 (0x00007f5769e00000)
However, I do not understand why meson is failing in linking the scalapack library when using the old build system I just specify the -lscalapack, without having to specify the -I or -L paths.
The executable produced with meson will run when used without mpiexec until the error as reported in the previous issue, and complain about undefined symbols when run in parallel (blacs_gridinit_). The old build system gap_fit works without issue instead, with or without multiple MPI processes.