Out of Memories in supercomputer

Question

Out of Memories in supercomputer

audita-nimas opened this issue 2 years ago · 6 comments

Hi, Joe!

I tried to compute MCMC in Cosmosis with this configuration:

[runtime]
sampler = emcee

[test]
save_dir = output/gammat_des_case5
fatal_errors=T

[output]
format=text
filename=output/output_gammat_des_case5.txt

[emcee]
walkers = 50
samples = 10000
nsteps = 2


; This parameter is used several times in this file, so is
; put in the DEFAULT section and is referenced below as %(2PT_FILE)s
[DEFAULT]
;2PT_FILE = likelihood/des-y3/2pt_NG_final_2ptunblind_02_24_21_wnz_covupdate.v2.fits
;2PT_FILE = /Users/nemas/Documents/PhD/task5_shear_ratio_module/make_input/cosmosis_input_desy3.fits
2PT_FILE = /fred/oz073/nimas/cosmosis1/cosmosis-standard-library/phd/data/cosmosis_input_desy3.fits

[pipeline]
modules =  consistency 
           camb 
           sigma8_rescale 
           fast_pt
           fits_nz 
           source_photoz_bias
           IA 
           pk_to_cl
           add_intrinsic
           2pt_gal_shear
           shear_m_bias 
           2pt_like

quiet=F
timing=F
debug=F
;priors = examples/des-y3-priors.ini
priors = phd/prior_shear_des_case4.ini
values = phd/values_shear_des_case5.ini
extra_output = cosmological_parameters/Omega_m cosmological_parameters/S8 cosmological_parameters/sigma_8 cosmological_parameters/sigma_12 data_vector/2pt_chi2


; It's worth switching this to T when sampling using multinest, polychord,
; or other samplers that can take advantage of differences in calculation speeds between
; different parameters.
fast_slow = F
first_fast_module = shear_m_bias
; For some use cases this might be faster:
;first_fast_module=lens_photoz_width


[consistency]
file = utility/consistency/consistency_interface.py

[camb]
file = boltzmann/camb/camb_interface.py
mode = all
lmax = 2500          ;max ell to use for cmb calculation
feedback=3         ;amount of output to print
AccuracyBoost=1.1 ;CAMB accuracy boost parameter
do_tensors = T
do_lensing = T
NonLinear = pk
halofit_version = takahashi
zmin_background = 0.
zmax_background = 4.
nz_background = 401
kmin=1e-4
kmax = 50.0
kmax_extrapolate = 500.0
nk=700

[sigma8_rescale]
file = utility/sample_sigma8/sigma8_rescale.py


[fits_nz]
file = number_density/load_nz_fits/load_nz_fits.py
nz_file = %(2PT_FILE)s
data_sets = lens source
prefix_section = T
prefix_extension = T



[source_photoz_bias]
file = number_density/photoz_bias/photoz_bias.py
mode = additive
sample = nz_source
bias_section = wl_photoz_errors
interpolation = linear

[fast_pt]
file = structure/fast_pt/fast_pt_interface.py
do_ia = T
k_res_fac = 0.5
verbose = F

[IA]
file = intrinsic_alignments/tatt/tatt_interface.py
sub_lowk = F
do_galaxy_intrinsic = F
ia_model = tatt

[pk_to_cl_gg]
file = structure/projection/project_2d.py
lingal-lingal = lens-lens
do_exact = lingal-lingal
do_rsd = True
ell_min_linspaced = 1
ell_max_linspaced = 4
n_ell_linspaced = 5
ell_min_logspaced = 5.
ell_max_logspaced = 5.e5
n_ell_logspaced = 80
limber_ell_start = 200
ell_max_logspaced=1.e5
auto_only=lingal-lingal
sig_over_dchi_exact = 3.5

[pk_to_cl]
file = structure/projection/project_2d.py
ell_min_logspaced = 0.1
ell_max_logspaced = 5.0e5
n_ell_logspaced = 100 
shear-shear = source-source  ;uncomment
shear-intrinsic = source-source
intrinsic-intrinsic = source-source
intrinsicb-intrinsicb=source-source
lingal-shear = lens-source
lingal-intrinsic = lens-source
; lingal-magnification = lens-lens
; magnification-shear = lens-source
; magnification-magnification = lens-lens
; magnification-intrinsic = lens-source 
verbose = F
get_kernel_peaks = F
sig_over_dchi = 20. 
shear_kernel_dchi = 10. 


[add_intrinsic]
file=shear/add_intrinsic/add_intrinsic.py
;shear-shear=T
position-shear=T
perbin=F


[2pt_gal_shear]
file = shear/cl_to_xi_fullsky/cl_to_xi_interface.py
ell_max = 40000
xi_type='02'
theta_file=%(2PT_FILE)s
bin_avg = T

[shear_m_bias]
file = shear/shear_bias/shear_m_bias.py
m_per_bin = True
cl_section = shear_xi_plus shear_xi_minus
cross_section = galaxy_shear_xi
verbose = F

[add_point_mass]
file=shear/point_mass/add_gammat_point_mass.py
add_togammat = False
use_fiducial = True
sigcrit_inv_section = sigma_crit_inv_lens_source

[2pt_like]
file = likelihood/2pt/2pt_point_mass/2pt_point_mass.py
;do_pm_marg = True
;do_pm_sigcritinv = True
;sigma_a = 10000.0
;no_det_fac = False
include_norm = False
data_file = %(2PT_FILE)s
data_sets = gammat
make_covariance=F
covmat_name=COVMAT

angle_range_gammat_1_1 = 30.00 300.0
angle_range_gammat_1_2 = 30.00 300.0
angle_range_gammat_1_3 = 30.00 300.0
angle_range_gammat_1_4 = 30.00 300.0
;angle_range_gammat_1_5 = 30.00 300.0
angle_range_gammat_2_1 = 15.76 300.0
angle_range_gammat_2_2 = 15.76 300.0
angle_range_gammat_2_3 = 15.76 300.0
angle_range_gammat_2_4 = 15.76 300.0
;angle_range_gammat_2_5 = 15.76 300.0
angle_range_gammat_3_1 = 11.07 300.0
angle_range_gammat_3_2 = 11.07 300.0
angle_range_gammat_3_3 = 11.07 300.0
angle_range_gammat_3_4 = 11.07 300.0
;angle_range_gammat_3_5 = 11.07 300.0
angle_range_gammat_4_1 = 8.75 300.0
angle_range_gammat_4_2 = 8.75 300.0
angle_range_gammat_4_3 = 8.75 300.0
angle_range_gammat_4_4 = 8.75 300.0
;angle_range_gammat_4_5 = 8.75 300.0

; we put these in a separate file because they are long
;%include examples/des-y3-scale-cuts.ini

; we put these in a separate file because they are long
;%include examples/des-y3-scale-cuts.ini

[shear_ratio_like]
file = likelihood/des-y3/shear_ratio/shear_ratio_likelihood.py
;data_file = /Users/nemas/Documents/PhD/task5_shear_ratio_module/make_input/shear_ratio_desy3.pkl
data_file = /fred/oz073/nimas/cosmosis1/cosmosis-standard-library/phd/data/shear_ratio_desy3.pkl
theta_min_1 = 6.0 3.2 2.2 1.7
theta_min_2 = 6.0 3.2 2.2 1.7
theta_min_3 = 6.0 3.2 2.2 1.7
theta_max = 60.0 31.5 22.1 17.5
include_norm = F

and in supercomputer I set sbatch file like this:

#!/bin/bash
#
#SBATCH --job-name=cosmosis_des_gammat_case5
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=168:00:00
#SBATCH --mem=100g
#SBATCH --mail-type=FAIL --mail-user=nemas@swin.edu.au
#SBATCH --chdir=/fred/oz073/nimas/cosmosis1/


source ./env1/bin/activate
source cosmosis-configure
cd cosmosis-standard-library



srun time cosmosis phd/gammat_des_case5.ini

but after 3/4 days, I got notification "Out of memories". actually, what happened with that? did I wrong in setting?

Answer 1 · 2023-03-15T12:04:23.000Z

Hi @audita-nimas

I haven't seen this before - it suggests there is a memory leak somewhere but I don't know where. Have you modified any standard library code? This might just be a bug I haven't found yet.

As a solution you can set resume=T in the [runtime] section and then just start the job again and it will continue where it left off.

How many CPUs are there on your nodes? You are only running a single MPI process, and depending how many CPUs you have you can run several and speed things up greatly.

If you're using ozstar as the path in your batch script implies then there are 36 cores per node. So you might be with something like:

#SBATCH --ntasks=9
#SBATCH --cpus-per-task=4

...

export OMP_NUM_THREADS=4
srun -n 9 -c 4 cosmosis --mpi phd/gammat_des_case5.ini

That should be nearly 9 times faster.

Answer 2 · 2023-03-15T22:19:57.000Z

Hi Joe,

We have changed file in this:

[shear_ratio_like]
file = likelihood/des-y3/shear_ratio/shear_ratio_likelihood.py

with detail:

Line 32-35: "nbin_lens" to be "nbin_source - 1"?
Lines 41 and 44 -- the index variables "s" and "l" to be the other way around
Line 95 -- self.x to self.theta

and also, I have tried your suggestion and I got:

sbatch: error: ERROR: multi-node jobs must be multiples of 32 cores
sbatch: error: Batch job submission failed: Requested operation not supported on this system

should I install something again?

Answer 3 · 2023-03-15T22:26:56.000Z

a liitle note: in .ini file that I gave as example I did not use SR, but I used SR in other case and still have same

Answer 4 · 2023-03-15T22:32:23.000Z

Okay, they must be 32-core nodes not 36 like it says. You could change the 9s to 8s.

Your SR change shouldn't affect the memory, so I don't have any more ideas there.

Answer 5 · 2023-03-16T02:15:52.000Z

I see. I have run it and looks good. I will tell you again if there is problem. Thank you so much, Joe!

Answer 6 · 2023-03-16T08:31:29.000Z

Great - I'll close this for now but please do re-open if you have more issues.