Segfault when loading index from multiple processes

Question

Segfault when loading index from multiple processes

scharch opened this issue 4 years ago · 4 comments

I am dispatching several hundred jobs on an HPC that all begin by loading a previously-created pyfastx index file. Around a third of the jobs will immediately segfault:

(gdb) bt
#0  0x00002aaaaaf5e580 in fileno_unlocked () from /lib64/libc.so.6
#1  0x00002aaab35de26e in is_readonly (fd=fd@entry=0x0) at src/zran.c:2639
#2  zran_import_index (index=index@entry=0x5555565f4b20, fd=fd@entry=0x0) at src/zran.c:2639
#3  0x00002aaab35d76be in pyfastx_load_gzip_index (index_file=<optimized out>, gzip_index=0x5555565f4b20, index_db=0x5555566a0508) at src/util.c:513
#4  0x00002aaab35db42c in pyfastx_fastq_load_index (self=self@entry=0x2aaadfe79938) at src/fastq.c:223
#5  0x00002aaab35dc210 in pyfastx_fastq_new (type=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at src/fastq.c:322
#6  0x00005555556f78e5 in type_call () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Objects/typeobject.c:895
#7  0x000055555566787b in _PyObject_FastCallDict () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Objects/abstract.c:2331
#8  0x00005555556f774e in call_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4861
#9  0x000055555571a2ba in _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:3335
#10 0x00005555556c60ab in _PyFunction_FastCall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4919
#11 fast_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4954
#12 0x00005555556f76d5 in call_function () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4858
#13 0x000055555571a2ba in _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:3335
#14 0x00005555556ca669 in _PyEval_EvalCodeWithName (qualname=0x0, name=0x0, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2,
    kwcount=<optimized out>, kwargs=0x0, kwnames=0x0, argcount=0, args=0x0, locals=0x2aaaaaba35e8, globals=0x2aaaaaba35e8, _co=0x2aaab2374ae0)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4166
#15 PyEval_EvalCodeEx () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:4187
#16 0x00005555556cb3fc in PyEval_EvalCode (co=co@entry=0x2aaab2374ae0, globals=globals@entry=0x2aaaaaba35e8, locals=locals@entry=0x2aaaaaba35e8)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/ceval.c:731
#17 0x000055555576cc94 in run_mod () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:1025
#18 0x000055555576d091 in PyRun_FileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:978
#19 0x000055555576d293 in PyRun_SimpleFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:419
#20 0x000055555576d39d in PyRun_AnyFileExFlags () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Python/pythonrun.c:81
#21 0x0000555555770dc9 in run_file (p_cf=0x7fffffff3f9c, filename=0x5555558ab340 L"/nethome/schrammca/SONAR/annotate/find_umis.py", fp=0x5555559243f0)
    at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Modules/main.c:340
#22 Py_Main () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Modules/main.c:810
#23 0x000055555563844e in main () at /home/conda/feedstock_root/build_artifacts/python_1551342612670/work/Programs/python.c:69
#24 0x00002aaaaaf0cb15 in __libc_start_main () from /lib64/libc.so.6
#25 0x0000555555720d7f in _start () at ../sysdeps/x86_64/elf/start.S:103

This happens even if I only start 20 worker jobs concurrently, although the frequency of segfaults is lower (~2/20 jobs vs ~70/200 jobs vs ~450/650 jobs).

Answer 1 · 2020-12-23T06:47:52.000Z

Could you show me your code. This error was caused by zran.c module.

Answer 2 · 2020-12-23T16:03:53.000Z

In the master script:

import sys, math, pyfastx
fileID = sys.argv[1]

try:
	seqFile = pyfastx.Fastq(fileID)
except RuntimeError:
	seqFile = pyfastx.Fasta(fileID)

#how many jobs is it?
numJobs = math.ceil( len(seqFile) / 500_000 )

#call find_umis either on cluster or locally
with open("featuresjob.sh", 'w') as jobHandle:
	jobHandle.write(f"#!/bin/bash\n#$ -N featureUMIs\n#$-cwd\n\nfind_umis.py {fileID} 500000 $(($SGE_TASK_ID-1)) \n\n")
subprocess.call([qsub, '-sync', 'y', '-t', "1-%d"%numJobs, "featuresjob.sh"])

And in each worker script:

import sys, pyfastx
from time import sleep

fileID = sys.argv[1]
chunksize = sys.argv[2]
chunknum = sys.argv[3]

#introducing a small random delay prevents most segfaults
sleeptime = 15 * random.uniform(0, 10)
sleep(sleeptime)

try:
	#this is where the segfault occurs
	seqFile = pyfastx.Fastq( fileID  )
except RuntimeError:
	seqFile = pyfastx.Fasta( fileID )


start = chunknum *chunksize 
stop  = min( len(seqFile), start+chunksize - 1 )

for seqNum in range(start, stop):
	seq = seqFile[seqNum]
	#do stuff...

Answer 3 · 2021-01-02T15:26:34.000Z

I released a new version 0.8.2. You can try it. This issue may be caused by the same random index file name generated by subprocess work. If it does not work, please let me known.

Answer 4 · 2021-01-07T02:52:43.000Z

This appears to have solved the issue, thanks!