heterodb/pg-strom

Can't start database "PG-Strom fatbin image is not valid now"

Opened this issue · 17 comments

Discussed in #742

Originally posted by alefcs23 March 22, 2024
2024-03-22 13:43:43 -03 [85441]: [6-1] user=,db=,app=,client=LOG: PG-Strom fatbin image is not valid now, so rebuild in progress...
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: Syntax error: Bad fd number
sh: 1: sh: 1: Syntax error: Bad fd numberSyntax error: Bad fd number

sh: 1: Syntax error: Bad fd number
2024-03-22 13:43:43 -03 [85441]: [7-1] user=,db=,app=,client=FATAL: failed on the build process at [/tmp/.pgstrom_fatbin_build_bQbEj8]
2024-03-22 13:43:43 -03 [85441]: [8-1] user=,db=,app=,client=LOG: database system is shut down
pg_ctl: could not start server
Examine the log output.

What shell program is launched by the user who works PostgreSQL server process?
It launches nvcc using system(3) function, so we expect /bin/bash is available.

just double checked and it uses /bin/bash, is there any way that i could specify the shell or edit the syntax myself?

The __rebuild_gpu_fatbin_file() function in src/gpu_device.c construct command lines.
Can you try to print cmd.data using elog(LOG, ...)?

sry idk how to, gonna look into __rebuild_gpu_fatbin_file() and see what happens

I met the same problem too.

I met the same problem too.

what SO are you using?

cuda 12.0 ubuntu 22.04 PostgreSQL 16 PG-Strom5
I wish one day, I can use PG-Strom on Ubuntu smoothly

sudo apt-get install pg-strom-PG16

like TimescaleDB or pg-vector.

The __rebuild_gpu_fatbin_file() function in src/gpu_device.c construct command lines. Can you try to print cmd.data using elog(LOG, ...)?

if i change to a rpm based OS, will my problem be solved?

commit 593da4ec873e8096f11b6bbc0ff2aa3194edd29d will fix the problem.

sh: 1: Syntax error: Bad fd number

It is a typical error message when we run a command and redirect both of stdout and stderr into one file using:

% COMMAND >& logfile

But it was bash enhancement, not available at sh or tcsh.
So, PG-Strom's code-builder routine now build a shell command to kick nvcc using the manner:

% COMMAND > logfile 2>&1

@wuxianliang

cuda 12.0 ubuntu 22.04 PostgreSQL 16 PG-Strom5 I wish one day, I can use PG-Strom on Ubuntu smoothly

sudo apt-get install pg-strom-PG16

like TimescaleDB or pg-vector.

Oh.. Need CUDA 12.2 or Latter...

commit 593da4ec873e8096f11b6bbc0ff2aa3194edd29d will fix the problem.

sh: 1: Syntax error: Bad fd number

It is a typical error message when we run a command and redirect both of stdout and stderr into one file using:

% COMMAND >& logfile

But it was bash enhancement, not available at sh or tcsh. So, PG-Strom's code-builder routine now build a shell command to kick nvcc using the manner:

% COMMAND > logfile 2>&1

the cluster now starts but when running a query this pops up:

2024-04-13 15:36:14.252 -03 [60537] LOG: PG-Strom fatbin image is not valid now, so rebuild in progress...
2024-04-13 15:36:14.252 -03 [60537] LOG: rebuild fatbin command: cd '/tmp/.pgstrom_fatbin_build_X0Lgmg' && ( /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_common.o /usr/share/postgresql/16/pg_strom/xpu_common.cu' > xpu_common.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o cuda_gpuscan.o /usr/share/postgresql/16/pg_strom/cuda_gpuscan.cu' > cuda_gpuscan.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o cuda_gpujoin.o /usr/share/postgresql/16/pg_strom/cuda_gpujoin.cu' > cuda_gpujoin.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o cuda_gpupreagg.o /usr/share/postgresql/16/pg_strom/cuda_gpupreagg.cu' > cuda_gpupreagg.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_basetype.o /usr/share/postgresql/16/pg_strom/xpu_basetype.cu' > xpu_basetype.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_numeric.o /usr/share/postgresql/16/pg_strom/xpu_numeric.cu' > xpu_numeric.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_timelib.o /usr/share/postgresql/16/pg_strom/xpu_timelib.cu' > xpu_timelib.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_textlib.o /usr/share/postgresql/16/pg_strom/xpu_textlib.cu' > xpu_textlib.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_misclib.o /usr/share/postgresql/16/pg_strom/xpu_misclib.cu' > xpu_misclib.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_jsonlib.o /usr/share/postgresql/16/pg_strom/xpu_jsonlib.cu' > xpu_jsonlib.log 2>&1 & /bin/sh -x -c '/usr/local/cuda/bin/nvcc --maxrregcount=128 --source-in-ptx -lineinfo -I. -I/usr/include/postgresql/16/server -DHAVE_FLOAT2 -arch=native --threads 4 --device-c -o xpu_postgis.o /usr/share/postgresql/16/pg_strom/xpu_postgis.cu' > xpu_postgis.log 2>&1) && wait; /bin/sh -x -c '/usr/local/cuda/bin/nvcc -Xnvlink --suppress-stack-size-warning -arch=native --threads 4 --device-link --fatbin -o 'pgstrom-gpucode-V012040-ff9a6c27933d7a7d7e539ebd9b2ab4a0.fatbin' xpu_common.o cuda_gpuscan.o cuda_gpujoin.o cuda_gpupreagg.o xpu_basetype.o xpu_numeric.o xpu_timelib.o xpu_textlib.o xpu_misclib.o xpu_jsonlib.o xpu_postgis.o' > pgstrom-gpucode-V012040-ff9a6c27933d7a7d7e539ebd9b2ab4a0.fatbin.log 2>&1

That is updated revision's expected behavior.
The background worker process (GPU Service) kicks nvcc, then PG-Strom functionality shall be available once fatbin (GPU binary) image becomes ready.

problem is it is never ready (just loops rebuild) and GPU does not show any signs of activity during this

You may see compilation error logs in $PGDATA/.pgstrom_fatbin/.

Or, /tmp/.pgstrom_fatbin_build_X0Lgmg according to your logs.

got it,

every log there ends with the "gcc: No such file or directory", but gcc seems to be in path and working, any tips?

Is it really visible from PostgreSQL server process? Please check it.