postgres-plr/plr

Fatal error: unable to initialize the JIT

jfyne opened this issue · 13 comments

We are trying to debug some issues in our PG16 environment, they manifest themselves as a crash of postgres and they all appear to be related to PLR. Often the crash is so abrupt that we get no logging from it at all, sometimes we see log

{"level":"info","ts":"2024-01-11T23:30:05Z","logger":"postgres","msg":"Fatal error: unable to initialize the JIT","source":"/controller/log/postgres","logging_pod":"prod16-5"}

Sometimes we see a log with *** stack smashing detected ***

This appears to be from PLR/R. Other projects issues have mentioned perhaps it can't find libR.so. I am unsure. Can you point me in any direction to try and figure out what is causing this?

As an aside we have a problem in which we call pg.spi.exec and if the data returned is over a certain size postgres crashes.

Any sort of way to diagnose this would be fantastic.

Thanks

Pretty light on details here ? What else is happening before this ?

In particular, what OS are we talking about here?

FWIW I don't find the error message string "unable to initialize the JIT" anywhere in either PostgreSQL or PL/R sources.

But in any case if the OS is Linux, perhaps you are getting hit by the OOM killer? Look for references to "signal 9" in the postgres logs and/or "sigkill" in dmesg output (I seem to recall dmesg uses the name rather than the number for the signal, but I think it also explicitly says OOM kill or something similar too).

Also this link points to Flask and R interaction: https://stackoverflow.com/questions/62928973/rpy2-in-a-flask-app-fatal-error-unable-to-initialize-the-jit

Hope this helps

This one seems to indicate "path to libR.so not known to the linker (e.g., not in ldconf or in LD_LIBRARY_PATH)"
rpy2/rpy2#870

In that case the solution might be to create "/etc/ld.so.conf.d/libR.conf" with contents something like:
<path-to>/R/lib
and then run
ldconfig

Again, this assumes your OS is Linux of course.

Hi, sorry this is linux Debian in a container on Kubernetes. I don't see the containers being restarted with the usual OOM code.

It looks like its exit code 2

{"level":"info","ts":"2024-01-11T23:30:05Z","logger":"postgres","msg":"record","logging_pod":"prod16-5","record":{"log_time":"2024-01-11 23:30:05.821 UTC","process_id":"26","session_id":"65a06e16.1a","session_line_num":"7","session_start_time":"2024-01-11 22:39:18 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"server process (PID 520) exited with exit code 2","backend_type":"postmaster","query_id":"0"}}

As for the LD_LIBRARY_PATH, I have had a look. When I run R in the container

postgres@prod16-5:/hmm$ R

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> Sys.getenv(c("LD_LIBRARY_PATH"))
[1] "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server:/usr/local/lib"
> 

Whereas in PLR If I run the following

create or replace function plr_path()
returns text
language plr
as $function$
	Sys.getenv(c("LD_LIBRARY_PATH"))
$function$;
hmm=# select * from plr_path();
 plr_path 
----------
 
(1 row)

I have tried setting the LD_LIBRARY_PATH environment variable manually to what R itself reports, (I am not sure where it gets it from). But we are still seeing the same problem.

Sounds like you need to take this up with whomever is providing the container. This is not a PL/R issue as far as I can tell, it is more of a configuration problem. With a container there is not much you can do unless you have the ability to modify the container itself.

FWIW, you can use plr_environ() to see the environment from the Postgres point of view.

The OOM stuff I mention is in reference to "we have a problem in which we call pg.spi.exec and if the data returned is over a certain size postgres crashes." You need to check the logs to confirm that though. I don't see any evidence in your reply that you have looked at the actual postgres logs. The "exit 2" is not coming from postgres and seems to be a client exit code.

Is this happening every time, i.e. PL/R never runs, or is it sporadic, as in PL/R usually works but sometimes you get a crash with this error? If the latter, it still makes me suspect OOM.

Do you know what cgroup version is being used? (v1 or v2)?
Do you know if there is a memory limit set on the postgres process cgroup (memory.limit under cgroup v1, memory.max under cgroup v2)?

	In R support function pg.spi.exec
	In PL/R function test_plr2
2024-01-15 15:40:32.308 UTC [605] DEBUG:  LLVMJIT detected CPU "broadwell", with features "+sse2,-tsxldtrk,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,+prfchw,+bmi2,-cldemote,+fsgsbase,-ptwrite,-amx-tile,-avx512bf16,+popcnt,+aes,-avx512bitalg,-movdiri,-xsaves,-avx512er,-xsavec,-avx512vnni,-amx-bf16,-avx512vpopcntdq,-pconfig,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,+rdseed,-waitpkg,-movdir64b,-sse4a,-avx512bw,-clflushopt,+xsave,-avx512vbmi2,+64bit,-avx512vl,-serialize,+invpcid,-avx512cd,+avx,-vaes,+cx8,+fma,+rtm,+bmi,-enqcmd,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,+fxsr,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,-shstk,+cmov,-avx512vbmi,-amx-int8,+movbe,-avx512vp2intersect,+xsaveopt,-avx512dq,+adx,-avx512pf,+sse3"
2024-01-15 15:40:32.308 UTC [605] CONTEXT:  SQL statement "SELECT 
						   s.subject_id,
						   jsonb_array_elements(response_answer)#>>'{}' as gender
						FROM adex.subject s
						join adex.response r using (subject_id, survey_id)
						WHERE response_category = 'profile-gender'
						AND s.study_id = 106313
						AND s.subject_status = 'complete'"
	In R support function pg.spi.exec
	In PL/R function test_plr2
2024-01-15 15:40:32.310 UTC [571] DEBUG:  postmaster received pmsignal signal
2024-01-15 15:40:32.311 UTC [571] DEBUG:  registering background worker "parallel worker for PID 605"
2024-01-15 15:40:32.311 UTC [571] DEBUG:  registering background worker "parallel worker for PID 605"
2024-01-15 15:40:32.311 UTC [571] DEBUG:  registering background worker "parallel worker for PID 605"
2024-01-15 15:40:32.311 UTC [571] DEBUG:  registering background worker "parallel worker for PID 605"
2024-01-15 15:40:32.311 UTC [571] DEBUG:  registering background worker "parallel worker for PID 605"
2024-01-15 15:40:32.311 UTC [571] DEBUG:  registering background worker "parallel worker for PID 605"
2024-01-15 15:40:32.311 UTC [571] DEBUG:  registering background worker "parallel worker for PID 605"
2024-01-15 15:40:32.311 UTC [571] DEBUG:  starting background worker process "parallel worker for PID 605"
postgres: dev16: postgres hmm [local] SELECT: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
2024-01-15 15:40:32.311 UTC [571] DEBUG:  starting background worker process "parallel worker for PID 605"
2024-01-15 15:40:32.312 UTC [571] DEBUG:  starting background worker process "parallel worker for PID 605"
2024-01-15 15:40:32.312 UTC [616] DEBUG:  InitPostgres
2024-01-15 15:40:32.313 UTC [571] DEBUG:  starting background worker process "parallel worker for PID 605"
2024-01-15 15:40:32.313 UTC [617] DEBUG:  InitPostgres
2024-01-15 15:40:32.313 UTC [571] DEBUG:  starting background worker process "parallel worker for PID 605"
2024-01-15 15:40:32.314 UTC [618] DEBUG:  InitPostgres
2024-01-15 15:40:32.314 UTC [571] DEBUG:  starting background worker process "parallel worker for PID 605"
2024-01-15 15:40:32.314 UTC [619] DEBUG:  InitPostgres
2024-01-15 15:40:32.315 UTC [571] DEBUG:  starting background worker process "parallel worker for PID 605"
2024-01-15 15:40:32.315 UTC [620] DEBUG:  InitPostgres
2024-01-15 15:40:32.315 UTC [621] DEBUG:  InitPostgres
2024-01-15 15:40:32.316 UTC [622] DEBUG:  InitPostgres
2024-01-15 15:40:32.318 UTC [571] DEBUG:  server process (PID 605) was terminated by signal 6: Aborted
2024-01-15 15:40:32.318 UTC [571] DETAIL:  Failed process was running: select *
	from analytics.test_plr2() as subs(subject_id bigint, age text);
2024-01-15 15:40:32.318 UTC [571] LOG:  server process (PID 605) was terminated by signal 6: Aborted
2024-01-15 15:40:32.318 UTC [571] DETAIL:  Failed process was running: select *
	from analytics.test_plr2() as subs(subject_id bigint, age text);
2024-01-15 15:40:32.318 UTC [571] LOG:  terminating any other active server processes
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 616
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 617
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 618
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 619
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 620
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 621
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 622
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 578
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 573
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 572
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 575
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 576
2024-01-15 15:40:32.318 UTC [571] DEBUG:  sending SIGQUIT to process 577
2024-01-15 15:40:32.319 UTC [571] DEBUG:  forked new backend, pid=623 socket=10
2024-01-15 15:40:32.319 UTC [623] LOG:  connection received: host=[local]
2024-01-15 15:40:32.320 UTC [623] FATAL:  the database system is in recovery mode
2024-01-15 15:40:32.320 UTC [623] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2024-01-15 15:40:32.320 UTC [623] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
2024-01-15 15:40:32.320 UTC [623] DEBUG:  proc_exit(1): 1 callbacks to make
2024-01-15 15:40:32.320 UTC [623] DEBUG:  exit(1)
2024-01-15 15:40:32.320 UTC [623] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-01-15 15:40:32.320 UTC [623] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-01-15 15:40:32.320 UTC [623] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-01-15 15:40:32.322 UTC [571] DEBUG:  server process (PID 623) exited with exit code 1
2024-01-15 15:40:32.322 UTC [571] LOG:  shutting down because restart_after_crash is off
2024-01-15 15:40:32.322 UTC [571] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2024-01-15 15:40:32.322 UTC [571] DEBUG:  shmem_exit(1): 4 on_shmem_exit callbacks to make
2024-01-15 15:40:32.322 UTC [571] DEBUG:  cleaning up orphaned dynamic shared memory with ID 3863052026
2024-01-15 15:40:32.322 UTC [571] DEBUG:  cleaning up orphaned dynamic shared memory with ID 2175441618
2024-01-15 15:40:32.322 UTC [571] DEBUG:  cleaning up orphaned dynamic shared memory with ID 3663836102
2024-01-15 15:40:32.322 UTC [571] DEBUG:  cleaning up orphaned dynamic shared memory with ID 1285479218
2024-01-15 15:40:32.322 UTC [571] DEBUG:  cleaning up dynamic shared memory control segment with ID 1994213952
2024-01-15 15:40:32.341 UTC [571] DEBUG:  proc_exit(1): 2 callbacks to make
2024-01-15 15:40:32.341 UTC [571] LOG:  database system is shut down
2024-01-15 15:40:32.341 UTC [571] DEBUG:  exit(1)
2024-01-15 15:40:32.341 UTC [571] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-01-15 15:40:32.341 UTC [571] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-01-15 15:40:32.341 UTC [571] DEBUG:  proc_exit(-1): 0 callbacks to make

Here is a pg.spi.exec caused crash. Looks like malloc:

SELECT: malloc.c:2379: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.`

The function and query I am running looks like this

CREATE OR REPLACE FUNCTION analytics.test_plr2()
RETURNS SETOF record AS $$
  # Execute the query
  res <- pg.spi.exec("SELECT 
					    s.subject_id,
					    jsonb_array_elements(response_answer)#>>'{}' as gender
					FROM adex.subject s
					join adex.response r using (subject_id, survey_id)
					WHERE response_category = 'profile-gender'
					AND s.study_id = 106313
					AND s.subject_status = 'complete'")
  # Check if the result is not null and has data
  if (!is.null(res) && length(res) > 0) {
    # Convert the list to a data frame
    data_frame <- data.frame(subject_id=as.integer(res[[1]]), age=as.character(res[[2]]), stringsAsFactors=FALSE)
    return(data_frame)
  } else {
    return(data.frame(subject_id=integer(), age=character()))
  }
$$ LANGUAGE 'plr';

If I run the query not inside PLR it runs fine. I can also run the query in plpgsql and plython without issue.

The error occurs inside postgres when the parallel worker is started. We have never tried to make PL/R parallel safe -- is the PL/R function explicitly created with "PARALLEL SAFE"? The default (as in not specified) would be "PARALLEL UNSAFE", but it isn't clear to me what happens to the query being executed via SPI. Another thing to try would be to add "SET max_parallel_workers_per_gather = 0" to the PL/R function definition and see if the problem goes away.

Any chance you can capture a core file from the process that gets the abort (signal 6)?

BTW, I assume you built postgres with "--enable-cassert" and turned on debug logging specifically for the troubleshooting, correct? You would not ordinarily want either of those for a prod system.

This postgres is from this Docker image:

https://github.com/cloudnative-pg/postgres-containers/blob/main/Debian/16/Dockerfile which in turn is from
https://github.com/docker-library/postgres/blob/d416768b1a7f03919b9cf0fef6adc9dcad937888/16/bullseye/Dockerfile

Which I assume is just the official postgres install from apt.

I will attempt to extract a core file.

Thanks for the PARALLEL SAFE and max_parallel_workers_per_gather pointers, I will implement that and see if that helps with our issues.

The functions were not defined with PARALLEL SAFE, I tested:

PARALLEL UNSAFE: Crash still happens
SET max_parallel_workers_per_gather = 0: Crash doesn't happen

I will try deploying with the workers setting and see if our intermittent issues are resolved.

I've been trying to get a core dump but unfortunately the root filesystem is set to read only and I think that its preventing the core dumps from being written.

Ok based on the above I am guessing that even though the PL/R function is PARALLEL UNSAFE (the default), the SPI query execution is done with parallel query, and somehow that tickles a bug. What is not clear to me is whether the bug is in PL/R or in Postgres itself.

If you cannot get a core file, can you create a self contained example (table def, sample data -- perhaps generated with generate_series -- and PL/R function) that I can use to duplicate the crash?

I'm still trying to create a self contained example.

However, I did deploy the fixes for pg.spi.exec, but we are still seeing the Fatal error: unable to initialize the JIT in the postgres logs. Which is definitely coming from R:

https://github.com/wch/r-source/blob/trunk/src/main/main.c#L1193

I am now trying our postgres deployment with the environment var R_ENABLE_JIT=0 to see if this stops the crashes

Hmmm, is JIT support in R new? I don't ever remember R doing JIT before. What version of R are you using? Perhaps there is something PL/R ought to be doing to play nice with a new R feature?

(looks) Seems like R has had the ability to do JIT since 2010 so not exactly new but perhaps no one typically enables it? I wonder if maybe R was compiled to use a different LLVM version than Postgres and that is causing the issue.

Anyway, I think I would need a way to reproduce the issue in order to get to the bottom of it.

We are using

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

I believe the default is JIT enabled. After disabling the JIT we are no longer seeing a "unable to initialize the JIT" it has now been replaced by *** stack smashing detected *** every time.

We have just deployed a new way to hopefully get a core dump, so just waiting on a crash now.