zonkyio/embedded-postgres

java.io.IOException: Gave up waiting for server to start after 10000ms

lwitzani opened this issue · 9 comments

Hi,

i have a similar issue as in zonkyio/embedded-database-spring-test#1

I am currently in the process of updating the postgres related dependencies in our project.
Before we had

<groupId>io.zonky.test</groupId>
<artifactId>embedded-postgres</artifactId>
<version>1.2.7</version>

<groupId>io.zonky.test.postgres</groupId>
<artifactId>embedded-postgres-binaries-bom</artifactId>
<version>13.2.0</version>

and i tried to update to

<groupId>io.zonky.test</groupId>
<artifactId>embedded-postgres</artifactId>
<version>2.0.1</version>

<groupId>io.zonky.test.postgres</groupId>
<artifactId>embedded-postgres-binaries-bom</artifactId>
<version>14.5.0</version>

Locally all tests work when starting them via intellij and also directly via mvn test.
The problem now occurs on our build system.
We use TeamCity together with Windows Agents to execute the builds.
The build step is basically mvn clean deploy.
With the old configuration, everything worked fine but with the new one the builds have failed tests.
Unfortunately, the only log output for failed tests is

[09:51:16][SystemModelServiceTest] java.io.IOException: Gave up waiting for server to start after 10000ms
[09:51:16][SystemModelServiceTest] java.io.IOException: Gave up waiting for server to start after 10000ms
Caused by: java.sql.SQLException: connect failed
Caused by: java.net.SocketTimeoutException: connect timed out

and the strange part is that not all tests fail like this.
In fact, 35 fail and 91 pass.
So it seems that it should actually work, otherwise all tests would fail.

Do you have any idea what the problem could be in this case?

As an additional note, i have to mention that i modified the used embedded-postgres-binaries-windows-amd64-14.5.0.jar to include a postgres environment that has the postgis extension installed.
It contains the original environment + manually installed postgis extension files.
This is the case for both versions 13.2.0 and 14.5.0 and it worked perfectly on 13.2.0. I don't think that this is the problem.

Thanks so much!
Cheers,
Lukas

Hi @lwitzani. In my opinion, it is very likely the error is related to the modified binaries. In that case I can't give you any assistance. However, here are some ideas to deal with the problem. You can try to prepare a test that will not require the postgis extension to eliminate the cause due to modified files. You can also try to start the binaries manually like this pg_ctl -D ./data start. Then you will get much more information in the console to solve the problem.

Hey thanks for your answer.

I think i need to correct my initial statement:

So it seems that it should actually work, otherwise all tests would fail.

This was not correct. Only the tests that need the embedded database fail. The ones that do not need it succeed. Guess i did not see that initially.

We have a second project that also uses the embedded database for tests but it does not require the postgis extension. I tried the vanilla dependency again

<groupId>io.zonky.test</groupId>
<artifactId>embedded-postgres</artifactId>
<version>2.0.1</version>

<groupId>io.zonky.test.postgres</groupId>
<artifactId>embedded-postgres-binaries-bom</artifactId>
<version>14.5.0</version>

while removing all files that i modified. I also made sure that the original files are really used.
It did not change any behaviour at all. So i can say that it is not due to the modification.
I'll try to use your second proposal but i don't think that will help me because the extracting of the binary files is done each time the build agent runs since everything is cleaned before each run. So i guess there is no point of time where i could trigger your command.

On the build agents, multiple of those epg folders are created. Locally i only have like one or two or them. Maybe that could indicate some issues?
image

Is there some configuration i can do to increase the timeout time? Since the error message says

java.io.IOException: Gave up waiting for server to start after 10000ms

maybe it would work if it just would wait longer for the embedded environment to be started?

Thanks so much

I am currently starting the postgres server with this line
pg = EmbeddedPostgres.builder().setPort(5432).setPGStartupWait(Duration.ofSeconds(30)).start();
but also the 30sec wait time does not help.

I logged into the build agent and tried to connect multiple times to the database server that should be started through this command (right when the build is running) but i really cannot connect to it, so it seems that the server is just not started. I also tried your provided command pg_ctl -D ./data start with the "epg" folders as data directory and that command boots the database server with the unpacked binaries just fine and i can connect to it just fine. So the unpacking of the binaries and the creation of the data directory seems to be working but the booting of the server seems to have an issue. I also tried using another port 5454 but that did also not help

i set the log level to DEBUG and defined a output redirector as follows

File file = new File("C:\\BuildAgent\\temp\\embeddedLog.txt");
pg = EmbeddedPostgres.builder().setPort(5454)
      .setOutputRedirector(ProcessBuilder.Redirect.appendTo(file))
      .setErrorRedirector(ProcessBuilder.Redirect.appendTo(file))
      .setPGStartupWait(Duration.ofSeconds(30)).start();

So running locally, in the redirector file i have

<multiple lines that are not interesting>
führe Post-Bootstrap-Initialisierung durch ... ok
synchronisiere Daten auf Festplatte ... ok

Erfolg. Sie können den Datenbankserver jetzt mit

    ^"C^:^\Users^\L58B0^~1^.WIT^\AppData^\Local^\Temp^\embedded^-pg^\PG^-fffe4b7d9e59a120ea9590802be69f29^\bin^\pg^_ctl^" -D ^"C^:^\Users^\L58B0^~1^.WIT^\AppData^\Local^\Temp^\epg14977322756008770021^" -l logdatei start

starten.

2022-11-25 10:35:43.037 CET [20320] LOG:  PostgreSQL 14.5, compiled by Visual C++ build 1914, 64-bit startet
2022-11-25 10:35:43.057 CET [20320] LOG:  erwarte Verbindungen auf IPv6-Adresse »::1«, Port 5454
2022-11-25 10:35:43.057 CET [20320] LOG:  erwarte Verbindungen auf IPv4-Adresse »127.0.0.1«, Port 5454
2022-11-25 10:35:43.107 CET [15616] LOG:  Datenbanksystem wurde am 2022-11-25 10:35:42 CET heruntergefahren
2022-11-25 10:35:43.272 CET [20320] LOG:  Datenbanksystem ist bereit, um Verbindungen anzunehmen

with the last line saying that the database system is now ready to accept connections. The printed command in the middle tells that everything is now ready to be startet using this command.

additionally via log level DEBUG i get these outputs

main] i.z.t.d.p.embedded.EmbeddedPostgres      : Detected a Windows x86_64 system
main] .z.t.d.p.e.DefaultPostgresBinaryResolver : Detected distribution: 'Unknown'
main] .z.t.d.p.e.DefaultPostgresBinaryResolver : System specific postgres binaries found: 'postgres-windows-x86_64.txz'
main] i.z.t.d.p.embedded.EmbeddedPostgres      : Postgres binaries at C:\Users\L58B0~1.WIT\AppData\Local\Temp\embedded-pg\PG-fffe4b7d9e59a120ea9590802be69f29
main] i.z.t.d.p.embedded.EmbeddedPostgres      : faaa279d-513a-465e-a9fe-542ef9b78e5b initdb completed in 00:00:03.205
main] i.z.t.d.p.embedded.EmbeddedPostgres      : faaa279d-513a-465e-a9fe-542ef9b78e5b postmaster started as Process[pid=12196, exitValue="not exited"] on port 5454.  Waiting up to PT30S for server startup to finish.
main] i.z.t.d.p.embedded.EmbeddedPostgres      : faaa279d-513a-465e-a9fe-542ef9b78e5b postmaster startup finished in 00:00:00.834

with the last line saying it finished in 834ms to start the server which is pretty fast i guess.

Now on contrast to this, on the build agent i have in the redirect file:

<multiple lines that are not interesting>
synchronisiere Daten auf Festplatte ... ok

Erfolg. Sie können den Datenbankserver jetzt mit

    ^"C^:^\BuildAgent^\temp^\buildTmp^\embedded^-pg^\PG^-fffe4b7d9e59a120ea9590802be69f29^\bin^\pg^_ctl^" -D ^"C^:^\BuildAgent^\temp^\buildTmp^\epg10422145599213142958^" -l logdatei start

starten.

but after that it does not continue with the log lines that state that the server is starting.

Also in the build log that is now visible via log level DEBUG:

main] i.z.t.d.p.embedded.EmbeddedPostgres      : Detected a Windows x86_64 system
main] .z.t.d.p.e.DefaultPostgresBinaryResolver : Detected distribution: 'Unknown'
main] .z.t.d.p.e.DefaultPostgresBinaryResolver : System specific postgres binaries found: 'postgres-windows-x86_64.txz'
main] i.z.t.d.p.embedded.EmbeddedPostgres      : Extracting Postgres...
main] i.z.t.d.p.embedded.EmbeddedPostgres      : Postgres binaries at C:\BuildAgent\temp\buildTmp\embedded-pg\PG-fffe4b7d9e59a120ea9590802be69f29
main] i.z.t.d.p.embedded.EmbeddedPostgres      : f4a6172b-6ed5-4a47-bfe6-0538debd282b initdb completed in 00:00:05.986
main] i.z.t.d.p.embedded.EmbeddedPostgres      : f4a6172b-6ed5-4a47-bfe6-0538debd282b postmaster started as Process[pid=11100, exitValue="not exited"] on port 5454.  Waiting up to PT30S for server startup to finish.

but after that it does not state that it has been started (the 834ms from before).

I also tried to use the exact same command to start the database server on the build agent which is used locally on my machine. It was like

<pathToTemp>/embedded-pg/PG-<randomUUID>/bin/postgres.exe -D <pathToTemp>/epg<randomNumber> -p 5454 -F -c synchronous_commit=off -c max_connections=300 -c timezone=UTC"

and it successfully and instantly started the database server, so no problem there. I have absolutely no clue why this would not work on the build agent

now i tried to find the version that introduced this behaviour.
so i tested again with

<groupId>io.zonky.test</groupId>
<artifactId>embedded-postgres</artifactId>
<version>1.2.7</version>

<groupId>io.zonky.test.postgres</groupId>
<artifactId>embedded-postgres-binaries-bom</artifactId>
<version>13.2.0</version>

and it works fine.
I increased the version 1.2.7 to 1.2.8 and i directly had the same problems again. So it seems some changes made in this version lead to my issues...

image
the main difference between version 1.2.7 and 1.2.8 that could lead to this issue seems to be the change of the command that starts the database server. before on the left it was pg_ctl start which basically uses the pg_ctl.exe of course, and afterwards it is the new POSTGRES command mechanism which uses the postgres.exe.

Do you happen to know why this was changed to the postgres.exe? Or maybe what this .exe does different?

Update:

I now tried

<groupId>io.zonky.test</groupId>
<artifactId>embedded-postgres</artifactId>
<version>1.2.7</version>

<groupId>io.zonky.test.postgres</groupId>
<artifactId>embedded-postgres-binaries-bom</artifactId>
<version>14.5.0</version>

on the build agent and it indeed does work. So next step is to try 1.2.7 together with my modified 14.5.0 ;-)

Update: it works with my modified version of

<dependency>
            <groupId>io.zonky.test.postgres</groupId>
            <artifactId>embedded-postgres-binaries-windows-amd64</artifactId>
            <version>14.5.0postgis</version>
</dependency>

so i guess there really might be some problem with the changes from 1.2.7 to 1.2.8. Would love to hear some feedback from your side.

Thanks so much!

Thank you for your comprehensive investigation! Based on your findings, I'm guessing that the problem could be related to the following issues: #66 #67. The previous pg_ctl.exe binary has been replaced with the postgres.exe one to improve logging of the process. I'll consider reverting this change back and finding another way to access the logs.

hey, just a short update. I switched to version 2.0.2 and now everything works again as intended. Thank you!

Ok, perfect! Thanks for the information.