will/crystal-pg

Error handling for connect_listen

straight-shoota opened this issue · 15 comments

When closing the connection returned by connect_listen, an exception is raised inside the listen fiber:

require "pg"
connection = PG.connect_listen ENV["DATABASE_URL"], "foo" {  }
sleep 1
connection.close
Unhandled exception in spawn: Closed stream (IO::Error)
  from /usr/share/crystal/src/io.cr:128:5 in 'check_open'
  from /usr/share/crystal/src/io/buffered.cr:69:5 in 'read'
  from /usr/share/crystal/src/io.cr:530:20 in 'read_fully?'
  from /usr/share/crystal/src/io.cr:513:5 in 'read_fully'
  from /usr/share/crystal/src/io/byte_format.cr:131:3 in 'decode'
  from /usr/share/crystal/src/int.cr:520:5 in 'from_io'
  from /usr/share/crystal/src/io.cr:889:5 in 'read_bytes'
  from lib/pg/src/pq/connection.cr:115:7 in 'read_i32'
  from lib/pg/src/pq/connection.cr:177:7 in 'read_one_frame'
  from lib/pg/src/pq/connection.cr:169:31 in 'read_async_frame_loop'
  from lib/pg/src/pg/connection.cr:50:15 in '->'
  from /usr/share/crystal/src/fiber.cr:255:3 in 'run'
  from /usr/share/crystal/src/fiber.cr:48:34 in '->'
  from ???

PQ::Connection#read_async_frame_loop only swallows Errno, it should probably be extended to cover more errors, including IO::Error. Probably any error could be ignored if the connection is closed?

m-o-e commented

+1

I'm now using this little monkey-patch to
be able to handle errors in connect_listen:

class PG::Connection
  protected def listen
    @connection.read_async_frame_loop
  end
end

But needless to say, patching low-level
shards in this way doesn't feel good. ;)

Perhaps connect_listen could just be
a blocking call to begin with?

We can always wrap it in spawn
ourselves where needed and it gives much
better control over the error handling.

I'm currently struggling a bit to actually make connect_listen blocking. I was somewhat surprised to learn it didn't block. But perhaps that makes sense too.

Agree, PG.connect_listen should be blocking, currently the fiber will die if eg. the server is restarted, and there's no obovious way to rescue it without monkey patching

will commented

Thanks to @carlhoerberg for the patch. Making it block does make sense I think, but I don’t myself use listen/notify. Since this will be a breaking change to anyone using it now, I'd appreciate it if anyone using listen/notify could comment on making it blocking.

I think the change makes sense.

But I'd strongly advise to avoid breaking implicit behaviour without breaking the explicit API. If you're using PG.connect_listen right now, your code continues to compile with this change. But it won't work anymore because the call blocks.

This new behaviour should be introduced with a different method signature.
The easiest solution is probably to rename the methods. PG.connect_listen could become .listen_connect - or .connect_listen! (the exclamation mark isn't very self-explanatory, though).
PG::Connection#listen is more complicated. #listen! might be an acceptable option.
Maybe .blocking_connect_listen and #blocking_listen could be a last solution.

The old methods could either be deprecated or even kept alongside the new, blocking ones. Not sure if that makes much sense though, when the difference is just wrapping spawn.

This doesn't resolve the original issue, though. It's an improvement for handling the error, but not an actual fix.

The example would become something like this:

require "pg"
connection = spawn { PG.blocking_connect_listen ENV["DATABASE_URL"], "foo" {  } }
sleep 1
connection.close

Closing the connection would still raise IO::Error in Connection#blocking_listen. Of course, the situation is better because I can rescue in user code. But I'd argue that this exception shouldn't be leaking here in the first place. Closing the connection should not be an error in the listening fiber.

m-o-e commented

I'd appreciate it if anyone using listen/notify could comment on making it blocking.

As one of these users I approve of the change.
Monkey-patching should not be required to use listen/notify or deal with PG connections in a reliable fashion in general.

Closing the connection should not be an error in the listening fiber.

I agree with this as well, but not sure how much of that
can/should be abstracted away from the user.

I think the interface that most users probably want is a
blocking call that raises all connection-exceptions except
when #close was previously called.

In that case it should either just return, or raise a
known Exception such as ConnectionClosed.

@m-o-e Yeah, I'd expect a blocking method to simply return when the connection is closed.

I, too, think it makes sense making this blocking.

FYI I've worked around it currently with:

channel = Channel(Exception).new
PG.connect_listen() do
 ... 
rescue e : Exception
  channel.send(e)
end
channel.receive
will commented

Sorry everyone, I got very busy right around the last comments here, then this issue slipped out of my mind. I'm having trouble gathering context on it again. There are a good number of unreleased changes, so I'm planning on doing a release soon. Can someone make a PR with the best way forward for this please so we can get this solved in the next release?

I think the way forward is twofold:

  • Avoid raising an error from .connect_listen when the connection was closed. It should just return.
  • Add a blocking alternative to .connect_listen which delegates error handling to the users. Maybe even deprecate .connect_listen then. But I think that should be a dedicated discussion.
will commented

@straight-shoota does the code you gave in the first post error for you every time? I haven't yet been able to reproduce getting any sort of error on closing the connection. I haven't tried linux yet, maybe it's a mac/linux difference?

crystal eval 'require "./src/pg"; 1000.times { c = PG.connect_listen("postgres:///", "foo") {}; c.close }; puts "done"'
done

Strangely, the error doesn't reproduce anymore for me either. I even tried with 0.22.1, which I believe was the version used in the initial report.
Maybe it was really a platform thing on WSL1 🤷 Or some issue in an older stdlib version.

will commented

Ok, I think #243 covers everything, let me know.

will commented

The above pr has been merged and released with v0.25.0