Health check stays "healthy" even if Event Handler throws an exception
Closed this issue · 5 comments
Describe the bug
I am trying to get the health checks working, as described here: https://eventuous.dev/docs/subscriptions/subs-diagnostics/#health-checks.
I have added the required code but when an exception occurs, the health status remains "healthy".
To Reproduce
I have modified the dotnet-sample repo to always throw an exception on the RoomBooked event.
Steps to reproduce the behavior:
- Clone my fork https://github.com/PehrGit/dotnet-sample-healthchecks
- Run docker-compose up
- Run the application
- Navigate to https://localhost:12110/hc
- Notice the state is "Healthy"
- Navigate to https://localhost:12110/swagger/index.html
- Execute the POST /booking/book to add a new booking
- Notice in the console output of the application, that an exception occurs.
- Go back to https://localhost:12110/hc
- Note the state is still "Healthy"
Expected behavior
I expect the state to become "Unhealthy" because the subscription has been dropped. I have set breakpoints in the SubscriptionHealthCheck class and I see that ReportHealthy is called at the time of Subscribe, but the ReportUnhealthy is never called.
Desktop (please complete the following information):
- OS: Windows 10
- Version 0.15.0-beta.7
Additional context
- We are using the SqlServer implementation in our project and are running into the same problem.
Log output:
[16:27:34 DBG] ChannelInfo being produced...
[16:27:34 INF] Selected Unspecified/localhost:12113.
[16:27:34 DBG] ChannelInfo being produced...
[16:27:34 INF] Selected Unspecified/localhost:12113.
[16:27:34 WRN] OpenTelemetry LoggerProvider was not found in application services. Logging will remain disabled.
[16:27:34 DBG] Starting subscription BookingsProjections
[16:27:34 INF] [BookingsProjections] Loaded checkpoint BookingsProjections from MongoCheckpointStore: Checkpoint { Id = BookingsProjections, Position = 39899, IsEmpty = False }
[16:27:35 DBG] ChannelInfo produced!
[16:27:35 DBG] ChannelInfo produced!
[16:27:35 INF] [BookingsProjections] Started
[16:27:35 DBG] Subscription 076f3591-d346-4195-a761-997fe92aaae6 confirmed.
[16:27:35 INF] Started subscription BookingsProjections
[16:27:35 DBG] Starting subscription PaymentIntegration
[16:27:35 VRB] Subscription 076f3591-d346-4195-a761-997fe92aaae6 received event Booking-string13@0 C:40731/P:40731
[16:27:35 INF] [PaymentIntegration] Started
[16:27:35 DBG] Persistent Subscription PaymentsIntegration::PaymentIntegration confirmed.
[16:27:35 INF] Started subscription PaymentIntegration
[16:27:35 VRB] [BookingsProjections] Received V1.RoomBooked from Booking-string13:40731 seq 0
[16:27:35 INF] Now listening on: https://localhost:12110
[16:27:35 INF] Now listening on: http://localhost:12111
[16:27:35 INF] Application started. Press Ctrl+C to shut down.
[16:27:35 INF] Hosting environment: Development
[16:27:35 INF] Content root path: C:\tfs\eventuous-dotnet-sample\Bookings
[16:27:35 ERR] [BookingsProjections] Message handling failed at BookingStateProjection for message 300fa451-412f-4bfa-bbda-1f9b7edec298
System.Exception: Demonstration exception
at Bookings.Application.Queries.BookingStateProjection.HandleRoomBooked(IMessageConsumeContext`1 ctx, UpdateDefinitionBuilder`1 update) in C:\tfs\eventuous-dotnet-sample\Bookings\Application\Queries\BookingStateProjection.cs:line 33
at Eventuous.Projections.MongoDB.MongoOperationBuilder`2.UpdateBuilder`1.<>c__DisplayClass11_0.<UpdateFromContext>b__0(IMessageConsumeContext`1 ctx, UpdateDefinitionBuilder`1 update)
at Eventuous.Projections.MongoDB.MongoOperationBuilder`2.UpdateBuilder`1.GetUpdateWithOptions(IMessageConsumeContext`1 ctx)
at Eventuous.Projections.MongoDB.MongoOperationBuilder`2.UpdateOneBuilder.<Eventuous.Projections.MongoDB.MongoOperationBuilder<TEvent,T>.IMongoProjectorBuilder.Build>b__3_0(MessageConsumeContext`1 ctx, IMongoCollection`1 collection, CancellationToken token)
at Eventuous.Projections.MongoDB.MongoProjector`1.HandleEvent(IMessageConsumeContext context)
at Eventuous.Subscriptions.TracedEventHandler.HandleEvent(IMessageConsumeContext context)
at Eventuous.Subscriptions.Consumers.DefaultConsumer.<>c__DisplayClass2_0.<<Consume>g__Handle|0>d.MoveNext()
[16:27:35 VRB] [BookingsProjections] MyBookingsProjection handled V1.RoomBooked Booking-string13:40731 seq 0
[16:27:35 ERR] [BookingsProjections] Message handling failed at AsyncHandlingFilter for message 300fa451-412f-4bfa-bbda-1f9b7edec298
System.Exception: Demonstration exception
at Eventuous.Subscriptions.Filters.AsyncHandlingFilter.DelayedConsume(WorkerTask workerTask, CancellationToken ct)
[16:27:35 WRN] [BookingsProjections] Message V1.RoomBooked not acknowledged at 40731
System.Exception: Demonstration exception
at Eventuous.Subscriptions.Filters.AsyncHandlingFilter.DelayedConsume(WorkerTask workerTask, CancellationToken ct)
To make matters even stranger, I noticed that the breakpoint in SubscriptionHealthCheck.ReportUnhealthy()
, which I am expecting to be hit, will be hit when I leave the debugger hanging on the exception for 20 seconds or more, but not before that. Videos showing the behaviour:
msedge_xz6IzXYVZ9.mp4
rider64_zLU0qHb7Ns.mp4
If you let the app run after throwing, it will report unhealthy right away. Because you hold the execution, it drops on timeout (20 sec) and then the function to report unhealthy sub is called. When you hold there, the exception is not bubbled up to the sub, and it doesn't crash the event handling callback, so it just waits. Then, it times out and drops.
I'd expect you to see log messages from this:
Log.WarnLog?.Log(exception, "Dropped: {Reason}", reason);
It's the second line of the Dropped
function, which is called on subscription drop.
I was also writing wrong about resubscribe. Until it actually resubscribes, the subscription should remain dropped.
@PehrGit shall we conclude this as resolved since I merged your PR?
Yes let's do that. My initial assumption, that an exception thrown during handling should make the check unhealthy, was incorrect. As you pointed out in the Discord, the health check is intended to ensure connectivity to the database. To verify that events are being handled, we should use the metrics (see #309).
Thanks for assisting and clarifying that!