commercialhaskell/stackage-server

Hoogle Search Fails

JavariaGhafoor opened this issue · 8 comments

For whatever I search, I get this error:

Screenshot 2023-12-29 at 5 48 20 PM

@JavariaGhafoor this is fixed now - sorry for the wait!

See commercialhaskell/stackage#7258.

It's happening again. (I didn't find any snapshot that claims to have Hoogle database available.)

Arg. I will investigate.

Because this has become a recurring complaint maybe an automated check would be worthwhile?

I think one lesson learned is that I should have put that stuff in first. As it is, it's high on the list of my "important + urgent" followup tasks for the handover.

@tomjaguarpaw I've discovered how to manually repair a missing hoogle database by throwing files around on the server. Search should be working again. Unfortunately there is an underlying problem that will need to be fixed before the next LTS is released, or it will happen again.

There are three underlying problems that may have conspired to create this outage.

  1. R2 is performing badly for PutObject and ListObjects commands.
  2. I had misconfigured the cache in front of the bucket such that it was caching 404's.
  3. Stackage decides a new snapshot is "ready" before the Hoogle database has finished being prepared.

I believe what happened is that Stackage decided lts-22.14 was available (3) before the database finished getting uploaded (1), leading to a window of time where Hoogle searches would come up empty. This time window was long enough that someone (or some thing) did, in fact, perform a Hoogle search. Stackage hit a 404 looking for the database, and it was dutifully cached by Cloudflare (2). Thus, even after the PutObject finished, Stackage kept getting a 404 for the database.

I have fixed (2). I was already working with Cloudflare staff to resolve (1). I'm still digging into (3), because it's honestly a problem even on its own. I would like to ensure that a snapshot doesn't "go live" on Stackage until the Hoogle database is already in place.

Note that I have not definitively identified this as the actual root cause. But all three problems do definitely exist, and they could cause an outage in the way outlined. Occam's razor etc.

I also discovered a mystery: Stackage wasn't getting a 404 for the database, but it wasn't finding it, either. The logs said nothing. But every non-200 status code had logging associated with it! Where was the trail of execution going? Was some exception happening before the log lines could fire?

But I think I figured it out.

case responseStatus res of
    status
        | status == status200 -> do
            createDirectoryIfMissing True $ takeDirectory fp
            withBinaryFileDurableAtomic fp WriteMode $ \h ->
                runConduitRes $
                bodyReaderSource (responseBody res) .| ungzip .|
                sinkHandle h
            return $ Just fp
        | status == status404 -> do
            logWarn $ "NotFound: " <> display (hoogleUrl name bucketUrl)
            return Nothing
        | otherwise -> do
            body <- liftIO $ brConsume $ responseBody res
            mapM_ (logWarn . displayBytesUtf8) body
-- here     ^^^^^^
            return Nothing

If the response body is empty, the mapM_ results in no messages sent! :D