bblfsh/bblfshd

Log file hash when parsing files

dennwc opened this issue · 9 comments

When parsing a file, it will be nice to log the content hash.

This will help us find specific files that fail during large parse jobs.

bzz commented

Nice idea. If only it would have been possible to search GH for a hash - that would have been an ideal solution

We have source hashes in the dependency graph. 😉

Couple questions (loudly thinking):

  • Do you want to hash every sent snippet?
  • Do you want to store/cache hash -> content?
  • Without storage how do you want to search for failing contents?
  • If we have a cache/storage then maybe people can attach header (sth. like If-Match or similar depends on protocol), so we can return immediately if we found a hash.
bzz commented

@kuba-- The proposal is to (optionally) hash every parsed file and inlaced new filed to our logs with this hash.

This way, only the logs of the bblfshd process are the "storage" and it's merely a way to simplify the analysis of bblfshd logs, so that one have a reproducible examples for the tests. Hash is only used to verify that indeed, the file found though some manual process (I.e GH search by filename), is the same as the one that triggered the error.

@creachadair that would be lovely, but the dependencies graph should also be available for all the languages first.

Seems like both questions aim for a smart way to immediately identify the abusing file, which can be an interesting design question if we want to have this type of complexity on our side.

A very simple way to do the same can be - for some verbosity level, just include whole file content in the error logs (compressed/based64 encoded).

But from the experience analyzing deployment logs after a large scale parse - just having those hashes in logs will already be a useful improvement to the last step of manual filtering.

@creachadair that would be lovely, but the dependencies graph should also be available for all the languages first.

I wasn't being all that serious, but just to clarify: That wasn't meant as a replacement for hashes in the log. I think that's useful for debugging even if we don't (otherwise) store them. Rather, I meant as a way to correlate the logged hashes back to file contents. Of course if we stick to SHA1 we could also use Gitbase to answer the same query (and probably more completely).

Next question - it's nice to have file's sha1 and be able to find it on git service (like github) by sha1, but git sha1 are calculated on original content, but in our case bblfshd gets already trimmed content.
So, how it can be useful? Maybe we can calculate it in clients (where have a original file) and pass it down?

I wouldn't call what Git hashes "an original content" (file content itself is), but it definitely makes sense to calculate Git hash. Maybe not as a replacement, but a second hash? We don't really know if the file comes from Gitbase, or just from the client sending it from FS directly, for example. It will be strange to see that logs of Babelfish won't match what sha1sum returns.

It does not matter if it's a git, github or gitbase.
My point was - what client gets is different to what bblfshd gets (because e.g. content is tailored).
So, you will log different hash (based on tailored content because most likely it will come from some bblfsh client), so how you can find the non-tailored content? How it can be helpful?

My point is, if a regular client sends a file and checks logs, he will expect a hash(content), as in sha1sum.
If some Git-related batch processor sends a file, we may find hash(size, content) more useful, since it's the way Git addresses blobs.
bblfshd has no way to know which use case is in play, so I think we can calculate both hashes: one for us (Git-specific, to search for it later) and one for the user (SHA1 of the file content).