codersrank-org/repo_info_extractor

Rewrite repo_info_extractor

peti2001 opened this issue ยท 13 comments

The current solution has quite a few problems. But most importantly it is hard to use. It is written in Python and for developers who have no Python experience, it is not very convenient to use.

Problems with the current solution

  • hard to install, requires Python
  • too big output, JSON takes too much memory and CPU to process, GRPCGateway and filechanges use a lot of memory.
  • it is slow, because of that we have to skip large repos

Requirements:

Nice to have

  • Merge with the multi_repo_info_extractor, being able to extract multiple repos by passing tokens, credentials.
  • Serverless compatibility (Go is available in Google Cloud)
  • parse the code instead of using regex, to improve the accuracy of the import detection
  • Minimalize disk IO, don't check out the code, do it in memory
  • Support multiple outputs. Easily extendable by the community.
  • Recognize squashed commits, not just merges
  • GUI

If you have any suggestions, problems with the current implementation, please share.

JJ commented

I don't really have a big problem with Python. I'll look for ways to speed it up, if that's the blocker. You can only get so far doing that, however. WRT to current requirements.

  1. Create a test suite that you will use to test that it's really got the same features. Create a set of model inputs (languages), and model outputs. Coverage should be 100%
  2. If the reason is that it makes it faster, so be it. If it's that it should be easy to run anywhere, maybe a Docker container (with different architecture) will be a better solution. If it's that it should be easily installable anywhere, you should be looking at packaging options, maybe.
  3. You seem to be looking at low memory footprint, but that might clash with the next requirement. Mind you, memory need not be "in the same place". It can be in some other VM, or node, and accessible via a socket. So instead of presenting a solution, it might be better to look at the actual problem and propose several solutions for it.
  4. Again, the problem might be the overall duration of processing, not the individual duration. Maybe instead of individual limits, it would be better to go at overall limits in a machine with several cores. That might restrict the choice of languages, or frameworks to process stuff. To be clear, it doesnot really matter if those take 100 minutes (which I guess is what they take) as long as they can be processed at the same time in a 16-core machine. Anyway, making the overall process faster is never a bad thing, but by limiting ourselves to sequentially faster languages we might drawing ourselves into a corner that includes only C and maybe Java and Go.
  5. Don't get this one. Similar emails?
ferki commented

@peti2001: thanks for inviting me to give input!

Personally, didn't have many problems with the Python implementation, even though I don't deal with Python code and dependencies on a daily basis. I'm not sure how big of a factor the chosen language plays in the perceived slowness.

Thumbs up for the initial definition of "fast" ๐Ÿ‘

As far as recognizing "similar" emails go, I think git's built-in mailmap feature would be pretty much on-topic for that (and even it if needs to be extended to non-git platforms, the idea and the format is already established, so that might be one less wheel to reinvent). /cc: @JJ

What I'd be happy to see is that the extractor should be able make a distinction between canonical repos and its forks. I often get credited highly for the forks of my main repos where I almost never contribute, while I get low points for the repo where 99%+ of my actual work is. I guess overall it's the same total results due to being credited to each commit ID only once, but the ratios are often weird between repos. I believe this might be a case on #132.

I suggest selecting the language after all other questions are clarified. Theoretically, you could choose any language. Golang might be the preferable choice for the use case. But if you can't find the appropriate libraries for the model, or you find poorly maintained libraries, or libraries that don't meet all criteria, it unnecessarily complicates things late in the process. I suggest beginning with the model, then thoroughly screening which libraries are available for which language and which libraries might have to be implemented. This allows for an educated guess of which language might not only be performant but also cost-efficient for implementation and maintenance. Btw, Cython is also a low-level-performant choice.

I think you're looking for a streamable format. It doesn't matter if it's text-based, binary, or compressed. It only has to be streamable. So the server won't have to cache it in memory before processing. Theoretically, this also applies client-side if you can post it to the server during the extraction. There are many streamable formats and there will be different library choices for each language. The next-best text-based format would be yaml, which allows better streaming due to line separation and a guarantee of completion mid-stream after every line. Afaik, JS has the most streamable libraries. Golang has native streaming support as well but not as many libraries.

I'd be interested in how you would like to process the git history without a checkout. I agree that a checkout is everything but performant. But a manual emulation means more implementation effort and also maintenance when the git standard changes. Golang also has a library for a virtual file system (vfs) that uses the same interface as the normal file system. But this doesn't work with external binaries. Only golang code can use the interface. Fortunately, there is a native git implementation for golang (go-git). Theoretically, git could also be processed in a stream. There is an open issue regarding their stream implementation, though.

Another optimization point I see is caching. The first extraction might take a lot of time but keeping it up to date shouldn't take the same amount of time. Theoretically, you could cache the stream and limit the checkout history or stream if the history up to that point of time has identical commit metadata. But how to recognize force-commits and other changes in the history that trace back before the date of the last extraction? If done properly, this could save a lot of processing time.

According to the @peti2001 invite:

hard to install, requires Python

I disagree - as a user, don't have any problems with Python (except that i, probably, would not participate in development with this environment). "install" part of the readme is great - i'm always trying to use language-agnostic containers and they are working fine for me here.

too big output, JSON takes too much memory and CPU to process

If i was making a similar app (repo info extractor; and a scoring algo itself too), i would consider at least 4 possible ways for refactoring:

1. Put a docker compose/stack configuration with a few replicas for the Go/ReactPHP app, with some changes to the output data: it must be rendered in a raw format without json decode / encode / object-tree building, just file offsets, for example. It opens a way to use concurrent processing by offsets with no need to perform an additional JSON parsing stage each time. We can also safely cut this data into pieces for the further processing. The main problem i see here is to properly handle a shared context, if there are some requirements/rules for the score calculator (your private algorithm), which affects different repository "pieces" within a single iteration, i don't know the implementation and can't tell much.

All of these replicas also must take only a part of the input data (commit range?).

So, you can convert output JSON to the less-readable, but faster format - offsets, for example, to be able to parallelize them later, using a cluster of low-profile Go/ReactPHP microservices (1 container = 1 instance = 1 thread), and then combine results to a report for the end-user.

2. A single Go app. Just input data, goroutines & channels and let the runtime decide, which execution flow to use for the different application parts (for a powerful server i guess).

3. Socket streaming (for your private environment only) - instead of building a fat JSON file, an "adapter" will start to emit data to your other services. They will read and buffer it as they need (memory threshold). ReactPHP example: https://github.com/reactphp/http#requeststreaming (something similar in Go).

4. Do not rewrite the code base, create a script that converts a JSON file to the text file :) to just met the "parsable line by line" requirement.

About overall processing time - i would make a "real" score calculator (which takes time to do its job) and a "fallback", to make a quick preview / estimate score changes before the main one completes the full logic pass.

So, i would start with:

1. An MVP - single script with "ported" logic from the python code (maybe some parts can be already parallelized without any trouble).
2. A benchmark, how things are faster than the current python implementation (personally, i love the built-in testing capabilities in Go, they are really cool).

Thank you for the great suggestions, I'll try to answer all of them :).

@JJ
2. The current solution has a docker implementation too. But it is another dependency. There are many developers who have no docker. Packaging is a good idea, however, we have no experience with it. But it might worth looking into it.
3. I am afraid it also increases the complexity and would be harder to set up for developers.
4. It is already multi-threaded but it was a nightmare to have a cross-platform implementation in Python. This is one of the reasons why we want to use go to have better parallelism.
5. I added a better explanation. :)

@ferki

.mailmap we use this already. We get the list of emails of the repo by git shortlog -se which considers .mailmap. When we have this list we find similar emails and names. So if you have .mailmap it will be more accurate.

Forks are out of the scope. The repo_info_extractor just extracts the data. It doesn't know whether it is a fork or not. This logic is on the server-side. That is the next refactoring after repo_info_extractor.

Many of you mentioned streaming. It is a great idea. we will definitely consider it.

Now one of the biggest bottlenecks is the disk I/O. We go through all the commits, and all the changes are written to the disk. For large repos it is a lot of disk I/O. Keeping everything in memory helps a lot. For the script output, we have to know how many lines changed and detect the imported libraries. For this, we don't have to construct the whole file. This is just an idea yet, but I hope we can put together a working prototype.

Which one would be more important for you?
Faster script ๐Ÿš€
GUI ๐Ÿ‘€

JJ commented

i think detect_language.py should be improved, because i never used CoffeeScript and im on the top 50 from Brazil.

@EduApps-CDG

i think detect_language.py should be improved, because i never used CoffeeScript and im on the top 50 from Brazil.

It looks like in this commit https://github.com/ArttiDev/php-node-project/commit/0b96956ae13ac0c914e81713fb2d179ea4b0fe46 a bunch of files have been updated and some of them were coffee script, this is why you got score for it.

I know these are just libraries but it is hard to detect which code is written by you or just copied from a library.

I know these are just libraries but it is hard to detect which code is written by you or just copied from a library.

I understood, in this commit node_modules was not in my .gitignore file. so thats why i earn score...