Add options to scan remote Git repos
samsmithnz opened this issue ยท 13 comments
- GitHub - split into two pieces - first to look for specific files, and then to extract the file content
- Get file list
- Get file contents
- #130
- #131
The way backstage.io does scans repos is using the APIs as they perform better than just using git commands may be worth doing similar.
@blueboxes I was planning to use the Rest API's - is that what you mean?
Yes it was, guess it is the obvious choice though I have seen other approaches before.
Absolutely. Appreciate the input and engagement! Very helpful to confirm I'm moving in the right directions.
Thats a great idea. Any plans on supporting scanning multiple repos ? Azure Devops uses a structure of Projects and repos where a project may contain multiple repos, in that case an option for "scan all repos is this project" would be useful ๐
Need any help ? just let us know ๐
@jedjohan I'm looking into it right now - starting with repos, and then expanding to organizations/projects. The challenge I'm seeing so far is that to list all files is relatively expensive from an API perspective - I'm scared of rate limits being hit for large organizations/projects. Will try to build it into error handling, but not sure how it will scale.
Hmm, true. Will put some load on the process, and also introduce a capability to be able to fail separate repos without crashing the rest I guess ? Are you planning to run parallel scans after fetching the list of repos? A simple polly retry on the HttpClient is a good start I guess, something like this maybe:
public IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
return HttpPolicyExtensions
.HandleTransientHttpError()
.OrResult(msg => msg.StatusCode == HttpStatusCode.TooManyRequests)
.WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromSeconds(Math.Pow(1, retryAttempt)) + TimeSpan.FromMilliseconds(Random.Shared.Next(0, 2000)),
onRetry: (message, timespan, attempt, _) => Logger?.LogInformation($"Retrying request to {message?.Result?.RequestMessage?.RequestUri} in {timespan.TotalSeconds} seconds. Retry attempt {attempt}."));
}
Yes - definitely want to add parallelism as I go - so far performance is very quick - ms times, unless the directory is massive, and even then it's still seconds.
Do you need to scan all the files in each repo?
If you are not already thinking of this you could use the search API across whole devops project as you want to only look at files with certain extensions then fetch just those single files?
@blueboxes I'm open to ideas, here is the current challenges with GitHub - and while I haven't jumped into issues with other DevOps solutions, I'm guessing the other Git repos are similar:
- The search API is rate limited to 10 requests per minute, and 1000 results (https://docs.github.com/en/rest/search).
- Because of the search restrictions , I'm using a tree call (see first post above).
- Even then I need to call a contents API to get the text for each individual file - no matter the method to find the project files.
Just out of interest I have had a look at backstage.io which stores config files in each repo and pulls them into a software catalog. This uses a code search and then downloads each file from the results just as you have described.
The results can number a few hundred, not sure what would happen if they hit the thousands. It does not do a retry for rate limiting but does use pages.
Good to know this is how others have done this.
I'm finding it quite different processing a list of files from a REST API. When I'm scanning directories, I look in a folder for project files, and if I don't find it, I recursively jump into the new sub-directory until I find a project file (or don't! :)). With a list of all files, I have to reconstruct that directory structure manually. If I don't, I would start to scan web.config files when I don't need to. But I also don't want to search for individual files - there are 8 different project files I scan for today, and it seems silly to search for each one (that is 8 search calls, plus a get-content call for each file found - vs 1 list files call, plus a get-content for each file found).
GitHub scanning is done! I'm going to experiment with what happens when I add orgs (and iterate through every repo), but I suspect for the really large orgs some customers have, this will always time out and I need a solution that allows you to segment the load.