jgehrcke/github-repo-stats

Action fails when too many jobs trying to track different repos in the same data repo

ChameleonTartu opened this issue · 10 comments

This project looks amazing!

My idea was to track all public repos and analyze them once in a while. It looks like when I have too many jobs running, the action fails. For instance, when one job is pushed before another one. My GitHub repo.

Also, there is another issue with amazon-mws-subscriptions-maven:

210411-19:09:08.177 INFO:MainThread: union-merge views and clones
Traceback (most recent call last):
  File "/fetch.py", line 314, in <module>
    main()
  File "/fetch.py", line 73, in main
    ) = fetch_all_traffic_api_endpoints(repo)
  File "/fetch.py", line 122, in fetch_all_traffic_api_endpoints
    df_views_clones = pd.concat([df_clones, df_views], axis=1, join="outer")
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 285, in concat
    op = _Concatenator(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 467, in __init__
    self.new_axes = self._get_new_axes()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 537, in _get_new_axes
    return [
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 538, in <listcomp>
    self._get_concat_axis() if i == self.bm_axis else self._get_comb_axis(i)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 544, in _get_comb_axis
    return get_objs_combined_axis(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 92, in get_objs_combined_axis
    return _get_combined_index(obs_idxes, intersect=intersect, sort=sort, copy=copy)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 145, in _get_combined_index
    index = union_indexes(indexes, sort=sort)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 214, in union_indexes
    return result.union_many(indexes[1:])
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py", line 395, in union_many
    this, other = this._maybe_utc_convert(other)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py", line 413, in _maybe_utc_convert
    raise TypeError("Cannot join tz-naive with tz-aware DatetimeIndex")
TypeError: Cannot join tz-naive with tz-aware DatetimeIndex

Another data frame issue:

210411-19:09:18.943 INFO: parsed timestamp from path: 2021-04-11 19:09:15+00:00
Traceback (most recent call last):
  File "/analyze.py", line 1398, in <module>
    main()
  File "/analyze.py", line 82, in main
    analyse_view_clones_ts_fragments()
  File "/analyze.py", line 691, in analyse_view_clones_ts_fragments
    if df.index.max() > snapshot_time:
TypeError: '>' not supported between instances of 'float' and 'datetime.datetime'
+ ANALYZE_ECODE=1
error: analyze.py returned with code 1 -- exit.

Git clone issue:

GHRS entrypoint.sh: pwd: /github/workspace
+ git clone 'https://ghactions:${' secrets.ACCESS_GITHUB_API_TOKEN '}@github.com/ChameleonTartu/buymeacoffee-repo-stats.git' .
length of API TOKEN: 36
fatal: Too many arguments.

All other issues are the same as those mentioned.

@jgehrcke Let me know if I can help more than just reporting this. It would be great to fix all of this, to use this tool more extensively, as I am planning to grow the number of repos from 34 to more over time. It is the most valuable tool, I could find for tracking repo development over time. Thank you again!

Traceback (most recent call last):
  File "/fetch.py", line 314, in <module>
    main()
  File "/fetch.py", line 73, in main
    ) = fetch_all_traffic_api_endpoints(repo)
  File "/fetch.py", line 122, in fetch_all_traffic_api_endpoints
    df_views_clones = pd.concat([df_clones, df_views], axis=1, join="outer")
[...]
TypeError: Cannot join tz-naive with tz-aware DatetimeIndex

I could not quite make sense of this one. Both, df_clones and df_views are created by the same code path. I thought maybe when one of both is empty this might be the fallout with a misleading error, but no:

± python
iPython 3.8.6 (default, Nov 22 2020, 17:14:35) 
[GCC 10.2.1 20201016 (Red Hat 10.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> tz_naive = pd.date_range('2018-03-01 09:00', periods=3)
>>> tz_aware = tz_naive.tz_localize(tz='US/Eastern')
>>> df_aware = pd.DataFrame(data={'lol': [1, 2, 3]}, index=tz_aware)
>>> df_aware
                           lol
2018-03-01 09:00:00-05:00    1
2018-03-02 09:00:00-05:00    2
2018-03-03 09:00:00-05:00    3
>>> df_empty = pd.DataFrame(data={}, index=[])
>>> pd.concat([df_aware, df_empty], axis=1, join="outer")
                           lol
2018-03-01 09:00:00-05:00    1
2018-03-02 09:00:00-05:00    2
2018-03-03 09:00:00-05:00    3

I am adding a patch that changes the way the DatetimeIndex is translated to a tz-aware object, which hopefully addresses this problem. It's a little disappointing to not understand it precisely.

TypeError: '>' not supported between instances of 'float' and 'datetime.datetime'

That somewhat suggests that df_clones and df_views looked rather differently structurally than what's expected.

Update: empty index explains that error msg:

>>> df_empty.index.max() > datetime(year=2012, month=3, day=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'float' and 'datetime.datetime'
GHRS entrypoint.sh: pwd: /github/workspace
+ git clone 'https://ghactions:${' secrets.ACCESS_GITHUB_API_TOKEN '}@github.com/ChameleonTartu/buymeacoffee-repo-stats.git' .
length of API TOKEN: 36
fatal: Too many arguments.

Could it be that this token was actually truncated and/or maybe this is related to one of your code changes?

I notice secrets.ACCESS_GITHUB_API_TOKEN but with current code this should actually look very differently:

git clone https://ghactions:${GHRS_GITHUB_API_TOKEN}@github.com/${DATA_REPOSPEC}.git 

When things work as expected, that should be the log pattern:

GHRS entrypoint.sh: pwd: /github/workspace
+ git clone ***github.com/jgehrcke/ghrs-test.git .
length of API TOKEN: 40
Cloning into '.'...

It's likely that the error message fatal: Too many arguments. was as of the misconstructed git clone ... command.

@ChameleonTartu would you mind retrying things with the current head of main? I think I've addressed all issued reported to date (maybe have a look at the changelog). Happy to cut a release, but ideally only after getting your confirmation that things indeed work.

@jgehrcke I made a run: https://github.com/ChameleonTartu/buymeacoffee-repo-stats/actions/runs/748508227

The only use-case that doesn't work is:

GHRS entrypoint.sh: pwd: /github/workspace
+ git clone 'https://ghactions:${' secrets.ACCESS_GITHUB_API_TOKEN '}@github.com/ChameleonTartu/buymeacoffee-repo-stats.git' .
length of API TOKEN: 36
fatal: Too many arguments.

And all jobs failed with the same message: https://github.com/ChameleonTartu/buymeacoffee-repo-stats/runs/2343584927?check_suite_focus=true

I suspect that repos may have been created a long time ago, so they have different API token formats, can it be the cause? Any idea?

The only use-case that doesn't work is:

OK, you're workflow file is bad in a subtle way! Mean trap: https://github.com/ChameleonTartu/buymeacoffee-repo-stats/blob/b6d089f2bc01462e05fe8100ce1f27cfd3a24909/.github/workflows/stats.yml#L138

@ChameleonTartu you have ghtoken: ${ secrets.ACCESS_GITHUB_API_TOKEN }, but these curly braces need to be pairs of them: ${{ ... }} -- in most jobs, you have that.

@jgehrcke Thank you! I didn't notice these nuances.

I auto-generated some of the jobs, so it looks like I got some of them wrong. Cool-cool-cool!

@ChameleonTartu ok : ) Please leave feedback again when the current head of main worked for all your jobs : )