[Feature] `activity` of each user in a repo by month
tyn1998 opened this issue ยท 11 comments
Description
Hi OpenDigger community,
In Hypercrx we have a feature called repo's developer network
, which consumes data file like: https://oss.x-lab.info/open_digger/github/X-lab2017/open-digger/developer_network.json. With the data in the file, we can know each user's activity
in that repo with the time span of 90 days.
Hypercrx is looking forward to datas that has every user's activity
in every month for a repo. Maybe the data file is organized in this way:
and a possible json scheme for a file might look like this:
{
"2020-08": [["frank-zsy",43.85],["xgdyp",22.36],["longyanz",13.09],["birdflyi",9.83]],
"2020-09": [["frank-zsy",23.85],["xgdyp",22.36],["longyanz",13.09],["birdflyi",5.83]],
...
}
With these data files, Hypercrx can implement features like:
These data can be generated when repo's activity
is computed, right? Not cost too much?
This issue has not been replied for 24 hours, please pay attention to this issue: @gymgym1212 @xiaoya-yaya @xgdyp
@tyn1998 The network data is generated from Neo4j database while relationship data can be extracted more easily. I think the detail can be retrieved from ClickHouse when activity metric is generated.
Right now the repo activity metric function does not return any details about the developers but actually we can add an option to query option so the query will return the detail about developers. So the data can be generated while activity metric is generated. It can be done of course.
/self-assign
@tyn1998 Could you take a look on this file: https://oss.x-lab.info/open_digger/github/X-lab2017/open-digger/activity_details.json , does it fit your requirement?
That is exactly what we want, thank you!
@tyn1998 All the data has been uploaded, but the data size is quite large compare with former data, like for vscode, the activity details contains 3MB data.
I think we should limit how many developers will be contained in a single month, this will reduce lots of storage usage.
Hi @frank-zsy, I have two ideas:
- use
[index, activity]
instead of[user_name, activity]
to decrease the file size
{
"participants": ["frank-zsy", "xgdyp", "longyanz", "birdflyi", "xxx"],
"2020-08": [[0, 43.85],[1, 22.36],[2, 13.09],[3, 9.83]],
"2020-09": [[1, 23.85],[0, 22.36],[2, 13.09],[4, 5.83]],
...
}
- (similar with yours)if the number of developers(participants) is greater than a certain threshold, then remove those whose total activity(i.e. sum of all months) is relatively small.
Both methods require that you need first to get data of all months then do the processing work. Rather month by month.
@tyn1998 I tried the first solution, but seems not work since most developer in vscode community maybe only active only once. So when I use index to replace the login and add an login array, the size of the output grows from 3.1MB to 3.3MB.
And why do you think we should remove developers by total activity but not in every month? How is it different from like just return the top 100 for each month?
So when I use index to replace the login and add an login array, the size of the output grows from 3.1MB to 3.3MB.
That was not taken into consideration... Thank you for your experiment.
And why do you think we should remove developers by total activity but not in every month? How is it different from like just return the top 100 for each month?
Because I don't want to miss the growing path(i.e. how the active developer becomes active or not active in a repo) of any outstanding contributor(i.e. sum of activity is great, whether he/she is active now) in a certain repo. However, this strategy might cause that current active developers not being included because their total score is not big enough to take over those who are active in history.
Now I'm thinking it twice. What is your idea?
I think we can use a threshold to filter the low activity developer for every month, the threshold could be every low like 2. I think this may not effect the purpose to find the trend.
Agree +1