X-lab2017/open-digger

[Data] Add more label data for Database technical area labled on DB-Engines Ranking in Jan, 2023.

Closed this issue · 8 comments

I want to add some labeled data into OpenDigger to help us for our community analysis.
The data is based on DB-Engines in Jan, 2023. It is an incremental version of labeled data submited in #1093, which is based on DB-Engines Open Soure DBMS in Dec, 2022.

Filter conditions: Rankings in the DB-Engines Rankings table in Jan, 2023 and has repository on GitHub and not labled yet.

Notes: The DBMS labeled dataset will keep updating at birdflyi/db_engines_ranking_table_crawling. The list below is auto-generated by wiget_autogen_issue_body_for_opendigger_submiting_labeled_data_issue.

Label: Key-value

Type: Tech-1

Repos:

  • skytable/skytable

/parse-github-id

Get repo and org/user ids done.

"

I want to add some labeled data into OpenDigger to help us for our community analysis.
The data is based on DB-Engines in Jan, 2023. It is an incremental version of labeled data submited in #1093, which is based on DB-Engines Open Soure DBMS in Dec, 2022.

Filter conditions: Rankings in the DB-Engines Rankings table in Jan, 2023 and has repository on GitHub and not labled yet.

Notes: The DBMS labeled dataset will keep updating at birdflyi/db_engines_ranking_table_crawling. The list below is auto-generated by wiget_autogen_issue_body_for_opendigger_submiting_labeled_data_issue.


Label: Key-value


Type: Tech-1


Repos:

- 276042304 # repo:skytable/skytable
"

/self-assign

xgdyp commented

Hi @birdflyi ,do we need to update every time the db-engine is updated?

The db-engines is updated monthly. It seems little new databases will be listed on the ranking table. Maybe a semiannual update(month:[1, 7]) or a quarterly(month:[1, 4, 7, 10]) update is enough. : )

Another solution: I noticed that almost all of the databases on db-engines ranking list can be found in https://dbdb.io. I can check all the databases on it to filter out those have github repos, despite of they may not appear on db-engines tables. However, it will take a lot of time to finish the work.

The https://dbdb.io project under CMU if also an excellent reference~

Yes, actually they've classified the Project Type for databases in Leaderboards.

The extra work is crawling infos of each database like Data Model, Source Code, and so on. If the Source Code link of a Open Source database is not linking with a github site, we may should search the corresponding repo in github platform to check out whether it exists, which is the most time-consuming part. (current number of Open Source database in dbdb.io: 541)

All of the Open Source DBMS which have a github repo link by 2023-03-31 is available here: dbfeatfusion_records_202303_automerged_manulabeled.csv.
The data is fused from dbdb.io and DB-Engines, and has been partial manu-labeled. Welcome to create issues in db_feature_data_fusion.