/SWHGD

The repository for a research project in exploring the Software Heritage Dataset

Primary LanguageJupyter Notebook

We only focus on git project

Attention: THIS GUIDANCE IS DEDUCED FROM OUR PROCESS. In authors' opinion, it is better than we did in our thesis.

  1. Import revision to Neo4j:
  • Export revision from PostgreSQL to file:

COPY (

SELECT encode(id,'hex') as id, author, date FROM revision

)

TO '/tmp/revision.csv' delimiter ',' CSV HEADER;

  • Import revision from file to Neo4j: Remember to put file in directory you set on config file of Neo4j.

USING PERIODIC COMMIT X

LOAD CSV WITH HEADERS FROM 'file:///revision.csv' AS row

WITH row WHERE row.id IS NOT NULL

MERGE (r:Revision {id: row.id, author: row.author, date: apoc.date.parse(row.date, 's', '2017-05-09 06:19:41.823386', '+07')});

X can be 100,000 or 1,000,000 depending on your ram memory. We use 100,000 for 64 GB RAM.

  1. Import relationship between revisions to Neo4j:

Since we have more than 1 billion relationships in file revison_history.csv.gz (download from this link: https://annex.softwareheritage.org/public/dataset/graph/latest/sql/). We separate it into 100,000 relationships, thus we have 14 files.

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 0) TO '/tmp/revision_history_1.csv' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 100000000) TO '/tmp/revision_history_2.csv' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 200000000) TO '/tmp/revision_history_3.csv' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 300000000) TO '/tmp/revision_history_4.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 400000000) TO '/tmp/revision_history_5.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 500000000) TO '/tmp/revision_history_6.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 600000000) TO '/tmp/revision_history_7.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 700000000) TO '/tmp/revision_history_8.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 800000000) TO '/tmp/revision_history_9.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 900000000) TO '/tmp/revision_history_10.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 1000000000) TO '/tmp/revision_history_11.csv.gz' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 1100000000) TO '/tmp/revision_history_12.csv' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 1200000000) TO '/tmp/revision_history_13.csv' delimiter ',' CSV HEADER;

COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 2422013 OFFSET 1300000000) TO '/tmp/revision_history_[y].csv' delimiter ',' CSV HEADER;

Then, we import relationship into Neo4j:

1 <= x <= 14

USING PERIODIC COMMIT

LOAD CSV WITH HEADERS FROM "file:///revision_history_[x].csv" AS row

MATCH (r:Revision{id: row.id})

MATCH (p:Revision{id: row.parent_id})

MERGE (r)<-[:PARENT]-(p);

Notice: This step took us 1 month. So any improvement is welcome.

  1. Choose subset of repositories from the dataset:

In the thesis, we choose based on the visited date which is a mistake. So we suggest future researchers use the date of the branch to find the active repositories. Moreover, we only choose MASTER branch or HEAD branch, the branch that the pointer currently point to: https://stackoverflow.com/questions/2304087/what-is-head-in-git.

  • Find the latest date from all of MASTER/HEAD branches

COPY (

SELECT MAX(r.date) as max_date

FROM origin as o

JOIN origin_visit as ov ON o.id = ov.origin

JOIN snapshot as ss ON ov.snapshot_id = ss.object_id

JOIN snapshot_branches as sbs ON ss.object_id = sbs.snapshot_id

JOIN snapshot_branch as sb ON sbs.branch_id = sb.object_id

JOIN revision as r ON sb.target = r.id

WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL AND o.type = 'git' AND (encode(sb.name::bytea, 'escape') ~ '.*HEAD.*' OR encode(sb.name::bytea, 'escape') ~* '.*master.*')

After find the latest date among branches, we will find repositories which have branches from 3 months ago to the current latest date

COPY(

SELECT ov.origin as origin_id, r.id as revision_id

FROM origin as o

JOIN origin_visit as ov ON o.id = ov.origin

JOIN snapshot as ss ON ov.snapshot_id = ss.object_id

JOIN snapshot_branches as sbs ON ss.object_id = sbs.snapshot_id

JOIN snapshot_branch as sb ON sbs.branch_id = sb.object_id

JOIN revision as r ON sb.target = r.id

WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL AND o.type = 'git' AND (encode(sb.name::bytea, 'escape') ~ '.*HEAD.*' OR encode(sb.name::bytea, 'escape') ~* '.*master.*') AND

(DATE_PART('month', [MAX_DATE]::date) - DATE_PART('month', r.date)) <= 3

)

to '/tmp/latest_repository.csv' delimiter ',' CSV HEADER;

MAX_DATE: The max date you find put in here in format 'yyyy-mm-dd'

Then we import table called latest_repository:

COPY

latest_repository(origin_id, revision_id)

FROM 'temp\latest_repository.csv' DELIMITER ',' CSV HEADER;

Then, we find the oldest version of those 'active' repositories:

COPY(

SELECT ov.origin as origin_id, min(r.date) as min_date

FROM latest_repository as lr

JOIN origin_visit as ov ON lr.origin_id = ov.origin

JOIN revision as r ON lr.revision_id = r.id

WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL

GROUP BY lr.origin_id, r.date

)

to '/tmp/oldest_repository_date.csv' delimiter ',' CSV HEADER;

Import table called oldest_repository to PostgreSQL:

COPY oldest_repository

FROM '/tmp/oldest_repository_date.csv' delimiter ',' CSV HEADER;

Finally, we find the id of the revision the the oldest version of the active repositories point to :

COPY(

SELECT DISTINCT ov.origin as origin_id, r.id as revision_id

FROM oldest_repository as or

JOIN origin_visit as ov ON or.origin_id = ov.origin

JOIN snapshot as ss ON ov.snapshot_id = ss.object_id

JOIN snapshot_branches as sbs ON ss.object_id = sbs.snapshot_id

JOIN snapshot_branch as sb ON sbs.branch_id = sb.object_id

JOIN revision as r ON r.date = or.min_date

WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL

GROUP BY or.origin_id, r.date

)

to '/tmp/oldest_repository.csv' delimiter ',' CSV HEADER;

  1. Revision traversal

We will create relationship called 'BRANCH' from repository to one particular revision for latest repository.

USING PERIODIC COMMIT 100000

LOAD CSV WITH HEADERS FROM "file:///latest_repository.csv" AS row

MATCH (r:Revision{id: row.revision_id})

MERGE (r)<-[:BRANCH]-(s:Repository{id: row.origin_id});

With oldest_repository, we have one more property called "oldest"

USING PERIODIC COMMIT 100000

LOAD CSV WITH HEADERS FROM "file:///oldest_repository.csv" AS row

WITH row.origin_id as origin_id, row.id as revision_id

WHERE revision_id is not null

MATCH(r:Revision{id:revision_id})

MERGE (r)<-[:BRANCH]-(s:Repository{id:origin_id, oldest: "true"});

  1. Export release to file:

COPY(

SELECT DISTINCT encode(id, "hex") as release_id, date as date

FROM release)

to '/tmp/release.csv' delimiter ',' CSV HEADER;

  1. Import release to Neo4j:

USING PERIODIC COMMIT 100000

LOAD CSV WITH HEADERS FROM "file:///release.csv" AS row

MATCH (r:Revision{id: row.release_id})

SET r.release_date = apoc.date.parse(row.date, 's', '2017-05-09 06:19:41.823386', '+07')

  1. Extract git repository's revisions and releases:

CALL apoc.export.csv.query(

"MATCH (s:Repository)-[:BRANCH]->(r:Revision)

WITH DISTINCT r, s.id as origin_id

CALL apoc.path.subgraphNodes(r, {relationshipFilter: "PARENT>"})

YIELD node WITH origin_id, node.id as id, node.author as author, node.date as date, node.release_date as release_date;", "/mnt/17volume/data/snapshot_revision_git.csv", {});

Conclusion: We have two files: snapshot_revision_git.csv and snapshot_release_git to extract metrics.

  • Extract metrics: commit frequency, total author, total commit, daily/weekly/monthly contributor count

Since the file snapshot_revision_git is really big, we split it into 8 files since we have eight core ram. We do that to boost the process of extracting metrics.

The code: extract_metrics.py (remember to change tor origin_id)

Notice: We should handle the case for example we split the big file into file 1 and file 2, the last origin_id of the file 1 and the first origin_id of file 2 belong to the same origin id. Therefore, we cannot extract these metrics from the code above directly. In thesis, we combine the last origin of file 1 and the first origin of file 2 and handle it separately.

  • Extract metrics: cycle time release

The code: cycle_time.py

  • Extract fork information in one year from the latest date: fork_detection.cpp in folder fork-detection origin_id, fork_id, date
  • Extract fork metrics: extract_fork_metric.py in folder fork-detection
  • The duplication information do the same by changing the code of fork: fork_detectin.cpp and extract_fork_metric.py
  • Extract all metrics to one file: extract_metrics_to_one_file.py

Please contact us if there is any concern