We only focus on git project
Attention: THIS GUIDANCE IS DEDUCED FROM OUR PROCESS. In authors' opinion, it is better than we did in our thesis.
- Import revision to Neo4j:
- Export revision from PostgreSQL to file:
COPY (
SELECT encode(id,'hex') as id, author, date FROM revision
)
TO '/tmp/revision.csv' delimiter ',' CSV HEADER;
- Import revision from file to Neo4j: Remember to put file in directory you set on config file of Neo4j.
USING PERIODIC COMMIT X
LOAD CSV WITH HEADERS FROM 'file:///revision.csv' AS row
WITH row WHERE row.id IS NOT NULL
MERGE (r:Revision {id: row.id, author: row.author, date: apoc.date.parse(row.date, 's', '2017-05-09 06:19:41.823386', '+07')});
X can be 100,000 or 1,000,000 depending on your ram memory. We use 100,000 for 64 GB RAM.
- Import relationship between revisions to Neo4j:
Since we have more than 1 billion relationships in file revison_history.csv.gz (download from this link: https://annex.softwareheritage.org/public/dataset/graph/latest/sql/). We separate it into 100,000 relationships, thus we have 14 files.
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 0) TO '/tmp/revision_history_1.csv' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 100000000) TO '/tmp/revision_history_2.csv' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 200000000) TO '/tmp/revision_history_3.csv' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 300000000) TO '/tmp/revision_history_4.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 400000000) TO '/tmp/revision_history_5.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 500000000) TO '/tmp/revision_history_6.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 600000000) TO '/tmp/revision_history_7.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 700000000) TO '/tmp/revision_history_8.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 800000000) TO '/tmp/revision_history_9.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 900000000) TO '/tmp/revision_history_10.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 1000000000) TO '/tmp/revision_history_11.csv.gz' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 1100000000) TO '/tmp/revision_history_12.csv' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 100000000 OFFSET 1200000000) TO '/tmp/revision_history_13.csv' delimiter ',' CSV HEADER;
COPY (SELECT encode(id,'hex') as id, encode(parent_id, 'hex') as parent_id FROM revision_history ORDER BY id, parent_id LIMIT 2422013 OFFSET 1300000000) TO '/tmp/revision_history_[y].csv' delimiter ',' CSV HEADER;
Then, we import relationship into Neo4j:
1 <= x <= 14
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///revision_history_[x].csv" AS row
MATCH (r:Revision{id: row.id})
MATCH (p:Revision{id: row.parent_id})
MERGE (r)<-[:PARENT]-(p);
Notice: This step took us 1 month. So any improvement is welcome.
- Choose subset of repositories from the dataset:
In the thesis, we choose based on the visited date which is a mistake. So we suggest future researchers use the date of the branch to find the active repositories. Moreover, we only choose MASTER branch or HEAD branch, the branch that the pointer currently point to: https://stackoverflow.com/questions/2304087/what-is-head-in-git.
- Find the latest date from all of MASTER/HEAD branches
COPY (
SELECT MAX(r.date) as max_date
FROM origin as o
JOIN origin_visit as ov ON o.id = ov.origin
JOIN snapshot as ss ON ov.snapshot_id = ss.object_id
JOIN snapshot_branches as sbs ON ss.object_id = sbs.snapshot_id
JOIN snapshot_branch as sb ON sbs.branch_id = sb.object_id
JOIN revision as r ON sb.target = r.id
WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL AND o.type = 'git' AND (encode(sb.name::bytea, 'escape') ~ '.*HEAD.*' OR encode(sb.name::bytea, 'escape') ~* '.*master.*')
After find the latest date among branches, we will find repositories which have branches from 3 months ago to the current latest date
COPY(
SELECT ov.origin as origin_id, r.id as revision_id
FROM origin as o
JOIN origin_visit as ov ON o.id = ov.origin
JOIN snapshot as ss ON ov.snapshot_id = ss.object_id
JOIN snapshot_branches as sbs ON ss.object_id = sbs.snapshot_id
JOIN snapshot_branch as sb ON sbs.branch_id = sb.object_id
JOIN revision as r ON sb.target = r.id
WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL AND o.type = 'git' AND (encode(sb.name::bytea, 'escape') ~ '.*HEAD.*' OR encode(sb.name::bytea, 'escape') ~* '.*master.*') AND
(DATE_PART('month', [MAX_DATE]::date) - DATE_PART('month', r.date)) <= 3
)
to '/tmp/latest_repository.csv' delimiter ',' CSV HEADER;
MAX_DATE: The max date you find put in here in format 'yyyy-mm-dd'
Then we import table called latest_repository:
COPY
latest_repository(origin_id, revision_id)
FROM 'temp\latest_repository.csv' DELIMITER ',' CSV HEADER;
Then, we find the oldest version of those 'active' repositories:
COPY(
SELECT ov.origin as origin_id, min(r.date) as min_date
FROM latest_repository as lr
JOIN origin_visit as ov ON lr.origin_id = ov.origin
JOIN revision as r ON lr.revision_id = r.id
WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL
GROUP BY lr.origin_id, r.date
)
to '/tmp/oldest_repository_date.csv' delimiter ',' CSV HEADER;
Import table called oldest_repository to PostgreSQL:
COPY oldest_repository
FROM '/tmp/oldest_repository_date.csv' delimiter ',' CSV HEADER;
Finally, we find the id of the revision the the oldest version of the active repositories point to :
COPY(
SELECT DISTINCT ov.origin as origin_id, r.id as revision_id
FROM oldest_repository as or
JOIN origin_visit as ov ON or.origin_id = ov.origin
JOIN snapshot as ss ON ov.snapshot_id = ss.object_id
JOIN snapshot_branches as sbs ON ss.object_id = sbs.snapshot_id
JOIN snapshot_branch as sb ON sbs.branch_id = sb.object_id
JOIN revision as r ON r.date = or.min_date
WHERE ov.origin IS NOT NULL AND ss.object_id IS NOT NULL AND sbs.branch_id IS NOT NULL AND r.id IS NOT NULL
GROUP BY or.origin_id, r.date
)
to '/tmp/oldest_repository.csv' delimiter ',' CSV HEADER;
- Revision traversal
We will create relationship called 'BRANCH' from repository to one particular revision for latest repository.
USING PERIODIC COMMIT 100000
LOAD CSV WITH HEADERS FROM "file:///latest_repository.csv" AS row
MATCH (r:Revision{id: row.revision_id})
MERGE (r)<-[:BRANCH]-(s:Repository{id: row.origin_id});
With oldest_repository, we have one more property called "oldest"
USING PERIODIC COMMIT 100000
LOAD CSV WITH HEADERS FROM "file:///oldest_repository.csv" AS row
WITH row.origin_id as origin_id, row.id as revision_id
WHERE revision_id is not null
MATCH(r:Revision{id:revision_id})
MERGE (r)<-[:BRANCH]-(s:Repository{id:origin_id, oldest: "true"});
- Export release to file:
COPY(
SELECT DISTINCT encode(id, "hex") as release_id, date as date
FROM release)
to '/tmp/release.csv' delimiter ',' CSV HEADER;
- Import release to Neo4j:
USING PERIODIC COMMIT 100000
LOAD CSV WITH HEADERS FROM "file:///release.csv" AS row
MATCH (r:Revision{id: row.release_id})
SET r.release_date = apoc.date.parse(row.date, 's', '2017-05-09 06:19:41.823386', '+07')
- Extract git repository's revisions and releases:
CALL apoc.export.csv.query(
"MATCH (s:Repository)-[:BRANCH]->(r:Revision)
WITH DISTINCT r, s.id as origin_id
CALL apoc.path.subgraphNodes(r, {relationshipFilter: "PARENT>"})
YIELD node WITH origin_id, node.id as id, node.author as author, node.date as date, node.release_date as release_date;", "/mnt/17volume/data/snapshot_revision_git.csv", {});
Conclusion: We have two files: snapshot_revision_git.csv and snapshot_release_git to extract metrics.
- Extract metrics: commit frequency, total author, total commit, daily/weekly/monthly contributor count
Since the file snapshot_revision_git is really big, we split it into 8 files since we have eight core ram. We do that to boost the process of extracting metrics.
The code: extract_metrics.py (remember to change tor origin_id)
Notice: We should handle the case for example we split the big file into file 1 and file 2, the last origin_id of the file 1 and the first origin_id of file 2 belong to the same origin id. Therefore, we cannot extract these metrics from the code above directly. In thesis, we combine the last origin of file 1 and the first origin of file 2 and handle it separately.
- Extract metrics: cycle time release
The code: cycle_time.py
- Extract fork information in one year from the latest date: fork_detection.cpp in folder fork-detection origin_id, fork_id, date
- Extract fork metrics: extract_fork_metric.py in folder fork-detection
- The duplication information do the same by changing the code of fork: fork_detectin.cpp and extract_fork_metric.py
- Extract all metrics to one file: extract_metrics_to_one_file.py