Link dudupe result with GRID back to original NIH and NSF grant
Opened this issue · 3 comments
titipata commented
@daniel-acuna put database on Amazon S3 and can be downloaded to dedupe_output
folder as follows:
aws s3 cp s3://grant-dataset/dedupe/ dedupe_output/ --recursive
We want to create a script that merge these dataset all together.
titipata commented
@daniel-acuna, here is a snippet to do the linkage.
affil_df = pd.read_csv('../dedupe_output/application_vs_affiliation.csv')
institution_disambiguated = pd.read_csv('../dedupe_output/institutions_disambiguated.csv')
institution_disambiguated['affiliation_id'] = range(len(institution_disambiguated))
institution_dedupe = affil_merge_df.merge(institution_disambiguated[['affiliation_id', 'dedupe_id']])
However, size of unique affiliation_id
in application_vs_affiliation.csv
is not corresponded to shape of institutions_disambiguated.csv
. I'm not sure why this happens.
daniel-acuna commented
OK, thanks. I'll take a look
daniel-acuna commented
Didn't we find that the shapes now match? Should we close this issue?