titipata/grant_database

Link dudupe result with GRID back to original NIH and NSF grant

Opened this issue · 3 comments

@daniel-acuna put database on Amazon S3 and can be downloaded to dedupe_output folder as follows:

aws s3 cp s3://grant-dataset/dedupe/ dedupe_output/  --recursive

We want to create a script that merge these dataset all together.

@daniel-acuna, here is a snippet to do the linkage.

affil_df = pd.read_csv('../dedupe_output/application_vs_affiliation.csv')
institution_disambiguated = pd.read_csv('../dedupe_output/institutions_disambiguated.csv')
institution_disambiguated['affiliation_id'] = range(len(institution_disambiguated))
institution_dedupe = affil_merge_df.merge(institution_disambiguated[['affiliation_id', 'dedupe_id']])

However, size of unique affiliation_id in application_vs_affiliation.csv is not corresponded to shape of institutions_disambiguated.csv. I'm not sure why this happens.

OK, thanks. I'll take a look

Didn't we find that the shapes now match? Should we close this issue?