Link dudupe result with GRID back to original NIH and NSF grant

Question

Link dudupe result with GRID back to original NIH and NSF grant

Opened this issue 8 years ago · 3 comments

@daniel-acuna put database on Amazon S3 and can be downloaded to dedupe_output folder as follows:

aws s3 cp s3://grant-dataset/dedupe/ dedupe_output/  --recursive

We want to create a script that merge these dataset all together.

Answer 1 · 2016-04-23T00:23:01.000Z

@daniel-acuna, here is a snippet to do the linkage.

affil_df = pd.read_csv('../dedupe_output/application_vs_affiliation.csv')
institution_disambiguated = pd.read_csv('../dedupe_output/institutions_disambiguated.csv')
institution_disambiguated['affiliation_id'] = range(len(institution_disambiguated))
institution_dedupe = affil_merge_df.merge(institution_disambiguated[['affiliation_id', 'dedupe_id']])

However, size of unique affiliation_id in application_vs_affiliation.csv is not corresponded to shape of institutions_disambiguated.csv. I'm not sure why this happens.

Answer 2 · 2016-04-23T17:53:21.000Z

OK, thanks. I'll take a look

Answer 3 · 2016-04-28T15:48:31.000Z

Didn't we find that the shapes now match? Should we close this issue?