ecomplexity output does not conform with atlas dataverse results using R reticulate
hamgamb opened this issue · 5 comments
Apologies in advance for the not-so reproducible example. I couldn't find a way around the name/email requirements of the dataverse. I am using reticulate in R to run ecomplexity.
Data published on the Harvard Economic Complexity Dataverse has pre-calculated complexity indicators. The country_hsproduct4digit_year
data from https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/T4CHWJ/4RG21Y&version=3.0 has columns: location_id, product_id, year, export_value, import_value, export_rca, product_status, cog, distance, normalized_distance, normalized_cog, normalized_pci, export_rpop, is_new, hs_eci, hs_coi, pci, location_code, hs_product_code
.
Using only the location_code, hs_product_code, export_value and year
columns from that data as input to ecomplexity yields different values for all of the calculated indicators. As an example, the atlas data has an hs_eci
for ABW
in 1995
as -0.468138129
. When calculating complexity indicators from the atlas data the eci
for ABW
in 1995
is calculated as -0.1471911
.
Is the data published on the Harvard Dataverse created using a different method?
Thanks Matias for your reply.
The data calculated by this package are normalized by eci. I can confirm that the mean of ECI is 0 and the standard deviation of ECI is 1.
The pre-calculated ECI from the dataverse data is not normalized but is close:
year | mean ECI | sd ECI |
---|---|---|
1995 | -0.0138702 | 0.9869372 |
1996 | -0.0172142 | 0.9702815 |
1997 | -0.0013751 | 1.0293460 |
1998 | -0.0184518 | 1.0017776 |
1999 | -0.0009565 | 0.9855839 |
2000 | 0.0250194 | 0.9951708 |
2001 | 0.0522257 | 0.9701210 |
2002 | 0.0458795 | 0.9585416 |
2003 | 0.0377355 | 0.9721496 |
2004 | 0.0383981 | 0.9639836 |
2005 | 0.0233821 | 0.9737128 |
2006 | 0.0611096 | 0.9824983 |
2007 | 0.0534595 | 0.9712855 |
2008 | 0.0498620 | 0.9636734 |
2009 | 0.0881478 | 0.9712107 |
2010 | 0.0712081 | 0.9720490 |
2011 | 0.0646669 | 0.9753816 |
2012 | 0.0383435 | 1.0044073 |
2013 | 0.0426753 | 1.0085014 |
2014 | 0.0353947 | 0.9829397 |
2015 | 0.0392669 | 0.9958343 |
2016 | 0.0422366 | 0.9964626 |
2017 | 0.0275077 | 0.9895860 |
2018 | 0.0539169 | 0.9958367 |
After normalizing, its close, but not identical:
In regards to any other data cleaning going on, I'm simply using the trade data which comes as part of the pre-calculated data from the dataverse. I don't see how there can be any differences between the two. The number of locations and products by year in the indicators calculated by this package are identical to the number of locations and products by year in the dataverse data.
year | locs.calculated | prods.calculated | locs.source | prods.source |
---|---|---|---|---|
1995 | 231 | 1247 | 231 | 1247 |
1996 | 227 | 1247 | 227 | 1247 |
1997 | 227 | 1247 | 227 | 1247 |
1998 | 226 | 1247 | 226 | 1247 |
1999 | 226 | 1247 | 226 | 1247 |
2000 | 231 | 1248 | 231 | 1248 |
2001 | 233 | 1248 | 233 | 1248 |
2002 | 234 | 1248 | 234 | 1248 |
2003 | 233 | 1248 | 233 | 1248 |
2004 | 234 | 1248 | 234 | 1248 |
2005 | 233 | 1247 | 233 | 1247 |
2006 | 232 | 1248 | 232 | 1248 |
2007 | 233 | 1247 | 233 | 1247 |
2008 | 233 | 1247 | 233 | 1247 |
2009 | 233 | 1247 | 233 | 1247 |
2010 | 233 | 1245 | 233 | 1245 |
2011 | 235 | 1245 | 235 | 1245 |
2012 | 235 | 1246 | 235 | 1246 |
2013 | 237 | 1243 | 237 | 1243 |
2014 | 236 | 1242 | 236 | 1242 |
2015 | 235 | 1241 | 235 | 1241 |
2016 | 234 | 1240 | 234 | 1240 |
2017 | 236 | 1227 | 236 | 1227 |
2018 | 236 | 1225 | 236 | 1225 |
I'll just add that the ECI calculated using the R package referenced in #11 does agree with the ECI calculated using this python package. So perhaps something different is being done to the data on the dataverse?
Sorry for the super-late response @hamgamb , but If anyone else is looking for some answers here, the short but possibly unsatisfying answer is that there is more data pre-processing that goes into the dataverse. The ultimate algorithms used to generate the PCI / ECI values are the same, and the differences you rightly call out are a result of the data preprocessing. If you reach out to the team that manages the data uploaded on the dataverse (atlas.cid.harvard.edu), they might be able to offer you exact details of the pre-processing.