
ecomplexity output does not conform with atlas dataverse results using R reticulate

hamgamb opened this issue · 5 comments

Apologies in advance for the not-so reproducible example. I couldn't find a way around the name/email requirements of the dataverse. I am using reticulate in R to run ecomplexity.

Data published on the Harvard Economic Complexity Dataverse has pre-calculated complexity indicators. The country_hsproduct4digit_year data from has columns: location_id, product_id, year, export_value, import_value, export_rca, product_status, cog, distance, normalized_distance, normalized_cog, normalized_pci, export_rpop, is_new, hs_eci, hs_coi, pci, location_code, hs_product_code.

Using only the location_code, hs_product_code, export_value and year columns from that data as input to ecomplexity yields different values for all of the calculated indicators. As an example, the atlas data has an hs_eci for ABW in 1995 as -0.468138129. When calculating complexity indicators from the atlas data the eci for ABW in 1995 is calculated as -0.1471911.

Is the data published on the Harvard Dataverse created using a different method?

Thanks Matias for your reply.

The data calculated by this package are normalized by eci. I can confirm that the mean of ECI is 0 and the standard deviation of ECI is 1.

The pre-calculated ECI from the dataverse data is not normalized but is close:

year mean ECI sd ECI
1995 -0.0138702 0.9869372
1996 -0.0172142 0.9702815
1997 -0.0013751 1.0293460
1998 -0.0184518 1.0017776
1999 -0.0009565 0.9855839
2000 0.0250194 0.9951708
2001 0.0522257 0.9701210
2002 0.0458795 0.9585416
2003 0.0377355 0.9721496
2004 0.0383981 0.9639836
2005 0.0233821 0.9737128
2006 0.0611096 0.9824983
2007 0.0534595 0.9712855
2008 0.0498620 0.9636734
2009 0.0881478 0.9712107
2010 0.0712081 0.9720490
2011 0.0646669 0.9753816
2012 0.0383435 1.0044073
2013 0.0426753 1.0085014
2014 0.0353947 0.9829397
2015 0.0392669 0.9958343
2016 0.0422366 0.9964626
2017 0.0275077 0.9895860
2018 0.0539169 0.9958367

After normalizing, its close, but not identical:


In regards to any other data cleaning going on, I'm simply using the trade data which comes as part of the pre-calculated data from the dataverse. I don't see how there can be any differences between the two. The number of locations and products by year in the indicators calculated by this package are identical to the number of locations and products by year in the dataverse data.

year locs.calculated prods.calculated locs.source prods.source
1995 231 1247 231 1247
1996 227 1247 227 1247
1997 227 1247 227 1247
1998 226 1247 226 1247
1999 226 1247 226 1247
2000 231 1248 231 1248
2001 233 1248 233 1248
2002 234 1248 234 1248
2003 233 1248 233 1248
2004 234 1248 234 1248
2005 233 1247 233 1247
2006 232 1248 232 1248
2007 233 1247 233 1247
2008 233 1247 233 1247
2009 233 1247 233 1247
2010 233 1245 233 1245
2011 235 1245 235 1245
2012 235 1246 235 1246
2013 237 1243 237 1243
2014 236 1242 236 1242
2015 235 1241 235 1241
2016 234 1240 234 1240
2017 236 1227 236 1227
2018 236 1225 236 1225

I'll just add that the ECI calculated using the R package referenced in #11 does agree with the ECI calculated using this python package. So perhaps something different is being done to the data on the dataverse?

Sorry for the super-late response @hamgamb , but If anyone else is looking for some answers here, the short but possibly unsatisfying answer is that there is more data pre-processing that goes into the dataverse. The ultimate algorithms used to generate the PCI / ECI values are the same, and the differences you rightly call out are a result of the data preprocessing. If you reach out to the team that manages the data uploaded on the dataverse (, they might be able to offer you exact details of the pre-processing.