Calculate distance values based on Proximity-output from this package
fabianscheifele opened this issue · 10 comments
Dear all,
in addition to the metrics provided by this package (ECI,PCI, Outlook gain and outlook index) I would like to calculate the distance value for each product-year-country instance. Those familiar with economic complexity know the definition:
Distance (value between 0 and 1, for product p to country c): the sum of the proximities connecting a new good p to all the products that country c is not currently exporting. We normalize distance by dividing it by the sum of proximities between all products and product p. In other words, distance is the weighted proportion of products connected to good p that country c is not exporting.
I am able to transform the product-proximity matrix resulting from the package into a 3-column dataframe (Hs-code p1, hs-code p2, proxmity) and I also have created a dummy variable 0 and 1 based on whether the country has and RCA in this product. But I am struggling now to code a command/function to calculate the distance.
I attached my data again (its for the year 1995, but the logic is the same) and the code that I started for this calculation
How to calculate distance value.zip
@pachamaltese is there maybe a specific spot in the source code where I can find this? Because I think in order to calculate the opportunity gain index in your package you already need to make use of the distance values as an intermediary outcome?
So you are going to create a new command that calculates the distance value?
I am not yet a very experienced R user nor a programmer. Hence, I think I will not be able to follow the code fully, but I know the concept of the distance value and can check whether the code calculates what it is supposed to...
@fabianscheifele "I know the concept of the distance value and can check whether the code calculates what it is supposed to." this is exactly what I need to double check my results
@fabianscheifele Hi! I've separated the outlook from the distance function. You can install the package with this new change by running remotes::install_github("pachamaltese/economiccomplexity")
. If everything makes sense, please let me know to increase the version number and send the new version to CRAN.
@pachamaltese I checked the function with the attached dataset (HS92 tradedata from oec website for the year 2017) and I have the following observations:
There very minimal differences between the value calculated by the command and my manual re-calculation (based on the proximity matrix created by the proximity command):
Country | Product | Value calculated by command | Manually calculated
Argentina | 10111 | 0.8514521 | 0.85309528
Armenia | 10111 | 0.943737 | 0.943629
Albania | 10111 | 0.929481 | 0.9293449
Albania | 10410 | 0.8495048 | 0.8491
These differences may be due to rounding of number in intermediary calculation steps. The command appears to be working fine and returns the distance value for all product-country combinations previously identified by the balassa_index command and by the product proximity command (hence all products and countries with at least one positive export value recorded in the database that you are using. In case you have a continous database with the same product codes for multiple year and some of those product codes do not in a particular year (but maybe in a previous or a later year), they will be filtered out by the balassa command, when you calculate it for this particular year. Probably to make the matrix more lean
distance value check.zip
).
I highly appreciate this additional functionality because it allow you now to calculate all key complexity metrics used in the Atlas.
@fabianscheifele Thanks a lot! Well, your function is quite long!!
It might be rounding since I'm using matrices. You can try tradestatistics.io, it has a quite decent API and there's https://github.com/rpensci/tradestatiistics to use that API from R, it shall ease data extraction a lot.
When I have completely checked your manual calculation, I shall increase the version number and send to CRAN.
@fabianscheifele there are a few things I don't get from your script. For example, where does all_comb17 come from?
@fabianscheifele Hi! After reading the Atlas again, the equation
Can be correctly expressed as a matrix operation with this line of code
tcrossprod(1 - balassa_index, proximity_product / rowSums(proximity_product))
So, the distance function is correct, and the difference is because of rounding (your codes mixes data.frame, tibble and data.table)
okay, yes this makes sense. the rowSums is basically summing up all the proximities of product p, while the (1-balassa index, proximity-product) is only summing up those instances in matrix where Balassa Index =0.