pachadotdev/economiccomplexity

balassa_index: Size of Matrix different to original data

fabianscheifele opened this issue · 6 comments

Dear community,

I have a dataset with 5018 unique product codes and 238 countries and when using balassa_index it returns me a matrix of only 221(countries)x 4902 (products).
I have checked and there is at least one export value greater than zero for each of the 5018 products and I have also transformed all NA's in the value column into zeros. Hence I do not understand why the command filters out some products. Do you have an idea what could be the problem ?
I left all the other arguments as default (discrete and cutoff).
Thank you.

Hi @fabianscheifele
Thanks for asking! Do you have a minimal reproducible example? If you cannot share the data, I'll create a dummy example later and we can use it as an example.

Dear @pachamaltese,
please find the csv file attached and I used the following code:
OEC Data.zip

balassa_matrix1995<-balassa_index(reduced_form_1995, country = "origin", product = "hs92", value = "export_val")

I actually expanded the BACI dataset that is published on the OEC website (https://legacy.oec.world/en/resources/data/) with empty country-product-year combinations because BACI only contains rows for positive import or export values, hence the panel data is not balance. Maybe this is not even required in case your code completes any non-existing country-product-year combinations with Zeros in the matrix? (but i did it, just to be sure)

@pachamaltese : Thank you so much for quickl reply but I think I solved it myself: I noticed that in the dataset "world_trade_avg_1998_to_2000", which is used as an example for the package, you have no export values with zero. The reason my matrix was reduced was because they were only positive export values for 4902 and 226 countries. Hence the command filters out all zero/NA values when doing the matrix. This makes sense as they are not necessary for the calculations of the RCAs. Is my reasoning/tracing correct?

@fabianscheifele Hi, let me dig your example a bit, the world_trade_avg_1998_to_2000 is the result of adding 3 matrices and divide the coefficients by 3

@fabianscheifele after inspecting the data, the rca matrix is correct

library(readr)
library(dplyr)
library(economiccomplexity)

# data ----

hs92_data_1995only <- read_csv("hs92_data_1995only.csv")

# unique countries/products ----

hs92_data_1995only %>% 
  select(origin) %>% 
  distinct() %>% 
  count()

hs92_data_1995only %>% 
  select(hs92) %>% 
  distinct() %>% 
  count()

# balassa index ----

m <- balassa_index(hs92_data_1995only,
                   discrete = F,
                   country = "origin",
                   product = "hs92",
                   value = "export_val")

dim(m)
[1]  221 4902

what happens is that the internal function https://github.com/pachamaltese/economiccomplexity/blob/master/R/economiccomplexity-internals.R removes the zero flows after summation (i.e. when you pass an origin-destination-export table, it adds to origin-export)

thank you @pachamaltese for the quick help!