Summary:

I am encountering a perplexing issue with a data.table join operation in R, where I intend to combine information from two datasets based on specific habitat and species criteria. The join is expected to yield additional rows, but on a large dataset, it results in a “Join results in more rows” error. To illustrate the problem, I’ve provided a minimal example with sample datasets. Initially, a smaller dataset demonstrates successful execution of the join, but the same operation on a larger dataset triggers an error. I’ve included the relevant code snippets, sample data, and error messages. Seeking guidance on resolving this issue, particularly when dealing with duplicate key values and large datasets, I am reaching out for insights and potential solutions.

This is a small dataset, here it works

library(data.table)

I have a problem with a join that should produce extra rows in data table, I have a subresult which is this one:

subresult <- structure(list(cell = c(1447386L, 1447386L, 1447387L), species = c(
  "Abies nordmanniana",
  "Abies nordmanniana", "Abies nordmanniana"
), Habitat = c("WetRich", "DryRich", "DryRich")), class = c(
  "data.table",
  "data.frame"
), row.names = c(NA, -3L))

this looks like this:

cell	species	Habitat
1447386	Abies nordmanniana	WetRich
1447386	Abies nordmanniana	DryRich
1447387	Abies nordmanniana	DryRich

Here we have a species that can inhabit in 2 different habitats in one cell, and 1 in another, all of which are available in this cell, now those habitats can become different landuses, so we have this other dataset

Feasible_Landuses <- structure(list(species = c(
  "Abies nordmanniana", "Abies nordmanniana",
  "Abies nordmanniana", "Abies nordmanniana", "Abies nordmanniana",
  "Abies nordmanniana", "Abies nordmanniana"
), Landuse = c(
  "ForestDryRich",
  "ForestDryPoor", "ForestWetRich", "ForestWetPoor", "OpenDryRich",
  "OpenDryPoor", "OpenWetPoor"
), Habitat = c(
  "DryRich", "DryPoor",
  "WetRich", "WetPoor", "DryRich", "DryPoor", "WetPoor"
)), row.names = c(
  NA,
  -7L
), class = c("data.table", "data.frame"), sorted = "species")

which can be seen here

species	Landuse	Habitat
Abies nordmanniana	ForestDryRich	DryRich
Abies nordmanniana	ForestDryPoor	DryPoor
Abies nordmanniana	ForestWetRich	WetRich
Abies nordmanniana	ForestWetPoor	WetPoor
Abies nordmanniana	OpenDryRich	DryRich
Abies nordmanniana	OpenDryPoor	DryPoor
Abies nordmanniana	OpenWetPoor	WetPoor

Now the goal would be to make a join resulting in this:

Final_Result <- structure(list(cell = c(1447386L, 1447386L, 1447386L, 1447387L, 1447387L), species = c(
  "Abies nordmanniana", "Abies nordmanniana",
  "Abies nordmanniana", "Abies nordmanniana",
  "Abies nordmanniana"
), Landuse = c(
  "ForestDryRich", "ForestWetRich",
  "OpenDryRich", "ForestDryRich", "OpenDryRich"
), Habitat = c("DryRich", "WetRich", "DryRich", "DryRich", "DryRich")), row.names = c(
  NA,
  -5L
), class = c("data.table", "data.frame"))

Which should look like this:

cell	species	Landuse	Habitat
1447386	Abies nordmanniana	ForestDryRich	DryRich
1447386	Abies nordmanniana	ForestWetRich	WetRich
1447386	Abies nordmanniana	OpenDryRich	DryRich
1447387	Abies nordmanniana	ForestDryRich	DryRich
1447387	Abies nordmanniana	OpenDryRich	DryRich

When I try this this works fine

Final_Result <- subresult[Feasible_Landuses, on = .(Habitat, species), nomatch = 0]

as can be seen here

cell	species	Habitat	Landuse
1447386	Abies nordmanniana	DryRich	ForestDryRich
1447387	Abies nordmanniana	DryRich	ForestDryRich
1447386	Abies nordmanniana	WetRich	ForestWetRich
1447386	Abies nordmanniana	DryRich	OpenDryRich
1447387	Abies nordmanniana	DryRich	OpenDryRich

Error on large dataset

However on a larget dataset I get this error:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 19648 rows; more than 11143 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Now with a big dataset

this dataset is available at this repo

But I will add the code so that I read it from there:

result <- data.table::fread("https://raw.githubusercontent.com/derek-corcoran-barrios/DataTableJoinQuestion/master/result.csv")

you can see the structure here

str(result)
#;-) Classes 'data.table' and 'data.frame': 11136 obs. of  3 variables:
#;-)  $ cell   : int  253745 253746 256044 256045 256046 256047 258344 258345 258346 262944 ...
#;-)  $ species: chr  "Abies nordmanniana" "Abies nordmanniana" "Abies nordmanniana" "Abies nordmanniana" ...
#;-)  $ Habitat: chr  "DryPoor" "DryPoor" "DryPoor" "DryPoor" ...
#;-)  - attr(*, ".internal.selfref")=<externalptr>

and now the feasible landuses

Feasible_Landuses <- data.table::fread("https://raw.githubusercontent.com/derek-corcoran-barrios/DataTableJoinQuestion/master/Feasible_Landuses.csv")

Again we check the structure

str(Feasible_Landuses)
#;-) Classes 'data.table' and 'data.frame': 7 obs. of  3 variables:
#;-)  $ species: chr  "Abies nordmanniana" "Abies nordmanniana" "Abies nordmanniana" "Abies nordmanniana" ...
#;-)  $ Landuse: chr  "ForestDryRich" "ForestDryPoor" "ForestWetRich" "ForestWetPoor" ...
#;-)  $ Habitat: chr  "DryRich" "DryPoor" "WetRich" "WetPoor" ...
#;-)  - attr(*, ".internal.selfref")=<externalptr>

and now we finally check the join:

Final_Result <- result[Feasible_Landuses, on = .(Habitat, species), nomatch = 0]
#;-) Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 19648 rows; more than 11143 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Why this one gets an error and this first one does not?

Session info

Standard output and standard error

Warning messages:
1: In normalizePath(Sys.getenv("TMPDIR", Sys.getenv("TMP"))) :
  path[1]="": No such file or directory
2: In normalizePath(Sys.getenv("TMPDIR", Sys.getenv("TMP"))) :
  path[1]="": No such file or directory
3: In normalizePath(Sys.getenv("TMPDIR", Sys.getenv("TMP"))) :
  path[1]="": No such file or directory
4: In eng_r(options) :
  Failed to tidy R code in chunk 'unnamed-chunk-10'. Reason:
Error in parse_safely(text, keep.source = TRUE) : 
  <text>:1:7: unexpected 'in'
1: Error in
          ^

Session info

sessioninfo::session_info()
#;-) ─ Session info ───────────────────────────────────────────────────────────────
#;-)  setting  value
#;-)  version  R version 4.3.2 (2023-10-31)
#;-)  os       Ubuntu 20.04.6 LTS
#;-)  system   x86_64, linux-gnu
#;-)  ui       X11
#;-)  language en_US:en
#;-)  collate  en_US.UTF-8
#;-)  ctype    en_US.UTF-8
#;-)  tz       Europe/Copenhagen
#;-)  date     2023-12-07
#;-)  pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#;-) 
#;-) ─ Packages ───────────────────────────────────────────────────────────────────
#;-)  package     * version date (UTC) lib source
#;-)  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
#;-)  curl          5.1.0   2023-10-02 [1] CRAN (R 4.3.1)
#;-)  data.table  * 1.14.8  2023-02-17 [1] CRAN (R 4.3.0)
#;-)  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
#;-)  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
#;-)  fansi         1.0.5   2023-10-08 [1] CRAN (R 4.3.1)
#;-)  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
#;-)  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.1)
#;-)  glue          1.6.2   2022-02-24 [3] CRAN (R 4.1.2)
#;-)  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
#;-)  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
#;-)  lifecycle     1.0.4   2023-11-07 [3] CRAN (R 4.3.2)
#;-)  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.1.3)
#;-)  pillar        1.9.0   2023-03-22 [3] CRAN (R 4.2.3)
#;-)  purrr         1.0.2   2023-08-10 [3] CRAN (R 4.3.1)
#;-)  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#;-)  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#;-)  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#;-)  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.3.0)
#;-)  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.0)
#;-)  rlang         1.1.2   2023-11-04 [1] CRAN (R 4.3.2)
#;-)  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
#;-)  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
#;-)  sessioninfo   1.2.2   2021-12-06 [3] CRAN (R 4.1.2)
#;-)  styler        1.10.0  2023-05-24 [1] CRAN (R 4.3.0)
#;-)  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
#;-)  vctrs         0.6.4   2023-10-12 [1] CRAN (R 4.3.1)
#;-)  withr         2.5.2   2023-10-30 [1] CRAN (R 4.3.2)
#;-)  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
#;-)  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.1)
#;-) 
#;-)  [1] /home/au687614/R/x86_64-pc-linux-gnu-library/4.3
#;-)  [2] /usr/local/lib/R/site-library
#;-)  [3] /usr/lib/R/site-library
#;-)  [4] /usr/lib/R/library
#;-) 
#;-) ──────────────────────────────────────────────────────────────────────────────