duttashi/pysqlr

How to join (merge) data frames (inner, outer, left, right)?

duttashi opened this issue · 1 comments

Given two data frames:

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio

How can I do database style, i.e., sql style, joins? That is, how do I get:

  • An inner join of df1 and df2:
    Return only the rows in which the left table have matching keys in the right table.

  • An outer join of df1 and df2:
    Returns all rows from both tables, join records from the left which have matching keys in the right table.

  • A left outer join (or simply left join) of df1 and df2
    Return all rows from the left table, and any rows with matching keys from the right table.

  • A right outer join of df1 and df2
    Return all rows from the right table, and any rows with matching keys from the left table.

Extra credit:
How can I do a SQL style select statement?

Reference

This question was originally asked on stack overflow. I'm reproducing it here for my future reference

Using the dplyr package, this can be done as follows;

library(dplyr)
> inner_join(df1, df2)
Joining, by = "CustomerId"
  CustomerId Product   State
1          2 Toaster Alabama
2          4   Radio Alabama
3          6   Radio    Ohio
> #left outer
> left_join(df1, df2)
Joining, by = "CustomerId"
  CustomerId Product   State
1          1 Toaster    <NA>
2          2 Toaster Alabama
3          3 Toaster    <NA>
4          4   Radio Alabama
5          5   Radio    <NA>
6          6   Radio    Ohio
> #right outer
> right_join(df1, df2)
Joining, by = "CustomerId"
  CustomerId Product   State
1          2 Toaster Alabama
2          4   Radio Alabama
3          6   Radio    Ohio
> #alternate right outer
> left_join(df2, df1)
Joining, by = "CustomerId"
  CustomerId   State Product
1          2 Alabama Toaster
2          4 Alabama   Radio
3          6    Ohio   Radio
> #full join
> full_join(df1, df2)
Joining, by = "CustomerId"
  CustomerId Product   State
1          1 Toaster    <NA>
2          2 Toaster Alabama
3          3 Toaster    <NA>
4          4   Radio Alabama
5          5   Radio    <NA>
6          6   Radio    Ohio