Joining of Dataframes in R Programming
Last Updated :
23 May, 2022
rreIn R Language, dataframes are generic data objects which are used to store the tabular data. Dataframes are considered to be the most popular data objects in R programming because it is more comfortable to analyze the data in the tabular form. Dataframes can also be taught as mattresses where each column of a matrix can be of the different data types. Dataframe is made up of three principal components, the data, rows, and columns. In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS. Types of Merging Available in R are,
- Natural Join or Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
- Cross Join
- Semi Join
- Anti Join
Basic Syntax of merge() function in R:
Syntax: merge(df1, df2, by.df1, by.df2, all.df1, all.df2, sort = TRUE) Parameters: df1: one dataframe df2: another dataframe by.df1, by.df2: The names of the columns that are common to both df1 and df2. all, all.df1, all.df2: Logical values that actually specify the type of merging happens.
Now let’s try to understand all types of merging one by one. First of all, we will create two dataframes that will help us to understand each join easily. # Data frame 1
Python3
df1 = data.frame(StudentId = c( 101 : 106 ),
Product = c("Hindi", "English",
"Maths", "Science",
"Political Science",
"Physics"))
df1
|
Output:
StudentId Product
1 101 Hindi
2 102 English
3 103 Maths
4 104 Science
5 105 Political Science
6 106 Physics
# Data frame 2
Python3
df2 = data.frame(StudentId = c( 102 , 104 , 106 ,
107 , 108 ),
State = c("Mangalore", "Mysore",
"Pune", "Dehradun", "Delhi"))
df2
|
Output:
StudentId State
1 102 Mangalore
2 104 Mysore
3 106 Pune
4 107 Dehradun
5 108 Delhi
Natural Join or Inner Join
Inner join is used to keep only those rows that are matched from the dataframes, in this, we actually specify the argument all = FALSE. If we try to understand this using set theory then we can say here we are actually performing the intersection operation. For example:
A = [1, 2, 3, 4, 5]
B = [2, 3, 5, 6]
Then the output of natural join will be (2, 3, 5)
It is the most simplest and common type of joins available in R. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId")
df
|
Output:
StudentId Product State
1 102 English Mangalore
2 104 Science Mysore
3 106 Physics Pune
Left Outer Join
Left Outer Join is basically to include all the rows of your dataframe x and only those from y that match, in this, we actually specify the argument x = TRUE. If we try to understand this using a basic set theory then we can say here we are actually displaying complete set x. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId",
all .x = TRUE)
df
|
Output:
StudentId Product State
1 101 Hindi NA
2 102 English Mangalore
3 103 Maths NA
4 104 Science Mysore
5 105 Political Science NA
6 106 Physics Pune
Right Outer Join
Right, Outer Join is basically to include all the rows of your dataframe y and only those from x that match, in this, we actually specify the argument y = TRUE. If we try to understand this using a basic set theory then we can say here we are actually displaying a complete set y. Now let us try to understand this using R program: Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId",
all .y = TRUE)
df
|
Output:
StudentId Product State
1 102 English Mangalore
2 104 Science Mysore
3 106 Physics Pune
4 107 NA Dehradun
5 108 NA Delhi
Full Outer Join
Outer Join is basically used to keep all rows from both dataframes, in this, we actually specify the arguments all = TRUE. If we try to understand this using a basic set theory then we can say here we are actually performing the union option. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId",
all = TRUE)
df
|
Output:
StudentId Product State
1 101 Hindi NA
2 102 English Mangalore
3 103 Maths NA
4 104 Science Mysore
5 105 Political Science NA
6 106 Physics Pune
7 107 NA Dehradun
8 108 NA Delhi
Cross Join
A Cross Join also known as cartesian join results in every row of one dataframe is being joined to every other row of another dataframe. In set theory, this type of joins is known as the cartesian product between two sets. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = NULL)
df
|
Output:
StudentId.x Product StudentId.y State
1 101 Hindi 102 Mangalore
2 102 English 102 Mangalore
3 103 Maths 102 Mangalore
4 104 Science 102 Mangalore
5 105 Political Science 102 Mangalore
6 106 Physics 102 Mangalore
7 101 Hindi 104 Mysore
8 102 English 104 Mysore
9 103 Maths 104 Mysore
10 104 Science 104 Mysore
11 105 Political Science 104 Mysore
12 106 Physics 104 Mysore
13 101 Hindi 106 Pune
14 102 English 106 Pune
15 103 Maths 106 Pune
16 104 Science 106 Pune
17 105 Political Science 106 Pune
18 106 Physics 106 Pune
19 101 Hindi 107 Dehradun
20 102 English 107 Dehradun
21 103 Maths 107 Dehradun
22 104 Science 107 Dehradun
23 105 Political Science 107 Dehradun
24 106 Physics 107 Dehradun
25 101 Hindi 108 Delhi
26 102 English 108 Delhi
27 103 Maths 108 Delhi
28 104 Science 108 Delhi
29 105 Political Science 108 Delhi
30 106 Physics 108 Delhi
Semi Join
This join is somewhat like inner join, with only the left dataframe columns and values are selected. Now let us try to understand this using R program:
Example:
Python3
library(dplyr)
df = df1 % > % semi_join(df2, by = "StudentId")
df
|
Output:
StudentId Product
1 102 English
2 104 Science
3 106 Physics
Anti Join
In terms of set theory, we can say anti-join as set difference operation, for example, A = (1, 2, 3, 4) B = (2, 3, 5) then the output of A-B will be set (1, 4). This join is somewhat like df1 – df2, as it basically selects all rows from df1 that are actually not present in df2. Now let us try to understand this using R program:
Example:
Python3
library(dplyr)
df = df1 % > % anti_join(df2, by = "StudentId")
df
|
Output:
StudentId Product
1 101 Hindi
2 103 Maths
3 105 Political Science
Similar Reads
DataFrame Operations in R
DataFrames are generic data objects of R which are used to store the tabular data. Data frames are considered to be the most popular data objects in R programming because it is more comfortable to analyze the data in the tabular form. Data frames can also be taught as mattresses where each column of
10 min read
Reading the CSV file into Dataframes in R
In this article, we will learn how to import or read a CSV file into a dataframe in R Programming Language. Data set in use: Step 1: Set or change the working directory In order to import or read the given CSV file into our data frame, we first need to check our current working directory, and make s
3 min read
List of Dataframes in R
DataFrames are generic data objects of R which are used to store the tabular data. They are two-dimensional, heterogeneous data structures. A list in R, however, comprises of elements, vectors, data frames, variables, or lists that may belong to different data types. In this article, we will study h
8 min read
data.table vs data.frame in R Programming
data.table in R is an enhanced version of the data.frame. Due to its speed of execution and the less code to type it became popular in R. The purpose of data.table is to create tabular data same as a data frame but the syntax varies. In the below example let we can see the syntax for the data table:
3 min read
Intersection of dataframes using Dplyr in R
In this article, we will discuss how to find the Intersection of two dataframes using the Dplyr package in R programming language. Dplyr provides intersect() method to get the common data in two dataframes. Syntax: intersect(dataframe1,dataframe2,dataframe3,........,dataframe n) We can perform this
1 min read
Indexing and Slicing Data Frames in R
Indexing and Slicing are use for accessing and manipulating data. Indexing: Accessing specific elements (rows or columns) in data structures.Slicing: Extracting subsets of data based on conditions or indices.In R, indexing a data frame allows you to retrieve specific columns by their names: datafram
3 min read
Create Subsets of a Data frame in R Programming - subset() Function
subset() function in R Programming Language is used to create subsets of a Data frame. This can also be used to drop columns from a data frame. Syntax: subset(df, expr) Parameters: df: Data frame usedexpr: Condition for subsetCreate Subsets of Data Frames in R Programming LanguageHere we will make s
3 min read
Rename Columns of a Data Frame in R Programming - rename() Function
The rename() function in R Programming Language is used to rename the column names of a data frame, based on the older names. Syntax: rename(x, names) Parameters: x: Data frame names: Old name and new name 1. Rename a Data Frame using rename function in RWe are using the plyr package to rename the c
2 min read
dplyr Package in R Programming
The dplyr package for R offers efficient data manipulation functions. It makes data transformation and summarization simple with concise, readable syntax. Key Features of dplyrData Frame and TibbleData frames in dplyr in R is organized tables where each column stores specific types of information, l
4 min read
Working with Databases in R Programming
Prerequisite: Database Connectivity with R Programming In R programming Language, a number of datasets are passed to the functions to visualize them using statistical computing. So, rather than creating datasets again and again in the console, we can pass those normalized datasets from relational da
4 min read