Take random sample based on groups in R

Last Updated : 18 Jul, 2021

R programming language provides us with many packages to take random samples from data objects, data frames, or data tables and aggregate them into groups.

Method 1: Using plyr library

The "plyr" library can be installed and loaded into the working space which is used to perform data manipulation and statistics. The ddply() method is applied for each subset of the specified data frame, followed by combining the results into a data frame.

Syntax:

ddply( .data, .variables, .fun = NULL)

Parameter -

data - The data frame to use

variables - the grouping parameters

fun - the function to be applied. In this case, sample(nrow(x),y) method is applied which extracts y rows of each group from the variables chosen for the second parameter of ddply() method.

Example:

# importing required libraries
library("plyr")

# create dataframe
data_frame<-data.frame(col1=c(rep('G1',50),rep('G2',50),rep('G3',50)), 
                col2=rep(letters[1:5],30)
                )

print("Original DataFrame")
head(data_frame)

# pick 3 samples of each from data frame
data_mod <- ddply(data_frame,.(col1),function(x) x[sample(nrow(x),5),])
print("Modified DataFrame")
print (data_mod)

Output

[1] "Original DataFrame" 
  col1 col2 
1   G1    a 
2   G1    b 
3   G1    c 
4   G1    d 
5   G1    e 
6   G1    a 
[1] "Modified DataFrame" 
   col1 col2 
1    G1    d 
2    G1    e 
3    G1    d 
4    G1    a 
5    G1    a 
6    G2    b 
7    G2    c 
8    G2    d 
9    G2    d 
10   G2    e 
11   G3    c 
12   G3    e 
13   G3    b 
14   G3    b 
15   G3    d

Method 2: Using dplyr library

The "dplyr" library can be installed and loaded into the working space which is used to perform data manipulation. This package allows a large variety of methods to filter, subset, and extract data based on the application of constraints and conditions. The data frame is subjected to multiple operations using the pipe operator.

The group_by method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names.

Syntax:

group_by(col1, col2, ...)

This is followed by the application of sample_n() method is used to select random rows from the data frame with the argument indicating the number of rows to sample out from each group.

Example:

# importing required libraries
library("dplyr")

# create dataframe
data_frame<-data.frame(col1=c(rep('G1',50),rep('G2',50),
                              rep('G3',50)), 
                col2=rep(letters[1:5],30)
                )

print("Original DataFrame")
head(data_frame)

# pick 3 samples of each from data frame
data_mod <- data_frame %>% group_by(col1) %>% sample_n(3)
print("Modified DataFrame")
print (data_mod)

Output

[1] "Original DataFrame" 
  col1 col2 
1   G1    a 
2   G1    b 
3   G1    c 
4   G1    d 
5   G1    e 
6   G1    a 
[1] "Modified DataFrame" 
# A tibble: 9 x 2 
# Groups:   col1 [3]   
 col1  col2    
<chr> <chr> 
1 G1    d     
2 G1    e     
3 G1    c     
4 G2    a     
5 G2    a     
6 G2    c     
7 G3    b     
8 G3    a     
9 G3    a

Method 3: Using data.table

The library data.table can be used for the fast aggregation of large data organized into tabular structures. The package can be loaded and installed into the working space.

The indexing of the data table can be performed using the .SD parameter which selects a sample grouping data using the "by" parameter. The number of rows chosen from each group depends on the size attribute specified in the indexing method. The output is returned in the form of a data.table.

Syntax:

data_frame[ , .SD[sample(x = .N, size = n)], by = ]

Example:

# importing required libraries
library("data.table")

# create dataframe
data_frame<-data.table(col1=c(rep('G1',50),rep('G2',50),
                              rep('G3',50)), 
                col2=rep(letters[1:5],30)
                )

print("Original DataFrame")
head(data_frame)

# pick 3 samples of each from data frame
data_mod <- data_frame[, .SD[sample(x = .N, size = 5)], by = col1]
print("Modified DataFrame")
print (data_mod)

Output

[1] "Original DataFrame" 
col1 col2 
1:   G1    a 
2:   G1    b 
3:   G1    c 
4:   G1    d 
5:   G1    e 
6:   G1    a 
[1] "Modified DataFrame" 
col1 col2  
1:   G1    a  
2:   G1    e  
3:   G1    d  
4:   G1    e  
5:   G1    a  
6:   G2    c  
7:   G2    c  
8:   G2    c  
9:   G2    d 
10:   G2    e 
11:   G3    b 
12:   G3    e 
13:   G3    d 
14:   G3    d 
15:   G3    d

Select Random Samples in R using Dplyr

yippeee25

Improve

Article Tags :

Take random sample based on groups in R

Method 1: Using plyr library

Method 2: Using dplyr library

Method 3: Using data.table

Similar Reads

Thank You!

What kind of Experience do you want to share?