Take random sample based on groups in R
Last Updated :
18 Jul, 2021
R programming language provides us with many packages to take random samples from data objects, data frames, or data tables and aggregate them into groups.
Method 1: Using plyr library
The "plyr" library can be installed and loaded into the working space which is used to perform data manipulation and statistics. The ddply() method is applied for each subset of the specified data frame, followed by combining the results into a data frame.
Syntax:
ddply( .data, .variables, .fun = NULL)
Parameter -
data - The data frame to use
variables - the grouping parameters
fun - the function to be applied. In this case, sample(nrow(x),y) method is applied which extracts y rows of each group from the variables chosen for the second parameter of ddply() method.
Example:
R
# importing required libraries
library("plyr")
# create dataframe
data_frame<-data.frame(col1=c(rep('G1',50),rep('G2',50),rep('G3',50)),
col2=rep(letters[1:5],30)
)
print("Original DataFrame")
head(data_frame)
# pick 3 samples of each from data frame
data_mod <- ddply(data_frame,.(col1),function(x) x[sample(nrow(x),5),])
print("Modified DataFrame")
print (data_mod)
Output
[1] "Original DataFrame"
col1 col2
1 G1 a
2 G1 b
3 G1 c
4 G1 d
5 G1 e
6 G1 a
[1] "Modified DataFrame"
col1 col2
1 G1 d
2 G1 e
3 G1 d
4 G1 a
5 G1 a
6 G2 b
7 G2 c
8 G2 d
9 G2 d
10 G2 e
11 G3 c
12 G3 e
13 G3 b
14 G3 b
15 G3 d
Method 2: Using dplyr library
The "dplyr" library can be installed and loaded into the working space which is used to perform data manipulation. This package allows a large variety of methods to filter, subset, and extract data based on the application of constraints and conditions. The data frame is subjected to multiple operations using the pipe operator.
The group_by method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names.
Syntax:
group_by(col1, col2, ...)
This is followed by the application of sample_n() method is used to select random rows from the data frame with the argument indicating the number of rows to sample out from each group.
Example:
R
# importing required libraries
library("dplyr")
# create dataframe
data_frame<-data.frame(col1=c(rep('G1',50),rep('G2',50),
rep('G3',50)),
col2=rep(letters[1:5],30)
)
print("Original DataFrame")
head(data_frame)
# pick 3 samples of each from data frame
data_mod <- data_frame %>% group_by(col1) %>% sample_n(3)
print("Modified DataFrame")
print (data_mod)
Output
[1] "Original DataFrame"
col1 col2
1 G1 a
2 G1 b
3 G1 c
4 G1 d
5 G1 e
6 G1 a
[1] "Modified DataFrame"
# A tibble: 9 x 2
# Groups: col1 [3]
col1 col2
<chr> <chr>
1 G1 d
2 G1 e
3 G1 c
4 G2 a
5 G2 a
6 G2 c
7 G3 b
8 G3 a
9 G3 a
Method 3: Using data.table
The library data.table can be used for the fast aggregation of large data organized into tabular structures. The package can be loaded and installed into the working space.
The indexing of the data table can be performed using the .SD parameter which selects a sample grouping data using the "by" parameter. The number of rows chosen from each group depends on the size attribute specified in the indexing method. The output is returned in the form of a data.table.
Syntax:
data_frame[ , .SD[sample(x = .N, size = n)], by = ]
Example:
R
# importing required libraries
library("data.table")
# create dataframe
data_frame<-data.table(col1=c(rep('G1',50),rep('G2',50),
rep('G3',50)),
col2=rep(letters[1:5],30)
)
print("Original DataFrame")
head(data_frame)
# pick 3 samples of each from data frame
data_mod <- data_frame[, .SD[sample(x = .N, size = 5)], by = col1]
print("Modified DataFrame")
print (data_mod)
Output
[1] "Original DataFrame"
col1 col2
1: G1 a
2: G1 b
3: G1 c
4: G1 d
5: G1 e
6: G1 a
[1] "Modified DataFrame"
col1 col2
1: G1 a
2: G1 e
3: G1 d
4: G1 e
5: G1 a
6: G2 c
7: G2 c
8: G2 c
9: G2 d
10: G2 e
11: G3 b
12: G3 e
13: G3 d
14: G3 d
15: G3 d
Similar Reads
How to Repeat a Random Sample in R In statistical analysis and data science, it is often important to understand the behavior of a dataset by taking random samples. Repeating a random sample allows researchers to observe how consistent their results are across different iterations. In R, this can be achieved using various functions.
4 min read
Select Random Samples in R using Dplyr In this article, we will be looking at different methods for selecting random samples from the Dplyr package of the R programming language. To install and import the Dplyr package in the R programming language, the user needs to follow the syntax: Syntax: install.packages("dplyr") library(dplyr) Met
2 min read
Take Random Samples from a Data Frame in R Programming - sample_n() Function sample_n() function in R Language is used to take random sample specimens from a data frame. Syntax: sample_n(x, n) Parameters: x: Data Frame n: size/number of items to select Example 1: Python3 1== # R program to collect sample data # from a data frame # Loading library library(dplyr) # Create a da
1 min read
R Program to Sample from a Population R is a powerful and widely used programming language for statistical computing and data analysis. It provides a user-friendly ecosystem of R packages for various analytical tasks and is known for its flexibility and visualization capabilities. In R Programming Language It's like a super-smart assist
5 min read
SQL Random Sampling within Groups Random sampling is a powerful technique in SQL for selecting representative subsets of data from larger datasets. It is widely used in database management, data analysis, and reporting to ensure unbiased results. This article will cover how to perform random sampling within groups in SQL, using the
4 min read
Group by one or more variables using Dplyr in R The group_by() method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names. Syntax: group_by(col1, col2, ...) Example 1: Group by one variable R #
2 min read