Open In App

Data Manipulation in R with data.table

Last Updated : 24 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

data.table in R is a package used for handling and manipulating large datasets. It allows for fast data processing, such as creating, modifying, grouping and summarizing data and is often faster than other tools like dplyr for big data tasks.

1. Creating and Sub-Setting Data

We can either convert existing data frames or create a new data.table object directly using data.table package.

R
library(data.table)
DT <- data.table(x = c(1,2,3,4), 
                 y = c("A", "B", "C", "D"), 
                 z = c(TRUE, FALSE, TRUE, FALSE))
print(DT)

subset_DT <- DT[x > 2]
print(subset_DT)

Output:

Screenshot-2025-06-24-161856
Output

2. Grouping the Data

We can group data by columns and perform calculations like sums, averages, etc., on those groups.

R
grouped_DT <- DT[, sum(x), by = y]
print(grouped_DT)

Output:

Screenshot-2025-06-24-153255
Output

3. Joining the Data

We can merge datasets, like performing an inner join on a common column.

R
DT2 <- data.table(y = c("A", "B", "C", "D"), v = c("alpha", "beta", "gamma", "delta"))

inner_join_DT <- DT[DT2, on = "y"]
print(inner_join_DT)

Output:

Screenshot-2025-06-24-153518
Output

4. Modifying the Data

We can modify data by adding, updating or replacing columns.

R
DT[, x_squared := x^2]
print(DT)

Output:

Screenshot-2025-06-24-153659
Output

5. Comparison with dplyr Package

While the dplyr package is common, data.table is often faster for large datasets. We can use microbenchmark to compare execution times.

R
if (!require(microbenchmark)) {
  install.packages("microbenchmark")
}

library(microbenchmark)
library(dplyr)

dplyr_time <- microbenchmark(
  .dplyr <- DT %>% filter(x > 2) %>% group_by(y) %>% summarise(sum_x = sum(x)),
  times = 10
)
print(dplyr_time)

data.table_time <- microbenchmark(
  .data.table <- DT[x > 2, sum(x), by = y],
  times = 10
)
print(data.table_time)

Output:

Screenshot-2025-06-24-154359
Output

The output displays the execution time of the dplyr and data.table operations, including the minimum, median and maximum times across 10 runs.


Article Tags :

Similar Reads