data.table in R is a package used for handling and manipulating large datasets. It allows for fast data processing, such as creating, modifying, grouping and summarizing data and is often faster than other tools like dplyr for big data tasks.
1. Creating and Sub-Setting Data
We can either convert existing data frames or create a new data.table object directly using data.table package.
library(data.table)
DT <- data.table(x = c(1,2,3,4),
y = c("A", "B", "C", "D"),
z = c(TRUE, FALSE, TRUE, FALSE))
print(DT)
subset_DT <- DT[x > 2]
print(subset_DT)
Output:

2. Grouping the Data
We can group data by columns and perform calculations like sums, averages, etc., on those groups.
grouped_DT <- DT[, sum(x), by = y]
print(grouped_DT)
Output:

3. Joining the Data
We can merge datasets, like performing an inner join on a common column.
DT2 <- data.table(y = c("A", "B", "C", "D"), v = c("alpha", "beta", "gamma", "delta"))
inner_join_DT <- DT[DT2, on = "y"]
print(inner_join_DT)
Output:

4. Modifying the Data
We can modify data by adding, updating or replacing columns.
DT[, x_squared := x^2]
print(DT)
Output:

5. Comparison with dplyr Package
While the dplyr package is common, data.table is often faster for large datasets. We can use microbenchmark to compare execution times.
if (!require(microbenchmark)) {
install.packages("microbenchmark")
}
library(microbenchmark)
library(dplyr)
dplyr_time <- microbenchmark(
.dplyr <- DT %>% filter(x > 2) %>% group_by(y) %>% summarise(sum_x = sum(x)),
times = 10
)
print(dplyr_time)
data.table_time <- microbenchmark(
.data.table <- DT[x > 2, sum(x), by = y],
times = 10
)
print(data.table_time)
Output:

The output displays the execution time of the dplyr and data.table operations, including the minimum, median and maximum times across 10 runs.