
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Writing Efficient R Code
Writing efficient code is very important as it makes the development time faster and leads our program to be able to understand, debug and maintain easily. We will discuss various techniques like benchmarking, vectorization and parallel programming to make our R code faster. You must learn these techniques if you are aspiring to be a data scientist. So, let's get started ?
Benchmarking
One of the easiest optimizations is to have the latest R version to work for. The new version cannot modify our existing code but it always comes with robust library functions that provide improved execution time.
The following command in R displays a list of version information of R ?
print(version)
Output
_ platform x86_64-pc-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status Patched major 4 minor 2.2 year 2022 month 11 day 10 svn rev 83330 language R version.string R version 4.2.2 Patched (2022-11-10 r83330) nickname Innocent and Trusting
Reading a CSV file as RDS file
Loading files using the read.csv() takes a lot of time. The efficient way to deal with it is to read and save the .csv file in .rds format first and then read the binary file. R provides us saveRDS() function to a .csv file in .rds format.
Example
Consider the following program that benchmarks the difference between the reading time of the same file present in two different formats ?
# Display the time taken to read file using read.csv() print(system.time(read.csv( "https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"))) # Save the file in .rds format saveRDS("https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv", "myFile.rds" ) # Display the time taken to read in binary format print(system.time(readRDS("myFile.rds")))
Output
user system elapsed 0.017 0.002 0.603 user system elapsed 0 0 0
Notice the difference between the execution time of both the methods. The time taken to read the same file in .RDS format is almost negligible. Thus reading an RDS file is more efficient than reading a CSV file.
Assigning using "<-" and "=" operators
R provides us with several ways to assign variables and files to objects. Two operators are widely used for this purpose: "<-" and "=". It is interesting to note that when we use the "<-" operator inside a function then it either creates a new object or overrides the existing ones. Since we want to store the result, using the "<-" operator is the useful inside system.time() function.
Elapsed time microbenchmark function
The system.time() function is reliable for computing the time taken by certain operations but it has a limitation to not compare many operations simultaneously.
R provides us with a microbenchmark library that provides us with a microbenchmark() function using which we can compare the time taken by two functions or operations.
Example
Consider the following program that uses the microbenchmark() function to compare the same file present in two different formats: CSV and RDS
library(microbenchmark) # Save the file in .rds format saveRDS("https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv", "myFile.rds" ) # Compare using microbenchmark() function difference <- microbenchmark(read.csv( "https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"), readRDS("myFile.rds"), times = 2) # Display the time difference print(difference)
Output
min lq mean median uq max neval 405062.028 405062.028 409947.146 409947.146 414832.264 414832.264 2 41.151 41.151 102.355 102.355 163.559 163.559 2
Notice the difference between the execution time of both the methods.
Efficient Vectorisation
The increasing size of a vector with the flow of the code is not desirable in programming and it should be avoided as much as possible. This is because it consumes a lot of time and makes our program inefficient.
Example
For example, the following source code increases the size of the vector ?
# expand() function expand <- function(n) { myVector <- NULL for(currentNumber in 1:n) myVector <- c(myVector, currentNumber) myVector } # Using system.time() function system.time(res_grow <- expand(1000))
Output
user system elapsed 0.003 0.000 0.003
As you can see in the output, the expand() function is consuming a lot of time.
Example
We can optimize the above code by preallocating the vector. For example, consider the following program ?
# expand() function expand <- function(n) { myVector <- numeric(n) for(currentNumber in 1:n) myVector[currentNumber] = currentNumber } # Using system.time() function system.time(res_grow <- expand(10000))
Output
user system elapsed 0.001 0.000 0.001
As you can see in the output, the execution time has been reduced drastically.
We should vectorize our code whenever possible.
Example
For example, consider the following program that adds the values in a vector using the simple loop method ?
# Initialize a vector with random values myVector1 <- rnorm(20) # Declare another vector myVector2 <- numeric(length(myVector1)) # Compute the sum for(index in 1:20) myVector2[index] <- myVector1[index] + myVector1[index] # Display print(myVector2)
Output
[1] 1.31044581 -1.98035551 0.14009657 -1.62789103 1.23248277 0.49893302 [7] -0.53349928 -0.02553238 -0.06886832 1.16296981 0.90072271 0.20713797 [13] -1.72293906 0.62083278 2.77900829 4.15732558 1.71227621 2.09531955 [19] -0.06520153 0.62591177
The output represents the sum of corresponding vector values with itself.
Example
The following do the same thing as done above but this time we will use the vectorization method that will make decrease our code size and increases the execution time ?
myVector1 <- rnorm(20) myVector2 <- numeric(length(myVector1)) # Add using vectorization myVector2 <- myVector1 + myVector1 # Display print(myVector2)
Output
[1] -1.0100098 3.2932186 -3.5650312 -3.2800819 0.1513545 -1.5786916 [7] 2.0485566 2.6009810 -0.8015987 -0.6965471 -1.4298714 1.1251865 [13] 1.2536663 2.6258258 1.1093443 -1.7895628 0.3472878 -1.4783578 [19] -0.7717328 -2.2734743
The output represents the sum of corresponding vector values with itself but this time we have used a vectorization method.
Note that we can apply vectorization techniques even with the R inbuilt functions.
Example
For example, consider the following program that computes the log of individual values present in a vector ?
myVector1 <- c(8, 10, 13, 16, 32, 64, 57, 88, 100, 110) myVector2 <- numeric(length(myVector1)) # Compute the sum for(index in 1:10) myVector2[index] <- log(myVector1[index]) # Display the vector print(myVector2)
Output
[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337 [9] 4.605170 4.700480
As you can see in the output, the logarithm of the corresponding vector values have been displayed.
Example
Now let us try to achieve the same thing but using vectorization technique this time ?
myVector1 <- c(8, 10, 13, 16, 32, 64, 57, 88, 100, 110) myVector2 <- numeric(length(myVector1)) myVector2 <- log(myVector1) # Display print(myVector2)
Output
[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337 [9] 4.605170 4.700480
As you can see in the output, the logarithm of the corresponding vector values have been displayed but this time we have used the vectorization method.
Example
The matrix that contains elements of the same data type has the faster column access as compared to a dataframe. For example, consider the following program ?
library(microbenchmark) # Create a matrix myMatrix <- matrix(c(1:12), nrow = 4, byrow = TRUE) # Display print(myMatrix) # Create rows data1 <- c(1, 4, 7, 10) data2 <- c(2, 5, 8, 11) data3 <- c(3, 6, 9, 12) # Create a dataframe myDataframe <- data.frame(data1, data2, data3) # Display the dataframe print(microbenchmark(myMatrix[,1], myDataframe[,1]))
Output
[,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12 Unit: nanoseconds expr min lq mean median uq max neval myMatrix[, 1] 493 525.0 669.64 595.5 661 5038 100 myDataframe[, 1] 6880 7110.5 8003.56 7247.0 7437 53752 100
You can spot the difference in execution time for the column access method of a matrix and a dataframe.
Parallel Programming for efficient R code
R provides us with a parallel package using which we can write efficient R code. Parallelism is most of the time beneficial to get things done in less time and make the proper use of the system resources. The parallel package in R provides us the parApply() function that uses the following steps to run a program in parallel ?
Make a cluster using the makeCluster() function.
Write some statements.
Eventually, stop the cluster using the stopCluster() function.
Example
The following source code calculates the mean of all the columns using parApply() function in R ?
library(parallel) library(microbenchmark) # Create rows data1 <- c(1, 4, 7, 10) data2 <- c(2, 5, 8, 11) data3 <- c(3, 6, 9, 12) # Create a dataframe myDataframe <- data.frame(data1, data2, data3) # Create a cluster cluster <- makeCluster(2) # Apply parApply() function print(parApply(cluster, myDataframe, 2, mean)) # Stop the cluster stopCluster(cluster)
Output
data1 data2 data3 5.5 6.5 7.5
As you can see in the output, the mean of the corresponding columns has been computed using parallel programming which is faster.
Conclusion
In this article, we briefly discussed how you can write efficient code in R. We discussed benchmarking, different vectorization techniques, and parallel programming. I hope this tutorial has surely helped you to expand your knowledge in the field of data science.