Writing Efficient R Code

R Programming Data Science Server Side Programming

Writing efficient code is very important as it makes the development time faster and leads our program to be able to understand, debug and maintain easily. We will discuss various techniques like benchmarking, vectorization and parallel programming to make our R code faster. You must learn these techniques if you are aspiring to be a data scientist. So, let's get started ?

Benchmarking

One of the easiest optimizations is to have the latest R version to work for. The new version cannot modify our existing code but it always comes with robust library functions that provide improved execution time.

The following command in R displays a list of version information of R ?


print(version)

Output

               _                                          
platform       x86_64-pc-linux-gnu                        
arch           x86_64                                     
os             linux-gnu                                  
system         x86_64, linux-gnu                          
status         Patched                                    
major          4                                          
minor          2.2                                        
year           2022                                       
month          11                                         
day            10                                         
svn rev        83330                                      
language       R                                          
version.string R version 4.2.2 Patched (2022-11-10 r83330)
nickname       Innocent and Trusting

Reading a CSV file as RDS file

Loading files using the read.csv() takes a lot of time. The efficient way to deal with it is to read and save the .csv file in .rds format first and then read the binary file. R provides us saveRDS() function to a .csv file in .rds format.

Example

Consider the following program that benchmarks the difference between the reading time of the same file present in two different formats ?

# Display the time taken to read file using read.csv()
print(system.time(read.csv(
   "https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv")))

# Save the file in .rds format
saveRDS("https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv", 
"myFile.rds" )

# Display the time taken to read in binary format
print(system.time(readRDS("myFile.rds")))

Output

 user   system  elapsed 
0.017    0.002    0.603 
 user   system  elapsed 
    0        0        0

Notice the difference between the execution time of both the methods. The time taken to read the same file in .RDS format is almost negligible. Thus reading an RDS file is more efficient than reading a CSV file.

Assigning using "<-" and "=" operators

R provides us with several ways to assign variables and files to objects. Two operators are widely used for this purpose: "<-" and "=". It is interesting to note that when we use the "<-" operator inside a function then it either creates a new object or overrides the existing ones. Since we want to store the result, using the "<-" operator is the useful inside system.time() function.

Elapsed time microbenchmark function

The system.time() function is reliable for computing the time taken by certain operations but it has a limitation to not compare many operations simultaneously.

R provides us with a microbenchmark library that provides us with a microbenchmark() function using which we can compare the time taken by two functions or operations.

Example

Consider the following program that uses the microbenchmark() function to compare the same file present in two different formats: CSV and RDS


library(microbenchmark)
# Save the file in .rds format
saveRDS("https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv", 
"myFile.rds" )

# Compare using microbenchmark() function
difference <- microbenchmark(read.csv(
   "https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"), 
      readRDS("myFile.rds"), 
         times = 2)

# Display the time difference
print(difference)

Output

        min         lq       mean     median         uq        max neval
 405062.028 405062.028 409947.146 409947.146 414832.264 414832.264     2
     41.151     41.151    102.355    102.355    163.559    163.559     2

Notice the difference between the execution time of both the methods.

Efficient Vectorisation

The increasing size of a vector with the flow of the code is not desirable in programming and it should be avoided as much as possible. This is because it consumes a lot of time and makes our program inefficient.

Example

For example, the following source code increases the size of the vector ?

# expand() function
expand <- function(n) {
   myVector <- NULL
   for(currentNumber in 1:n)
      myVector <- c(myVector, currentNumber)
    
   myVector
}

# Using system.time() function
system.time(res_grow <- expand(1000))

Output

 user  system elapsed 
0.003   0.000   0.003

As you can see in the output, the expand() function is consuming a lot of time.

Example

We can optimize the above code by preallocating the vector. For example, consider the following program ?

# expand() function
expand <- function(n) {
   myVector <- numeric(n)
   for(currentNumber in 1:n)
      myVector[currentNumber] = currentNumber
    
}

# Using system.time() function
system.time(res_grow <- expand(10000))

Output

  user  system elapsed 
 0.001   0.000   0.001

As you can see in the output, the execution time has been reduced drastically.

We should vectorize our code whenever possible.

Example

For example, consider the following program that adds the values in a vector using the simple loop method ?

# Initialize a vector with random values
myVector1 <- rnorm(20)

# Declare another vector 
myVector2 <- numeric(length(myVector1))

# Compute the sum 
for(index in 1:20)
   myVector2[index] <- myVector1[index] + myVector1[index]

# Display
print(myVector2)

Output

 [1]   1.31044581 -1.98035551  0.14009657 -1.62789103  1.23248277  0.49893302
 [7]  -0.53349928 -0.02553238 -0.06886832  1.16296981  0.90072271  0.20713797
[13]  -1.72293906  0.62083278  2.77900829  4.15732558  1.71227621  2.09531955
[19]  -0.06520153  0.62591177

The output represents the sum of corresponding vector values with itself.

Example

The following do the same thing as done above but this time we will use the vectorization method that will make decrease our code size and increases the execution time ?

myVector1 <- rnorm(20)

myVector2 <- numeric(length(myVector1))

# Add using vectorization
myVector2 <- myVector1 + myVector1

# Display
print(myVector2)

Output

 [1] -1.0100098  3.2932186 -3.5650312 -3.2800819  0.1513545 -1.5786916
 [7]  2.0485566  2.6009810 -0.8015987 -0.6965471 -1.4298714  1.1251865
[13]  1.2536663  2.6258258  1.1093443 -1.7895628  0.3472878 -1.4783578
[19] -0.7717328 -2.2734743

The output represents the sum of corresponding vector values with itself but this time we have used a vectorization method.

Note that we can apply vectorization techniques even with the R inbuilt functions.

Example

For example, consider the following program that computes the log of individual values present in a vector ?

myVector1 <- c(8, 10, 13, 16, 32, 64, 57, 88, 100, 110)

myVector2 <- numeric(length(myVector1))

# Compute the sum 
for(index in 1:10)
   myVector2[index] <- log(myVector1[index])

# Display the vector
print(myVector2)

Output

[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337
[9] 4.605170 4.700480

As you can see in the output, the logarithm of the corresponding vector values have been displayed.

Example

Now let us try to achieve the same thing but using vectorization technique this time ?

myVector1 <- c(8, 10, 13, 16, 32, 64, 57, 88, 100, 110)

myVector2 <- numeric(length(myVector1))

myVector2 <- log(myVector1)

# Display
print(myVector2)

Output

[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337
[9] 4.605170 4.700480

As you can see in the output, the logarithm of the corresponding vector values have been displayed but this time we have used the vectorization method.

Example

The matrix that contains elements of the same data type has the faster column access as compared to a dataframe. For example, consider the following program ?


library(microbenchmark)

# Create a matrix
myMatrix <- matrix(c(1:12), nrow = 4, byrow = TRUE)  

# Display
print(myMatrix)  

# Create rows 
data1 <- c(1, 4, 7, 10)
data2 <- c(2, 5, 8, 11)
data3 <- c(3, 6, 9, 12)

# Create a dataframe
myDataframe <- data.frame(data1, data2, data3)

# Display the dataframe
print(microbenchmark(myMatrix[,1], myDataframe[,1]))

Output

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
Unit: nanoseconds
             expr  min     lq    mean median   uq   max neval
    myMatrix[, 1]  493  525.0  669.64  595.5  661  5038   100
 myDataframe[, 1] 6880 7110.5 8003.56 7247.0 7437 53752   100

You can spot the difference in execution time for the column access method of a matrix and a dataframe.

Parallel Programming for efficient R code

R provides us with a parallel package using which we can write efficient R code. Parallelism is most of the time beneficial to get things done in less time and make the proper use of the system resources. The parallel package in R provides us the parApply() function that uses the following steps to run a program in parallel ?

Make a cluster using the makeCluster() function.

Write some statements.

Eventually, stop the cluster using the stopCluster() function.

Example

The following source code calculates the mean of all the columns using parApply() function in R ?


library(parallel)
library(microbenchmark)

# Create rows 
data1 <- c(1, 4, 7, 10)
data2 <- c(2, 5, 8, 11)
data3 <- c(3, 6, 9, 12)

# Create a dataframe
myDataframe <- data.frame(data1, data2, data3)

# Create a cluster
cluster <- makeCluster(2)

# Apply parApply() function
print(parApply(cluster, myDataframe, 2, mean))

# Stop the cluster
stopCluster(cluster)

Output

data1 data2 data3 
  5.5   6.5   7.5

As you can see in the output, the mean of the corresponding columns has been computed using parallel programming which is faster.

Conclusion

In this article, we briefly discussed how you can write efficient code in R. We discussed benchmarking, different vectorization techniques, and parallel programming. I hope this tutorial has surely helped you to expand your knowledge in the field of data science.

Bhuwanesh Nainwal

Updated on: 2023-01-17T16:05:04+05:30

259 Views

Kickstart Your Career

Get certified by completing the course

Get Started