R-Guru.
com Cheat Sheet for Practical R Tasks df2 <- df1[df1$vr1 == 'male', ] # row, column reference, subset, all vars
print(df1[df1$vr1 == 'male', c('vr1', 'vr2')]) # print vr1 & vr2 for males
mydata$vr2 <- ifelse(vr1 >= 4, 1, 0) # derive variable by condition
This guide contains basic best practice examples for creating and updating
tibble data frames from vectors of the same length or number of records.
Piping and multiple conditional processing to derive variables
Examples show common R tasks for importing data, creating data frames,
df2 <- df1 %>% # copy df1 data frame to df2
direct variable referencing, piping, conditional and group processing, sql
mutate(vr2 = case_when( # derive vr2 based on condition
components, character and date operations, variable type conversions,
vr1 > 20 ~ "label 1", # ifelse() is two option alternative
transposing data frames, joining data frames, appending data frames,
vr1 > 10 ~"label 2", # ifelse(vr1 < 51, "50 or Below", "Above 50")
deriving summary variables, and creating graphs and output files. When
TRUE ~ "label 3" ) ) # otherwise last case for label 3
possible, base R supplied sample data frames are used in examples.
SQL components (DPLYR) for selecting, filtering, mutating, and arranging
Mutate() has five features: case_when(), simple expression, summary
myquery <- ToothGrowth %>% # 1. source df
functions, rowwise(), and group_by()/ungroup() with summary functions.
select(len, supp, dose) %>% # 2. select variables, - drop
Data utility functions describe and view data frames: View(df), str(df),
filter(tolower(supp) == 'vc') %>% # 3. subset records, & (and), | (or), ! (not)
summary(df), table(vr), print(df, n=), head(df), tail(df), row_number(), nrow(),
# filter(year %in% c(2010, 2011)) to subset multiple values
ncol() and ls(). Tidyverse, DPLYR, STRINGR, READXL, HAVEN, LUBRIDATE and
mutate(dose2 = (dose*2)) %>% # 4. derive variables w/ simple expressions
GGPLOT2 packages are required. df#-data frame names, vr# – variable names.
arrange(supp, dose) # 5. sort records, desc()
Character or numeric variables depend on the function and values.
Character String Operations to combine, remove, subset, and substring
Import data into data frames: Data frames, CSV, Excel and SAS Datasets
vr3 <- str_c(vr1, sep="-", vr2) # combine two variables as vr1-vr2
install.packages('tidyverse') # install package
vr2 <- str_replace_all(vr1 , "Street" , "St" ) # replace ‘Street’ with ‘St’
library(tidyverse) # load popular data management package
vr2 <- str_trim(vr1, side='both') # remove blanks from left and right sides
readRDS("df.RDS") # read R data frame
vr2 <- str_extract(vr1 , "\\d*" ) # from char vr1, extract all digitss
read.csv("C:/mydata/my_csv.csv") # read csv, forward ‘/’
filter(str_detect( vr1 , "Health")) # subset records by finding text
read_excel("C:/mydata/my_excel.xlsx") # read excel, missing ‘NA’
vr2 <- str_sub(vr1 , 3 , 6 ) # substring vr1 text from 3rd to 6th position
sdtm <- “c:/my_sdtm” # create full path reference
read_sas(file.path(sdtm, “adsl.sas7bdat”)) # read dataset, missing ., ‘’
Variable Type Conversion to switch between Numeric & Character Variables
vr2 <- as.character(vr1) # number in numeric variable to character variable
Environmental Setup and Workspace
vr2 <- as.numeric(vr1) # number in character variable to numeric variable
ls() # list all objects
remove(list=ls()) # remove all objects
Date Operations: Assignment, Periods, Durations, Intervals, and Formats
# names(adsl)= tolower(names(adsl)) # lower case all variable names
Durations - # of seconds, Periods - # of days, weeks, months and years,
Intervals - duration between start and end points
Create data frames by combining variables
vr1 <- as.Date("2021-01-25") # assign date in yyyy-mm-dd format
df <- data.frame(vr1, vr2, vr3) # variable order
format(date, format="%m/%d/%y") # format as mm/dd/yy
+ ddays(1), + dweeks(1), + dmonths(1), + dyears(1) # add 1 dy, wk, mth or yr
Derive numeric and character constants to data frame
interval(dtvr1, dtvr2) %/% months(1) # of months between dates
df2 <- cbind(df1, vr1=1, vr2='Drug A') # to df1, add vr1 and vr2
# dates are stored as # of days since 1970
Direct variable reference to select variables and filter records
Transpose data frames to switch between long (records) & wide (variables)
df2 <- df1[c('vr1', 'vr2')] # combine selected variables by name
Long (records) to wide (variables) structure mutate(vr3 = mean(vr2, .1)) %>% # 4. derive mean vr2 with rounding
df2 <- df1 %>% # vr1 contains new variable name values ungroup() # 5. ungroup to add back all original variables with vr3
pivot_wider(names_from = vr1, values_from = vr2) # vr2 contains numbers
Left join data frames to add derived variables
Wide (variables) to long (records) structure df3 <- left_join(df1, df2, by='vr1') # join by the same by variables
df2 <- df1 %>% # all other variables in df1 are group by variables df3 <- left_join(df1, df2, by= c('vr1' = 'vr2')) # join by different by variables
pivot_longer(c("vr1", "vr2")) # list all variables to be transposed # other joins: right_join(), inner_join(), full_join()
df3 <- crossing(df1, df2) # many-to-many join without by variables
Group processing to derive summary variables
• Summary variables Subquery condition in df1 to filter df2 records
• First and Last Group By variables df3 <- df2 %>% # 5. final df3 data frame
• Descriptive Statistics filter(vr1 %in% ( # 4. filter vr1 values in df2 data frame
df1 %>% # 1. lookup df1 data frame
Derive summary variables filter(vr2 == 'male') %>% # 2. filter vr2=males in df1
mtcars_cyl_summary <- mtcars %>% # final and source data frames pull(vr1) %>% unique)) # 3. unique vr1 values for df2 filter
group_by(cyl) %>% # without is overall, ignore NA
summarize(mean_mpg = mean(mpg, na.rm = TRUE)) Append data frame records to end of first data frame records
df3 <- bind_rows(df1, df2) # append data frames with uneven variables
Derive First and Last group by variables df4 <- rbind(df1, df2, df3) # append two or more even variables data frames
first_mpg <- mtcars %>%
group_by( mpg ) %>% # group by mpg Graphs: Scatterplots, Lines, Boxplots, Bars and Histograms
slice(1) # flag first group by records, distinct() for unique records ggplot(df # data frame name
slice(n()) # flag last group by records , aes(x = vr1, y = vr2 # vr1 for x and vr2 for y axis variables
lead(), lag() # next and previous record values , fill= , color= , col=, size=) # valid options with valid values
# one or more required options below, defaults unless options are specified
Derive Descriptive Statistics Variable, Left Join to Add to Data Frame + geom_point() # scatterplot two quantitative variables
- Across one variable + geom_line # trend lines over time
vr1=c(2, 4, 6) # combine 3 values into one variable + geom_boxplot() # boxplot one continuous and one categorical variable
vr2 <- min(vr1) # one value, max(), sum() + geom_bar() # bar of variables, options: stat=
+ geom_histogram() # histogram of x-axis for counts
- Across variables using rowMins(), rowMax(), rowMeans() + geom_smooth(method=’lm’, formula=y~x, se=F) # smooth option
df2 <- subset(df1, select=c(vr1, vr2) ) # select variables vr1 and vr2 # one or more format options below, defaults unless options are specified
df3$vr1 <- rowMeans(df2, na.rm=TRUE ) # derive mean of all df2 vars + theme(), + ggtitle(), + xlab(), + ylab()
- Across variables using rowwise() with min(), max() Output files: Data frames, Text, Excel and SAS Datasets
df2 <- df1 %>% # derive min, max variables of vr1 and vr2 setwd("C:/mydataframes") # change default working folder
rowwise() %>% mutate(vr3= min(vr1, vr2), vr4= max(vr1, vr2)) getwd() # confirm working folder
saveRDS(df, file = "df.RDS") # save as permanent data frame
- Across records using mutate(), min(), max(), mean(), sum(), percent() write.table(df,"C:/myoutput/mydm.txt", sep="\t") # save as text file
df2 <- df1 %>% # 1. source df write_sas(df, "df.sas7bdat") # save as SAS dataset
filter(vr1 != '.') %>% # 2. subset non-missing records Created by Sunil Gupta, Gupta Programming, Copyright © 2023, Practical R Programming, R-Guru.com
group_by(vr1) %>% # 3. group by vr1, else by overall