Hosted by Virginia Education Science Training (VEST) Program at UVA
This short module is meant to introduce you to some of R programming features. In the first section, we’ll go through some common control flow functions. In the second section, we’ll go over building our own R functions.
## libraries
library(tidyverse)
By control flow, I just mean the functions that help you change how your script is read. Scripts in R are read from top to bottom unless specific commands tell R to skip some lines or repeat a set of commands if certain conditions are met.
Repeating commands often involves a loop. Loops have a bad reputation in R, mostly for being slow, but they aren’t that slow and they are easy to write and understand.[1]
The for()
function allows you to build loops. There are few ways to
use for()
, but its construction is the same: for(variable in
sequence)
.
Reading it backwards, the sequence
is just the set of numbers or
objects that we’re going to work through. The variable
is a new
variable that will temporarily hold a value from the sequence in each
run through the loop. When the sequence
is finished, so is the loop.
First, let’s loop through a sequence of 10 numbers, printing each one at a time as we work through the loop.
## make vector of numbers between 1 and 10
num_sequence <- 1:10
## loop through, printing each num_sequence value, one at a time
for (i in num_sequence) {
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Notice the braces {}
that come after for()
. This is the code in the
loop that will be repeated as long as the loop is run. With each loop,
i
takes on the next value in the num_sequence
. This is why we see 1
through 10 printed to the console.
Let’s do it again, but this time with characters.
## character vector using letters object from R base
chr_sequence <- letters[1:10]
## loop through, printing each chr_sequence value, one at a time
for (i in chr_sequence) {
print(i)
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"
Once more, with each loop, i
takes on each chr_sequence
value in
turn and print()
prints it to the console.
Quick exercise
Can you modify the above loop so that it works through both the
num_sequence
andchr_sequence
in the same loop? (HINT: how might you combine the two sequences?)
Another way to make a for loop is work through a vector by its indices. Let’s break the code into pieces to make it clearer.
Inside the for()
parentheses, we have i in 1:length(chr_sequence)
.
We know what i in
means since it’s like what we’ve seen before. What’s
1:length(chr_sequence)
? Starting at the end, we know that length()
will return the number of items in the vector. Since we know that there
are ten letters in chr_sequence
, then we know that
length(chr_sequence) == 10
. That means that 1:length(chr_sequence)
is the same thing as saying 1:10
, which is what we’ve seen before.
It’s just another, more flexible way, to get the end number of our
sequence.
Inside the braces ({}
), we have print(chr_sequence[i])
. From the
first module, we know that brackets ([]
) are way of pulling out
specific values from a vector. We’ve only used numbers before, but we
can also use variables that represent numbers. Since we know that i
is
going to take on values 1 through 10 in the loop, that means the
print()
function will get chr_sequence[1]
, chr_sequence[2]
, and so
on. Because of the brackets, these will turn into…a
, b
, and so on.
We should get the same thing as before!
## for loop by indices
for (i in 1:length(chr_sequence)) {
print(chr_sequence[i])
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"
And we do!
Whether you decide to loop using actual values from the sequence or indices will usually depend on the code you want to run in the loop. Sometimes one way works better and other times the other. Just do whatever works best for you at that time.
Quick exercise
Add another print statement to the last loop that shows the value of
i
with each loop.
The while()
function is similar to for()
except that it doesn’t have
a predetermined stopping point. As long the expression inside the
parentheses is TRUE
, the loop will keep going. Only when it becomes
FALSE
will it stop.
One way to use a while()
loop is to set up a counter. When the counter
reaches some value, the expression inside the while()
parentheses is
no longer true and the loop stops.
## set up a counter
i <- 1
## with each loop, add one to i
while(i < 11) {
print(i)
i <- i + 1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Using a while()
loop with a counter is often the same as using a
for()
loop with a sequence. If that’s the case, it’s probably better
just to use a for()
loop.
while()
loops are most useful when it’s not clear, from the start,
when the loop should stop. Imagine you have an algorithm that should
only stop when a certain number is reached. If the time it takes to
reach the number changes depending on the input, then a for()
loop
probably won’t work, but a while()
loop will.
You have to careful, however, with while()
loops. If you forget to
increment the counter (like I did the first time I set up this example),
the loop won’t ever stop because i
will never get larger and will
always be less than 11! If your while()
loop will only stop when a
certain condition is met, it’s still a good idea to build in a
pre-specified number of trials. If your loop has tried, let’s say 1000
times to meet the condition and still hasn’t done so, it should stop
with an error or return what it has so far (depending on your needs).
You have been warned!
We’ve already used a version of if, ifelse()
, quite a bit. We can also
use if()
in a for()
loop to set a condition that changes behavior
sometimes.
## only print if number is not 5
for (i in num_sequence) {
if (i != 5) {
print(i)
}
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Notice how 5
wasn’t printed to the console. It worked!
Quick exercise
Change the condition to print only numbers below 3 and above 7.
We can add one or more else if() / else()
partners to if()
if we
need, for example, option B to happen if option A does not.
## if/else loop
for (i in num_sequence) {
if (i != 3 & i != 5) {
print(i)
} else if (i == 3) {
print('three')
} else {
print('five')
}
}
[1] 1
[1] 2
[1] "three"
[1] 4
[1] "five"
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
You can write your own functions in R and should! They don’t need to be complex. In fact, they tend to be best when kept simple. Mostly, you want a function to do one thing really well.
To make a function, you use the function()
function. Put the code that
you want your function to run in the braces {}
. Any arguments that you
want your function to take should be in the parentheses ()
right after
the word function
. The name of your function is the name of the object
you assign it to.
Let’s make one. The function below, my_function()
, doesn’t take any
arguments and prints a simple string when called. After you’ve built it,
call your function using its name, not forgetting to include the
parentheses.
## function to say hi!
my_function <- function() {
print('Hi!')
}
## call it
my_function()
[1] "Hi!"
Let’s make another one with an argument so that it’s more flexible. This time, we want it to print out a sequence of numbers, but we want to be able to change the number each time we call it.
Notice how the variable num_vector
is repeated in both the main
function argument and inside the for
parentheses. The for()
function
sees num_vector
and looks for it in the main function. It finds it
because the num_vector
you give the main function, print_nums()
, is
passed through to the code inside. Now for()
can see it and use it!
## new function to print sequence of numbers
print_nums <- function(num_vector) {
## this code looks familiar...
for (i in num_vector) {
print(i)
}
}
## try it out!
print_nums(1:10)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Quick exercise
What happens if you forget to put an argument in your new function? How do you think you might set a default argument for
num_vector
? Could you set it equal to something?
Moving to a more realistic example, we could make a function that filled in missing values, a common task we’ve had. First, we’ll generate some fake data with missing values.
Note that since we’re using R’s sample()
function, your data will look
a little different from mine due to randomness in the sample, but
everything will work the
same.
## create tbl_df with around 10% missing values (-97,-98,-99) in three columns
df <- data.frame('id' = 1:100,
'age' = sample(c(seq(11,20,1), -97),
size = 100,
replace = TRUE,
prob = c(rep(.09, 10), .1)),
'sibage' = sample(c(seq(5,12,1), -98),
size = 100,
replace = TRUE,
prob = c(rep(.115, 8), .08)),
'parage' = sample(c(seq(45,55,1), -99),
size = 100,
replace = TRUE,
prob = c(rep(.085, 11), .12))
) %>%
tbl_df()
## show
df
# A tibble: 100 x 4
id age sibage parage
<int> <dbl> <dbl> <dbl>
1 1 16 6 49
2 2 -97 11 50
3 3 14 5 53
4 4 15 11 51
5 5 18 8 -99
6 6 17 -98 45
7 7 15 6 51
8 8 14 12 52
9 9 14 -98 48
10 10 15 9 53
# ... with 90 more rows
We could fix these manually like we have been, but it would be nice have a shorthand function. The function needs to flexible though, because the missing data values are coded differently in each column.
## function to fix missing values
fix_missing <- function(x, miss_val) {
## in the vector, wherever the vector is the missval_num, make NA
x[x == miss_val] <- NA
## return instead of print because we want to store it
return(x)
}
Our fix_missing()
function should work. It takes the same bracket code
we used in past modules, but instead of using the name of the object
(like df
), uses a variable name x
that we can set each time. It does
the same for miss_val
. Instead of choosing a hard-coded value (a
magic
number), we
can change it each time we call the function. Let’s try it out.
## check
table(df$age, useNA = 'ifany')
-97 11 12 13 14 15 16 17 18 19 20
8 9 7 5 14 11 6 12 11 11 6
## missing values in age are coded as -97
df$age <- fix_missing(df$age, -97)
## recheck
table(df$age, useNA = 'ifany')
11 12 13 14 15 16 17 18 19 20 <NA>
9 7 5 14 11 6 12 11 11 6 8
It worked! All the values that were -97 before, are now in the NA
table column. Importantly, none of the other values changed.
Quick exercise
Our new function should work in the tidyverse framework. Load tidyverse and see if you can fix the other two columns with our new command. (HINT: look back at a past module if you need to)