Hosted by Virginia Education Science Training (VEST) Program at UVA
It’s hard to know where to start when teaching a new programming
language. This page is meant to give some background about R that
hopefully
1. explains a little about how it is put together, and
2. puts it in context with other programming languages you might know.
That said, revisiting this page after after working through the other modules might be useful.
R is a port of the S language, which was developed at Bell Labs. As a GNU project, R is open source and free (as in freedom) to use and distribute. It can be installed and used on most major operating systems.
R is best thought of as an integrated language and environment that was designed with statistical computing and data analysis in mind. To that end, its structure represents a compromise between a code base optimized for mathematical procedures and one with high-level functionality that can be used interactively (unlike compiled code). In other words, it’s a great tool for working interactively with quantitative data.
R is probably best known for its graphing capabilities, but it has continued to grow in popularity among data scientists,[1] who are increasingly extending R’s functionality through user-contributed packages.[2] We will use a number of packages in this workshop.
RStudio does most everything R-related well and with little fuss, so it’s a great all-around program for using R. We will use it in this workshop.
Quick exercise
If you haven’t already, open up RStudio and poke around. First, try entering an equation in the console (like
1 + 1
). Next, open the script associated with this module and run the first line.
R thinks of things as objects. Objects are like boxes in which we can put things: data, functions, and even other objects.
Before discussing data types and structures, the first lesson in R is
how to assign values to objects. In R (for quirky
reasons),
the primary means of assignment is the arrow, <-
, which is a less than
symbol, <
, followed by a hyphen, -
.
## assign value to object x using <-
x <- 1
## show
x
[1] 1
You can also assign using a single equals sign, =
:
## assign value to object y using =
y = 'a'
## show
y
[1] "a"
Keep in mind, however, that since =
sometimes has other meanings in R
(it’s how functions set argument options), it may be clearer to use
<-
.
Quick exercise
Using the arrow, assign the output of
1 + 1
tox
. Next subtract 1 fromx
and reassign the result tox
.
There are three primary data types in R that you will regularly use:
- logical
- numeric
(integer
& double
)
- character
Logical vectors can be TRUE
, FALSE
, or NA
. They can be assigned to
objects or returned by logical operators (e.g., ==
, !=
, <
, >
,
etc), which makes them useful for control flow in loops and functions.
NB In R, you can shorten TRUE
to T
and FALSE
to F
, but both
the short and long versions must be capitalized.
## assignment
x <- TRUE
x
[1] TRUE
## ! == NOT
!x
[1] FALSE
## check
is.logical(x)
[1] TRUE
## evaluate
1 + 1 == 2
[1] TRUE
Numeric values can be both integers and double precision floating point values, or just doubles. R automatically converts between the two data types for you, so knowing the difference between the two isn’t really important for most analyses.
If you want to use an integer, place a capital L
after the number like
1L
. If a number is stored as an integer, some R output will place an
L
behind the digits to let you know that. Mostly, R defaults to using
doubles, but if you see a number with an L
behind it, know that it’s
still a number.
## use 'L' after digit to store as integer
x <- 1L
is.integer(x)
[1] TRUE
## R stores as double by default
y <- 1
is.double(y)
[1] TRUE
## both are numeric
is.numeric(x)
[1] TRUE
is.numeric(y)
[1] TRUE
Character values are stored as strings, which means you need to place
either single '
or double "
quotes around them. Numeric values can
also be stored as strings (sometimes useful if you must store leading
zeroes), but they have to be converted back to numbers before you can
perform numeric operations on them (like adding or subtracting) or use
them in a statistical model.
## store a string using quotation marks
x <- 'The quick brown fox jumps over the lazy dog.'
x
[1] "The quick brown fox jumps over the lazy dog."
## store a number with leading zeros
x <- '00001'
x
[1] "00001"
Quick exercise
Try to add a string digit to a numeric value. What happens? Can you convert the string version on the fly so that the equation works? (HINT: in R, you can change a vector type using
as.<type>()
, where<type>
is the name of what you want.)
Building on these data types, R relies on four primary data structures:
vector
matrix
[3]list
dataframe
A vector in R is just a collection of the data types discussed. In fact,
a single value is a vector of one. Vectors do not have dimensions
(dim()
), but do have length()
, which is good to remember when
inspecting your data or writing loops and functions.
You combine multiple values using the concatenate, c()
, function. We
will use c()
a lot.
## create vector
x <- 1
## check
is.vector(x)
[1] TRUE
## add to vector (can do so recursively meaning old x can help make new x)
x <- c(x, 5, 8)
x
[1] 1 5 8
## no dim...
dim(x)
NULL
## ...but length
length(x)
[1] 3
You can access the elements of a vector using brackets, []
, after the
object name. If you think of each element in the vector as having an
address, that is, a way to access it specifically, then its address is
its position number in the vector. This position number is called its
index, and in R, the index always starts with 1
.
In our current vector, we have three items, 1
, 5
, and 8
, which in
turn have indices of 1
, 2
, and 3
. To access 5
specifically, we
can call it using the brackets and its index: x[2]
.
## get the second element
x[2]
[1] 5
Quick exercise
Since you know how to access a specific element in a vector and how to assign new values, try to change the 3rd element of the
x
vector to 4.
All values in a vector must be of the same type. If you concatenate
values of different data types, R will automatically promote all values
to least ambiguous type. We can check this with class()
.
## check class of x
class(x)
[1] "numeric"
## add character
x <- c(x, 'a')
x
[1] "1" "5" "8" "a"
## check class
class(x)
[1] "character"
A matrix is a 2D arrangement of data types. Instead of length, it has dimensions. Like vectors, all data elements must be of the same type.
## create 3 x 3 matrix that is the sequence of numbers between 1 and 9
x <- matrix(1:9, nrow = 3, ncol = 3)
x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
## ...fill by row this time
y <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
y
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
## a matrix has dimension
dim(x)
[1] 3 3
Use nrow()
and ncol()
to get the number of rows and columns,
respectively.
## # of rows
nrow(x)
[1] 3
## # of columns
ncol(x)
[1] 3
Like a vector, you can access parts of a matrix. Since it has two dimensions, use a comma in the bracket to separate row indices from column indices.
When using brackets with objects that have two dimensions, a good rule
of thumb is to add your comma first: x[ , ]
. Numbers or objects you
put between the first bracket and the comma will affect the rows;
numbers between the comma and the closing bracket will affect the
columns.
If you don’t put anything in either of those spaces (a blank space doesn’t count), R will assume you want all rows or columns, depending on which side of the comma is blank.
## show the values in the first row
x[1, ]
[1] 1 4 7
## show the values in the third column
x[ ,3]
[1] 7 8 9
## this is the same as just calling x by itself
x[ , ]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Quick exercise
Return the middle value of the
x
matrix. Next assign the middle value the character value ‘a’. What happens to the rest of the values in the matrix?
Lists are a catch all objects that can hold an assortment of other objects of different data types. They can be flat, meaning that all values are at the same level, or nested, with lists holding other lists.
## create single-level list
x <- list(1, 'a', TRUE)
## show
x
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
## check
is.list(x)
[1] TRUE
## create blank list
y <- list()
## add to first list, creating nested list
z <- list(x, y)
## show
z
[[1]]
[[1]][[1]]
[1] 1
[[1]][[2]]
[1] "a"
[[1]][[3]]
[1] TRUE
[[2]]
list()
You access items in lists like you do vectors and matrices. You may,
however, need to use double brackets, [[]]
, and multiple pairs,
[[]][[]]
, to reach the item you need.
## the first item in list z is list x
z[[1]]
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
## to get to 'a' in list x, need to add more brackets
z[[1]][[2]]
[1] "a"
Data frames are really just an organized collection of lists / vectors that are the same length. That quick description, however, belies the importance of data frames: you will use them all the time in your data work.
Most of the time, you will be reading in data frames, but you can also create them.
## create data frame where col_* are the column (variable) names
df <- data.frame(col_a = c(1,2,3),
col_b = c(4,5,6),
col_c = c(7,8,9))
## show
df
col_a col_b col_c
1 1 4 7
2 2 5 8
3 3 6 9
## check
is.data.frame(df)
[1] TRUE
Like matrices, data frames have a dim()
and the number of rows and
columns can be recovered using nrow()
and ncol()
. The column names,
which are needed when estimating models and making graphics, are
accessed using names()
.
## get column names
names(df)
[1] "col_a" "col_b" "col_c"
To access a column, you need to give R the data frame’s name followed by
a $
and then the variable name.
## get col_a
df$col_a
[1] 1 2 3
You can also use the df[['<var name>']]
construction, which comes in
handy in loops and functions.
## get col_a (note the quotation marks this time)
df[['col_a']]
[1] 1 2 3
Quick exercise
Create two or three equal length vectors. Next, combine to create a data frame. Finally, change one value in the data frame (HINT: think about how you changed vector and matrix values before).
User-submitted packages are a huge part of what makes R great. Most of your scripts will make use of one or more packages.
As you’ve seen on the getting started page, packages can be installed from the official CRAN repository using:
install.packages('<package name>')
The default option installs all dependencies (other packages that the package you want may rely on to work properly). By default, R will check how you installed R and download the right operating system file type.
Quick exercise
Install the
survey
package, which we will use in a later module. Don’t forget to use single or double quotation marks around the package name.
Recently, people have begun sharing the source code for their R packages
on GitHub. If you want to download a package on
GitHub, either because it isn’t hosted on CRAN or because you want the
newest development version, you can use the devtools
package to get it
(you will need git
on your system, too):
library(devtools)
install_github('<github handle>/<repo name>')
Package libraries[4] can loaded in a number of ways, but the easiest it to write:
library('<library name>')
where '<library name>'
is the name of the package/library. You will
need to load these before you can use their functions in your scripts.
Typically, they are placed at the top of the script file.
Quick exercise
Load the
tidyverse
package, which you should have already installed. This will be a good test of the installation since we will use tidyverse libraries throughout the rest of the workshop.
Even I don’t have every R function and nuance memorized. With all the user-written packages, it would be difficult to keep up if I tried! When stuck, there are a few ways to get help.
In the console, typing a function name immediately after a question mark will bring up that function’s help file:
## get help file for function
?sum
Two question marks will search for the command name in CRAN packages:
## search for function in CRAN
??sum
Google is a coder’s best friend. If you are having a problem, odds are a 1,000 other people have too and at least one of them has been brave (or foolhardy!) enough to ask about it in a forum like StackOverflow, CrossValidated, or R-help mailing list. Google it!
Like all computing languages, R has its own structure and quirks. The idiomatic R approach to data analysis can be especially challenging at first for those who come to R from other common statistical packages or scripting languages, like SPSS, Python, and Stata.[5]
I came to R after learning Stata first, which I think is common for many researchers trained in econometric methods. For me and others who’ve made the same Stata-to-R transition, I think the root of many problems is the fundamental difference between how Stata and R operate. Whereas Stata is more of a procedural language in which commands do things in an environment (your data), R is more object-oriented in that data and functions are stored in variables or objects and await instructions that pertain to them.[6]
As pointed out by my friend and colleague Richard
Blissett, users can
see this difference in the command/function names in each language.
Stata commands tend to be verbs: summarize
, tabulate
, and regress
;
on the other hand, R functions are often nouns: summary
, table
, and
lm
(for linear model). And so, common problems in the R to Stata
switch such as
- I ran a model and didn’t get any output…
- How do I create local/global macros in R?
- Which of these data objects is the actual data?
- Why isn’t R doing anything?
may be due to misunderstanding this difference.[7]
Like learning a new spoken language, constantly translating between your native tongue and the new language will only get you so far. To that end, I encourage native-Stata users to try to approach R without Stata procedures in mind (easier said than done, I know). That said, this document that shows the same analysis done in Stata and R side-by-side may be useful in the initial transition.
There are many other ways besides RStudio to run R. Below are just a few that, depending on your personal preferences and project needs, may be better or worse than RStudio.
.r
or .R
R CMD BATCH
Rscript
The “data scientist” as a person/title, like “big data,” has probably become a little played out, but for lack of a better catch-all term, I think everyone knows what I mean.
For a little more history on R, particularly its success as an open source project, see Fox (2009)
R also supports arrays, which can take on more than two dimensions.
For clarity, I’ll call them packages when talking about what is downloaded and libraries when discussing what is loaded into memory. Since the names are the same, it’s really a semantic difference.
If you come to R knowing C/C++, Fortran, or Java, see Rcpp, rFortran, rJava for some cool interactivity.
Stata has some object-oriented features and R some procedural programming behaviors, so the assigned labels aren’t perfect. They are mostly right, though.
Full disclosure: all questions I asked when learning R.