First, download and install R from https://cran.r-project.org/.
Second, download and install RStudio Desktop (Open Source License) from https://www.rstudio.com/products/rstudio/download/ (scroll down the web page to see download links).
R is a high-level programming language and environment for statistical computing. RStudio is a very useful and commonly used development interface built on top of R. Both are free. Both are available on Windows, Mac OS X, and Ubuntu. This tutorial uses R v3.4.2 and RStudio Desktop v1.1.383, but you can use more recent versions if available.
If you are having trouble with installation, try to Google the errors you are getting - very likely someone had similar issues to yours and there is a solution online. If you are still having trouble, reach out to the creators of the software.
After installation, both R and RStudio will be available in your installation directory. However, you do not need to use R directly yourself, RStudio uses it on your behalf, and you only directly interact with RStudio.
Now, proceed and launch RStudio.
Once you have launched RStudio, you will see the following interface.
Create a project: File > New Project > Existing Directory > Browse … > Create Project (the directory you pick will be your working directory)
Now, open new R script file: File > New File > R script
Understand the layout
Alternatively, we could use special R notebook (R markdown file) to mix code with text.
You could open it as follows: File > New File > R Notebook
R notebook allows you to mix together LaTeX equations, markdown notation, and R code, which is very useful when creating reports. In fact, this tutorial is written using R notebook format. The only peculiarity is that you need to mark R code in a special way as shown in the example. Then you can compile the notebook into one of several file formats and the code output will be embedded in text.
Below I provide a review of some key functionality. However, there is much more to R, and there are many useful commands contained within specific packages. Use Google and stackoverflow to your advantage - R has great community - if there is a question you have - it was likely asked by someone and there is a response online. Learning how to find such information is an important skill for an R user.
# hello world
print('Hello world')
> [1] "Hello world"
# variable assignments (note, R convention for assignment is '<-'; but '=' will also work)
# also note that by default the numbers are class 'numeric'
a <- 3/2
a
> [1] 1.5
class(a) # tells us the class of the object
> [1] "numeric"
# this is how you would create an integer -- rarely used
b <- as.integer(3/2)
b
> [1] 1
class(b)
> [1] "integer"
# what if we round a number to the lowest integer?
k <- floor(3/2)
k
> [1] 1
class(k)
> [1] "numeric"
# and now to the highest integer
ceiling(3/2)
> [1] 2
round(3/2)
> [1] 2
# class character
class("s")
> [1] "character"
# turning numeric to character
as.character(3)
> [1] "3"
# and back
as.numeric('3')
> [1] 3
# creating a vector of numbers
d <- c(1,2,4,5)
d
> [1] 1 2 4 5
# notice, vector of numeric is class numeric too
class(d)
> [1] "numeric"
# creating a vector of strings
k <- c(1,'dog',2,'a',4,6)
k
> [1] "1" "dog" "2" "a" "4" "6"
class(k)
> [1] "character"
# converting character vector to numeric - NA missing values introduced
as.numeric(k)
> Warning: NAs introduced by coercion
> [1] 1 NA 2 NA 4 6
# checking for NA
is.na(as.numeric(k))
> Warning: NAs introduced by coercion
> [1] FALSE TRUE FALSE TRUE FALSE FALSE
# logical values
k <- c(TRUE, FALSE, T, F)
k
> [1] TRUE FALSE TRUE FALSE
class(k)
> [1] "logical"
# sequences
1:10
> [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10)
> [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10,2) # odd numbers
> [1] 1 3 5 7 9
# first element in a vector
d[1]
> [1] 1
# second element in a vector
d[2]
> [1] 2
# two ways to get the last element in a vector
d[length(d)]
> [1] 5
tail(d,1)
> [1] 5
# repeat vector three times
rep(d,3)
> [1] 1 2 4 5 1 2 4 5 1 2 4 5
# repeat every element of a vector three times
rep(d,each=3)
> [1] 1 1 1 2 2 2 4 4 4 5 5 5
# get unique elements
unique(rep(d,each=3))
> [1] 1 2 4 5
sum(d) # sum of elements
> [1] 12
prod(d) # product of elements
> [1] 40
mean(d) # mean
> [1] 3
sd(d) # standard deviation
> [1] 1.825742
var(d) # variance
> [1] 3.333333
sqrt(var(d)) # standard deviation
> [1] 1.825742
# log
log(d) # base e, also known as ln
> [1] 0.0000000 0.6931472 1.3862944 1.6094379
log10(d) # base 10
> [1] 0.00000 0.30103 0.60206 0.69897
# max value and index
max(d)
> [1] 5
which.max(d)
> [1] 4
# exp
exp(d)
> [1] 2.718282 7.389056 54.598150 148.413159
# taking to power
d^2
> [1] 1 4 16 25
# applying computation to a vector
sapply(d, function(x) x^2)
> [1] 1 4 16 25
# factor - vector that assumes a discrete set of values (levels)
v <- rep(c('1990','2000','2010'),100)
v
> [1] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [11] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [21] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [31] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [41] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [51] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [61] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [71] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [81] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [91] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [101] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [111] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [121] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [131] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [141] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [151] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [161] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [171] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [181] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [191] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [201] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [211] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [221] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [231] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [241] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [251] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [261] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [271] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [281] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [291] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
v <- factor(v)
v
> [1] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [15] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [29] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [43] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [57] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [71] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [85] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [99] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [113] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [127] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [141] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [155] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [169] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [183] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [197] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [211] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [225] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [239] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [253] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [267] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [281] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [295] 1990 2000 2010 1990 2000 2010
> Levels: 1990 2000 2010
levels(v) # levels of a factor
> [1] "1990" "2000" "2010"
as.numeric(v) # reveals numeric index corresponding to level ordering
> [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
> [36] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
> [71] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> [106] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
> [141] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
> [176] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> [211] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
> [246] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
> [281] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
as.numeric(as.character(v)) # recovers level labels
> [1] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [15] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [29] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [43] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [57] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [71] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [85] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [99] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [113] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [127] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [141] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [155] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [169] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [183] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [197] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [211] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [225] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [239] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [253] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [267] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [281] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [295] 1990 2000 2010 1990 2000 2010
# matrix of zeros
matrix(0,4,4)
> [,1] [,2] [,3] [,4]
> [1,] 0 0 0 0
> [2,] 0 0 0 0
> [3,] 0 0 0 0
> [4,] 0 0 0 0
# diagonal identity matrix
diag(5)
> [,1] [,2] [,3] [,4] [,5]
> [1,] 1 0 0 0 0
> [2,] 0 1 0 0 0
> [3,] 0 0 1 0 0
> [4,] 0 0 0 1 0
> [5,] 0 0 0 0 1
# matrix of zeros
m <- matrix(0,1,4)
v <- rep(0,5)
class(m)
> [1] "matrix"
class(v)
> [1] "numeric"
# conversion between matrices and vectors
as.matrix(v)
> [,1]
> [1,] 0
> [2,] 0
> [3,] 0
> [4,] 0
> [5,] 0
as.vector(m)
> [1] 0 0 0 0
# set seed
set.seed(999)
# generate uniform [0,1] random variables
runif(5)
> [1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676
# generate standard normal random variables
rnorm(5)
> [1] -1.1782812 -1.3986638 0.3040954 0.1284985 2.2420250
# create a 12-number vector of uniform[0,1] random numbers and reshape it into 4x3 matrix
v <- runif(12)
v
> [1] 0.78721294 0.16658469 0.82125595 0.13114195 0.06998445 0.90749130
> [7] 0.63380102 0.55328977 0.34208638 0.82607030 0.81951407 0.56849274
class(v)
> [1] "numeric"
matrix(v,4,3)
> [,1] [,2] [,3]
> [1,] 0.7872129 0.06998445 0.3420864
> [2,] 0.1665847 0.90749130 0.8260703
> [3,] 0.8212559 0.63380102 0.8195141
> [4,] 0.1311419 0.55328977 0.5684927
# or filling by row
m <- matrix(v,4,3,byrow=TRUE)
m
> [,1] [,2] [,3]
> [1,] 0.7872129 0.16658469 0.8212559
> [2,] 0.1311419 0.06998445 0.9074913
> [3,] 0.6338010 0.55328977 0.3420864
> [4,] 0.8260703 0.81951407 0.5684927
class(m)
> [1] "matrix"
# mean of a matrix
mean(m)
> [1] 0.5522438
# matrix transpose
t(m)
> [,1] [,2] [,3] [,4]
> [1,] 0.7872129 0.13114195 0.6338010 0.8260703
> [2,] 0.1665847 0.06998445 0.5532898 0.8195141
> [3,] 0.8212559 0.90749130 0.3420864 0.5684927
# matrix dot product
md <- m%*%t(m)
md
> [,1] [,2] [,3] [,4]
> [1,] 1.3219160 0.8601776 0.8720464 1.2536898
> [2,] 0.8601776 0.8456365 0.4322800 0.6815879
> [3,] 0.8720464 0.4322800 0.8248564 1.1714666
> [4,] 1.2536898 0.6815879 1.1714666 1.6771795
# matrix element-wise product
m*m
> [,1] [,2] [,3]
> [1,] 0.61970422 0.027750460 0.6744613
> [2,] 0.01719821 0.004897823 0.8235405
> [3,] 0.40170373 0.306129572 0.1170231
> [4,] 0.68239214 0.671603316 0.3231840
# multiplication by a scalar
10*m
> [,1] [,2] [,3]
> [1,] 7.872129 1.6658469 8.212559
> [2,] 1.311419 0.6998445 9.074913
> [3,] 6.338010 5.5328977 3.420864
> [4,] 8.260703 8.1951407 5.684927
# size of a matrix
dim(m)
> [1] 4 3
# number of rows
dim(m)[1]
> [1] 4
# number of columns
dim(m)[2]
> [1] 3
# mean of each row
apply(m,1,mean)
> [1] 0.5916845 0.3695392 0.5097257 0.7380257
# mean of each column
apply(m,2,mean)
> [1] 0.5945566 0.4023432 0.6598316
# sum of each row
rowSums(m)
> [1] 1.775054 1.108618 1.529177 2.214077
apply(m,1,sum)
> [1] 1.775054 1.108618 1.529177 2.214077
# sum of each column
colSums(m)
> [1] 2.378226 1.609373 2.639326
apply(m,2,sum)
> [1] 2.378226 1.609373 2.639326
# first row of a matrix
m[1,]
> [1] 0.7872129 0.1665847 0.8212559
# second column of a matrix
m[,2]
> [1] 0.16658469 0.06998445 0.55328977 0.81951407
# specific element of a matrix
m[1,2]
> [1] 0.1665847
# matrix inverse
solve(m[1:3,1:3])
> [,1] [,2] [,3]
> [1,] 1.8049785 -1.5001301 -0.3536953
> [2,] -2.0018047 0.9482986 2.2901270
> [3,] -0.1064618 1.2455919 -0.1254987
# list of objects - unnamed
l1 <- list(5, "b", 12)
l1
> [[1]]
> [1] 5
>
> [[2]]
> [1] "b"
>
> [[3]]
> [1] 12
# getting back a vector of elements in a list
unlist(l1)
> [1] "5" "b" "12"
# accessing element in a list - note special format
l1[[1]]
> [1] 5
# named list
l2 <- list(a1=5, a2=c(2983, 1890), a3=c(3, 118))
l2
> $a1
> [1] 5
>
> $a2
> [1] 2983 1890
>
> $a3
> [1] 3 118
# getting named list value
l2$a1
> [1] 5
# applying power to each element of a list
lapply(l2, function(x) x^2)
> $a1
> [1] 25
>
> $a2
> [1] 8898289 3572100
>
> $a3
> [1] 9 13924
# unlisting named list
unlist(l2)
> a1 a21 a22 a31 a32
> 5 2983 1890 3 118
# taking mean over a list
mean(unlist(l2))
> [1] 999.8
# creating a data frame - list of vectors of the same length,
# it is structured like a table or matrix,
# individual variables are accessed via column names,
# individual observations - via row index
b1 <- c(NA, 22, 50)
b2 <- c("john", "rose", "john")
b3 <- c(TRUE, FALSE, TRUE)
df <- data.frame(b1, b2, b3)
# here is how we access columns
df$b1
> [1] NA 22 50
df$b2
> [1] john rose john
> Levels: john rose
# note that a vector of strings was automatically converted into a factor
# when data frame was created
class(df$b2)
> [1] "factor"
# if we do not want to convert string vectors to factors when creating a data frame,
# we simply add stringsAsFactors=FALSE option
df <- data.frame(b1, b2, b3, stringsAsFactors=FALSE)
class(df$b2) # now it is character
> [1] "character"
# renaming columns (variables) of a data frame
colnames(df)
> [1] "b1" "b2" "b3"
colnames(df) <- c('age','name','indicator')
colnames(df)
> [1] "age" "name" "indicator"
# second row
df[2,]
> age name indicator
> 2 22 rose FALSE
# second element of the indicator variable
df$indicator[2]
> [1] FALSE
df[2, "indicator"]
> [1] FALSE
df[2, 3]
> [1] FALSE
# slicing - two first rows and two names columns
df[1:2, c("age","indicator")]
> age indicator
> 1 NA TRUE
> 2 22 FALSE
# picking out elements of data frame based on two vectors of coordinates
df[cbind(1:2, c(1,3))]
> [1] NA "FALSE"
# filtering for rows that contain name "john" in name column
df[df$name=="john",]
> age name indicator
> 1 NA john TRUE
> 3 50 john TRUE
# keeping rows without missing values in age column
df[!is.na(df$age),]
> age name indicator
> 2 22 rose FALSE
> 3 50 john TRUE
# binding data frames together
rbind(df,df) # horizontally
> age name indicator
> 1 NA john TRUE
> 2 22 rose FALSE
> 3 50 john TRUE
> 4 NA john TRUE
> 5 22 rose FALSE
> 6 50 john TRUE
cbind(df,df) # vertically
> age name indicator age name indicator
> 1 NA john TRUE NA john TRUE
> 2 22 rose FALSE 22 rose FALSE
> 3 50 john TRUE 50 john TRUE
# description statistics
df <- data.frame(a=runif(10), b=runif(10), c=runif(10))
# mean
apply(df,2,mean)
> a b c
> 0.5557822 0.4620113 0.3673768
# variance
apply(df,2,var)
> a b c
> 0.06514680 0.07555668 0.09733104
# standard deviation
apply(df,2,sd)
> a b c
> 0.2552387 0.2748758 0.3119792
# loop
for (i in 1:5) {
print(i)
}
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
# conditional statements
if (a > 1) {
print('hi')
} else if (a > 0) {
print('hello')
} else {
print('bye')
}
> [1] "hi"
# function
custom_mean <- function(v) {
# taking mean over a vector via a for-loop
temp <- 0
for (i in v) {
temp <- temp + i
}
return(temp/length(v))
}
v <- runif(1000000)
custom_mean(v)
> [1] 0.5001688
mean(v)
> [1] 0.5001688
# which one is faster?
s <- Sys.time()
custom_mean(v)
> [1] 0.5001688
e <- Sys.time()
e-s
> Time difference of 0.01963902 secs
s <- Sys.time()
mean(v)
> [1] 0.5001688
e <- Sys.time()
e-s
> Time difference of 0.002686977 secs
# as we see, native mean() function, which applies to an array as a whole, is much faster -
# loops, which access arrays elementwise, are very slow in R - avoid them at all costs!
# using mean() function instead of looping over an array is an example of vectorization -
# vectorize everything using R expressions - e.g., apply(), linear algebra notation, etc.
# vectorization is fast because R functions are often just an interface to fast low-level language code -
# for example, fast fourier transform calls C code
# (we can print content of the function by typing it without round brackets)
fft
> function (z, inverse = FALSE)
> .Call(C_fft, z, inverse)
> <bytecode: 0x7fed91845a40>
> <environment: namespace:stats>
# and here is how you can get help on the function
?fft
Note, you need to install the following packages before we can load and use them:
One way to do it is to go into Packages tab in the bottom right window of RStudio, click install, enter package name, and press enter (you can also update packages from this tab). Alternatively, you can use install.packages(c(“gdata”,“ggplot2”,“ggthemes”,“scales”,“arm”)) command.
# Note, you need to install packages below before loading them
# install.packages(c("gdata","ggplot2","ggthemes","scales","arm"))
suppressMessages(library("gdata")) # loads library and suppresses unnecessary output
library("ggplot2")
library("ggthemes")
library("scales")
suppressMessages(library("arm"))
# the project is based on rolling sales update provided by NYC
# http://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page
# data for Manhattan in excel format is available via this link
# http://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_manhattan.xls
# download the data (5.3MB)
download.file("http://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_manhattan.xls", destfile="manhattan_real_estate.xls")
# load xsl file into a data frame - this will take a moment
data <- read.xls("manhattan_real_estate.xls", header=TRUE, pattern="BOROUGH")
# do ?read.xls and you will find that 'pattern' requires that all lines are skipped before the requested pattern is found
# NOTE: the exact numerical results shown below will change over time, as sales log gets updated
# data frame
class(data)
> [1] "data.frame"
# size of data
dim(data)
> [1] 17223 21
# names of columns - could be overwritten if desired
colnames(data)
> [1] "BOROUGH" "NEIGHBORHOOD"
> [3] "BUILDING.CLASS.CATEGORY" "TAX.CLASS.AT.PRESENT"
> [5] "BLOCK" "LOT"
> [7] "EASE.MENT" "BUILDING.CLASS.AT.PRESENT"
> [9] "ADDRESS" "APARTMENT.NUMBER"
> [11] "ZIP.CODE" "RESIDENTIAL.UNITS"
> [13] "COMMERCIAL.UNITS" "TOTAL.UNITS"
> [15] "LAND.SQUARE.FEET" "GROSS.SQUARE.FEET"
> [17] "YEAR.BUILT" "TAX.CLASS.AT.TIME.OF.SALE"
> [19] "BUILDING.CLASS.AT.TIME.OF.SALE" "SALE.PRICE"
> [21] "SALE.DATE"
# see top 5 rows of the data
head(data, 5)
> BOROUGH NEIGHBORHOOD BUILDING.CLASS.CATEGORY
> 1 1 ALPHABET CITY 01 ONE FAMILY DWELLINGS
> 2 1 ALPHABET CITY 01 ONE FAMILY DWELLINGS
> 3 1 ALPHABET CITY 02 TWO FAMILY DWELLINGS
> 4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> 5 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> TAX.CLASS.AT.PRESENT BLOCK LOT EASE.MENT BUILDING.CLASS.AT.PRESENT
> 1 1 390 61 NA A4
> 2 1 390 61 NA A4
> 3 1 390 35 NA B1
> 4 2A 390 54 NA C3
> 5 2B 390 64 NA C4
> ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
> 1 189 EAST 7TH STREET 10009 1
> 2 189 EAST 7TH STREET 10009 1
> 3 113 AVENUE C 10009 2
> 4 203 EAST 7TH STREET 10009 4
> 5 187 EAST 7TH STREET 10009 8
> COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
> 1 0 1 987 2,183
> 2 0 1 987 2,183
> 3 0 2 1,218 4,764
> 4 0 4 1,950 5,446
> 5 2 10 1,642 5,220
> YEAR.BUILT TAX.CLASS.AT.TIME.OF.SALE BUILDING.CLASS.AT.TIME.OF.SALE
> 1 1860 1 A4
> 2 1860 1 A4
> 3 1899 1 B1
> 4 2001 2 C3
> 5 1910 2 C4
> SALE.PRICE SALE.DATE
> 1 $ 4,844,809 2018-05-22
> 2 $ 0 2018-05-23
> 3 $ 0 2018-04-25
> 4 $ 6,250,000 2018-05-09
> 5 $ 2,400,000 2018-05-18
# what class?
class(data$SALE.PRICE)
> [1] "factor"
class(data$YEAR.BUILT)
> [1] "integer"
# transform factor into numeric
data$YEAR.BUILT <- as.numeric(as.character(data$YEAR.BUILT))
# transform factor into numeric, removing commas and dollar signs on the way
data$SALE.PRICE <- as.numeric(gsub("\\$ ","",gsub(",","",as.character(data$SALE.PRICE))))
# transform into factor to make it a categorical variable
data$TAX.CLASS.AT.TIME.OF.SALE <- as.factor(data$TAX.CLASS.AT.TIME.OF.SALE)
data <- data[!is.na(data$YEAR.BUILT),] # remove observations with a missing year of construction
data <- data[data$YEAR.BUILT>=1900,] # only keep observations after 1900
data <- data[data$SALE.PRICE>0,] # filter out zero prices (e.g., due to inheritance/gifts)
# check size of the data again
dim(data)
> [1] 11851 21
# we could now save the data, dropping row index
write.csv(data,file='clean_data.csv',row.names=FALSE)
# and here is how we could read it back in
head(read.csv(file = "clean_data.csv", head=TRUE))
> BOROUGH NEIGHBORHOOD BUILDING.CLASS.CATEGORY
> 1 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> 2 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> 3 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> 4 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> 5 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> 6 1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS
> TAX.CLASS.AT.PRESENT BLOCK LOT EASE.MENT BUILDING.CLASS.AT.PRESENT
> 1 2A 390 54 NA C3
> 2 2B 390 64 NA C4
> 3 2 393 47 NA C4
> 4 2 398 33 NA C7
> 5 2 399 26 NA C7
> 6 2 400 54 NA C7
> ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
> 1 203 EAST 7TH STREET 10009 4
> 2 187 EAST 7TH STREET 10009 8
> 3 377 EAST 10TH 10009 12
> 4 28 AVENUE B 10009 16
> 5 234 EAST 4TH STREET 10009 28
> 6 207 EAST 4TH STREET 10009 8
> COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
> 1 0 4 1,950 5,446
> 2 2 10 1,642 5,220
> 3 0 12 2,370 10,715
> 4 1 17 1,933 8,699
> 5 3 31 4,616 18,690
> 6 3 11 2,364 7,166
> YEAR.BUILT TAX.CLASS.AT.TIME.OF.SALE BUILDING.CLASS.AT.TIME.OF.SALE
> 1 2001 2 C3
> 2 1910 2 C4
> 3 1900 2 C4
> 4 1900 2 C7
> 5 1900 2 C7
> 6 1900 2 C7
> SALE.PRICE SALE.DATE
> 1 6250000 2018-05-09
> 2 2400000 2018-05-18
> 3 954623 2018-06-04
> 4 5000000 2018-03-05
> 5 7906412 2018-03-05
> 6 3000000 2018-05-18
median(data$SALE.PRICE) # median sales price
> [1] 1110000
mean(data$SALE.PRICE) # mean
> [1] 3624782
max(data$SALE.PRICE) # max
> [1] 2397501899
aggregate(cbind(YEAR.BUILT, SALE.PRICE) ~ TAX.CLASS.AT.TIME.OF.SALE, data=data, FUN=median) # get median of variables by group
> TAX.CLASS.AT.TIME.OF.SALE YEAR.BUILT SALE.PRICE
> 1 1 1910 3911000
> 2 2 1956 1070000
> 3 4 1932 6000000
aggregate(cbind(YEAR.BUILT, SALE.PRICE) ~ TAX.CLASS.AT.TIME.OF.SALE, data=data, FUN=length) # count of rows within each group
> TAX.CLASS.AT.TIME.OF.SALE YEAR.BUILT SALE.PRICE
> 1 1 156 156
> 2 2 11316 11316
> 3 4 379 379
# other libraries to explore on your own - plyr, dplyr, reshape2, lubridate
# simple sales price histogram (log-transformed)
hist(log10(data$SALE.PRICE))
# ggplot2 sales prices histogram
ggplot(data, aes(SALE.PRICE)) + geom_histogram(aes(fill=..count..),alpha=0.6,bins=50) +
theme_bw() + ggtitle("Histogram of Property Sales, Manhattan", subtitle = NULL) +
scale_fill_gradient("Count",low="orange",high="red") + scale_y_continuous("Count") +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x)) ) +
xlab("Sales Price (log-scaled)")
# pearson (linear) correlation between logged prices and year when the building was constructed
cor(log10(data$SALE.PRICE),data$YEAR.BUILT)
> [1] 0.07713556
# spearman rank-order correlation (here log transformation would not change results because correlation is measured in terms of rank orderings, and log preserves order)
cor(data$SALE.PRICE,data$YEAR.BUILT, method="spearman")
> [1] 0.1138932
# run simple regression
model <- lm(data=data, log10(SALE.PRICE) ~ YEAR.BUILT)
# plot summary of estimates
summary(model)
>
> Call:
> lm(formula = log10(SALE.PRICE) ~ YEAR.BUILT, data = data)
>
> Residuals:
> Min 1Q Median 3Q Max
> -6.1606 -0.2570 -0.0381 0.2822 3.3743
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 3.2678142 0.3334360 9.800 <2e-16 ***
> YEAR.BUILT 0.0014371 0.0001706 8.422 <2e-16 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 0.6665 on 11849 degrees of freedom
> Multiple R-squared: 0.00595, Adjusted R-squared: 0.005866
> F-statistic: 70.92 on 1 and 11849 DF, p-value: < 2.2e-16
# we can interpret the regression results as follows
# if we increase construction year by 1, e.g. from 2000 to 2001,
# we increase log10(sales price) by value of coefficient YEAR.BUILT: 0.0014371 (the exact value may change as data gets updated),
# that is, we increase sales price by 100*(10^0.0014371-1) = 0.33%
# see details on interpretation here: http://www.cazaar.com/ta/econ113/interpreting-beta
# plot regression line - relationship between log10(sales price) and year built
# color points by tax class
p <- ggplot(data, aes(YEAR.BUILT,SALE.PRICE,color=TAX.CLASS.AT.TIME.OF.SALE)) +
geom_point(alpha=0.3) +
theme_bw() + xlab("Year of Construction") + scale_y_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x)) )
# add regression line
# setting intercept of the line based on beta0, slope of the line based on beta1
p + geom_abline(intercept = coef(model)[1], slope = coef(model)[2]) +
# annotating
annotate(label = sprintf("log10(y) = %.5f + %.5fx\nR² = %.3f", coef(model)[1],coef(model)[2], summary(model)$r.squared), geom = "text", x = 1975, y = 10^8, size = 4)
# regression diagnostic plots -
# notice, for example, that qqplot shows deviations from assumption that
# residuals are normally distributed - so we should take care interpreting
# the p-values and statistical significance results
plot(model)
# finally, let us build a large model with dummy variables and
# visualize its coefficients and confidence intervals
model <- lm(data=data, log10(SALE.PRICE) ~ YEAR.BUILT + NEIGHBORHOOD)
names(model$coefficients)<-gsub('NEIGHBORHOOD','',names(model$coefficients))
summary(model)
>
> Call:
> lm(formula = log10(SALE.PRICE) ~ YEAR.BUILT + NEIGHBORHOOD, data = data)
>
> Residuals:
> Min 1Q Median 3Q Max
> -6.4125 -0.2366 -0.0131 0.2565 3.3917
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 3.3134277 0.3459731 9.577 < 2e-16 ***
> YEAR.BUILT 0.0013217 0.0001761 7.505 6.60e-14 ***
> CHELSEA 0.1567136 0.0721440 2.172 0.029858 *
> CHINATOWN 0.2311274 0.1017332 2.272 0.023111 *
> CIVIC CENTER 0.5817345 0.0903577 6.438 1.26e-10 ***
> CLINTON 0.1627447 0.0774525 2.101 0.035643 *
> EAST VILLAGE 0.3221223 0.0814833 3.953 7.76e-05 ***
> FASHION 0.5702639 0.0880361 6.478 9.69e-11 ***
> FINANCIAL 0.0979759 0.0764937 1.281 0.200276
> FLATIRON 0.4002358 0.0755453 5.298 1.19e-07 ***
> GRAMERCY 0.1175131 0.0742874 1.582 0.113705
> GREENWICH VILLAGE-CENTRAL 0.2969894 0.0728642 4.076 4.61e-05 ***
> GREENWICH VILLAGE-WEST 0.3787139 0.0726018 5.216 1.86e-07 ***
> HARLEM-CENTRAL -0.1115705 0.0724697 -1.540 0.123698
> HARLEM-EAST 0.0535489 0.0945828 0.566 0.571297
> HARLEM-UPPER -0.0094921 0.0889054 -0.107 0.914976
> HARLEM-WEST -0.0744420 0.1337394 -0.557 0.577798
> INWOOD -0.1748163 0.0926124 -1.888 0.059103 .
> JAVITS CENTER -0.0914308 0.1504075 -0.608 0.543274
> KIPS BAY -0.0022053 0.0782166 -0.028 0.977508
> LITTLE ITALY 0.8010468 0.1478375 5.418 6.13e-08 ***
> LOWER EAST SIDE 0.1800874 0.0770071 2.339 0.019374 *
> MANHATTAN VALLEY 0.1205662 0.0823851 1.463 0.143372
> MIDTOWN CBD 0.4733586 0.0902934 5.242 1.61e-07 ***
> MIDTOWN EAST 0.0503657 0.0689212 0.731 0.464932
> MIDTOWN WEST 0.0270263 0.0717714 0.377 0.706506
> MORNINGSIDE HEIGHTS -0.0870946 0.1063159 -0.819 0.412685
> MURRAY HILL -0.0086453 0.0732270 -0.118 0.906020
> ROOSEVELT ISLAND 0.0882947 0.1534173 0.576 0.564951
> SOHO 0.4462584 0.0812644 5.491 4.07e-08 ***
> SOUTHBRIDGE 0.1130535 0.0908043 1.245 0.213147
> TRIBECA 0.4450179 0.0769254 5.785 7.43e-09 ***
> UPPER EAST SIDE (59-79) 0.1971894 0.0679977 2.900 0.003739 **
> UPPER EAST SIDE (79-96) 0.2311823 0.0680871 3.395 0.000688 ***
> UPPER EAST SIDE (96-110) 0.2685674 0.1264986 2.123 0.033767 *
> UPPER WEST SIDE (59-79) 0.2592869 0.0684896 3.786 0.000154 ***
> UPPER WEST SIDE (79-96) 0.2428071 0.0702259 3.458 0.000547 ***
> UPPER WEST SIDE (96-116) 0.1963852 0.0753441 2.607 0.009159 **
> WASHINGTON HEIGHTS LOWER 0.0558094 0.0946432 0.590 0.555415
> WASHINGTON HEIGHTS UPPER -0.0904123 0.0788283 -1.147 0.251424
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 0.6498 on 11811 degrees of freedom
> Multiple R-squared: 0.05828, Adjusted R-squared: 0.05517
> F-statistic: 18.74 on 39 and 11811 DF, p-value: < 2.2e-16
coefplot(model,mar=c(0.5,20,2,5), main="") # bottom, left, top, and right margins