R Tutorial - edX Marketing Analytics

R Installation

First, download and install R from https://cran.r-project.org/.

Second, download and install RStudio Desktop (Open Source License) from https://www.rstudio.com/products/rstudio/download/ (scroll down the web page to see download links).

R is a high-level programming language and environment for statistical computing. RStudio is a very useful and commonly used development interface built on top of R. Both are free. Both are available on Windows, Mac OS X, and Ubuntu. This tutorial uses R v3.4.2 and RStudio Desktop v1.1.383, but you can use more recent versions if available.

If you are having trouble with installation, try to Google the errors you are getting - very likely someone had similar issues to yours and there is a solution online. If you are still having trouble, reach out to the creators of the software.

After installation, both R and RStudio will be available in your installation directory. However, you do not need to use R directly yourself, RStudio uses it on your behalf, and you only directly interact with RStudio.

Now, proceed and launch RStudio.

First steps

Once you have launched RStudio, you will see the following interface.

Create a project: File > New Project > Existing Directory > Browse … > Create Project (the directory you pick will be your working directory)

Now, open new R script file: File > New File > R script

Understand the layout

Top left corner - text editor - you type all your code here and can then run selected code using Cmd + Enter (Mac) or Ctrl + Enter (Win).
Bottom left corner - console - this is where code is actually run and where output (including error messages) is shown.
Top right corner - list of defined variables and loaded data.
Bottom right corner - variety of windows. Plots will appear here under “Plots” tab. You can install packages and view already installed ones under the “Packages” tab. And you can type in names of functions under “Help” tab to find explanation of their operation.

Alternatively, we could use special R notebook (R markdown file) to mix code with text.

You could open it as follows: File > New File > R Notebook

R notebook allows you to mix together LaTeX equations, markdown notation, and R code, which is very useful when creating reports. In fact, this tutorial is written using R notebook format. The only peculiarity is that you need to mark R code in a special way as shown in the example. Then you can compile the notebook into one of several file formats and the code output will be embedded in text.

R basics

Below I provide a review of some key functionality. However, there is much more to R, and there are many useful commands contained within specific packages. Use Google and stackoverflow to your advantage - R has great community - if there is a question you have - it was likely asked by someone and there is a response online. Learning how to find such information is an important skill for an R user.

# hello world
print('Hello world')

> [1] "Hello world"

# variable assignments (note, R convention for assignment is '<-'; but '=' will also work)
# also note that by default the numbers are class 'numeric'
a <- 3/2
a

> [1] 1.5

class(a) # tells us the class of the object

> [1] "numeric"

# this is how you would create an integer -- rarely used
b <- as.integer(3/2)
b

> [1] 1

class(b)

> [1] "integer"

# what if we round a number to the lowest integer?
k <- floor(3/2)
k

> [1] 1

class(k)

> [1] "numeric"

# and now to the highest integer
ceiling(3/2)

> [1] 2

round(3/2)

> [1] 2

# class character
class("s")

> [1] "character"

# turning numeric to character
as.character(3)

> [1] "3"

# and back
as.numeric('3')

> [1] 3

# creating a vector of numbers
d <- c(1,2,4,5)
d

> [1] 1 2 4 5

# notice, vector of numeric is class numeric too
class(d)

> [1] "numeric"

# creating a vector of strings 
k <- c(1,'dog',2,'a',4,6)
k

> [1] "1"   "dog" "2"   "a"   "4"   "6"

class(k)

> [1] "character"

# converting character vector to numeric - NA missing values introduced
as.numeric(k)

> Warning: NAs introduced by coercion

> [1]  1 NA  2 NA  4  6

# checking for NA
is.na(as.numeric(k))

> Warning: NAs introduced by coercion

> [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

# logical values
k <- c(TRUE, FALSE, T, F)
k

> [1]  TRUE FALSE  TRUE FALSE

class(k)

> [1] "logical"

# sequences
1:10

>  [1]  1  2  3  4  5  6  7  8  9 10

seq(1,10)

>  [1]  1  2  3  4  5  6  7  8  9 10

seq(1,10,2) # odd numbers

> [1] 1 3 5 7 9

# first element in a vector
d[1]

> [1] 1

# second element in a vector
d[2]

> [1] 2

# two ways to get the last element in a vector
d[length(d)]

> [1] 5

tail(d,1)

> [1] 5

# repeat vector three times
rep(d,3)

>  [1] 1 2 4 5 1 2 4 5 1 2 4 5

# repeat every element of a vector three times
rep(d,each=3)

>  [1] 1 1 1 2 2 2 4 4 4 5 5 5

# get unique elements
unique(rep(d,each=3))

> [1] 1 2 4 5

sum(d) # sum of elements

> [1] 12

prod(d) # product of elements

> [1] 40

mean(d) # mean

> [1] 3

sd(d) # standard deviation

> [1] 1.825742

var(d) # variance

> [1] 3.333333

sqrt(var(d)) # standard deviation

> [1] 1.825742

# log
log(d) # base e, also known as ln

> [1] 0.0000000 0.6931472 1.3862944 1.6094379

log10(d) # base 10

> [1] 0.00000 0.30103 0.60206 0.69897

# max value and index
max(d)

> [1] 5

which.max(d)

> [1] 4

# exp
exp(d)

> [1]   2.718282   7.389056  54.598150 148.413159

# taking to power
d^2

> [1]  1  4 16 25

# applying computation to a vector
sapply(d, function(x) x^2)

> [1]  1  4 16 25

# factor - vector that assumes a discrete set of values (levels)
v <- rep(c('1990','2000','2010'),100)
v

>   [1] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
>  [11] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
>  [21] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
>  [31] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
>  [41] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
>  [51] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
>  [61] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
>  [71] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
>  [81] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
>  [91] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [101] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [111] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [121] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [131] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [141] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [151] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [161] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [171] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [181] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [191] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [201] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [211] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [221] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [231] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [241] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [251] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [261] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [271] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [281] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [291] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"

v <- factor(v)
v

>   [1] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [15] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [29] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [43] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [57] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [71] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [85] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [99] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [113] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [127] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [141] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [155] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [169] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [183] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [197] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [211] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [225] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [239] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [253] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [267] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [281] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [295] 1990 2000 2010 1990 2000 2010
> Levels: 1990 2000 2010

levels(v) # levels of a factor

> [1] "1990" "2000" "2010"

as.numeric(v) # reveals numeric index corresponding to level ordering

>   [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
>  [36] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
>  [71] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> [106] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
> [141] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
> [176] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> [211] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
> [246] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
> [281] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

as.numeric(as.character(v)) # recovers level labels

>   [1] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [15] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [29] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [43] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [57] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [71] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [85] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [99] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [113] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [127] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [141] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [155] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [169] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [183] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [197] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [211] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [225] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [239] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [253] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [267] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [281] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [295] 1990 2000 2010 1990 2000 2010

Random numbers, matrices, and data frames

# matrix of zeros
matrix(0,4,4)

>      [,1] [,2] [,3] [,4]
> [1,]    0    0    0    0
> [2,]    0    0    0    0
> [3,]    0    0    0    0
> [4,]    0    0    0    0

# diagonal identity matrix
diag(5)

>      [,1] [,2] [,3] [,4] [,5]
> [1,]    1    0    0    0    0
> [2,]    0    1    0    0    0
> [3,]    0    0    1    0    0
> [4,]    0    0    0    1    0
> [5,]    0    0    0    0    1

# matrix of zeros
m <- matrix(0,1,4)
v <- rep(0,5)

class(m)

> [1] "matrix"

class(v)

> [1] "numeric"

# conversion between matrices and vectors
as.matrix(v)

>      [,1]
> [1,]    0
> [2,]    0
> [3,]    0
> [4,]    0
> [5,]    0

as.vector(m)

> [1] 0 0 0 0

# set seed
set.seed(999)

# generate uniform [0,1] random variables
runif(5)

> [1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676

# generate standard normal random variables
rnorm(5)

> [1] -1.1782812 -1.3986638  0.3040954  0.1284985  2.2420250

# create a 12-number vector of uniform[0,1] random numbers and reshape it into 4x3 matrix
v <- runif(12)
v

>  [1] 0.78721294 0.16658469 0.82125595 0.13114195 0.06998445 0.90749130
>  [7] 0.63380102 0.55328977 0.34208638 0.82607030 0.81951407 0.56849274

class(v)

> [1] "numeric"

matrix(v,4,3)

>           [,1]       [,2]      [,3]
> [1,] 0.7872129 0.06998445 0.3420864
> [2,] 0.1665847 0.90749130 0.8260703
> [3,] 0.8212559 0.63380102 0.8195141
> [4,] 0.1311419 0.55328977 0.5684927

# or filling by row
m <- matrix(v,4,3,byrow=TRUE)
m

>           [,1]       [,2]      [,3]
> [1,] 0.7872129 0.16658469 0.8212559
> [2,] 0.1311419 0.06998445 0.9074913
> [3,] 0.6338010 0.55328977 0.3420864
> [4,] 0.8260703 0.81951407 0.5684927

class(m)

> [1] "matrix"

# mean of a matrix
mean(m)

> [1] 0.5522438

# matrix transpose
t(m)

>           [,1]       [,2]      [,3]      [,4]
> [1,] 0.7872129 0.13114195 0.6338010 0.8260703
> [2,] 0.1665847 0.06998445 0.5532898 0.8195141
> [3,] 0.8212559 0.90749130 0.3420864 0.5684927

# matrix dot product
md <- m%*%t(m)
md

>           [,1]      [,2]      [,3]      [,4]
> [1,] 1.3219160 0.8601776 0.8720464 1.2536898
> [2,] 0.8601776 0.8456365 0.4322800 0.6815879
> [3,] 0.8720464 0.4322800 0.8248564 1.1714666
> [4,] 1.2536898 0.6815879 1.1714666 1.6771795

# matrix element-wise product
m*m

>            [,1]        [,2]      [,3]
> [1,] 0.61970422 0.027750460 0.6744613
> [2,] 0.01719821 0.004897823 0.8235405
> [3,] 0.40170373 0.306129572 0.1170231
> [4,] 0.68239214 0.671603316 0.3231840

# multiplication by a scalar
10*m

>          [,1]      [,2]     [,3]
> [1,] 7.872129 1.6658469 8.212559
> [2,] 1.311419 0.6998445 9.074913
> [3,] 6.338010 5.5328977 3.420864
> [4,] 8.260703 8.1951407 5.684927

# size of a matrix
dim(m)

> [1] 4 3

# number of rows
dim(m)[1]

> [1] 4

# number of columns
dim(m)[2]

> [1] 3

# mean of each row
apply(m,1,mean)

> [1] 0.5916845 0.3695392 0.5097257 0.7380257

# mean of each column
apply(m,2,mean)

> [1] 0.5945566 0.4023432 0.6598316

# sum of each row
rowSums(m)

> [1] 1.775054 1.108618 1.529177 2.214077

apply(m,1,sum)

> [1] 1.775054 1.108618 1.529177 2.214077

# sum of each column
colSums(m)

> [1] 2.378226 1.609373 2.639326

apply(m,2,sum)

> [1] 2.378226 1.609373 2.639326

# first row of a matrix
m[1,]

> [1] 0.7872129 0.1665847 0.8212559

# second column of a matrix
m[,2]

> [1] 0.16658469 0.06998445 0.55328977 0.81951407

# specific element of a matrix
m[1,2]

> [1] 0.1665847

# matrix inverse
solve(m[1:3,1:3])

>            [,1]       [,2]       [,3]
> [1,]  1.8049785 -1.5001301 -0.3536953
> [2,] -2.0018047  0.9482986  2.2901270
> [3,] -0.1064618  1.2455919 -0.1254987

Lists

# list of objects - unnamed
l1 <- list(5, "b", 12)
l1

> [[1]]
> [1] 5
> 
> [[2]]
> [1] "b"
> 
> [[3]]
> [1] 12

# getting back a vector of elements in a list
unlist(l1)

> [1] "5"  "b"  "12"

# accessing element in a list - note special format
l1[[1]]

> [1] 5

# named list
l2 <- list(a1=5, a2=c(2983, 1890), a3=c(3, 118))
l2

> $a1
> [1] 5
> 
> $a2
> [1] 2983 1890
> 
> $a3
> [1]   3 118

# getting named list value
l2$a1

> [1] 5

# applying power to each element of a list
lapply(l2, function(x) x^2)

> $a1
> [1] 25
> 
> $a2
> [1] 8898289 3572100
> 
> $a3
> [1]     9 13924

# unlisting named list
unlist(l2)

>   a1  a21  a22  a31  a32 
>    5 2983 1890    3  118

# taking mean over a list
mean(unlist(l2))

> [1] 999.8

Data frames

# creating a data frame - list of vectors of the same length,
# it is structured like a table or matrix,
# individual variables are accessed via column names,
# individual observations - via row index
b1 <- c(NA, 22, 50)
b2 <- c("john", "rose", "john")
b3 <- c(TRUE, FALSE, TRUE)
df <- data.frame(b1, b2, b3)

# here is how we access columns
df$b1

> [1] NA 22 50

df$b2

> [1] john rose john
> Levels: john rose

# note that a vector of strings was automatically converted into a factor
# when data frame was created
class(df$b2)

> [1] "factor"

# if we do not want to convert string vectors to factors when creating a data frame,
# we simply add stringsAsFactors=FALSE option
df <- data.frame(b1, b2, b3, stringsAsFactors=FALSE)

class(df$b2) # now it is character

> [1] "character"

# renaming columns (variables) of a data frame
colnames(df)

> [1] "b1" "b2" "b3"

colnames(df) <- c('age','name','indicator')
colnames(df)

> [1] "age"       "name"      "indicator"

# second row
df[2,]

>   age name indicator
> 2  22 rose     FALSE

# second element of the indicator variable
df$indicator[2]

> [1] FALSE

df[2, "indicator"]

> [1] FALSE

df[2, 3]

> [1] FALSE

# slicing - two first rows and two names columns
df[1:2, c("age","indicator")]

>   age indicator
> 1  NA      TRUE
> 2  22     FALSE

# picking out elements of data frame based on two vectors of coordinates
df[cbind(1:2, c(1,3))]

> [1] NA      "FALSE"

# filtering for rows that contain name "john" in name column
df[df$name=="john",]

>   age name indicator
> 1  NA john      TRUE
> 3  50 john      TRUE

# keeping rows without missing values in age column
df[!is.na(df$age),]

>   age name indicator
> 2  22 rose     FALSE
> 3  50 john      TRUE

# binding data frames together
rbind(df,df) # horizontally

>   age name indicator
> 1  NA john      TRUE
> 2  22 rose     FALSE
> 3  50 john      TRUE
> 4  NA john      TRUE
> 5  22 rose     FALSE
> 6  50 john      TRUE

cbind(df,df) # vertically

>   age name indicator age name indicator
> 1  NA john      TRUE  NA john      TRUE
> 2  22 rose     FALSE  22 rose     FALSE
> 3  50 john      TRUE  50 john      TRUE

# description statistics
df <- data.frame(a=runif(10), b=runif(10), c=runif(10))

# mean
apply(df,2,mean)

>         a         b         c 
> 0.5557822 0.4620113 0.3673768

# variance
apply(df,2,var)

>          a          b          c 
> 0.06514680 0.07555668 0.09733104

# standard deviation
apply(df,2,sd)

>         a         b         c 
> 0.2552387 0.2748758 0.3119792

Loops, if-else statements, functions, vectorization

# loop
for (i in 1:5) {
  print(i)
}

> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5

# conditional statements
if (a > 1) {
  print('hi')
} else if (a > 0) {
  print('hello')
} else {
  print('bye')
}

> [1] "hi"

# function
custom_mean <- function(v) {
  # taking mean over a vector via a for-loop
  temp <- 0 
  for (i in v) {
    temp <- temp + i
  }
  return(temp/length(v))
}

v <- runif(1000000)

custom_mean(v)

> [1] 0.5001688

mean(v)

> [1] 0.5001688

# which one is faster?
s <- Sys.time()
custom_mean(v)

> [1] 0.5001688

e <- Sys.time()
e-s

> Time difference of 0.01963902 secs

s <- Sys.time()
mean(v)

> [1] 0.5001688

e <- Sys.time()
e-s

> Time difference of 0.002686977 secs

# as we see, native mean() function, which applies to an array as a whole, is much faster -
# loops, which access arrays elementwise, are very slow in R - avoid them at all costs!
# using mean() function instead of looping over an array is an example of vectorization -
# vectorize everything using R expressions - e.g., apply(), linear algebra notation, etc.

# vectorization is fast because R functions are often just an interface to fast low-level language code - 
# for example, fast fourier transform calls C code
# (we can print content of the function by typing it without round brackets)
fft

> function (z, inverse = FALSE) 
> .Call(C_fft, z, inverse)
> <bytecode: 0x7fed91845a40>
> <environment: namespace:stats>

# and here is how you can get help on the function
?fft

Loading packages

Note, you need to install the following packages before we can load and use them:

“gdata” - handles data loading
“ggplot2” - visualization
“ggthemes” - visual themes for plots
“scales” - helps with scale plotting
“arm” - plotting regression coefs

One way to do it is to go into Packages tab in the bottom right window of RStudio, click install, enter package name, and press enter (you can also update packages from this tab). Alternatively, you can use install.packages(c(“gdata”,“ggplot2”,“ggthemes”,“scales”,“arm”)) command.

# Note, you need to install packages below before loading them
# install.packages(c("gdata","ggplot2","ggthemes","scales","arm"))

suppressMessages(library("gdata")) # loads library and suppresses unnecessary output
library("ggplot2")
library("ggthemes")
library("scales")
suppressMessages(library("arm"))

Downloading and loading data

# the project is based on rolling sales update provided by NYC
# http://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

# data for Manhattan in excel format is available via this link
# http://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_manhattan.xls

# download the data (5.3MB)
download.file("http://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_manhattan.xls", destfile="manhattan_real_estate.xls")

# load xsl file into a data frame - this will take a moment
data <- read.xls("manhattan_real_estate.xls", header=TRUE, pattern="BOROUGH")

# do ?read.xls and you will find that 'pattern' requires that all lines are skipped before the requested pattern is found

# NOTE: the exact numerical results shown below will change over time, as sales log gets updated

Data cleaning

# data frame
class(data)

> [1] "data.frame"

# size of data
dim(data)

> [1] 17223    21

# names of columns - could be overwritten if desired
colnames(data)

>  [1] "BOROUGH"                        "NEIGHBORHOOD"                  
>  [3] "BUILDING.CLASS.CATEGORY"        "TAX.CLASS.AT.PRESENT"          
>  [5] "BLOCK"                          "LOT"                           
>  [7] "EASE.MENT"                      "BUILDING.CLASS.AT.PRESENT"     
>  [9] "ADDRESS"                        "APARTMENT.NUMBER"              
> [11] "ZIP.CODE"                       "RESIDENTIAL.UNITS"             
> [13] "COMMERCIAL.UNITS"               "TOTAL.UNITS"                   
> [15] "LAND.SQUARE.FEET"               "GROSS.SQUARE.FEET"             
> [17] "YEAR.BUILT"                     "TAX.CLASS.AT.TIME.OF.SALE"     
> [19] "BUILDING.CLASS.AT.TIME.OF.SALE" "SALE.PRICE"                    
> [21] "SALE.DATE"

# see top 5 rows of the data
head(data, 5)

>   BOROUGH  NEIGHBORHOOD                     BUILDING.CLASS.CATEGORY
> 1       1 ALPHABET CITY 01 ONE FAMILY DWELLINGS                    
> 2       1 ALPHABET CITY 01 ONE FAMILY DWELLINGS                    
> 3       1 ALPHABET CITY 02 TWO FAMILY DWELLINGS                    
> 4       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
> 5       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
>   TAX.CLASS.AT.PRESENT BLOCK LOT EASE.MENT BUILDING.CLASS.AT.PRESENT
> 1                    1   390  61        NA                        A4
> 2                    1   390  61        NA                        A4
> 3                    1   390  35        NA                        B1
> 4                   2A   390  54        NA                        C3
> 5                   2B   390  64        NA                        C4
>               ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
> 1 189 EAST 7TH STREET                     10009                 1
> 2 189 EAST 7TH STREET                     10009                 1
> 3        113 AVENUE C                     10009                 2
> 4 203 EAST 7TH STREET                     10009                 4
> 5 187 EAST 7TH STREET                     10009                 8
>   COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
> 1                0           1              987             2,183
> 2                0           1              987             2,183
> 3                0           2            1,218             4,764
> 4                0           4            1,950             5,446
> 5                2          10            1,642             5,220
>   YEAR.BUILT TAX.CLASS.AT.TIME.OF.SALE BUILDING.CLASS.AT.TIME.OF.SALE
> 1       1860                         1                             A4
> 2       1860                         1                             A4
> 3       1899                         1                             B1
> 4       2001                         2                             C3
> 5       1910                         2                             C4
>    SALE.PRICE  SALE.DATE
> 1 $ 4,844,809 2018-05-22
> 2         $ 0 2018-05-23
> 3         $ 0 2018-04-25
> 4 $ 6,250,000 2018-05-09
> 5 $ 2,400,000 2018-05-18

# what class?
class(data$SALE.PRICE)

> [1] "factor"

class(data$YEAR.BUILT)

> [1] "integer"

# transform factor into numeric
data$YEAR.BUILT <- as.numeric(as.character(data$YEAR.BUILT))

# transform factor into numeric, removing commas and dollar signs on the way
data$SALE.PRICE <- as.numeric(gsub("\\$ ","",gsub(",","",as.character(data$SALE.PRICE))))

# transform into factor to make it a categorical variable
data$TAX.CLASS.AT.TIME.OF.SALE <- as.factor(data$TAX.CLASS.AT.TIME.OF.SALE)

data <- data[!is.na(data$YEAR.BUILT),] # remove observations with a missing year of construction
data <- data[data$YEAR.BUILT>=1900,] # only keep observations after 1900
data <- data[data$SALE.PRICE>0,] # filter out zero prices (e.g., due to inheritance/gifts)

# check size of the data again
dim(data)

> [1] 11851    21

# we could now save the data, dropping row index
write.csv(data,file='clean_data.csv',row.names=FALSE)

# and here is how we could read it back in
head(read.csv(file = "clean_data.csv", head=TRUE))

>   BOROUGH  NEIGHBORHOOD                     BUILDING.CLASS.CATEGORY
> 1       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
> 2       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
> 3       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
> 4       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
> 5       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
> 6       1 ALPHABET CITY 07 RENTALS - WALKUP APARTMENTS             
>   TAX.CLASS.AT.PRESENT BLOCK LOT EASE.MENT BUILDING.CLASS.AT.PRESENT
> 1                   2A   390  54        NA                        C3
> 2                   2B   390  64        NA                        C4
> 3                    2   393  47        NA                        C4
> 4                    2   398  33        NA                        C7
> 5                    2   399  26        NA                        C7
> 6                    2   400  54        NA                        C7
>                 ADDRESS APARTMENT.NUMBER ZIP.CODE RESIDENTIAL.UNITS
> 1   203 EAST 7TH STREET                     10009                 4
> 2   187 EAST 7TH STREET                     10009                 8
> 3         377 EAST 10TH                     10009                12
> 4           28 AVENUE B                     10009                16
> 5 234 EAST 4TH   STREET                     10009                28
> 6   207 EAST 4TH STREET                     10009                 8
>   COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET GROSS.SQUARE.FEET
> 1                0           4            1,950             5,446
> 2                2          10            1,642             5,220
> 3                0          12            2,370            10,715
> 4                1          17            1,933             8,699
> 5                3          31            4,616            18,690
> 6                3          11            2,364             7,166
>   YEAR.BUILT TAX.CLASS.AT.TIME.OF.SALE BUILDING.CLASS.AT.TIME.OF.SALE
> 1       2001                         2                             C3
> 2       1910                         2                             C4
> 3       1900                         2                             C4
> 4       1900                         2                             C7
> 5       1900                         2                             C7
> 6       1900                         2                             C7
>   SALE.PRICE  SALE.DATE
> 1    6250000 2018-05-09
> 2    2400000 2018-05-18
> 3     954623 2018-06-04
> 4    5000000 2018-03-05
> 5    7906412 2018-03-05
> 6    3000000 2018-05-18

Basic distribution statistics

median(data$SALE.PRICE) # median sales price

> [1] 1110000

mean(data$SALE.PRICE) # mean

> [1] 3624782

max(data$SALE.PRICE) # max

> [1] 2397501899

aggregate(cbind(YEAR.BUILT, SALE.PRICE) ~ TAX.CLASS.AT.TIME.OF.SALE, data=data, FUN=median) # get median of variables by group

>   TAX.CLASS.AT.TIME.OF.SALE YEAR.BUILT SALE.PRICE
> 1                         1       1910    3911000
> 2                         2       1956    1070000
> 3                         4       1932    6000000

aggregate(cbind(YEAR.BUILT, SALE.PRICE) ~ TAX.CLASS.AT.TIME.OF.SALE, data=data, FUN=length) # count of rows within each group

>   TAX.CLASS.AT.TIME.OF.SALE YEAR.BUILT SALE.PRICE
> 1                         1        156        156
> 2                         2      11316      11316
> 3                         4        379        379

# other libraries to explore on your own - plyr, dplyr, reshape2, lubridate

Histogram

# simple sales price histogram (log-transformed)
hist(log10(data$SALE.PRICE))

# ggplot2 sales prices histogram
ggplot(data, aes(SALE.PRICE)) + geom_histogram(aes(fill=..count..),alpha=0.6,bins=50) +
  theme_bw() + ggtitle("Histogram of Property Sales, Manhattan", subtitle = NULL) + 
 scale_fill_gradient("Count",low="orange",high="red") + scale_y_continuous("Count") +
  scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10", scales::math_format(10^.x)) ) + 
  xlab("Sales Price (log-scaled)")

Running a regression

# pearson (linear) correlation between logged prices and year when the building was constructed
cor(log10(data$SALE.PRICE),data$YEAR.BUILT)

> [1] 0.07713556

# spearman rank-order correlation (here log transformation would not change results because correlation is measured in terms of rank orderings, and log preserves order)
cor(data$SALE.PRICE,data$YEAR.BUILT, method="spearman")

> [1] 0.1138932

# run simple regression
model <- lm(data=data, log10(SALE.PRICE) ~ YEAR.BUILT)

# plot summary of estimates
summary(model)

> 
> Call:
> lm(formula = log10(SALE.PRICE) ~ YEAR.BUILT, data = data)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -6.1606 -0.2570 -0.0381  0.2822  3.3743 
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 3.2678142  0.3334360   9.800   <2e-16 ***
> YEAR.BUILT  0.0014371  0.0001706   8.422   <2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.6665 on 11849 degrees of freedom
> Multiple R-squared:  0.00595, Adjusted R-squared:  0.005866 
> F-statistic: 70.92 on 1 and 11849 DF,  p-value: < 2.2e-16

# we can interpret the regression results as follows
# if we increase construction year by 1, e.g. from 2000 to 2001,
# we increase log10(sales price) by value of coefficient YEAR.BUILT: 0.0014371 (the exact value may change as data gets updated),
# that is, we increase sales price by 100*(10^0.0014371-1) = 0.33%
# see details on interpretation here: http://www.cazaar.com/ta/econ113/interpreting-beta

Plotting a regression line

# plot regression line - relationship between log10(sales price) and year built
# color points by tax class
p <- ggplot(data, aes(YEAR.BUILT,SALE.PRICE,color=TAX.CLASS.AT.TIME.OF.SALE)) + 
  geom_point(alpha=0.3) + 
  theme_bw() + xlab("Year of Construction") + scale_y_log10(
   breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10", scales::math_format(10^.x)) )

# add regression line
# setting intercept of the line based on beta0, slope of the line based on beta1
p + geom_abline(intercept = coef(model)[1], slope = coef(model)[2]) + 
# annotating
annotate(label = sprintf("log10(y) = %.5f + %.5fx\nR² = %.3f", coef(model)[1],coef(model)[2],  summary(model)$r.squared), geom = "text", x = 1975, y = 10^8, size = 4)

Regression diagnostics

# regression diagnostic plots - 
# notice, for example, that qqplot shows deviations from assumption that
# residuals are normally distributed - so we should take care interpreting 
# the p-values and statistical significance results
plot(model)

Plotting numerous regression coefficients

# finally, let us build a large model with dummy variables and 
# visualize its coefficients and confidence intervals
model <- lm(data=data, log10(SALE.PRICE) ~ YEAR.BUILT + NEIGHBORHOOD)
names(model$coefficients)<-gsub('NEIGHBORHOOD','',names(model$coefficients))
summary(model)

> 
> Call:
> lm(formula = log10(SALE.PRICE) ~ YEAR.BUILT + NEIGHBORHOOD, data = data)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -6.4125 -0.2366 -0.0131  0.2565  3.3917 
> 
> Coefficients:
>                             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)                3.3134277  0.3459731   9.577  < 2e-16 ***
> YEAR.BUILT                 0.0013217  0.0001761   7.505 6.60e-14 ***
> CHELSEA                    0.1567136  0.0721440   2.172 0.029858 *  
> CHINATOWN                  0.2311274  0.1017332   2.272 0.023111 *  
> CIVIC CENTER               0.5817345  0.0903577   6.438 1.26e-10 ***
> CLINTON                    0.1627447  0.0774525   2.101 0.035643 *  
> EAST VILLAGE               0.3221223  0.0814833   3.953 7.76e-05 ***
> FASHION                    0.5702639  0.0880361   6.478 9.69e-11 ***
> FINANCIAL                  0.0979759  0.0764937   1.281 0.200276    
> FLATIRON                   0.4002358  0.0755453   5.298 1.19e-07 ***
> GRAMERCY                   0.1175131  0.0742874   1.582 0.113705    
> GREENWICH VILLAGE-CENTRAL  0.2969894  0.0728642   4.076 4.61e-05 ***
> GREENWICH VILLAGE-WEST     0.3787139  0.0726018   5.216 1.86e-07 ***
> HARLEM-CENTRAL            -0.1115705  0.0724697  -1.540 0.123698    
> HARLEM-EAST                0.0535489  0.0945828   0.566 0.571297    
> HARLEM-UPPER              -0.0094921  0.0889054  -0.107 0.914976    
> HARLEM-WEST               -0.0744420  0.1337394  -0.557 0.577798    
> INWOOD                    -0.1748163  0.0926124  -1.888 0.059103 .  
> JAVITS CENTER             -0.0914308  0.1504075  -0.608 0.543274    
> KIPS BAY                  -0.0022053  0.0782166  -0.028 0.977508    
> LITTLE ITALY               0.8010468  0.1478375   5.418 6.13e-08 ***
> LOWER EAST SIDE            0.1800874  0.0770071   2.339 0.019374 *  
> MANHATTAN VALLEY           0.1205662  0.0823851   1.463 0.143372    
> MIDTOWN CBD                0.4733586  0.0902934   5.242 1.61e-07 ***
> MIDTOWN EAST               0.0503657  0.0689212   0.731 0.464932    
> MIDTOWN WEST               0.0270263  0.0717714   0.377 0.706506    
> MORNINGSIDE HEIGHTS       -0.0870946  0.1063159  -0.819 0.412685    
> MURRAY HILL               -0.0086453  0.0732270  -0.118 0.906020    
> ROOSEVELT ISLAND           0.0882947  0.1534173   0.576 0.564951    
> SOHO                       0.4462584  0.0812644   5.491 4.07e-08 ***
> SOUTHBRIDGE                0.1130535  0.0908043   1.245 0.213147    
> TRIBECA                    0.4450179  0.0769254   5.785 7.43e-09 ***
> UPPER EAST SIDE (59-79)    0.1971894  0.0679977   2.900 0.003739 ** 
> UPPER EAST SIDE (79-96)    0.2311823  0.0680871   3.395 0.000688 ***
> UPPER EAST SIDE (96-110)   0.2685674  0.1264986   2.123 0.033767 *  
> UPPER WEST SIDE (59-79)    0.2592869  0.0684896   3.786 0.000154 ***
> UPPER WEST SIDE (79-96)    0.2428071  0.0702259   3.458 0.000547 ***
> UPPER WEST SIDE (96-116)   0.1963852  0.0753441   2.607 0.009159 ** 
> WASHINGTON HEIGHTS LOWER   0.0558094  0.0946432   0.590 0.555415    
> WASHINGTON HEIGHTS UPPER  -0.0904123  0.0788283  -1.147 0.251424    
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.6498 on 11811 degrees of freedom
> Multiple R-squared:  0.05828, Adjusted R-squared:  0.05517 
> F-statistic: 18.74 on 39 and 11811 DF,  p-value: < 2.2e-16

coefplot(model,mar=c(0.5,20,2,5), main="") # bottom, left, top, and right margins