Working with Data

We always need to clean data before working. Even data downloaded from the database is not perfect. Sometimes there are missing data. Moreover, when data are from database, we need to combine them before using.

Learning Objectives:

Five basic data types (logical, integer, numeric, complex, character)
Two extended data types (date, factor)
Five basic data structure (vector, matrix, array, data.frame, list)
Basic data management
Advanced data management with dplyr and reshape
Download world bank data with wbstats
Import and export external data through csv (comma-delimited) files

Data Type

There are five basic data types:

logical
integer
numeric
complex
character.

Logical

Logical data type can either be TRUE and FALSE. It corresponds to integer 0 and 1.

x <- TRUE
x

## [1] TRUE

y <- FALSE
y

## [1] FALSE

class(x)

## [1] "logical"

Note that We use the function class() to see the data type.

Integer

Integer data type covers all whole numbers. It is represented by number and letter L: 1L, 2L, 3L.

x <- 1L
x

## [1] 1

class(x)

## [1] "integer"

Numeric

Numeric data type covers all real numbers.

x <- 2.6
x

## [1] 2.6

class(x)

## [1] "numeric"

Complex

Complex data type covers all complex number.

x <- 2 +5i
x

## [1] 2+5i

class(x)

## [1] "complex"

Character

Character data type covers all text strings.

x <- "Hello World"
x

## [1] "Hello World"

class(x)

## [1] "character"

Data type conversion

One common operation on data is to convert from one data type to another.

One frequently used conversion is between character and numeric. It is because numeric data may contain comma.

To convert string to number, we use as.numeric():

as.numeric("12000")

## [1] 12000

# When converting ``12,000'', need to replace comma
as.numeric(gsub(",", "", "12,000"))

## [1] 12000

Here, the function gsub is to find the comma and replace with null string where gsub stands for general substitution

To convert number to character, we may use as.character():

as.character(1200)

## [1] "1200"

Date

Date data type covers standard calendar date. It is converted from character data type. One complication is that we need to specify the format since there are a lot of possible way to express it.

For day, we use %d for day number (01-31).

For month, we use %m for month number (00-12), %b for abbreviated month (e.g. Jan), and %B for unabbreviated month (e.g. January)

For year, we use %y for two-digit year (e.g., 14) and %Y for four-digit year (e.g., 2014)

as.Date("21Jan2004", "%d%b%Y")

## [1] "2004-01-21"

as.Date("21/01/04", "%d/%m/%y")

## [1] "2004-01-21"

as.Date("21-01-04", "%d-%m-%y")

## [1] "2004-01-21"

Data Structure

In R, you can use different types of data structures, depending on your needs:

Vector,
Matrix,
Array,
Dataframe, and
List.

The type of data structure used generally differs by how information is stored and displayed, and also accessed or edited.

Vector

Vector is the most common and basic data structure in R.

Vector is simply data of the same data type ordered in a list.

The simplest vector is an integer vector. For a vector of integers from 1 to 5, we just need to use 1:5.

x <- 1:5
x

## [1] 1 2 3 4 5

Sometimes, we want to repeat the same number several times. We use the function rep(). The following code repeat 0 five times.

x <- rep(0, 5)
x

## [1] 0 0 0 0 0

This is usually a good way to initialize a vector. Slightly more general way to create a vector of integer is to use seq().

For integers from 4 to 10 with each step being 2, we have ``seq(4,10,by=2)’’. If the step is negative, then it is decreasing.

x <- seq(4,10,by=2)
x

## [1]  4  6  8 10

x <- seq(21,6,by=-3)
x

## [1] 21 18 15 12  9  6

To create a generic vector, we use the function c(). We input elements inside are separated by commas.

We first create a generic integer vector.

x <- c(1L, -1L, 3L, 2L)
x

## [1]  1 -1  3  2

Then we create a generic numeric vector.

x <- c(1, -1, 3.5, 2)
x

## [1]  1.0 -1.0  3.5  2.0

Finally, we create a generic character vector.

x <- c("Apple", "Banana")
x

## [1] "Apple"  "Banana"

We can also create a vector of logical.

x <- c(TRUE,FALSE)
x

## [1]  TRUE FALSE

It is easy to add new element to a vector.

x <- c(1, -1, 3.5, 2)
x <-c(x, 5) # add new element
x

## [1]  1.0 -1.0  3.5  2.0  5.0

Since R is a vectorized program, applying mathematical operators to the vector will take effect on all elements inside the vector.

x <- c(1, -1, 3.5, 2)
x + 2   #every element plus 2

## [1] 3.0 1.0 5.5 4.0

x - 2   #every element minus 2

## [1] -1.0 -3.0  1.5  0.0

Extracting elements is using the operator []. To get the third element, we have x[3]. To get first three elements, we use x[1:3]. To get the first, third and forth elements, we use x[c(1,3,4)].

x <- c(1, -1, 3.5, 2)
x[3]  #Pick out the 3rd element

## [1] 3.5

x[1:3] #Pick out the first three elements

## [1]  1.0 -1.0  3.5

x[c(1,3,4)] #Pick out 1st, 3rd and 4th elements

## [1] 1.0 3.5 2.0

Size of a vector can be found by using length()

x <- 1:10
length(x)

## [1] 10

Factor

A factor vector isan integer vector converted from character vectors or numeric vector. It was created to save memory space because long strings converted to numbers, and only mapping is needed.

A factor vector is created from character vector using factor().

x <- c("Apple", "Banana","Apple")
x <- factor(x)
x

## [1] Apple  Banana Apple 
## Levels: Apple Banana

While memory is usually not an issue these days, factor vector is sometimes converted from numeric vector to construct as categorical variable.

The following divides data into three groups with the same length.

x <- c(1,3,7,10)
x <- cut(x,3)
x

## [1] (0.991,4] (0.991,4] (4,7]     (7,10]   
## Levels: (0.991,4] (4,7] (7,10]

Sometimes we want to have label to avoid make it easier.

x <- c(1,3,7,10)
x <- cut(x,2,labels=c("L","H"))
x

## [1] L L H H
## Levels: L H

Sometimes we may use quartile to cut instead. However, in this case, we need to use the option include.lowest=TRUE to avoid data missing.

x <- c(1,3,7,10)
x <- cut(x, quantile(x, c(0, 1/2, 1)),
         include.lowest = TRUE)
x

## [1] [1,5]  [1,5]  (5,10] (5,10]
## Levels: [1,5] (5,10]

We can add label to each group.

x <- c(1,3,7,10)
x <- cut(x, quantile(x, c(0, 1/2, 1)),
         include.lowest = TRUE,
         labels=c("L","H"))
x

## [1] L L H H
## Levels: L H

Matrix

Martrix is a two-dimensional array.

To define a matrix from a vector, the syntax is matrix(vector, nrow, ncol, byrow). byrow is the way we fill the array. It is either TRUE or FALSE.

The following code fills the matrix by column.

x<- matrix(1:20, nrow=5, ncol=4, byrow=FALSE)
x

##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

The following code fills the matrix by row.

x<- matrix(1:20, nrow=5, ncol=4, byrow=TRUE)
x

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
## [5,]   17   18   19   20

Extracting elements from matrix is similar to extraction in vector.

x[2,] #the second row

## [1] 5 6 7 8

x[,1] #the first column

## [1]  1  5  9 13 17

x[1,2] #first row, second column

## [1] 2

One useful operation on matrix is to swap columns and row by t(), which means transpose.

y <- t(x)
y

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20

Size of matrix is rather complicated since it has two dimensions. There are three basics operations:

length(): total number of elements
ncol(): total number of columns
nrow(): total number of rows

z <- matrix(1:6,ncol=2,byrow=T)
length(z)

## [1] 6

ncol(z)

## [1] 2

nrow(z)

## [1] 3

Array

Array behaves like matrix but it is multi-dimensional (more than 2). To define arrary from vector, the syntax is array(vector/input, c(nrow, ncol, nmatrix))

x<- array(1:12, c(2,3,2))
x

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

Dataframe

Dataframe is most useful form of data type in R. It behaves like matrix but can contain strings (letters/words). To visualize a dataframe, one may consider a standard Excel spreadsheet. Each column is a vector. Each spreadsheet is a dataframe – it is a collection of columns of cells.

df <- data.frame(c(1,2),
                 c("Good", "Bad"))
df

##   c.1..2. c..Good....Bad..
## 1       1             Good
## 2       2              Bad

Note that first rows are the columns names. R will automatically name the column based on the elements inside that vector. However, it looks ugly. We should give names to rows and columns to improve readability of the data.

colnames(df) <-c("GPA","outcomes")
rownames(df) <-c("John", "Mary")
df

##      GPA outcomes
## John   1     Good
## Mary   2      Bad

To skip having to rename the columns, we can simply specify the column name when creating the dataframe.

df <- data.frame(GPA = c(1,2), 
                 outcomes = c("Good", "Bad")) 
df

##   GPA outcomes
## 1   1     Good
## 2   2      Bad

One can first define vectors and then define dataframe based on the vectors.

 x <- c(2,4)
 y <- c(1,3)
df <- data.frame(x,y) 
df

##   x y
## 1 2 1
## 2 4 3

A more compact code can be done by defining the vectors and the dataframe at the same time.

df <- data.frame(x=c(2,4),y=c(1,3))
df

##   x y
## 1 2 1
## 2 4 3

Call particular elements in dataframe share the same syntax as in matrix or array.

df$x     #vector x

## [1] 2 4

df[1]    #1st column

##   x
## 1 2
## 2 4

df[1,]   #1st row

##   x y
## 1 2 1

df[,1]   #1st column

## [1] 2 4

df[2,1]  #2nd row, 1st column

## [1] 4

To remove particular vector from dataframe, simplify assume NULL to it.

df$x <- NULL
df

##   y
## 1 1
## 2 3

New columns can be created directly.

df <- data.frame(x=c(2,4),y=c(1,3))
df$z <- df$x + df$y
df

##   x y z
## 1 2 1 3
## 2 4 3 7

Exercise

Create two new numeric vectors x and y.
Create a dataframe from two newly created vectors.
Create a new column in the dataframe that \(w = x - 2 \times y\).

List

List is the most comprehensive data type. It can contain anything: vector, array, matrix and even dataframe.

x <- c(2, 3, 5) 
df <- data.frame(y=c(2,3,4),z=c(1,3,5)) 
name <- c("NUS", "NTU", "SMU") 
x <- list(x,df,name)

To access the first element in the list x. Use x[[1]]. Similarly for [[2]].

x[[1]]

## [1] 2 3 5

x[[2]]

##   y z
## 1 2 1
## 2 3 3
## 3 4 5

Functions

Functions take in inputs, execute commands, and return output. They are extremely useful because if we have to make a particular calculation multiple times, we can just program a function and reuse it in the future. For example, sum is a function that takes in inputs of numbers, and returns the value of all the numbers added together.

Defining a function

The most basic function is a function that has does not return any value.

Hello<- function (...) {
       cat("Hello World")
}
Hello()

## Hello World

Now, we define a function called myadd that adds 2 input numbers together and returns the total value.

myadd <- function (x, y) {
       return(x+y)
}
myadd(3, 4)

## [1] 7

From the example, we can see that a function consists of 2 parts, the 1st being the inputs, and the 2nd being the body of the function. When you define a function, you need to specify how many inputs the function requires. How you name them is up to you, but it is likely you will need to use the names of the inputs in the body of the function. The body of the function is where you write the code for what you want the function to do with the inputs. Do not forget to include a return statement, otherwise there will be no output shown in the console for the function.

We can also set default value of a function:

myadd <- function (x, y=1) {
       return(x+y)
}
myadd(3)

## [1] 4

myadd(3, 4)

## [1] 7

Now also that since R is a vectorized program, function can take vector as imputs:

myadd <- function (x, y) {
return(x+y)
}
myadd(1:3, 3:5)

## [1] 4 6 8

Exercise

Write the following function:

take three numerical numbers
calculate the product of three numerical inputs
display the value
return the value as function output

Assignment of function outputs

You can also store the output of a function into another variable, by assigning the output to that variable.

For example, result stores the value of add(3,4), which is 7. When you call result, you will then get the value 7.

result <- myadd(3, 4)
result

## [1] 7

Help on function

Type a function’s name at the console, you can see the code behind. If you Type “? function’s name”, you will see its help file as in Figure ??.

? sum

Source code of a function

If you type the function name without bracket, then it will show the code of the function.

myadd <- function (x, y) {
       return(x+y)
}
myadd

## function (x, y) {
##        return(x+y)
## }

Financial Data

Financial data are usually time series. Data are often indexed by date and even time. Date are usually converted from character data type using functionas.Date() whiel date-time data are converted using function as.POSIXct().

Without time, the program will put in hours.

as.POSIXct("03-12-2014",format="%d-%m-%Y")

## [1] "2014-12-03 +08"

The package xts is useful to handle financial data.

install.packages("xts")
library(xts)

Create an xts object using the xts() function:

dates<-as.Date(c("2016-01-01","2016-01-02",
                 "2016-01-03"))
prices <- c(1.1,2.2,3.3)
x <- xts(prices, order.by=dates)

Each observation has time timestamp.

x[1]

##            [,1]
## 2016-01-01  1.1

It is often more useful to be more specific about timing instead of number of row. Then we just need to supply the time stamp directly. Note that we will a slash to separate starting time and ending time:

x['2016-01-02/2016-01-03']

##            [,1]
## 2016-01-02  2.2
## 2016-01-03  3.3

With timestamp, we can find the first and last times:

first(x)

##            [,1]
## 2016-01-01  1.1

last(x)

##            [,1]
## 2016-01-03  3.3

To get time stamp, we use time() and to get the value, we use as.numeric():

time(x[1])

## [1] "2016-01-01"

as.numeric(x[1])

## [1] 1.1

The most common time series operation is lag(). It moves your data ahead.

lag1_x <- lag(x,1)
lag1_x

##            [,1]
## 2016-01-01   NA
## 2016-01-02  1.1
## 2016-01-03  2.2

lag2_x <- lag(x,2)
lag2_x

##            [,1]
## 2016-01-01   NA
## 2016-01-02   NA
## 2016-01-03  1.1

The reason that you need to use lag is that first difference cannot be calculated by x[i]-x[i-1]. Instead we need to use x-lag(x)

To transfer timestamp to another vector, we use reclass(). The following code copy the timestamp from x to y:

y <- c(1,0,-1)
y <- reclass(y,x)
y

##            [,1]
## 2016-01-01    1
## 2016-01-02    0
## 2016-01-03   -1

Two xts objects with the same timestamps can be combined using cbind()

z <- cbind(x,y)
z

##              x  y
## 2016-01-01 1.1  1
## 2016-01-02 2.2  0
## 2016-01-03 3.3 -1

Sometimes, some entries in the data are missing and there will be calculation problems for time series calculation (e.g. lag operator Lag()). There are using two ways to deal with the problem:

na.omit() to take away those data points, and
na.approx() to to take linear approximation.

Note that na.approx() is only available (and also only meaningful) for time series data.

Data Management

We first cover data format conversion since sometimes the data is not in the right format. Second, we will learn how to load a built-in dataset to illustrate how to take a snapshot of the data, sorting the data, selecting observation, selecting variable, subseting the dataset, and merging dataframes. Then we will learn two useful packages: dplyr and reshape. Finally, we apply it to world bank data.

Loading built-in dataset

Load a standard dataset and store it as a dataframe

df <- data.frame(mtcars)

Peeping Data

We may use the head to checking the top rows and tail to check bottom rows.

x <- 1:5
y <- seq(5,1,-1)
z <- c(1,1,2,2,3)
df <- data.frame(x,y,z)
head(df, 3) #first 3 row

##   x y z
## 1 1 5 1
## 2 2 4 1
## 3 3 3 2

tail(df, 3) #last 3 row

##   x y z
## 3 3 3 2
## 4 4 2 2
## 5 5 1 3

Sorting Data

Sorting data by giving multiple criteria using order.

The following code first sort the data by y (in an ascending order) and then break tie using disp if they have the same z.

newdf <- df[order(df$y,df$z),]
newdf

##   x y z
## 5 5 1 3
## 4 4 2 2
## 3 3 3 2
## 2 2 4 1
## 1 1 5 1

Joining dataframes

To merge two dataframes horizontally by joining through an unique identifier, one may use merge().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("a","b"), y=c(3,4))
df3 <-merge(df1,df2,by="ID")
df3

##   ID x y
## 1  a 1 3
## 2  b 2 4

If two dataframes are just to join horizontally without an unique identifier, then use cbind().

df1<-data.frame(ID=c("a","b"), x=c(1,2))
df2<-data.frame(ID=c("a","b"), y=c(3,4))
df3 <-cbind(df1,df2)
df3

##   ID x ID y
## 1  a 1  a 3
## 2  b 2  b 4

Stacking Dataframe

If dataframes are just to join vertically, then use rbind(). Note that rbind requires dataframes have the same columns names.

df1<-data.frame(ID=c("a","b"),
                x=c(1,2), y=c(1,2))
df2<-data.frame(ID=c("c","d"), 
                x=c(3,4), y=c(1,2))
df3<-data.frame(ID=c("e","f"),
                x=c(5,6), y=c(1,2))
df4 <-rbind(df1,df2,df3)
df4

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1
## 4  d 4 2
## 5  e 5 1
## 6  f 6 2

Selecting Variables

The following code selecting columns 1 to 3.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,c(1:3)]
head(newdf,3)

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

The following code drops columns 1 to 2.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[,-c(1:2)]
head(newdf,3)

##   y z
## 1 1 6
## 2 2 5
## 3 1 4

We can delete columns by name.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$z<- NULL
head(df,3)

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  c 3 1

Selecting Observation based on row number

The functions head and tail allows as to directly obtain rows from the top and bottom.

The following code selects the top 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,3)
newdf

##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4

The following code drops the last 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf<-head(df,-2)
newdf

##   ID x y z
## 1  a 1 1 6
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code selects the bottom 3 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,3)
newdf

##   ID x y z
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code drops the top 2 observations.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- tail(df,-2)
newdf

##   ID x y z
## 3  c 3 1 4
## 4  d 4 2 3
## 5  e 5 1 2
## 6  f 6 2 1

The following code selects observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[c(2:4),]
head(newdf,3)

##   ID x y z
## 2  b 2 2 5
## 3  c 3 1 4
## 4  d 4 2 3

The following code excludes observations from row 2 to row 4.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[-c(2:4),]
head(newdf,3)

##   ID x y z
## 1  a 1 1 6
## 5  e 5 1 2
## 6  f 6 2 1

Selecting Observation based on condition

To select observations based on conditions, we may use ``which’’. The following code selects observations such that \(y=1\).

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1), ]
newdf

##   ID x y z
## 1  a 1 1 6
## 3  c 3 1 4
## 5  e 5 1 2

The following code chooses observation that \(y=1\) and \(x>1\).

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- df[which(df$y==1 & df$x>1), ]
newdf

##   ID x y z
## 3  c 3 1 4
## 5  e 5 1 2

Selecting Observation and Columns

To select observations based on conditions restricting to some columns, we use subset. To choose which column to include, we use ``select’’.

The following code selects columns of y and z when \(y=1\) and \(x>1\).

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
newdf <- subset(df, y== 2 & x>1, select= c(y,z))
newdf

##   y z
## 2 2 5
## 4 2 3
## 6 2 1

Create New Column

If a new column is simple transformation of existing column, then we can just write the expression directly because R is vectorized.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- 2*df$x + 3*df$y -df$z
df

##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17

df

##   ID x y z  a
## 1  a 1 1 6 -1
## 2  b 2 2 5  5
## 3  c 3 1 4  5
## 4  d 4 2 3 11
## 5  e 5 1 2 11
## 6  f 6 2 1 17

When we want to create a new column where each row depends on values of existing column, we can use ifelse().

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
df$a <- ifelse(df$x>3, 1,0)
df

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 0
## 4  d 4 2 3 1
## 5  e 5 1 2 1
## 6  f 6 2 1 1

Remove Duplicated Observation

To remove duplicate, the function is unique().

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- unique(df)
newdf

##   ID x y
## 1  a 1 1
## 2  b 2 2
## 3  a 3 1
## 4  b 4 2

Collapse Data by Group

There are two ways to collapse data by applying function to group: (1) aggregate and (2) by.

The following calculate the mean of data by group.

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <- aggregate(df$x, by=data.frame(df$ID), mean)
newdf

##   df.ID        x
## 1     a 1.666667
## 2     b 2.666667

The following is similar by use by()

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,1,2), 
                 y=c(1,2,1,2,1,2))
newdf <-by(df$x, df$ID, mean)
newdf

## df$ID: a
## [1] 1.666667
## ------------------------------------------------------------ 
## df$ID: b
## [1] 2.666667

Packages

Everything which is done in R is done by functions. Commonly used functions are grouped in packages. Installing different packages expand the functionality of R.

To install a package, say dplyr, for the first time, type

install.packages("dplyr")

To load the package, type the package name without quotation

library(dplyr)

Note that we only need to install each package once but we need to load the package each time to access the functions inside the package.

dplyr package

It is very convenient data management package. It can filter data, sort data, select data and generate new columns.

The filter() function is similar to Excel filter function.

df <- data.frame(Price=c(1.2,2.4,3.6,4.8), 
                 month=c(1,2,1,2), 
                 day=c(1,2,3,4))
dplyr::filter(df, month ==1 , day ==3)

##   Price month day
## 1   3.6     1   3

Note that we have added ``dplyr::’’ is to avoid name conflict of filter

The arrange() is like Excel sorting function. The syntax is arrange(dataframe, variables):

df <- data.frame(school=c("NTU","SMU","NUS"), 
                 rank=c(2,1,3), 
                 size=c(1,3,2))
dplyr::arrange(df, rank, size)

##   school rank size
## 1    SMU    1    3
## 2    NTU    2    1
## 3    NUS    3    2

The default order is ascending. To sort by a descending order, one can simply use desc():

df <- data.frame(school=c("NTU","SMU","NUS"),
                 rank=c(2,1,3),
                 size=c(1,3,2))
dplyr::arrange(df, desc(rank), size)

##   school rank size
## 1    NUS    3    2
## 2    NTU    2    1
## 3    SMU    1    3

The select() is to select columns. It behaves as if Excel hide column function or delete column function.

df <- data.frame(x=c("NTU","SMU","NUS"),
                 rank=c(2,1,3),
                 size=c(1,3,2))
dplyr::select(df, x, rank)

##     x rank
## 1 NTU    2
## 2 SMU    1
## 3 NUS    3

The mutate() function add new columns in the dataframe. It works like *``gen function in Stata.

df <- data.frame(product=c("chicken rice", "laska"),
                 revenue=c(10,5), 
                 cost=c(5,4))
mutate(df, profit = revenue-cost, 
                 profit_margin = profit/revenue)

##        product revenue cost profit profit_margin
## 1 chicken rice      10    5      5           0.5
## 2        laska       5    4      1           0.2

To create columns based on some simple condition, we can use ifelse()

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
mutate(df, a = ifelse(x>3, 1,0))

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 0
## 4  d 4 2 3 1
## 5  e 5 1 2 1
## 6  f 6 2 1 1

To create columns based on some simple condition, we can use case_when.

df <- data.frame(ID=c("a","b","c","d","e","f"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,2,1,2,1,2),
                 z=c(6,5,4,3,2,1))
mutate(df, a = case_when(x==1 | x==2 ~0,
                         x==3 | x==4 ~1,
                         x==5 | x==6 ~2)
       )

##   ID x y z a
## 1  a 1 1 6 0
## 2  b 2 2 5 0
## 3  c 3 1 4 1
## 4  d 4 2 3 1
## 5  e 5 1 2 2
## 6  f 6 2 1 2

To create summary by group, we first use group_by() to group the data, and then use summarize() to apply functions.

df <- data.frame(ID=c("a","b","a","b","a","b"), 
                 x=c(1,2,3,4,5,6), 
                 y=c(1,4,2,3,6,2))
newdf <- group_by(df, ID)
summarize(newdf, xbar=mean(x), yvar=var(y))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 3
##   ID     xbar  yvar
##   <fct> <dbl> <dbl>
## 1 a         3     7
## 2 b         4     1

reshape package

Reshape package is useful to transform data from long format to wide format. The two main functions for reshape is melt and cast.

We first install and load the reshape package.

 install.packages("reshape")
 library(reshape)

The melt() function collapses all columns into rows

df1 <- data.frame(id=c(1,1,2,2),
                  time = c(1,2,1,2),
                  bid=c(1,2,3,4),
                  ask=c(5,6,7,8))
df2 <- melt(df1, id=c("id","time"))
df2

##   id time variable value
## 1  1    1      bid     1
## 2  1    2      bid     2
## 3  2    1      bid     3
## 4  2    2      bid     4
## 5  1    1      ask     5
## 6  1    2      ask     6
## 7  2    1      ask     7
## 8  2    2      ask     8

The cast() function rebuild new table after it is melt. The casting formula: row variable 1 + row variable 2 ~ column variable 1 + column variable 2.

The following code will return original table. It is because the original table has id and time as row variables while bid and ask are column variables.

df3 <- cast(df2, id+time~variable)
df3

##   id time bid ask
## 1  1    1   1   5
## 2  1    2   2   6
## 3  2    1   3   7
## 4  2    2   4   8

The following code will make id and variable as row variable and time as column variables.

df4 <- cast(df2, id+variable~time)
df4

##   id variable 1 2
## 1  1      bid 1 2
## 2  1      ask 5 6
## 3  2      bid 3 4
## 4  2      ask 7 8

The following code will make id as row variable, and time and variable as column variables.

df5 <- cast(df2, id~variable+time)
df5

##   id bid_1 bid_2 ask_1 ask_2
## 1  1     1     2     5     6
## 2  2     3     4     7     8

Application to World Bank Data

To access on World Bank data in R, we first install the wbstats packages:

install.packages("wbstats")
library(wbstats,warn.conflicts = FALSE)

The following code download data description from World bank database. Since it is too large, we just download the description first. Then we will download only the variables we need.

new_cache <- wbcache()

Since the database is huge, we need to search for relevant data. We will need to obtain GDP and population data.

gdp <- wbsearch("gdp.*capita.*US\\$")
pop <- wbsearch("population, total")

After checking the gdp and pop, we decide to download two data series: “SP.POP.TOTL” and “NY.GDP.PCAP.KD”.

download<- wb(country=c("US", "SG"),
              indicator = c("SP.POP.TOTL",
                            "NY.GDP.PCAP.KD"), 
              startdate = 2000, enddate = 2017)

We will use dplyr to clean up the data.

library(dplyr,warn.conflicts = FALSE)

Since the data contains a lot of redundant columns, we first clean the data.

long <- dplyr::select(download, date, 
                      indicator, country,value)
head(long,3)

##   date         indicator   country   value
## 1 2017 Population, total Singapore 5612253
## 2 2016 Population, total Singapore 5607283
## 3 2015 Population, total Singapore 5535002

The data is in long format. We need to use reshape package to change it to wide format.

library(reshape,warn.conflicts = FALSE)

To start, we first melt it and then cast it to the desired format.

temp <- melt(long, id=c("date","indicator","country"))
data <- cast(temp, country + date~indicator)

Finally, we make some decoration.

colnames(data) <- c("country", "year", "gdp", "pop")
data$year <- as.numeric(data$year)
head(data,3)

##     country year      gdp     pop
## 1 Singapore 2000 33850.76 4027887
## 2 Singapore 2001 32597.62 4138012
## 3 Singapore 2002 33565.97 4175950

External Datasets

We often want to import from CSV, stata data and excel files. Sometimes, we also want to export to CSV and excel files. Finally, we cover how to convert your R code into Word document through knit and export your table using the package stargazer.

Working Directory

To work with external file, we need to set the folder containing the file as working directory. Otherwise, we need to write the full path of the file.

To set working directory, we use setwd().

setwd("C:/user/folder/")

Import from CSV

To import csv, we can use read.csv().

The program will find a file in the working directory. If the file is stored in a sub directory, we will say ``./folder/file.csv’’ instead.

If the data has header, then set header = TRUE so that the column name is given by the header row. Otherwise, set header to FALSE.

# If the file has header, then type
df <- read.csv("file.csv", header= TRUE)
# If the file has no header, then type
df <- read.csv("file.csv", header= FALSE)

Import from Stata

To import from Stata data file, we need the package foreign.

library(foreign)
df <-read.dta("datafile.dta")

Import from Excel

To import from xlsx file, we need the package xlsx

library(xlsx)
workbook <- "path/file.xlsx"
df <-read.xlsx(workbook, n)
# n is n-th worksheet to import

Export to CSV

To export dataframe to a csve file, we can use write.csv(). The file will be saved in your current working directory, which can be changed if you wish.

write.csv(df,file="filename.csv")

Note that a csv file that can be directly read by Excel.

Export to xlsx

To export to xlss file, we need to install package XLConnect.

install.packages("XLConnect")
library(XLConnect)
# Load and Excel workbook (create if not exist)
wb = loadWorkbook("test.xlsx",create=T)
# Create Worksheet in the workbook
createSheet(wb,"AAPL"); createSheet(wb,"IBM")
AAPL<-data.frame(P=c(1,2,4),Q=c(2,3,5))
IBM <-data.frame(P=c(2,4,6),Q=c(7,8,9))  
# Write things into worksheet
writeWorksheet(wb, AAPL,sheet = "AAPL", 
               startRow = 1, startCol = 1)
writeWorksheet(wb, IBM,sheet = "IBM", 
               startRow = 1, startCol = 1)
# Save to workbook
saveWorkbook(wb)

Export to Word: Knitr

Sometimes, we want to export our working in RStudio as printable or pdf/word format. The easiest way to do it is through the knitr package. Install the package and then go to File > Knit document > MS Word in RStudio.

Export Table: stargazer

The package ``stargazer’’ is useful to export table to latex, text or html (open in word and copy and paste). This saves considerable time to get the formatting right.

install.packages("stargazer");
library(stargazer);
stargazer(df)  #latex
stargazer(df,  type="text", out="test.txt") 
stargazer(df,  type="html", out="test.html")

Last updated on Sep 9, 2018

Edit this page