Scrapping

We will cover two common ways to extract files that is not from database directly:

pdf files, and
html files.

PDF files

We will cover how to scrap data from:

extract pdf files stored in offline
download pdf file and extract
massive download and extact

Offline pdf file

We need to install and load pdftools package to do the extraction.

install.packages("pdftools")
library(pdftools)

To read pdf as textfile, use pdf_text().

txt <- pdf_text("path/file.pdf")

Then we can extract a particular page.

test <- txt[49] #page 49

The pdf file contains a table.

To extract rows into list, we use the function scan.

rows<-scan(textConnection(test), 
           what="character", sep = "\n")

Then we can delimit rows into cells.

row =unlist(strsplit(rows[1]," \\s+ "))

Online pdf file

First we download a pdf file from the web. We use the function download.file.

Import the pdf file and then extract P.49 where it has a table. Then we scan to separate text file into rows.

Then we loop over the rows (starting from row 7) for the following operations: 1. split each row that is separated by space \\s+ using strsplit, 2. unlist the result to make it a vector, and (3) store the third cells if it is not empty.

link <- paste0(
  "http://www.singstat.gov.sg/docs/",
  "default-source/default-document-library/",
  "publications/publications_and_papers/",
  "cop2010/census_2010_release3/",
  "cop2010sr3.pdf")
download.file(link, "census2010_3.pdf", mode = "wb")

txt <- pdf_text("census2010_3.pdf")
test <- txt[49]  #P.49
rows<-scan(textConnection(test), what="character",
           sep = "\n")  

name<-c()
total<-c()

for (i in 7:length(rows)){
  row = unlist(strsplit(rows[i]," \\s+ "))
  if (!is.na(row[3])){
     name <- c(name, row[2])
     total<- c(total,
               as.numeric((gsub(",", "", row[3]))))
  }
}

Scrapping through massive download

We will use the RCurl package to download a large of csv files. Very often, we need to download a lot of csv files from the website. Luckily csv files are stored on the website with structured url paths.

For example, suppose that we want to download the all historical weather data of Singapore airport. We go to the website http://www.weather.gov.sg/climate-historical-daily/. Then we can see from the bottom that the links for download csv file is http://www.weather.gov.sg/files/dailydata/DAILYDATA_S24_201712.csv.

Hence, we will use getURL to get the file and the use textConnection to read the csv file directly.

install.packages("RCurl")
library(RCurl)
link<-paste0("http://www.weather.gov.sg/files/",
             "dailydata/DAILYDATA_S24_201712.csv")
x <- getURL(URL)
df<-read.csv(textConnection(x))

However, very often, we want to download more months. Then we can use loop. By guessing and checking, we know that S24 refers to Changi airport, 2017 is the year and 12 is December. To download the whole year of data, then we have to download all 12-month data, and at each time the link dynamically changes and the data is combined each round:

site<-"http://www.weather.gov.sg/files/dailydata/"
months <- c("01","02","03","04","05","06",
            "07","08","09","10","11","12")
df <-data.frame()
for (month in months){
  filename <-paste0("DAILYDATA_S24_2017",month,".csv")
  link<-paste0(site,filename)
  x <- getURL(link)
  temp<-read.csv(textConnection(x))
  df <-rbind(df,temp)
}

Alternatively, we can download each months as a separate csv file into a single folder and then combine all csv files at the end. This is particularly useful when the csv files are huge.

The following codes first download all the csv files into a temp folder and then combine all csv files in that folder. To combine all csv files in the folder, we need to obtain the path of all files using list.file'' where the optionfull.names’’ is set to be TRUE to also get the directory path. Then we need to have a list of csv files by using lapply with the import function fread. Finally, we use ``rbindlist’’ to combine all data in the list.

site<-"http://www.weather.gov.sg/files/dailydata/"
months <- c("01","02","03","04","05","06",
            "07","08","09","10","11","12")
df <-data.frame()
# Download data
for (month in months){
  filename <-paste0("DAILYDATA_S24_2017",month,".csv")
  link<-paste0(site,filename)
  x <- getURL(link)
  temp<-read.csv(textConnection(x))
  write.csv(temp,paste0("./temp/",filename), 
            row.names=FALSE)
}
# Combine data
library(data.table)
folder<-"./temp/"
csv.list <- list.files(pattern = "\\.csv$")
lst <- lapply(csv.list, fread)
df <- rbindlist(lst,fill=TRUE)

Scrapping from Web

We will use the rvest package to scrap directly from the web. However, it is sometimes convenient to know what to extract using some minor tools. We will use SelectorGadget from Chrome browser.

With the keyword SelectorGadget, use internet search engine to download and install the file. The program is easy to use. The first click will select area and then subsequent click will include or exclude elements.

To install and load the rvest package, we use the following code:

install.packages("rvest")
library(rvest)

Wikipedia Table

We will do two scrapping exercises:

scarp from Wikipedia table, and
scrap from an unfriendly website.

The following code extracts the student t’s distribution table from Wikipedia. Using the SelectorGadget, we can see that the table is called .wikitable. Then we will extract that using html_nodes() and then we parse the html data into a dataframe using html_table().

link <-paste0("https://en.wikipedia.org/wiki/",
              "Student%27s_t-distribution")
webpage <- read_html(link)
data <- html_nodes(webpage,".wikitable")
table<- html_table(data[[1]],header = FALSE)

Other Websites

To scarp from unstructural data, then we need to find what is the selector using the SelectorGadget. Then we can read the data as text.

link<-paste0("http://www.fas.nus.edu.sg/ecs/",
             "people/staff.html")
webpage <- read_html(link)
data <- html_nodes(webpage,"br+ table td")
content <-html_text(data)

Then we can transform dataset into dataframe.

df = data.frame(matrix(content,ncol=5,byrow=T),
                stringsAsFactors=FALSE)
colnames(df)<-df[1,]
df[-1,]

##                                      Title                        Name
## 2  Satoru TAKAHASHI\r\n                                      6516 6259
## 3                                6516 6020                      ecscnc
## 4                                   ecszjl Director (Graduate Program)
## 5                                    Title                        Name
## 6             Management Assistant Officer  Ms Fatimah AHMAD\r\n\t\t\t\t  
## 7             Management Assistant Officer            Ms CHEE Lee Kuen
## 8             Management Assistant Officer             Ms Diana ISMAIL
## 9                                  Manager               Ms Nicky KHEH
## 10                       Assistant Manager             Ms NEO Seok Min
## 11                                 Manager     Ms PAK Ming Foon, Ginny
## 12            Management Assistant Officer           Mdm TAN Leng Choo
## 13                               Executive             Ms TAN Pei Ying
## 14                               Executive              Ms TANG Yuchen
## 15                                 Manager                 Ms WEI Qing
## 16                                 Manager           Ms WOON Swee Yoke
## 17            Management Assistant Officer            Ms Salinah ZUBER
##                                Tel                              Email
## 2                            ecsst   Director (Undergraduate Program)
## 3        Director (Master Program) ZENG Jinli \r\n                   
## 4  LUO Xiao\r\n                                             6516 6231
## 5                              Tel                              Email
## 6                        6516 3950                              ecsfa
## 7                        6516 3942                             ecsclk
## 8                        6516 6013                              ecsdi
## 9                        6516 4878                             ecsklc
## 10                       6516 3941                             ecssec
## 11                       6516 3956                            ecspmfg
## 12                       6516 1304                             ecstlc
## 13                       6601 3508                           pei.ying
## 14                       6601 4922                           yuchen07
## 15                       6516 8909                               weiq
## 16                       6516 6027                             ecswsy
## 17                       6516 3958                              ecssz
##                        Head of Department
## 2  CHIA Ngee Choon\r\n                   
## 3                               6516 5177
## 4                                   ecslx
## 5                               Main Area
## 6        Undergraduate (levels 1000-2000)
## 7                             Timetabling
## 8                   Graduate (Coursework)
## 9                                Graduate
## 10              Head's Personal Assistant
## 11                          Undergraduate
## 12                    Graduate (Research)
## 13                    Department seminars
## 14         Graduate (Master of Economics)
## 15         Graduate (Master of Economics)
## 16                          Undergraduate
## 17       Undergraduate (levels 3000-4000)

row.names(df) <- NULL
head(df[2:3], n=3)

##        Name                       Tel
## 1      Name                       Tel
## 2 6516 6259                     ecsst
## 3    ecscnc Director (Master Program)

Last updated on Sep 9, 2018

Edit this page