OK. A “small” webscraping project. Part one in a series on getting some historical data on the chemical litterature.
I’m going to take a look at Acta Chemica Scandinavica. Ideally I would like to harvest the entire journal, including all pdfs. But for now, I’m going to harvest the metadata, and ask nicely about getting access to the pdfs without downloading them automatically.
So. First up, getting the metadata.
Datanalysis progresses in five steps:
- (Harvesting) raw data
- Forcing the data to be technically correct
- Coerce the data to be consistent
- Analyse the data
- Present the interesting results
Step 1, getting the raw data. I am going to cheat a bit, and make sure that the data is technically correct while harvesting.
ActaChemScand, as presented on the webpage, gives us access to one volume at a time. We choose a decade, the 40-ties, 50ties etc, and then the volume in the chosen decade we want to look at.
Choosing volume 1, brings us to this url: http://actachemscand.dk/volume.php?select1=1&vol=1 Choosing volume 2, gives us http://actachemscand.dk/volume.php?select1=1&vol=2 And volume 3: http://actachemscand.dk/volume.php?select1=1&vol=3
Volume 4 is in the fifties. Therefore the url changes: http://actachemscand.dk/volume.php?select1=2&vol=4
That seems pretty straightforward. Change the select1 part of the url, based on the decade, and the vol to the volume we are interested in, and it should be simple to generate af list of urls to harvest. There is a small trick, that makes things easier. The select1 variable is actually not needed. It just provides assistance in navigation. Therefore volume 5 can be accessed through this url: http://actachemscand.dk/volume.php?vol=5 Even easier!
But wait. There are supplements. And in the 70ties, there is introduced to series, A and B. Lets take a look: Choosing the supplemental volume 17 (from the sixties), leads us to this url: http://actachemscand.dk/volume.php?select1=3&vol=1017
29A from 1975 is: http://actachemscand.dk/volume.php?select1=4&vol=290 And 29B is: http://actachemscand.dk/volume.php?select1=4&vol=291
So. Not that simple after all. There is a system: Volumes 1 through 27 is just vol=1, vol=2 . vol=27. And vol 17 supp is vol=1017. After that volume xxA is xx0 and xxb is xx1. As in vol 34A is 340, and vol 34B is 341.
One should be able to write a script, that handled that. However, I figured it was easier to list it all in Excel, and construct a dataframe with the needed data manually. Maybe because I did this rather late in the day, and I was out of coffee. So:
url_id <- c(1, 2, 3,4,5,6,7,8,9,10, 11, 12, 13, 14, 15, 16, 17, 1017, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 280, 281, 290, 291, 300, 301,
310, 311, 320, 321, 330, 331, 340, 341, 350, 351, 360, 361, 370, 371, 380,
381, 390, 391, 400, 401, 410, 411, 420, 421, 430, 440, 450, 460, 470, 480,
490, 500, 510, 520, 530)
year <- c(1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957,
1958, 1959, 1960, 1961, 1962, 1963, 1963,1964, 1965, 1966, 1967, 1968,
1969, 1970, 1971, 1972, 1973, 1974, 1974, 1975, 1975, 1976, 1976, 1977, 1977,
1978, 1978, 1979, 1979, 1980, 1980, 1981, 1981, 1982, 1982, 1983, 1983, 1984,
1984 ,1985 ,1985, 1986, 1986, 1987, 1987, 1988, 1988, 1989, 1990, 1991, 1992,
1993, 1994, 1995, 1996, 1997 ,1998, 1999 )
vol <- c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17",
"17s","18","19","20","21","22","23","24","25","26","27","28a","28b","29a",
"29b","30a","30b","31a","31b","32a","32b","33a","33b","34a","34b","35a","35b",
"36a","36b","37a","37b","38a","38b","39a","39b","40a","40b","41a","41b","42a",
"42b","43","44","45","46","47","48","49","50","51","52","53")
url_id <- c(1, 2)
year <- c(1947, 1948)
vol <- c("1", "2")
acta <- data.frame(url_id, year, vol, stringsAsFactors=FALSE)
Okay. Now I have a dataframe, with three columns: url_id, which is the number I need to attach to a baseurl, to get to the relevant webpage. year, the publication year – not just a continuous list of years, as some years had more than one volume. And vol, the volume number. As strings, as we get to vol 28A after a few years.
Oh, and instead of harvesting everything, I modify it, to only look at the first two years. It takes some time to harvest everything.
We’re going to need some libraries. And we’re going to completely ignore all the warnings:
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.2.3
## Loading required package: bitops
## Warning: package 'bitops' was built under R version 3.2.3
library(XML)
## Warning: package 'XML' was built under R version 3.2.3
library(rvest)
## Warning: package 'rvest' was built under R version 3.2.5
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.2.5
##
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
##
## xml
library(textcat)
## Warning: package 'textcat' was built under R version 3.2.5
- RCurl is used for getting my greasy hands on the webpages.
- XML & rvest is used for parsing and accessing the interesting parts of the data. rvest is the reason I’m writing this up in R Markdown, rather than Jupyter. I could not get rvest to run on the Jupyter platform without crashing the kernel.
- textcat. Well we’ll get back to that in detail, but it’s used to determine, or rather guess, the language of the papers.
I’m not going to show screenshots here. But we need to take a look at the HTML-markup on the pages we’re going to harvest. All the papers are listed in a table, with the css-tag “.resulttable”. Within it, we have individual table cells, with the tag “td“. There is some formatting within the tag, that we are going to use later.
But the point is, that we need to get a list into R of all the “fields” within the “.resulttable” tags.
Lets try that with the first volume.
adress <- "http://actachemscand.dk/volume.php?vol=1"
url <- read_html(adress)
papers <- html_nodes(url, ".resulttable") %>%
html_nodes("td")
head(papers)
## {xml_nodeset (6)}
## [1] <td>\n<select name="select1" onchange="ldMenu(this.selectedIndex);" ...
## [2] <td>\n<select name="vol" size="1" style="width: 120px"><option selec ...
## [3] <td>\n<input type="submit" value="Go" alt="Go" /></td>
## [4] <td>\n <b>Some Improvements in Electrophoresis.</b>\n <p><a href=" ...
## [5] <td>\n <b>Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophos ...
## [6] <td>\n <b>On the determination of Reducing Sugars.</b>\n <p><a hre ...
Cool. read_html grabs the content of the adress and stores it in the variable “url”. And then we pass it through html_nodes twice. First we ask to get the content of the tag “.resulttable”. And then we pass the result (thats the %>% part) to html_nodes again, this time asking for the contents of the tags “td”.
The result is a set of html nodes, one for each instance of the td tag on the page. Those can be adresses almost like a list.
Looking at the head of the result, we see that we actually have to go to result number 4, before we get to a paper. For some reason, the navigation is embedded in the .resulttable. A bit annoying, but as long as its the same way on every page, it’s not a problem.
Let me just check.
…
Yep, it’s the same way on every page. Or at least on all the five pages I tested. No problem, we’ll just start the loop at 4 instead of 1.
Nice. Now we can look at papers[4] and get the data out of it.
The print function truncates the output. Thats usually OK, but here, I would like to take a look at all the data. Enter the paste function. But! I’m going to want to output this in pdf. RStudio does this by running everything through LaTeX. And as anyone who has ever done anything with LaTeX knows, you often run into very long lines that does not break. This little helper function solves that problem, sort of. It’s shamelessly stolen from http://stackoverflow.com/questions/24020617/textwrapping-long-string-in-knitr-output-rstudio
str_break = function(x, width = 60L) {
n = nchar(x)
if (n <= width) return(x)
n1 = seq(1L, n, by = width)
n2 = seq(width, n, by = width)
if (n %% width != 0) n2 = c(n2, n)
substring(x, n1, n2)
}
And then:
str_break(paste(papers[4]))
## [1] "<td>\n <b>Some Improvements in Electrophoresis.</b>\n <p><a "
## [2] "href=\"/author.php?aid=81\">Astrup, Tage</a>; <a href=\"/author"
## [3] ".php?aid=36\">Brodersen, Rolf</a></p>\n <div id=\"pages\"><b>Pa"
## [4] "ges:</b> 1-7.</div>\n <br />\n <div id=\"doi\"><b>DOI number:<"
## [5] "/b> 10.3891/acta.chem.scand.01-0001</div>\n <div id=\"downloa"
## [6] "d\"><b>Download as:</b> <a target=\"_blank\" href=\"/pdf/acta_vo"
## [7] "l_01_p0001-0007.pdf\">PDF</a> <a href=\"/djvu/acta_vol_01_p000"
## [8] "1-0007.djvu\">DjVu</a></div>\n</td>"
Much better.
The title is bold. Thats the “b” tag. So is the text “Pages”, “DOI number:” and “Download as:”. But the title always comes first. The author list is the only part of the reference that is enclosed in “b” tags. Pagenumbers and DOI is enclosed in “div”-tags as the only parts of the reference.
bolds <- html_nodes(papers[4], "b") %>%
html_text()
title <- bolds[1]
print(title)
## [1] "Some Improvements in Electrophoresis."
authors <- html_nodes(papers[4], "p") %>%
html_text()
print(authors)
## [1] "Astrup, Tage; Brodersen, Rolf"
pagesAndDOI <- html_nodes(papers[4], "div") %>%
html_text()
pages <- pagesAndDOI[1]
pages <- substring(pages, 8, nchar(pages)-1)
print(pages)
## [1] "1-7"
doi <- pagesAndDOI[2]
doi <- substring(doi, 13, nchar(doi))
print(doi)
## [1] "10.3891/acta.chem.scand.01-0001"
What did I do there?
I passed papers[4] throuhg the html_nodes funciton, looking for -tags. And passed the result to html_text, which gave me the actual textual content for what was in those tags. That gave me a list (sort of), of all the bold text in papers[4]. The first is the title, so title <- bolds[1].
The authors was the only text that was enclosed in
-tags. Same procedure, only this time there is only one result, so authors contains the right information directly.
Pagenumbers and DOI is in
-tags. The same procedure againg. The first result is always the pagenumbers: pages <- pagesAndDOI[1]. The result however, is “pages:” and then the actual pagenumbers. Therefore I strip it of the first characters, to get only the pagenumbers The DOI is in the second result. It is the same story as with the pagenumbers, just with 13 characters instead of 8.
So far so good.
I also want the year and the volume. I know the vol_id used, that is 1. I’m just going to get the year and volumenumber from the original dataframe:
year <- acta[1,]$year
vol <- acta[1,]$vol
print(year)
## [1] 1947
print(vol)
## [1] "1"
Any other interesting information? The link to the PDF would be nice. I’m still hoping to gain access to the PDFs in a non-harvesting way, but even if I do, I’m going to need the filename.
url_list <- papers[4] %>%
html_nodes("a") %>%
html_attr("href")
urls <- lapply(url_list, function(ch) grep("pdf", url_list))
url <- url_list[urls[[1]]]
print(url)
## [1] "/pdf/acta_vol_01_p0001-0007.pdf"
That took some googling. I pass papers[4] to html_nodes, and extract all the nodes in the HTML that is enclosed in -tags. I’m looking for links, and those are marked with that tag. Then I extract the parts of those nodes that has the attribute “href”, which is the actual link part.
That gives me a list (of pointers to) the urls in the paper. There are several, but I’m only interested in the one that contains the string “pdf”. The other links are to at DJVU-version of the paper. And to the authors. I’m counting on not encountering authors with the string “pdf” in their names. Therefore I’m grepping (google it) the list of urls, for “pdf”. There is only one result, but grep returns more than just the result. So I have to take result 1. That is actually not the result, but the position in url_list of the result containing “pdf”. And as it is a pointer, it goes in [[]] rather than [].
Almost done. I am rather interested in the language of the paper. Remember the library textcat from earlier? The textcat() function takes at string, and (tries to) determine the language. That is going to give som interesting results. But I’m gonna do it anyway.
language <- textcat(title)
print(language)
## [1] "english"
I’ll get back to it, but there is some shortcuts build into textcat. For now I’m going to do it quick and dirty.
To sum up: * I can collect all the pages on www.actachemscand.dk * I can extract all the papers on each of those pages * I can extract all the relevant information for each of those papers.
Now I just need to bundle it all up, and get all the information for all the papers on all the pages.
Lets begin by defining a dataframe to keep all the information.
masterDF <- NULL
masterDF <- data.frame(title = character(0), authors = character(0), language = character(0), pages = character(0), doi = character(0), volume = character(0), year = character(0), url = character(0), stringsAsFactors = F)
Done. Now a function to extract the information when I have a paper. I want it to take a paper (eg papers[4]) and a volumeid, that I’ll get from looping through the list of years, volumeids etc. And I would like it to return a list with all the information:
getPaperDetails <- function(paper, volid){
title <- html_nodes(paper, "b") %>%
html_text()
title <- title[1]
authors <- html_nodes(paper, "p") %>%
html_text()
pagesAndDOI <- html_nodes(paper, "div") %>%
html_text()
pages <- pagesAndDOI[1]
pages <- substring(pages, 8, nchar(pages)-1)
doi <- pagesAndDOI[2]
doi <- substring(doi, 13, nchar(doi))
language <- textcat(title)
year <- acta[which(acta$url_id==volid),]$year
volume <- acta[which(acta$url_id==volid),]$vol
url_list <- paper %>% html_nodes("a") %>% html_attr("href")
urls <- lapply(url_list, function(ch) grep("pdf", url_list))
url <- url_list[urls[[1]]]
result <- list(title=title, authors=authors, language=language, pages=pages,
doi=doi, volume=volume, year=year, url=url)
}
Almost done. I still need to define the base-url, to which the volume id should be attached to get the actual webpages I need to scrape:
base_url <- "http://actachemscand.dk/volume.php?vol="
All that is needed now, is just to loop through the dataframe acta from the beginning, extract the papers from those pages, plug them into the getPaperDetails function from above. And add the results to the dataframe.
So. For i= 1 to the number of url_id’s in acta: Get the url for the page with url_id number i. Grab the page. Get the list of papers, defined as stuff enclosed in first a .resulttable tag, and then a td-tag. For each of those results, but beginning with number 4, since the navigation takes up the first three results, grab that paper, and pass it to getPapersDetails. And save the result in masterDF.
for(i in 1:length(acta$url_id)){
weburl <- paste(base_url,acta$url_id[i], sep="")
webpage <- read_html(weburl)
papers <- html_nodes(webpage, ".resulttable") %>%
html_nodes("td")
for(j in 4:length(papers)){
masterDF[nrow(masterDF)+1,] <- getPaperDetails(papers[j],acta$url_id[i])
}
}
head(masterDF)
## title
## 1 Some Improvements in Electrophoresis.
## 2 Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophosphates and the Corresponding Pseudohalogens.
## 3 On the determination of Reducing Sugars.
## 4 On the Application of p-Carboxyphenylhydrazones in the Identification of Carbonyl Compounds.
## 5 A Note on the Growth Promoting Properties of an Enzymatic Hydrolysate of Casein.
## 6 Die Konstitution der Harzphenole und ihre biogenetischen Zusammenhänge. X. Herausspaltung des "MIttelstückes" des Pinoresinols
## authors language pages
## 1 Astrup, Tage; Brodersen, Rolf english 1-7
## 2 Foss, Olav english 8-31
## 3 Blom, Jakob; Rosted, Carl Olof english 32-53
## 4 Veibel, Stig english 54-68
## 5 Ågren, Gunnar english 69-70
## 6 Erdtman, Holger; Gripenberg, Jarl german 71-75
## doi volume year
## 1 10.3891/acta.chem.scand.01-0001 1 1947
## 2 10.3891/acta.chem.scand.01-0008 1 1947
## 3 10.3891/acta.chem.scand.01-0032 1 1947
## 4 10.3891/acta.chem.scand.01-0054 1 1947
## 5 10.3891/acta.chem.scand.01-0069 1 1947
## 6 10.3891/acta.chem.scand.01-0071 1 1947
## url
## 1 /pdf/acta_vol_01_p0001-0007.pdf
## 2 /pdf/acta_vol_01_p0008-0031.pdf
## 3 /pdf/acta_vol_01_p0032-0053.pdf
## 4 /pdf/acta_vol_01_p0054-0068.pdf
## 5 /pdf/acta_vol_01_p0069-0070.pdf
## 6 /pdf/acta_vol_01_p0071-0075.pdf
That looks about right.
Someone is going to be annoyed, if I do this on a regular basis, so I’m going to save the results, rather than scraping the page again and again:
saveRDS(masterDF, file="data.Rda")
Done! The data is scraped and saved. Next up making sure that it is consistent.