Previously on…
All (almost) relevant metadata on the papers in Acta Chemica Scandinavica was harvested. And now I have a dataframe, saved in a file. But the data is not quite ready. I got textcat to guestimate the language of the papers, as I’m interested in the decline of german (and french) as scientific languages.
Lets take a look. The data is now saved on my d-drive in the folder “acta”
setwd("e:\\acta")
rawdata <- readRDS(file="data.Rda")
head(rawdata)
## title
## 1 Some Improvements in Electrophoresis.
## 2 Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophosphates and the Corresponding Pseudohalogens.
## 3 On the determination of Reducing Sugars.
## 4 On the Application of p-Carboxyphenylhydrazones in the Identification of Carbonyl Compounds.
## 5 A Note on the Growth Promoting Properties of an Enzymatic Hydrolysate of Casein.
## 6 Die Konstitution der Harzphenole und ihre biogenetischen Zusammenhänge. X. Herausspaltung des "MIttelstückes" des Pinoresinols
## authors language pages
## 1 Astrup, Tage; Brodersen, Rolf english 1-7
## 2 Foss, Olav english 8-31
## 3 Blom, Jakob; Rosted, Carl Olof english 32-53
## 4 Veibel, Stig english 54-68
## 5 Ågren, Gunnar english 69-70
## 6 Erdtman, Holger; Gripenberg, Jarl german 71-75
## doi volume year
## 1 10.3891/acta.chem.scand.01-0001 1 1947
## 2 10.3891/acta.chem.scand.01-0008 1 1947
## 3 10.3891/acta.chem.scand.01-0032 1 1947
## 4 10.3891/acta.chem.scand.01-0054 1 1947
## 5 10.3891/acta.chem.scand.01-0069 1 1947
## 6 10.3891/acta.chem.scand.01-0071 1 1947
## url
## 1 /pdf/acta_vol_01_p0001-0007.pdf
## 2 /pdf/acta_vol_01_p0008-0031.pdf
## 3 /pdf/acta_vol_01_p0032-0053.pdf
## 4 /pdf/acta_vol_01_p0054-0068.pdf
## 5 /pdf/acta_vol_01_p0069-0070.pdf
## 6 /pdf/acta_vol_01_p0071-0075.pdf
Nice. Now, lets take a look at the languages:
table(rawdata$language)
##
## afrikaans albanian basque
## 13 1 11
## catalan croatian-ascii czech-iso8859_2
## 401 2 2
## danish dutch english
## 141 14 14765
## esperanto estonian finnish
## 19 2 4
## french frisian german
## 67 28 311
## indonesian irish italian
## 4 4 2
## latin lithuanian manx
## 162 2 4
## middle_frisian portuguese romanian
## 53 6 118
## rumantsch scots scots_gaelic
## 38 78 12
## serbian-ascii slovak-ascii slovak-windows1250
## 1 150 1
## slovenian-iso8859_2 spanish swedish
## 9 20 8
## tagalog welsh
## 3 36
Interesting. Acta was owned by the nordic chemical societies, but it was international in its scope. But probably not that international.
Taking a closer look at the papers reveals that something actually is wrong:
rawdata[which(rawdata$language=="finnish"),]$title
## [1] "Some Remarks on Chromatography."
## [2] "Gustav Komppa."
## [3] "Note on Vitamin B12b."
## [4] "Semisynthetic Penicillins. V. alpha-(Ylideneimino-oxy)penicillins."
The second paper might actually be in finnish. But the other three are not.
There is only one way forward. Taking a look at all papers, and changing the language to the correct.
rawdata[which(rawdata$language=="finnish"),]$doi
## [1] "10.3891/acta.chem.scand.03-0401" "10.3891/acta.chem.scand.04-1155"
## [3] "10.3891/acta.chem.scand.07-0703" "10.3891/acta.chem.scand.19-0352"
It was the second that there was doubts about. That is in volume 4, and the paper starts at page 1155. Checking. It’s in english.
Repeat for all the strange langauges. Then check the german and french papers. And finally, go through all the english ones.
There is a shortcut in textcat. It is possible to define a “profile”, in which we tell textcat that the only possible languages are english, french and german.
my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c(“english”, “french”, “german”)] textcat_medhelp <- textcat(data$titel, p=my.profiles)
However. I’m not actually sure that those are the only languages. It is entirely possible that there could be papers in danish, swedish, norwegian or finnish. Also, textcat misclassifies some german language papers as english. I want as short a list of english language papers to go through. It’s much easier to scan through 20 papers, classified as spanish-language, and determine the correct language.
I wont drag anyone through all the work. But this is the code needed for ending up with correct languages.
We begin with the easy parts, note that this is the final result, not the way I actually did it. I went through all the papers classified as english first, before I changed anything to english:
languages <- c("french","afrikaans", "albanian", "finnish", "tagalog", "serbian-ascii", "slovak-windows1250",
"irish", "czech-iso8859_2", "croatian-ascii", "italian", "indonesian", "lithuanian",
"manx", "estonian", "slovenian-iso8859_2", "scots_gaelic", "dutch", "portuguese","basque", "welsh", "esperanto", "spanish", "frisian", "rumantsch", "middle_frisian", "scots", "danish", "swedish", "romanian", "slovak-ascii", "latin", "catalan")
for(i in 1:length(languages)){
rawdata[which(rawdata$language==languages[i]),]$language <- "english"
}
And the result:
table(rawdata$language)
##
## english german
## 16181 311
That is the most compact way I can figure out.
But! Some of those german papers are actually in english. Some of the english are in german. And some are in french. That was the time consuming part. But the result is this:
english_papers <- c("32a-0653", "33b-0006", "35b-0225", "35b-0513", "05-0227", "36b-0291", "38b-0183",
"39b-0469", "41a-0573", "42b-0242", "44-0527", "45-0105", "45-0529", "46-0147", "52-1034", "52-1171", "52-0001",
"01-0770", "03-0090", "03-0093", "03-0321", "03-1201", "04-0550", "04-0997", "04-1314", "05-0616", "06-0311",
"07-1036", "08-0523a", "08-1727", "09-1350", "09-1425", "10-0397", "52-1285", "53-0263", "10-0481", "11-0854",
"11-0906", "13-1366", "17-1189", "17-1826", "17-2351a", "19-1381", "19-1677", "21-1293", "22-2161", "22-3321",
"23-0455", "24-1301", "24-2093", "24-3420", "25-1142", "25-1695", "25-3886", "27-1519", "28b-0579", "29a-0895",
"30b-0970", "31b-0219")
french_papers <- c("06-0189", "20-2304", "26-0059", "32a-0415", "31b-0077", "31a-0088", "27-1450", "27-1039", "26-2703", "27-0708", "25-1849", "23-2949", "22-2388",
"22-3191", "22-0070", "21-2807", "20-0159", "11-1428", "11-1473", "09-1674", "10-1199", "04-0806",
"04-1393", "02-0193", "03-0036", "03-0554", "03-1220")
german_papers <- c("15-2064", "21-1293", "16-0275", "20-1035", "05-0227", "22-3332", "16-0522", "22-2685", "13-1240",
"06-0468", "28a-0116", "19-1012", "19-1987", "11-1622", "19-1993", "16-0297", "17-0272", "15-0849", "04-1155",
"15-0218", "04-0589", "02-0883", "05-0241", "02-0034", "20-1064")
for(i in 1:length(french_papers)){
doi <- paste("10.3891/acta.chem.scand.", french_papers[i], sep="")
rawdata[which(rawdata$doi==doi),]$language <- "french"
}
for(i in 1:length(german_papers)){
doi <- paste("10.3891/acta.chem.scand.", german_papers[i], sep="")
rawdata[which(rawdata$doi==doi),]$language <- "german"
}
for(i in 1:length(english_papers)){
doi <- paste("10.3891/acta.chem.scand.", english_papers[i], sep="")
rawdata[which(rawdata$doi==doi),]$language <- "english"
}
Damn, that took some time:
table(rawdata$language)
##
## english french german
## 16189 27 276
Now You see why I changed all the french papers to english at the beginning. In this way, I needed to change the language of 27 papers from english to french. The other way around, I would have had to change the language of 37 papers from french to english. Not much faster. But a bit.
Result, now the data is not only technically correct, it is also consistent. And now we’re ready to analyse. Almost. It would be nice. But I’m not sure I can get it. But, well. Journals did not only come in volumes. They came in issues. It would be nice to know which issue a given paper belonged to. That would give me more granular data for the timeseries I want to plot.
Hm.. Scopus does have some information. But only the years 1974 to 1988. Which makes that database bloody useless for a chemist. Also please note: Scopus classifies the B series of Acta Chem. Scand. with the subject area: “medicine”. The series actually covered organic chemistry and biochemistry. Absolutely useless!
Google Scholar appears to have something. But how to get at it? Web of Science has it. But it will take forever to get them all… I can only output 500 references at a time.
Hm… This might be difficult to automate. Vol 1:
- number 1: 1-132
- number 2: 133-267
- number 3: 269-352
- number 4: 353-448
- number 5: 449-528
- number 6: 529-618
- number 7: 619-684
- number 8: 685-780
- number 9: 781-860
- number 10: 861-952
Repeat for all volumes.
That took some time. Of course I could have made the harvest semi-automatically from web of science. But then I would have missed the link to the pdf. Which I really like to have for later.
The numbers above – I got those by exporting all references for the first volume from Web of Science, imported them to JabRef, sorted by pagenumber. And extracted the issue numbers by hand. Not effecient.
Scifinder. Not much better. At least I can search by volume and issue number, and get: vol 2: Number 1: 1-93
But apparently there was no issue 2. Issue 3 however, is 193 – 294.
Damn.
Anyways, lets save the now consistent data:
saveRDS(rawdata, file="consistentdata.Rda")