Acta Chem Scand – consistent data!

Previously on…

All (almost) relevant metadata on the papers in Acta Chemica Scandinavica was harvested. And now I have a dataframe, saved in a file. But the data is not quite ready. I got textcat to guestimate the language of the papers, as I’m interested in the decline of german (and french) as scientific languages.

Lets take a look. The data is now saved on my d-drive in the folder “acta”

setwd("e:\\acta")
rawdata <- readRDS(file="data.Rda")
head(rawdata)

##                                                                                                                            title
## 1                                                                                          Some Improvements in Electrophoresis.
## 2                          Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophosphates and the Corresponding Pseudohalogens.
## 3                                                                                       On the determination of Reducing Sugars.
## 4                                   On the Application of p-Carboxyphenylhydrazones in the Identification of Carbonyl Compounds.
## 5                                               A Note on the Growth Promoting Properties of an Enzymatic Hydrolysate of Casein.
## 6 Die Konstitution der Harzphenole und ihre biogenetischen Zusammenhänge. X. Herausspaltung des "MIttelstückes" des Pinoresinols
##                             authors language pages
## 1     Astrup, Tage; Brodersen, Rolf  english   1-7
## 2                        Foss, Olav  english  8-31
## 3    Blom, Jakob; Rosted, Carl Olof  english 32-53
## 4                      Veibel, Stig  english 54-68
## 5                     Ågren, Gunnar  english 69-70
## 6 Erdtman, Holger; Gripenberg, Jarl   german 71-75
##                               doi volume year
## 1 10.3891/acta.chem.scand.01-0001      1 1947
## 2 10.3891/acta.chem.scand.01-0008      1 1947
## 3 10.3891/acta.chem.scand.01-0032      1 1947
## 4 10.3891/acta.chem.scand.01-0054      1 1947
## 5 10.3891/acta.chem.scand.01-0069      1 1947
## 6 10.3891/acta.chem.scand.01-0071      1 1947
##                               url
## 1 /pdf/acta_vol_01_p0001-0007.pdf
## 2 /pdf/acta_vol_01_p0008-0031.pdf
## 3 /pdf/acta_vol_01_p0032-0053.pdf
## 4 /pdf/acta_vol_01_p0054-0068.pdf
## 5 /pdf/acta_vol_01_p0069-0070.pdf
## 6 /pdf/acta_vol_01_p0071-0075.pdf

Nice. Now, lets take a look at the languages:

table(rawdata$language)
## 
##           afrikaans            albanian              basque 
##                  13                   1                  11 
##             catalan      croatian-ascii     czech-iso8859_2 
##                 401                   2                   2 
##              danish               dutch             english 
##                 141                  14               14765 
##           esperanto            estonian             finnish 
##                  19                   2                   4 
##              french             frisian              german 
##                  67                  28                 311 
##          indonesian               irish             italian 
##                   4                   4                   2 
##               latin          lithuanian                manx 
##                 162                   2                   4 
##      middle_frisian          portuguese            romanian 
##                  53                   6                 118 
##           rumantsch               scots        scots_gaelic 
##                  38                  78                  12 
##       serbian-ascii        slovak-ascii  slovak-windows1250 
##                   1                 150                   1 
## slovenian-iso8859_2             spanish             swedish 
##                   9                  20                   8 
##             tagalog               welsh 
##                   3                  36

Interesting. Acta was owned by the nordic chemical societies, but it was international in its scope. But probably not that international.

Taking a closer look at the papers reveals that something actually is wrong:

rawdata[which(rawdata$language=="finnish"),]$title
## [1] "Some Remarks on Chromatography."                                   
## [2] "Gustav Komppa."                                                    
## [3] "Note on Vitamin B12b."                                             
## [4] "Semisynthetic Penicillins. V. alpha-(Ylideneimino-oxy)penicillins."

The second paper might actually be in finnish. But the other three are not.

There is only one way forward. Taking a look at all papers, and changing the language to the correct.

rawdata[which(rawdata$language=="finnish"),]$doi
## [1] "10.3891/acta.chem.scand.03-0401" "10.3891/acta.chem.scand.04-1155"
## [3] "10.3891/acta.chem.scand.07-0703" "10.3891/acta.chem.scand.19-0352"

It was the second that there was doubts about. That is in volume 4, and the paper starts at page 1155. Checking. It’s in english.

Repeat for all the strange langauges. Then check the german and french papers. And finally, go through all the english ones.

There is a shortcut in textcat. It is possible to define a “profile”, in which we tell textcat that the only possible languages are english, french and german.

my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c(“english”, “french”, “german”)] textcat_medhelp <- textcat(data$titel, p=my.profiles)

However. I’m not actually sure that those are the only languages. It is entirely possible that there could be papers in danish, swedish, norwegian or finnish. Also, textcat misclassifies some german language papers as english. I want as short a list of english language papers to go through. It’s much easier to scan through 20 papers, classified as spanish-language, and determine the correct language.

I wont drag anyone through all the work. But this is the code needed for ending up with correct languages.

We begin with the easy parts, note that this is the final result, not the way I actually did it. I went through all the papers classified as english first, before I changed anything to english:

languages <- c("french","afrikaans", "albanian", "finnish", "tagalog", "serbian-ascii", "slovak-windows1250",
"irish", "czech-iso8859_2", "croatian-ascii", "italian", "indonesian", "lithuanian",
"manx", "estonian", "slovenian-iso8859_2", "scots_gaelic", "dutch", "portuguese","basque", "welsh", "esperanto", "spanish", "frisian", "rumantsch", "middle_frisian", "scots", "danish", "swedish", "romanian", "slovak-ascii", "latin", "catalan")


for(i in 1:length(languages)){
  rawdata[which(rawdata$language==languages[i]),]$language <- "english"
}

And the result:

table(rawdata$language)
## 
## english  german 
##   16181     311

That is the most compact way I can figure out.

But! Some of those german papers are actually in english. Some of the english are in german. And some are in french. That was the time consuming part. But the result is this:

english_papers <- c("32a-0653", "33b-0006", "35b-0225", "35b-0513", "05-0227", "36b-0291", "38b-0183",
"39b-0469", "41a-0573", "42b-0242", "44-0527", "45-0105", "45-0529", "46-0147", "52-1034", "52-1171", "52-0001",
"01-0770", "03-0090", "03-0093", "03-0321", "03-1201", "04-0550", "04-0997", "04-1314", "05-0616", "06-0311",
"07-1036", "08-0523a", "08-1727", "09-1350", "09-1425", "10-0397", "52-1285", "53-0263", "10-0481", "11-0854",
"11-0906", "13-1366", "17-1189", "17-1826", "17-2351a", "19-1381", "19-1677", "21-1293", "22-2161", "22-3321",
"23-0455", "24-1301", "24-2093", "24-3420", "25-1142", "25-1695", "25-3886", "27-1519", "28b-0579", "29a-0895",
"30b-0970", "31b-0219")

french_papers <- c("06-0189", "20-2304", "26-0059", "32a-0415", "31b-0077", "31a-0088", "27-1450", "27-1039", "26-2703", "27-0708", "25-1849", "23-2949", "22-2388",
"22-3191", "22-0070", "21-2807", "20-0159", "11-1428", "11-1473", "09-1674", "10-1199", "04-0806",
"04-1393", "02-0193", "03-0036", "03-0554", "03-1220")

german_papers <- c("15-2064", "21-1293", "16-0275", "20-1035", "05-0227", "22-3332", "16-0522", "22-2685", "13-1240", 
    "06-0468", "28a-0116", "19-1012", "19-1987", "11-1622", "19-1993", "16-0297", "17-0272", "15-0849", "04-1155",
    "15-0218", "04-0589", "02-0883", "05-0241", "02-0034", "20-1064")

for(i in 1:length(french_papers)){
    doi <- paste("10.3891/acta.chem.scand.", french_papers[i], sep="")
    rawdata[which(rawdata$doi==doi),]$language <- "french"
}

for(i in 1:length(german_papers)){
    doi <- paste("10.3891/acta.chem.scand.", german_papers[i], sep="")
    rawdata[which(rawdata$doi==doi),]$language <- "german"
}

for(i in 1:length(english_papers)){
    doi <- paste("10.3891/acta.chem.scand.", english_papers[i], sep="")
    rawdata[which(rawdata$doi==doi),]$language <- "english"
}

Damn, that took some time:

table(rawdata$language)
## 
## english  french  german 
##   16189      27     276

Now You see why I changed all the french papers to english at the beginning. In this way, I needed to change the language of 27 papers from english to french. The other way around, I would have had to change the language of 37 papers from french to english. Not much faster. But a bit.

Result, now the data is not only technically correct, it is also consistent. And now we’re ready to analyse. Almost. It would be nice. But I’m not sure I can get it. But, well. Journals did not only come in volumes. They came in issues. It would be nice to know which issue a given paper belonged to. That would give me more granular data for the timeseries I want to plot.

Hm.. Scopus does have some information. But only the years 1974 to 1988. Which makes that database bloody useless for a chemist. Also please note: Scopus classifies the B series of Acta Chem. Scand. with the subject area: “medicine”. The series actually covered organic chemistry and biochemistry. Absolutely useless!

Google Scholar appears to have something. But how to get at it? Web of Science has it. But it will take forever to get them all… I can only output 500 references at a time.

Hm… This might be difficult to automate. Vol 1:

  • number 1: 1-132
  • number 2: 133-267
  • number 3: 269-352
  • number 4: 353-448
  • number 5: 449-528
  • number 6: 529-618
  • number 7: 619-684
  • number 8: 685-780
  • number 9: 781-860
  • number 10: 861-952

Repeat for all volumes.

That took some time. Of course I could have made the harvest semi-automatically from web of science. But then I would have missed the link to the pdf. Which I really like to have for later.

The numbers above – I got those by exporting all references for the first volume from Web of Science, imported them to JabRef, sorted by pagenumber. And extracted the issue numbers by hand. Not effecient.

Scifinder. Not much better. At least I can search by volume and issue number, and get: vol 2: Number 1: 1-93

But apparently there was no issue 2. Issue 3 however, is 193 – 294.

Damn.

Anyways, lets save the now consistent data:

saveRDS(rawdata, file="consistentdata.Rda")