Acta Chem. Scand – now with plots!

I got through the first three steps of dataanalysis. The data is harvested. It is technically correct. And it is consistent.

And I made a beginning on the analysis part. But it is not that interesting to figure out how good textcat recognizes languages.

Therefore it is time to actually do some analysis. Lets begin by loading the data, and refamiliarize ourselves with the structure:

data <- readRDS(file="d:\\acta\\consistentdata.rda")
str(data)

## 'data.frame':    16494 obs. of  8 variables:
##  $ title   : chr  "Some Improvements in Electrophoresis." "Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophosphates and the Corresponding Pseudohalogens." "On the determination of Reducing Sugars." "On the Application of p-Carboxyphenylhydrazones in the Identification of Carbonyl Compounds." ...
##  $ authors : chr  "Astrup, Tage; Brodersen, Rolf" "Foss, Olav" "Blom, Jakob; Rosted, Carl Olof" "Veibel, Stig" ...
##  $ language: chr  "english" "english" "english" "english" ...
##  $ pages   : chr  "1-7" "8-31" "32-53" "54-68" ...
##  $ doi     : chr  "10.3891/acta.chem.scand.01-0001" "10.3891/acta.chem.scand.01-0008" "10.3891/acta.chem.scand.01-0032" "10.3891/acta.chem.scand.01-0054" ...
##  $ volume  : chr  "1" "1" "1" "1" ...
##  $ year    : chr  "1947" "1947" "1947" "1947" ...
##  $ url     : chr  "/pdf/acta_vol_01_p0001-0007.pdf" "/pdf/acta_vol_01_p0008-0031.pdf" "/pdf/acta_vol_01_p0032-0053.pdf" "/pdf/acta_vol_01_p0054-0068.pdf" ...

Lets take a look at the languages over the years. Begin by making the table:

languagetable <- table(data$year, data$language)
languagetable
##       
##        english french german
##   1947      92      0     13
##   1948     104      1     11
##   1949     192      3     11
##   1950     219      2     16
##   1951     189      0     13
##   1952     225      1     11
##   1953     222      0      7
##   1954     340      0     21
##   1955     338      1      7
##   1956     325      1      5
##   1957     339      2     11
##   1958     345      0      8
##   1959     371      0      8
##   1960     348      0     10
##   1961     333      0     14
##   1962     371      0     22
##   1963     551      0     13
##   1964     379      0      7
##   1965     413      0     10
##   1966     432      2     11
##   1967     451      1      5
##   1968     475      3      4
##   1969     517      1     10
##   1970     566      0      4
##   1971     554      1      5
##   1972     584      2     10
##   1973     542      3      6
##   1974     432      0      2
##   1975     370      0      0
##   1976     369      0      0
##   1977     351      2      0
##   1978     304      1      1
##   1979     303      0      0
##   1980     286      0      0
##   1981     253      0      0
##   1982     262      0      0
##   1983     275      0      0
##   1984     255      0      0
##   1985     218      0      0
##   1986     237      0      0
##   1987     215      0      0
##   1988     212      0      0
##   1989     180      0      0
##   1990     189      0      0
##   1991     187      0      0
##   1992     200      0      0
##   1993     202      0      0
##   1994     171      0      0
##   1995     153      0      0
##   1996     183      0      0
##   1997     191      0      0
##   1998     205      0      0
##   1999     169      0      0

A quick plot of the number of german papers over the years:

plot(languagetable[,3], type="l")

There’s a clear decline, but with some peaks.

Lets get the proportions. The table object has a useful method, prop:

prop.table(languagetable,1)
##       
##            english      french      german
##   1947 0.876190476 0.000000000 0.123809524
##   1948 0.896551724 0.008620690 0.094827586
##   1949 0.932038835 0.014563107 0.053398058
##   1950 0.924050633 0.008438819 0.067510549
##   1951 0.935643564 0.000000000 0.064356436
##   1952 0.949367089 0.004219409 0.046413502
##   1953 0.969432314 0.000000000 0.030567686
##   1954 0.941828255 0.000000000 0.058171745
##   1955 0.976878613 0.002890173 0.020231214
##   1956 0.981873112 0.003021148 0.015105740
##   1957 0.963068182 0.005681818 0.031250000
##   1958 0.977337110 0.000000000 0.022662890
##   1959 0.978891821 0.000000000 0.021108179
##   1960 0.972067039 0.000000000 0.027932961
##   1961 0.959654179 0.000000000 0.040345821
##   1962 0.944020356 0.000000000 0.055979644
##   1963 0.976950355 0.000000000 0.023049645
##   1964 0.981865285 0.000000000 0.018134715
##   1965 0.976359338 0.000000000 0.023640662
##   1966 0.970786517 0.004494382 0.024719101
##   1967 0.986870897 0.002188184 0.010940919
##   1968 0.985477178 0.006224066 0.008298755
##   1969 0.979166667 0.001893939 0.018939394
##   1970 0.992982456 0.000000000 0.007017544
##   1971 0.989285714 0.001785714 0.008928571
##   1972 0.979865772 0.003355705 0.016778523
##   1973 0.983666062 0.005444646 0.010889292
##   1974 0.995391705 0.000000000 0.004608295
##   1975 1.000000000 0.000000000 0.000000000
##   1976 1.000000000 0.000000000 0.000000000
##   1977 0.994334278 0.005665722 0.000000000
##   1978 0.993464052 0.003267974 0.003267974
##   1979 1.000000000 0.000000000 0.000000000
##   1980 1.000000000 0.000000000 0.000000000
##   1981 1.000000000 0.000000000 0.000000000
##   1982 1.000000000 0.000000000 0.000000000
##   1983 1.000000000 0.000000000 0.000000000
##   1984 1.000000000 0.000000000 0.000000000
##   1985 1.000000000 0.000000000 0.000000000
##   1986 1.000000000 0.000000000 0.000000000
##   1987 1.000000000 0.000000000 0.000000000
##   1988 1.000000000 0.000000000 0.000000000
##   1989 1.000000000 0.000000000 0.000000000
##   1990 1.000000000 0.000000000 0.000000000
##   1991 1.000000000 0.000000000 0.000000000
##   1992 1.000000000 0.000000000 0.000000000
##   1993 1.000000000 0.000000000 0.000000000
##   1994 1.000000000 0.000000000 0.000000000
##   1995 1.000000000 0.000000000 0.000000000
##   1996 1.000000000 0.000000000 0.000000000
##   1997 1.000000000 0.000000000 0.000000000
##   1998 1.000000000 0.000000000 0.000000000
##   1999 1.000000000 0.000000000 0.000000000

If we want to plot that, we just use:

plot(row.names(languagetable), prop.table(languagetable,1)[,2]*100, type="l", xlab="Year", ylab="% french language papers", main="The decline of French in Acta.Chem.Scand.")

There never were many. What are we doing? Plot takes some variables. x is row.names – the years. Y is the proprotion, we choose column 2 which corresponds with french, and multiply with 100 to get nice percentages. The type is “l” for a line plot. xlab, is the label for the x-axis, y-lab is selfexplanatory. And main is the main title of the plot.

What about german? Same story, we just need to change a single digit, from 2 til 3:

plot.new()
plot(row.names(languagetable), prop.table(languagetable,1)[,3]*100, type="l", xlab="Year", ylab="% german language papers", main="The decline of German in Acta.Chem.Scand.")

There’s a lot less noise.

Lets try to add a trendline. We need X and Y values. The absolute numbers are probably not that interesting, I’ll look at the proportions:

x <- as.integer(row.names(languagetable))
y <- prop.table(languagetable,1)[,3]*100
y <- unname(y)

The prop-method returns the proportions (here the german propotions) as named integers. Here I need unnamed integers.

There are several possible trendlines. I’ve stolen the code from answer 27 on http://stackoverflow.com/questions/15102254/how-do-i-add-different-trend-lines-in-r

# basic straight line of fit
fit <- glm(y~x)
co <- coef(fit)
plot.new()
plot(x, y, type="l")
abline(fit, col="blue", lwd=2)

# exponential
#f <- function(x,a,b) {a * exp(b * x)}
#fit <- nls(y ~ f(x,a,b), start = c(a=1, b=1)) 
#co <- coef(fit)
#curve(f(x, a=co[1], b=co[2]), add = TRUE, col="green", lwd=2) 

# logarithmic
#f <- function(x,a,b) {a * log(x) + b}
#fit <- nls(y ~ f(x,a,b), start = c(a=1, b=1)) 
#co <- coef(fit)
#curve(f(x, a=co[1], b=co[2]), add = TRUE, col="orange", lwd=2) 

# polynomial
#f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
#fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1)) 
#co <- coef(fit)
#curve(f(x, a=co[1], b=co[2], d=co[3]), add = TRUE, col="pink", lwd=2) 

The last three fits are commented out – they give infinite results, something goes under a threshold and other stuff. Basically, the fits don’t fit.

But the basic straight line does. Or rather, we are able to get a result.

Neato!