I got through the first three steps of dataanalysis. The data is harvested. It is technically correct. And it is consistent.
And I made a beginning on the analysis part. But it is not that interesting to figure out how good textcat recognizes languages.
Therefore it is time to actually do some analysis. Lets begin by loading the data, and refamiliarize ourselves with the structure:
data <- readRDS(file="d:\\acta\\consistentdata.rda")
str(data)
## 'data.frame': 16494 obs. of 8 variables:
## $ title : chr "Some Improvements in Electrophoresis." "Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophosphates and the Corresponding Pseudohalogens." "On the determination of Reducing Sugars." "On the Application of p-Carboxyphenylhydrazones in the Identification of Carbonyl Compounds." ...
## $ authors : chr "Astrup, Tage; Brodersen, Rolf" "Foss, Olav" "Blom, Jakob; Rosted, Carl Olof" "Veibel, Stig" ...
## $ language: chr "english" "english" "english" "english" ...
## $ pages : chr "1-7" "8-31" "32-53" "54-68" ...
## $ doi : chr "10.3891/acta.chem.scand.01-0001" "10.3891/acta.chem.scand.01-0008" "10.3891/acta.chem.scand.01-0032" "10.3891/acta.chem.scand.01-0054" ...
## $ volume : chr "1" "1" "1" "1" ...
## $ year : chr "1947" "1947" "1947" "1947" ...
## $ url : chr "/pdf/acta_vol_01_p0001-0007.pdf" "/pdf/acta_vol_01_p0008-0031.pdf" "/pdf/acta_vol_01_p0032-0053.pdf" "/pdf/acta_vol_01_p0054-0068.pdf" ...
Lets take a look at the languages over the years. Begin by making the table:
languagetable <- table(data$year, data$language)
languagetable
##
## english french german
## 1947 92 0 13
## 1948 104 1 11
## 1949 192 3 11
## 1950 219 2 16
## 1951 189 0 13
## 1952 225 1 11
## 1953 222 0 7
## 1954 340 0 21
## 1955 338 1 7
## 1956 325 1 5
## 1957 339 2 11
## 1958 345 0 8
## 1959 371 0 8
## 1960 348 0 10
## 1961 333 0 14
## 1962 371 0 22
## 1963 551 0 13
## 1964 379 0 7
## 1965 413 0 10
## 1966 432 2 11
## 1967 451 1 5
## 1968 475 3 4
## 1969 517 1 10
## 1970 566 0 4
## 1971 554 1 5
## 1972 584 2 10
## 1973 542 3 6
## 1974 432 0 2
## 1975 370 0 0
## 1976 369 0 0
## 1977 351 2 0
## 1978 304 1 1
## 1979 303 0 0
## 1980 286 0 0
## 1981 253 0 0
## 1982 262 0 0
## 1983 275 0 0
## 1984 255 0 0
## 1985 218 0 0
## 1986 237 0 0
## 1987 215 0 0
## 1988 212 0 0
## 1989 180 0 0
## 1990 189 0 0
## 1991 187 0 0
## 1992 200 0 0
## 1993 202 0 0
## 1994 171 0 0
## 1995 153 0 0
## 1996 183 0 0
## 1997 191 0 0
## 1998 205 0 0
## 1999 169 0 0
A quick plot of the number of german papers over the years:
plot(languagetable[,3], type="l")
There’s a clear decline, but with some peaks.
Lets get the proportions. The table object has a useful method, prop:
prop.table(languagetable,1)
##
## english french german
## 1947 0.876190476 0.000000000 0.123809524
## 1948 0.896551724 0.008620690 0.094827586
## 1949 0.932038835 0.014563107 0.053398058
## 1950 0.924050633 0.008438819 0.067510549
## 1951 0.935643564 0.000000000 0.064356436
## 1952 0.949367089 0.004219409 0.046413502
## 1953 0.969432314 0.000000000 0.030567686
## 1954 0.941828255 0.000000000 0.058171745
## 1955 0.976878613 0.002890173 0.020231214
## 1956 0.981873112 0.003021148 0.015105740
## 1957 0.963068182 0.005681818 0.031250000
## 1958 0.977337110 0.000000000 0.022662890
## 1959 0.978891821 0.000000000 0.021108179
## 1960 0.972067039 0.000000000 0.027932961
## 1961 0.959654179 0.000000000 0.040345821
## 1962 0.944020356 0.000000000 0.055979644
## 1963 0.976950355 0.000000000 0.023049645
## 1964 0.981865285 0.000000000 0.018134715
## 1965 0.976359338 0.000000000 0.023640662
## 1966 0.970786517 0.004494382 0.024719101
## 1967 0.986870897 0.002188184 0.010940919
## 1968 0.985477178 0.006224066 0.008298755
## 1969 0.979166667 0.001893939 0.018939394
## 1970 0.992982456 0.000000000 0.007017544
## 1971 0.989285714 0.001785714 0.008928571
## 1972 0.979865772 0.003355705 0.016778523
## 1973 0.983666062 0.005444646 0.010889292
## 1974 0.995391705 0.000000000 0.004608295
## 1975 1.000000000 0.000000000 0.000000000
## 1976 1.000000000 0.000000000 0.000000000
## 1977 0.994334278 0.005665722 0.000000000
## 1978 0.993464052 0.003267974 0.003267974
## 1979 1.000000000 0.000000000 0.000000000
## 1980 1.000000000 0.000000000 0.000000000
## 1981 1.000000000 0.000000000 0.000000000
## 1982 1.000000000 0.000000000 0.000000000
## 1983 1.000000000 0.000000000 0.000000000
## 1984 1.000000000 0.000000000 0.000000000
## 1985 1.000000000 0.000000000 0.000000000
## 1986 1.000000000 0.000000000 0.000000000
## 1987 1.000000000 0.000000000 0.000000000
## 1988 1.000000000 0.000000000 0.000000000
## 1989 1.000000000 0.000000000 0.000000000
## 1990 1.000000000 0.000000000 0.000000000
## 1991 1.000000000 0.000000000 0.000000000
## 1992 1.000000000 0.000000000 0.000000000
## 1993 1.000000000 0.000000000 0.000000000
## 1994 1.000000000 0.000000000 0.000000000
## 1995 1.000000000 0.000000000 0.000000000
## 1996 1.000000000 0.000000000 0.000000000
## 1997 1.000000000 0.000000000 0.000000000
## 1998 1.000000000 0.000000000 0.000000000
## 1999 1.000000000 0.000000000 0.000000000
If we want to plot that, we just use:
plot(row.names(languagetable), prop.table(languagetable,1)[,2]*100, type="l", xlab="Year", ylab="% french language papers", main="The decline of French in Acta.Chem.Scand.")
There never were many. What are we doing? Plot takes some variables. x is row.names – the years. Y is the proprotion, we choose column 2 which corresponds with french, and multiply with 100 to get nice percentages. The type is “l” for a line plot. xlab, is the label for the x-axis, y-lab is selfexplanatory. And main is the main title of the plot.
What about german? Same story, we just need to change a single digit, from 2 til 3:
plot.new()
plot(row.names(languagetable), prop.table(languagetable,1)[,3]*100, type="l", xlab="Year", ylab="% german language papers", main="The decline of German in Acta.Chem.Scand.")
There’s a lot less noise.
Lets try to add a trendline. We need X and Y values. The absolute numbers are probably not that interesting, I’ll look at the proportions:
x <- as.integer(row.names(languagetable))
y <- prop.table(languagetable,1)[,3]*100
y <- unname(y)
The prop-method returns the proportions (here the german propotions) as named integers. Here I need unnamed integers.
There are several possible trendlines. I’ve stolen the code from answer 27 on http://stackoverflow.com/questions/15102254/how-do-i-add-different-trend-lines-in-r
# basic straight line of fit
fit <- glm(y~x)
co <- coef(fit)
plot.new()
plot(x, y, type="l")
abline(fit, col="blue", lwd=2)
# exponential
#f <- function(x,a,b) {a * exp(b * x)}
#fit <- nls(y ~ f(x,a,b), start = c(a=1, b=1))
#co <- coef(fit)
#curve(f(x, a=co[1], b=co[2]), add = TRUE, col="green", lwd=2)
# logarithmic
#f <- function(x,a,b) {a * log(x) + b}
#fit <- nls(y ~ f(x,a,b), start = c(a=1, b=1))
#co <- coef(fit)
#curve(f(x, a=co[1], b=co[2]), add = TRUE, col="orange", lwd=2)
# polynomial
#f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
#fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
#co <- coef(fit)
#curve(f(x, a=co[1], b=co[2], d=co[3]), add = TRUE, col="pink", lwd=2)
The last three fits are commented out – they give infinite results, something goes under a threshold and other stuff. Basically, the fits don’t fit.
But the basic straight line does. Or rather, we are able to get a result.
Neato!