We saw the rapid decline of german as a scientific language in the last installment. But what else can we learn? Given that I do not have access to the full text papers.
Well, what about the number of authors on a given paper?
I have a hypothesis. In the beginning of time, chemistry was a lonely science, where individual scientists worked and published alone.
As the years went by, chemistry became a more collaborative science, with more scientists working together, and also publishing together.
So – the average number of authors on a paper will rise, as a function of time.
There is only one way to see if I’m right.
Let’s again begin by reading the data:
data <- readRDS(file="d:\\acta\\consistentdata.rda")
head(data)
## title
## 1 Some Improvements in Electrophoresis.
## 2 Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophosphates and the Corresponding Pseudohalogens.
## 3 On the determination of Reducing Sugars.
## 4 On the Application of p-Carboxyphenylhydrazones in the Identification of Carbonyl Compounds.
## 5 A Note on the Growth Promoting Properties of an Enzymatic Hydrolysate of Casein.
## 6 Die Konstitution der Harzphenole und ihre biogenetischen Zusammenhänge. X. Herausspaltung des "MIttelstückes" des Pinoresinols
## authors language pages
## 1 Astrup, Tage; Brodersen, Rolf english 1-7
## 2 Foss, Olav english 8-31
## 3 Blom, Jakob; Rosted, Carl Olof english 32-53
## 4 Veibel, Stig english 54-68
## 5 Ågren, Gunnar english 69-70
## 6 Erdtman, Holger; Gripenberg, Jarl german 71-75
## doi volume year
## 1 10.3891/acta.chem.scand.01-0001 1 1947
## 2 10.3891/acta.chem.scand.01-0008 1 1947
## 3 10.3891/acta.chem.scand.01-0032 1 1947
## 4 10.3891/acta.chem.scand.01-0054 1 1947
## 5 10.3891/acta.chem.scand.01-0069 1 1947
## 6 10.3891/acta.chem.scand.01-0071 1 1947
## url
## 1 /pdf/acta_vol_01_p0001-0007.pdf
## 2 /pdf/acta_vol_01_p0008-0031.pdf
## 3 /pdf/acta_vol_01_p0032-0053.pdf
## 4 /pdf/acta_vol_01_p0054-0068.pdf
## 5 /pdf/acta_vol_01_p0069-0070.pdf
## 6 /pdf/acta_vol_01_p0071-0075.pdf
The authors are found in the dataframe, in the column “authors”. And looking at the first record:
testauthor <- data$authors[1]
str(testauthor)
## chr "Astrup, Tage; Brodersen, Rolf"
we find that the authornames are in a string, separated by “;”.
That string can be split:
testauthors <- strsplit(testauthor, ";")[[1]]
str(testauthors)
## chr [1:2] "Astrup, Tage" " Brodersen, Rolf"
length(testauthors)
## [1] 2
For some reason that is not entirely clear to me, I don’t get a list with the two names, but a list containing a list. Thats the reason for the [[1]].
Probably for the same reason, I’m having problems with a seemingly simple task. Applying a function to all rows in the dataframe, and making a new column. Basically I would like to take the above, as a function, and get a new column in the dataframe, with the number of authors in a given paper.
Annoying. But there is a way around.
for(i in 1:nrow(data)){
data$number[i] <- length(unlist(strsplit(data$authors[i],";")))
}
head(data[,c("authors", "number")])
## authors number
## 1 Astrup, Tage; Brodersen, Rolf 2
## 2 Foss, Olav 1
## 3 Blom, Jakob; Rosted, Carl Olof 2
## 4 Veibel, Stig 1
## 5 Ågren, Gunnar 1
## 6 Erdtman, Holger; Gripenberg, Jarl 2
Now I have a new column in data, with the name number, which contains the number of authors on a given paper.
Again I run into some problems with factors. No problem, I have a hammer. Two lists are generated, aggregated to a new dataframe (numbermat). And the function aggregate, on column 2, aggregated by year, and with the function mean, gives me the result.
numbers <- data$number
year <- data$year
numbermat <- data.frame(year, numbers, stringsAsFactors = FALSE)
avNumAut <- aggregate(numbermat[,2], list(numbermat$year), mean)
avNumAut
## Group.1 x
## 1 1947 1.638095
## 2 1948 1.534483
## 3 1949 1.514563
## 4 1950 1.586498
## 5 1951 1.623762
## 6 1952 1.793249
## 7 1953 1.659389
## 8 1954 1.678670
## 9 1955 1.618497
## 10 1956 1.631420
## 11 1957 1.659091
## 12 1958 1.648725
## 13 1959 1.633245
## 14 1960 1.670391
## 15 1961 1.740634
## 16 1962 1.735369
## 17 1963 1.815603
## 18 1964 1.764249
## 19 1965 1.898345
## 20 1966 1.865169
## 21 1967 1.838074
## 22 1968 1.900415
## 23 1969 2.053030
## 24 1970 1.982456
## 25 1971 1.967857
## 26 1972 2.033557
## 27 1973 2.065336
## 28 1974 2.064516
## 29 1975 2.194595
## 30 1976 2.173442
## 31 1977 2.144476
## 32 1978 2.349673
## 33 1979 2.320132
## 34 1980 2.328671
## 35 1981 2.284585
## 36 1982 2.561069
## 37 1983 2.436364
## 38 1984 2.309804
## 39 1985 2.440367
## 40 1986 2.645570
## 41 1987 2.506977
## 42 1988 2.849057
## 43 1989 2.805556
## 44 1990 2.740741
## 45 1991 3.021390
## 46 1992 2.915000
## 47 1993 3.113861
## 48 1994 3.052632
## 49 1995 3.313725
## 50 1996 3.224044
## 51 1997 3.188482
## 52 1998 3.302439
## 53 1999 3.432749
sum(numbermat[numbermat['year'] == "1947", 'numbers'])/length(numbermat[numbermat['year'] == "1947", 'numbers'])
## [1] 1.638095
I’m just checking – will the sum of the number of authors in 1947, divided with the number of papers, give me the same result as the aggregate?
It does.
plot(avNumAut)
Looking at the plot, it appears that the rise is almost linear from 1960 or so.
Looking at the columnnames, and passing it to a simple linear model:
str(avNumAut)
## 'data.frame': 53 obs. of 2 variables:
## $ Group.1: chr "1947" "1948" "1949" "1950" ...
## $ x : num 1.64 1.53 1.51 1.59 1.62 ...
x <- as.integer(unlist(avNumAut["Group.1"]))
y <- unlist(avNumAut["x"])
fit <- glm(y~x)
co <- coef(fit)
summary(fit)
##
## Call:
## glm(formula = y ~ x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.28535 -0.12739 -0.03684 0.13649 0.33606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66.395759 2.870939 -23.13 <2e-16 ***
## x 0.034774 0.001455 23.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.02625785)
##
## Null deviance: 16.3357 on 52 degrees of freedom
## Residual deviance: 1.3392 on 51 degrees of freedom
## AIC: -38.54
##
## Number of Fisher Scoring iterations: 2
That is actually not that bad.
Lets plot it again, this time with the linear fit:
plot.new()
plot(x,y, xlab="Year", ylab="Avg. # of authors", main="Average number of authors on papers\n in Acta.Chem.Scand. increases over time")
abline(fit,col="blue")
Conclusion: My hypothesis is correct. Or rather, there actually is an increase in the average number of authors on a paper over time.