Those issue numbers are still frustrating me. Untill I get over that, lets take a look at guessing languages.
We’ll begin by importing the data (and a library):
data <- readRDS(file="d:\\acta\\consistentdata.rda")
library(textcat)
## Warning: package 'textcat' was built under R version 3.2.5
As I mentioned, it is possible to help textcat, by telling it that only certain languages are possible.
After the close analysis I made earlier, I now have the correct languages in the dataset. I have no idea how long textcat took to determine the language. That was done on the fly. But lets find out.
timer <- proc.time()
textcat_guess <- textcat(data$title)
time_textcat <- proc.time() - timer
time_textcat
## user system elapsed
## 91.67 0.07 92.15
That actually takes some time! There are:
nrow(data)
## [1] 16494
Which means, that textcat processes
paste((nrow(data)/time_textcat[3]),"rows/second", sep=" ")
## [1] "178.990775908844 rows/second"
On this computer, while I’m doing other stuff, cause it still takes more than a minute to get through them all.
Well, that can be done faster!
my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french", "german")]
timer2 <- proc.time()
textcat_guess_withhint <- textcat(data$title, p=my.profiles)
time_textcat_withhint <- proc.time() - timer2
time_textcat_withhint
## user system elapsed
## 19.01 0.00 19.05
Nice! That was fast. At least compared.
paste(round(time_textcat_withhint[3]/time_textcat[3]*100,1), " % of the time it took without hints.", sep=" ")
## [1] "20.7 % of the time it took without hints."
Thats about five times as fast.
But how good are the guesses? Lets find out:
list1 <- c(data$language == textcat_guess)
list2 <- c(data$language == textcat_guess_withhint)
So. Without hints, we get:
paste(round(sum(list1, na.rm=TRUE)/length(data$language)*100,1),"%")
## [1] "91.2 %"
Correct guesses.
And with hints, we get:
paste(round(sum(list2, na.rm=TRUE)/length(data$language)*100,1),"%")
## [1] "97.7 %"
correct guesses.
That is quite an improvement.
Where are the mistakes? Basically, if textcat guess that the language is romanian, how many of the papers are english, german and french. Does textcat think that romanian lies closer to german than english?
Please note that I’m not trying to bash textcat. I’m asking it to guess the language, based on no more than a single sentence, riddled with chemical terms. I’m actually quite impressed that it does it that well.
I have to lists. data$language, which is the correct language. And textcat_guess, which is the guessed language.
So, lets define a new dataframe with those to lists:
correct <- data$language
guessed <- textcat_guess
confmat <- data.frame(correct, guessed, stringsAsFactors = FALSE)
I’m calling it a confusion matrix. I’m not sure that is actually the correct term. Anyway, this is the result:
table(confmat$guessed, confmat$correct)
##
## english french german
## afrikaans 13 0 0
## albanian 1 0 0
## basque 11 0 0
## catalan 397 4 0
## croatian-ascii 2 0 0
## czech-iso8859_2 2 0 0
## danish 139 0 2
## dutch 12 0 2
## english 14763 0 2
## esperanto 18 0 1
## estonian 2 0 0
## finnish 3 0 1
## french 39 23 5
## frisian 26 0 2
## german 58 0 253
## indonesian 3 0 1
## irish 4 0 0
## italian 2 0 0
## latin 162 0 0
## lithuanian 2 0 0
## manx 4 0 0
## middle_frisian 52 0 1
## portuguese 6 0 0
## romanian 117 0 1
## rumantsch 38 0 0
## scots 78 0 0
## scots_gaelic 12 0 0
## serbian-ascii 1 0 0
## slovak-ascii 148 0 2
## slovak-windows1250 1 0 0
## slovenian-iso8859_2 9 0 0
## spanish 20 0 0
## swedish 7 0 1
## tagalog 3 0 0
## welsh 34 0 2
Neat. Mostly, it is english-language papers that are assigned a wrong language. Of course it is, there are far more english papers than anything else. But it is interesting, that textcat thinks that 39 papers, that are actually english, is in french. It finds more french papers than english. Four french papers are assigned the language catalan.
Anyway, this is not that interesting. It would be interesting if I had another library to determine languages, that I could compare textcat with.
And a meta-result. From I began thinking about making this table, to I actually found a way to do it, 20 hours passed. I wasn’t thinking about it all the time. But still.
10 minutes after I figured out a very complicated way to do it, I realised that there was a far simpler way. It might not even be the simples. But a lot simpler than what I came up with first. I’m obviously a long way from being good at R.
Back to data – how well does textcat do, when we help it by telling it that there only are papers in english, german and french?
hinted <- textcat_guess_withhint
conf2mat <- data.frame(correct, hinted, stringsAsFactors = FALSE)
table(conf2mat$hinted, confmat$correct)
##
## english french german
## english 15825 0 3
## french 136 26 9
## german 228 1 264
And we have the correct language horizontally, and the guessed vertically. Of those that are actually french, 26 are recognized as such. One is determined to be german.
I still think textcat does a good job.
That was that, lets take a look at the historical decline of french and german.