One thing I learned early on in building tweet aggregation sites for clients is that they expect to only see tweets in English. After all, Google can do it, why can’t I? In theory there is a lang=en argument in the search API, but it doesn’t help much, because it only uses the language setting entered by users in their profile. Since English is the default, and hardly anyone changes it, almost all tweets are labelled as English. I seem to remember the streaming API having a lang argument also, but it isn’t in the docs now. Either way, I gave up and found my own solution a long time ago. The good thing is that it doesn’t just work for English. It also does a remarkably good job for over a dozen languages I have tested it for, and claims to do a lot more. Best of all, it is free and open source.
The library I use is called Text_LanguageDetect, and it is available as a Pear module, which makes installation very easy for PHP. You can download the code here, and get docs here. It requires PHP 5.3, and Pear 1.9. You don’t have to download it and install manually, you can just use the Pear install command:
pear install Text_LanguageDetect-0.3.0
Using the library only takes a few lines of code. It is a class, so you have to create an instance of the class, and then you can call its functions.
$oLang = new Text_LanguageDetect();
The simplest function you can call is detectSimple(), which returns the most likely language for the text it is passed. Here is a basic test script.
1 2 3 4 5 6 7 8 9 10 11
<?php // language_detect1.php require_once 'Text/LanguageDetect.php'; $oLang = new Text_LanguageDetect(); $text = 'La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo'; print "Text: $text<br/>"; print "Language: " . $oLang->detectSimple($text); ?>
Running this script through a browser shows that the language detection library correctly identified the text as Spanish.
Text: La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo
Tomorrow we’ll dig deeper into this library, and see how to handle tweets that are more borderline as to their language.