There are two problems you typically want to solve with language detection for tweets. First you need to analyse the types of languages you end up with for a specific set of keywords, and determine the minimum confidence level needed to get a clean result. Then when you have that data, you can process a […]
Twitter Language Detection
In yesterday’s installment we learned how to get the most likely language for a tweet with the detectSimple() function. We also discovered that this library sometimes fails when you get down to just 2 or 3 words. The Text_LanguageDetect library has a more advanced function, called detect(), that delivers an array of possible language matches […]
The docs for the Text_LanguageDetect library say that you need to pass it 4-5 sentences to get an accurate language identification, but as we saw in part 1 of this tutorial, even a single sentence seems to work. This is great, since we will need this to work with tweets that average 5-6 words. So […]
One thing I learned early on in building tweet aggregation sites for clients is that they expect to only see tweets in English. After all, Google can do it, why can’t I? In theory there is a lang=en argument in the search API, but it doesn’t help much, because it only uses the language setting […]