140dev » Twitter Language Detection

Language detection for tweets: Part 4

Adam Green — Thu, 24 May 2012 15:50:46 +0000

There are two problems you typically want to solve with language detection for tweets. First you need to analyse the types of languages you end up with for a specific set of keywords, and determine the minimum confidence level needed to get a clean result. Then when you have that data, you can process a tweet stream and pull out just the tweets that meet your goals. This simple language library will address both of these issues.

language_lib.php

detect($text));
	
	// Convert the confidence level to a 2 digit integer for convenience
	$confidence = round($confidence*100,0);
	
	// Get the number of words in this string
	$string = eregi_replace(" +", " ", $text);
	$array = explode(" ", $string);
	$word_count = sizeof($array);
	
	return array( 'language' => $language,
		'confidence' => $confidence, 
		'word_count' => $word_count); 	
}

// Return 1 if the text meets your requirements, and 0 if not
function is_language($text, $target_language, $min_confidence, $min_words) {
 	global $oLang;
	
	// Get the number of words in this string
	$string = eregi_replace(" +", " ", $text);
	$array = explode(" ", $string);
	// Exit if there aren't enough words
	if (sizeof($array) < $min_words) {return 0;}
	
 	// Test all the possible languages returned by detect()
	foreach($oLang->detect($text) as $language => $confidence) {
		$confidence = round($confidence*100,0);
		
		// We have a good tweet
		if ((strtolower($language) == strtolower($target_language)) && 
			($confidence >= $min_confidence)) {
				
			return 1;
		}
	}

	// No acceptable languages were found
	return 0;	
}

?>

Let’s use the first library function, language_info(), to examine all the tweets in the sample database. In a real application, I would typically store the results in a database, so I could run some queries to find things like the average confidence level and number of words for tweets in different languages. Based on that data, I could build a quality control routine to pick out just the best tweets. For now you can test this idea by running the next script in a browser.

language_detect6.php

select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language info
	$text = $row['tweet_text'];
	print "Text: $text
";
	print_r( language_info($text));
	print "

";
}
?>

I won’t bother including all the results from this script, but you can see from the first few tweets that we now have a way of extracting what we need for all the tweets in a stream.

Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh Array ( [language] => german [confidence] => 26 [word_count] => 9 )


Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...

Array ( [language] => german [confidence] => 32 [word_count] => 18 ) 
Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich

Array ( [language] => german [confidence] => 32 [word_count] => 23 )

Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng Array ( [language] => german [confidence] => 41 [word_count] => 12 )

The next example uses the library’s is_language() function to select just the tweets for a specific language. In this case, I’ve tested for English, but the function will work with any of the languages that the Text_LanguageDetect code returns. We saw yesterday that there is a large number of possible languages.

language_detect7.php

select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language info
	$text = $row['tweet_text'];
	
	// Only display tweets in English 
	// with a confidence level of at least 30%,
	// and at least 5 words
	if (is_language($text,'english',30,5)) {
		print "Text: $text
";
		print_r( language_info($text));
		print "

";
	}
}
?>

If you run this script in your browser, you’ll see just the English tweets that have a confidence level of at least 30% and 5 words or more. I chose to reject tweets that that didn’t meet the minimum word count, but another option would have been to set the minimum word count to 0, so all English tweets that met the confidence level were displayed.

Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ... Array ( [language] => english [confidence] => 32 [word_count] => 26 )

Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality. Array ( [language] => english [confidence] => 35 [word_count] => 20 )

Language detection for tweets: Part 3

Adam Green — Wed, 23 May 2012 15:02:21 +0000

In yesterday’s installment we learned how to get the most likely language for a tweet with the detectSimple() function. We also discovered that this library sometimes fails when you get down to just 2 or 3 words. The Text_LanguageDetect library has a more advanced function, called detect(), that delivers an array of possible language matches and a numeric confidence level for each. The higher the confidence level, the more likely the language is a match.

language_detect4.php

";
print "Language: 
";
print_r($oLang->detect($long_french));

$long_english = 'the latest episode of american idol sucks';
print "

Long English: $long_english
";
print "Language: 
";
print_r($oLang->detect($long_english));

?>

If you run this script in a browser, you will see that there are many possible languages to choose from, in order by confidence level.

Long French: qui propose une école maternelle bilingue français Language: Array ( [french] => 0.32340136054422 [romanian] => 0.25102040816327 [slovene] => 0.24061224489796 [danish] => 0.23877551020408 [latin] => 0.21857142857143 [italian] => 0.21761904761905 [english] => 0.21040816326531 [norwegian] => 0.20884353741497 [portuguese] => 0.20047619047619 [estonian] => 0.18700680272109 [spanish] => 0.18503401360544 [croatian] => 0.18428571428571 [pidgin] => 0.17292517006803 [slovak] => 0.16809523809524 [dutch] => 0.16224489795918 [czech] => 0.14707482993197 [german] => 0.14544217687075 [tagalog] => 0.14510204081633 [cebuano] => 0.11734693877551 [finnish] => 0.1147619047619 [swedish] => 0.11469387755102 [lithuanian] => 0.11333333333333 [latvian] => 0.10857142857143 [polish] => 0.1069387755102 [swahili] => 0.10551020408163 [turkish] => 0.094149659863946 [hawaiian] => 0.09204081632653 [indonesian] => 0.089727891156463 [albanian] => 0.080544217687075 [hausa] => 0.077142857142857 [azeri] => 0.067074829931973 [hungarian] => 0.052517006802721 [icelandic] => 0.052448979591837 [vietnamese] => 0.051768707482993 [welsh] => 0.051700680272109 [somali] => 0.037142857142857 [bengali] => 0 [mongolian] => 0 )

Long English: the latest episode of american idol sucks Language: Array ( [english] => 0.26414634146341 [pidgin] => 0.20056910569106 [spanish] => 0.17081300813008 [slovak] => 0.16130081300813 [estonian] => 0.15845528455285 [italian] => 0.15471544715447 [welsh] => 0.14829268292683 [latin] => 0.14739837398374 [danish] => 0.14585365853659 [romanian] => 0.14268292682927 [french] => 0.1409756097561 [norwegian] => 0.14048780487805 [dutch] => 0.12666666666667 [portuguese] => 0.12065040650406 [german] => 0.1130081300813 [indonesian] => 0.1079674796748 [slovene] => 0.090487804878049 [swahili] => 0.09 [latvian] => 0.086991869918699 [turkish] => 0.08 [azeri] => 0.079512195121951 [swedish] => 0.075447154471545 [albanian] => 0.07479674796748 [hungarian] => 0.074065040650407 [hawaiian] => 0.072926829268293 [finnish] => 0.07260162601626 [tagalog] => 0.072113821138211 [cebuano] => 0.060894308943089 [hausa] => 0.059105691056911 [croatian] => 0.057967479674797 [lithuanian] => 0.055528455284553 [somali] => 0.053170731707317 [polish] => 0.043170731707317 [czech] => 0.041219512195122 [vietnamese] => 0.040975609756098 [icelandic] => 0.034146341463415 [mongolian] => 0 [bengali] => 0 )

Manipulating arrays is sometimes tricky, so here is an extension of this script that delivers the most likely language for a string, along with its confidence level and number of words.

language_detect5.php

";
language_info($long_french);

$long_english = 'the latest episode of american idol sucks';
print "

Long English: $long_english
";
language_info($long_english);

function language_info($text) {
	global $oLang;
	
	// Split out the key and value of the first array element
	list($language, $confidence) = each($oLang->detect($text));
	
	// Convert the confidence level to a 2 digit integer for convenience
	$confidence = round($confidence*100,0);
	
	// Get the number of words in this string
	$string = eregi_replace(" +", " ", $text);
	$array = explode(" ", $string);
	$word_count = sizeof($array);
	
	print "Language: $language
";
	print "Confidence: $confidence%
"; 
	print "Words: $word_count
"; 	
}

?>

Long French: qui propose une école maternelle bilingue français Language: french Confidence: 33% Words: 7

Long English: the latest episode of american idol sucks Language: english Confidence: 26% Words: 7

We now have the basic tools to create a library of language functions that can be used when processing tweets from the Twitter API. Come back tomorrow and we’ll work out the details of such a library.

Language detection for tweets: Part 2

Adam Green — Tue, 22 May 2012 13:32:11 +0000

The docs for the Text_LanguageDetect library say that you need to pass it 4-5 sentences to get an accurate language identification, but as we saw in part 1 of this tutorial, even a single sentence seems to work. This is great, since we will need this to work with tweets that average 5-6 words. So how small a string will give you accurate results? It varies with each language, but from my tests you need at least 3-4 words in most languages.

This sample script demonstrates the problem.

language_detect2.php

";
print "Language: " . $oLang->detectSimple($long_french) . "
";

$short_french = 'école maternelle';
print "
Short French: $short_french
";
print "Language: " . $oLang->detectSimple($short_french) . "
";

$long_english = 'the latest episode of american idol sucks';
print "
Long English: $long_english
";
print "Language: " . $oLang->detectSimple($long_english) . "
";

$short_english = 'american idol';
print "
Short English: $short_english
";
print "Language: " . $oLang->detectSimple($short_english) . "
";

?>

Running this example in a browser shows that with just 2 words, the language returned by the library can’t be trusted.

Long French: qui propose une école maternelle bilingue français Language: french


Short French: école maternelle

Language: danish
Long English: the latest episode of american idol sucks

Language: english

Short English: american idol Language: welsh

I’ve found that the best way to test the accuracy of this language detection method is to process a sample set of tweets with it, and examine the results for different languages. The next script will do this with a list of 16 tweets I pulled out of a database I built for a firm that consults to drug companies in Europe. They need to collect tweets for different diseases, and separate the results by language. The sample table we’ll process here has 4 tweets in each of 4 languages. The code uses my standard db_lib.php database library to read the tweets from the database.

language_detect3.php

select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language	
	$text = $row['tweet_text'];
	print "Text: $text
";
	print "Language: " . $oLang->detectSimple($text) . "

";
}
?>

Running this script in a browser shows that with a reasonably long tweet, the language identification is really good, especially for a free library.

Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh Language: german


Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...

Language: german
Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich

Language: german
Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng

Language: german
Text: Mierda de profesor que no supo explicar nada de las columnas de dominancia ocular y ahora no entiendo nada

Language: spanish
Text: #diabetesla Hace poco puse el enlace a 1foro de diabetes. Se comenta que insulina Lantus provoca depresión.¿Algo de cierto? 10 días con ella

Language: spanish
Text: Queridos padres, tengo casi 16 años, creerme ya he aprendido a vivir con la diabetes, me acompaña desde los 3 años, así que por favor +

Language: spanish
Text: La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo - http://t.co/074JHwBh http://t.co/PKUyPLuG

Language: spanish
Text: Les problemes ou sa fait maigrir ou sa fait grossir. Personnellemnt je suis devenue obese. Fais chié!

Language: french
Text: @Mangeunepomme C'est ce que je compte faire

Language: french
Text: Genre c'est une grosse limite obese et elle fait la meuf genre c'est une salope

Language: french
Text: @GlodieGabrielle hehehehehe. A kelke kilos detr obese, u va mettre ta tente dans une salle de gym de la place

Language: french
Text: Oh shoot looks like I've got hay fever... This is bad :/

Language: english
Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ...

Language: english
Text: Vital Signs: Options for weight loss: In addition to dietitians, counselors and life coaches who can walk you th... http://t.co/hBam9XiF

Language: english

Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality. Language: english

You can now see how easy it is to get the language for a series of tweets. We’ll dig deeper tomorrow and learn how to use this library’s confidence level results. That will let you select only the tweets that have a high chance of being in the language you need. Then later in the week I’ll create a standard language detection function that you can call whenever you need to process tweets.

Language detection for tweets: Part 1

Adam Green — Tue, 22 May 2012 01:48:06 +0000

One thing I learned early on in building tweet aggregation sites for clients is that they expect to only see tweets in English. After all, Google can do it, why can’t I? In theory there is a lang=en argument in the search API, but it doesn’t help much, because it only uses the language setting entered by users in their profile. Since English is the default, and hardly anyone changes it, almost all tweets are labelled as English. I seem to remember the streaming API having a lang argument also, but it isn’t in the docs now. Either way, I gave up and found my own solution a long time ago. The good thing is that it doesn’t just work for English. It also does a remarkably good job for over a dozen languages I have tested it for, and claims to do a lot more. Best of all, it is free and open source.

The library I use is called Text_LanguageDetect, and it is available as a Pear module, which makes installation very easy for PHP. You can download the code here, and get docs here. It requires PHP 5.3, and Pear 1.9. You don’t have to download it and install manually, you can just use the Pear install command:
pear install Text_LanguageDetect-0.3.0

Using the library only takes a few lines of code. It is a class, so you have to create an instance of the class, and then you can call its functions.
require_once 'Text/LanguageDetect.php'; $oLang = new Text_LanguageDetect();

The simplest function you can call is detectSimple(), which returns the most likely language for the text it is passed. Here is a basic test script.

language_detect1.php

";
print "Language: " . $oLang->detectSimple($text);

?>

Running this script through a browser shows that the language detection library correctly identified the text as Spanish.
Text: La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo Language: spanish

Tomorrow we’ll dig deeper into this library, and see how to handle tweets that are more borderline as to their language.