Adam Green
Twitter API Consultant
adam@140dev.com
781-879-2960
@140dev

Language detection for tweets: Part 4

by Adam Green on May 24, 2012

in Twitter API Tutorials,Twitter Language Detection

There are two problems you typically want to solve with language detection for tweets. First you need to analyse the types of languages you end up with for a specific set of keywords, and determine the minimum confidence level needed to get a clean result. Then when you have that data, you can process a tweet stream and pull out just the tweets that meet your goals. This simple language library will address both of these issues.

language_lib.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
<?php
// language_lib.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

// Return an array with the language data for any text
function language_info($text) {
 	global $oLang;
	
 	// Split out the key and value of the first array element
	list($language, $confidence) = each($oLang->detect($text));
	
	// Convert the confidence level to a 2 digit integer for convenience
	$confidence = round($confidence*100,0);
	
	// Get the number of words in this string
	$string = eregi_replace(" +", " ", $text);
	$array = explode(" ", $string);
	$word_count = sizeof($array);
	
	return array( 'language' => $language,
		'confidence' => $confidence, 
		'word_count' => $word_count); 	
}

// Return 1 if the text meets your requirements, and 0 if not
function is_language($text, $target_language, $min_confidence, $min_words) {
 	global $oLang;
	
	// Get the number of words in this string
	$string = eregi_replace(" +", " ", $text);
	$array = explode(" ", $string);
	// Exit if there aren't enough words
	if (sizeof($array) < $min_words) {return 0;}
	
 	// Test all the possible languages returned by detect()
	foreach($oLang->detect($text) as $language => $confidence) {
		$confidence = round($confidence*100,0);
		
		// We have a good tweet
		if ((strtolower($language) == strtolower($target_language)) && 
			($confidence >= $min_confidence)) {
				
			return 1;
		}
	}

	// No acceptable languages were found
	return 0;	
}

?>

Let’s use the first library function, language_info(), to examine all the tweets in the sample database. In a real application, I would typically store the results in a database, so I could run some queries to find things like the average confidence level and number of words for tweets in different languages. Based on that data, I could build a quality control routine to pick out just the best tweets. For now you can test this idea by running the next script in a browser.

language_detect6.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?php
// language_detect6.php

require_once 'language_lib.php';

// Connect to the database with the sample tweet table
require_once('db_lib.php');
$oDB = new db;

// Loop through the sample tweets
$query = "SELECT tweet_text FROM language";
$result = $oDB->select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language info
	$text = $row['tweet_text'];
	print "Text: $text<br/>";
	print_r( language_info($text));
	print "<br/><br/>";
}
?>

I won’t bother including all the results from this script, but you can see from the first few tweets that we now have a way of extracting what we need for all the tweets in a stream.

Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh
Array ( [language] => german [confidence] => 26 [word_count] => 9 )

Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...
Array ( [language] => german [confidence] => 32 [word_count] => 18 )

Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich
Array ( [language] => german [confidence] => 32 [word_count] => 23 )

Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng
Array ( [language] => german [confidence] => 41 [word_count] => 12 )

The next example uses the library’s is_language() function to select just the tweets for a specific language. In this case, I’ve tested for English, but the function will work with any of the languages that the Text_LanguageDetect code returns. We saw yesterday that there is a large number of possible languages.

language_detect7.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?php
// language_detect7.php

require_once 'language_lib.php';

// Connect to the database with the sample tweet table
require_once('db_lib.php');
$oDB = new db;

// Loop through the sample tweets
$query = "SELECT tweet_text FROM language";
$result = $oDB->select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language info
	$text = $row['tweet_text'];
	
	// Only display tweets in English 
	// with a confidence level of at least 30%,
	// and at least 5 words
	if (is_language($text,'english',30,5)) {
		print "Text: $text<br/>";
		print_r( language_info($text));
		print "<br/><br/>";
	}
}
?>

If you run this script in your browser, you’ll see just the English tweets that have a confidence level of at least 30% and 5 words or more. I chose to reject tweets that that didn’t meet the minimum word count, but another option would have been to set the minimum word count to 0, so all English tweets that met the confidence level were displayed.

Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ...
Array ( [language] => english [confidence] => 32 [word_count] => 26 )

Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality.
Array ( [language] => english [confidence] => 35 [word_count] => 20 )

Previous post:

Next post: