There are two problems you typically want to solve with language detection for tweets. First you need to analyse the types of languages you end up with for a specific set of keywords, and determine the minimum confidence level needed to get a clean result. Then when you have that data, you can process a tweet stream and pull out just the tweets that meet your goals. This simple language library will address both of these issues.
language_lib.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | <?php // language_lib.php require_once 'Text/LanguageDetect.php'; $oLang = new Text_LanguageDetect(); // Return an array with the language data for any text function language_info($text) { global $oLang; // Split out the key and value of the first array element list($language, $confidence) = each($oLang->detect($text)); // Convert the confidence level to a 2 digit integer for convenience $confidence = round($confidence*100,0); // Get the number of words in this string $string = eregi_replace(" +", " ", $text); $array = explode(" ", $string); $word_count = sizeof($array); return array( 'language' => $language, 'confidence' => $confidence, 'word_count' => $word_count); } // Return 1 if the text meets your requirements, and 0 if not function is_language($text, $target_language, $min_confidence, $min_words) { global $oLang; // Get the number of words in this string $string = eregi_replace(" +", " ", $text); $array = explode(" ", $string); // Exit if there aren't enough words if (sizeof($array) < $min_words) {return 0;} // Test all the possible languages returned by detect() foreach($oLang->detect($text) as $language => $confidence) { $confidence = round($confidence*100,0); // We have a good tweet if ((strtolower($language) == strtolower($target_language)) && ($confidence >= $min_confidence)) { return 1; } } // No acceptable languages were found return 0; } ?> |
Let’s use the first library function, language_info(), to examine all the tweets in the sample database. In a real application, I would typically store the results in a database, so I could run some queries to find things like the average confidence level and number of words for tweets in different languages. Based on that data, I could build a quality control routine to pick out just the best tweets. For now you can test this idea by running the next script in a browser.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | <?php // language_detect6.php require_once 'language_lib.php'; // Connect to the database with the sample tweet table require_once('db_lib.php'); $oDB = new db; // Loop through the sample tweets $query = "SELECT tweet_text FROM language"; $result = $oDB->select($query); while ($row=mysqli_fetch_assoc($result)) { // Print the detected language info $text = $row['tweet_text']; print "Text: $text<br/>"; print_r( language_info($text)); print "<br/><br/>"; } ?> |
I won’t bother including all the results from this script, but you can see from the first few tweets that we now have a way of extracting what we need for all the tweets in a stream.
Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh
Array ( [language] => german [confidence] => 26 [word_count] => 9 )
Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...
Array ( [language] => german [confidence] => 32 [word_count] => 18 )
Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich
Array ( [language] => german [confidence] => 32 [word_count] => 23 )
Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng
Array ( [language] => german [confidence] => 41 [word_count] => 12 )
The next example uses the library’s is_language() function to select just the tweets for a specific language. In this case, I’ve tested for English, but the function will work with any of the languages that the Text_LanguageDetect code returns. We saw yesterday that there is a large number of possible languages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | <?php // language_detect7.php require_once 'language_lib.php'; // Connect to the database with the sample tweet table require_once('db_lib.php'); $oDB = new db; // Loop through the sample tweets $query = "SELECT tweet_text FROM language"; $result = $oDB->select($query); while ($row=mysqli_fetch_assoc($result)) { // Print the detected language info $text = $row['tweet_text']; // Only display tweets in English // with a confidence level of at least 30%, // and at least 5 words if (is_language($text,'english',30,5)) { print "Text: $text<br/>"; print_r( language_info($text)); print "<br/><br/>"; } } ?> |
If you run this script in your browser, you’ll see just the English tweets that have a confidence level of at least 30% and 5 words or more. I chose to reject tweets that that didn’t meet the minimum word count, but another option would have been to set the minimum word count to 0, so all English tweets that met the confidence level were displayed.
Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ...
Array ( [language] => english [confidence] => 32 [word_count] => 26 )
Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality.
Array ( [language] => english [confidence] => 35 [word_count] => 20 )