140dev » Streaming API

Streaming API: Multi-level tweet collection databases

Adam Green — Fri, 14 Feb 2014 01:13:40 +0000

Yesterday’s streaming API post described a multiple server model for handling high rate tweet collection. Today I’d like to cover a different architecture that addresses this problem with a single server running multiple databases.

Let’s say you want to display tweets for the most active stocks each day. The streaming API lets you collect tweets for 400 keywords, or in this case, the 400 most active stock symbols. That will be a high flow rate, and a large database to query if your site only needs to display tweets for 20 or 30 stocks at any one time.

A solution is to store all the tweets, users and related data you receive for all 400 stocks in one database, we’ll call it tweet_collect. You can then create a separate database, it can be called tweet_serve, and have your code copy just the tweets for active stocks to this database as they arrive. Your website only needs to read from tweet_serve, which will be much smaller and therefore deliver query results faster.

When a new stock becomes active, you will already have its tweets available in tweet_collect, so you can quickly copy its tweets to tweet_serve and be ready to display on the site. When the stock is no longer active, you can delete its data from tweet_serve.

The limitation of this technique is that you are limited topics that can be covered adequately within the limit of 400 keywords. As long as this fits your application needs, this model will produce a much faster website display.

When a keyword becomes active that isn’t in your normal collection list, you can fill in the data for this with the search API as needed. Search isn’t as powerful as streaming for large amounts of data, but if you need ad hoc collection of tweets for a few extra keywords, it does a good job. You can query it up to 720 times an hour and request tweets for about 10 to 15 keywords each time. These tweets would also go into the tweet_serve database.

Streaming API: Multiple server collection architecture

Adam Green — Wed, 12 Feb 2014 21:33:58 +0000

Now that I’ve upgraded the streaming API framework to make it easier to manage keyword tweet collection, the next step is handling the increased data flow that results from more keywords. One simple solution is to upgrade your server. MySQL loves as much RAM as it can be given, and switching to a solid state drive is another fix that I highly recommend. But building one monstrous server may not be the most cost effective solution, especially if you are operating “in the cloud”. Cloud servers get really expensive when you try to load up lots of RAM.

An alternative solution that should be considered is to distribute your tweet collection across more than one server, each of which may not be that powerful. The result is often more bang for the buck. I’m going to cover some possible multiple server architectures that I’ve built for various projects over the past few years.

One solution is to dedicate one server to tweet collection, and another to data mining and data processing. I tend to call the first one the collection server, and the second the db server. In terms of my streaming API code, I would put a database with just the json_cache table on the collection server. The only code running on this machine would be get_tweets.php, which writes new tweets to its copy of json_cache. The db server would have the complete database schema, including its own copy of json_cache. It would run parse_tweets.php and any other database code you need, such as queries for a web interface to display the tweets.

The goal is to only give the db server as many new tweets as it can handle while maintaining good parsing and query performance. This can be done by a script that copies new tweets from json_cache on the collection server to json_cache on the db server, then deletes these tweets from the collection server. The db server would parse the new tweets it finds in its copy of json_cache, just the way it normally does. The nice thing is that other than the code to transfer tweets between servers, none of the other code changes.

In effect the collection server is now a buffer, holding new tweets as they arrive from the streaming API and protecting the db server from being crushed by too high a flow, or a sudden burst. The tweet transfer rate from collection server to db server can be managed by a timetable that transfers more tweets at night when the db server is unlikely to be running user requests. During the day the amount of tweets stored on the collection server would rise, if the flow was too fast to parse. Then at night the higher transfer rate would draw down the buffer.

For maximum performance and minimum cost, you have to make sure the two servers can communicate through the webhost’s internal network. You don’t want to pay for bandwidth costs to move this data across the public internet, which would also be a lot slower.

The benefit of this model is that as long as you only transfer new tweets to the db server at a rate it can handle, you are guaranteed an acceptable level of performance. A sudden trending topic or other increase in flow would impact the collection server, but have no effect on the db server. You don’t have to build up the db server’s hardware to handle the largest possible burst. That can save money, even with the addition of the collection server. The collection server can be kept small, since all it does is grab tweets from the API and insert them into json_cache.

The obvious downside of this architecture is that there would be a lag between the time tweets arrived from the API and when they were available for queries on the db server. This is fine for an application that did long-term analysis, but may not be acceptable for a site that needs to display new tweets in real time.

I’ll cover other possible server architectures in future posts that can fit different application requirements.

Future directions for streaming API code

Adam Green — Sun, 09 Feb 2014 14:27:53 +0000

My latest set of enhancements to the streaming API framework is moving along nicely towards my goal of making this code a true production level tweet collection system. While I’m waiting for feedback on the new code, I wanted to take a minute to think about where this system can go. I see several possible directions.

Training examples
This was my original goal when I wrote the first version of the framework, and that seems to have worked out. I’ve received plenty of feedback showing that this code served as the starting point for lots of development. I’d like to continue this approach by writing my next Twitter API book based on the framework’s code.

Application sets
What I mean by this is a combination of code, sample data, and docs (either free ebook or paid book) that apply the streaming API code to specific applications. I’ve been doing this type of consulting for years, and the database issues involved in modeling the tweet activity in a specific subject area are always fascinating. I can see building sample applications for such areas as social TV, stock market investing, sports, music, food, and many others.

WordPress for Twitter
I still think of WordPress as the best implementation of an open source system that can be used by non-programmers. I’m typing into it right now. While the streaming API code currently requires a fair degree of PHP, MySQL and Linux expertise, it would be exciting to put a front-end system on top of it that was as easy to install as WordPress. You would just copy it onto a server, run an install program through a browser, and manage it with a web-based dashboard. The application sets mentioned above could then evolve into the equivalent of themes or plugins.

This is several years of development to achieve, but I can see working toward this goal for that long. At age 57 I can still expect/hope to code for at least another 10 years.

Streaming API enhancements, part 5: Purging old tweets and related data

Adam Green — Sat, 08 Feb 2014 13:44:26 +0000

I wanted to fit one more enhancement into this new version of the streaming API framework. The major limit on performance of a tweet collection system is the number of rows in each table. This is most important during the parsing phase, when lots of insertions are done. MySQL slows down dramatically when it has to insert rows into a large table, especially if the tables has multiple indices. If you only need recent tweets for your application, you should set up a regular purge routine to clean out all data over a specified age.

The new version of the code will include a new purge_tables.php script to keep the data to a manageable size. The maximum age of tweets and related data is set in 140dev_config.php with a new PURGE_INTERVAL constant.

140dev_config.php

// Settings for purge_tables.php
// Set the number of days before tweets and related data are deleted
// A setting of 0 will leave all data permanently in the database
define ('PURGE_INTERVAL',0);
?>

A PURGE_INTERVAL of 0 will prevent any deletions when purge_tables.php is run. A setting of 7 will delete any data more than 7 days old.

purge_tables.php
Make sure to test this script on a backup copy of your data.

select($query);

// Delete all related data that no longer has a matching tweet
$query = 'DELETE FROM tweet_mentions 
	WHERE NOT EXISTS (
		SELECT 1 
		FROM tweets
		WHERE tweets.tweet_id = tweet_mentions.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM tweet_tags 
	WHERE NOT EXISTS (
		SELECT 1 FROM tweets
	    WHERE tweets.tweet_id = tweet_tags.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM tweet_urls 
	WHERE NOT EXISTS (
		SELECT 1 FROM tweets
	    WHERE tweets.tweet_id = tweet_urls.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM tweet_words
	WHERE NOT EXISTS (	
		SELECT 1 
		FROM tweets
		WHERE tweets.tweet_id = tweet_words.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM users
		WHERE NOT EXISTS 
        (SELECT 1 FROM tweets
        WHERE tweets.user_id = users.user_id)';
$oDB->select($query);

?>

If you want to use this script, you can set up a cron job that runs every day. It can put a heavy load on the server, if the table is large, so I’d schedule it for late at night.

Do I have to warn everyone again to test this on a backup copy of your tweet database the first time you run it? Maybe I do.

Make sure to test this script on a backup copy of your data.

Streaming API enhancements, part 4: Parsing tweets for keywords

Adam Green — Sat, 08 Feb 2014 13:04:43 +0000

This set of changes is a significant improvement to parse_tweets.php. To allow you to use this separately from the production version, I’m calling it parse_tweets_keyword.php. You can place it in the same /db directory as the rest of the streaming API framework code. The only changes you will have to make is adding the following tables to the database. None of the existing tables need to be changed.

CREATE TABLE IF NOT EXISTS `collection_words` (
  `words` varchar(60) NOT NULL,
  `type` enum('words','phrase') NOT NULL DEFAULT 'words',
  `out_words` varchar(100) DEFAULT NULL,
  KEY `words` (`words`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS `exclusion_words` (
  `words` varchar(60) NOT NULL,
  `type` enum('partial','exact') NOT NULL DEFAULT 'partial',
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS `tweet_words` (
  `tweet_id` bigint(20) unsigned NOT NULL,
  `words` varchar(60) NOT NULL,
  KEY `tweet_id` (`tweet_id`),
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

To prepare the data for the new code, you need to enter collection words and phrases in the words field of the collection_words table. If you want to find matches for the words anywhere in the tweet, leave the type field at its default setting of ‘words’. To only save tweets that match an exact phrase, you can set the type field to ‘phrase’. For example, to match the words ‘fruit’ and ‘pie’ anywhere in a tweet, put ‘fruit pie’ in the words field. If you only want tweets about apple pie, put ‘apple pie’ in this field, and set type to phrase.

The collection_words table also has an optional out_words field to control false positives. Let’s say you want tweets about legal patents, but not ones about patent leather, purses, or shoes. You would set the words field to ‘patent’ and enter ‘leather,purse,bag,shoe’ in the out_words field. There are several details worth noting here. The out_words are entered as a comma delimited string, and each of the words will be matched separately. The code does partial string matches for out_words, which means that purse will also match purses, bag matches bags, etc.

The exclusion_words table allows you to block tweets with objectionable words. This is very useful for public tweet aggregation sites. You can set these words to a type of either partial or exact. For example, ‘fuck’ entered as partial will also exclude tweets with fucks, fucker, or fucking. If you only want to exclude a specific spelling of a word, then set the type to exact.

At least one entry on the collection_words table is required. Without it, no tweets will be collected, and the new version of get_tweets.php will not run. You can choose to leave the out_words fields blank, or add out_words only where needed.

The exclusion_words table is optional. The code will run fine with no entries here. This is a table that you will tend to fill up as you become aware of the problem words you usually see in tweets based on your keywords.

Finally, we have the new tweet_words table. This will automatically be filled by the parsing code. There will be an entry for each match within each tweet. If a tweet matches 3 collection words, there will be 3 rows added to this table. This will allow you to find tweets that contain keywords very quickly by linking it to the tweets table with the matching tweet_id field.

Now we can review the new version of parse_tweets.php. Here is the complete code with new lines shown in bold face. Then I will break out the changes and explain them separately.

parse_tweets_keyword.php

// Gather exclusion words into an array once per parsing cycle
  $query = "SELECT words, type
      FROM exclusion_words
      WHERE words <> ''";
  $result = $oDB->select($query);
  $exclusion_words = array();
  while($row = mysqli_fetch_assoc($result)) {
    $exclusion_words[strtolower($row['words'])] = $row['type'];
  }
	
  // Gather collection words into an array 
  $query = "SELECT words, type, out_words
      FROM collection_words
      WHERE words <> ''";
  $result = $oDB->select($query);
  $collection_words = array();
  while($row = mysqli_fetch_assoc($result)) {
    $collection_words[strtolower($row['words'])] = array( 'type' => $row['type'],
      'out_words' => strtolower($row['out_words']));
  }
	
  // Process all new tweets
  $query = 'SELECT cache_id, raw_tweet ' .
    'FROM json_cache';
  $result = $oDB->select($query);
  while($row = mysqli_fetch_assoc($result)) {
		
    $cache_id = $row['cache_id'];

    // Each JSON payload for a tweet from the API was stored in the database  
    // by serializing it as text and saving it as base64 raw data
    $tweet_object = unserialize(base64_decode($row['raw_tweet']));
		
    // Delete cached copy of tweet
    $oDB->select("DELETE FROM json_cache WHERE cache_id = $cache_id");
		
    // Limit tweets to a single language,
    // such as 'en' for English
    if ($tweet_object->lang <> 'en') {continue;}
		
    // The streaming API sometimes sends duplicates, 
    // Test the tweet_id before inserting
    $tweet_id = $tweet_object->id_str;
    if ($oDB->in_table('tweets','tweet_id=' . $tweet_id )) {continue;}
		
    // Get the tweet text for collection and exclusion words testing
    if (isset($tweet_object->retweeted_status)) {
      // This is a retweet, so we need the original tweet text for testing
      // Retweet text may be clipped to allow for RT @[screen_name]:
      $test_text = $tweet_object->retweeted_status->text;
    } else {
      $test_text = $tweet_object->text;
    }
		
    // Reject tweets that don't match any collection words rules
    // Record details of tweets that do match any of them
    $match_collection_words = array();
    foreach($collection_words as $words => $rules) {
      // If valid collection words are found
      if (find_collection_words($words,$test_text,$rules['type'],$rules['out_words'])) {
        // Record the words for insertion into tweet_words table
        $match_collection_words[] = $words;
      }
    }
    // Skip this tweet if no valid matches found
    if (!$match_collection_words) {continue;}		
		
    // Reject tweets that contain exclusion words
    foreach($exclusion_words as $words => $type) {
      // if a match is found, use continue 2 to 
      // exit foreach loop and jump to top of while loop
      if (find_exclusion_words($words,$test_text,$type)) {continue 2;}
    }

    // Gather tweet data from the JSON object
    // $oDB->escape() escapes ' and " characters, and blocks characters that
    // could be used in a SQL injection attempt
   
    if (isset($tweet_object->retweeted_status)) {
      // This is a retweet
      // Use the original tweet's entities, they are more complete
      $entities = $tweet_object->retweeted_status->entities;
      $is_rt = 1;
    } else {
      $entities = $tweet_object->entities;
      $is_rt = 0;
    }
    $tweet_text = $oDB->escape($tweet_object->text);	
    $created_at = $oDB->date($tweet_object->created_at);
    if (isset($tweet_object->geo)) {
      $geo_lat = $tweet_object->geo->coordinates[0];
      $geo_long = $tweet_object->geo->coordinates[1];
    } else {
      $geo_lat = $geo_long = 0;
    } 
    $user_object = $tweet_object->user;
    $user_id = $user_object->id_str;
    $screen_name = $oDB->escape($user_object->screen_name);
    $name = $oDB->escape($user_object->name);
    $profile_image_url = $user_object->profile_image_url;
		
    // Add a new user row or update an existing one
    $field_values = 'screen_name = "' . $screen_name . '", ' .
      'profile_image_url = "' . $profile_image_url . '", ' .
      'user_id = ' . $user_id . ', ' .
      'name = "' . $name . '", ' .
      'location = "' . $oDB->escape($user_object->location) . '", ' . 
      'url = "' . $user_object->url . '", ' .
      'description = "' . $oDB->escape($user_object->description) . '", ' .
      'created_at = "' . $oDB->date($user_object->created_at) . '", ' .
      'followers_count = ' . $user_object->followers_count . ', ' .
      'friends_count = ' . $user_object->friends_count . ', ' .
      'statuses_count = ' . $user_object->statuses_count . ', ' . 
      'time_zone = "' . $user_object->time_zone . '", ' .
      'last_update = "' . $oDB->date($tweet_object->created_at) . '"' ;			

    if ($oDB->in_table('users','user_id="' . $user_id . '"')) {
      $oDB->update('users',$field_values,'user_id = "' .$user_id . '"');
    } else {			
      $oDB->insert('users',$field_values);
    }
		
    // Add the new tweet
    $field_values = 'tweet_id = ' . $tweet_id . ', ' .
        'tweet_text = "' . $tweet_text . '", ' .
        'created_at = "' . $created_at . '", ' .
        'geo_lat = ' . $geo_lat . ', ' .
        'geo_long = ' . $geo_long . ', ' .
        'user_id = ' . $user_id . ', ' .			
        'screen_name = "' . $screen_name . '", ' .
        'name = "' . $name . '", ' .
        'profile_image_url = "' . $profile_image_url . '", ' .
        'is_rt = ' . $is_rt;
			
    $oDB->insert('tweets',$field_values);
		
    // Record all collection words found in this tweet
    foreach ($match_collection_words as $words) {
    			
      $where = 'tweet_id=' . $tweet_id . ' ' .
        'AND words ="' . $words .'"';		
				
      if(! $oDB->in_table('tweet_words',$where)) {
			
        $field_values = 'tweet_id=' . $tweet_id . ', ' .
        'words="' . $words . '"';	

        $oDB->insert('tweet_words',$field_values);
      }
    }
		
    // The mentions, tags, and URLs from the entities object are also
    // parsed into separate tables so they can be data mined later
    foreach ($entities->user_mentions as $user_mention) {
		
      $where = 'tweet_id=' . $tweet_id . ' ' .
        'AND source_user_id=' . $user_id . ' ' .
        'AND target_user_id=' . $user_mention->id;		
					 
      if(! $oDB->in_table('tweet_mentions',$where)) {
			
        $field_values = 'tweet_id=' . $tweet_id . ', ' .
        'source_user_id=' . $user_id . ', ' .
        'target_user_id=' . $user_mention->id;	
				
        $oDB->insert('tweet_mentions',$field_values);
      }
    }
    foreach ($entities->hashtags as $hashtag) {
			
      $where = 'tweet_id=' . $tweet_id . ' ' .
        'AND tag="' . $hashtag->text . '"';		
					
      if(! $oDB->in_table('tweet_tags',$where)) {
			
        $field_values = 'tweet_id=' . $tweet_id . ', ' .
          'tag="' . $hashtag->text . '"';	
				
        $oDB->insert('tweet_tags',$field_values);
      }
    }
    foreach ($entities->urls as $url) {
		
      if (empty($url->expanded_url)) {
        $url = $url->url;
      } else {
        $url = $url->expanded_url;
      }
			
      $where = 'tweet_id=' . $tweet_id . ' ' .
        'AND url="' . $url . '"';		
					
      if(! $oDB->in_table('tweet_urls',$where)) {
        $field_values = 'tweet_id=' . $tweet_id . ', ' .
          'url="' . $url . '"';	
				
        $oDB->insert('tweet_urls',$field_values);
      }
    }		
  } 
		
  // You can adjust the sleep interval to handle the tweet flow and 
  // server load you experience
  sleep(1);
}

// Return 1 if match is found
// Return 0 if no match, or match containing out word
function find_collection_words($words,$tweet_text,$type,$out_words) {
  // Remove extra spaces from words and tweet text
  $words = trim(preg_replace('/\s+/',' ', $words));
  $tweet_text = trim(preg_replace('/\s+/',' ', $tweet_text));
  $out_words = trim(preg_replace('/\s+/',' ', $out_words));
	
  // Escape any characters in collection words that may 
  // conflict with a regex pattern used by preg_match
  $words = preg_quote($words, '/');	

  $match = 0;
  if ($type=='phrase') {
    // Exact match of collection phrase is required
    $match = preg_match('/\b' . $words . '\b/i',$tweet_text);
  } else {
    // Break apart the words on space boundaries 
    // and check for each of them separately
    $words_array = explode(' ',$words);
    foreach($words_array as $word) {
      if (!preg_match('/' . $word . '/i',$tweet_text)) {
        // One of the words is missing, so we're done
        return 0;
      } 
    }
    $match = 1;
  }

  if($match && !empty($out_words)) {
    // Check for out words
    // Break apart the out words on comma boundaries 
    // and check for each of them separately
		
    $out_words_array = explode(',',$out_words);
    foreach($out_words_array as $out_word) {

      // Escape any characters in out_word that may 
      // conflict with a regex pattern used by preg_match
      $out_word = preg_quote($out_word, '/');

      if (preg_match('/' . $out_word . '/i',$tweet_text)) {
        // One of the out_words is found, so we're done
        return 0;
      } 
    }
  }
	
  return $match;
}

// Return 1 if match is found, 0 if not
function find_exclusion_words($words,$tweet_text,$type) {
  // Remove extra spaces from words and tweet text
  $words = trim(preg_replace('/\s+/',' ', $words));
  $tweet_text = trim(preg_replace('/\s+/',' ', $tweet_text));

  // Escape any characters in the exclusion word that may 
  // conflict with a regex pattern used by preg_match
  $words = preg_quote($words, '/');
	
  if ($type == 'partial') {
    return preg_match('/' . $words . '/i',$tweet_text);
  } elseif ($type='exact') {
    return preg_match('/\b' . $words . '\b/i',$tweet_text);
  }
}
?>

It’s a lot of code, but it does a lot more than the old version. Let’s go through the changes one section at a time. The first change is reading the contents of the collection_words and exclusion_words tables into arrays, so they can be tested quickly. This done at the start of the main while loop, so the arrays will be refreshed every time the matching tables are changed. It won’t be necessary to restart the script after the database is changed.

// Gather exclusion words into an array once per parsing cycle
  $query = "SELECT words, type
      FROM exclusion_words";
  $result = $oDB->select($query);
  $exclusion_words = array();
  while($row = mysqli_fetch_assoc($result)) {
    $exclusion_words[strtolower($row['words'])] = $row['type'];
  }
	
  // Gather collection words into an array 
  $query = "SELECT words, type, out_words
      FROM collection_words";
  $result = $oDB->select($query);
  $collection_words = array();
  while($row = mysqli_fetch_assoc($result)) {
    $collection_words[strtolower($row['words'])] = array( 'type' => $row['type'],
      'out_words' => strtolower($row['out_words']));
  }

Once we have the collection and exclusion words in memory, we are ready to compare them to the current tweet being parsed. First we pull out the text and make sure we allow for retweets. If it is a retweet, we want to test against the original’s text.

// Get the tweet text for collection and exclusion words testing
    if (isset($tweet_object->retweeted_status)) {
      // This is a retweet, so we need the original tweet text for testing
      // Retweet text may be clipped to allow for RT @[screen_name]:
      $test_text = $tweet_object->retweeted_status->text;
    } else {
      $test_text = $tweet_object->text;
    }

Then we can test each tweet against the collection words and reject it if it doesn’t meet any of the collection rules. This is done with a find_collection_words() function found at the end of the script. If the tweet is good, we add the matching words to the $match_collection_words array. This will be used later to fill in the tweet_words table.

// Reject tweets that don't match any collection words rules
    // Record details of tweets that do match any of them
    $match_collection_words = array();
    foreach($collection_words as $words => $rules) {
      // If valid collection words are found
      if (find_collection_words($words,$test_text,$rules['type'],$rules['out_words'])) {
        // Record the words for insertion into tweet_words table
        $match_collection_words[] = $words;
      }
    }
    // Skip this tweet if no valid matches found
    if (!$match_collection_words) {continue;}

After the collection words processing is done, we test the tweet against the exclusion words, if any are found in the database. The find_exclusion_words() function is at the end of the script.

    // Reject tweets that contain exclusion words
    foreach($exclusion_words as $words => $type) {
      // if a match is found, use continue 2 to 
      // exit foreach loop and jump to top of while loop
      if (find_exclusion_words($words,$test_text,$type)) {continue 2;}
    }

After the tweet and user are added to the database, we can insert a row in the tweet_words table for each collection word found.

    // Record all collection words found in this tweet
    foreach ($match_collection_words as $words) {
    			
      $where = 'tweet_id=' . $tweet_id . ' ' .
        'AND words ="' . $words .'"';		
				
      if(! $oDB->in_table('tweet_words',$where)) {
			
        $field_values = 'tweet_id=' . $tweet_id . ', ' .
        'words="' . $words . '"';	

        $oDB->insert('tweet_words',$field_values);
      }
    }

The actual work of testing collection and exclusion words is done with these two new functions. Parsing is inherently ugly code, at least to me, but I’ve tried to make the logic as clear as possible.

// Return 1 if match is found
// Return 0 if no match, or match containing out word
function find_collection_words($words,$tweet_text,$type,$out_words) {
  // Remove extra spaces from words and tweet text
  $words = trim(preg_replace('/\s+/',' ', $words));
  $tweet_text = trim(preg_replace('/\s+/',' ', $tweet_text));
  $out_words = trim(preg_replace('/\s+/',' ', $out_words));

  // Escape any characters in collection words that may 
  // conflict with a regex pattern used by preg_match
  $words = preg_quote($words, '/');	
	
  $match = 0;
  if ($type=='phrase') {
    // Exact match of collection phrase is required
    $match = preg_match('/\b' . $words . '\b/i',$tweet_text);
  } else {
    // Break apart the words on space boundaries 
    // and check for each of them separately
    $words_array = explode(' ',$words);
    foreach($words_array as $word) {
      if (!preg_match('/' . $word . '/i',$tweet_text)) {
        // One of the words is missing, so we're done
        return 0;
      } 
    }
    $match = 1;
  }

  if($match && !empty($out_words)) {
    // Check for out words
    // Break apart the out words on comma boundaries 
    // and check for each of them separately
		
    $out_words_array = explode(',',$out_words);
    foreach($out_words_array as $out_word) {

      // Escape any characters in out_word that may 
      // conflict with a regex pattern used by preg_match
      $out_word = preg_quote($out_word, '/');

      if (preg_match('/' . $out_word . '/i',$tweet_text)) {
        // One of the out_words is found, so we're done
        return 0;
      } 
    }
  }
	
  return $match;
}

// Return 1 if match is found, 0 if not
function find_exclusion_words($words,$tweet_text,$type) {
  // Remove extra spaces from words and tweet text
  $words = trim(preg_replace('/\s+/',' ', $words));
  $tweet_text = trim(preg_replace('/\s+/',' ', $tweet_text));

  // Escape any characters in the exclusion word that may 
  // conflict with a regex pattern used by preg_match
  $words = preg_quote($words, '/');
	
  if ($type == 'partial') {
    return preg_match('/' . $words . '/i',$tweet_text);
  } elseif ($type='exact') {
    return preg_match('/\b' . $words . '\b/i',$tweet_text);
  }
}

I’ve tested these enhancements for a couple of days and they seem solid. I’d appreciate any comments or problems you find. I’ll assemble all the new framework code into a beta 0.40 version next week, and give people more time to test after that. I hope to release this code as the new production version of the streaming API framework in a couple of weeks.

To finish off this series of enhancements, I’ve posted a script to purge old data. This can result in a significant performance boost.

Streaming API enhancements, part 3: Collecting tweets based on table of keywords

Adam Green — Wed, 05 Feb 2014 14:12:11 +0000

Yesterday I introduced the database changes needed to control tweet collection with a table of keywords. Now we can modify get_tweets.php to collect tweets based on the contents of the collection_words table. To avoid confusion with the production version of get_tweets.php, I’m calling this new version get_tweets_keyword.php. This will let you place it in the same directory as the rest of the streaming API framework code.

Remember, you are never allowed to make more than one connection to the streaming API with the same Twitter account. If you want to test this new code, you MUST kill get_tweets.php that may already be running. If you need to keep your existing tweet collection code running, you should create a new copy of the framework code in a different directory, and configure it with the OAuth tokens from an app owned by a different Twitter account. Twitter permits you to use multiple accounts this way for testing and development purposes.

Let’s start with a complete copy of the new collection code, and then I’ll break apart each change to explain the goals of the new code. I’ve highlighted the changes in boldface to make them easier to spot.

get_tweets_keyword.php

oDB = new db;
  }
	
  // This function is called automatically by the Phirehose class
  // when a new tweet is received with the JSON data in $status
  public function enqueueStatus($status) {
    $tweet_object = json_decode($status);
		
    // Ignore tweets without a properly formed tweet id value
    if (!(isset($tweet_object->id_str))) { return;}
		
    $tweet_id = $tweet_object->id_str;

    // If there's a ", ', :, or ; in object elements, serialize() gets corrupted 
    // You should also use base64_encode() before saving this
    $raw_tweet = base64_encode(serialize($tweet_object));
		
    $field_values = 'raw_tweet = "' . $raw_tweet . '", ' .
      'tweet_id = ' . $tweet_id;
    $this->oDB->insert('json_cache',$field_values);
  }
	
  // This function is called automatically by the Phirehose class
  // every 5 seconds. It can be used to reset the collection array	
  public function checkFilterPredicates() {
    $this->setTrack($this->get_keywords());
  }
		
  // Build an array of keywords for tweet collection
  public function get_keywords() {
    $query = "SELECT words
  	FROM collection_words
        WHERE words <> ''";
    $result = $this->oDB->select($query);
		
    if (mysqli_num_rows($result)==0) {
      // Exit if no collection words found	
      print "ERROR: No keywords found in collection_words table";
      exit;
    } else if (mysqli_num_rows($result)>400) {
      // Exit if keyword count exceeds API limit of 400
      print "ERROR: More than 400 keywords in collection_words table";
      exit;
    }
		
    // Create a keyword list
    $keyword_array = array();
    while ($row=mysqli_fetch_assoc($result)) {
       array_push($keyword_array, $row['words']);
    }		
    return $keyword_array;
  }
}

// Open a persistent connection to the Twitter streaming API
$stream = new Consumer(OAUTH_TOKEN, OAUTH_SECRET, Phirehose::METHOD_FILTER);

// Establish a MySQL database connection
$stream->db_connect();

// The keywords for tweet collection are 
// set by reading them from the collection_words table
$stream->setTrack($stream->get_keywords());

// Start collecting tweets
// Automatically call enqueueStatus($status) with each tweet's JSON data
$stream->consume();

?>

The core of the new code is the get_keywords() function. This is run when the script starts and again every 5 seconds. It collects all the keywords from the collection_words table, and creates an array for use with the Phirehose setTrack() function. It has two tests to prevent a streaming API error. It exits the script if no keywords are found in the table or if more than 400 keywords. Both of these conditions would cause a failed connection.

  // Build an array of keywords for tweet collection
  public function get_keywords() {
    $query = "SELECT words
  	FROM collection_words";
    $result = $this->oDB->select($query);
		
    if (mysqli_num_rows($result)==0) {
      // Exit if no collection words found	
      print "ERROR: No keywords found in collection_words table";
      exit;
    } else if (mysqli_num_rows($result)>400) {
      // Exit if keyword count exceeds API limit of 400
      print "ERROR: More than 400 keywords in collection_words table";
      exit;
    }
		
    // Create a keyword list
    $keyword_str = '';
    while ($row=mysqli_fetch_assoc($result)) {
      $keyword_str .= $row['words'] . ',';
    }		
    // Clip off the trailing comma
    $keyword_str = substr($keyword_str, 0, strlen($keyword_str)-1);
		
    // Create an array from keyword list for use with setTrack()
    $keyword_array = explode(',',$keyword_str);
    return $keyword_array;
  }

With the get_keywords() function now available, we need to call it from two parts of this script. The setTrack() function calls it when the script starts to set the initial collection array.

// The keywords for tweet collection are 
// set by reading them from the collection_words table
$stream->setTrack($stream->get_keywords());

We also want the collection list to be updated as soon as changes are made to the collect_words table. This is done with a Phirehose function called checkFilterPredicates(), which is called every 5 seconds.

  // This function is called automatically by the Phirehose class
  // every 5 seconds. It can be used to reset the collection array	
  public function checkFilterPredicates() {
    $this->setTrack($this->get_keywords());
  }

Once you have the code installed as described above, and you’ve made sure there is no running version of get_tweets.php with the same OAuth tokens, all you have to do to test this is fill in some keywords in the collection_words table. You can ignore the boolean field called phrase in this table. It will be used by the new version of parse_tweets.php found in my next blog post on this subject. To test the get_tweets_keyword.php script, you can run it directly from the command line as a foreground process. This will let you see any error messages as you debug the new code. This would be done with the command:

php get_tweets_keyword.php

As always, please share any questions or comments with me. I want to make sure this works for everyone before building it into the release version of the framework.

The new version of parse_tweets.php that processes this data is here.

Streaming API enhancements, part 2: Keyword collection database changes

Adam Green — Tue, 04 Feb 2014 20:48:48 +0000

The previous post had an overview of my planned enhancements to the streaming API framework. The first step is adding some new tables to the 140dev tweet collection database:

collection_words
The words in this table will be used to collect matching tweets. We’ll see how to add this to the get_tweets.php script so that the collection list is automatically updated for the streaming API when the table changes. This means that you can add and remove words from the table, and have the new list collected without having to restart the get_tweets.php script.

CREATE TABLE IF NOT EXISTS `collection_words` (
  `words` varchar(60) NOT NULL,
  `type` enum('words','phrase') NOT NULL DEFAULT 'words',
  `out_words` varchar(100) DEFAULT NULL,
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

The streaming API automatically ANDs multi-word entries, and ORs the full set of keywords. Let’s assume this table is given the following entries:
pizza recipe
cookbook

This will deliver tweets that contain:
pizza recipe OR cookbook

Multi-word phrases return tweets that contain all the words even if they are not next to each other. An entry of pizza recipe will return tweets that contain both of these words, no matter where they are in the tweet. The Twitter docs have more examples.

The table also contains a enum type field, so you can restrict results to a complete phrase. Let’s say you want to match “I love apple pie”, but not “I don’t want an apple in my blueberry pie.” You can set the words field to apple pie and the type to phrase.

False positives can be a real problem with keyword collection. One of the first tweet aggregation systems I built was for an intellectual property lawyer, and I ran into the problem of searching for the word patent and getting matches for patent leather. The out_words field is included to handle these types of false positives. A collection word of patent, could have the word leather placed in the out_words field to block this false positive. The out_words field is optional. It is only needed for collection words that may return unintended tweets.

There is a one-to-many relationship between collection words and out words, and I’m a strong relational database guy, as you may have noticed. The rule I follow is to normalize relationships into linked tables, except when it makes life harder and queries slower. In this case, creating a separate out_words table would be a pain. Looking up each out word for the matching collection words would be slow, and there is the risk of deleting the collection word while leaving the matching out word in the database. To simplify this data structure, I’ve made the out_words field large, and expect to put all the words into the field with comma delimiters. You’ll see how this is used when we put this table into use in the new version of parse_tweets.php.

exclusion_words
This table contains words that cause tweets to be rejected. It will be used by parse_tweets.php to test each tweet before adding it to the database. Because this is done in our code, rather than by the API, we can add logic to do partial or exact matches based on the type field. For example, if fuck is added along with the type of partial, parse_tweets.php can exclude tweets with: fuck, fucks, fucker, and fucking. You can use this table to make sure that your tweet display system doesn’t display tweets that embarrass you or your client.

CREATE TABLE IF NOT EXISTS `exclusion_words` (
  `words` varchar(60) NOT NULL,
  `type` enum('partial','exact') NOT NULL DEFAULT 'partial',
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

tweet_words
Once you have collected tweets based on keywords, you’ll want a fast way of retrieving just the tweets that match specific keywords. You’ll also probably want to report on which keywords are used the most in the tweets you collect. Parse_tweets.php can accomplish this by recording the tweet_id and keywords found in this table. In effect, you are creating an index of all the tweets based on keywords. This is much faster than searching within the text of each tweet, especially when the number of tweets gets large.

CREATE TABLE IF NOT EXISTS `tweet_words` (
  `tweet_id` bigint(20) unsigned NOT NULL,
  `words` varchar(60) NOT NULL,
  KEY `tweet_id` (`tweet_id`),
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

This will let you run the following MySQL queries:

SELECT tweets.*
FROM tweets, tweet_keywords
WHERE tweets.tweet_id = tweet_keywords.tweet_id
AND tweet_keywords.words = "pizza recipe"

SELECT count(*) as cnt, words
FROM tweet_keywords
GROUP BY words
ORDER BY cnt DESC

Part 3 of this series with a new version of get_tweets.php is here.

Streaming API enhancements, part 1: Keyword collection enhancements

Adam Green — Tue, 04 Feb 2014 18:04:54 +0000

The next few posts will describe a set of enhancements to the streaming API framework that will greatly expand the capabilities of the code for collecting tweets based on keywords. I thought I’d start with an overview of what I want to accomplish:

Add a collection_keywords table to the database to hold keywords to be used for collection.
Add an exclusion_keywords table to the database to hold words (typical curse words) that identify tweets to be rejected.
Add a tweet_keywords table to the database to record the tweet_id of any tweet with a collection keyword. This will greatly speed up queries that get tweets for specific keywords.
Modify get_tweets.php to collect tweets that contain the collection_keywords.
Modify parse_tweets.php to test each tweet and reject it if an exclusion_keyword is found.
Modify parse_tweets.php to record any keywords found in the tweet_keywords table.

I’m going to leave the current version of the framework code unchanged, so the enhanced scripts will be called get_tweets_keyword.php and parse_tweets_keyword.php. Once people have had a chance to test this code, I will integrate it into a new release version of the framework.

The next post in this series is available here.

Google Hangouts for streaming API code?

Adam Green — Fri, 31 Jan 2014 14:11:40 +0000

I’d like to start working on a book on the streaming API, but I’ve learned that actually walking people through code and installs is best before writing it all down. I’m learning a lot about where the sticking points are by helping people install the new version of the framework. It seems like running processes in Linux may be the biggest hurdle for most. I have always assumed that this was the starting point for PHP coding, but apparently not.

Most of my recent support work has been done through Twitter, which sucks as a support medium. Getting Twitter people to use email is surprisingly difficult. I’ve known this for a long time, but it still confuses me. I think my best bet is to start running Google Hangouts on the streaming API code. It will let me share my screen, and get real-time feedback and questions. I’m a big believer in this. I used to teach dBASE for about a dozen years, and nothing beats trying things in front of a live audience when it comes to learning what works and doesn’t pedagogically. Google Hangouts will also let me save the session and post it on my site.

I’ll look into this and see if I can start next week. Watch my Twitter feed for details.

Three most important rules for running the streaming API framework

Adam Green — Fri, 31 Jan 2014 11:56:28 +0000

I’m a typical programmer when it comes to reading documentation before I try to run someone else’s code. First I try to run the code, then I read the docs. Hey, if I needed clear documentation to get things working, would I have become a Twitter API programmer?

Unfortunately, Twitter’s streaming API requires a unique model of coding that is probably unlike any you’ve used before. Trust me, your first instincts will not work here. So I’ve compiled 3 basic rules that solve 90% of all support questions I receive when people first try to run my 140dev Streaming API Framework.

Rule #1: Do not run get_tweets.php as a cronjob
Cron is the backbone of Linux programming, so when I see a script that I need to run over a long period of time, I automatically create a cronjob for it. That is what many coders do when they start using my streaming API code. In the case of get_tweets.php, that will not work. This script must be run as a background process, which means it is started once and then run forever, or until you explicitly kill it. I do have a detailed install page that warns against this, but I wouldn’t read it myself. I’d just plunge ahead. So I understand why this is missed.

Here’s the problem, when you start get_tweets.php, it makes a connection to the streaming API on Twitter, and keeps it open. Tweets then start flowing into the database. It’s really cool. Where a cron job for this script causes problems, is that each time it runs another copy of get_tweets.php is started, even though the first one is still running. This results in additional connections to the streaming API being opened. Twitter hates that, and breaks the second connection. If your cron job keeps starting get_tweets.php over and over again, you can end up getting suspended, because Twitter thinks you are abusing its servers, which you are.

Rule #2: Read the install instructions and follow them line by line
I know you don’t want to do this. I wouldn’t want to do this either. But the reason I wrote this code and then documented it so thoroughly is that streaming API coding just isn’t like other PHP tasks.

Rule #3: Do not run get_tweets.php as a cronjob
Yeah, I’m being obnoxious here. My kids point this out all the time, but after explaining this issue countless times over the years, I know it is hard to absorb. I’m not being paid to do this. I want to help people get started with the streaming API. It is a totally amazing service that is completely free. I know that this is where people get caught when they start, and this is the best way I can think of to help them get started.