Adam Green
Twitter API Consultant
adam@140dev.com
781-879-2960
@140dev

Streaming API enhancements, part 3: Collecting tweets based on table of keywords

by Adam Green on February 5, 2014

in 140dev Source Code,Streaming API

Yesterday I introduced the database changes needed to control tweet collection with a table of keywords. Now we can modify get_tweets.php to collect tweets based on the contents of the collection_words table. To avoid confusion with the production version of get_tweets.php, I’m calling this new version get_tweets_keyword.php. This will let you place it in the same directory as the rest of the streaming API framework code.

Remember, you are never allowed to make more than one connection to the streaming API with the same Twitter account. If you want to test this new code, you MUST kill get_tweets.php that may already be running. If you need to keep your existing tweet collection code running, you should create a new copy of the framework code in a different directory, and configure it with the OAuth tokens from an app owned by a different Twitter account. Twitter permits you to use multiple accounts this way for testing and development purposes.

Let’s start with a complete copy of the new collection code, and then I’ll break apart each change to explain the goals of the new code. I’ve highlighted the changes in boldface to make them easier to spot.

get_tweets_keyword.php

<?php
require_once('140dev_config.php');

require_once('../libraries/phirehose/Phirehose.php');
require_once('../libraries/phirehose/OauthPhirehose.php');
class Consumer extends OauthPhirehose
{
  // A database connection is established at launch and kept open permanently
  public $oDB;
  public function db_connect() {
    require_once('db_lib.php');
    $this->oDB = new db;
  }
	
  // This function is called automatically by the Phirehose class
  // when a new tweet is received with the JSON data in $status
  public function enqueueStatus($status) {
    $tweet_object = json_decode($status);
		
    // Ignore tweets without a properly formed tweet id value
    if (!(isset($tweet_object->id_str))) { return;}
		
    $tweet_id = $tweet_object->id_str;

    // If there's a ", ', :, or ; in object elements, serialize() gets corrupted 
    // You should also use base64_encode() before saving this
    $raw_tweet = base64_encode(serialize($tweet_object));
		
    $field_values = 'raw_tweet = "' . $raw_tweet . '", ' .
      'tweet_id = ' . $tweet_id;
    $this->oDB->insert('json_cache',$field_values);
  }
	
  // This function is called automatically by the Phirehose class
  // every 5 seconds. It can be used to reset the collection array	
  public function checkFilterPredicates() {
    $this->setTrack($this->get_keywords());
  }
		
  // Build an array of keywords for tweet collection
  public function get_keywords() {
    $query = "SELECT words
  	FROM collection_words
        WHERE words <> ''";
    $result = $this->oDB->select($query);
		
    if (mysqli_num_rows($result)==0) {
      // Exit if no collection words found	
      print "ERROR: No keywords found in collection_words table";
      exit;
    } else if (mysqli_num_rows($result)>400) {
      // Exit if keyword count exceeds API limit of 400
      print "ERROR: More than 400 keywords in collection_words table";
      exit;
    }
		
    // Create a keyword list
    $keyword_array = array();
    while ($row=mysqli_fetch_assoc($result)) {
       array_push($keyword_array, $row['words']);
    }		
    return $keyword_array;
  }
}

// Open a persistent connection to the Twitter streaming API
$stream = new Consumer(OAUTH_TOKEN, OAUTH_SECRET, Phirehose::METHOD_FILTER);

// Establish a MySQL database connection
$stream->db_connect();

// The keywords for tweet collection are 
// set by reading them from the collection_words table
$stream->setTrack($stream->get_keywords());

// Start collecting tweets
// Automatically call enqueueStatus($status) with each tweet's JSON data
$stream->consume();

?>

The core of the new code is the get_keywords() function. This is run when the script starts and again every 5 seconds. It collects all the keywords from the collection_words table, and creates an array for use with the Phirehose setTrack() function. It has two tests to prevent a streaming API error. It exits the script if no keywords are found in the table or if more than 400 keywords. Both of these conditions would cause a failed connection.

  // Build an array of keywords for tweet collection
  public function get_keywords() {
    $query = "SELECT words
  	FROM collection_words";
    $result = $this->oDB->select($query);
		
    if (mysqli_num_rows($result)==0) {
      // Exit if no collection words found	
      print "ERROR: No keywords found in collection_words table";
      exit;
    } else if (mysqli_num_rows($result)>400) {
      // Exit if keyword count exceeds API limit of 400
      print "ERROR: More than 400 keywords in collection_words table";
      exit;
    }
		
    // Create a keyword list
    $keyword_str = '';
    while ($row=mysqli_fetch_assoc($result)) {
      $keyword_str .= $row['words'] . ',';
    }		
    // Clip off the trailing comma
    $keyword_str = substr($keyword_str, 0, strlen($keyword_str)-1);
		
    // Create an array from keyword list for use with setTrack()
    $keyword_array = explode(',',$keyword_str);
    return $keyword_array;
  }

With the get_keywords() function now available, we need to call it from two parts of this script. The setTrack() function calls it when the script starts to set the initial collection array.

// The keywords for tweet collection are 
// set by reading them from the collection_words table
$stream->setTrack($stream->get_keywords());

We also want the collection list to be updated as soon as changes are made to the collect_words table. This is done with a Phirehose function called checkFilterPredicates(), which is called every 5 seconds.

  // This function is called automatically by the Phirehose class
  // every 5 seconds. It can be used to reset the collection array	
  public function checkFilterPredicates() {
    $this->setTrack($this->get_keywords());
  }

Once you have the code installed as described above, and you’ve made sure there is no running version of get_tweets.php with the same OAuth tokens, all you have to do to test this is fill in some keywords in the collection_words table. You can ignore the boolean field called phrase in this table. It will be used by the new version of parse_tweets.php found in my next blog post on this subject. To test the get_tweets_keyword.php script, you can run it directly from the command line as a foreground process. This will let you see any error messages as you debug the new code. This would be done with the command:

php get_tweets_keyword.php

As always, please share any questions or comments with me. I want to make sure this works for everyone before building it into the release version of the framework.

The new version of parse_tweets.php that processes this data is here.

{ 1 comment… read it below or add one }

mustafa February 6, 2014 at 11:58 am

Yes, this is what I need.
I wait new parse_tweets unpatiently

Reply

Leave a Comment

Previous post:

Next post: