140dev » Twitter Database Programming

Go with the flow when creating a tweet collection database

Adam Green — Thu, 07 Jun 2012 13:15:31 +0000

When new Twitter consulting clients ask me to plan a tweet collection database, the first question they always ask is how much it will cost. I can give them a rough estimate for the cost of my programming time based on their desired features, but it is impossible to know how much server power they will have to pay for without testing first.

Calling the REST API or the Search API is predictable, because there is a one to one correspondence between what you ask for and receive. The Streaming API, on the other hand, is completely unpredictable. The only thing you can be sure of is that the maximum you will receive is 1% of the total tweet flow or 3.5 million tweets a day. Exactly how many you will receive from the Streaming API up to that limit is dependent on the keywords and accounts you choose to follow.

The average Twitter account in our various tweet databases has sent about 6 tweets a day since they were created, but each account is allowed to send up to 1,000 tweets a day, and the Streaming API also delivers retweets. @JustinBieber, for example, can get 10,000 to 20,000 retweets for a single tweet, and @BarackObama has gotten as many as 40,000 retweets. So if you follow the maximum of 5,000 accounts with the Streaming API, the flow could range from an average of 30,000 a tweets a day up to the streaming limit of 3.5 million tweets.

The truly variable flow is when tracking keywords with the Streaming API. You can get tweets for up to 400 keywords or phrases, but there is no reliable way to predict the amount. You have to collect tweets for a week or two, and see what you get. One way to speed up this evaluation process by using the search API to see what the daily average has been for the last few days. The search API only handles about 10 keywords at a time, so you will have to break up your queries into pieces of that size.

Even when you have some data on the normal flow for keywords, you have to be prepared for bursts. I’ve written about bursts before. There are lots of techniques for handling them, ranging from getting the biggest server you can afford to dropping any tweets that exceed a predetermined hourly limit.

So how do I synthesize all these ideas to tell a client what they will need to spend on servers for their Twitter application? My general approach is to use a cloud service, like Rackspace, and start with the smallest server instance possible. Then I build a first version of the tweet collection code and start collecting stats on the flow from each user and keyword. Once I have a good handle on the average, I upsize the server to an amount of memory, disk and CPU that I know will handle that average. Then I add an initial set of burst control techniques until I get a better idea of the long-term variability. If the slow is high enough to require more than 4G of RAM, I find that a dedicated server is more cost effective, but starting with the cloud server is a good way to ramp up slowly.

The important take away for Twitter consultants is that you cannot know what you will need to handle a tweet collection project until you do the real world testing.

Twitter Consultant Tip: Tweet data is priceless

Adam Green — Thu, 31 May 2012 20:12:52 +0000

Most of the Twitter consulting I do involves some form of tweet collection and storage in a database. Even when clients approach me with this in mind, they hardly ever realize just how valuable tweet data can be. In fact, it is priceless in the truest sense of the word, because there is no way to buy tweets after they are sent. You either capture them in real-time, or they are gone forever. Anyone who wants to work as a Twitter consultant needs to be able to explain that value added message to potential clients. Here are the key selling points to keep in mind.

The Twitter search API only goes back in time 5 to 6 days, and will only return up to 1,500 tweets for any query. If you want old tweets from the API, that is an absolute limit. The streaming API is much more responsive, and will return up to 1% of the total stream, meaning that you can get up to 3 million tweets a day on any query, but these tweets are returned in real-time, not after the fact. So if you want to get all the tweets for a query, you must set up the streaming API connection before you need the results. Then you must store them in a database for later retrieval.

The Twitter terms of service (TOS) allow you to store tweets for use on your own server, either for display or analysis, but there are strict limitations on reselling this data. You can sell it in discrete data sets as a file, such as a PDF or Excel file, but you cannot resell it as an API or real-time service. This means that if someone has already collected tweets that you need, you are forbidden from buying them as a continuous stream for display on your site. If you haven’t collected them yourself, you can’t have a real-time display of tweets on your site, even if you are willing to pay for them.

But what about Twitter’s data partners, Gnip and Datasift? These sites don’t publicize the limitation on their site, but they are also forbidden by Twitter’s license from selling tweets for display on other sites. The tweets you buy from them may only be used for analysis, such as in a product like Radian 6.

All of this means that once a client has built up a long-term database of tweets, they have a priceless resource. There is no price at which these tweets can be bought and sold for continuous display. That makes a tweet database an incredibly valuable resource, and it means that you have to start collecting tweets and saving them in advance. There is no going back for them.

Once clients understand this, they suddenly become very acquisitive. They can collect all the tweets about politicians, celebrities, athletes, TV shows, etc., and have a iron-clad barrier to entry against any competitor coming along later. That is a valuable selling tool for any Twitter consultant who can do this type of database programming. My free, open source library is a good starting point for this type of coding.

Simple PHP/MySQL database library source code: db_lib.php

Adam Green — Tue, 15 May 2012 14:28:07 +0000

There seems to be a good amount of interest in the new set of tutorials I’ve started writing, and most of the code I produce interacts with a MySQL database, so I’m going to post the code for my standard database library here. This makes it easy for me to link to this post multiple times, rather than include the source of this library in multiple posts. This is a simplified version of the library included in the 140dev Framework.

The login info for this library is kept in a separate script called db_config.php. For the sample code shown on this blog this configuration file will be kept in the same directory as the db_lib.php script. Security minded programmers will probably want to keep this in a different location on their server, preferably outside the web accessible directories.

db_config.php

The actual library code is written as a PHP class. This allows the code to open a MySQL connection once, and then keep it open for the entire time the script using the library is running. The library contains simple functions for preparing data for insertion, running any SQL query, checking to see if a value already exists in a table, and table insertion and update functions.

db_lib.php

dbh = mysqli_connect($db_host, 
      $db_user, $db_password, $db_name)) { 
	} else {
	  exit('Unable to connect to DB');
    }
	// Set every possible option to utf-8
    mysqli_query($this->dbh, 'SET NAMES "utf8"');
    mysqli_query($this->dbh, 'SET CHARACTER SET "utf8"');
    mysqli_query($this->dbh, 'SET character_set_results = "utf8",' .
        'character_set_client = "utf8", character_set_connection = "utf8",' .
        'character_set_database = "utf8", character_set_server = "utf8"');
  }
  
  // Create a standard data format for insertion of PHP dates into MySQL
  public function date($php_date) {
    return date('Y-m-d H:i:s', strtotime($php_date));	
  }
  
  // All text added to the DB should be cleaned with mysqli_real_escape_string
  // to block attempted SQL insertion exploits
  public function escape($str) {
    return mysqli_real_escape_string($this->dbh,$str);
  }
    
  // Test to see if a specific field value is already in the DB
  // Return false if no, true if yes
  public function in_table($table,$where) {
    $query = 'SELECT * FROM ' . $table . 
      ' WHERE ' . $where;
    $result = mysqli_query($this->dbh,$query);
    return mysqli_num_rows($result) > 0;
  }

  // Perform a generic select and return a pointer to the result
  public function select($query) {
    $result = mysqli_query( $this->dbh, $query );
    return $result;
  }
    
  // Add a row to any table
  public function insert($table,$field_values) {
    $query = 'INSERT INTO ' . $table . ' SET ' . $field_values;
    mysqli_query($this->dbh,$query);
  }
  
  // Update any row that matches a WHERE clause
  public function update($table,$field_values,$where) {
    $query = 'UPDATE ' . $table . ' SET ' . $field_values . 
      ' WHERE ' . $where;
    mysqli_query($this->dbh,$query);
  } 
 
}  
?>

There will be practical examples of using this library throughout the tutorials coming up in the blog. Just to show the simplest example possible, here is a script that makes a database connection and then runs a “SHOW TABLES” MySQL query.

db_lib_demo.php

select($query);

// Retrieve the first row of results as an array
$row = mysqli_fetch_assoc($result);
print_r($row);

?>

This script can be run from the command line of an SSH or Telnet client, however you normally connect to your server.

# php db_lib_demo.php

 Array
(
    [Tables_in_140dev_tutorials] => rss_feed
)