I wanted to fit one more enhancement into this new version of the streaming API framework. The major limit on performance of a tweet collection system is the number of rows in each table. This is most important during the parsing phase, when lots of insertions are done. MySQL slows down dramatically when it has to insert rows into a large table, especially if the tables has multiple indices. If you only need recent tweets for your application, you should set up a regular purge routine to clean out all data over a specified age.
The new version of the code will include a new purge_tables.php script to keep the data to a manageable size. The maximum age of tweets and related data is set in 140dev_config.php with a new PURGE_INTERVAL constant.
140dev_config.php
<?php // OAuth settings for connecting to the Twitter streaming API // Fill in the values for a valid Twitter app define('TWITTER_CONSUMER_KEY','*****'); define('TWITTER_CONSUMER_SECRET','*****'); define('OAUTH_TOKEN','*****'); define('OAUTH_SECRET','*****'); // Settings for monitor_tweets.php // Set the number of minutes before a restart is triggered define('TWEET_ERROR_INTERVAL',10); // Fill in the email address for error messages define('TWEET_ERROR_ADDRESS','*****'); // Settings for purge_tables.php // Set the number of days before tweets and related data are deleted // A setting of 0 will leave all data permanently in the database define ('PURGE_INTERVAL',0); ?>
A PURGE_INTERVAL of 0 will prevent any deletions when purge_tables.php is run. A setting of 7 will delete any data more than 7 days old.
purge_tables.php
Make sure to test this script on a backup copy of your data.
<?php require_once('140dev_config.php'); require_once('db_lib.php'); $oDB = new db; if (PURGE_INTERVAL == 0) { print "PURGE_INTERVAL is set to 0 days in 140dev_config.php"; exit; } // Delete old tweets $query = 'DELETE FROM tweets WHERE created_at < now() - interval ' . PURGE_INTERVAL . ' day'; $oDB->select($query); // Delete all related data that no longer has a matching tweet $query = 'DELETE FROM tweet_mentions WHERE NOT EXISTS ( SELECT 1 FROM tweets WHERE tweets.tweet_id = tweet_mentions.tweet_id)'; $oDB->select($query); $query = 'DELETE FROM tweet_tags WHERE NOT EXISTS ( SELECT 1 FROM tweets WHERE tweets.tweet_id = tweet_tags.tweet_id)'; $oDB->select($query); $query = 'DELETE FROM tweet_urls WHERE NOT EXISTS ( SELECT 1 FROM tweets WHERE tweets.tweet_id = tweet_urls.tweet_id)'; $oDB->select($query); $query = 'DELETE FROM tweet_words WHERE NOT EXISTS ( SELECT 1 FROM tweets WHERE tweets.tweet_id = tweet_words.tweet_id)'; $oDB->select($query); $query = 'DELETE FROM users WHERE NOT EXISTS (SELECT 1 FROM tweets WHERE tweets.user_id = users.user_id)'; $oDB->select($query); ?>
If you want to use this script, you can set up a cron job that runs every day. It can put a heavy load on the server, if the table is large, so I’d schedule it for late at night.
Do I have to warn everyone again to test this on a backup copy of your tweet database the first time you run it? Maybe I do.
Make sure to test this script on a backup copy of your data.
{ 1 comment… read it below or add one }
Purging is good idea but it can be also according to amount. For example purge tweets more than one milion.