Adam Green
Twitter API Consultant
adam@140dev.com
781-879-2960
@140dev

Streaming API enhancements, part 5: Purging old tweets and related data

by Adam Green on February 8, 2014

in 140dev Source Code,Streaming API

I wanted to fit one more enhancement into this new version of the streaming API framework. The major limit on performance of a tweet collection system is the number of rows in each table. This is most important during the parsing phase, when lots of insertions are done. MySQL slows down dramatically when it has to insert rows into a large table, especially if the tables has multiple indices. If you only need recent tweets for your application, you should set up a regular purge routine to clean out all data over a specified age.

The new version of the code will include a new purge_tables.php script to keep the data to a manageable size. The maximum age of tweets and related data is set in 140dev_config.php with a new PURGE_INTERVAL constant.

140dev_config.php

<?php
// OAuth settings for connecting to the Twitter streaming API
// Fill in the values for a valid Twitter app
define('TWITTER_CONSUMER_KEY','*****');
define('TWITTER_CONSUMER_SECRET','*****');
define('OAUTH_TOKEN','*****');
define('OAUTH_SECRET','*****');

// Settings for monitor_tweets.php
// Set the number of minutes before a restart is triggered
define('TWEET_ERROR_INTERVAL',10);
// Fill in the email address for error messages
define('TWEET_ERROR_ADDRESS','*****');

// Settings for purge_tables.php
// Set the number of days before tweets and related data are deleted
// A setting of 0 will leave all data permanently in the database
define ('PURGE_INTERVAL',0);
?>

A PURGE_INTERVAL of 0 will prevent any deletions when purge_tables.php is run. A setting of 7 will delete any data more than 7 days old.

purge_tables.php
Make sure to test this script on a backup copy of your data.

<?php 

require_once('140dev_config.php');
require_once('db_lib.php');
$oDB = new db;

if (PURGE_INTERVAL == 0) {
  print "PURGE_INTERVAL is set to 0 days in 140dev_config.php";
  exit;
}

// Delete old tweets
$query = 'DELETE FROM tweets 
	WHERE created_at < now() - interval ' . PURGE_INTERVAL . ' day';
$oDB->select($query);

// Delete all related data that no longer has a matching tweet
$query = 'DELETE FROM tweet_mentions 
	WHERE NOT EXISTS (
		SELECT 1 
		FROM tweets
		WHERE tweets.tweet_id = tweet_mentions.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM tweet_tags 
	WHERE NOT EXISTS (
		SELECT 1 FROM tweets
	    WHERE tweets.tweet_id = tweet_tags.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM tweet_urls 
	WHERE NOT EXISTS (
		SELECT 1 FROM tweets
	    WHERE tweets.tweet_id = tweet_urls.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM tweet_words
	WHERE NOT EXISTS (	
		SELECT 1 
		FROM tweets
		WHERE tweets.tweet_id = tweet_words.tweet_id)';
$oDB->select($query);

$query = 'DELETE FROM users
		WHERE NOT EXISTS 
        (SELECT 1 FROM tweets
        WHERE tweets.user_id = users.user_id)';
$oDB->select($query);

?>

If you want to use this script, you can set up a cron job that runs every day. It can put a heavy load on the server, if the table is large, so I’d schedule it for late at night.

Do I have to warn everyone again to test this on a backup copy of your tweet database the first time you run it? Maybe I do.

Make sure to test this script on a backup copy of your data.

{ 1 comment… read it below or add one }

mustafa February 11, 2014 at 9:40 am

Purging is good idea but it can be also according to amount. For example purge tweets more than one milion.

Reply

Leave a Comment

Previous post:

Next post: