The streaming API is pretty amazing when you start seeing all that free data pouring into a database, but sadly the connection is not completely stable. I typically see a failure at least every couple of months. It’s hard to be sure what causes the failure. It could be the Phirehose library that talks to the API or the API itself. Either way, the result is a gap in data collection.
My streaming API framework handles these failures in a pretty simplistic way through the monitor_tweets.php script. This is run as a cron job, and it sends an error email when tweets haven’t been collected for a specified number of minutes.
A more robust solution to these API failures is to automatically restart the get_tweets.php and parse_tweets.php scripts. The method I use is based on the fact that every running Linux process is assigned a process id on startup. I have both get_tweets.php and parse_tweets.php write their process id’s into text files when they start. If tweet collection fails, monitor_tweets.php will read these process ids, kill the running scripts, and then run fresh copies of them. I’ve found that when tweet collection fails, the parse_tweets.php script sometimes stalls, so I find it safer to restart both it and get_tweets.php.
Here are the code snippets you need to implement this model. I’m going to let people test this for a week or so, and then I’ll add them as an upgrade to the framework code.
Add this to the start of the script. It can be the first executable code after the starting comments.
$fp = fopen('process_id_get_tweets.txt','w'); fwrite($fp,getmypid()); fclose($fp);
The same code should be added to the start of this script. Notice that each text file obviously needs its own unique name.
$fp = fopen('process_id_parse_tweets.txt','w'); fwrite($fp,getmypid()); fclose($fp);
With the process ids recorded, you can now write a script to restart both collection scripts. This can be run by monitor_tweets.php on a failure. I also find this useful as a quick way to reload the get_tweets.php and parse_tweets.php when I make any changes to the code.
<?php $process_id = file_get_contents('process_id_get_tweets.txt'); exec('kill -9 ' . $process_id); exec('nohup php get_tweets.php > /dev/null &'); $process_id = file_get_contents('process_id_parse_tweets.txt'); exec('kill -9 ' . $process_id); exec('nohup php parse_tweets.php > /dev/null &'); ?>
Finally you need to modify monitor_tweets.php to run restart.php after a failure.
This line of code should be added inside the if() on line 29 that checks for new tweets.
One thing to be careful of is not restarting your streaming API connection too frequently. Monitor_tweets.php now triggers a restart if no new tweets have been collected in the time interval stored in the TWEET_ERROR_INTERVAL constant, which is set in 140dev_config.php. By default this is set to 10 minutes. If the API fails for a while, and the interval is too short, you will end up restarting over and over again. This can make Twitter’s suspension algorithm cranky.
I have found this restart model reliable across many sites for several years, but it has failed to do a restart a few times, which is why it is still a good idea to have monitor_tweets.php email you when it does a restart. You can make a quick check of your database to make sure tweets have resumed flowing.
If you try this enhancement out, please let me know of any problems. Thanks.