This week we got crushed by the State of the Union speech. We normally get about 30,000 to 50,000 tweets per day in the 2012twit.com database, and our largest server can handle that without any showing any appreciable load. During the SOTU tweet volume exploded. We got 500,000 tweets in about 4 hours. I was able to keep the server going by shutting down some processes that weren’t needed, but it was a challenge. This issue of bursts of tweets seems to be getting worse. In the case of Twitter and politics people are getting used to talking back to the TV through Twitter. With 9 months left until the election I needed to find some solutions.
I spent a lot of time over the last 2 days trying to find the problem, and discovered that it was not parsing the tweets that was killing us, but inserting the raw tweet data into the json_cache table. I use a two phase processing system with the raw tweet delivered by the streaming API getting inserted as fast as possible in a cache table, and then a separate parsing phase breaking it out into a normalized schema. You can get the basic code for this as open source.
It looks like Twitter has been steadily increasing the size of the basic payload that it sends for each tweet in the streaming API. That makes sense, since people are demanding more data. Yesterday they announced some insane scheme where every tweet will include data about countries that don’t want tweets with specific words to be displayed. This will only get worse.
I realized that I have never actually needed to go back and reparse the contents of json_cache, and I had long ago added purging code to my 2012twit system to delete anything in that table older than 7 days. I tried clearing out the json_cache table on my server and modifying the code to delete each tweet as soon as it was parsed. This cut the size from several hundred thousand rows on average in this table to about 50. The load on that server dropped right away and during the GOP debate last night, the load stayed very low.