I’m collecting all the tweets for possible 2012 candidates with the Streaming API, and I wanted to make sure I was getting every one of their tweets. I built a backfilling script to go through every tweet in each of these accounts, and add any that weren’t already in the database. This uses the /statuses/user_timeline call to get the past tweets. I ran into a problem with tons of 502 errors from the Twitter API, as many as one every two or three API calls.
Taylor Singletary on the Dev mailing list suggested dropping the count parameter to avoid timeout errors, and this has helped a lot. I was using a count of 200 tweets per call to keep the number of calls low. This gave me all the data in about 100 calls, but with the errors I wasn’t able to complete the process before hitting the rate limit. I tried dropping the count to 100, and this allowed the script to finish with a total of 298 calls.
So now I have the catch 22 of needing to do more API calls to avoid the errors that cause too many API calls. The only solution I see is to cut the count parameter to a level that is low enough to avoid errors, and then spread the backfilling out over multiple hours to stay within the rate limit.
I think the ultimate solution is to do a steady level of backfilling spread over the entire day. I haven’t had to do backfilling in the past, because I was treating the Streaming API tweet collection as a high volume sampling mechanism. As long as I got lots of tweets on a particular subject, it was good. Now that I want to maintain a database of every tweet made by the candidates I have to backfill to make sure nothing was missed by streaming. This seems to be necessary, since every time I run the backfill I get two to three tweets that didn’t get sent by streaming.