140dev » Twitter consultant

Go with the flow when creating a tweet collection database

Adam Green — Thu, 07 Jun 2012 13:15:31 +0000

When new Twitter consulting clients ask me to plan a tweet collection database, the first question they always ask is how much it will cost. I can give them a rough estimate for the cost of my programming time based on their desired features, but it is impossible to know how much server power they will have to pay for without testing first.

Calling the REST API or the Search API is predictable, because there is a one to one correspondence between what you ask for and receive. The Streaming API, on the other hand, is completely unpredictable. The only thing you can be sure of is that the maximum you will receive is 1% of the total tweet flow or 3.5 million tweets a day. Exactly how many you will receive from the Streaming API up to that limit is dependent on the keywords and accounts you choose to follow.

The average Twitter account in our various tweet databases has sent about 6 tweets a day since they were created, but each account is allowed to send up to 1,000 tweets a day, and the Streaming API also delivers retweets. @JustinBieber, for example, can get 10,000 to 20,000 retweets for a single tweet, and @BarackObama has gotten as many as 40,000 retweets. So if you follow the maximum of 5,000 accounts with the Streaming API, the flow could range from an average of 30,000 a tweets a day up to the streaming limit of 3.5 million tweets.

The truly variable flow is when tracking keywords with the Streaming API. You can get tweets for up to 400 keywords or phrases, but there is no reliable way to predict the amount. You have to collect tweets for a week or two, and see what you get. One way to speed up this evaluation process by using the search API to see what the daily average has been for the last few days. The search API only handles about 10 keywords at a time, so you will have to break up your queries into pieces of that size.

Even when you have some data on the normal flow for keywords, you have to be prepared for bursts. I’ve written about bursts before. There are lots of techniques for handling them, ranging from getting the biggest server you can afford to dropping any tweets that exceed a predetermined hourly limit.

So how do I synthesize all these ideas to tell a client what they will need to spend on servers for their Twitter application? My general approach is to use a cloud service, like Rackspace, and start with the smallest server instance possible. Then I build a first version of the tweet collection code and start collecting stats on the flow from each user and keyword. Once I have a good handle on the average, I upsize the server to an amount of memory, disk and CPU that I know will handle that average. Then I add an initial set of burst control techniques until I get a better idea of the long-term variability. If the slow is high enough to require more than 4G of RAM, I find that a dedicated server is more cost effective, but starting with the cloud server is a good way to ramp up slowly.

The important take away for Twitter consultants is that you cannot know what you will need to handle a tweet collection project until you do the real world testing.

Twitter Consulting Tip: Twitter is people

Adam Green — Sat, 02 Jun 2012 11:47:03 +0000

Lots of people ask us to build databases of tweets, but they seem to miss the fact that along with the tweets you can also collect an amazing database of people. Data about the people who tweet is the proverbial low hanging fruit. The Twitter API gives you the complete profile of the author of each tweet it delivers. You don’t have to make an extra API call. Twitter is basically saying, “Here is a fresh set of data about this person, please take it and build something useful.” The Twitter Terms of Service has strict limits on the reselling of tweet text, but lets you do whatever you want with user profiles. These are strong signs that Twitter looks favorably on applications based on their users.

There are several ways a good Twitter consultant can help their clients understand the value of user data. My favorite technique is to make the marketing case that a tweet database is a great source of leads. Along with knowing what is being said, you know who is saying it. You also know everything else that user is saying. Excuse me for being crass, but the best way to describe this is that it is like email marketing, only you get to read the email of everyone you want to communicate with. That is a huge advantage.

Twitter lets you fly at 30,000 feet over the general landscape of discussion about your client’s product or market segment, and then zoom down and focus on a single individual. That is completely unprecedented. Even better, you can gather solid metrics about the influence of each user through values like follower count and frequency of mentions by others. Some of these values, like follower count, are readily available by looking at a user’s profile, but others require programming. That is where a Twitter consultant can add value.

My pitch is generally that while you can get influence measurements from tools like Klout, those are generic measurements of influence against all Twitter users and areas of interest. If you use the Twitter API and collect only tweets about a specific set of keywords, you can identify the most influential people for this area. I’ve written a detailed tutorial on this subject.

The best Twitter consultants make sure that they go beyond just building what the client asks for based on a limited knowledge of what is possible with Twitter data. By opening up the marketing benefits of a database of Twitter users, a whole new set of features are possible, and both the client and consultant profit.

Twitter Consultant Tip: Start with Twitter API rate limits

Adam Green — Fri, 01 Jun 2012 14:32:24 +0000

A good Twitter consultant should start any discussion with a potential client by reviewing the Twitter API rate limits on the features they want. This is really a case of form follows function. Twitter has defined what developers should be doing through their wide range of rate limits, and you better pay attention to them before promising to deliver on a client’s dream app. One rule I start with is: tweets are easy, followers are hard. You can get tons of tweets with the streaming API, which has no rate limits. Getting followers, on the other hand, is something Twitter clearly does not want you to do in bulk.

For example, you can request up to 5,000 followers of a specific account at a time, but you can only get user profiles for these followers at a rate of 100 per API call. This means that there is no way to get the details on all the followers of a major celebrity in anything like real-time.

Another aspect of rate limits that has a big impact on code architecture is deciding which entity is making the API call. If you do everything from a single server with the OAuth keys for the app itself, you only get 350 calls per hour with the REST API. But if you let users login through OAuth, you can use their keys and get 350 calls per hour with each set of keys. Just 100 users give you 35,000 calls per hour. Pretty powerful reason to build OAuth login into a site, right? This is perfectly kosher. You are not really taking anything from the users. Each app they authorize gets 350 calls per hour with a different set of keys based on the same person.

You can also play with rate limits by offloading some of the functionality into the user’s browser. When you make an API call that isn’t using OAuth keys, such as the search API, the rate limit is charged to the IP of the server that connects to Twitter. A way around this limit is to call the search API with Javascript from the web page. In that case, the IP of the user’s browser is absorbing the rate limit. That can scale up to any number of users. We take advantage of this technique in the ThisrThat app we built for a client.

There are lots of other rate limit tricks you can use, but this is enough to make the point that a Twitter consultant needs to make new clients aware of the limits they will be facing before a complete feature list is created.

Twitter Consultant Tip: Tweet data is priceless

Adam Green — Thu, 31 May 2012 20:12:52 +0000

Most of the Twitter consulting I do involves some form of tweet collection and storage in a database. Even when clients approach me with this in mind, they hardly ever realize just how valuable tweet data can be. In fact, it is priceless in the truest sense of the word, because there is no way to buy tweets after they are sent. You either capture them in real-time, or they are gone forever. Anyone who wants to work as a Twitter consultant needs to be able to explain that value added message to potential clients. Here are the key selling points to keep in mind.

The Twitter search API only goes back in time 5 to 6 days, and will only return up to 1,500 tweets for any query. If you want old tweets from the API, that is an absolute limit. The streaming API is much more responsive, and will return up to 1% of the total stream, meaning that you can get up to 3 million tweets a day on any query, but these tweets are returned in real-time, not after the fact. So if you want to get all the tweets for a query, you must set up the streaming API connection before you need the results. Then you must store them in a database for later retrieval.

The Twitter terms of service (TOS) allow you to store tweets for use on your own server, either for display or analysis, but there are strict limitations on reselling this data. You can sell it in discrete data sets as a file, such as a PDF or Excel file, but you cannot resell it as an API or real-time service. This means that if someone has already collected tweets that you need, you are forbidden from buying them as a continuous stream for display on your site. If you haven’t collected them yourself, you can’t have a real-time display of tweets on your site, even if you are willing to pay for them.

But what about Twitter’s data partners, Gnip and Datasift? These sites don’t publicize the limitation on their site, but they are also forbidden by Twitter’s license from selling tweets for display on other sites. The tweets you buy from them may only be used for analysis, such as in a product like Radian 6.

All of this means that once a client has built up a long-term database of tweets, they have a priceless resource. There is no price at which these tweets can be bought and sold for continuous display. That makes a tweet database an incredibly valuable resource, and it means that you have to start collecting tweets and saving them in advance. There is no going back for them.

Once clients understand this, they suddenly become very acquisitive. They can collect all the tweets about politicians, celebrities, athletes, TV shows, etc., and have a iron-clad barrier to entry against any competitor coming along later. That is a valuable selling tool for any Twitter consultant who can do this type of database programming. My free, open source library is a good starting point for this type of coding.