140dev » Data Mining Tweets

How is Twitter programming better than Twitter search?

Adam Green — Fri, 08 Jun 2012 12:51:28 +0000

This is a question I frequently get asked by new clients. They know there is a Twitter API available to collect tweets, but they have no idea how the results differ from just asking for tweets with Search.Twitter.com. I’ve recently explained the fact that a tweet database lets you create a long-term store that cannot be reproduced or purchased any other way. That is just the starting point. The real advantage of Twitter API programming is the way it allows you to add value to a collection of tweets:

You can apply quality control rules that let you filter out false positives for the keywords you are using in your collection query.
I also like to apply simple “filth controls” to all tweet streams that get displayed on sites. This starts with a list of George Carlin’s 7 words you can’t say on television, and grows into a list of the more creative racist and misogynist words so popular on Twitter. Excluding tweets with these words makes Twitter seem much more civilized.
A simple language detection algorithm will let you tweets for a specific language and exclude all other languages.
By checking the tweets you receive for spammy words, like free, coupon, buy now, or sale, you can clean out a high percentage of spam tweets, and if you check new tweets for duplicates, you can identify spammers and blacklist them.
If you screen the user account data for each tweet’s author, you can exclude accounts that have a spammy profile, such as a default avatar, no followers, or an account that has only been in existence a few days.
Or you can come up with an influence algorithm, such as follower count or frequency of mentions, to select tweets from the most influential users.

These are just the generic ways to add value to a tweet aggregation site. Once you start working with a client with specific application needs, there are many ways to add value to Twitter. This is an iterative process that keeps improving the quality of your tweet collection.

So the simple answer to the question is that Twitter programming produces much higher quality results than Twitter search.

Twitter Consulting Tip: Twitter is people

Adam Green — Sat, 02 Jun 2012 11:47:03 +0000

Lots of people ask us to build databases of tweets, but they seem to miss the fact that along with the tweets you can also collect an amazing database of people. Data about the people who tweet is the proverbial low hanging fruit. The Twitter API gives you the complete profile of the author of each tweet it delivers. You don’t have to make an extra API call. Twitter is basically saying, “Here is a fresh set of data about this person, please take it and build something useful.” The Twitter Terms of Service has strict limits on the reselling of tweet text, but lets you do whatever you want with user profiles. These are strong signs that Twitter looks favorably on applications based on their users.

There are several ways a good Twitter consultant can help their clients understand the value of user data. My favorite technique is to make the marketing case that a tweet database is a great source of leads. Along with knowing what is being said, you know who is saying it. You also know everything else that user is saying. Excuse me for being crass, but the best way to describe this is that it is like email marketing, only you get to read the email of everyone you want to communicate with. That is a huge advantage.

Twitter lets you fly at 30,000 feet over the general landscape of discussion about your client’s product or market segment, and then zoom down and focus on a single individual. That is completely unprecedented. Even better, you can gather solid metrics about the influence of each user through values like follower count and frequency of mentions by others. Some of these values, like follower count, are readily available by looking at a user’s profile, but others require programming. That is where a Twitter consultant can add value.

My pitch is generally that while you can get influence measurements from tools like Klout, those are generic measurements of influence against all Twitter users and areas of interest. If you use the Twitter API and collect only tweets about a specific set of keywords, you can identify the most influential people for this area. I’ve written a detailed tutorial on this subject.

The best Twitter consultants make sure that they go beyond just building what the client asks for based on a limited knowledge of what is possible with Twitter data. By opening up the marketing benefits of a database of Twitter users, a whole new set of features are possible, and both the client and consultant profit.

Twitter Consultant Tip: Tweet data is priceless

Adam Green — Thu, 31 May 2012 20:12:52 +0000

Most of the Twitter consulting I do involves some form of tweet collection and storage in a database. Even when clients approach me with this in mind, they hardly ever realize just how valuable tweet data can be. In fact, it is priceless in the truest sense of the word, because there is no way to buy tweets after they are sent. You either capture them in real-time, or they are gone forever. Anyone who wants to work as a Twitter consultant needs to be able to explain that value added message to potential clients. Here are the key selling points to keep in mind.

The Twitter search API only goes back in time 5 to 6 days, and will only return up to 1,500 tweets for any query. If you want old tweets from the API, that is an absolute limit. The streaming API is much more responsive, and will return up to 1% of the total stream, meaning that you can get up to 3 million tweets a day on any query, but these tweets are returned in real-time, not after the fact. So if you want to get all the tweets for a query, you must set up the streaming API connection before you need the results. Then you must store them in a database for later retrieval.

The Twitter terms of service (TOS) allow you to store tweets for use on your own server, either for display or analysis, but there are strict limitations on reselling this data. You can sell it in discrete data sets as a file, such as a PDF or Excel file, but you cannot resell it as an API or real-time service. This means that if someone has already collected tweets that you need, you are forbidden from buying them as a continuous stream for display on your site. If you haven’t collected them yourself, you can’t have a real-time display of tweets on your site, even if you are willing to pay for them.

But what about Twitter’s data partners, Gnip and Datasift? These sites don’t publicize the limitation on their site, but they are also forbidden by Twitter’s license from selling tweets for display on other sites. The tweets you buy from them may only be used for analysis, such as in a product like Radian 6.

All of this means that once a client has built up a long-term database of tweets, they have a priceless resource. There is no price at which these tweets can be bought and sold for continuous display. That makes a tweet database an incredibly valuable resource, and it means that you have to start collecting tweets and saving them in advance. There is no going back for them.

Once clients understand this, they suddenly become very acquisitive. They can collect all the tweets about politicians, celebrities, athletes, TV shows, etc., and have a iron-clad barrier to entry against any competitor coming along later. That is a valuable selling tool for any Twitter consultant who can do this type of database programming. My free, open source library is a good starting point for this type of coding.

Twitter consultant tip: Creating a sales lead spreadsheet

Adam Green — Mon, 15 Nov 2010 13:52:34 +0000

Part of the sales process for Twitter consulting is convincing a new client that Twitter is more than just another way to broadcast their message. You have to show them that what appears to be a random stream of tweets is really a collection of highly qualified sales prospects. By aggregating Twitter users as well as their Tweets, you can extract a great set of sales leads along with their contact info. One way to quickly demonstrate the value of tweet aggregation is to deliver an Excel spreadsheet of sales prospects that meet the client’s needs.

When you aggregate tweets from the Twitter streaming API, it also returns the complete account profile for each user. You can data mine this collection of users to extract highly targeted lists of users, along with their geographical location and home page URL.

The free 140dev Twitter framework is an example of the code you will need to do the tweet aggregation. The schema for the MySQL database it creates shows you it has a table for all the aggregated tweets, which links to the list of tweeting users. Since all of this data is collected for a specific set of keywords, you can then extract personal details on the users who tweet these keywords the most with a simple SQL statement:

SELECT count(*) AS cnt, users.screen_name, users.name, users.location, users.url FROM tweets, users WHERE tweets.user_id = users.user_id AND users.location != '' AND users.url != '' GROUP BY tweets.user_id ORDER BY cnt DESC LIMIT 1000

The 140dev framework’s example database collects tweets for the keyword “recipe”, so this query gives us the most active tweeters in the food world. Here are the results in phpMyAdmin:

You can then export the results from phpMyAdmin to an Excel spreadsheet, and email it to your client. This gives them solid data in a familiar form. Twitter doesn’t deliver email addresses, and doesn’t even collect phone numbers, but you do get each user’s home page URL. This can be used to gather other contact info, a task that is easily farmed out to people on freelance sites like Mechanical Turk.

So the next time you want to convince a client that Twitter is not just a bunch of kids talking to each other, you can just create a tweet aggregation database for the client’s industry keywords, let it collect data for a few days, and pull out a list of targeted users.

Twitter consultant tip: Top 5 ways to monetize Twitter

Adam Green — Fri, 12 Nov 2010 15:29:19 +0000

I went to the OpenCoffee meetup in Cambridge the other day. They all recognized the importance of Twitter, but don’t understand how to make money from it. We are exactly where we were in 1996 with the World Wide Web when I helped start Andover.net. Great point in the cycle.

So here is my quick 5 point pitch on how clients can benefit from integrating Twitter into business and marketing models. But first keep in mind that you don’t make money “from Twitter”, you make money “with Twitter”. Meaning that Twitter is a lever for improving your other efforts, but you don’t get cash handed to you directly by Twitter users on Twitter. Anyway, here’s my list:

1. Putting keyword targeted tweets on pages in the right way is great for SEO. Google loves tweets. This will increase the page’s search rank, getting a lot more first time visitors.

2. Datamining of tweets lets you find the right people to follow in Twitter for your market. This can be used very effectively to build a big follower list. This list becomes profitable when you tweet messages with URLs you want people to click on. Think of it as free Adwords.

3. Follower lists are also essential if you want to make people do something in the real world, like contribute money, or go to an event. Twitter will be huge in 2012 election.

4. If you have a database of tweets, you can datamine it for sales leads. You can give sales people the Twitter accounts and home page URLs of people who tweet a lot about the products the salesperson is selling. The best part is that the salesperson can see exactly what prospects say about their products and competitors before contacting them.

5. You can also datamine a tweet database for sentiment trends. This is valuable for PR and customer service. It gives you a real-time read on how effective the rest of your communication program is.

Your spammer is my information source

Adam Green — Sun, 17 Oct 2010 13:36:53 +0000

There is an interesting thread on the Twitter development list about the need for a “good citizen” rank. This problem is approached in a literal engineering way, which says there are good and bad users. There are plenty of Twitter behaviors that could be seen as “bad,” but the beauty of Twitter is that it is totally opt-in. I only see the people I choose to follow. So the only meaningful criteria is whether I or my client wants to read an account’s tweets. Automating that selection must be made within the context of a specific area of interest. I believe that my mention algorithm is a good way of solving this problem.