How is Twitter programming better than Twitter search?

Adam Green — Fri, 08 Jun 2012 12:51:28 +0000

This is a question I frequently get asked by new clients. They know there is a Twitter API available to collect tweets, but they have no idea how the results differ from just asking for tweets with Search.Twitter.com. I’ve recently explained the fact that a tweet database lets you create a long-term store that cannot be reproduced or purchased any other way. That is just the starting point. The real advantage of Twitter API programming is the way it allows you to add value to a collection of tweets:

You can apply quality control rules that let you filter out false positives for the keywords you are using in your collection query.
I also like to apply simple “filth controls” to all tweet streams that get displayed on sites. This starts with a list of George Carlin’s 7 words you can’t say on television, and grows into a list of the more creative racist and misogynist words so popular on Twitter. Excluding tweets with these words makes Twitter seem much more civilized.
A simple language detection algorithm will let you tweets for a specific language and exclude all other languages.
By checking the tweets you receive for spammy words, like free, coupon, buy now, or sale, you can clean out a high percentage of spam tweets, and if you check new tweets for duplicates, you can identify spammers and blacklist them.
If you screen the user account data for each tweet’s author, you can exclude accounts that have a spammy profile, such as a default avatar, no followers, or an account that has only been in existence a few days.
Or you can come up with an influence algorithm, such as follower count or frequency of mentions, to select tweets from the most influential users.

These are just the generic ways to add value to a tweet aggregation site. Once you start working with a client with specific application needs, there are many ways to add value to Twitter. This is an iterative process that keeps improving the quality of your tweet collection.

So the simple answer to the question is that Twitter programming produces much higher quality results than Twitter search.

Screening a tweet stream for quality control

Adam Green — Fri, 20 Jan 2012 12:05:37 +0000

We’ve been working on a college football recruiting site called DirectSnap.com for a couple of months, and the most interesting aspect of the technology behind this site is the quality control algorithm I had to develop. Most of the tweet streams we work on, such as 2012twit.com, are based on collecting tweets for either a set of screen names or real names that are distinctive, such as politicians. When you find a match for Newt Gingrich or Mitt Romney, you can be fairly sure you have the right person.

In the case of DirectSnap, the tweet collection is based on the first and last name of 250 high school football players. Right away I knew I would have a problem when I found Michael Moore in the list of potential recruits. Randy Johnson was going to be even trickier, since the baseball player with this name was likely to be tweeted about by the same sports fans as the football recruit we were tracking. Identifying college teams is also tricky. For example, the word ‘Florida’ in a tweet with a player’s name could refer to the University of Florida or Florida State University.

The solution I came up with was creating a list of exclusion keywords for each player and team. If a tweet contains ‘Michael Moore’, but it also has words like fat, hypocrite, film, or liberal, it probably is not about the football player. A tweet with a player’s name is assigned to the University of Florida if it contains ‘Florida’, but not ‘Florida State’. This first level of screening did a good job of filtering out false positives, such as the wrong Michael Moore, but we wanted to curate the tweets automatically to select the highest quality. The goal was to end up with a tweet stream that was much more interesting than what you could get with Twitter’s search.

To do this we added a set of high quality words to the quality screen, like the team position or hometown name of each player. We found that tweets with this extra information was generally from users who were serious about reporting details, not just random fans chanting a player’s name repeatedly. We used these quality words in two ways. Each time a quality word was found in a tweet, 1 point was added to a quality score for the tweet and for the user who sent the tweet. This allows us to select tweets for display that have a minimum quality score, and that are from a user with a minimum quality score.

To see how well this system works, try comparing the DirectSnap page for Michael Moore, and Twitter search for the same words. My experience is that users find false positives very upsetting. They think computers actually understand what they are searching for, and when they see a false positive, the reaction is always that the website is “stupid”. My favorite example of this is when people complain about Google Alerts for their own name returning blog posts or tweets they have written. The reaction is usually “How stupid can Google be? Doesn’t it know that I don’t want to be alerted about my own writing?” On the other hand, they never seem to be upset about missing results. So ending up with a subset of all possible matches, but with no visible false positives is always the best goal.

140dev » Quality Control

How is Twitter programming better than Twitter search?

Screening a tweet stream for quality control