Bias in Twitter API measurements

I’m working on the analysis of the tweets on the Dutch general elections of 2010. Because in the five week prior to Election Day, there is a considerable amount of tweets to analyse, 4,585,614 to be precise. Because the large amount of tweets I use SPSS to organize the data and create variables. While working on the data, what struck me was that Twitter’s identification of, for instance, retweets it quite sloppy to say the least.

As you may know, Twitter provides an API to allow access to their data re-using the tweets, location etc in mashups. A few of these variables are of interest for researchers, for instance user characteristics such as name, location, follower and following network size. I use these in my research as well, combined with data from other sources (see the upcoming issue of Party Politics on Twitter use by candidates in the EU elections of 2009).

As I was working on the actual tweets content I found some curious discrepancies between Twitter’s measurements and mine. For instance, retweets, identified as “RT” in the tweet text, are only flagged as a retweets by Twitter when they are positioned right at the beginning of the tweet. Even if there is a blank space before the “RT” it fails to flag it as a retweet. Furthermore, “RT” codes put somewhere in the middle of the tweet are not identified correctly either. You could consider these occurrences as false negatives.
There are also false positives: “RT” codes identified as tweets but actually are not. Consider this: in the Netherlands, there’s this broadcasting organisation called RTL. Indeed, the first to characters are identified as a retweet by Twitter. Similarly, tweets starting with the text RTV (often used as an abbreviation of for Radio and Television) is also identified as a retweet.

So, to what extent does this influence the findings? In the table below I cross-tabulated the original classification against my corrected version.


What we see here, is that there are 787 false positives and 127994 (=127987+7) false negatives. That’s 2.8% of tweets incorrectly classified. Well, this small fraction seems not too disturbing. Or does it? Well, in terms of descriptive analysis it might be negligable. As long as the miss-classifications are at random (which I didn’t check yet).

At the same time, if one wants to use these retweets to construct a social network of people retweeting each other (yes you can do that), things might be different. Even if the miss-classifications are at random, it might seriously affect network structure indicators.

Similar classification issues are at play for mentions and replies. Only the first mentioned names are identified, whereas many tweets mention multiple names. Furthermore, a reply is only a reply when the twittername begins at the first position (i.e. when one includes the @-sign).

In my opinion, it’s surprising that the programmers at Twitter Inc. haven’t used something like regular expressions to classify the data correctly. To be fair to Twitter, they acknowledge that their retweet count is an approximation. In the mean time it’s better to be safe than sorry: classify them yourself if you can.

Still, if there are tools available – I already mentioned regular expressions or think of the string functions available in SPSS – researchers studying the actual tweets might consider these to get more accurate results.

Published by

Maurice Vergeer

I am Maurice Vergeer, working at Communication Science department of the Radboud University Nijmegen, in the Netherlands.