SIDEBAR
»
S
I
D
E
B
A
R
«
The transformation from data journalism to computational journalism
May 15th, 2011 by Maurice

Some time now journalism used to be a traditional profession of people investigating issues, talking to sources, writing it down and publishing it. But then the Internet came and journalists when onto the Net, emailing and surfing the Web as a new way to contact sources gather information and disseminate the news.
Ultimately, this evolved into a new type of journalist, one that collects data freely available on the Net and aggregates this in such a way it reveals new insights. This is called data journalism. Now there appears even another type of journalist, one that computes: computational journalism. I would suggest that computational journalism is an extension of data(-driven) journalism. Of course data as such are meaningles and only through some filtering – aggregation, comparisons etc – sense can be made of the large amounts of data, possibly made easier through the use of visualizations.
Data and computational journalism especially used in investigative journalism has been around for quite some time already. However, it received a great push through the use of APIs and the increased accessibility of databases through the Internet in general. The data repository of the Guardian is a good example of the latter. Still, analyzing data and visualizing the findings to convey the message of the journalist can be quite tricky. A source on creative data visualisation or visualisations gone wrong can be found at Flowing Data.
The video below is a lecture on computational journalism’s agenda Journalism and Media Studies Centre of Hong Kong University
.

Media Research Seminar: Computational Journalism: Mapping the Research Agenda from JMSC HKU on Vimeo.

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png
Bias in Twitter API measurements
Apr 18th, 2011 by Maurice

I’m working on the analysis of the tweets on the Dutch general elections of 2010. Because in the five week prior to Election Day, there is a considerable amount of tweets to analyse, 4,585,614 to be precise. Because the large amount of tweets I use SPSS to organize the data and create variables. While working on the data, what struck me was that Twitter’s identification of, for instance, retweets it quite sloppy to say the least.

As you may know, Twitter provides an API to allow access to their data re-using the tweets, location etc in mashups. A few of these variables are of interest for researchers, for instance user characteristics such as name, location, follower and following network size. I use these in my research as well, combined with data from other sources (see the upcoming issue of Party Politics on Twitter use by candidates in the EU elections of 2009).

As I was working on the actual tweets content I found some curious discrepancies between Twitter’s measurements and mine. For instance, retweets, identified as “RT” in the tweet text, are only flagged as a retweets by Twitter when they are positioned right at the beginning of the tweet. Even if there is a blank space before the “RT” it fails to flag it as a retweet. Furthermore, “RT” codes put somewhere in the middle of the tweet are not identified correctly either. You could consider these occurrences as false negatives.
There are also false positives: “RT” codes identified as tweets but actually are not. Consider this: in the Netherlands, there’s this broadcasting organisation called RTL. Indeed, the first to characters are identified as a retweet by Twitter. Similarly, tweets starting with the text RTV (often used as an abbreviation of for Radio and Television) is also identified as a retweet.

So, to what extent does this influence the findings? In the table below I cross-tabulated the original classification against my corrected version.

 

What we see here, is that there are 787 false positives and 127994 (=127987+7) false negatives. That’s 2.8% of tweets incorrectly classified. Well, this small fraction seems not too disturbing. Or does it? Well, in terms of descriptive analysis it might be negligable. As long as the miss-classifications are at random (which I didn’t check yet).

At the same time, if one wants to use these retweets to construct a social network of people retweeting each other (yes you can do that), things might be different. Even if the miss-classifications are at random, it might seriously affect network structure indicators.

Similar classification issues are at play for mentions and replies. Only the first mentioned names are identified, whereas many tweets mention multiple names. Furthermore, a reply is only a reply when the twittername begins at the first position (i.e. when one includes the @-sign).

In my opinion, it’s surprising that the programmers at Twitter Inc. haven’t used something like regular expressions to classify the data correctly. To be fair to Twitter, they acknowledge that their retweet count is an approximation. In the mean time it’s better to be safe than sorry: classify them yourself if you can.

Still, if there are tools available – I already mentioned regular expressions or think of the string functions available in SPSS – researchers studying the actual tweets might consider these to get more accurate results.

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png
Online political widgetry and gadgetry
Mar 27th, 2010 by Maurice

In my research on politicians’  use of Twitter, I came across these widgets showing live updates of MPs tweeting throughout the day. To me these are a quick way (although not always reliable) to collect the usernames of tweeting politician.

Here are some links to pages aggregating political tweets:
Tweetcongress is a basic page showing who in the US congress is tweeting, how often and what about.

Kamertweets is a Dutch version also showing basic data. It let’s you embed the latest tweets on your own webpage:

Then there is of course the British Tweetminster. This seems to be, of these three, the most elaborate one. Not only does Tweetminster provide the basic data. It allows you to use html-code to embed tweets onto your page, as you can see below:

Not only that, they also provide html-code to embed the Tweetometer (a spin-off of the famous Swingometer in the UK):

This is all nice, and although these widgets were not intended for analysis purposes, it would be very nice to see some more elaborate analysis of the role Twitter plays in political communication. Here at Yeungnam University’s WCU Webometrics Institute, we are developing a number of tools that allows us to collect and visualize data (yes yes shameless self promotion). Analysis takes place with regular software tools (SPSS, Pajek, Ucinet). Papers are coming available soon at a conference near you.

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png
SIDEBAR
»
S
I
D
E
B
A
R
«
»  Substance:WordPress   »  Style:Ahren Ahimsa