Harvard’s TweetMap (ALPHA): Explore 125 Million Tweets

Ben Lewis, who's over at the Harvard Center for Geographic Analysis shared that TweetMap, built on MapD (general purpose SQL database) and Harvard's WorldMap, is up and running in ALPHA.

Officially, "TweetMap ALPHA is an instance of the MapD big data platform developed through a collaboration between Todd Mostak and Harvard CGA." I corresponded with Mostak to learn a bit about the project and its future.

TweetMap allows the exploration of some 125 million tweets from 12/10/2012 to 12/31/2012. Visitors can query them by time, space, and keyword.  The hope is to increase the size of the database, perhaps to billions. Real time streaming from tweet-tweeted to tweet-on-the-map in under a second has been implemented.  MapD makes use of any number of commodity Graphic Processing Units - so it will use whatever it has access to use. Todd Mostak notes, "it runs equally well on my laptop with 1 GPU as our demo server with 4 as a Dell GPU server with 16 (of course the more GPUs you have, the faster things will run and the more data you can store)." GPUs, and their role and geospatial, are covered in this Directions Magazine article.

Harvard users (with a log-in) can even download the tweets found by their queries. The rest of us can see the results as individual "dots" (with details of the tweet content, data, lat/long, etc.) and/or see a heat map. The one at right is a query for "Obama" across the entire time frame. I also searched for "adena" and found but a handful - many around a geography with that name.

What's next? Mostak shares:

...we will soon allow for spatial joins/intersections of points to polygons.  This means that the user could upload an arbitrary shapefile of say census districts and basically find the average sentiment of tweets containing the word "Obama" in each district and then regress that against attribute data, such as income or education level for the district.  On 4 GPUs we should be able to do around 4 billion such joins per second, as opposed to PostGIS or ArcGIS which seem to top out at 10,000-20,000 such operations per second, allowing real-time choroplething and regression analysis of spatial data for datasets which might take PostGIS or ArcGIS many days to do the same thing.

Published Monday, February 25th, 2013

Written by Adena Schutzberg

Published in

Cloud Computing