Media Cloud(s) On the Horizon

The Berkman Center for Internet & Society launched Media Cloud in early March, though it had been quietly available for a few months before that. It’s an exciting concept, limited in its current implementation but sure to grow in utility as more features get added.


In essence, Media Cloud monitors a set of sources, and then semantically processes the news items from those stories, creating a rich structured dataset which enables various queries and visualizations.

Media Cloud Summary (Image from
Media Cloud Summary (Image from

The project also relies on a partnership with Calais to provide the term extraction and entity identification capability.

Currently, the visualizations are rather limited. You can create a comparative graphic across any three media sources in the system, of one the following types:

  • Top 10 most mentioned terms
  • Top 10 Term Pivot
  • World Map

Unfortunately there’s no easy way to identify what sources are in the database, other than starting to type and seeing if the autocomplete finds what you’re hoping to use. There’s also no way to tell what “terms” are considered significant, though the error message notes:

The available terms that you can currently serach for are focused on prominent people, places, and events. This will broaden considerably in the future.

It’s the long term plans, not the current visualizations, that make Media Cloud worth watching. Ultimately the Media Cloud project describes itself becoming:

A platform for open, collaborative research by scholars around the world . . . [which] does the heavy lifting in the “cloud” and provides the results as a web service

It isn’t clear at this point what specifically is meant by “in the ‘cloud'” – except in the limited sense that all remote web services could be said to be in the cloud. (See my colleague Andrew Webb’s The Open Cloud for a good overview of the various things “cloud” might mean in today’s environment). Similarly, I believe the only current access to the “web service” is via the front end site at – no programmatic APIs are exposed yet.

Assuming, however, that the project can reach its goal of an infinitely scalable, cloud-hosted web service which would semantically index a great portion of the relevant media stream, and could be accessed by researchers at low or no cost – that would be a very powerful tool for understanding how media operates online.

Media Cloud is also a free and open source software project, licensed under the GNU Affero General Public License and built in Perl using the Catalyst web framework and a PostgreSQL database. (Get code here).

Calais for Drupal