Tools for Transparency: Google Refine

Originally posted as a guest post on the Sunlight Foundation blog.

For the past six months, I’ve served as the co-director of the Technology for Transparency Network, an organization that documents the use of online and mobile technology to promote transparency and accountability around the world. One of the most common challenges the project leaders we’ve interviewed face is making sense of large amounts of data.

In countries where governments keep detailed digital records of lobbying data and education expenditures, data wrangling is a time-consuming, labor-intensive task. In countries where these records are poorly maintained, this task becomes even harder — everything from inconsistent data entry practices to simple typos can derail data analysis.

Google Refine (formerly Freebase Gridworks) is a free, open-source tool for cleaning up, combining, and connecting messy data sets. Rather than acting like a traditional spreadsheet program, Google Refine exists “for applying transformations over many existing cells in bulk, for the purpose of cleaning up the data, extending it with more data from other sources, and getting it to some form that other tools can consume.”

At its most basic level, Google Refine helps users quickly summarize, filter and edit data sets by allowing them to view patterns and to spot and correct errors quickly. More advanced features include reconciling data sets (i.e., matching text in the set with existing database IDs) with data repository Freebase, geocoding, and fetching additional information from the Web based on existing data.

Though it runs through an Internet browser, Google Refine operates offline, making it attractive for those with limited bandwidth or privacy concerns — a group that includes many of the projects listed on the Technology for Transparency Network.

Google Refine isn’t going to solve the problem of poor data availability, but for those who manage to gain access to existing records, it can be a powerful tool for transparency.

For more information, check out the links and video below:

  • Erin kaplan

    Interesting. Do you do most of your data analysis in excel-type spread sheets? Or do you use a statistical package (SAS, STATA, Gretle…)? I don’t know how the Berkan center operates with regard to data sharing, but I would love to collaborate with you. Do you have Economists there? Anyway, your work sounds really interesting.

    • I like STATA (it’s what I was taught at SIPA), personally. Berkman’s data-sharing policies vary by project, but a good place to start looking is the Publications section of the website.

      E-mail me if there’s something specific you’re interested in!

  • Pingback: Tweets that mention Tools for Transparency: Google Refine « Jackfruity -- Topsy.com()