Lunch at Berkman: Culturomics

Liveblogging Erez Lieberman Aiden and Jean-Baptiste Michel’s presentation on Culturomics: Quantitative Analysis of Culture Using Millions of Digitized Books at the Berkman Center. Please excuse misrepresentation, misinterpretation, typos, and general stupidity.

Erez Lieberman Aiden and Jean-Baptiste Michel have assembled a digital collection comprising approximately 4 percent of all published books — or around 5 million titles printed since 1800 — and are analyzing it to reveal trends about everything from technological adoption to the pursuit of fame. (Interested in checking it out for yourself? The data is available via Google.) They call this field of study “culturomics,” which they define as research that “extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.”

There are two basic ways to approach a library, Erez and JB say. You can read a few books very carefully, or you can read all the books “very not carefully.” Their hope is to give people a way to pull useful information from all the books without having to read all the books carefully.

Awesome

They start with an example of culturomics: quantifying the usage of irregular and regular verbs to help track how language changes over time. In an earlier project, they scoured 11 early English texts and manually counted the instances of different verbs. They found that verbs with a higher frequency in these texts (to be, to have) have remained irregular, while verbs with a lower frequency regularized more quickly — if a verb was 100 times less frequent, it regularized 10 times as fast. In other words, when English speakers rarely use a word, they tend to fall back to the standard pattern of conjugation rather than preserve irregular forms.

While this research is awesome, it is not practical, Erez says. (Particularly given that they had hoped for 1000 undergraduate students to assist them in the counting but were only able to entice one.) So they set out to make something both awesome and practical.

Awesome + Practical

JB says that the ideal way to begin a project like this would be for Google, which has digitized millions of books, to simply release all of this data to the world. “But 5 million books is 5 million authors, which is 5 million plaintiffs,” he says. So instead of using the full text, they convinced Google to release statistics: n-grams. A gram is a word, a 2-gram is two words, a 4-gram is four words (e.g., “United States of America”). They worked with Google to publish this data for approximately 5 million books printed in the last two centuries.

This data has enormous potential to help track cultural evolution, they argume. Six months ago, JB says, if you wanted to know about the history of the past tense of the verb “to thrive,” you’d “ask two distinguished scholars with fantastic hair”: the scholar in 2000 would say that people “thrived,” while the scholar from 1800 would say that people “throve,” and that would be that. You would know the past tense had changed, but you wouldn’t know when, or how quickly.

With n-grams, however, you can more precisely track the usage of both “thrived” and “throve” in 5 million books published over the past 200 years, showing that usage of “throve” has been steadily in decline since 1900, while “thrived’ has been on the rise. This kind of analytical power is “1 billion times more awesome” than anything you could do before, JB says.

Incredibly Awesome

Verb usage and linguistic change aren’t the only things that can be tracked with n-grams. Erez and JB plugged the years 1883, 1910, and 1950 into the system to track mentions of these years over time. The usage of “1950” in published texts spiked in 1950 (“Nothing made 1950 interesting like 1950,” Erez says), but by 1954 it was declining. The same thing happened with 1883 and 1910, but with two interesting factors: overall, printed books talked more about 1950 than they did about 1883 or 1910, suggesting that we’re more interested in time than we used to be. However, the half-lives of these terms grew shorter — we talked more about 1950 than we did about 1883, but we stopped talking about it faster than we stopped talking about 1883, suggesting that we’re less interested in the past than we used to be.

Another example is tracking fame: Erez and JB tracked the names of the most famous people born in each year since the 1800s and were able to figure out how old the “class” of each year’s group of celebrities were when they achieved fame, how quickly they shot to stardom, and how long it took for society to forget them. Over time, people have become famous earlier, shot to stardom faster, and been forgotten sooner. Erez and JB were also able to determine that actors become famous at the youngest age, while politicians and authors, who take longer to become famous, become the most famous. (They recommend avoiding becoming a mathematician, as, historically speaking, they’re not very famous at all.)

To explore censorship, Erez and JB tracked mentions of Jewish painter Marc Chagall in English (a steady rise over time) and German (a dip to zero during World War II). English-language mentions of African-American track star Jesse Owens have been high since the 1936 Berlin Olympics, at which he won four medals, but didn’t rise in German until the 1950s. In Russian, mentions of Trotsky were artificially low between the time he was assassinated and the advent of perestroika. In Chinese, “Tiananmen Square” stays more or less even between the 1970s and today, while mentions shoot up in English after the massacre in 1989.

In another exploration of censorship, Erez and JB took Nazi blacklists, which were separated systematically into fields (politics, literature, etc.) and entered them into the system, comparing mentions of these intellectuals’ names against names of prominent Nazis. Between the mid 1930s and mid 1940s, mentions of Nazis went up 500%, while mentions of political scholars dropped by 60%; philosophers 76%, etc. This pattern held for individuals as well (for example, Henri Matisse). JB notes that every name on Wikipedia could be entered into the system to create “distribution of suppression indices” in different languages. He’s careful to explain that this does not replace the work of historians, but complete it.

Next Steps

Erez and JB approached Google with their prototype and asked them to create a web-based version; Ngrams. In addition to running simple comparative analyses of phrases and words, the system lets you view examples of these phrases and words in context. The system has proven quite popuar: within the first 24 hours, over a million queries were run.

They suggest that Ngrams could be part of a front end for a digital library — in a digital environment it’s important to think beyond card catalogs, toward new interfaces. Their research started with books, but newspapers, maps, manuscripts, art, and other cultural works are increasingly being digitized. Culturomics can be applied to these sources as well. JB argues that we don’t need to wait for copyright law to change in order to conduct this research — many of the books in their corpus are still under copyright, but because they didn’t release the entire text, they were still able to share important and useful data with the public.

Q+A

Q: Could this data be used for forecasting? To project culture?
A (Erez): You have to be careful, but one ought to be able to make some sorts of predictions based on observable linear trends in the data. We should push our boundaries.
A (JB): There’s a lot to be done with aggregates, but it’s harder for individuals.

Q: Have your data proven any “small-n historians” wrong?
A (JB): Not yet, but we’re hoping that this can be a tool to help historians generate hypotheses as starting points for discovery, and perhaps to prove or disprove these hypotheses. The problem is that these need to be quantitative hypotheses, which is not how most hypotheses tend to be formulated.

Q: What are the intellectual property and copyright implications of your work?
A (Erez): The approach that’s being taken with many of these digitization projects is to push for copyright reform in order to make these projects possible. That’s true to some extent, but the Ngrams system allows Google to make use of in-copyright books they’ve digitized that they can’t display in full. That’s an argument for digitizing even in-copyright books.

Q: Does the corpus have enough structural information in it so that you know where words appear (for example, book titles vs. chapter titles vs. subheads, etc.)?
A (JB): It was a challenge to make sure that the book was actually written in the year in which it said it was written. Tracking other types of data is even more challenging. Possible, theoretically, but challenging.

Why I’d Like a Digital Public Library of America

Next month I’ll be spending a few quiet days in Amsterdam at the end of a work-related trip. It’s been ages since I’ve traveled alone — years, even — and I’m looking forward to wandering from café to café with a stack of dense novels and several long stretches of spring afternoon during which I can read them.

I’m a fast reader, and during vacations I often find myself at the end of the last book I’ve brought with me before I’m finished traveling. I try to plan ahead by bringing thick, dense bricks that weigh down my suitcase but keep me entertained for as long as possible — fruitcake books, some call them. Today I posted a question to Ask Metafilter in search of these kinds of books to take with me on my upcoming trip.

A few moments ago, attempting to cross-reference Ask Mefi (among other things, my recommendation engine of choice) with Goodreads (where I keep track of what I’ve read and would like to read) and AbeBooks (used bookseller extraordinaire), I was overwhelmed with the number of tabs open in Chrome and the frustration of trying to use three different websites to get me to one final goal: finding books I want to read. Why isn’t there a website where I can do all three of these things — recommend, track, obtain — in one place?

And then I remembered, oh yeah, I’m currently working on a project that could eventually do just that. The Digital Public Library of America (DPLA), currently housed at the Berkman Center and lead by an amazing Steering Committee of top-notch librarians, techies, and government and foundation representatives, is bringing together stakeholders from public and research libraries, the publishing industry, government, cultural organizations, and the academic community to figure out how best to “make the cultural and scientific heritage of humanity available, free of charge, to all.”

The project is still very much in the planning stages, and I don’t know what form it will take. If I get my way*, and if the DPLA turns out to be as awesome as I hope it will, I won’t have to open as many different tabs or log into as many different websites to find recommendations, track books I want to read, and obtain them, whether by checking them out from my local library branch, buying them, or downloading an ebook. If that concept (or some variant thereof) interests you — or if you wholeheartedly disagree about what a DPLA should look like and want to make your voice heard — you should check out our wiki and maybe add your name to our list of supporters.

*Note: As always, everything on this blog represents nothing more than the author’s own opinion, experience and predilection for referring to herself in the third person. I’m not speaking for my employer in the above; just sharing a nice little “oh, right” moment that, no matter how hard I tried, wouldn’t fit in a tweet. If it piques your interested in the DPLA, all the better!

Lunch at Berkman: DDoS Attacks Against Independent Media and Human Rights Sites

Liveblogging Hal Roberts, Ethan Zuckerman and Jillian York’s presentation on Distributed Denial of Service Attacks Against Independent Media and Human Rights Sites at the Berkman Center. Please excuse misrepresentation, misinterpretation, typos and general stupidity.

*****

Hal begins by outlining the history of denial of service attacks, which “have been around as long as the Internet.” The rise of botnets allowed for distributed denial of service (DDoS) attacks, in which the attacks are coming from multiple places at the same time. Early botnets were controlled by IRC; these days, many are operated through Twitter accounts.

Ethan points out that we’re seeing a rise in botnets being used to attack each other. One of the largest Internet outages of all time — 9 hours long, in China — was caused by a botnet-fueled “turf war” between two online gaming providers.

(Interesting factoid: early DDoS defense systems grew from the needs of online gambling sites that were being attacked, who operate in a gray area and may not want to ask authorities for help defending against attacks.)

Arbor’s ATLAS, which tracks DDoS attacks worldwide, estimates that 500-1500 attacks happen per day. Hal & Ethan believe that ATLAS “only sees the big ones,” meaning the 500-1500 number is a gross underestimate.

DDoS attacks comprise a wide variety of approaches: slowloris attacks overwhelm machines by slowing down their response rates to requests, while random incessant searches require a server to repeatedly execute database calls, using up all available resources. These two examples are application attacks that essentially “crash the box” (affect a single server). Network attacks that involve volunteers, bots, and/or amplifiers work by “clogging the pipe,” or slowing down the the flow of traffic, for example by requesting huge amounts of data that flood a server.

People who face DDoS attacks have several options. One is to obtain a better machine with a higher capacity to handle requests. Another option is to rent servers online in order to add resources only when they’re needed. Packet filtering can block malicious traffic (assuming it can be identified); scrubbing involves having a data center filter packets for you. Source mitigation and dynamic rerouting are used when the network is flooded. At that point, packet filtering and scrubbing is impractical. Both tactics involve preventing that flood of traffic from arriving, whether by stopping it in its tracks or by sending it somewhere else.

All of these tactics are problematic in some way: they’re expensive (scrubbing can cost $40,000-50,000 per month), they require considerable advance planning or high-level connections, or they’re tricky to execute (the “dark arts” of DDoS defense).

“All of this is background,” Hal says. Their specific research question involves independent media and human rights sites — what kinds of DDoS attacks are used against them, and how often? How can they defend themselves?

Hal describes a “paradox” of DDoS attacks: overall, the defenses are working pretty well. Huge sites — Google, the New York Times, Facebook — are attacked often, but they manage to stay online. This is because these sites are located close to the core of the network, where around 75% of ISPs are able respond to DDoS attacks in less than an hour, making DDoS attacks a “manageable problem.” The sites at the edge of the network are much more vulnerable, and they’re also much more likely to be attacked.

Ethan describes the case of Viet Tan, which is under DDoS attacks almost constantly — to the extent that when they put up a new web service, it is attacked within hours. As a result, Viet Tan has shifted many of their new campaigns to Blogger (blogspot.com) blogs.

Viet Tan is struggling in particular because they’re not only experiencing DDoS attacks. They also face filtering at the national level, from a government who wants to prevent people in Vietnam from accessing their site. Ethan says that 81% of sites in the study that had experienced a DDoS attack have also experienced intrusion, filtering, or another form of attack. In the case of Viet Tan, the site was being attacked unknowingly by its target audience, many of whom were using a corrupted Vietnamese keyboard driver that allowed their computers to be used as part of a botnet attack.

One of the big problems for sites that are DDoS-ed is that their ISPs may jettison them in order to protect other sites on the same server. Of the sites in the study, 55% of sites that were attacked were shut down by their ISP, while only 36% were successfully defended by their ISP.

An attack against Irrawaddy, a Burmese activist site hosted in Thailand, essentially caused all of Thailand to go offline. In response, Irrawaddy’s ISP asked it to move elsewhere. This year, they were attacked again with a larger attack. They were on a stronger ISP that may have been able to protect them, but they hadn’t paid for the necessary level of protection and were again shut down.

Hal and Ethan suggest that a system of social insurance is happening online, at least with larger sites — everything is starting to cost a little bit more, with the extra cost subsidizing the sites that are attacked. The problem with this is that small Internet sites aren’t protected because they’re not in the core.

Hal and Ethan wonder whether someone should build dedicated human rights hosting to protect these sites from attacks. The problem with this is that it collects all these sites into a single location, meaning any company that hosted a group of these sites would be a major target for DDoS attacks. Devising a fair pricing system in this case is tricky.

Ethan raises the issue of intermediary censorship — the constant threat that your hosting company may shut your site down for any reason (e.g., when Amazon shut down Wikileaks). This is a problem of Internet architecture, he says, and there are two solutions: building an alternative, peer-based architecture, or creating a consumer movement that puts sufficient pressure on hosting companies not to take sites down.

What Hal and Ethan ended up recommending to these sites is to have a back-up plan; to minimize dynamic pages; to have robust mirroring, monitoring and failover; to consider hosting on Blogger or a similar large site; and to avoid using the cheapest hosting provider.

Within some communities, Ethan says, a person or group emerges that is the technical contact. This person or group advocates for sites that are under attack. These “tech leaders” are connected to one another and to companies in the core that want to help. The problem is that this isn’t a particularly scaleable model — a better chain needs to be established, so that problems can escalate through a team of local experts up to larger entities. In the meantime, it’s essential to increase organized public pressure on private companies not to act as intermediary censors, but rather to support these sites.

Juliet Schor on “Post-Industrial Peasants”

Sociology professor Juliet Schor is at the Berkman Center today to talk about how the sustainability community — both activists and practitioners — is increasingly using the Internet to “foster new lifestyles, consumption patterns and ways of producing.”

Liveblogging Juliet Schor’s presentation “Using the Internet to ‘Save the Planet'” at the Berkman Center. Please excuse misrepresentation, misinterpretation, typos and general stupidity.

*****

Sociology professor Juliet Schor is at the Berkman Center today to talk about how the sustainability community — both activists and practitioners — is increasingly using the Internet to “foster new lifestyles, consumption patterns and ways of producing.” Her presentation is based on her recent book Plenitude: The Economics of True Wealth, in which Schor argues that by shifting to a more sustainable way of life, we can improve both the environment and our economic situation. While writing the book, Schor says, she came to believe that the sustainability and technology communities should have a much closer relationship.

It sounds crazy — “post-industrial peasants” — but there are some very important features to that: diversity of activities and income streams is key. Putting all your eggs in the basket of one employer is riskier and riskier in times of economic uncertainty. The single income stream strategy is becoming less attractive, and diversification is smart. The reason it makes sense now in a way it wouldn’t have 50 years ago is because of technology. Technology allows a single individual or a small company to be productive in ways they couldn’t have before — access to the network, access to information. This is the next stage after “big.” The large economies of scale will be less important going forward, and small-scale efforts will become more important.

Schor starts out by describing a “dramatic collapse” in biodiversity since the 1970s, the growing ecological footprints of different countries (hint: the United States is at the top, using more than four times the world average biocapacity per person), and our collective failure to reduce global carbon dioxide omissions. (She points out that recent data shows that the best way to reduce emissions is to have economic collapse, though that’s not practical as a long-term strategy.)

Schor argues that a purely technological approach won’t halt climate change — this is also a problem of scale. According to a recent paper in Nature, we have already exceeded two of nine different “planetary boundaries” (in categories such as climate change, ocean acidification, biodiversity loss, and others), and we’re close to hitting the sustainable boundaries on a number of others. The strategy of de-materialization (reducing the “material intensity” of our energy use) has had some success, but our economic growth has “more than outweighed the decline” in material intensity. On a worldwide basis, Schor says, our material intensity has actually increased by around 45%, while North America has been a particularly egregious user of materials — our material extraction has increased by about 66% since 1980. This is largely due to our use of fossil fuels and the construction boom.

The Challenges

Schor argues that the world needs to cut its ecological impact rapidly. The problem, she says, is that we’re in the midst of an unemployment crisis. This is a disaster both economically and environmentally speaking. We also shouldn’t take any paths that worsen the distribution of wealth — there’s a negative correlation between income inequality and certain environmental indicators — or decrease human development (i.e., wealth and well-being) overall.

Plenitude: The Economic Model

Switching to green technology (a clean consumption and production system) will help, Schor says. So will improving eco-knowledge, which she defines as “open source transmission and ecological skill diffusion.” We’re “centuries behind” in terms of developing both an understanding of nature as a scarce resource and technology that would allow us to increase the productivity of that resource.

Schor points to working hours, which have declined dramatically between 1870 and the 1970s (from around 3000 to around 2000 per year). Since 1973, however, the annual hours worked have been increasing the United States. A country’s ecological footprint rises with its average annual hours worked, even when income is held constant. Schor says that as we move forward, we need to focus on increasing productivity growth in fewer working hours, rather than by adding new hours. She wants to move hours from the “business as usual” economy to “self-providing” and green entrepreneurship. This will reduce market dependence and reliance on large corporations and provide more time for people to increase their skills, build local resilience, and help create a small-scale, low-impact sector of enterprises.

Schor provides an example in the form of permaculture (a high-productivity approach to agriculture) and urban agriculture. This form of micro-generation, which applies not only to farmers’ markets and fruits and vegetables but also to energy and homes (DIY yurts, anyone?), is low-cash and low-footprint in comparison to more market-driven methods and mechanisms. Schor is currently working on a number of other case studies, including a permaculture farm in the Netherlands, a converted soybean farm in Kansas build with fab lab technology. This farm is also trying to build a blueprint for other communities to follow. What’s cool about this, Schor says, is the low financial barriers to entry: communities can purchase the machines and the costs of materials are low.

Schor’s also interested in the principle of sharing: couches, homes, cars, tools, etc. She says the recession has “changed the calculus of time and money,” creating an environment that fosters these sorts of sharing schemes. Another initiative that has sprung up in this environment is the transition movement, which focuses on helping communities build local resilience.

Overall, Schor says, our constraint is much more about time — we work long hours in formal jobs, which we need in order to have access to health insurance, housing, and education. We need to find ways to allow people to “delink” from these jobs, which are high footprint jobs, to allow them to do more of this kind of small-scale activity.

Jumble.

Two weekends ago I moved out of my summer sublet and into an attic apartment the new roommate and I have dubbed the Sky Parlor. I’ve yet to unpack, partially because I’m overwhelmed by all the boxes and partially because roommate and I have half-formed plans to build this in our kitchen:

DIY shelves via Hindsvik

Want. Via Hindsvik.

What’s the use of unpacking and putting things on shelves, really, if we’re just going to have to move them all again?

The Sky Parlor is lovely, full of light and breeze, except I can’t figure out how to turn the oven on. I grew up with electric kitchen appliances, and I’m terrified that too much messing around might result in a flaming ball of natural gas.

What I’m trying to say is that if I don’t show up to my new job tomorrow, you should probably call the gas company and see if there have been any explosions.