- By Brandon Keim
Google’s massive trove of scanned books could be useful for researchers studying the evolution of culture.
In a paper published December 16 in Science, researchers turned part of that vast textual corpus into a 500 billion-word database in which the frequency of words can be measured over time and space.
Their initial subjects of analysis, including cultural trajectories of popular modern thinkers and the conjugation of irregular verbs, hint at what might be done.
“There are many more questions, that we could never think of, that this data makes possible,” said Harvard University evolutionary dynamicist Jean-Michel Baptiste. “What we present in the paper is our first explorations of what becomes possible when you have this dataset.”
The new research is part of an emerging approach to applying rigorous statistical analyses, traditionally known from the study of biological evolution, to cultural evolution.
Unlike biological evolution, however, which can be studied through the fossil record and in genomic comparisons, cultural evolution has proved difficult to study.
Researchers have used archaeological documentation of Polynesian canoe shapes and records painstakingly assembled by comparative linguists, but rich and rigorously compiled datasets are rare.
One potential source is Google, which has scanned some 15 million books, or roughly 12 percent of every book ever published. Michel-Baptiste and his colleagues turned one-third of these, selected for legibility and fully documented origins, into a massive word database.
Patterns that can be queried from its cloud are not necessarily answers unto themselves, they say, but a way of illuminating subjects of further investigation.
“It’s not just an answer machine. It’s a question machine,” said study co-author Erez Lieberman-Aiden, a computational biologist at Harvard University. “Think of this as a hypothesis-generating machine.”
In the new study, the researchers restricted their queries to single words and names, as more sophisticated querying raised the potential of copyright violation. (Google and book publishers are currently negotiating terms of access to copyright material, putting scientific accessibility and legal restrictions at odds.)