What’s Not in a Datamine

I think that research that involves quantitative studies of extremely large datasets of texts that focuses on word usage, genre classification, spatial mapping of publication and circulation, or other kinds of information that we can now collect and analyze is potentially great stuff. I’ve expressed my enthusiasm for the work of Franco Moretti and scholars like him in the past, for example.

That said, don’t bring statistics to a hermeneutics fight. Or maybe to indulge in some more tedious territoriality, if you’re setting out to interpret the recurrence of words or phrases in very large textual datasets, it wouldn’t hurt to either have a humanist as a collaborating researcher or to at least seriously and repeatedly consider some of the fundamental challenges involved in interpreting texts, media and communication as humanists might.

A good example of this problem is the book A Billion Wicked Thoughts, which is getting the same weirdly uncritical reception by lifestyle journalists that evolutionary psychology written for general audiences often does, regardless of its quality. (I think it’s because evo-psych findings, both well-researched and silly, are so readily parsed out as little USA-Today-style social info-nuggets.)

A Billion Wicked Thoughts has been getting a lot of criticism throughout its research process from the early moment that the authors embarked on an ill-considered research strategy using slash-fiction forums that most of us would expect to see from an undergraduate who got started too late on a term paper. The final data they’re working with is more interesting and promising, but I’m really underwhelmed (like many) with what they do with it. Leaving aside the many detached essentialist pronouncements on the intrinsic nature of men, women, gays, straights, and so on, and leaving aside numerous more social-scientific questions about matching up queries and verifiable social data, there are just some basic unconsidered cautions that almost any humanist would employ in interpreting what a choice of words in a textually dense and fast-evolving communicative technology might mean. Including being open to multiple meanings, to slippages in meaning, to gaps between intention and cause of communication.

Not to obsess too much on this one book, because I think this is an emerging problem at the intersection between digital humanities and other disciplines taking an interest in data-mining very large bodies of textual and communicative information. There’s no question, for example, that finding out what the top twenty words in a given genre of literature, a given interval of publication, a particular text, the oeuvre of one author, and so on, is really interesting. This kind of data has the potential to force humanists to confront the systematicity of texts and intertextuality all at once, the kinds of complexity that we often push to the side in hand-crafting an interpretation of a theme, mood or tone. This kind of research can disrupt and confound interpretations whose authority derives from the cultural capital or institutional power of the interpreter, or from the degree to which an interpretation conforms to disciplinary trends and expectations. Looking for patterns and structures in very large textual datasets is a discovery tool, a way to inject strangeness and surprise into the work of interpretation. Including, potentially, making some new disciplinary partners welcome at the table: cognitive explanations for the repetition of words or phrases, for example, make an interesting new kind of sense at this scale of reference.

What’s aggravating are the moments where those new dinner guests choose to make wild leaps (or inspire the budding digital humanists to follow) that would otherwise seem naive. Say, that the recurrence of a particular word or phrase in a group of texts or communications is a very straightforward indication of author intent, consciousness or behavior. Here the history of digital media is so spectacularly interesting for how it cautions against that kind of reading, for the way that words and phrases repeat and appear at times quite independently of any intent or practice by authors and readers. The author isn’t dead, exactly, but neither is the author always the agent of what is said or represented. You can see this with visuality as well, if you search Flickr or Tumblr. The images that appear in response to a search for a particular tag, or just in looking for patterns across very large displays of information, aren’t the pure expressions of the visual interests or imagination of the people who’ve posted them. They’re determined by what pictures could be taken: there are images (pornographic and otherwise) that audiences might look for which are never taken because they can’t be, not because they are unimagined or undesired. I can’t take a picture of my dreams, I can’t take a self-portrait while I’m asleep, and so on. There are images that exist in profusion because there’s a social cue to take them (birthdays, weddings), and images that exist in profusion because there are external conditions that make it more likely that the picture will be judged worth sharing (good outdoor lighting). There are images which exist in profusion because they conform to a genre of image, to understood rules of how that picture should be staged or framed or cropped or performed, and images which counter or comment upon genre expectations. All of this and much more goes for text. Search in particular is a textual act which is mapped into many feedback loops, not merely the one between textual expression and the psychology of the searcher.

A Billion Wicked Thoughts may be the worst-case scenario as far as methodological sloppiness goes, but it’s not the only project recently to show a tendency towards saying way too much about this kind of very large dataset. It’s not that there’s insufficient data: it’s that there are things you’ll never be able to say in a fixed, final or scientific fashion about what text and representation mean.


As a coda, let me flip this point around in another direction. Certain kinds of conventional policy formation backed by a particular kind of social science could do worse than invite humanists into the mix when they’re dealing with complex systems. That is, if humanists were prepared to rethink what it is that they might bring to that kind of work, as opposed to just hassling policy-makers and quantitative social scientists about problems with their discursive practices.

Case in point: the New York Times magazine had a piece last Sunday on how reforms aimed at breaking up long shifts that caused sleep-deprivation among internists at hospitals don’t seem to have produced the expected drop in medical errors. Some of the article reviewed the original research findings that associated sleep deprivation with errors, which was especially persuasive in the light of general work on sleep deprivation that conclusively demonstrated its impact on cognition and ability. The interesting idea that researchers are now focused on is that breaking up long shifts has caused problems with the hand-off of patients, which has increased errors and erased any gains from improved restfulness.

This is a really familiar kind of story. When it’s about policy or institutional reform, it often involves very real, tangible unintended consequences that follow from dealing with one identifiable problem in isolation, which pulls hard at a string that has all sorts of unseen connections. There’s a lot of social science that follows the same pattern, isolating a variable from a complex real-world system, ignoring its temporal and systemic complexities, modeling it in isolation, and then forgetting to put it all back together. I can pull a spark plug out of a car engine and properly conclude through study that it is necessary for the car to run, but I shouldn’t then run around flogging my sparkplug-centric theory of traffic conditions on Interstate 405.

The thing is that we don’t really have very good social science capable of dealing with complex systems that involve human beings without that kind of reduction. There are some good beginnings out there, but I think there are big problems with those alternatives that hope to eventually provide the same epistemological swagger that an economist who is armed with some variables that have been regressed into warm gruel can sport. I keep feeling that most complex systems are best appreciated as processes over time, best understood with as many simultaneous interactions and dimensions that the human mind can take in, and best examined with intuition, empathy and experience. E.g., cultural anthropology’s preference for “participant-observation” analysis, for example, might be as good as it’s going to get when it comes to having an intellectually rigorous understanding of complex systems.

In the case of trying to reform institutions or set policies, I wouldn’t quite say that we should go with our guts. But I’d give a humanistically-inclined medical professional with experience in several hospitals at least as good odds of intuitively noticing a connection between sleep, length of shift, patient hand-off and medical error before a new policy gets set as I would trust to an extensive and conventional study carried out by quantitative social scientists to see that coming.

Patterns and structures that we find by methods that are not intuitive are terribly important for making a better world and understanding ourselves. There are clearly things about the physical universe, biological systems and human societies and psychologies that do not at all conform to our cognitive intuitions or our cultural and historical expectations. At the same time, I think we sometimes know a lot about the always-fuzzy, always-unresolved messiness of systemic interactions and the meaning of ideas and words, and this is a knowledge that can never be expressed in clean, empirical terms.

This entry was posted in Digital Humanities, Generalist's Work, Oh Not Again He's Going to Tell Us It's a Complex System. Bookmark the permalink.

4 Responses to What’s Not in a Datamine

  1. Dorothea says:

    When the authors of A Billion Wicked Thoughts started their ill-advised “research,” fanfiction communities took them to the cleaners right away, for reasons that you include here as well as other stunning methodology flaws. A pity the publisher didn’t check up on these two; they should never have been published, and they should emphatically not be taken as representative of the text-mining or digital-humanities communities.

  2. DannyScL says:

    The issue of residents’ hours, sleep deprivation, and hand-offs is one that I’ve thought about a lot in the past few years as my wife has transitioned from medical school to residency. Perhaps what’s been most interesting to me is the steady socialization that convinces them of the benefits of long shifts and the problems with excessive hand-offs (the short version: as a resident you just need a certain amount of volume of patients cases in order to develop the knowledge and skills you need to be a good doctor; handing off patients to other doctors inevitably means that some information, some of which could be crucial to effective care, is lost). At this point I’m pretty convinced that they’re right (up to a point – there’s surely a level of sleep deprivation that would be disastrous for all involved). But for a long time I looked on these claims pretty skeptically, as nothing more than justification for maintaining the long hours that everyone goes through as part of medical training. It’s only after I’ve become aware of the complexity of medical decision-making and the mass of information required for that process that I’ve been persuaded.

    All of which is to say that I think you’re right – when we look at complex human systems, we need to take seriously the ideas of the participants in those systems to get a grasp on how they work. Relying solely on quantifiable inputs and outputs is going to leave important stuff out.

  3. john theibault says:

    Semi-related to Dorothea’s point: I don’t quite understand how the A Billion Wicked Problems website hoax mentioned in the Chronicle was supposed to expose the perfidy of their fan-fic critics. Just makes the authors seem like even bigger jerks.

    Your coda reminds me of the news a few weeks ago that Google was targeting humanities grads for jobs. Humanities “ways of knowing” can provide insights that isolating a variable and highlighting its effects does not. I found the response of the culturomics folks to Tony Grafton’s observation that no historians were included in their paper telling. They claimed that they had tried to recruit historians to the group, but in the end could not find any who had the time and the quantitative understanding to contribute effectively. I wondered at the time if the perceived lack of quantitative understanding of potential collaborators actually revealed an unwillingness to frame the issues in terms of what variables could be isolated. Even with encouragement from Robert Darnton and Mike McCormick, their launch of culturomics showed little sense of what historians would find interesting about their tool. I was pleased to hear that Ben Schmidt, a historian from Princeton, will be a fellow at the Harvard Cultural Observatory, working with the culturomics folks. I hope something interesting results.

  4. lisa nakamura says:

    “don’t bring statistics to a hermeneutics fight.” That’s a good one and I’m going to think of some reasons to use it. It reminds me of something a scrappy old guy would say in a movie about the South.

Comments are closed.