From Big Data to Boutique Data

Tweet about this on TwitterShare on Facebook24

This blog post is a small excerpt from a chapter co-authored by Ball, Tarez Samra Graban, and Michelle Sidler on open data and and submitted to Networked Humanities (eds. McNely & Rice).

While authors of the other posts in this data carnival offer fantastic how tos, tips, and techniques for working with your data, this post takes a step back and urges humanities researchers to collect and publish their data for others to potentially use. The phrase “big data” has become synonymous in the digital humanities with funding agency projects such as the National Endowment for the Humanities’ Digging Into Data challenge. Big data as a concept speaks to the breadth and depth of computational data researchers have access to:

Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data, or cell phone records — what new, computationally-based research methods might we apply?

The Digging Into Data challenge brings collaborative, international researchers together to dig into the research collections of an ever-growing collection of databases that house massive digital collections (e.g., JSTOR, Project Muse, Hathi Trust, Internet Archives, New York Public Library, Library of Congress, National Archives, etc.). One project from the first round of funding included “Digging Into Image Data,” led in the U.S. by digital humanities and rhetoric scholar Dean Rehberger, in which his team analyzed authorship in manuscripts, maps, and quilts from the 15th through the 20th centuries. This is just one small example of the huge corpora — big data in every sense — that these Digging Into Data challenge projects use.

It is important to note the resources required for these projects: economically, they cost hundreds of thousands of dollars in funding supplied by two or more international funding agencies, coupled with the human resources of having international partnerships already in place; and physically, they require big computing—supercomputing, in fact—to algorithmically sort and analyze all that big data.

In the rush to embrace big data as a funding stream for large, networked humanities projects, researchers in rhetoric and composition and the digital humanities might miss the value in the thousands, perhaps hundreds of thousands, of uncounted data sets we already have. These data sets come from what seem to be small-scale projects in relation to big data—small data sets such as assessments from our writing programs, student writing and projects from a class study, interviews from writers, scholarship collected under a single topic, discourse analyses from dissertation projects, data mining from job lists or journals in our field, and even data sets generated by historical questions of rhetorical velocity.

Library and information science professor, P. Bryan Heidorn (2008) calls this kind of data dark data: “Like dark matter, this dark data on the basis of volume may be more important than that which can be easily seen” (p. 281). It’s the kind of data that rhetoric and composition scholars may already be interested in and already have access to. Heidorn notes that “There may only be a few scientists worldwide that would want to see a particular boutique data set but there are many thousands of these data sets” and access to these sets can have a huge impact on research in science—or the humanities, we contend. (p. 282) Thus the phrase boutique data seems apropos to describe the plethora of currently inaccessible sets of qualitative and quantitative data that exists behind much of humanities scholarship. These data sets are often small and built from local contexts, but when combined with other such sets, they are rich in potential new sources of inquiry and knowledge.

Several online sites have started to collect summaries of research studies and data pertaining to writing studies:

  • REx (Research Exchange), which annually publishes peer-reviewed research reports about boutique writing studies research

  • WritingPro: Knowledge Center for Writing Process Research, which built on the idea of REx for an audience of international/EU writing researchers

  •, which is an institutionally independent, centralized site for writing researchers to make their own datasets publicly accessible

A primary hurdle to this work succeeding is writing studies slow adoption of research as data-driven and generalizable in the sense that Institutional Review Boards define research (see Banks & Eble, 2006; McKee, 2004). But, a few minor changes in terminology on those seemingly onerous IRB forms—from “students” to “research participants”, from “student assignments” to “participant data”, from “classroom” to “study”, for instance—and our research fits easily within the forms of those oversight committees. As well, rethinking the default assumption that killing our data after X number of years is always the appropriate choice, we should approach our research as scientists do, understanding that data is as important as our analysis of it and others may—and will—find value in it.

About Author(s)