The very designation of “Big” Data suggests that size of datasets is the dividing line, distinguishing them from “Small” Data (the surveys and questionnaires traditionally used in social science and statistics). But is that all – or are there other, and perhaps more profound, differences?
Let’s start from a well-accepted, size-based definition. In its influential 2011 report, McKinsey Global Institute depicts Big Data as:
“datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze”.
Similarly, O’Reilly Media (2012) defines it as:
“data that exceeds the processing capacity of conventional database systems”.
The literature goes on discussing how to quantify this size, typically measured in terms of bytes. McKinsey estimates that:
“big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes)”
This is not set in stone, though, depending on both technological advances over time and specific industry characteristics.
What the literature does not always say explicitly is that, with such sizes, the quantity affects the quality: Big Data not only differ in size from Small Data, but have to be handled in completely different ways. Even just their capture and storage require different computing technologies as, by definition, conventional database systems cannot process them. Analysis is even more complex.
Big Data do not offer us more detail about, say, the traditionally known (though still largely elusive) relationship between schooling and earnings, about which social scientists and policymakers would so much like to know more: it does not cover a larger number of individuals, or longer periods of time, or more frequent waves of observation. Rather, Big Data will include a lot of different information and typically contain a lot more noise. O’Reilly confirms the feelings of many when it says that:
“Big data practitioners consistently report that 80% of the effort involved in dealing with data is cleaning it up in the first place”.
That’s why the quality of Big Data is an issue. Its analysis requires more thought than has been devoted to it so far. And this is also why, after all, Small Data still maintain a lot of attractiveness.
1 Comment