Posts Tagged ‘ Data analysis ’

Science XXL: digital data and social science

I attended last week (unfortunately only part of) an interesting workshop on the effects of today’s abundance and diversity of digital data on social science practices, aptly called “Science XXL“. A variety of topics were discussed and different research experiences were shared, but I’ll just summarize here a few lessons learned that I find interesting.

  • Digital data are archive data. Data retrieved automatically from the digital traces of individual actions, such as those mined from the APIs of platforms such as Twitter, are unlike survey data in that they were not originally recorded for research purposes. The researcher must select relevant records on the basis of some understanding of the conditions under which these data were produced. Perhaps ironically, digital data share these characteristic with data from historical or literary archives.
  • Digital data are not necessarily “big”, in the sense that their volume is often small (at least in social science research so far!), even though they may share other characteristics of big data such as velocity (being generated on the fly as people use digital platforms) or variety (being little or not structured).
  • Digital data can help fill gaps in survey data, for example when survey sampling is not statistically representative: detail and volume can provide extra information that supports general conclusions.
  • Non-clean data, outliers and aberrant observations may be very informative, revealing details that would escape attention if researchers focused only on the average or center of the distribution (the normal law cherished in classical statistical approaches). Special cases are no longer a prerogative of qualitative research.
  • Data analysis is a key ingredient of “computational social science” a field that is growing in importance after an initial phase in which it was largely confined to agent-based simulation and complexity theory.

International Program in Survey and Data Science

A new, master’s level programme of study in Survey and Data Science is to be offered jointly by the University of Mannheim, the University of Maryland, the University of Michigan, and Westat. Applications for the first delivery are accepted until 3 January, for a start in Spring 2016. Prospective students are professionals with a first degree, at least one year of work experience, and some background in statistics or applied mathematics. All courses are delivered in English, fully online, to small classes (it’s not a MOOC!). Tuition is free, thank to support from German public funds at least for the first few cohorts.

What is most interesting about this master is its twofold core, involving both more classical survey methodology and today’s trendy data science. Fundamental changes in the nature of data, their availability, the way in which they are collected, integrated, and disseminated, have found many professionals unprepared. These changes are partly due to “big” data from the internet and digital devices becoming increasingly predominant relative to “small” data from surveys. Big data offer the benefit of fast, low-cost access to an unprecedented wealth of informational resources, but also bring challenges as these are “found” rather than “designed” data: less structured, less representative, less well documented (if at all…). In part, these changes are also due to the world of surveys changing internally, with new technical challenges (regarding for example data preservation, in a world of pre-programmed digital obsolescence), legislative issues (such as those triggered by greater awareness of privacy protection), increased demand by multiple users, and a growing need to merge surveys and data from other (such as business and administrative) sources. It is therefore necessary, as the promoters of this new study programme rightly recognize, to prepare students for the challenges of working both with designed data from surveys and with big data.

It will be interesting to see how data science, statistics, and social science / survey methodology feed into each other and support each other (or fail to do so…). There is still work to be done to develop techniques for analyzing data that allow us to gain insights more thoroughly, not just more quickly, and help us develop solid theories, rather than just uncovering new relationships that might eventually turn out to be spurious.

Read more

Databeers now in London

In the midst of the chaos and sadness of the past week, a more leisurely note: the first of a new “Databeers” series of events in London yesterday evening, following a format that has been experiencing a huge success in Spain, Italy and other countries. The event is very informal, and getting to know other data enthusiasts is the main goal. There are a few flash talks with free beers and networking time.

databeers

The next Data Beers London event is on 25 February 2016.

 

Big data and history

archive_pic

A paper archive – more and more often replaced by digitised versions today.

Yesterday at Biblithèque Nationale de France, I took part in a panel discussion  on longue durée in history, organised by the Revue Annales – Histoire et Sciences Sociales. Of course I am not a historian, and I wouldn’t be able to tell whether one interpretation of longue durée is better than another. But historians are now raising questions that are common to the social sciences and humanities more generally: how to benefit from big data and how to re-think the political engagement of the researcher. So I was there to talk about big data and how they change not just research practices and methods, but also researchers’ position relative to power, politics, and industry. This questions cross disciplinary boundaries, and all may benefit from dialogue.

BellinghamCB001

Collection of older sources is now often online and enables application of new methods.

What ignited the historians’ debate was an attempt by two leading scholars, David Armitage and Jo Guldi, to restore history’s place as a critical social science, based on (among other things) increased availability of large amounts of historical data and the digital tools necessary to analyze them. Before their article in Annales, they published a full book in open access, the History Manifesto, where they develop their argument in more detail. Their writing is deliberately provocative, and indeed triggered strong (and sometimes very negative) reactions. Yet the sheer fact that so many people took the trouble to reply, proves that they stroke a chord.

What do they say about big data? They highlight the opportunity of accessing large and rich archives and to expand research beyond any previous limitations. Their enthusiasm may seem excessive but it is entirely understandable insofar as their goal is to shake up their colleagues. My approach was to take their suggestion seriously and ask: what opportunities and challenges do data bring about? How would they affect research, especially for historians?

Continue reading

The fuzziness of Big Data

Fascinating as they may be, Big Data are not without posing problems. Size does not eliminate the problem of quality: because of the very way they are collected, Big Data are unstructured and unsystematized, the sampling criteria are fuzzy, and the classical statistical analyses do not apply very well. The more you zoom in (the more detail you have), the more noise you find, so that you need to aggregate data (that is, to reduce a “big” micro-level dataset to a “smaller” macro one) to detect any meaningful tendency. Analyzing Big Data as they are, without any caution, increases the likelihood of finding spurious correlations – a statistician’s nightmare! In short, processing Big Data is problematic: Although we do have sufficient computational capacity today, we still need to refine appropriate analytical techniques to produce reliable results.

In a sense, the enthusiasm for Big Data is diametrically opposed to another highly fashionable trend in socioeconomic research: that of using randomized controlled trials (RCTs), as in medicine, or at least quasi-experiments (often called “natural experiments”), which enable collecting data under controlled conditions and facilitate detection of causal relationships  much more clearly and precisely than in traditional, non-experimental social research. These data have a lot more structure and scientific rigor than old-fashioned surveys – just the opposite of Big Data!

This is just anecdotal evidence, but do a quick Google search for images on RCTs  vs. Big Data. Here are the first two examples I came across: on the left are RCTs (from a dentistry course), on the right are Big Data (from a business consultancy website).  The former conveys order, structure and control, the latter a sense of being somewhat lost, or of not knowing where all this is heading… Look for other images, I’m sure the great majority won’t be that different from these two.

RCTvsBigData

Continue reading

What is data?

All the hype today is about Data and Big Data, but this notion may seem a bit elusive. My students sometimes struggle understanding the difference between “data” and “literature”, perhaps because of the unfortunate habit to call library portals “databases”. Even colleagues are sometimes uncomfortable with the notion of data (whether “big” or “small”) and the breadth it is now taking. So, a definition can be helpful.

Data  are pieces of unprocessed information – more precisely raw indicators, or basic markers, from which information is to be extracted. Untreated, they hardly reveal anything; subject to proper analysis, they can disclose the inner working of some relevant aspects of reality.

The “typical” example of socioeconomic data is the observations/variables matrix, where each row represents an observation – an individual in a population – and each column represents a variable – a particular indicator about that individual, for example age, gender, or geographical location. (In truth data types are more varied and may also include unstructured text, images, audio and video; But for the sake of simplicity, let’s stick to the Matrix here.)

 Fig11a

Continue reading

Big data: Quantity or quality?

The very designation of “Big” Data suggests that size of datasets is the dividing line, distinguishing them from “Small” Data (the surveys and questionnaires traditionally used in social science and statistics). But is that all – or are there other, and perhaps more profound, differences?

Let’s start from a well-accepted, size-based definition. In its influential 2011 report, McKinsey Global Institute depicts Big Data as:

“datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze”.

Similarly, O’Reilly Media (2012) defines it as:

“data that exceeds the processing capacity of conventional database systems”.

The literature goes on discussing how to quantify this size, typically measured in terms of bytes. McKinsey estimates that:

“big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes)”

This is not set in stone, though, depending on both technological advances over time and specific industry characteristics.

Continue reading