World Statistics Day 2015

This week was World Statistics Day, celebrated at the UN and in individual countries around the world. While celebWSD2rating the successes of official statistics throughout its history of producing vital information for governments and citizens, this time much of the debate focused on its – more uncertain – future. The landscape is rapidly changing, swiftly shifting from a data-scarce to a data-rich world, from structured to unstructured data, from the quasi-monopoly of official statisticians on the production of information to fier competition, from pure statistics to multi-disciplinarity and the rise of so-called “data science”. There are obvious opportunities, but also formidable challenges, and it is always difficult for large organisations (such as statistical institutes) to adapt.

The President of the IAOS urged official statisticians to stick to the UN-backed Fundamental Principles of Official Statistics as a guide. She focused on the efficiency and ethics of engaging with users and the private sector, combined with the rigour of methods, to deliver “better data for better lives” (the slogan of the day).

Research ethics in secondary data: what issues?

It is often believed that use of secondary data relieves the researcher from the burden of applying for ethical approval – and sometimes, from thinking about ethics altogether. But the whole process of research involves ethical considerations, whether or not any primary data collection is involved. This starts from the initial design of the study, which should aim at the public good (and at the very least should do no harm) and continues until communication of results, which should ensure transparency, publicness and replicability. More specifically, what ethical issues will the data collection and analysis stages involve, when secondary data are used?

Secondary data are usually defined as those that were collected as part of a different research, with purposes other than those of the present study. They may be official statistical data (census for example, but also, increasingly, administrative data), data gathered by commercial operators (time series of stock prices for example), and researchers’ data from past projects. They are more often quantitative, although secondary analysis of qualitative data is becoming more and more common.

Weighing risks and benefits

Use of secondary data is in itself, a highly ethical practice: it maximizes the value of any (public) investment in data collection, it reduces the burden on respondents, it ensures replicability of study findings and therefore, greater transparency of research procedures and integrity of research work. But the value of secondary data is only fully realized if these benefits outweigh the risks, notably in terms of re-identification of individuals and disclosure of sensitive information.

For this to happen, use of secondary data must meet some key ethical conditions:

  • Data must be de-identified before release to the researcher
  • Consent of study subjects can be reasonably presumed
  • Outcomes of the analysis must not allow re-identifying participants
  • Use of the data must not result in any damage or distress

New publications on big data and official statistics

National Statistical Institutes (NSIs) have long been the recognised repositories of all socio-economic information, mandated by governments to collect and analyse data on their behalf. The development of big data is shaking this world. New actors are coming in and commercially-oriented, privately-produced information challenges the monopoly of NSIs. At the same time, NSIs themselves can tap into digital technologies and produce “big” data. More generally, these new sources offer a range of opportunities, challenges and risks to the work of NSIs.

OpendataThe Statistical Journal of the IAOS, the flagship journal of the International Association for Official Statistics, has published a special section on big data – of particular interest to the extent that it is free of charge!

Fride Eeg-Henriksen and Peter Hackl introduce this special section by defining big data and emphasising its interest for official statistics. But it is crucial,  albeit admittedly not easy, to separate the hype around big data from its actual importance.

The other papers are concrete examples of how big data may be integrated into official statistics:

The power of survey data: Eurostat Users’ Conference

survey3In the age of big data, social surveys haven’t lost their appeal and interest. Surveys are the instrument through which governments, for a long time, have gathered information on their population and economy to inform their choices. Interestingly, surveys conducted by, or for, governments are the best in terms of quality and coverage: because significant resources are invested in their design and realization, and especially because participation can be made compulsory by law (they are “official”), their sampling strategies are excellent and their response rates are extremely high. (Indeed, official government surveys are practically the only case in which the “random sampling” principles taught in theoretical statistics courses are actually applied). In short, these are the best “small data” available — and their qualities make them superior to many a (usually messy) big data collection. It is for this reason that surveys from official statistics have always been in high demand by social researchers.

Training in European data: EU-SILC

Official statistical surveys are still the best sources of data in terms of quality. Practically, they are the only ones that apply random sampling and the legal obligation to respond makes the actual sample very close to the targeted one. No other approach to data collection can hope to do as well.

The European Union Statistics on Income and Living Conditions (EU-SILC) is an instrument aiming at collecting timely and eurostat1comparable cross-sectional and longitudinal multidimensional microdata on income, poverty, social exclusion and living conditions. It started in 2003 with a small group of participant countries, and was enlarged in 2004. It is one of the richest sources of information on the daily life conditions of Europeans.

EU-SILC data are available for research use, but many barriers exist and these data are actually underutilized. On the one hand, the fact that access is legally authorised does not make it practically straightforward – the application process can be lengthy and costly. On the other hand, the very handling of data requires some specific knowledge and skills.

The Data without Boundaries European initiative, aimed at moving forward research access to official data, organises a training programme on EU‐SILC with a specific focus on the longitudinal component. Local organization lies with Réseau Quetelet, host of the training course is GENES ‐ Groupe des Écoles Nationales d’Économie et Statistique both in Paris (France).

Big Data and social research

Data are not a new ingredient of socio-economic research. Surveys have served the social sciences for long; some of them like the European Social Survey, are (relatively) large-scale initiatives, with multiple waves of observation in several countries; others are much smaller. Some of the data collected were quantitative, other qualitative, or mixed-methods. Data from official and governmental statistics (censuses, surveys, registers) have also been used a lot in social research, owing to their large coverage and good quality. These data are ever more in demand today.

Now, big data are shaking this world. The digital traces of our activities can be retrieved, saved, coded and processed much faster, much more easily and in much larger amounts than surveys and questionnaires. Big data are primarily a business phenomenon, and the hype is about the potential gains they offer to companies (and allegedly to society as a whole). But, as researcher Emma Uprichard says very rightly in a recent post, big data are essentially social data. They are about people, what they do, how they interact together, how they form part of groups and social circles. A social scientist, she says, must necessarily feel concerned.

It is good, for example, that the British Sociological Association is organizing a one-day event on The Challenge of Big Data. It is a good point that members must engage with it. This challenge goes beyond the traditional qualitative/quantitative divide and the underrepresentation of the latter in British sociology. Big data, and the techniques to handle them, are not statistics, and professional statisticians have trouble with it too. (The figure below is just anecdotal, but clearly suggests how a simple search on the Internet identifies Statistics and Big Data as unconnected sets of actors and ties). The challenge has more to do with the a-theoretical stance that big data seem to involve.


New data opportunity for socioeconomic research

If you are a researcher in economics, demography, sociology, geography or political science, you may have experienced the frustration of discovering a relevant data resource and being denied access to it — typically on the ground that data release would violate the confidentiality of data subjects. Or you may have heard of fantastic analyses — with all the fancy new statistical and econometrics tools and software that are increasingly in fashion today — done with large amounts of very detailed microdata, but you have no clue how to do anything like that yourself. Maybe you have tried to look at the website of some public administration that likely holds the data you want – like labor market or business data — but could not figure out how to ask for these data in the first place. And f you ever tried to access data from two or more different countries, you probably found the task of even finding out how to apply in different systems daunting.

Now, there is a great opportunity for you to get closer to your goal. The European project “Data without Boundaries” (DwB) offers social scientists from across Europe funding, information and support to access household surveys and business data from public-sector records in special Research Data Centers in France, Germany, the Netherlands and UK.  These are microdata at individual level, highly detailed; they cannot be publicly released, but access can be legally given for scientific and statistical research purposes.

Both confirmed researchers and PhD students are welcome to apply, and should do so in a country different from the one where they reside. There is a preference for comparative, cross-country projects. The deadline is 15th October 2013.

For more information, see the call for proposals on the DwB website.

This is part of a broader policy effort to improve researchers’ access to data in Europe, to enhance capacity to produce science-based understanding of society across the continent.