Research ethics in secondary data: what issues?

It is often believed that use of secondary data relieves the researcher from the burden of applying for ethical approval – and sometimes, from thinking about ethics altogether. But the whole process of research involves ethical considerations, whether or not any primary data collection is involved. This starts from the initial design of the study, which should aim at the public good (and at the very least should do no harm) and continues until communication of results, which should ensure transparency, publicness and replicability. More specifically, what ethical issues will the data collection and analysis stages involve, when secondary data are used?

Secondary data are usually defined as those that were collected as part of a different research, with purposes other than those of the present study. They may be official statistical data (census for example, but also, increasingly, administrative data), data gathered by commercial operators (time series of stock prices for example), and researchers’ data from past projects. They are more often quantitative, although secondary analysis of qualitative data is becoming more and more common.

Weighing risks and benefits

Use of secondary data is in itself, a highly ethical practice: it maximizes the value of any (public) investment in data collection, it reduces the burden on respondents, it ensures replicability of study findings and therefore, greater transparency of research procedures and integrity of research work. But the value of secondary data is only fully realized if these benefits outweigh the risks, notably in terms of re-identification of individuals and disclosure of sensitive information.

For this to happen, use of secondary data must meet some key ethical conditions:

  • Data must be de-identified before release to the researcher
  • Consent of study subjects can be reasonably presumed
  • Outcomes of the analysis must not allow re-identifying participants
  • Use of the data must not result in any damage or distress

The value and expertise of data service organisations

Major public and non-for profit data producers such as national statistical institutes (Britain’s ONS, France’s INSEE, etc.), and large research-led data collection enterprises (such as the European Social Survey, ESS) are aware of these aspects and have set up services and infrastructures that archive, manage and release data for secondary analysis, fully in line with the above principles. Examples are the UK Data Service, France’s Quetelet, and members of the CESSDA consortium of data archives in other European countries, with funding from the EU as well as national research councils or ministries.

The work done by these bodies largely contributes to ensuring that all ethical conditions are met. Now in most cases, the very design of data collection (whether based on surveys or administrative records) incorporates the possibility of releasing the data for secondary use, in that informed consent includes provisions for sharing and future use of data. Both data archives and public statistical agencies also put in place specific, sometimes very strict frameworks for access to data under controlled conditions, in ways that minimize re-identification and disclosure risks. These typically involve either strong anonymization, or release in “safe” settings. The former is more than removal of direct identifiers such as names and addresses, and involves use of a variety of techniques including re-coding, perturbation (for example, by adding random data points, data swapping, and micro-aggregation), and even production of “synthetic” data files that mimic the key properties of the original ones; as a result, re-identification of individuals becomes extremely unlikely. Safe settings, instead, are used when release of more detailed data is necessary for the research to be done: access is then given through secure technological solutions that (among others) authenticate authorized users through biometric identification, allow analysis but not download of data, and check outputs before release. The UK’s Secure Lab, France’s CASD, and Germany’s FDZ are examples of such solutions, which usually also include researchers’ training in legal and ethical matters related to data. Access through such systems usually requires a rather detailed application procedure, and signature of contracts, terms of use or some pledge of secrecy, by the researcher and their institution.

So after all, it is relatively easy for a researcher to comply with the basic principles of ethical use of secondary data if  data is accessed through one of these infrastructure services: the burden is shifted from the researcher onto the data service organisation, so to speak, and the researcher should just follow the guidance provided. Under these conditions, one might argue that the need for ethical approval can safely be waived (though there is no uncontroversial view on this point).

Remaining issues

Issues are more likely to arise with data that were collected outside of such frameworks, namely without ethical approval, or with ethical approval not including provisions for later researchers to engage in secondary analysis. In such cases, and especially if the data are at micro level, researchers should be particularly careful in their consideration of possible risks – and ethical approval may well be needed.

Highly aggregate data (such as macroeconomic data and financial time series) are by their very nature less likely to involve risks of re-identification of individuals or disclosure of sensitive information. But especially if data come from private-sector sources, intellectual property issues might arise, as well as  potential conflicts of interest; and there is a risk of mis-interpretation if the data were not appropriately documented by the original collector.

Primary data collections that open the way to secondary analyses

These considerations are not only important for researchers who engage in secondary research themselves, but also for those who do primary data collection, and aim to archive their data and make them available for future re-use. Archiving can, and should, be done by all – not just the largest data collection consortia. In such cases, it is essential to take advice from a professional data service (like a CESSDA member organisation) from the early stages of the research, to plan the whole data lifecycle in a way that is ethical, legal, and value-maximising.

Some resources

  UPDATE. On research with internet data (neither primary nor secondary according to traditional categories) see this post.

