The concept of “data inclusion” is new and still slowly seeking its way in our linguistical habits, but it is gaining ground in the minds of those who care for disadvantaged, low-income, or otherwise underserved segments of society. A recent report of the US Federal Trade Commission (FTC) does precisely this. Looking at the commercial use of big data analytics, it considers cases in which big data analytics lead companies to make choices that are detrimental to the most vulnerable segments of society, for example by excluding them from credit or from employment opportunities. Instead, it asks how big data may be used in inclusive ways.
A first set of recommendations they make is for companies to be well aware of the regulations: on financial and credit reporting, equal opportunities, consumer protection. The second set of recommendations, though specifically aimed at research done in (or for) companies, is of relevance for public research as well, and consists in asking key questions about the quality of data and models, and about the reliability and validity of results:
- How representative is your data set? In popular discourse, big data carry a promise of exhaustivity, which however is rarely fulfilled in practice (see this great FT article by Tim Hartford). In fact, big data sets are not necessarily statistically representative of the population they refer to, and information may be disproportionately missing about specific, possibly disadvantaged, populations.
- Does your data model account for biases? Selection effects, which occur whenever some members of the population are less likely to be included in the sample than others, must be controlled for in order for results to be generalizable.
- How accurate are your predictions based on big data? The issue is that most research with big data is predictive without being able to uncover the social or economic mechanisms underlying observed correlations, so that interpretation of results is potentially misleading. The report does not say, though, that recent developments in machine learning that support causality reasoning may alleviate this problem in the not-so-far future.
- Does your reliance on big data raise ethical or fairness concerns? In all honesty, this is not specifically a question for research on big data, but for research in general. If a company’s analysis of employees’ behavior lead to solutions that involve forms of, say, racial or gender-based behavior, then that analysis shouldn’t be used – whether it’s done with “big” or “small” data.
It is important that major regulators like the FTC are taking notice. Big data open the way to major improvements in our life conditions, but not because data-driven analysis will take the lead over current best practices in research. Regulations, awareness of statistical issues and potential pitfalls, and ethics are ever more necessary for big data to fulfill their potential.