The power of dataviz (150 years ago)

Science, like the rest of human life, is subject to fashions. Data visualisation is the latest trend: policy-makers and the public are all under its charm, and researchers magically suspend their disbelief — give me a fancy image, and I won’t look too closely at your p-values. So I was intrigued by the discovery, at a talk few days ago by Paul Jackson of the Office for National Statistics, that there are precedents, and that they have a long history behind them.

The story is that of John Snow, an epidemiologist who was persuaded, against the received wisdom of the mid-nineteenth century,  that cholera does not propagate through air but through contaminated water or food. But how to convince others? When cholera struck London in 1854, Snow began plotting the location of deaths on a map of Soho: he represented each death through a line parallel to the building front in which the person died.

snow_cholera_map

Snow soon realised that there was a concentration of “death lines” around Broad Street — more specifically, around a water pump at the corner between Broad and Cambridge St.

snow_cholera_map_detail

He managed to convince the authorities to remove the handle of the pump, so that people could no longer use it: in a few days, the number of deaths in the area plummeted. Snow had proven his point and saved lives: using no medical trials, no sophisticated chemistry, just with some basic count statistics, and a clever dataviz.

Sharing medical data for research: Why we should all care

A major health data plan is on the verge of being called off, to never have a chance again. It is supposed to anonymise all the patient records in the National Health Service (NHS) in the UK, linking them together into one single, giant database, and making them available under controlled use conditions to health researchers and (controversially) to commercial companies too. Public outcry has led to the plan being delayed for six months.

Stethoscope on clipboard over blood pressure print out #5 In an article published in The Guardian last week, Ben Goldacre, a medical doctor and high-profile media commentator on science matters, rightly identifies what the point is: in principle, the public accepts release of data for scientific purposes, but resists commercial exploitation. And rightly so: medical knowledge results from the study of several cases, and the higher the availability of cases, the more accurate the results; in the era of big data, it is also clear that aggregation and sharing of a wealth of data such as those held by the NHS is a unique opportunity for medical science to discover ways of saving lives. On the other hand, use of data for any other purposes looks much more opaque, and people understandably feel it might lead to discrimination and potentially negative individual consequences, for example if disclosure of the health history of a person results in higher insurance premiums, or rejection of job applications.

Continue reading “Sharing medical data for research: Why we should all care”

Small data and big models: Sunbelt 2014

Uh, it’s been a while… I should have written more regularly! All the more so as many things have happened this month, not least the publication of our book on the End-of-Privacy hypothesis. Well, I promise, I’ll catch up!

Meanwhile, a short update from St Pete Beach, FL, where the XXXIV Sunbelt conference is just about to end. This is the annual conference of the International Network for Social Network Analysis and in the last few years, I noticed some sort of tension between the (let’s call it like that — no offense!) old-school of people using data from classical sources such as surveys and fieldwork, and big data people, usually from computer science departments and very disconnected from the core of top social network analysts, mostly from the social sciences. This year, though, this tension was much less apparent, or at least I did not find it so overwhelming. There weren’t many sessions on big data this time, but a lot of progress with the old school — which in fact is renewing its range of methods and tools very fast. No more tiny descriptives of small datasets as was the case in the early days of social network analysis, but ever more powerful statistical tools allowing statistical inference (very difficult with network data — I’ll go back to that in some future post), hypothesis testing, very advanced forms of regression and survival analysis. In this sense, a highly interesting conference indeed.  We can now do theory-building and modeling of networks at a level never experienced before, and we don’t even need big data to do so.

The keynote speech by Jeff Johnson, interestingly, was focused on the contrast between big and small data. Johnson has strong ethnographic experience with small data, including in very exotic settings such as scientific research labs at the South Pole and fisheries in Alaska. He combined social network analysis techniques, sometimes using highly sophisticated mathematical tools, with fieldwork observation to gain insight into, among other things, the emergence of informal roles in communities. His key question here was, can we bring ethnographic knowing to big data? And how can we do so?

My own presentation (apart from a one-day workshop I offered on the first day, where I taught the basis of social network analysis) took place this afternoon. I realize, and I am pleased to report, that it was in line with the small-data-but-sophisticated-modeling mood of the conference. It is a work derived from our research project Anamia, using data from an online survey of persons with eating disorders to understand how the body image disturbances that affect them are related to the structure of their social networks. The data were small, because they were collected as part of a questionnaire; but the survey technique used was advanced, and the modeling strategy is quite complex. For those who are interested in the results, our slides are here:

New book: Against the Hypothesis of the End of Privacy

Our new book Against the Hypothesis of the End of Privacy is out now! It has been published by Springer and co-authored with Antonio A. Casilli (Telecom ParisTech) and Yasaman Sarabi (University of Greenwich). Please check here for regular updates about the book.

Network data, new and old: from informal ties to formal networks

Fig1Network data are among those that are changing fastest these days. When I say I study social networks, people almost automatically think of Facebook or Twitter –without necessarily realizing that networks have been around for, well, the whole history of humanity, long before the internet. Networks are just systems of social relationships, and as such, they can exist in any social context — the family, school, workplace, village, church, leisure club, and so forth. Social scientists started mapping and analysing networks as early as the 1930s. But people didn’t think of their social relationships as “networks” and didn’t always see themselves as “networkers” even if they did invest a lot in their relationships, were aware of them, and cared about them. The term, and the systemic configuration, were just not familiar. There was something inherently informal and implicit about social ties.

What has changed with Facebook and its homologues, is that the network metaphor has become explicit. People are nowSocial-Media-Network accustomed to talking about “networks”, and think in systemic terms, seeing their own relationships as part of a more global structure. Network ties have become formal — you have to make a clear choice and action when you add a “friend” on Facebook, or “follow” someone on Twitter; you will have a list of your friends/followers/followees (whatever the specific terminology is) and monitor changes in this list. You know who the friends of your friends are, and can keep track of how many people viewed your profile /included you in their “lists” / mentioned you in their Tweets. Now, everyone knows what networks are –so if you are a social network researcher and conduct a survey like in the old days, you won’t fear your respondents may misunderstand. In fact, you may not even need to do a survey at all –the formal nature of online ties, digitally recorded and stored, makes it possible to retrieve your network information automatically. You can just mine network tie data from Facebook, Twitter, or whatever service your target populations happen to be using.

Continue reading “Network data, new and old: from informal ties to formal networks”

Training in European data: EU-SILC

Official statistical surveys are still the best sources of data in terms of quality. Practically, they are the only ones that apply random sampling and the legal obligation to respond makes the actual sample very close to the targeted one. No other approach to data collection can hope to do as well.

The European Union Statistics on Income and Living Conditions (EU-SILC) is an instrument aiming at collecting timely and eurostat1comparable cross-sectional and longitudinal multidimensional microdata on income, poverty, social exclusion and living conditions. It started in 2003 with a small group of participant countries, and was enlarged in 2004. It is one of the richest sources of information on the daily life conditions of Europeans.

EU-SILC data are available for research use, but many barriers exist and these data are actually underutilized. On the one hand, the fact that access is legally authorised does not make it practically straightforward – the application process can be lengthy and costly. On the other hand, the very handling of data requires some specific knowledge and skills.

The Data without Boundaries European initiative, aimed at moving forward research access to official data, organises a training programme on EU‐SILC with a specific focus on the longitudinal component. Local organization lies with Réseau Quetelet, host of the training course is GENES ‐ Groupe des Écoles Nationales d’Économie et Statistique both in Paris (France).

Continue reading “Training in European data: EU-SILC”

Small Data to study the Web: The ANAMIA project

We have just published the results of our research project ANAMIA, studying the personal networks and online interactions of persons with eating disorders (“ana” and “mia” in web jargon). The report has just come out:

Documents

Report: Young internet users and eating disorder websites: beyond the notion of “pro-ana” (pdf, 92 pp, in French)

Infographic: results and recommendations of the ANAMIA project (pdf, in French)

Summary (in English!)

The ana-mia webosphere had remained opaque for long, with little data available for a science-based understanding of it. As a result, misconceptions proliferated and policy-makers hesitated — threatening censorship but without devising solutions to reach out and support a population in distress. Our study has been the first to overcome these limitations and reveal the social environment, actual eating practices and digital usages of persons with eating disorders in the English and French web.

Fig1

Visualization of the personal networks of four individuals with, respectively, EDNOS (Eating Disorders Not Otherwise Specified, top panel, left), anorexia nervosa (top, right), bulimia nervosa (bottom, left), binge eating (bottom right). Hollow circles represent their face-to-face acquaintances, filled circles their online ones. Colours indicate relational proximity to the subject (green: intimate, blue: very close, yellow: close, red: somewhat close). Source: ANAMIA project report.

Continue reading “Small Data to study the Web: The ANAMIA project”

#bigdataBL

On Friday last week, the British Sociological Association (BSA) held an event on “The Challenge of Big Data” at the British Library. It was interesting, stimulating and relevant – I was particularly impressed by the involvement of participants and the very intense live-tweeting, never so lively at a BSA event! And people were particularly friendly and talkative both on their keyboards and at the coffee tables… so in honour of all this, I am choosing the hashtag of the day #bigdataBL as title here.

bigdataBL(Visualisation: http://www.digitalcoeliac.com/)

Some highlights:

  • The designation of “big data” is from industry, not (social) science, said a speaker at the very beginning. And it is known to be fuzzy. Yet it becomes a relevant object of scientific inquiry in that it is bound to affect society, democracy, the economy and, well, social science.
  • Big-data practices change people’s perception of data production and use. Ordinary people are now increasingly aware that a growing range of their actions and activities are being digitally recorded and stored. Data are now a recognized social object.
  • Big data needs to be understood in the context of new forms of value production.
  • So, social scientists need to take note (and this was the intended motivation of the whole event). The complication is that Big Data matter for social science in two different ways. First, they are an object of study in themselves – what are their implications for, say, inequalities, democratic participation, the distribution of wealth. Second, they offer new methods to be exploited to gain insight into a wide range of (traditional and new) social phenomena, such as consumer behaviours (think of Tesco supermarket sales data).
  • Put differently, if you want to understand the world as it is now, you need to understand how information is created, used and stored – that’s what the Big Data business is all about, both for social scientists and for industry actors.

Continue reading “#bigdataBL”

Three tools to visualize personal network data – continued

Yesterday, Antonio Casilli and I gave our promised talk on network data visualization. It was an opportunity to discuss the extension of the tools we developed within a given research project to other network studies, and to reflect on the contribution as well as the limitations of data visualizations. Here are our slides:

Three tools to visualize personal networks

Data visualization techniques are enjoying ever greater popularity, notably thank to the recent boom of Big Data and our increased capacity to handle large datasets. Network data visualization techniques are no exception. in fact, appealing diagrams of social connections (sociograms) have been at the heart of the field of social network analysis since the 1930s, and have contributed a lot to its success. Today, all this is evolving at unprecedented pace.

In line with these tendencies, the research team of the project ANAMIA (a study of the networks and online sociability of persons with eating disorders, funded by the French ANR) of which I was one of the investigators, have developed new software tools for the visualization of personal network data, with different solutions for the three stages of data collection, analysis, and dissemination of results.

Specifically:

– ANAMIA EGOCENTER is a graphical version of a name generator, to be embedded in a computer-based survey to collect personal network data. It has turned out to be a user-friendly, highly effective interface for interacting and engaging with survey respondents;

Continue reading “Three tools to visualize personal networks”