Archive for the ‘ Data & methods ’ Category

Small data and big models: Sunbelt 2014

Uh, it’s been a while… I should have written more regularly! All the more so as many things have happened this month, not least the publication of our book on the End-of-Privacy hypothesis. Well, I promise, I’ll catch up!

Meanwhile, a short update from St Pete Beach, FL, where the XXXIV Sunbelt conference is just about to end. This is the annual conference of the International Network for Social Network Analysis and in the last few years, I noticed some sort of tension between the (let’s call it like that — no offense!) old-school of people using data from classical sources such as surveys and fieldwork, and big data people, usually from computer science departments and very disconnected from the core of top social network analysts, mostly from the social sciences. This year, though, this tension was much less apparent, or at least I did not find it so overwhelming. There weren’t many sessions on big data this time, but a lot of progress with the old school — which in fact is renewing its range of methods and tools very fast. No more tiny descriptives of small datasets as was the case in the early days of social network analysis, but ever more powerful statistical tools allowing statistical inference (very difficult with network data — I’ll go back to that in some future post), hypothesis testing, very advanced forms of regression and survival analysis. In this sense, a highly interesting conference indeed.  We can now do theory-building and modeling of networks at a level never experienced before, and we don’t even need big data to do so.

The keynote speech by Jeff Johnson, interestingly, was focused on the contrast between big and small data. Johnson has strong ethnographic experience with small data, including in very exotic settings such as scientific research labs at the South Pole and fisheries in Alaska. He combined social network analysis techniques, sometimes using highly sophisticated mathematical tools, with fieldwork observation to gain insight into, among other things, the emergence of informal roles in communities. His key question here was, can we bring ethnographic knowing to big data? And how can we do so?

My own presentation (apart from a one-day workshop I offered on the first day, where I taught the basis of social network analysis) took place this afternoon. I realize, and I am pleased to report, that it was in line with the small-data-but-sophisticated-modeling mood of the conference. It is a work derived from our research project Anamia, using data from an online survey of persons with eating disorders to understand how the body image disturbances that affect them are related to the structure of their social networks. The data were small, because they were collected as part of a questionnaire; but the survey technique used was advanced, and the modeling strategy is quite complex. For those who are interested in the results, our slides are here:

Advertisements

Training in European data: EU-SILC

Official statistical surveys are still the best sources of data in terms of quality. Practically, they are the only ones that apply random sampling and the legal obligation to respond makes the actual sample very close to the targeted one. No other approach to data collection can hope to do as well.

The European Union Statistics on Income and Living Conditions (EU-SILC) is an instrument aiming at collecting timely and eurostat1comparable cross-sectional and longitudinal multidimensional microdata on income, poverty, social exclusion and living conditions. It started in 2003 with a small group of participant countries, and was enlarged in 2004. It is one of the richest sources of information on the daily life conditions of Europeans.

EU-SILC data are available for research use, but many barriers exist and these data are actually underutilized. On the one hand, the fact that access is legally authorised does not make it practically straightforward – the application process can be lengthy and costly. On the other hand, the very handling of data requires some specific knowledge and skills.

The Data without Boundaries European initiative, aimed at moving forward research access to official data, organises a training programme on EU‐SILC with a specific focus on the longitudinal component. Local organization lies with Réseau Quetelet, host of the training course is GENES ‐ Groupe des Écoles Nationales d’Économie et Statistique both in Paris (France).

Continue reading

Small Data to study the Web: The ANAMIA project

We have just published the results of our research project ANAMIA, studying the personal networks and online interactions of persons with eating disorders (“ana” and “mia” in web jargon). The report has just come out:

Documents

Report: Young internet users and eating disorder websites: beyond the notion of “pro-ana” (pdf, 92 pp, in French)

Infographic: results and recommendations of the ANAMIA project (pdf, in French)

Summary (in English!)

The ana-mia webosphere had remained opaque for long, with little data available for a science-based understanding of it. As a result, misconceptions proliferated and policy-makers hesitated — threatening censorship but without devising solutions to reach out and support a population in distress. Our study has been the first to overcome these limitations and reveal the social environment, actual eating practices and digital usages of persons with eating disorders in the English and French web.

Fig1

Visualization of the personal networks of four individuals with, respectively, EDNOS (Eating Disorders Not Otherwise Specified, top panel, left), anorexia nervosa (top, right), bulimia nervosa (bottom, left), binge eating (bottom right). Hollow circles represent their face-to-face acquaintances, filled circles their online ones. Colours indicate relational proximity to the subject (green: intimate, blue: very close, yellow: close, red: somewhat close). Source: ANAMIA project report.

Continue reading

Three tools to visualize personal network data – continued

Yesterday, Antonio Casilli and I gave our promised talk on network data visualization. It was an opportunity to discuss the extension of the tools we developed within a given research project to other network studies, and to reflect on the contribution as well as the limitations of data visualizations. Here are our slides:

Three tools to visualize personal networks

Data visualization techniques are enjoying ever greater popularity, notably thank to the recent boom of Big Data and our increased capacity to handle large datasets. Network data visualization techniques are no exception. in fact, appealing diagrams of social connections (sociograms) have been at the heart of the field of social network analysis since the 1930s, and have contributed a lot to its success. Today, all this is evolving at unprecedented pace.

In line with these tendencies, the research team of the project ANAMIA (a study of the networks and online sociability of persons with eating disorders, funded by the French ANR) of which I was one of the investigators, have developed new software tools for the visualization of personal network data, with different solutions for the three stages of data collection, analysis, and dissemination of results.

Specifically:

– ANAMIA EGOCENTER is a graphical version of a name generator, to be embedded in a computer-based survey to collect personal network data. It has turned out to be a user-friendly, highly effective interface for interacting and engaging with survey respondents;

Continue reading

Big Data and social research

Data are not a new ingredient of socio-economic research. Surveys have served the social sciences for long; some of them like the European Social Survey, are (relatively) large-scale initiatives, with multiple waves of observation in several countries; others are much smaller. Some of the data collected were quantitative, other qualitative, or mixed-methods. Data from official and governmental statistics (censuses, surveys, registers) have also been used a lot in social research, owing to their large coverage and good quality. These data are ever more in demand today.

Now, big data are shaking this world. The digital traces of our activities can be retrieved, saved, coded and processed much faster, much more easily and in much larger amounts than surveys and questionnaires. Big data are primarily a business phenomenon, and the hype is about the potential gains they offer to companies (and allegedly to society as a whole). But, as researcher Emma Uprichard says very rightly in a recent post, big data are essentially social data. They are about people, what they do, how they interact together, how they form part of groups and social circles. A social scientist, she says, must necessarily feel concerned.

It is good, for example, that the British Sociological Association is organizing a one-day event on The Challenge of Big Data. It is a good point that members must engage with it. This challenge goes beyond the traditional qualitative/quantitative divide and the underrepresentation of the latter in British sociology. Big data, and the techniques to handle them, are not statistics, and professional statisticians have trouble with it too. (The figure below is just anecdotal, but clearly suggests how a simple search on the Internet identifies Statistics and Big Data as unconnected sets of actors and ties). The challenge has more to do with the a-theoretical stance that big data seem to involve.

TouchGraph2

Continue reading

The fuzziness of Big Data

Fascinating as they may be, Big Data are not without posing problems. Size does not eliminate the problem of quality: because of the very way they are collected, Big Data are unstructured and unsystematized, the sampling criteria are fuzzy, and the classical statistical analyses do not apply very well. The more you zoom in (the more detail you have), the more noise you find, so that you need to aggregate data (that is, to reduce a “big” micro-level dataset to a “smaller” macro one) to detect any meaningful tendency. Analyzing Big Data as they are, without any caution, increases the likelihood of finding spurious correlations – a statistician’s nightmare! In short, processing Big Data is problematic: Although we do have sufficient computational capacity today, we still need to refine appropriate analytical techniques to produce reliable results.

In a sense, the enthusiasm for Big Data is diametrically opposed to another highly fashionable trend in socioeconomic research: that of using randomized controlled trials (RCTs), as in medicine, or at least quasi-experiments (often called “natural experiments”), which enable collecting data under controlled conditions and facilitate detection of causal relationships  much more clearly and precisely than in traditional, non-experimental social research. These data have a lot more structure and scientific rigor than old-fashioned surveys – just the opposite of Big Data!

This is just anecdotal evidence, but do a quick Google search for images on RCTs  vs. Big Data. Here are the first two examples I came across: on the left are RCTs (from a dentistry course), on the right are Big Data (from a business consultancy website).  The former conveys order, structure and control, the latter a sense of being somewhat lost, or of not knowing where all this is heading… Look for other images, I’m sure the great majority won’t be that different from these two.

RCTvsBigData

Continue reading

Advertisements