Data Big and Small

Twitter networks at the OuiShare Fest 2016

Twitter conversations are one way through which participants in an event engage with the programme, comment and discuss about the talks they attend, prolong questions-and-answers sessions. Twitter feeds have become part of the official communication strategy of major events and serve documentation and information purposes, both for attendees and for outsiders. While tweeting is becoming more an more a prerogative of “official” accounts in charge of event communication, it is also a potential tool in the hands of each participant, allowing anyone to join the conversation at least in principe. Participants may become aware of each other, perhaps using the opportunity of the event to meet face-to-face, start relationships and even collaborations. A Nesta study insisted on the potential for using social media data to attain a quantitative understanding of events and their impacts on participants’ networks.

The OuiShare Fest 2016, a major gathering of the collaborative economy community that took place last week in Paris, was one opportunity to see such mechanisms in place. Tweeting was easy – with an official hashtag, #OSFEST16, although related hashtags were also widely used. I mined a total of 12440 tweets over the four days of the event. Do Twitter conversations related to the Fest bring to light the emergence of a community? While it’s too early for any deep analysis, some descriptive results can already be shown.

First, when did people tweet? Mostly at the beginning of each day’s programme (9am on the first two days, 2pm on the third day). Tweeting was more intense in the first day and declined over time (Figure 1). The comparatively low participation on the fourth day is due to the fact that the format was different – an open day in French (rather than an international conference in English), whereby local people were free to come and go. Online activity is not independent of what happens on the ground – quite on the contrary, it follows the timings of physical activity.

Tweets_Over_Time_Blog — Figure 1: Tweets over time.

Who tweeted most? Our dataset has a predictable outlier, the official @OuiShareFest Twitter account, who published 727 tweets – twice as many as the second in the ranking. But let’s look at the people who had no obligation to tweet, and still did so: who among them contributed most to documenting the Fest? Figure 2 shows the presence of some other institutional accounts among the top 10, but the most active include a few individual participants. Ironically, one of them was not even physically present at the Fest, and followed the live video streaming from home. In this sense, Twitter served as an interface between event participants and interested people who couldn’t make it to Paris.

OSFest16_Top10_NoOutliers — Figure 2: Ten most active tweeters (excluding @OuiShareFest).

What was the proportion of tweets, replies and retweets? Original tweets are interesting for their unique content (what are people talking about?), while replies and retweets are interesting because they reveal social interactions – dialogue, endorsement or criticism between users. Figure 3 shows that the number of replies is small compared to tweets and retweets.

Tweets_By_Type_blog — Figure 3: Tweets, replies and retweets

Let’s now look closer at the replies. By taking who replied to whom, we can build a social network of conversations between a group of tweeters. It’s a relatively small network of 311 tweeters (the coloured points in Figure 4), with 321 ties among them (the lines in Figure 4). The size of points depends on the number of their incoming ties, that is, the number of replies received: even if the points haven’t been labelled, I am sure you can tell immediately which one represents the official @OuiShareFest account… the usual suspect! But let’s look at the network structure more closely. Some ties are self-loops, that is, people replying to themselves. (Let’s be clear, it’s not a sign of social isolation, but simply a consequence of the 140-character limit imposed on Twitter: self-replies are meant to deliver longer messages). A lot of other participants are involved in just simple dyads or small chains (A replies to B who replies to C, but then C does not reply to A), unconnected to the rest. There is a larger cluster formed around the most replied-to users: here, some closure becomes apparent (A replies to B who replies to C who replies to A) and enables this sub-network to grow.

Network_replies_OSF16_blog — Figure 4: the network of replies.

Now, my own experience of tweeting at the Fest suggested that tweets were multilingual. Apart from the fourth day, there seemed to be a large number of French-speaking participants. A quick-and-dirty (for now) language detection exercise revealed that roughly 60% of tweets were in English, 25% in French, the rest being split between different languages especially German, Spanish, and Catalan. So, did people reply to each other based on the language of their tweets? It turns out that quite a few tweeters were involved in conversations in multiple languages. Figure 5 is a variant of Figure 4, colouring nodes and ties differently depending on language. A nice mix: interestingly, the central cluster is not monolingual and in fact, is kept together by a few, albeit small, multi-lingual tweeters.

Replies_by_language — Figure 5: the network of replies, by language.

Let’s turn now to mentions: who are the most mentioned tweeters? Again, I’ll take out of the analysis @OuiShareFest, hugely ahead of anyone else with 832 mentions received. Below, Figure 6 ranks the most mentioned: mostly companies (partners or sponsors of the event such as MAIF), speakers (such as Nathan Schneider, Nilofer Merchant), and key OuiShare personalities (such as Antonin Léonard). Mentions follow the programme of the event, and most mentioned are people and organizations that play a role in shaping it.

15MostMentioned_OSF16 — Figure 6: Most mentioned tweeters.

Mentions are also a basis to build another social network – of who mentions whom in a tweet. This will be a larger network compared to the net of replies, as mentions can be of many types and also include retweets (which as we saw above, are very numerous here). There are 17248 mentions (some of which are repeated more than once) in the network. They involve 796 users who mention others and are mentioned in turn; 550 users who are mentioned, but do not mention themselves; and 1680 users who mention others, but are not themselves mentioned.

A large network such as this is more difficult to visualize meaningfully, and I had to introduce some simplifications to do so. I have included only pairs in which one had mentioned the other at least twice: this makes a network of 778 nodes with 2222 ties. The color of nodes depends on their modularity class (a group of nodes that are more connected with one another, than with any other nodes in the network) and their size depends on the number of mentions received. You will clearly recognize at the center of the network, the official @OuiShareFest account, which structures the bulk of the conversations. But even intuitively, other actors seem central as well, and their role deserves being examined more thoroghly (in some future, less preliminary analysis).

Mentions2 — Figure 7: Network of Twitter mentions

This analysis is part of a larger research project, “Sharing Networks“, led by Antonio A. Casilli and myself, and dedicated to the study of the emergence of communities of values and interest at the OuiShare Fest 2016. Twitter networks will be combined with other data on networking – including informal networking which we are capturing through a (perhaps old-fashioned, but still useful!) survey.

The analyses and visualizations above were done with the packages TwitteR and igraph in R; Figure 7 was produced with Gephi.

Ethical issues in research with online data

Some time ago, I wrote a post on ethical issues in research with secondary data – a somewhat grey area, where students and scholars often feel guidance is insufficient. Even more complex is research with internet data – neither primary nor secondary strictly speaking, but “big” data. A recent case fuelled an international debate on how researchers should deal with data that are, apparently, accessible to all on the web: a Danish graduate student published a large dataset of users of the online dating site OkCupid (he apparently did so without any institutional backing, and Aarhus University where he studies, is now on the case). Michael Zimmer, a specialist of information studies and the policy and ethics of online research, properly summarizes the issues in a recent Wired article:

Don’t say that “the data are already public”. The fact that OkCupid users knowingly share some personal information, does not mean they consent to it being used for purposes other than interactions with other users on that site. By scrapping data, one may be able to put together the whole history of users’ presence on that platform, revealing more of their life or personality than they themselves are aware of. More dangerously, data extracted in this way might in some cases be matched with other information, thereby potentially becoming much more disclosive than what the persons concerned ever intended or agreed. And the disclosure may be aggravated by releasing the data outside the platform.

Continue reading “Ethical issues in research with online data”

First steps toward “Data Inclusion”

The concept of “data inclusion” is new and still slowly seeking its way in our linguistical habits, but it is gaining ground in the minds of those who care for disadvantaged, low-income, or otherwise underserved segments of society. A recent report of the US Federal Trade Commission (FTC) does precisely this. Looking at the commercial use of big data analytics, it considers cases in which big data analytics lead companies to make choices that are detrimental to the most vulnerable segments of society, for example by excluding them from credit or from employment opportunities. Instead, it asks how big data may be used in inclusive ways.

A first set of recommendations they make is for companies to be well aware of the regulations: on financial and credit reporting, equal opportunities, consumer protection. The second set of recommendations, though specifically aimed at research done in (or for) companies, is of relevance for public research as well, and consists in asking key questions about the quality of data and models, and about the reliability and validity of results:

How representative is your data set? In popular discourse, big data carry a promise of exhaustivity, which however is rarely fulfilled in practice (see this great FT article by Tim Hartford). In fact, big data sets are not necessarily statistically representative of the population they refer to, and information may be disproportionately missing about specific, possibly disadvantaged, populations.
Does your data model account for biases? Selection effects, which occur whenever some members of the population are less likely to be included in the sample than others, must be controlled for in order for results to be generalizable.
How accurate are your predictions based on big data? The issue is that most research with big data is predictive without being able to uncover the social or economic mechanisms underlying observed correlations, so that interpretation of results is potentially misleading. The report does not say, though, that recent developments in machine learning that support causality reasoning may alleviate this problem in the not-so-far future.
Does your reliance on big data raise ethical or fairness concerns? In all honesty, this is not specifically a question for research on big data, but for research in general. If a company’s analysis of employees’ behavior lead to solutions that involve forms of, say, racial or gender-based behavior, then that analysis shouldn’t be used – whether it’s done with “big” or “small” data.

It is important that major regulators like the FTC are taking notice. Big data open the way to major improvements in our life conditions, but not because data-driven analysis will take the lead over current best practices in research. Regulations, awareness of statistical issues and potential pitfalls, and ethics are ever more necessary for big data to fulfill their potential.

Hierarchy, market or network? The disruptive world of the digital platform

Economics traditionally considered firms and markets as two alternative ways of coordinating economic activities. Nobel prize winner Ronald H. Coase (1937) demonstrated that it all hinges on “transaction costs”, such as the need to search for a trade partner, the time needed to negotiate a contract, the legal expenses to draw it up and if necessary, to enforce it. When these costs are high, then hiring people in a firm is the right solution. When they are low, then a harmonious state will emerge spontaneously from the choices of independent, self-employed individuals. The difference, further emphasized by the work of Oliver Williamson, another Nobel, is between the world of bureaucracy, hierarchy and salaried work, and the world of the market and myriad micro-entrepreneurs.

This dichotomous description seemed reductive to economic sociologists, and Mark Granovetter (1985) pointed to social networks as coordination devices. Networks enable circulation of knowledge, formation of trust, emergence of shared norms in informal ways, thereby lowering costs and smoothing economic transactions. Walter W. Powell (1990) saw networks as an alternative to market and hierarchy, while others thought of it as a complement rather than a substitute. In some cases, the relevance of networks is flagrant: think of “collegial“, horizontal organizations such as legal partnerships, which are clearly not markets, and which have no vertical hierarchy either.

The rise of online platforms challenges these older views today. Powered by digital data and matching algorithms, platforms are meeting places for actors on the two sides of a market: riders and drivers (Uber, Lyft, BlaBlaCar), guests and hosts (Airbnb), buyers and sellers (eBay), and so on. Officially, platforms are intermediaries only, able to put in touch, say, those who need a lift and those who have a car, so that they can share the ride. Platforms don’t employ drivers and don’t own cars.

Continue reading “Hierarchy, market or network? The disruptive world of the digital platform”

Second European Social Networks Conference (EUSN 2016)

I am lucky enough to be part of the organizing committee of the second European Social Networks Conference, which will take place at Sciences Po Paris on 14-17 June 2016. The EUSN conferences have been created to offer a single place for the European community of social networks researchers to gather, in place of previous national annual conferences; and has been endorsed as a regional conference by INSNA, the international association of network researchers. A first, successful EUSN conference was held in Barcelona in 2014.

Somehow, the European social networks crowd seems more diverse than the US-based core of scholars who gave life to INSNA and drove its development over time. While remaining affectionate to the INSNA format and philosophy (for example, by selecting proposals only on the basis of an abstract, to be maximally inclusive), the European conferences can afford exploring new ideas, and variants on classical schemes. In particular, this year, we are trying to enlarge patricipation and attract delegates from a wider variety of disciplines, beyond those traditionally most represented – the social sciences, mathematics, and more recently statistics. Hence for example, the keynote speakers will give a sense of continuity – we will have social anthropologists Miranda Lubbers and José Luis Molina, the organisers of the first EUSN in Barcelona, on “Ethnography and multilevel networks in the study of migration and transnationalism”. But the plenary speech is an opening to recent, relevant developments in computer science: Jean-Daniel Fekete of INRIA will talk about “Challenges in social network visualization: bigger, dynamic, multivariate”.

Submissions are now invited for paper and poster proposals (abstract only – deadline 16 February 2016). There are special thematic sessions and general sessions, and all fields are welcome. A prize will be awarded for the best poster – where all participants will be able to vote.

The day before the conference, 15 training workshops are offered into the theory, data collection, methods of analysis and visualization of social networks.

IMPORTANT DATES:
16 February: Deadline for abstract/poster proposals, and pre-registration opening
1 March: Registration opening
16 March: Notification to authors
18 April: Early registration closure
14 June: Workshops
15-17 June: Conference

New year, new job, new life…

Yes I must admit it: I didn’t keep my new-year-2015 promise of posting more often on my blog… and the annual report I received yesterday from WordPress, showing a couple of peaks of activity and frigthening silence the rest of the year, isn’t something I would be proud to share… but I have a justification! Seriously, it’s not just an excuse – it’s that I’ve been busy trying to change life… and yes, I managed. On Monday 4 January, I’ll start an exciting new position as senior research scientist at the National Center of Scientific Research (CNRS, or in French, Centre national de la recherche scientifique) in Paris. CNRS can be loosely compared to what is, in other countries, a National Research Council, but there’s more to it than international comparisons might vaguely suggest: this is probably the single most desired job in French academia, with a mission “to contribute to the development of knowledge… in all fields that contribute to the advancement of society“. In plain words, that’s basically pure research with almost no teaching apart from some PhD supervision… a dream that would hardly be possible in the UK, where I was before.

I’ll be at the Lab for Computer Science (LRI, Laboratoire de Recherche en Informatique, UMR8623) on the Saclay campus, and I’ll work with the A&O (Learning and Optimization) research team. The interesting thing is that mine is an interdisciplinary position, designed to facilitate dialogue and collaboration between the social sciences and computer science around big data and their use for the advancement of knowledge, policy, and more generally society. I have been especially selected by the sociology section of CNRS to work in a computer science research centre. There, I am asked to develop my personal, long-term research project on the “sharing economy” of digital platforms and how they create value from the social ties in which economic action is embedded: this will require blending my research on data, social networks and the digital economy with machine learning and optimization approaches (more on this later … yes on this blog! promise!).

What else will I do this year at LRI? I am in the organising committee of the Second European Social Networks Conference which will take place in Paris next June, I am finishing a book on so-called “pro-anorexia” websites as the conclusion of my past project ANAMIA, and I am in the Editorial Board of Revue Française de Sociologie.

I won’t entirely forget England though… I’ll keep my doctoral students at Greenwich and continue my engagement at UCL’s Institute of Education as external examiner. Come on, you can’t just disappear after six years! Indeed, I’ll always remember those six years as most productive and fulfilling ones. And however happy I am now to join CNRS, I’ll never forget the expressions of love, sympathy and friendliness I received from colleagues and students when I left Greenwich in December. The cards, the presents, the parties… all beyond any expectations I might have had before! Thank you Greenwich. And well, yes, a big thank you to all those who made it possible – both those in London who made me have a great time far from home for so long, and those in Paris who helped me come back, not without effort, and have welcomed me now.

A great new year is about to start, and I promise I’ll document it more… 😉

Discussing platform cooperativism

On Monday, 7 December 2015 at Telecom ParisTech, I was discussant at a seminar by New School scholar Trebor Scholz on “Unpacking Platform Cooperativism“.

Internet platforms carry an unprecedented potential of value creation, exploiting the extraordinary power of data and algorithms to extract and distribute information to an extent never seen before. Information, we know since Hayek’s times, is the fuel that keeps markets going, that eliminates “lemons” and ensures an ever-better coordination between buyers and sellers, borrowers and lenders, or landlords and tenants. At the same time, the internet has channeled the dream of a viable non-market society, since Rheingold’s 1993 revival of the “community” and Barbrook’s 1998 “hi-tech gift economy“. So, can we put this informational efficiency to the service of a more humane economy, based on relationships, solidarity and reciprocation, rather than on the sheer market system?

The so-called “sharing economy” suggests answers, but also displays a tension: the efforts of myriad grassroots associations to develop collaboration as a value and a practice, sharply contrasts the spectacular growth of firms like Airbnb and Uber, now large multinationals, and their alleged cavalier attitude to anti-trust regulations and workers’ rights. If some say Uber is not really about sharing and collaboration, it is difficult to draw the line.

This ambiguity is fostered by a public discourse that focuses on the sharing of assets – the spare room in your home, or a sit in your car – that digital platforms enable. Asset-sharing has economic and social appeal: it increases efficiency by preventing assets from lying idle, while reducing waste, shifting emphasis away from consumerist values (“access is better than ownership“), and facilitating sociality beyond mere consumption.

But it is often forgotten that asset-sharing does not produce value by itself: it involves extra labour. In economic jargon, capital and labour and complementary production factors. In practice, if you want to put your spare room on Airbnb, you must produce an ad, monitor your message inbox and reply swiftly. You must clean the room and do the laundry before and after a guest’s visit. You must show your guests around when they arrive.

More importantly, the very opportunity of asset-sharing changes the incentives that shape labour supply – people’s willingness to sell their time and effort against a payment. Because of the expected compensation, some people will renounce use of a (non-spare) room to accommodate visitors instead, and others will do more journeys to drive passengers around – so it’s not really about sharing unused assets, it is about self-employment and starting a micro-business. A work opportunity as a complement to (and sometimes a substitute for) a main job.

This is where debates on internet platforms and the sharing economy rejoin the growing literature on digital labour — and where the contribution of Trebor Scholz is illuminating. Where others see assets (ie, capital), he sees labour. He shows us that the bottlenecks here are about labour, not capital, and that the success — be it economic or social– of the sharing economy is closely tied to the destiny of labour. Whether it appears on the surface as self-employment, micro-entrepreneurship or salaried work, doesn’t really matter. Trebor reminds us of Marx’s fundamental principle that production relations are central to our (capitalist) society, and value generation rests ultimately on labor. If this very crucial part of the human experience goes wrong, even the best side of the sharing economy – the one that endorses trust, reciprocity, and zero-waste – may fail to perform any transformative effects on society.

Continue reading “Discussing platform cooperativism”

International Program in Survey and Data Science

A new, master’s level programme of study in Survey and Data Science is to be offered jointly by the University of Mannheim, the University of Maryland, the University of Michigan, and Westat. Applications for the first delivery are accepted until 3 January, for a start in Spring 2016. Prospective students are professionals with a first degree, at least one year of work experience, and some background in statistics or applied mathematics. All courses are delivered in English, fully online, to small classes (it’s not a MOOC!). Tuition is free, thank to support from German public funds at least for the first few cohorts.

What is most interesting about this master is its twofold core, involving both more classical survey methodology and today’s trendy data science. Fundamental changes in the nature of data, their availability, the way in which they are collected, integrated, and disseminated, have found many professionals unprepared. These changes are partly due to “big” data from the internet and digital devices becoming increasingly predominant relative to “small” data from surveys. Big data offer the benefit of fast, low-cost access to an unprecedented wealth of informational resources, but also bring challenges as these are “found” rather than “designed” data: less structured, less representative, less well documented (if at all…). In part, these changes are also due to the world of surveys changing internally, with new technical challenges (regarding for example data preservation, in a world of pre-programmed digital obsolescence), legislative issues (such as those triggered by greater awareness of privacy protection), increased demand by multiple users, and a growing need to merge surveys and data from other (such as business and administrative) sources. It is therefore necessary, as the promoters of this new study programme rightly recognize, to prepare students for the challenges of working both with designed data from surveys and with big data.

It will be interesting to see how data science, statistics, and social science / survey methodology feed into each other and support each other (or fail to do so…). There is still work to be done to develop techniques for analyzing data that allow us to gain insights more thoroughly, not just more quickly, and help us develop solid theories, rather than just uncovering new relationships that might eventually turn out to be spurious.

Databeers now in London

In the midst of the chaos and sadness of the past week, a more leisurely note: the first of a new “Databeers” series of events in London yesterday evening, following a format that has been experiencing a huge success in Spain, Italy and other countries. The event is very informal, and getting to know other data enthusiasts is the main goal. There are a few flash talks with free beers and networking time.

The next Data Beers London event is on 25 February 2016.

⇒ Complete the questionnaire

Thank you for your invaluable contribution!