Skip to content
Data Big and Small

Data Big and Small

A social scientist's venture into big data, while still learning much from surveys and fieldwork

  • Home
  • About
  • Courses

Category Archives: Computational social science

Sociology of AI, Sociology with AI (2)

There are two main ways in which a discipline like sociology engages with artificial intelligence (AI) and is affected by it. In a previous post, I discussed how the sociology of AI understands technology as embedded in socio-economic systems and takes it as an object for research. Here, focus is on sociology with AI, indicating that the discipline is integrating AI into its methodological toolbox.

AI technologies were designed for other-than-research purposes, but they may be repurposed. The editors of a special issue of Sociologica last year stress that, ten years ago, digital methods presented similar challenges: for example, using tweets to make claims about the social world required us to understand how people used Twitter in the first place. We also needed to understand which people used Twitter at all, and what spaces of action the architecture and Terms of Use of this platform allowed. Likewise, using AI technologies can serve sociologists insofar as efforts are made to understand the technological (and socio-economic) conditions that produced them. Because AI systems are typically blackboxed, this requires, to begin with, developing exploration techniques to navigate them. For example, the plot below is from a recent paper that asked five different LLMs to generate religious sermons and compared the results (readability scores) by religious traditions and race. It finds that Jewish and Muslim synthetic sermons were constructed with significantly more difficult reading levels than were those for evangelical Protestants, and that Asian religious leaders were assigned more difficult texts than other groups. It is a way to uncover how models treat religious and racial groups differently, although it remains difficult to detect precisely which factors affect this result.

Robust regression coefficients predicting readability scores by religion, race, and model. Source: Tom J.C., Ferguson T.W. & Martinez B.C. 2025. Religion and racial bias in Artificial Intelligence Large Language Models. Socius: Sociological Research for a Dynamic World, 11. https://doi.org/10.1177/23780231251377210

That said, how can AI help us methodologically? To answer this question, it is useful to look at qualitative and quantitative approaches separately. Qualitative research, traditionally viewed and practiced as an intensely human-centred method, may seem at first sight incompatible with it. However, use of computer-assisted qualitative data analysis (with tools such as Nvivo) is now common among qualitative researchers, though it faced some degree of scepticism at the beginning. Attempts to leverage AI move forward this agenda, and the most common application so far is automated transcription of interviews through speech recognition technologies. AI-powered tools make this otherwise tedious task more efficient and scalable. A recent special issue of the International Journal of Qualitative Methods maps a variety of other usages, less common and more experimental: for example, considering that even the best models for audio transcription are not as accurate as humans, LLMs appear as tools to facilitate and speed up transcription cleaning. There are also some attempts at using LLMs as instruments for coding and thematic analysis: for example, some authors have examined inter-coder reliability between ‘human only’ and ‘human-AI’ approaches. Others have used AI image generation like vignettes – as a tool for supporting interview participants in articulating their experiences. Overall, the use of AI remains experimental and marginal in the qualitative research community. Those who have undertaken these experiments find the results encouraging, but not perfect.

Wes Cockx & Google DeepMind / AI large language models (detail) / https://betterimagesofai.org / https://creativecommons.org/licenses/by/4.0/

In quantitative research, some AI tools are already (relatively) widely used: in particular, natural language processing (NLP) to process textual data like corpora from the press or media. More recent applications leverage generative AI, especially large language models (LLM). Outside practices like literature review and code generation/debugging, which are common to multiple disciplines, three applications are specifically sociological and worth mentioning. First, in experimental research, there are some attempts to examine how the introduction of AI agents in interaction games shapes the behaviour of the humans they play with. As discussed by C. Bail in an insightful PNAS article, the extent to which generative AI is capable of impersonating humans is nevertheless subject to debate, and it will probably evolve over time. Second, L. Rossi and co-authors outline that in agent-based models (ABM), an idea is to use LLMs to create agents that are more capable of capturing a larger spectrum of human behaviours. While this approach may provide more realistic descriptions of agents, it re-ignites a long-standing debate in the field: indeed, many believe that increasing the complexity of agents is undesirable when emergent collective dynamics can emerge from more parsimonious models. It is also unclear how the performance of LLMs within ABMs should be evaluated. Shall we say that they are good if they reproduce known collective dynamics within ABMS? Or, should they be assessed based upon their capacity to predict real-world outcomes? Third, in survey research, the question has arisen whether LLMs can simulate human populations for opinion research. Some studies have tested this possibility, with mixed results. The most recent available evidence, in an article by J. Boelaert and co-authors, is that: 1) current LLMs fail to accurately predict human responses, 2) they do so in ways that are unpredictable, as they do not systematically favor specific social groups, and 3) their answers exhibit a substantially lower variance between subpopulations than what is found in real-world human data. In passing, this is evidence that so-called ‘bias’ does not necessarily operate as expected – LLM errors do not stem primarily from unbalanced training data.

These applications face specific ethical challenges. First, studies that require humans to interact with AI may expose them to offensive or inaccurate information, the so-called ‘AI hallucinations’. Second, there are new concerns about privacy and confidentiality. Most GenAI models are the private property of companies: if we use them to code in-depth interviews about a sensitive topic, the full content of these interviews may be shared with these companies, often not bound by the same standards and regulations in terms of personal data protection. Third, the environmental impact of these models is high, in terms of energy to run the system, water to cool servers in data centres, metal extraction to build devices, and production of e-waste. The literature on AI-as-object also warns that there is a cost in terms of the unprotected human work of annotators.

Another limitation is that is that research with Generative AI is difficult to replicate. These models are probabilistic in nature: even identical prompts may produce different outputs, in ways that are not well understood as of today. Also, models are constantly being fine-tuned by their producers in ways that we as users do not control. Finally, different LLMs have been found to produce substantially different results in some cases. Many of these issues are due to the proprietary nature of most models – so much so that some authors like C. Bail believe that open-source models devoted to, and controlled by, researchers can help address some of these challenges. 

Overall, AI has slowly entered the toolbox of the sociologist, and except for some applications that are now commonplace (from automated transcriptions to NLP), its use has not completely revolutionised practices. This pattern is not exclusive to sociology. My own study of the diffusion of AI in science until 2021, as part of the ScientIA project led by F. Gargiulo, showed limited penetration in almost all disciplines, although the last two years have seen a major acceleration. The opportunities that AI offers are promising, although a lot are more hypothetical than real at the moment. We still see calls that invite sociologists and social scientists to embrace AI, but the number of realizations is still small. Almost all applications devote time to consider the epistemological, methodological, and substantive implications. A question that often emerges concerns the nature of bias. The AI-as-object perspective challenges the language of bias, and we see the same here, though from a different perspective. There’s still no shared definition of bias (or any substitute for this term). More generally, patterns are similar in qualitative and quantitative studies. The guest-editors of last year’s special issue of Sociologica suggest that, like digital methods 10-15 years ago, generative AI is supporting a move beyond the traditional qualitative/quantitative divide.

Concluding, both Sociology-of-AI and Sociology-with-AI exist and are important, but they are not well integrated. This is one of the bottlenecks for the development of the methodological toolbox of sociology, but also for the development of an AI that is useful and positive for people and societies. In part, this may be due to lack of adequate (technical) training for part of the profession, or to the absence of  guidelines (for ethics and/or scientific integrity). But perhaps, the real obstacles are less immediately visible. One of them is the difficulty to judge the uptake of AI in our discipline: are we just feeding the hype if we use it? Or are we missing a major opportunity to make sociology more relevant/stronger if we don’t? The other concerns the questions and issues that go beyond the specificities of sociology. How to continue interacting with other disciplines, while upholding the distinctive contribution of sociology?

Posted bypaolatubaro20/12/202520/12/2025Posted inArtificial intelligence, Computational social science, Data & methodsTags:Artificial intelligence, Automated transcriptions, Big data, Large Language Models (LLM), NLP, Research ethics, Social science research, sociology, SurveysLeave a comment on Sociology of AI, Sociology with AI (2)

Is Paris a 15-minute city?

A few years ago, when hopes to leverage technology to build a more humane “sharing” economy had not yet completely vanished, it was often believed that the most interesting policy experiments were to be found at the local level of cities, not states. One of those was the urban-planning concept of a 15-minute city, aiming to make any essential amenities such as schools and shops accessible within a 15-min walk or bike ride. Launched in Paris before receiving enthusiastic support worldwide, it was part of the current mayor’s latest re-election campaign.

Fast forward to today, and how actually far is Paris from the 15-min goal? Sarah J. Berkemer and I have endeavoured to answer this question in a just-published article (available in open access!) with three brilliant ENSAE students (Marie-Olive Thaury, Simon Genet and Léopold Maurice). We harness open map data from the large participatory project Open Street Map and geo-localized socio-economic data from official statistics (Insee) to fill this gap.

While the city of Paris is rather homogeneous, we show that it is nonetheless characterized by remarkable inequalities between a highly accessible city centre (though with some internal differences in terms of types of amenities) and a less equipped periphery, where lower-income neighborhoods are more often found. Heterogeneity increases if we consider Paris together with its immediate surroundings, the “Petite Couronne,” where large numbers of daily commuters and other users of city facilities live.

We find that this ambitious urban planning objective cannot be achieved without addressing existing socio-economic inequalities, and that especially in a big city like Paris, it cannot be confined within the narrow boundaries of the municipality itself, without also including the city’s immediate surroundings.

One reason why I am particularly proud of this work is that it demonstrates how far research-informed teaching can go. Most higher education is about familiarizing students with generally accepted and confirmed knowledge, without going beyond the state of the art. This is certainly important and in all cases a “safe” bet, but does not give students a sense of what it means to push the boundaries further. This project was an opportunity to do so. It gave students the role of researchers – our peers – letting them play the role in full and showing them all the back-office work that lies behind publications (from drafting to responding to reviews and copy-editing), and that is (too) often occluded from students’ view. There’s probably more to experiment around this model.

To read the full article: Thaury M.-O., Genet S., Maurice L., Tubaro P. & Berkemer S.J., 2024, ‘City composition and accessibility statistics in and around Paris’, Frontiers in Big Data, 7, DOI=10.3389/fdata.2024.1354007

Posted bypaolatubaro09/03/2024Posted inComputational social scienceTags:Big data, Geo-localized data, Higher education, Participatory mapping, Public policy, Sharing economy, Social science researchLeave a comment on Is Paris a 15-minute city?

Where does AI come from, and where is it heading?

Most of my current research aims to unpack artificial intelligence (AI) from the viewpoint of its commercial production, looking in particular at the human resources needed to prepare the data it needs – whence my studies on the data work and annotation market. However, for once, I am focusing on AI as a set of scientific theories and tools, regardless of their market positioning; indeed, I have joined a team of science-of-science specialists to study the disciplinary origins and subsequent spread of AI over time.

In a newly published, open-acces article, we unveil the disciplinary composition of AI, and the links between its various sub-fields. We question a common distinction between ‘native’ and ‘applicative’ disciplines, whereby only the former (typically confined to statistics, mathematics, and computer science) produce foundational algorithms and theorems for AI. In fact, we find that the origins of the field are rather multi-disciplinary and benefit, among others, from insights from cognitive science, psychology, and philosophy. These intersecting contributions were most evident in the historical practices commonly known as ‘symbolic systems’. Later, different scientific fields have become, in turn, the central originating domains and applicators of AI knowledge, for example operations research, which was for a long time one of the core actors of AI applications related to expert systems.

While the notion of statistics, mathematics and computer science as native disciplines has become more relevant in recent times, the spread of AI throughout the scientific ecosystem is uneven. In particular, only a small number of AI tools, such as dimensionality reduction techniques, are widely adopted (for example, variants of these techniques have been in use in sociology for decades). But if transfer of AI is largely ascribable to multi-disciplinary interactions, very few of them exist. We observe very limited collaborations between researchers in disciplines that create AI and researchers in disciplines that only (or mainly) apply AI. The small core of multi-disciplinary champions who interact with both sides, and the presence of a few multi-disciplinary journals, sustains the whole system.

Inter- and multi-disciplinary interactions are essential for AI to thrive and to adequately support scientific research in all fields, but disciplinary boundaries are notoriously hard to break. Strategies to better reward inter-disciplinary training, publications, and careers, are thus essential. Of course the potential for AI to significantly advance knowledge is still (largely) to be proven, and there have been disappointing experiences with, for example, the comparatively limited effectiveness of these tools in research on Covid-19. In all cases, the status quo is not ideal, and important steps forward are now needed.

We establish these results by analyzing a large corpus of scientific papers published between 1970 and 2017, extracted from Microsoft Academic Graph through the AI keywords used by the authors, and explored with different relational structures among the scientometric data (keyword co-occurrence network, authors’ collaboration network).

Full citation: Floriana Gargiulo, Sylvain Fontaine, Michel Dubois, Paola Tubaro. A meso-scale cartography of the AI ecosystem. Quantitative Science Studies, 2023; doi: https://doi.org/10.1162/qss_a_00267

Posted bypaolatubaro04/10/2023Posted inArtificial intelligence, Computational social scienceTags:Artificial intelligence, Science cartography, Science of science, Social networks, Social science researchLeave a comment on Where does AI come from, and where is it heading?

Voices from Online Labour: Inequalities in digital earning activities across countries

What shapes differences in how people get paid, are deemed productive, or receive respect? Alongside traditional explanations of social inequalities such as class, gender, age, disability, race, migration status, rural vs. urban residence, and others, a recent literature highlights the effects of digital divides. The digitally resourced have more opportunities across all life spheres, from consumption to education, work, and health. Ironically, though, digital technologies also generate new vulnerabilities by generalizing low-paid and contingent work. Digital labour platforms like Uber, Deliveroo and Upwork use data and algorithms to match clients with workers, construed as independent contractors, for one-off ‘gigs’ without any long-term commitment. These workers are largely exposed to the vagaries of the market and have limited or no social protection, although increasing efforts aim to bring labour law to bear on platforms.

Growing concerns that platform workers compare unfavourably to conventional employees have already attracted significant research and policy attention. But more remains to be done to fully understand how the recent rise of labour platforms has undermined the relationship between digitization and inequalities, adding a layer of complexity. Scattered, but growing evidence indeed suggests that platforms may be accelerating transmission to digital worlds of ’legacy’ inequalities for example vis-à-vis race and gender, while also fostering the proliferation of ’emerging’ inequalities that diminish users’ agency and augment the power of technology creators and big-tech multinationals. Especially platforms for remote online-only labour change the geographical scale at which these questions arise, projecting workers toward a competitive planetary market that relentlessly selects winners and losers.

To tackle these questions, I’m happy and honoured to announce that I have just been awarded a major grant (almost 570k euros, at marginal cost) by the French National Agency for Research (ANR) for a new 4-year study called VOLI: Voices from Online Labour. As a team effort that builds on a solid record of interdisciplinary collaborations, VOLI innovatively combines hypotheses and methods from sociology and neighbouring disciplines, notably large-scale corpus linguistics (I’ll explain why below), and relies on speech technology and artificial intelligence to tackle the rising economic risks that coalesce around the nexus between online platform labour, digitization, and social inequalities. The project leverages the power and potential of the very digital tools whose societal effects it studies, to develop an original and potentially transferable methodology.

The innovative idea that underpins the project is to tackle the problem through language, benefiting from recent advances in linguistics research and its capacity to recast methods and tools from artificial intelligence in a broad sense – including speech and language technology and machine learning techniques – to capture features and processes that used to escape its traditional methods. Despite the importance of linguistic tasks (such as translation, transcription, writing, and editing) in online labour platforms, linguistic methods have never been applied to the study of these workers before, and thus are best positioned to bring fresh insight. To this end, we have assembled a team composed of speech technology scientists, computational linguists specialized in multilingual and large-scale corpora analysis, and computational, digital, and labour sociologists. Expected results sustain our ambition to devise policy solutions to mitigate the effects of inequalities, and to support the individuals and groups that accumulate multiple sources of disadvantage.

To harness our previous research experience and ensure continuity, we focus on so-called ’micro-work’, the necessary but inconspicuous contribution of low-paid masses who annotate, tag, label, correct and sort data to fuel the digital economy, especially artificial intelligence. Because it is performed remotely and can be allocated to providers worldwide, micro-work differs from location-based platform ’gigs’ such as delivery and transport. It also differs from online-only jobs for freelancers, for example in computer programming and design, insofar as its extreme segmentation and standardization allow dispersing tasks to an undefined crowd instead of a selected individual (whence the alternative denomination of ’crowdwork’). Micro-tasks include, for example, recording one’s voice while reading aloud a sentence, labelling files, translating short bits of text, classifying contents in an image or webpage. They perform essential functions in the development of machine learning and artificial intelligence, from data generation and enrichment to quality controls of automated outputs. We give voice to these workers, often invisibilized by the automation narratives popular in the technology industry, in that we interview them about their lived experience, aspirations, motivations and perhaps regrets; and we rely on their voices as data for the simultaneous development of sociology, linguistics, and artificial intelligence (specifically, speech recognition) itself.

Indeed while bringing to the next level our sociological knowledge of the linkages between micro-work and digital inequalities, the methods that will be developed within this highly interdisciplinary project advance the study of the factors driving speech variation within the discipline of linguistics, augmenting language corpora with rich sets of metadata from sociological surveys, while also building and testing new and improved tools for automated transcription, with potential commercial applications.

I am the PI of the VOLI project which involves four research centres within France:

  • CREST (S. Coavoux, E. Ollion, P. Präg)
  • LISN (I. Vasilescu, L. Lamel, M. Evrard)
  • CRISCO (Y. Wu)
  • SES Department of Telecom Paris (A.A. Casilli, J. Torres Cierpe),

plus a company, Vocapia Research (V.B. Le, J. Despres, I. Swiecicki), and three international partners:

  • Weizenbaum Institut Berlin (M. Miceli),
  • Universitat Autònoma de Barcelona (J.L. Molina)
  • Universitat de València (J.A. Santos Ortega).
Posted bypaolatubaro03/09/2023Posted inArtificial intelligence, Computational social science, Digital labourTags:Computational linguistics, Digital labour, Digital platforms, Micro-work, Socio-linguistics, Speech technologyLeave a comment on Voices from Online Labour: Inequalities in digital earning activities across countries

Thematic school on SNA & Complexity

20180925_151108Just back from a stimulating summer school on social network analysis and complexity, organised in sunny Cargèse (Corsica). Lots of exciting talks on communities, network dynamics, and complex social structures, with a touch of genuine interdisciplinarity.

There, I gave a talk on agent-based models and simulation, and co-led a workshop on20180924_083521statistical models for social network data.

The summer school was organized by GDR Analyse des réseaux en sciences humaines et sociales and a short report is available here (in French), with a few more blog posts (also in French) here.

Posted bypaolatubaro30/09/201819/10/2018Posted inComputational social science, Data & methods, EventsTags:Complexity, Social networks, Social scienceLeave a comment on Thematic school on SNA & Complexity

Programme now online, “More than complex: large and rich network structures”

As part of the upcoming NetSci2018 conference in Paris, I co-organize a satellite event that aims to foster interdisciplinary reflection on how methods from social science can be upscaled to large network structures and on how methods from complex systems can be downscaled to deal with small heterogeneous structures.

The idea is to reconcile two traditions of research that have remained separate so far. Sociology typically handles small but rich networks where a wealth of network attributes results from the complexity of the data collection design. Differences across nodes and edges enable to capture the social processes underlying network structures and their dynamics. Instead, the complex systems tradition handles large but poorly-specified networks. Assuming statistical equivalence of graph entities, a mean field treatment suffices to describe the aggregate properties of the network. Today’s network data-sets contain an unprecedented quantity of relational information within and between all possible levels: individuals, social groups, organizations, and macro entities. Such large and rich network structures expose the implicit limitations of the two above-mentioned approaches: classical sociological methods cannot be upscaled because of their heavy algorithms, and those from complex systems lose track of the multi-faceted nature of social actors, their relationships and their processes.

Our satellite event aims to move forwards, inviting an inter-disciplinary reflection and exploring ways in which these limitations can be overcome.

The program of the satellite is now online.

I am most pleased to co-organize this satellite event with Floriana Gargiulo, Sylvie Huet, and Emmanuel Lazega.

We are honored to count, among our invited speakers, Camille Roth, Matthieu Latapy, Fariba Karimi, and Noshir Contractor.

Posted bypaolatubaro07/04/2018Posted inComputational social science, Data & methods, EventsTags:Big data, Complex systems, Events, interdisciplinary collaboration, Network science, Online social networks, Social network data, Social networksLeave a comment on Programme now online, “More than complex: large and rich network structures”

Categories

Archives

Follow me on Twitter

My Tweets

Enter your email address to follow this blog and receive notifications of new posts by email.

Data Big and Small, Create a free website or blog at WordPress.com.
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Subscribe Subscribed
    • Data Big and Small
    • Join 59 other subscribers
    • Already have a WordPress.com account? Log in now.
    • Data Big and Small
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...