What you can learn by reading all the world's open-source code at once

December 10, 2025

Open-source repositories hold a record of how software is really written, by millions of developers over many years. Reading that record at scale can answer questions no single project could, but it demands infrastructure: a way to express an analysis once and run it across hundreds of thousands of projects without managing a cluster by hand. For more than a decade our lab has built that infrastructure and used it to study software, and to make large-scale analysis faster and more useful. This page collects that work.

The foundation is Boa (ICSE 2013), a language and infrastructure that lets a researcher write a software-mining task once and run it across an enormous corpus. On top of that foundation we worked on making the analysis itself scale: Collective Program Analysis (ICSE 2018) shares redundant work across similar artifacts, BCFA (ICSE 2020) tailors control-flow traversal to each analysis, and an earlier new-ideas paper (ICSE 2017) first proposed clustering artifacts to cut redundant work.

We also used analysis at scale to answer concrete questions about how software is written and used. API misuse on Stack Overflow (ICSE 2018) measured how reliable popular code examples are, substitutability in the presence of effects (ESEC/FSE 2018) asked when one component can safely replace another, and data-driven syntactic sugar design (ICSE 2024) turned language evolution into an evidence-based decision. This direction has been supported by NSF awards including Boa: Enhancing Infrastructure for Studying Software at Scale and its successor Boa 2.0.

This work is part of Analyzing Software at Scale, with Boa. For the complete record, see our list of papers.