This website uses cookies, pixels, and similar technologies (“cookies”), some of which are provided by third parties, to enable website features and functionality; measure, analyze, and improve site performance; enhance user experience; record user interactions; and support our advertising and marketing. We and our third-party vendors may monitor, record, and access information and data, including device data, IP address and online identifiers, referring URLs and other browsing information, for these and similar purposes. By clicking “Accept all cookies,” you agree to such purposes. If you continue to browse our site without clicking “Accept all cookies,” or if you click “Reject all cookies,” only cookies necessary to operate and enable default website features and functionalities will be deployed. If you are visiting our Site in the U.S., by using this site or clicking “Accept all cookies,” “Reject all cookies,” or “Preferences,” you acknowledge and agree to our Privacy Policy, Cookie Policy, and Terms of Use.

library

Blog
/

GitGeo: Discover the Geography of Software

John Speed Meyers, Data Scientist / Kinga Dobolyi, Data Scientist / Bentz Tozer, Senior Member, Technical Staff / George Sieniawski, Data Engineer / Mona Gogia, Senior Engineer / Morgan Mahlock, Senior Associate / Paulo Dutra, Member, Technical Staff / Murali Kannan, Member, Technical Staff
Read the Paper
Explores the global origins of open source software with GitGeo, a tool revealing the geographical distribution of developers behind GitHub repositories, offering insights into the diverse landscape of software development.

Figure 1 above: Geography of the top 400 contributors to the Python package, Requests, which has over 128 million downloads per month.

Do you know where in the world your software comes from? Although open source software projects often mention imports and exports, borrowing the language of international trade, the code repositories in which they reside often seem placeless, like much of the Web. To reveal the global landscape of open source software for interested users, we recently built GitGeo, a tool that analyzes the geography of software developers associated with a GitHub repository. GitGeo can help users:

  • Understand high-level social trends in the context of open source software;
  • Understand and organize software maintainer communities around the globe; and
  • Build context around software supply chains as part of a larger open source software risk assessment process.

More broadly, we view GitGeo as one initial prototype for a tool suite that helps someone better understand the people behind an open source software project. If you're interested in helping build related tools, please contact us!

Mapping Open Source Software for Research

Perhaps you're in the public sector and evaluating direct funding for open source software and basic digital infrastructure, or maybe you're a researcher trying to discern broad patterns in open source software development. GitGeo can provide the data and basic analysis of geographic data associated with GitHub repositories to enable your work. Imagine, for instance, that you would like to know where in the world are the top 100 software developers per package associated with the 100 most critical Python packages. (For one possible definition of open source "criticality," consider this quantitative metric proposed by programming guru Rob Pike as part of the efforts of the Open Source Security Foundation.) By running GitGeo, you can generate the figure below to get an answer.

Figure 2. Geography of the top 100 contributors to the 100 most critical Python packages
(based on the OpenSSF criticality measure)

GitGeo reveals that critical Python software packages are maintained by software developers the world over. Over 450 of these contributors are in Germany, 370 are in India, 220 in China, and nearly 2,300 are in the United States. To be sure, not all of these software developers list a location, but over 60% of developers in this example do. We should also acknowledge that the data underlying this figure was cleaned manually in the optimistic hope that this example might interest a broader community. GitGeo nonetheless correctly identified the correct country of 94% of software developers in this analysis. A later section further explains GitGeo's limitations.

Furthermore, GitGeo data can help identify developers associated with multiple critical Python packages. See the table below for the GitHub username, the number of critical Python packages on which that user is among the top 100 contributors (measured by Git commits), and the developer's country.

Figure 3. Top 10 Contributors to the 100 Most Critical Python Packages*(based on the OpenSSF criticality measure)

Notably, several of these individuals are among the 100 most active developers on more than 10 critical Python packages. This analysis also provides some tentative support for the validity of the OpenSSF criticality score since several of these users are known prolific contributors to the Python ecosystem.

It's also potentially possible to use GitGeo to research economic trends related to open source software. As a precedent, Harvard Business School professor Frank Nagle has studied the effect of a change in French technology procurement policy on open source software production. Tools like GitGeo allow analysts to identify the location of developers, making possible geographic analysis of the economics of open source software. Future research could also use GitGeo to examine the intersection of development economics and open source software: what is the worldwide distribution of open source software over time and why?

Managing Software Communities with GitGeo

Open source software maintainers and advocates can also use GitGeo as a tool for helping to manage a worldwide community of software developers. One could, for instance, imagine a GitGeo-like feature built into new open source community management platforms. Potential users include software conference planners looking to minimize travel distance for a given community as well as technology standards bodies trying to determine the most convenient time zone for developer calls. Consider the map below displaying the geography of the approximately top 400 GitHub contributors associated with Kubernetes, the popular container management system.

Figure 4. Geography of top ≈400 contributors to Kubernetes, which has over 3 million downloads per month

While many (126) of these developers are in the United States, sizable subgroups are based in Europe (>40) and China (38).  This type of information can help the Kubernetes community understand itself and create events and coordination mechanisms more narrowly tailored to its constituents.

How GitGeo Works

GitGeo asks the user to specify one or more GitHub repositories -- think websites for open source code associated with a particular project. It then creates a choropleth (color-coded) map, as shown in the example above, displaying the count of software developers associated with each country.

GitGeo relies on the GitHub application programming interface (or API), a set of instructions for requesting data from the world's largest code hosting platform. After retrieving user profile data for each software developer associated with a repository, GitGeo parses the user-provided location data to associate each user with a country. For instance, a GitHub user reporting that they are in "Paris, France" is associated with France while another user in "Paris, Kentucky" is associated with the United States. Because GitHub's user-provided location data is free-form (rather than drop-down menus, for example), GitGeo uses various heuristics to disambiguate and clean this information. To be sure, GitGeo is far from perfect when it comes to predicting country location. Free-from text creates ambiguities that can lead to incorrect predictions. For instance, if the user lists "Paris," what should GitGeo do? Currently, GitGeo will default to the more populous locality, favoring "Paris, France" over "Paris, Kentucky" in this example. But, we admit, this type of shortcut doesn't always work.

Finally, GitGeo uses the Python package Folium to build an interactive choropleth viewable in a web browser. Each country is shaded in direct proportion to the count of software developers who work on that project.

GitGeo (or an Improved Open Source Software Explorer Tool) Could Aid IT Security

Additionally, open source users are increasingly interested in the provenance of open source software and the "digital identity attestation" of software developers. Both stem from concerns about the security of open source software, a highly contentious topic given legitimate fears about privacy, security, and the potential abuse of rules of origin for open source software. At the same time, Ken Thompson's oft-cited Turing Award lecture "Reflections on Trusting Trust" -- perhaps the seminal statement on software security -- bears repeating. In this piece Thompson muses that "perhaps it is more important to trust the people who wrote the software" than to trust technical assertions about software security.

Accordingly, we propose that GitGeo (or similar tools that build context around developer identity) may be of interest to organizations that want to know more about the people behind a piece of open source software. And while we applaud efforts to make software security an engineering discipline focused on assessing code -- not people -- we also observe that current technical approaches to assessing software security have their limits. Viewed in this light, GitGeo can help build a more holistic basis for researching the relationship between developers and software security.

GitGeo

GitGeo is only an initial prototype. Our real interest is bigger: a suite of tools to help someone better understand the people behind a piece of software. GitGeo helps with geographic understanding but other tools could help with organizational affiliation, related open source activities, and coding specialties for example. If you want to help build such a larger open source software exploration suite, please contact us.

If you're interested in GitGeo, whether that means contributing code or documentation, using GitGeo, or have a bug or suggestion, drop in at the project's GitHub and create a pull request or open an issue. And if you're interested in open source software, the federal government, and national security, please send an email to jmeyers@iqt.org. We'd love to talk!

Related content

To learn more about related  IQT Labs research, please explore the following articles:

"Toward Secure Code Reuse," IQT Blog, Feb. 2021

"Counting Broken Links: A Quant's View of Software Supply Chain Security," USENIX ;login:, Dec. 2020

"pypi-scan: A Tool for Scanning the Python Package Index for Typosquatters," IQT Blog, Oct. 2020

"Bewear: Python Typosquatting is about More than Typos," IQT Blog, Sept. 2020

"Who Will Pay the Piper for Open Source Software Maintenance? Can We Increase Reliability as We Increase Reliance?," USENIX ;login:, Jun. 2020