This website uses cookies, pixels, and similar technologies (“cookies”), some of which are provided by third parties, to enable website features and functionality; measure, analyze, and improve site performance; enhance user experience; record user interactions; and support our advertising and marketing. We and our third-party vendors may monitor, record, and access information and data, including device data, IP address and online identifiers, referring URLs and other browsing information, for these and similar purposes. By clicking “Accept all cookies,” you agree to such purposes. If you continue to browse our site without clicking “Accept all cookies,” or if you click “Reject all cookies,” only cookies necessary to operate and enable default website features and functionalities will be deployed. If you are visiting our Site in the U.S., by using this site or clicking “Accept all cookies,” “Reject all cookies,” or “Preferences,” you acknowledge and agree to our Privacy Policy, Cookie Policy, and Terms of Use.

library

Blog
/

Can Code Steal Your Genome?

Ray LeClair, IQT ‪Labs Contractor / John Speed Meyers, Engineer / George Sieniawski, Senior Technologist / Mona Gogia, Senior Engineer / Bentz Tozer, Vice President, Technology
Read the Paper
Ever wondered about the security of the code that analyzes humanity's genetic code? Check out how IQT Labs built a prototype security tool for securing bioinformatics packages.

An Initial Security Assessment of Bioinformatics Software Packages

Given the growing threat of software supply chain attacks, in which attackers compromise software packages, we wondered about the security of the bioinformatics ecosystem. Bioinformatics packages provide the tools required to analyze genetic sequences and are increasingly important to modern scientific advances. The National Security Commission on Artificial Intelligence has forecasted that advances in biology and bioinformatics will underpin future major scientific breakthroughs related to human health, agriculture, and climate science. The security of bioinformatics software packages is therefore critical not only for U.S. competitiveness but also for national security.

We therefore turned our attention to Bioconda, a popular channel for bioinformatics software packages used to store, organize, analyze, visualize, and understand biological sequences. Bioconda provides over 8,000 software packages ready to install with Conda, an software package management system. Crucially, Bioconda depends on contributors to voluntarily add, update, and maintain packages. This means Bioconda users place their trust, even if implicitly, in the people and organizations who write this code.

Our research asked this question: Could defenders detect attackers inserting malicious code into the bioinformatics packages that deal with some of humanity's most sensitive data? We decided to build a prototype security analysis pipeline to keep attackers out of the bioinformatics software supply chain. Here's a summary of our findings:

  • Our tools did not detect any currently compromised bioinformatics packages.
  • There are nonetheless vectors of compromise for these bioinformatics packages.
  • Future teams should build security tools that make ecosystem scanning efforts like this easier and less time-consuming.

Those interested in the details should read on. If you're interested in further work on this topic or collaboration, please contact jmeyers@iqt.org.

An Initial Security Assessment Pipeline for Bioinformatics Packages

We prototyped a set of tools to determine if attackers have already inserted a malicious package or compromised an existing Bioconda package. We hope these tools will benefit the Bioconda maintainers and also companies and researchers with an interest in the security of the bioinformatics software supply chain. This work is part of a larger effort at  IQT Labs to address secure code reuse.

Building on the work of a Georgia Tech software security dissertation, we created an approach with three components:

  1. Searching Bioconda recipes and BioContainers Docker files for exfiltration commands.
  2. Using static analysis to scan Python-based bioinformatics repositories.
  3. Employing dynamic analysis to identify system calls during Bioconda package installs, BioContainers Dockerfiles builds, and bioinformatics pipeline runs.

We used Dask to distribute the tasks to a cluster. Details and code are available in  IQT Labs' secure-bioinformatics-reuse GitHub repository. We encourage anyone interested in bioinformatics software security to borrow, build on, and provide feedback on this code and approach.

Metadata Analysis

sshsftpscpwgetcurlBiconda Recipes1005561442BioContainers Dockerfiles00216521
Table 1: Occurrence of potential exfiltration commands in Bioconda recipes and BioContainers Dockerfiles

We first searched for commands in Bioconda recipes and BioContainers Dockerfiles that could be used for data exfiltration attacks. Table 1 displays the results.

Inspection of the ssh and scp commands revealed nothing suspicious. The wget commands often obtained software from code repositories like GitHub or sourceforge.net or data from reputable sources such as ftp.ncbi.nih.gov. The results were similar for the wget and curl commands in the BioContainers Docker files except that these commands often obtain code and data from sources other than code repositories. The majority (1,357) of the curl commands in the Bioconda recipes were concentrated in the post-link.sh script for Bioconductor recipes. These curl commands appear to be obtaining data required by the tool. While manual inspection did not reveal anything suspicious, a useful next step would be to evaluate all contacted domains via a threat intelligence service.

Static Analysis

We identified 495 bioinformatics repositories from papers published in the journal Bioinformatics. These repositories were written primarily in Python, enabling us to use the Python static analysis tool Aura. Aura finds indicators (e.g., suspicious code in a setup script) and assigns a score to each indicator. We identified the unique match types (see documentation here) and counted occurrences of each score. The results appear in Figure 1.

Since the scan results are voluminous, we manually inspected a few results focusing on function calls, SQL injections, and anomalous setup scripts. We found no malicious function calls and only a few SQL injection opportunities. Of more concern, we did find opportunities to execute arbitrary Python code during installation of Luigi, and snakePipes, both packages related to building workflows. This vulnerability seems worth noting given the generality of these packages.

Because these repositories were originally included in a prestigious peer-reviewed journal, we suspected it was unlikely that we would find malicious code, and we did not. However, given the limited time available for manual inspection and evident opportunities for attack, future research should run security scans on the long tail of less closely scrutinized bioinformatics packages. We should also mention that assessing the results was time-consuming and our experience highlights the value of projects that make it easier to filter, sort, and view the results of static analysis scans.

Figure 1: Match score by match type produced during an Aura scan of 495 primarily Python git repositories identified in papers from Bioinformatics.

Dynamic Analysis

We used strace, a dynamic analysis tool that monitors running code, to identify system calls during the install of 1,028 Bioconda packages, a build of 969 Biocontainer Dockerfiles, and a run of 35 bioinformatics pipelines. We examined the output and concluded that we should focus on executed files and IP addresses. We did not find any executed files of concern.

We did, however, identify 61 IP addresses to which a connection was made during the installations. Figure 2 shows the 20 least and most frequently occurring IP addresses. We focused on the least frequently occurring and public addresses as the most suspicious. However, assessment of the security risk of connections to these IP addresses proved difficult to complete. In some cases, the IP address corresponded to a content delivery network, and so the ultimate endpoint was not readily identified. Future versions of a bioinformatics security pipeline will need a method for assessing IP addresses.

Figure (a)
Figure (b)
Figure 2: Count of occurrences of IP addresses identified during an install of 1,028 Bioconda packages, a build of 969 Biocontainer Dockerfiles, and a run of 35 bioinformatics pipelines, (a) 20 least frequently occurring, and (b) 20 most frequently occurring IP addresses.

Trust and Verify: No Malicious Bioinformatics Packages Found‚ For Now.

What did we learn? Users of bioinformatics packages place their trust, even if implicitly, in the people and organizations that write the code. Simple attacks on the Bioconda build and Conda install process are theoretically possible (and have been observed elsewhere in the Python ecosystem), which could cause loss of sensitive clinical or proprietary commercial data. While our pilot security assessment pipeline produced no suspicious results, future researchers will have to find ways of reducing false positives, sifting more easily through the reams of data, and handling assessment of IP addresses.

This security assessment of bioinformatics packages should be viewed as preliminary and only a first step. Others should consider building upon this work to help ensure the future security of bioinformatics software. Please email jmeyers@iqt.org if you have further interest in this topic and would like to discuss this research or associated tools. You can also find the code used to do the security analysis here.

Further Reading

"Bewear! Python Typosquatting Is About More Than Typos," IQT Blog, September 2020.

"Breaking Trust: Shades of Crisis across an Insecure Software Supply Chain," The Atlantic Council, July 2020.

"Bioviz Under the Microscope," Part 1 and 2, IQT Labs Human-Machine Interfaces Team Blog, February 2020.

Carnogursky, Martin, "Attacks on Package Managers," Bachelor's Thesis, Masaryk University, May 2019.

Geer, Dan, Bentz Tozer, and John Speed Meyers, "Counting Broken Links: A Quant's View of Software Supply Chain Security," USENIX ;login:, December 2020.

Ohm, Marc, Henrik Plate, Arnold Sykosch, and Michael Meier, "Backstabber's Knife Collection: A Review of Software Supply Chain Attacks." ArXiv, May 2020.

Acknowledgements

Thank you to Luke Berndt, Michael Chadwick, Zigfried Hampel-Arias, and Adam Van Etten for thoughtful review and critique.