The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) is a scalable and interoperable resource for the genomic scientific community, that leverages a cloud-based infrastructure for democratizing genomic data access, sharing and computing across large genomic, and genomic related, data sets. The AnVIL will facilitate integration and computing on and across large datasets generated by NHGRI programs, as well as initiatives funded by the National Institutes of Health (NIH), or by other agencies that support human genomics research.

In addition, the AnVIL will be a component of the emerging NIH Data Commons, and is expected to collaborate and integrate with other genomic data resources through the adoption of the FAIR (Findable, Accessible, Interoperable, Reusable) principles, as their specifications emerge from the scientific community. The AnVIL will provide a collaborative environment where datasets and analysis workflows can be shared within a consortium, and be prepared for public release to the broad scientific community through AnVIL user interfaces.

[Read more…]


The DataSTAGE (Storage, Toolspace, Access and analytics for biG data Empowerment) project aims to create a community of practice that is motivated to collaboratively solve technical challenges to enable NHLBI investigators to find, access, share, store, cross-link, and compute on large-scale data sets. Though the primary goal of the DataSTAGE Consortium is to build a data science platform, at its core this is a people-centric endeavor.

For NHLBI research investigators who need to find, access, share, store, cross-link, and compute on large scale data sets, NHLBI DataSTAGE will serve as a cloud-based platform providing tools, applications, and workflows to enable these capabilities in secure workspaces. DataSTAGE is a rationally organized digital environment that will accelerate efficient biomedical research and maximize community engagement and productivity through increased access to NHLBI data sets and innovative data analysis capabilities. By making these data sets accessible and usable to varied users, DataSTAGE will drive discovery and scientific advancement, leading to novel diagnostic tools, therapeutic options, and prevention strategies for heart, lung, blood, and sleep disorders.


The Global Alliance for Genomics and Health (GA4GH) is an international effort to promote, foster and standardize secure, ethical, privacy preserving sharing of genomic information for the betterment of global health outcomes. It has many, many organizations involved, all recognizing the importance of its mission, however it is largely a volunteer effort. To help ensure that it can meet its technical goals we are supporting with engineering effort a GA4GH integration group tasked with implementing and defining the necessary software standards.

Key to this is the creation of genomics Application Program Interfaces (APIs) to allow the exchange of genomic information across multiple organizations. The GA4GH standard is a freely available open standard for interoperability, which uses common web protocols to support the serving and sharing of data about nucleic acid sequences, variation, expression and other forms of genomic data. The API is implemented as a web service to create a data source that may be integrated downstream into visualization software, web-based genomics portals or processed as part of genomic analysis pipelines. Its goal is to overcome the barriers of incompatible infrastructure between organizations and institutions to enable DNA data providers and consumers to better share genomic data and work together on a global scale, advancing genome research and clinical application.


Toil is a scalable, efficient, dynamic cross-platform pipeline management system written entirely in Python. Toil is being used by projects like the Treehouse childhood cancer initiative to create portable, scaleable and reproducible analyses.

Toil was used to process over 20,000 RNA-seq samples in under four days using a commercial cloud cluster of 32,000 preemptable compute cores. Toil can also run workflows described in Common Workflow Language and supports a variety of schedulers, such as Mesos, GridEngine, LSF, Parasol and Slurm. Toil runs in the cloud on Amazon Web Services, Microsoft Azure and Google Cloud.See the Toil website and Toil Github for details.


Dockstore, developed by the Cancer Genome Collaboratory, is an open platform for sharing Docker-based tools described with the Common Workflow Language used by the GA4GH.

[Read more…]