Nextflow is a scientific workflow system predominantly used for Bioinformatics data analysis. It establishes standards for programmatically creating a series of dependent computational steps and facilitates their execution on various local and Cloud computing resources.
Scientific workflow systems like Nextflow allow formalizing an analysis as a data analysis pipeline. Pipelines, also known as workflows, specify the order and conditions of computing steps. They are accomplished by special purpose programs, so-called workflow executors, which ensure predictable and reproducible behavior in various computing environments.
Workflow systems also provide built-in solutions to common challenges of workflow development, such as the application to multiple samples, the validation of input and intermediate results, conditional execution of steps, error handling, and report generation. Advanced features of workflow systems may also include scheduling capabilities, graphical user interfaces for monitoring workflow executions, and the management of dependencies by containerizing the whole workflow or its components.
Typically, scientific workflow systems initially present a steep learning challenge as all their features and complexities are built on in addition to the actual analysis. However, the standards and abstraction imposed by workflow systems ultimately improve the traceability of analysis steps, which is particularly relevant when collaborating on pipeline development, as is customary in scientific settings.
This reactive implementation is a key design pattern of Nextflow and is also known as the functional dataflow model.
Processes and entire workflows are programmed in a domain-specific language (DSL) which is provided by Nextflow which is based on Apache Groovy. While Nextflow's DSL is used to declare the workflow logic, developers can use their scripting language of choice within a process and mix multiple languages in a workflow. It is also possible to port existing scripts and workflows to Nextflow. Supported scripting languages include bash, csh, ksh, Python, Ruby, and R. Any scripting language that uses the standard Unix shebang declaration (#!/bin/bash) is compatible with Nextflow.
Below is an example of a workflow consisting of only one process:
workflow {
input:
val greeting
output:
path "${greeting}.txt"
script:
"""
echo "${greeting} World!" > ${greeting}.txt
"""
}
Channel.of("Hello", "Ciao", "Hola", "Bonjour") | hello_world
}
To enable easy collaboration on workflows, Nextflow natively support for Version control and DevOps platforms including GitHub, GitLab, and others.
Nextflow supports container frameworks such as Docker, Singularity, Charliecloud, Podman, and Shifter. These containers can be automatically retrieved from external repositories when the pipeline is executed. Additionally, it was revealed at Nextflow Summit 2022 that future versions of Nextflow will support a dedicated container provisioning service for better integration of customized containers into workflows.
In July 2018, Seqera Labs was launched as a spin-off from the Centre for Genomic Regulation. The company employs many of Nextflow's core developers and maintainers and provides commercial services and consulting with a focus on Nextflow.
In July 2020, a major extension and revision of Nextflow's domain-specific language was introduced to allow for sub-workflows and additional improvements. In the same year, monthly downloads of Nextflow reached approximately 55,000.
One notable use case is its role in pathogen surveillance during the COVID-19 pandemic. Swift and highly automated processing of raw data, variant analysis, and lineage designation were essential for monitoring the emergence of new virus variants and tracing their global spread. Nextflow-enabled pipelines played a crucial role in this effort.
Nextflow also plays a significant role for the non-profit plasmid repository Addgene, using it to confirm the integrity of all deposited plasmids.
In addition to genomics, Nextflow is gaining popularity in other domains of biomedical data processing, where complex workflows on large amounts of primary data are required. These domains include drug screening, diffusion magnetic resonance imaging (dMRI) in radiology, and mass spectrometry data processing, the latter with a particular focus on proteomics
|
|