SNAKEMAKE – WORKFLOW MANAGAMENT SYSTEM

Analyzing large amounts of data often requires constant, specific tasks and scripts. Therefore, the possibilities offered by workflow management systems are invaluable in the implementation of bioinformatics work. This article describes how the Snakemake system works.

What are workflow management systems?

Workflow management systems allow you to automate processes as well as create repeatable and scalable data analyzes. They provide the perfect infrastructure to configure, control performance, and monitor a defined task sequence. With their help, we can create ready-made rule schemes, which will make it easier for us to start selected processes at any time.

Snakemake – is it worth it?

An example of a workflow management system that is often used in bioinformatics departments is Snakemake. It is distinguished primarily by:

  • intuitive use based on rules and dependencies between them (input/output),
  • readable code, based on the Python scripting language,
  • support for parallel computing on multiple servers (SLURM, Kubernetes) and computing in the cloud (Google Cloud Computing),
  • integration with the Conda work environment (the ability to declare dependencies for the rule execution environment),
  • active development (developer support).

How does the Snakemake workflow management system work?

Everything is based on the rules that describe what should be done to create specific output files by carrying out specific actions.

An example of a simple rule:

rule example:

input:

infile = “input_ {custom} .tsv”

output:

outfile = “output_ {custom} .tsv”

shell: “” “

cat “{input.infile}” | process.sh> “{output.outfile}”

“” “

Invoking such a rule is done by specifying the target file, in this case:

snakemake output_genXone.tsv -n

Snakemake will check if any of the available rules match the selected target file. In our example rule “example”, in the name of the input and output files, the so-called wildcards appeared through curly brackets. This procedure allows you to assign under the wildcards variable, in this case, called “custom”, any string that appears between “input _” / “output_” and “.tsv”. In the above example call for this rule, the text “genXone” was assigned to the wildcards “custom” variable.

Moving on – to create the file “output_genXone.tsv”, we need the input file “input_genXone.tsv”. If such a file is available, the rule may run and Snakemake will inform us. Thanks to the “-n” switch, we can preview what rules will be run and what dependencies between them are in text form, without running the rules themselves.

If creating the file requires running other dependent rules, Snakemake will also inform us about it.

We can create a visual dependency graph with the command:

snakemake output_genXone.tsv –dag | dot -Tsvg> dag.svg

Invoking Snakemake without the “-n” switch will start the job pipeline and produce the expected files.

After configuring SLURM (Slurm Workflow Manager) with the Snakemake system, the same rule can be called with the “ssnakemake” directive (double “s”).

From Make to Snakemake

Snakemake is based on the Make GNU paradigm – workflows are defined according to the rules that govern how to create output files from input files. The dependencies between the rules are determined automatically, creating a DAG (Targeted Acyclic Chart) of tasks that can be automatically paralleled.

Workflow management systems in bioinformatics

Various workflow management systems are used in bioinformatics analyzes. In addition to Snakemake, there are, for example, Nextflow, which uses containers with software, or Python Celery, which easily processes large numbers of messages. All of them allow structuring cyclical analyzes (pipelines), significantly accelerating the work of bioinformatics. Thanks to them, we do not have to wait for the completion of each of the rules to run the next one – the selected system will do it for us. Thanks to its compatibility with the Python language and the simplicity of operation, Snakemake is easily digestible, and the support for parallel computing and cloud solutions makes it one of the most-used workflow management systems in the field of bioinformatics and more.