genXone

Analyzing large amounts of data often requires constant, specific tasks and scripts. Therefore, the possibilities offered by workflow management systems are invaluable in the implementation of bioinformatics work. This article describes how the Snakemake system works.

What are workflow management systems?

Workflow management systems allow you to automate processes as well as create repeatable and scalable data analyzes. They provide the perfect infrastructure to configure, control performance, and monitor a defined task sequence. With their help, we can create ready-made rule schemes, which will make it easier for us to start selected processes at any time.

Snakemake – is it worth it?

An example of a workflow management system that is often used in bioinformatics departments is Snakemake. It is distinguished primarily by:

intuitive use based on rules and dependencies between them (input/output),
readable code, based on the Python scripting language,
support for parallel computing on multiple servers (SLURM, Kubernetes) and computing in the cloud (Google Cloud Computing),
integration with the Conda work environment (the ability to declare dependencies for the rule execution environment),
active development (developer support).

How does the Snakemake workflow management system work?

Everything is based on the rules that describe what should be done to create specific output files by carrying out specific actions.

An example of a simple rule:

rule example:

input:

infile = “input_ {custom} .tsv”

output:

outfile = “output_ {custom} .tsv”

shell: “” “

cat “{input.infile}” | process.sh> “{output.outfile}”

“” “

Invoking such a rule is done by specifying the target file, in this case:

snakemake output_genXone.tsv -n

Snakemake will check if any of the available rules match the selected target file. In our example rule “example”, in the name of the input and output files, the so-called wildcards appeared through curly brackets. This procedure allows you to assign under the wildcards variable, in this case, called “custom”, any string that appears between “input _” / “output_” and “.tsv”. In the above example call for this rule, the text “genXone” was assigned to the wildcards “custom” variable.

Moving on – to create the file “output_genXone.tsv”, we need the input file “input_genXone.tsv”. If such a file is available, the rule may run and Snakemake will inform us. Thanks to the “-n” switch, we can preview what rules will be run and what dependencies between them are in text form, without running the rules themselves.

If creating the file requires running other dependent rules, Snakemake will also inform us about it.

We can create a visual dependency graph with the command:

snakemake output_genXone.tsv –dag | dot -Tsvg> dag.svg

Invoking Snakemake without the “-n” switch will start the job pipeline and produce the expected files.

After configuring SLURM (Slurm Workflow Manager) with the Snakemake system, the same rule can be called with the “ssnakemake” directive (double “s”).

From Make to Snakemake

Snakemake is based on the Make GNU paradigm – workflows are defined according to the rules that govern how to create output files from input files. The dependencies between the rules are determined automatically, creating a DAG (Targeted Acyclic Chart) of tasks that can be automatically paralleled.

Workflow management systems in bioinformatics

Various workflow management systems are used in bioinformatics analyzes. In addition to Snakemake, there are, for example, Nextflow, which uses containers with software, or Python Celery, which easily processes large numbers of messages. All of them allow structuring cyclical analyzes (pipelines), significantly accelerating the work of bioinformatics. Thanks to them, we do not have to wait for the completion of each of the rules to run the next one – the selected system will do it for us. Thanks to its compatibility with the Python language and the simplicity of operation, Snakemake is easily digestible, and the support for parallel computing and cloud solutions makes it one of the most-used workflow management systems in the field of bioinformatics and more.

Related news

2 December 2022

Helicobacter pylori infection

We hear a lot about the intestinal microbiota, but we know much less about the bacteria that live in other parts of the digestive system. When we think of the stomach, one bacterium usually comes to mind – Helicobacter pylori. Helicobacter pylori is a spiral-shaped gram-negative bacillus that lives in the gastric and duodenal mucosa. […]

7 December 2021

HOW BACTERIA AFFECT BRAIN DEVELOPMENT

As we have learned from previous articles on our blog, bacteria play a significant role not only in digestion but also in the formation of our immune system. But there is more! Increasingly, research indicates a close connection between the development of the brain in early life and the process of colonization of the digestive […]

29 October 2021

GUT BACTERIA AND ATOPIC DERMATITIS

The bacteria that colonize our digestive system right after birth are extremely important for the proper development of immunity. Disruptions in this process may translate into the appearance of various types of diseases later in life. As it turns out, colonization time is a particularly important factor, as proven by the latest research by scientists […]

12 October 2021

OCTOBER – THE MONTH OF BREAST CANCER

Cancer is the second most common cause of death in Poland. Annually, almost 170,000 people are diagnosed and about 100 thousand die of it. As a partner of the ONKOODPOWIEDZIALNI (cancer responsible) campaign, in October we engage in the prevention of breast cancer, which is the most common cancer among women. ONKOODPOWIEDZIALNI action In June […]

6 September 2021

A NEW BASECALLING MODEL

Oxford Nanopore is constantly improving its technology, thanks to which in May this year the latest method of basecalling – Super-accurate was introduced to users. What is basecalling? Basecalling is the process of processing changes in the electric potential created in the sequencing process in order to obtain information about the sequence of the genetic […]

5 September 2021

SARS-COV-2 VARIANTS

In the viral genomes, nucleotide changes occur much more frequently than in the genomes of living organisms. In the case of SARS-CoV-2, the most commonly reported infections are currently caused by four variants: Alpha, Beta, Gamma, and Delta. These variants, due to mutations in their genome, can spread faster, make the disease more severe or […]