genXone

Bioinformatics analysis is an inseparable element of the entire process of studying the DNA sequence. The work of bioinformatics is therefore crucial in the implementation of the sequencing service. Appropriate tools and solutions are needed to effectively conduct bioinformatics analysis. One of the solutions is data-stream processing.

What is pipelining?

Data processing pipelines are a way to process information sequentially. Tasks such as sorting and finding duplicates are divided into separate blocks that connect to each other in series. This is how we know what the correct order of performing all tasks is.

What is pipelining for?

Pipelining of data in the daily work of bioinformatics is primarily used to save time and increase work efficiency. For example, if we want to display only those lines in a file that start with “>” and then sort them, we do not need to use the command to find the appropriate lines and save them to a separate file and then sort that file. Instead, we can search for specific lines once and pipe them directly to the sort command.

How do I use pipelines in the console?

We redirect the data pipeline in UNIX-based systems in the console with the “|” character. It is the link between the two command blocks. Instead of creating an intermediate file:

grep “^>” file_in.txt > file_tmp.txt

sort file_tmp.txt > file_out.txt

we can pass data in a pipe:

grep “^>” file_in.txt | sort > file_out.txt

This way our command is a one-line command that clearly shows the data flow, without creating any intermediate files to remember to delete when no longer needed.

The command connected by pipes can be of any length:

Advantages of data-stream processing

In pipelining, data is stored in fast RAM memory, thanks to this we avoid the creation of intermediate files – writing them to the disk and rereading them by another program slows down the processing due to the need to use hard drives, especially when processing large volumes of data.
It is much more convenient and faster to create entire scripts and commands. They often only take one line of code.
This solution also gives you more control over the created files and data flow and gives the impression of a clean and transparent code.

Of course, there are also situations where the data pipeline is not possible. Such an example may be the need to process data by a program that reads input data several times. During pipelining, data is passed to the program only once, in real-time.

An example of the use of pipeline processing

As an example, we present the data analysis of the taxonomy of entries in the NCBI genomic sequence database. From the file https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxcat_readme.txt we can read that the database stores genomes of organisms belonging to the following categories/kingdoms:

A = Archaea

B = Bacteria

E = Eukaryote

V = Viruses and Viroids

U = Unclassified

O = Other

First look at the data: we download the file with the taxonomic ID, unpack it and select 5 random lines from the first 5000.

URL=”https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxcat.zip”

curl -s “$URL” | zcat | head -n 5000 | shuf | head -n 5

E 3095 3095

E 8737 8737

B 1066 1066

B 447 66973

E 8420 8420

The data is in the form of Category SpeciesTaxonomicID TaxonomicID. If the SpeciesTaxonomicID is equal to TaxonomicID, the entry is for a species, otherwise, the entry is for an organism that is a subspecies. For example, line B 447 66973 refers to Legionella bozemanii serogroup 1 a subspecies of the species Fluoribacter bozemanae.

We now count how many organisms are taxonomically recorded in each category/kingdom. As before, we download the file, unpack it, rewrite only the first column, sort, count how many consecutive rows are the same and finally sort in reverse order to list starting from the category with the largest number of organisms. We redirect the result both to the screen and to a file with the tee command.

curl -s “$URL” | zcat \

| awk ‘{print $1}’ | sort -T /dev/shm | uniq -c | sort -rn \

| tee all_stats.txt

1338805 E

495917 B

203340 V

15530 O

12701 A

958 U

The -T / dev / shm option on Linux systems causes that when sorting large files, the resulting temporary files are placed in RAM – this results in less disk load and faster sorting at the expense of RAM consumption.

The quantitative data of species and subspecies are already known, in the next step, we would like to count how many species are present in each category. For this we can apply the grep -e ‘\ s ([0-9] +) \ s + \ 1’ filter, it will only leave the lines where the second and third columns are identical.

Additionally, in the following command, we use tee (stream duplication) and bash redirection of the stream to the next command >(pipeline | …) so that we do two calculations: previous and next with a filter only for genres. In this way, with large files, we can perform several calculations at the same time reading a large file only once.

curl -s “$URL” | zcat \

| tee >( awk ‘{print $1}’ | sort -T /dev/shm | uniq -c | sort -rn \

| tee all_stats.txt >&2 ) \

| grep -e ‘\s$[0-9]\+$\s\+\1’ \

| awk ‘{print $1}’ | sort -T /dev/shm | uniq -c | sort -rn \

| tee species_stats.txt

1338805 E

495917 B

203340 V

15530 O

12701 A

958 U

1302432 E

451723 B

34725 V

15530 O

12341 A

958 U

The results were also saved in the files all_stats.txt species_stats.txt. We can observe that the greatest difference in the number of recorded subspecies compared to the number of recorded species is in the domain of viruses. The tail command, which lists the contents of files along with their names, is useful here.

tail -n +1 *stats.txt

==> all_stats.txt <==

1338805 E

495917 B

203340 V

15530 O

12701 A

958 U

==> species_stats.txt <==

1302432 E

451723 B

34725 V

15530 O

12341 A

958 U

Related news

7 December 2021

HOW BACTERIA AFFECT BRAIN DEVELOPMENT

As we have learned from previous articles on our blog, bacteria play a significant role not only in digestion but also in the formation of our immune system. But there is more! Increasingly, research indicates a close connection between the development of the brain in early life and the process of colonization of the digestive […]

29 October 2021

GUT BACTERIA AND ATOPIC DERMATITIS

The bacteria that colonize our digestive system right after birth are extremely important for the proper development of immunity. Disruptions in this process may translate into the appearance of various types of diseases later in life. As it turns out, colonization time is a particularly important factor, as proven by the latest research by scientists […]

12 October 2021

OCTOBER – THE MONTH OF BREAST CANCER

Cancer is the second most common cause of death in Poland. Annually, almost 170,000 people are diagnosed and about 100 thousand die of it. As a partner of the ONKOODPOWIEDZIALNI (cancer responsible) campaign, in October we engage in the prevention of breast cancer, which is the most common cancer among women. ONKOODPOWIEDZIALNI action In June […]

6 September 2021

A NEW BASECALLING MODEL

Oxford Nanopore is constantly improving its technology, thanks to which in May this year the latest method of basecalling – Super-accurate was introduced to users. What is basecalling? Basecalling is the process of processing changes in the electric potential created in the sequencing process in order to obtain information about the sequence of the genetic […]

5 September 2021

SARS-COV-2 VARIANTS

In the viral genomes, nucleotide changes occur much more frequently than in the genomes of living organisms. In the case of SARS-CoV-2, the most commonly reported infections are currently caused by four variants: Alpha, Beta, Gamma, and Delta. These variants, due to mutations in their genome, can spread faster, make the disease more severe or […]

28 May 2021

PCT TEST INDICATING CORONAVIRUS MUTATIONS

Reading the full sequence of the SARS-CoV-2 coronavirus genomes has become a natural consequence of the effective fight against the ongoing epidemic. Such information is provided by sequencing analyzes, but most often it is not generally available to individuals. So how do we find out if one of the dangerous variants of the coronavirus has […]