STREAM DATA PROCESSING IN BIOINFORMATICS

Bioinformatics analysis is an inseparable element of the entire process of studying the DNA sequence. The work of bioinformatics is therefore crucial in the implementation of the sequencing service. Appropriate tools and solutions are needed to effectively conduct bioinformatics analysis. One of the solutions is data-stream processing.

What is pipelining?

Data processing pipelines are a way to process information sequentially. Tasks such as sorting and finding duplicates are divided into separate blocks that connect to each other in series. This is how we know what the correct order of performing all tasks is.

What is pipelining for?

Pipelining of data in the daily work of bioinformatics is primarily used to save time and increase work efficiency. For example, if we want to display only those lines in a file that start with “>” and then sort them, we do not need to use the command to find the appropriate lines and save them to a separate file and then sort that file. Instead, we can search for specific lines once and pipe them directly to the sort command.

How do I use pipelines in the console?

We redirect the data pipeline in UNIX-based systems in the console with the “|” character. It is the link between the two command blocks. Instead of creating an intermediate file:

grep “^>” file_in.txt > file_tmp.txt

sort file_tmp.txt > file_out.txt

we can pass data in a pipe:

grep “^>” file_in.txt | sort > file_out.txt

This way our command is a one-line command that clearly shows the data flow, without creating any intermediate files to remember to delete when no longer needed.

The command connected by pipes can be of any length:

command1 | command2 | command3 | command3 | … | commandX

Advantages of data-stream processing

  • In pipelining, data is stored in fast RAM memory, thanks to this we avoid the creation of intermediate files – writing them to the disk and rereading them by another program slows down the processing due to the need to use hard drives, especially when processing large volumes of data.
  • It is much more convenient and faster to create entire scripts and commands. They often only take one line of code.
  • This solution also gives you more control over the created files and data flow and gives the impression of a clean and transparent code.

Of course, there are also situations where the data pipeline is not possible. Such an example may be the need to process data by a program that reads input data several times. During pipelining, data is passed to the program only once, in real-time.

An example of the use of pipeline processing

As an example, we present the data analysis of the taxonomy of entries in the NCBI genomic sequence database. From the file https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxcat_readme.txt we can read that the database stores genomes of organisms belonging to the following categories/kingdoms:

A = Archaea

B = Bacteria

E = Eukaryote

V = Viruses and Viroids

U = Unclassified

O = Other

First look at the data: we download the file with the taxonomic ID, unpack it and select 5 random lines from the first 5000.

URL=”https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxcat.zip”

curl -s “$URL” | zcat | head -n 5000 | shuf | head -n 5

E 3095 3095

E 8737 8737

B 1066 1066

B 447 66973

E 8420 8420

The data is in the form of Category SpeciesTaxonomicID TaxonomicID. If the SpeciesTaxonomicID is equal to TaxonomicID, the entry is for a species, otherwise, the entry is for an organism that is a subspecies. For example, line B 447 66973 refers to Legionella bozemanii serogroup 1 a subspecies of the species Fluoribacter bozemanae.

We now count how many organisms are taxonomically recorded in each category/kingdom. As before, we download the file, unpack it, rewrite only the first column, sort, count how many consecutive rows are the same and finally sort in reverse order to list starting from the category with the largest number of organisms. We redirect the result both to the screen and to a file with the tee command.

curl -s “$URL” | zcat \

| awk ‘{print $1}’ | sort -T /dev/shm | uniq -c | sort -rn \

| tee all_stats.txt

1338805 E

495917 B

203340 V

15530 O

12701 A

958 U

The -T / dev / shm option on Linux systems causes that when sorting large files, the resulting temporary files are placed in RAM – this results in less disk load and faster sorting at the expense of RAM consumption.

The quantitative data of species and subspecies are already known, in the next step, we would like to count how many species are present in each category. For this we can apply the grep -e ‘\ s ([0-9] +) \ s + \ 1’ filter, it will only leave the lines where the second and third columns are identical.

Additionally, in the following command, we use tee (stream duplication) and bash redirection of the stream to the next command >(pipeline | …) so that we do two calculations: previous and next with a filter only for genres. In this way, with large files, we can perform several calculations at the same time reading a large file only once.

curl -s “$URL” | zcat \

| tee >( awk ‘{print $1}’ | sort -T /dev/shm | uniq -c | sort -rn \

| tee all_stats.txt >&2 ) \

| grep -e ‘\s\([0-9]\+\)\s\+\1’ \

| awk ‘{print $1}’ | sort -T /dev/shm | uniq -c | sort -rn \

| tee species_stats.txt

1338805 E

495917 B

203340 V

15530 O

12701 A

958 U

1302432 E

451723 B

34725 V

15530 O

12341 A

958 U

The results were also saved in the files all_stats.txt species_stats.txt. We can observe that the greatest difference in the number of recorded subspecies compared to the number of recorded species is in the domain of viruses. The tail command, which lists the contents of files along with their names, is useful here.

tail -n +1 *stats.txt

==> all_stats.txt <==

1338805 E

495917 B

203340 V

15530 O

12701 A

958 U

==> species_stats.txt <==

1302432 E

451723 B

34725 V

15530 O

12341 A

958 U