NEURAL NETWORKS IN BIOINFORMATICS – BASECALLING

Sequencing is a technique of reading the order of nucleotide pairs in a DNA molecule under study – but how does it work in practice? How is information contained in biological material converted into huge databases? Well, it is all thanks to statistical analyzes conducted in bioinformatics. Let’s take a closer look at it – what is basecalling?

Sequencing chromatogram

Nanopore sequencing involves passing a properly prepared DNA chain through a so-called membrane – nanopores. An electrical voltage is applied to the diaphragm, which causes a current to flow. Thanks to this, it is possible to catch any change in the resistance of the nucleotides passing through the pores. These changes in an electric current are recorded as a chromatogram that produces the raw result of nanopore sequencing. The next stage of the analysis is the appropriate interpretation of the resulting chromatogram, enabling the reading of the correct sequence of the tested DNA. This is where statistical bioinformatics methods come in handy.

What is basecalling?

Basecalling is the process of translating a chromatogram into the appropriate DNA sequences. However, it is not such a simple task. There is no universal, model relationship between the value of the current intensity and a specific nucleotide. As an example – it cannot be stated that 1pA current always means adenine, and 2pA current means cytosine.

These difficulties are due to three main factors:

  1. The current measurement results are very low, in addition, the difference between the individual readings and the overall amplitude of the graph is relatively small.
  2. The rate of penetration of nucleotides through the pores is very high, which may cause reduced sensitivity to detect changes in resistance.
  3. The speed of the chain traversal is also variable in the case of repetition of parts of the sequence. This is known as DNA strand slippage. This phenomenon makes it very difficult to determine the exact number of nucleotides present in long repetition sequences – it is difficult to precisely determine whether the reading read is, for example, AAAA or AAAAA.

That is why advanced machine learning techniques are used, which show repetitive mathematical structures of the information processed. The general name for this type of model is the neural network.

Neural networks

Oxford Nanopore Technology (ONT), the developer of the nanopore sequencing method and supplier of tools based on this method, also provides basecalling software adapted to long readings from nanopore sequencers – Guppy Basecaller. It is on the basis of this tool that genXone carries out its analyzes.

Guppy is a tool that uses neural networks, i.e. an algorithm based on the construction of a decision graph with a structure inspired by the structure of the human brain. The basic form of the neural network is a supervised learning algorithm that is trained using a prepared data set and a set of expected results (machine learning).

Guppy Basecaller is delivered in the form of a tool and models matched to the sequencing parameters, trained from a very large number of sequencing. These models are constantly evolving and it is hoped that they will achieve even better accuracy of the results over time. However, already now, due to the carrying out of such demanding analyzes, it is possible to use additional graphics cards with the Nvidia CUDA protocol, which allow obtaining several or even several hundred times faster data processing. Thanks to such efficiency, you can perform basecalling in real-time, already during the sequencing, and as a result, get the final result much faster.