The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
Krzysztof Kotlarz , Magda Mielczarek , Tomasz Suchocki , Bartosz Czech , B. Guldbrandtsen , Joanna Szyda
AbstractA downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models.
|Journal series||Journal of Applied Genetics, ISSN 1234-1983, e-ISSN 2190-3883, (N/A 100 pkt)|
|Publication size in sheets||0.5|
|Keywords in English||Classification, Keras, Next-generation sequencing, Python, SNP calling, SNPmicroarray, TensorFlow|
|License||Journal (articles only); published final; ; with publication|
|Score||= 100.0, 11-02-2021, ArticleFromJournal|
|Publication indicators||= 1; = 0; : 2018 = 0.689; : 2019 = 2.027 (2) - 2019=1.954 (5)|
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or Perish system.