Unleashing genomic insights with AB learning: A self-supervised whole-genome language model
Naidenov, Bryan
Citations
Abstract
The language of genetic code embodies a complex grammar and rich syntax of interacting molecular elements. In this regard, the standard additive marker encoding scheme’s inability to adequately represent context-dependent genetic information and variable length base-pair qualities limits its applicability in describing higher-order genomic phenomena. While there have been numerous efforts to more holistically accommodate this information, no single encoding regimen has been able to fully incorporate the vast genomic corpus. Recent advances in self-supervision and feature learning suggest that statistical learning techniques can identify high-quality quantitative representations from unprocessed data. This dissertation, inspired by innovations in natural language modeling, introduces a reimagined genomic representation framework that leverages the existing contrastive characteristics, found in the genomes of natural populations, to resolve the genomic latent space. In this study, we first considered a gene-based language model that generates whole-genome vector representations from a population of 16 disease-causing bacterial species (Chapter 5). To achieve this, we developed a set-based learning objective, AB learning, that compares the annotated gene content of two population subsets for use in optimization. Using this foundational objective, we trained a deep-learning Transformer model to backpropagate information into dense genome vector representations. The resulting bacterial representations, or embeddings, captured important population structure characteristics, like delineations across serotypes and host specificity preferences. Their vector quantities encoded the relevant functional information necessary to achieve state-of-the-art genomic prediction accuracy in 11 out of 12 antibiotic resistance phenotypes. In Chapter 6, we expanded our Transformer model to encode variation from genomic k-mers, eliminating the need to annotate the population prior to training. In a lodgepole pine study system, we demonstrated that the resulting genome embeddings form clusters that reflect the family structure of the population, and the pairwise embedding distances can reconstruct the pedigree. In supervised genomic prediction tasks, the pine embeddings outperformed previous marker-based models. Additionally, we showed that genomic k-mers can be jointly embedded during training, facilitating the identification of k-mer token association clusters for important agronomic traits.