Machine Learning for Novel RNA Targets

Written by TKS Boston student, Mukundh Murthy (email:

I’m sure you’ve heard of DNA, or deoxyribonucleic acid, the “recipe book” which codes for all the proteins, the macromolecules that do all your cellular chores and make your unique self.

But you also probably know that if I gave you a recipe book, the recipe on the pages wouldn’t magically turn into that lasagna that you want for dinner. Instead, there has to be the intermediary, the cook, which will transform the instructions from the book into an actual product. In fact, I’d argue that the recipe book is completely useless without a cook to transform all of its valuable information into a delicious meal.

What’s the intermediary in our cells that transforms DNA into protein? That’s right it’s RNA (or ribonucleic acid). It’s the essential cellular cook without which our DNA would be completely useless.

RNA Molecules are single stranded and made of the four bases A, G, C, and U. These bases can also pair up with each other to form complex three dimensional structures. (Image Source)

So far we have this basic model, the central dogma as it’s called (as most of you who’ve taken high school biology would know).

The central dogma of biology as most students are taught. DNA → RNA → Proteins

But nature can’t be that simple, right? Nature is never that simple. So after decades of research, we have…

A more detailed, accurate version of the central dogma. (Don’t worry too much about the biology jargon if you’re not familiar with it) (Image Source)

The central dogma was never wrong. It just ignored a lot of useful information and brushed over a lot of genetic regulatory elements that are very important. I’d even argue that the diagram above is a general simplification of the actual mechanisms going on in our cells, but it should suffice for our purposes 🙂

Ok, don’t worry too much about the terms on the infographic. Let’s go back to the cook/recipe analogy. Initially, we had a straight chain: recipe → cook → dish. Now, instead, we know that in reality, the cook doesn’t just play one role (transforming the recipe into a meal). In addition, the cook must edit the recipe book based on the occasion and ingredients available, shop for and buy ingredients, decide what recipes to make at what times, and ultimately decide the final destination of a batch of delicacies. This is exactly what our new model for RNA accounts for! It chooses which genes to transcribe based on the time period and cellular location! It conducts subcellular localization and binds to miRNAs deciding exactly when and where the RNA’s are translated into proteins.

This is exactly why it’s important to understand RNAs at a level of very precise detail! In general, it’s good to understand the three-dimensional structure of most macromolecules that play complex roles in cellular function, and for the most of the time, we’ve been focusing mostly on proteins. If we were to think of cellular activities as a theatrical production, proteins would be the main actors, while RNAs would be doing all the backstage work. But imagine if you removed the lighting crew. You wouldn’t even be able to see the play! This is why it’s so important to devote at least partial attention to the three dimensional structures of RNA molecules.

So far in drug development, we’ve identified proteins that are altered by mutations or SNPs (single nucleotide polymorphisms) in disease-afflicted individuals. We then treat the protein as a lock, and then identify a small molecule drug, or key, which is to open the lock and effectively treat those affected with a particular disease. Numerous exponential technologies are being integrated with this traditional lock and key approach (such as quantum computing and artificial intelligence), and are revolutionizing the way that we think about drug discovery, but our approach is incomplete, and we aren’t addressing all the potential molecules/pathways that can be altered by chance mutations. Yep, that’s right. We aren’t exploring the vast diversity of RNA molecules as potential targets for hard to cure diseases.

Ok… Now the next question that’s likely to come up is why? If docking small molecule ligands and protein targets is so easy, why can’t the same approach be taken for RNA targets.

I’d like to explain this problem using the analogy of the ball rolling down a hill. Imagine that the elevation of the ball represents the free energy of macromolecule (be it RNA or proteins).

Here’s what the free energy landscape looks like for proteins (of course intrinsically unstructured proteins are an exception to this rule, but most aren’t used as drug targets) and RNA molecules. Notice that whether I push the protein free energy ball with little force or large force, it always seems to roll back to the bottom-most point in the single minimum free energy (MFE) valley. On the other hand, the RNA free energy landscape has many local valleys. Thus, if the ball is pushed with small variations in the amount of force applied, it will end up in different valleys that are not all MFE positions.

Each valley represents a low energy or stable conformation for the macromolecule. Proteins often decide on one singular stable conformation, while RNAs fall into multiple local valleys based on small changes in external conditions (such as the amount of metal cation, presence of RNA binding proteins nearby, etc.). And this is where the problem comes in.

Let’s stick to the lock and key analogy to understand this problem. Imagine trying to open a door with a lock that constantly changes. Every nanosecond, the ridges and bumps that make the signature of the lock fluctuate without a single period of stability. This is kind of like what scientists face when trying to develop small molecule drugs for RNA targets. Uncertainty is inevitable in problems involving biological systems. However, scientists can never be sure of the efficacy of a certain RNA-targeting drug without isolating a single target structure based on cellular environmental conditions and other personal and biological factors. Researchers often try to sample clusters of viable conformations from a statistical model called the Boltzmann distribution. These clusters are then placed onto a two dimensional graph called a Principal Component Analysis plot (PCA). The problem is that these PCA plots only help to visualize the diversity of the ensemble of structures rather than pinpoint a specific 3d structure.

Novel Case Study: Evaluating the Viability of TRAF — C5 Transcripts as a Drug Target

In order to identify a potential RNA drug target, I searched through a library of 2000 riboSNitches (mutations that change the conformation of an RNA molecule) and mutations on the PharmGKB Database (A pharmagenomic database containing info on how specific variants affect drug response). I then identified mutations contained in both the databases (that therefore acted as a riboSNitch AND a mutation affected pathology/drug response). One such mutation was rs3761847, a mutation that supposedly contributes to the severity of rheumatoid arthritis.

I then identified the ~180 base pair windows flanking either side of the mutation via the RNAstructurome Database and inputed the entire sequence into the RNAsnp web server along with the desired rs3761847 mutation.

A shift in conformation in the TRAF1-C5 transcript as visualized by the RNAsnp Web Server. The position of major loops and stems towards the left side of the transcript structure shift between the two structures.

Based on the obvious shift in conformation characterized by the outputted visuals, I assumed that this mutation must be correlated with an increase in risk/severity associated with rheumatoid arthritis.

However, after I searched Pubmed literature, I found exactly the opposite of what I expected. The rs3761847 mutation seemed to be in almost no way correlated to rheumatoid arthritis (a very low Pearson correlation coefficient was observed in some cases).

Although an obvious shift in conformation is observed based on MFE structures calculated by base pairing probability algorithms, when researchers attempt to generalize the effects of the mutation to larger populations, their attempts don’t seem to work.

Here’s the most likely reason for this result:

In silico (computer-based) folding algorithms often deviate from the structures observed in vivo (in cellular conditions).

So how can we bridge the wide gap between in-silico and in-vivo structures in order to increase the precision of RNA based therapeutics and potentially save millions of lives by combatting multifactorial diseases such as cancer and rheumatoid arthritis?

That’s where machine learning comes in.

What if, in the future, doctors could almost instantaneously sequence your transcriptome and identify RNA based errors that lead to your unique disease symptoms?

This could become a reality. All we’d need is an ML algorithm that could bridge the gap between in-silico and in-vivo structures, and that’s the problem I intend to work on.

Proposed Architecture

In order to bridge the gap between in silico and in vivo structures, I plan to have my machine learning algorithm characterize an unprocessed RNA transcript by sequence motifs, secondary structure motifs, and characteristics of its MFE structure.

A proposed CNN based architecture for generating RNA structures that are more accurate (more closely resemble in vivo structures)

  1. Fully Connected Layer with MFE data from RNA Structurome

The RNA structurome is a database from the Iowa State University which contains all the structure data for RNA transcripts from the 3 billion nucleotides in our genome.

The 3 billion nucleotide sequence is broken into 120 nucleotide (nt) overlapping fragments and then folded into a two dimensional structure using RNA folding algorithms like RNAfold from the ViennaRNA package.

In addition to the MFE value for each of these 120 nucleotide sequences, the RNA structurome database contains various metrics for MFE structure comparison including the p-value, the z-score, and the Ensemble Diversity.


It evaluates the likelihood of a particular z-score based on the fraction of the 30 random sequences that are more stable than the actual genome sequence.

(# 𝑜𝑓 𝑀𝐹𝐸(30 𝑠𝑐𝑟𝑎𝑚𝑏𝑙𝑒𝑑 𝑣𝑒𝑟𝑠𝑖𝑜𝑛𝑠) < 𝑀𝐹𝐸(𝑔𝑒𝑛𝑜𝑚𝑒 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒))/30


It shows us how stable a particular 120 nt sequence is as compared to 30 randomly scrambled sequences.

𝒛 − 𝒔𝒄𝒐𝒓𝒆 = (𝑀𝐹𝐸(𝑔𝑒𝑛𝑜𝑚𝑒 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒) − 𝑀𝐹𝐸(30 𝑠𝑐𝑟𝑎𝑚𝑏𝑙𝑒𝑑 𝑣𝑒𝑟𝑠𝑖𝑜𝑛𝑠))/𝜎

Ensemble Diversity:

As I’ve explained previously in this article, an RNA molecule can have multiple stable structures. This range of stable conformations is called the “ensemble.” The Ensemble Diversity (or ED) measures the diversity of this conformational space.

All of the metrics explained above are visualized in the RNA structurome database. Each of the yellow windows at the top of the image represent the 120nt transcription windows for RNA folding.

2. Convolutional Filter Layer for Structure Motif Detection

Here, convolutional filter kernels will be used to detect various secondary structure motifs, such as loops, stems, and pseudoknots.

3. Convolutional Filter Layer for Sequence Motif Detection

A similar idea to #2. Convolutional kernels will be used to detect feature commonalities in the sequence itself.

4. Final Fully Connected Layer for Synthesis of Different Modes of Information

5. Calculate Training Loss (Using Mean Squared Error Function)

This is where the magic will happen. Instead of using stability of MFE as the main metric for success in structure generation, I will use in-vivo SHAPE data collected from various databases in order to evaluate the real life accuracy of the generated RNA structure.

The goal is for the loss to decrease as the model trains on various RNA structures that have complement SHAPE data, so that the model becomes more accurate at predicted actual in-vivo RNA structures!!!

Key Takeaways and Summary:

  • RNAs are a key player in cellular biology. The central dogma is an oversimplification of the complexity of RNA structure and function.
  • A tradition lock-key approach is often taken to drug discovery. This model works well, but right now we’re constraining ourselves to protein “locks” or targets.
  • In order to increase the viability of RNA molecules as small molecule drug targets, we’ll need to isolate single fixed structures of RNA molecules in vivo (to conduct accurate small molecule docking)
  • RNA conformational changes as visualized by the multitude of RNA folding online servers and packages online are not matching clinical studies presented in literature (as shown with the rs3761847)
  • Machine learning (CNNs) can be used to create models that predict in-vivo structures more accurately.