T-cell Receptor Sequencing

Written by TKS Ottawa Student Vansh Sethi (email:

The Greatest Threat to Humanity…

What is the greatest threat to humanity? Some might say that it’s a nuclear fallout. Others might say it’s the threat of a large asteroid hitting our planet. In fact, most people believe that the biggest threat to humanity is some out of this world, unimaginable thing that will wipe out humans in one sweep. Well… the greatest threat to humanity is right under our noses.

The greatest threat to humanity is a pandemic that could ravage around the world that we aren’t prepared for. Don’t believe me? Just look at what has happened in the past historically. The Spanish Flu inflected a quarter of the world’s population and killed upwards of 100 million people. Smallpox killed around 500 million people. Disease is the leading cause of death. Even billionaire philanthropist Bill Gates believes that the next worst thing to happen to humanity will be a virus.

“If anything kills over 10 million people in the next few decades, it’s most likely to be a highly infectious virus rather than a war — not missiles, but microbes”

Most of these viruses attack our immune system and terrorize our bodies from the inside. Fundamentally, we have been building vaccines as the main method of our attack to fight against these viruses. The way these vaccines work is that we inject the proteins of the virus into a person’s immune system to expose the immune system on how to fight against these viruses.

The fact is creating a vaccine is costly and takes a lot of time to create. Instead of having to expose the virus in a controlled matter to our immune system, what if we could probe our immune system to fight off these diseases without the use of a vaccine?

Welcome to T-Cells

T-cells are the immune system’s natural defense against disease. They are cells within the immune system that identify dangerous cells (like cancer or viruses) and deactivate those cells via apoptosis. Here’s how they work:

Every diseased cell has antigens on its surface which are cultivated from the cell hijacking other cells and doing damage to other cells. T-Cells have receptors (TCRs) on the surface of their cell which can bind to certain antigens, which allow the T-cell to kill off the diseased cell. Your body does this all the time with diseases, however, sometimes foreign diseases adapt to your body’s natural TCRs which make it so they can’t bind with the diseased antigens. This becomes problematic as the diseased cell can spread throughout your body and essentially kill you. CAR T-cell therapies solve this issue by engineering specific TCRs to fight these antigens.

AI Jobs

Car T-Cell Therapies

In a CAR T-cell therapy, a patient’s T-cells are extracted from a blood sample and genetic information is injected into the nucleus of the T-Cell to instruct the cell to produce certain T-cell receptors. These new CAR T-cells are then put back into the body to bind with the antigens to kill diseased cells.

When considering the TCR of the new T-Cell, there are a few major parts of the receptor to engineer to make sure that the antigen can bind with the TCR. The major parts include:

The Antigen: Also referred to as the peptide, this protein sequence is what connects the TCR to the MHC (or connector to cell) to allow for apoptosis to occur.

V-Segment: The V-segment or variable-segment is a part of the T-cell receptor and is the most physically outside part. There are many different variants of the v-segment.

J-Segment: The J-segment or joining-segment is what connects the v-segment to the rest of the TCR. There are many different variants of the j-segment.

CDR3: The 3rd complementarity-determining region of the T-cell is also important when fighting diseases. It’s in the form of a protein sequence.

Epitope: The epitope is what joins the antigen to the antibody or the rest of the cell. It’s represented in the form of a sequence of proteins.

Finding the parts of the T-Cell receptor is called T-Cell Receptor Sequencing or TCR sequencing. Currently, finding the various parts of the TCR require lots of research and tests. It can be costly and timely, not ideal when dealing with a new disease. This is where CAR T-cell therapies become inefficient for new diseases. T-cell receptor sequencing is a hard process that equates many variables and is hard to do currently. I introduce a deep learning method to TCR sequencing, using a variety of deep learning methods.

T-Cell Receptor Sequencing with Machine Learning

Given the epitope protein sequence of an antigen, can we predict the V-Segment, J-Segment and CDR3 Protein sequence of the TCR that will bind with the antigen? This is the fundamental question I tried to answer using machine learning. To make this easier, I broke up the problem into three manageable goals. The first goal is to create a model to predict the V-segment, one model to predict the J-segment and one model to predict the CDR3 sequence.

I used the VDJdb database which contains information on epitopes and their corresponding T-cell parts that would trigger apoptosis. It contains over 75,000+ data points which is what the model was trained on. You can find the database here.

Deep Learning Model for V-Segment and J-Segment

The method for predicting the v-segment and j-segment are very similar because they are only a handful of classes that the segments can be and a model can be tuned to predict the class that the segment is. This is a multiclass classification problem.

Input Data Representation

As mentioned, the input of the neural networks are the epitope protein sequence of the antigen. However, the epitope protein sequence is represented as letters in a non-fixed length. This is problematic as neural networks only work exclusively with numerical values in a constant size input matter.

Examples of epitope protein sequences: LLWNGPMAV (Yellow Fever Virus), CPSQEPMSIYVY (Cytomegalovirus), CTPYDINQM (Simian Immunodeficiency Virus)

Luckily, assigning each letter to an id makes this easier. We can map each letter in the protein sequence to a number. For example, the letter “A” becomes 1, the letter “B” becomes 2 and so on.

So the yellow fever virus sequence becomes:

LLWNGPMAV → 12 12 23 14 7 16 13 1 22

This new encoded sequence of numbers has a length of 9. However, some sequences will have a length of 8, 10, 11 or up to 20 proteins. The input for a neural network needs a fixed-sized input, so to achieve this, we can pad every sequence to the max length possible of 20 with 0’s. So for our currently encoded protein sequence of the yellow fever virus, it becomes:

12 12 23 14 7 16 13 1 22 → 12 12 23 14 7 16 13 1 22 0 0 0 0 0 0 0 0 0 0 0

These inputs are fed into our neural network in the form of arrays. So with the input handled, how does the neural network decide which segment is appropriate for the sequence?

Model Architecture

A neural network is just a mathematical function that takes in input “x” and produces an output “y”. It has weights or learnt parameters that alter the x to get the y. There are different methodologies in neural networks that utilize parameters in different ways. These parameters are optimized to produce the best output possible. This is done by calculating a loss function which represents how good the model is. The lower the loss, the better the model. We can use an optimizer (which are calculus functions) to optimize the loss and make it lower. This process is called “machine learning” & “training the model.”

Output Data

The last dense layer in the neural network has 126 neutrons which represent the 126 classes the V-segment of the T-Cell can be. The output is in the form of a one-hot vector which means every neuron output is 0 except for one of the neurons which is 1. The one neuron’s position that has a value of 1 determines which class the V-segment is.

For the J-segment model, there were 68 neurons that represent the 68 classes the J-segment can be. It’s the same model with just a different number of classes in the last dense layer.

So for example, if we were predicting the V-segment class, if the output of the model is 1 0 0 0…. it means that the V-segment would be the 1st class or the TRBV6–8 variable segment.

Top 4 Most Popular AI Articles:

1. Turn your Raspberry Pi into homemade Google Home
2. Keras Cheat Sheet: Neural Networks in Python
3. Making a Simple Neural Network
4. Artificial Intelligence Conference

Training / Results

After training both models, similar results were achieved. The loss went down to 3 for the entire dataset yet the validation loss was much worse. The model had a tough time generalizing on new data yet it works in theory. This is because many classes will work for a certain epitope so the loss in practicality is much lower.

Loss for V-Segment
Loss for J-Segment

After training for 30 epochs each, the model works decently well on new unseen data. I can predict one of the right classes up to 80% of the time.

Sequence 2 Sequence for CDR3 Protein Sequence

Since the CDR3 part of the T-cell is a protein sequence of a non-fixed length, a multiclass approach wouldn’t work. Instead, a new approach that takes in the epitope and produces an output of a dynamic length is needed. A Sequence 2 Sequence model can be used for this. It takes in a sequence (epitope protein sequence) and produces another sequence (cdr3 protein sequence). It incorporates an encoder-decoder model.

Input Data Representation

The way we represent the data for this model’s input is similar to the V-segment and J-segment models. The proteins in a sequence are mapped to a number. “A” is mapped to 1, “B” is mapped to 2 and so on. Yet, where the model’s input differs is that after encoding the proteins to numbers, those numbers are encoded into a one-hot vector. A one-hot vector is an array of 0’s with one number in the array as a 1. The position of the 1 in the array represents the class of that one-hot vector. For example, if the number 1 is encoded into a one-hot vector, the first position of the array would be 1 and the rest of the numbers would be 0. Since there are characters, there are 26 classes in the one-hot vector. Doing this allows for information to be handled easier through this neural network.

Also, at the beginning of the protein sequence and end, there will be a start and end token. These tokens essentially represent what they sound like, they indicate when a sequence is starting or when a sequence is ended. Also, the desired output will also be used during training explicitly within the model. I know, using the output for input? Yes, because it allows the seq2seq model to become more accurate and better.

Model Architecture

Embeddings: First, the input is put through an embedding layer, which essentially converts it into a new vector of values of a fixed length. This is important as we had to pad our input to keep it to a constant length and this process allows us to make the padded areas unimportant. It multiplies the input by some learned weights to create a new vector which are trained during the training process. Embeddings also allow for similar words or sequences to be generalized into similar number sequences, which allows the model to generalize even easier. Similar words are closer to each other.

Embeddings Map

Encoder: The encoder is an LSTM layer that produces the state vectors for the decoder. Essentially what the encoder is doing is that it processes the input and extracts the most important features. LSTMs have multiple state vectors that it aggregates, but essentially what the LSTM keeps track of is the long-term, short-term and produces an output. Propagating our input through this allows for crucial extraction of important features of our input, later to be constructed back into a sequence from the decoder.

Decoder: The decoder is also an LSTM but reconstructs in some sense the output from the decoder. It is able to produce a representation of the new sequence from the encoder features. The decoder outputs are then run through a dense layer to produce the final sequence, which is encoded into numbers. These extremely specific neural network functionalities are what allows for the sequence 2 sequence model to be so effective. The parameters can be easily learnt with enough data.

Training / Results

This model trained very well and achieved magnificent results. There were only around 20 000 parameters (small for a neural network) but it was able to create a protein sequence to 90% accuracy and got down to a 0.4 loss.

No signs of over fitting for the 20 epochs it was run for, which means it’s still got room for improvement! Overall, this works really well.


Melanoma Antigen
Herpes Virus Antigen
Yellow Fever Virus Antigen


I envision a future where pandemics, disease and viruses can be dealt with in a secure, hasty manner. We are inevitably going into a more personalized focused medical system that favours those with money over the general public. We as a society need to address large scale disease and make it accessible to everyone, and I believe that CAR T-cell therapy is something we need to be working to advance. Breakthroughs in research can expedite the process of getting it out widely to the market, but we will need full cooperation of world governments to achieve accessibility for everyone. This solution I’ve proposed helps in making CAR T-cell therapy better but not more accessible. As higher accuracy goes up, we need to ensure usability goes up as well.\

This project also helps sement the use of NLP technologies in the medical field (which is becoming more reliable). A lot of people don’t trust these A.I to help with medical discovery, but the fact is that it makes our current medical technology better.

Code for project:

Key Takeaways:

  • Problem: Predict the CDR3 protein sequence, variable segment & joining segment of a T-cell given the epitope of an antigen
  • Epitope data can be represent as number through word encodings and embeddings
  • Variable segment & joining segment can be predicted using classical neural networks
  • CDR3 protein sequence can be predicted using a Sequence 2 Sequence model