Deep learning has proven to be one of the state-of-the-art tools in object recognition and natural language processing, and has been applied to many fields including life sciences. Deep learning refers to the use of multiple layers of neural network which is one of the traditional machine learning algorithms. Deep neural networks are able to solve problems that are difficult to solve with conventional machine learning algorithms. At present, various architectures of deep neural networks have been published, especially convolutional neural networks (CNN) and recurrent neural networks (RNN) are well-known. The former is known to be excellent in computer visions such as object recognition and object detection, while the latter, RNN, is known to be excellent in time-series data analysis and natural language processing. Considering many Bioconductor users are familiar with genome sequence analysis, in this workshop, we will focus on RNN, and introduce the fundamental algorithm of RNN and procedures to build RNN models using R.
The workshop introduces fundamental concepts of RNN algorithms and long short-term memory (LSTM), through developing a model to predict CRISPR guide RNA activity with a nucleotide sequence of guide RNA. By the end of the workshop, participants should be able to:
The workshop will proceed on the assumption that participants have the basic knowledge:
for
sentence, functions)Since the workshop focuses on explaining of basic knowledge about deep learning, it is not intended for those who are familiar with deep learning or who can build models with Python or other programming languages.
In this workshop, we mainly use torch (PBC 2021) and coro (Henry 2021) packages to perform deep learning, and use tidyverse package for visualization. To install these packages, run the following scripts.
install.packages('tidyverse')
install.packages('coro')
install.packages('torch')
library('torch')
install_torch(timeout = 1200)
Then, restart R session and load these packages.
Dataset used in this workshop can be downloaded from Supplementary Data in Wang D (2019)’s paper. Due to time restraints, we will use the dataset that has already been arranged. The arranged dataset is packaged in Intro2DNN package and also can be downloaded from GitHub. The dataset contains train.tsv and valid.tsv files, which are the training dataset and validation dataset, respectively.
At first, we will show an example of loading and preprocessing of training dataset train.tsv.
train_fpath <- system.file('train.tsv', package = 'Intro2DNN')
crisprcas9_train_dataset <- read.table(train_fpath, header = FALSE, sep = '\t')
head(crisprcas9_train_dataset)
As shown above, train.tsv consists of two columns: the first column is the editing efficiencies of guide RNA, and the second is the nucleotide sequence of guide RNA.
In this workshop, our goal is to create a model to predict the activity by inputting a nucleotide sequence. For convenience, we store the activities into y_train
as labels (response variable) and the nucleotide sequences into x_train
as features (explanatory variables).
x_train <- crisprcas9_train_dataset[, -1]
y_train <- crisprcas9_train_dataset[, 1]
Since most machine learning packages including torch package require numeric inputs, it is necessary to convert a string (i.e., nucleotide sequence) into numeric values. Based on the requirements of torch package, we convert A, T, C, and G into integers 1, 2, 3, and 4, respectively. Here, as an example, we try to covert the first sequence of x_train
with chartr
and strsplit
functions.
chartr('ATCG', '1234', x_train[1])
as.integer(unlist(strsplit(chartr('ATCG', '1234', x_train[1]), '')))
We can see that the 21 nucleotides was successfully converted into 21 integers with chartr
and strsplit
functions. Next, we use the two functions to convert all sequences in x_train
object.
x_train <- matrix(as.integer(unlist(strsplit(chartr('ATCG', '1234', x_train), ''))),
ncol = 21, byrow = TRUE)
head(x_train)
With the above steps, we successfully loaded data from a file into R object, and converted all sequences into numeric values. Unfortunately, torch package cannot directly handle this type of R object. It is required to create a torch-readable class (a collection of datasets and functions) following torch requirements.
set_dataset <- dataset(
'guidRNA_dataset',
initialize = function(x, y) {
self$x <- x
self$y <- y
},
.getitem = function(i) {
x_tensor <- torch_tensor(as.integer(self$x[i, ]))
y_tensor <- torch_tensor(as.numeric(self$y[i]))
list(x = x_tensor, y = y_tensor)
},
.length = function() {
nrow(self$x)
}
)
Then, we assign the training data x_train
and y_train
to set_dataset
class; and we pass set_dataset
class object to dataloader
to manage datasets during model training.
dataset_train <- set_dataset(x_train, y_train)
dataloader_train <- dataloader(dataset_train, batch_size = 1024, shuffle = TRUE)
The first step of model construction with torch package is to design architecture of neural networks. Here, we design a neural network with multiple layers that receives 21 integers (features) and outputs one real value (label). Specifically, we design an architecture with three main parts: (i) the first part is composed of an embedding layer that converts 21 integer values to 64 dimension values, (ii) the second part is composed of an LSTM layer which receives 64 values and outputs 256 features, and (iii) the third part is composed of three fully connected layers which receives 256 values and outputs one real value.
To design the network architecture, we follow the definitions of torch to create a class with initialize
and forward
functions. The initialize
function declares the components of the network, and the forward
function defines the order in which the components are connected.
GenomicNet <- nn_module(
"GenomicNet",
initialize = function() {
self$embedding <- nn_embedding(num_embeddings = 4, embedding_dim = 64)
self$lstm <- nn_lstm(input_size = 64, hidden_size = 256, batch_first = TRUE)
self$dropout <- nn_dropout(p = 0.5)
self$fc1 <- nn_linear(in_features = 256, out_features = 512)
self$fc2 <- nn_linear(in_features = 512, out_features = 512)
self$fc3 <- nn_linear(in_features = 512, out_features = 1)
},
forward = function(x) {
x <- self$embedding(x)
x <- self$lstm(x)
x <- x[[2]][[1]][1, , ]
x <- nnf_relu(x)
x <- self$fc1(x)
x <- self$dropout(x)
x <- nnf_relu(x)
x <- self$fc2(x)
x <- self$dropout(x)
x <- nnf_relu(x)
x <- self$fc3(x)
}
)
In this subsection, we will create an instance from the model architecture and assign the dataset for model training. Here is an example for creating an instance from the GenomicNet
class.
model <- GenomicNet()
To train the model, we specify a training algorithm and a loss function in advance. Since our goal is to predict a single real number, we will solve it as a regression problem. Here we will use the mean squared error (MSE) as the loss function. In addition, we will use Adam’s algorithm to optimize the model, which is one of the most popular algorithms in most situations.
criterion <- nnf_mse_loss
optimizer <- optim_adam(model$parameters)
Next, we send the prepared datasets and the model to a device (CPU or GPU) for training. Here, we use for
statement to train 5 epochs with the same dataset. At each epoch, we train the model with each of the minibatches defined by dataloader
.
model$to(device = 'cpu')
model$train()
loss_train <- c()
for (epoch in 1:5) {
loss_running <- 0
n_train_samples <- 0
# loop for minibatches
coro::loop(for (b in dataloader_train) {
optimizer$zero_grad()
output <- model(b$x$to(device = 'cpu'))
loss <- criterion(output, b$y$to(device = 'cpu'))
loss$backward()
optimizer$step()
loss_running <- loss_running + loss$item() * nrow(b$x)
n_train_samples <- n_train_samples + nrow(b$x)
})
loss_train <- c(loss_train, loss_running / n_train_samples)
cat(sprintf("epoch %d loss: %3f\n", epoch, loss_running / n_train_samples))
}
The training loss decreased during training. Further, even at the 5th epoch, the downward trend of the training loss has not weakened. Thus, it is expected that the training loss can be further reduced by setting a larger number of epochs.
data.frame(epoch = 1:length(loss_train), loss = loss_train) %>%
ggplot(aes(x = epoch, y = loss)) +
geom_line()
Let us train the model more 5 epochs.
for (epoch in 6:10) {
loss_running <- 0
n_train_samples <- 0
coro::loop(for (b in dataloader_train) {
optimizer$zero_grad()
output <- model(b$x$to(device = 'cpu'))
loss <- criterion(output, b$y$to(device = 'cpu'))
loss$backward()
optimizer$step()
loss_running <- loss_running + loss$item() * nrow(b$x)
n_train_samples <- n_train_samples + nrow(b$x)
})
loss_train <- c(loss_train, loss_running / n_train_samples)
cat(sprintf("epoch %d loss: %3f\n", epoch, loss_running / n_train_samples))
}
data.frame(epoch = 1:length(loss_train), loss = loss_train) %>%
ggplot(aes(x = epoch, y = loss)) +
geom_line()
Here we use the validation dataset to validate the model performance. The procedures of data preprocessing for validation is the same as that for training.
valid_fpath <- system.file('valid.tsv', package = 'Intro2DNN')
crisprcas9_valid_dataset <- read.table(valid_fpath, header = FALSE, sep = '\t')
x_valid <- crisprcas9_valid_dataset[, -1]
y_valid <- crisprcas9_valid_dataset[, 1]
x_valid <- matrix(as.integer(unlist(strsplit(chartr('ATCG', '1234', x_valid), ''))),
ncol = 21, byrow = TRUE)
dataset_valid <- set_dataset(x_valid, y_valid)
dataloader_valid <- dataloader(dataset_valid, batch_size = 1024, shuffle = FALSE)
Then, same as the training steps, we use for
statement to assign the validation dataset to the model, and retrieve the prediction results. Note that, switching the model to validation mode (evaluation mode) enables to improve the calculation speed during validation.
model$eval()
y_true <- y_valid
y_pred <- c()
loss_valid <- 0
n_valid_samples <- 0
coro::loop(for (b in dataloader_valid) {
output <- model(b$x$to(device = 'cpu'))
y_pred <- c(y_pred, as.numeric(output))
loss <- criterion(output, b$y$to(device = 'cpu'))
loss_valid <- loss_valid + loss$item() * nrow(b$x)
n_valid_samples <- n_valid_samples + nrow(b$x)
})
loss_valid <- loss_valid / n_valid_samples
loss_valid
Then we plot a scatter chart to visualize the correlation between predicted values and the true values.
data.frame(label = y_true, predicted = y_pred) %>%
ggplot(aes(x = label, y = predicted)) +
geom_point() +
coord_fixed() +
xlim(0, 1) + ylim(0, 1)
The mean squared error (MSE) was calculated during the validation step, and stored in loss_valid
object. In addition, including MSE, the evaluation metrics such as Spearman’s rank correlation coefficient also can be calculated following the definition.
In this subsection, we show an example to infer activities of guide RNA with the trained model. To perform inference, the scripts used for validation also can be used as-is. Here, we show another method to infer activity with a single nucleotide sequence of guide RNA.
x <- 'GAGTGATGATGGTCTGCACAC'
x <- matrix(as.integer(unlist(strsplit(chartr('ATCG', '1234', x), ''))),
ncol = 21, byrow = TRUE)
x
x_tensor <- torch_tensor(x)
x_tensor
y <- model(x_tensor$to(device = 'cpu'))
y <- y$item()
y
The trained model can be saved with torch_save
function. Note that models saved with the standard save
function will be environment-dependent, and results in that you are not able to call the model in other environments (computers).
torch_save(model, 'my_model.pth')
The model can be loaded with torch_load
function from a file. Models loaded by torch_load
function can be used for inference or retraining.
mymodel <- torch_load('my_model.pth')
mymodel$eval()
x <- 'GAGTGATGATGGTCTGCACAC'
x <- matrix(as.integer(unlist(strsplit(chartr('ATCG', '1234', x), ''))),
ncol = 21, byrow = TRUE)
x_tensor <- torch_tensor(x)
x_tensor
y <- mymodel(x_tensor$to(device = 'cpu'))
y <- y$item()
y