Achieved Top 12.5% on Kaggle Competition, Mechanisms of Action (MoA) Prediction

To develop the algorithm that classifies drugs based on their biological activity

Bozhong Liu

13 min readMay 27, 2021

Team Member: LIU BOZHONG, ZHOU JIANAN, ZHU ZHICHENG

Time: November 2020

Homepage of Competition: https://www.kaggle.com/c/lish-moa/overview

GitHub repository : https://github.com/bozliu/Mechanisms-of-Action-Prediction

1 Introduction

1.1 Problem Statement

This is a multi-label classification problem to determine Mechanism of Action (MoA) of a drug. A new technology can measure simultaneously human cells’ responses to drugs in a pool of 100 different cell types within the same samples and therefore ex-ante, cell types that are better suited for a given drug can be identified. The dataset combines gene expression and cell viability data split into testing and training subsets.

The objective is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.

1.2 Challenges of the problem

There are total 23814 instances in the train dataset and for each instance there are 206 labels to predict. This is a multilabel problem, and we think the biggest challenge of this problem is label imbalance Figure 1. The biggest number of positive samples for 1 label only accounts 3.5%, and most of labels accounts less than 0.5%, the mean value of target labels is 0.00343.

Figure 1. percentage of each positive target labels in train dataset.

This problem allows us to submit the prediction probability for each prediction, and it will calculate our final submission score as the evaluation function in Equation 1.The lower score represent the better result.

The second challenge of this problem is this evaluation function is not friendly for some traditional machine learning methods which can only predict a discreate label instead of probability. Although logistic linear regression can predict a probability between 0 and 1, it is very sensitive to label imbalanced problem. SVM can only give 0 or 1 prediction which suffer a lot when making wrong predictions. Decision tree could predict probability by counting proportion in its leaf nodes while it also has many limits and performs bad on this task. As a consequence, deep learning could be used as the major methods for this problem.

1.3 Datasets

1.3.1 File description

In this competition, multiple targets of the MoA responses of different samples would be predicted given various inputs such as gene expression data and cell viability data.

“train_features.csv” is the file of features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).

“train_drug.csv” is the file that contains an anonymous drug_id for the training set only.

“train_targets_scored.csv” is the file to score the binary MoA targets. train_targets_nonscored.csv includes additional binary MoA responses for the training data. These are not predicted nor scored.

“test_features.csv” contains features for the test data. The probability of each scored MoA would be precited for each row in the test data.

“sample_submission.csv” is a submission file in the correct format.

1.3.2 Categorical features

Features g- signify gene expression data.

Features c- signify cell viability data.

cp_type indicates samples treated with a compound, trt_cp samples treated with the compounds.

cp_vehicle or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs.

cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).

1.3.3 Cell Viability

A viability assay is an assay that is created to determine the ability of organs, cells or tissues to maintain or recover a state of survival. Viability can be distinguished from the all-or-nothing states of life and death by the use of a quantifiable index that ranges between the integers of 0 and 1 or, if more easily understood, the range of 0% and 100%. Viability can be observed through the physical properties of cells, tissues, and organs. Some of these include mechanical activity, motility, such as with spermatozoa and granulocytes, the contraction of muscle tissue or cells, mitotic activity in cellular functions, and more. Viability assays provide a more precise basis for measurement of an organism’s level of vitality

1.3.4 Gene Expression

Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. We have both negative and positive correlations here between genes. Some drugs upregulate and others downregulate some genes: For example, drug-A could reduce gene-X expression level while drug-B could elevate gene-Y expression level. Some genes have high negative correlation, so gene-X has a high positive gene expression value which means gene-Y will have a high negative gene expression value.

2 Methodology

2.1 Feature Transformation and Normalization

Transformation: In this competition, the raw feature dimension is 875 consisting 3 categorical features, 772 numerical genic features and 100 numerical cellular features. The categorical features are transformed into numerical ones since the classifier such as decision tree is not used, which can deal with categorical features naturally. We treat them as nominal features, and transform them using get_dummies in Pandas API. These transformed features are not processed through the following process, such as normalization, dimensionality reduction and feature selection.

Normalization: The motivation of doing normalization is that we observe the shape of data distribution of the feature is similar to that of gaussian distribution. Therefore, it’s not hurt to normalize it to gaussian distribution in order to benefit the following steps, such as PCA and the training process. We use quantile normalization to normalize the features such that each feature X~N(0,1). Figure 2 shows the data distribution of randomly selected features before and after normalization.

Figure 2. The data distribution of randomly selected genic (a) and ©, and cellular features (b) and (d). Before normalization (a) and (b), the shape of data distribution is a little bit skewed. After normalization (c) and (d) using Quantile normalization, the shape is pretty normal.

2.2 Principal Component Analysis (PCA)

Given the notable number of correlations in the gene and, especially, the cell features. PCA can be applied to reduce feature space and therefore the principle features can be extracted. PCA is essentially a rotation of the parameter space so that the new axes (the “principal components” aka PC) are orthogonal and align with the directions of maximum variance.

In terms of principal components selection to select the dimension k after reduction, according to the average squared project error in Equation 2 and total variation in the data in Equation 3, the Equation 4 is derived to select different values of K. The value of t can be determined by itself. For example, if the value of T is 0.01, it means that the PCA algorithm retains 99% of the main information.

2.3 Feature Selection using Variance Encoding

We combine the obtained features which display as much as possible of the variation among data instances by using PCA, with the original features. Then, we use Variance Encoding to select the features whose variance is larger than a specific threshold (0.9).

After Quantile normalization, the variance of the features is near 1, specifically, variance ∈[0.96, 1.20]. Then PCA is applied to further construct features that are able to capture the largest amount of variation in data. After PCA, variance ∈[0.56, 9.66]. Therefore, we use variance encoding to filter out some features that may less important for the task. Figure 3 shows part of the feature variance after feature normalization and PCA.

Actually, in some sense, the feature selection step is another way to further decide how many principal components we need to select. Because in our case, setting variance threshold is the same as lowering the number of PCA components.

Figure 3. Feature variance after feature normalization and dimensionality reduction. We only plot the feature whose variance belongs to [0.5,1.5], since the variance of most of the features lays in this range.

2.4 Split Data into K-Fold

The competition is a multi-label classification problem. It provides us with some extra data, e.g., the used drug for each data instance. To utilize these kinds of information to further boost the performance, we stratify the data into k-fold based on both drug and multi-label. This kind of stratification also give us a more reliable correlation between cross validation and leaderboard.

Table 1 shows the drug distribution. There are 9 drugs that appear very frequently in the training data. We assume these drugs also appear in test data. Therefore, we split them evenly among folds. Specifically, all drugs that appear 19 times or less in the dataset get assigned to their own fold, and the other drugs are split evenly into folds.

Table 1. The drug distribution. There are 9 drugs (drug_num) appear frequently (count > 19).

2.5 Label Smoothing

Label smoothing is a frequently used trick in classification to prevent overfitting and prevent sudden appearance of large gradients In the process of updating weight by gradient descent. By weighted sum of labels, better results can be achieved than one hot label. Label smoothing transforms the label from y_k to y_k^Ls. The transformation formular is shown in Equation 5.

As can be seen from the original form of “smooth” label. Take cross entropy as an example. The original form of cross entropy is shown in Equation 6.

is the output probability corresponding to the correct category

Cross entropy under label smoothing is shown in Equation 7. It can be seen that even the pi of the incorrect class appears in the loss function, and the correct class and the incorrect class have different weights. Training a network with label smoothing encourages the differences between the logit of the correct class and the logits of the incorrect classes to be a constant dependent on α.

2.6 Deep Learning Model

At beginning, we designed some complex network structure which divides input features into three parts, the first part is for genes information extraction and the second part is for cells information extraction and the third part is for genes and cells PCA features and other meta information (cp time, cp dose). The three parts consist of some Fully Connection (FC) layers and batch norm layers followed by leaky RELU activation layers. The output vectors of three parts are concatenated and further feed to FC layers followed by a sigmoid function and finally output 206 probability predictions.

Figure 4. Deep learning network model structure

After experiments, we found this structure extremely overfitted the data that train loss is very low while validation loss is very high. We found that some proposed methods in public notebooks could reach a relatively higher score by using a very simple network structure. So, we change our network to a simple structure which consist of some weight normalized FC layers and BN layers. Weight normalization is a reparameterization over the given module which helps learning faster and reduce data noise. Figure 4 is our final deep network structure.

2.7 Ensemble method

Ensemble learning is a very useful skill in many tasks, and who wins the Kaggle competition always used an ensemble model. Ensemble model is useful because it could effectively balance the overfitting of each model and could combing the predicted results made by each model.

In this problem, as we mentioned previously that we have 5 models trained by each CV fold’s train dataset. For each of them, we have the prediction of target labels, and we calculate the mean value of predictions as ensemble predictions. Also, we imitate other notebooks which made an ensemble model by aggregating models with different seeds.

Although ensemble method is very useful, unlike classifier problem we learned in class, the ensemble model may not perform better than best model among sub models. In this problem, the loss function for one prediction (of a specific target label y) is:

Then the final test score for this model is shown in Equation 9. L_k is a function of m, n and k, we did not show m and n in L and y for simplification.

Suppose we have T different sub models, and their predictions of a target label value y are ŷ1, ŷ2, …, ŷT respectively. Then the ensemble prediction for this value is:

Then, the ensemble model prediction loss of this target label is:

Since both -log(x) and -log(1-x) are convex function, we have inequality:

a) Always the equal does not hold except all the sub models’ predictions are equally for each all predictions. It means the ensemble test score will smaller than the average of scores on sub models and we think this is a very important property of this problem.

b) If we merge two models which have same test scores, then the ensemble test score will smaller than their scores.

c) If this problem does not use -log(x) function to calculate the final score, and instead it uses a concave function such as √ {|y−ŷ} to compute the score. Then the ensemble submission score will bigger than at least one of the sub model submission scores (the lower score, the better result) because the direction of the score inequality will change. This situation will reduce the meaning of ensemble method.

3 Experiment

This section discusses the experiments and our final result on the competition leaderboard. In 3.1, we conduct several experiments to verify our motivations of proposed methodology. The final result is presented in 3.2.

3.1 Experimental Detail

Table 2 shows the settings and results of experiment, which are used to fine-tune hyper-parameters, and to verify our motivations to use various methods proposed in Section 2. Due to the times limit of submitting solutions to Kaggle, only part of the experiments has the leaderboard (LB) score. We use cross validation (CV) score to compare with each setting.

3.1.1 Principal Component Analysis (PCA)

Different reduced dimension is selected. Number of the dimension of cells are reduced from 100 to 15, 30, 60 receptively. Number of the dimension of genes are reduced from 772 to 100, 200, 400, 500 receptively. It has been found that when the reduced number of the dimension of cells and genes are 60 and 400 receptively, the cross-validation score is the minimum.

3.1.2 Feature Selection (VE-threshold)

As we discussed in section 2.3, feature selection helps us to filter the features that are less important, or even troublesome to the MoA prediction task. From the CV score of experiments 7–9, we observe an improvement when we choose a relatively large threshold (0.9) of variance encoding.

In our final submission, we use feature normalization before feature selection. Therefore, this step mainly removes some obtained features from PCA. It can be viewed as a way to further help PCA to decide the number of principal components.

3.1.3 Label Smoothing

It has been found that using label smoothing can achieve better performance on cross-validation score based on the observations of experiments 9 and 10. The hyperparameter α is 0.001.

3.1.4 Layer and dropout

Experiments 10, 11, 12 are aimed at choosing number of linear layers and dropout rate hyperparameters. The bigger number of layers give us more complex model, and bigger dropout rate could alleviate the overfitting problem. In our experiments, we found 3 three layers and 0.25 dropout rate could get higher validation score.

3.1.5 Feature Normalization (Gauss_norm)

We normalize the features into gaussian distribution because of the shape of data distribution of original features as discussed in section 2.1. From experiments 16 and 17, we can see a further improvement both on the CV and LB score.

3.1.6 Split Data (Drug_FOLD)

We split data into different folds based on both drug and multi-label. This manner of splitting data further improves our leaderboard result, and makes us ranking top 15%. However, as we can see from the results of experiments 17 and 18, the CV score is worse, but the leaderboard score becomes better. This can be explained by CV-LB correlation, which plagues many competitors. This method provides us with a similar data distribution between training and test dataset, and hence a better CV-LB correlation, which can be observed from the difference of CV and LB score.

3.1.7 CV fold and seed

In our experiments, we find that different seeds lead to quite different CV score and different number of folds also have different CV score. The selection of folds number and seeds number decide how many sub models we have in our ensemble model and we think more sub models lead to better result generally, but we also need balance the running rime. At last, we select folds and seeds number both to be 7.

3.2 Leaderboard Ranking Result

As Figure 5 shown, by November 20th, 2020, there are 4035 teams participating in this competition. Our final submission (Exp_id=18 in Table 2) ranked 505th (Top 12.5%) in this competition.

Figure 5. The leaderboard rank of our final submission.

4 Conclusion

Deep learning is expert in handling unstructured data, such as text and images. However, in this competition, we go through the comprehensive pipeline of dealing with structured data with machine learning techniques, including feature engineering, normalization, data transformation, dimensionality reduction, feature selection, label smoothing, cross validation and ensemble learning. All these methods have been applied in the improvement of labelling different classes. The unified machine learning, PCA method and deep learning model achieved the best performance. In addition, some data processing technique can further help to select the appropriate features such as feature transformation and normalization, feature selection using variance encoding and data split into different folds. Furthermore, it has been proved that ensemble model is effective in balance the overfitting of each model and could combing the predicted results made by each model.