Tuesday 23 April 2013

Imputing Missing-Values in DNA Microarray Data For Cancer Survival Analysis Using K-Nearest Neighbour Method


Abstract
          The microarray technology is widely used in the field of molecular biology. One such application is in the cancer survival analysis study. Through gene expression techniques, DNA microarray data of cancer patients are obtained. The processes of gene expression are the imputation of the genes, selection of genes, clustering of genes and classification of genes. The microarray data is considered to be precise although in reality, the microarray data contains many missing-values. Each missing-value is valuable and if retrieved, could provide scientists with better understanding of the DNA microarray data which can be used in creating new treatments for diseases such as cancer. There are a number of techniques that have been developed or used for this purpose. This research focuses on the use of K-Nearest-Neighbor (KNN) algorithm to solve the missing-values problem. The datasets used in the experiment are diffuse large B-Cell Lymphoma (DLBCL) and carcinoma. Experiments are carried out to determine KNN algorithm’s performance in terms of cancer survival analysis such as the effect of KNN on the p-values and the final Kaplan-Meier survival analysis plot. The performance of this method is then compared to other existing techniques such as the INI algorithm, Bayesian Principal Component Analysis (BPCA) and the Partial Least Square (PLS) method to determine the best possible solution. Findings from the experiments showed that for both datasets, INI method performed the best followed by the PLS method. For carcinoma dataset, KNN outperformed the BPCA method while for the DLBCL dataset, BPCA method outperformed KNN method. Overall, the experiment proved INI method to be the best method for DNA microarray data imputation in terms of cancer survival analysis.

No comments:

Post a Comment