Filter-Based Gene Selection Method for Tissues Classification on Large Scale Gene Expression Data

Dimensions

Kabir Ahmad, Farzana and Yusof, Yuhanis and Yusoff, Nooraini (2018) Filter-Based Gene Selection Method for Tissues Classification on Large Scale Gene Expression Data. International Journal of Engineering & Technology, 7 (2.15). pp. 68-71. ISSN 2227-524X

Full text not available from this repository. (Request a copy)

Official URL: http://doi.org/10.14419/ijet.v7i2.15.11216

Abstract

DNA microarray technology is a current innovative tool that has offers a new perspective to look sight into cellular systems and measure a large scale of gene expressions at once. Regardless the novel invention of DNA microarray, most of its results relies on the computational intelligence power, which is used to interpret the large number of data. At present, interpreting large scale of gene expression data remain a thought-provoking issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p. In addition, this data are often overwhelmed, over fitting and confused by the complexity of data analysis. Due to the nature of this microarray data, it is also common that a large number of genes may not be informative for classification purposes. For such a reason, many studies have used feature selection methods to select significant genes that present the maximum discriminative power between cancerous and normal tissues. In this study, we aim to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. Two common classifiers, Support Vector Machine (SVM) and Decision Tree (C4.5) are used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM while IG fit for C4.5. In a colon dataset, SVM has achieved a specificity of 86% with SNR while and 80% for IG. In contract, C4.5 has obtained a specificity of 78% for IG on the identical dataset. These results indicate that SVM performed slightly better with IG pre-processed data compare to C4.5 on the same dataset.

Item Type:	Article
Uncontrolled Keywords:	Bioinformatics; Feature Selection; High Dimensional Data; Support Vector Machine.
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	School of Computing
Depositing User:	Mrs. Norazmilah Yaakub
Date Deposited:	12 Dec 2018 06:04
Last Modified:	12 Dec 2018 06:04
URI:	https://repo.uum.edu.my/id/eprint/25273

Actions (login required)

View Item

Altmetric