10.48336/ER6M-R932
Marvikhorasani, Hanieh
Memorial University of Newfoundland
Scalable feature selection methods by augmenting sparse lease squares
Memorial University of Newfoundland
2019
en
Feature selection has been used widely for selecting a subset of genes (features) from
microarray datasets, which help discriminate healthy samples from those with a particular
disease. However, most feature selection methods suffer from high computational
complexity when applied to these datasets due to the large number of genes present.
Usually, a small subset of these genes have a contributing factor to the disease, and the
rest of the genes are irrelevant to the condition. This study proposes a sparse method,
Sparse Least Squares (SLS), based on singular value decomposition and least squares
to filter out irrelevant features. In this thesis, we shall also consider reducing the number
of features by clustering genes and selecting representative genes from each cluster
based on two different metrics. These dataset size-reduction methods are incorporated
into three state-of-the-art feature selection methods, namely, mRMR, SVM-RFE, and
HSIC-Lasso. These methods are applied to three Inflammatory Bowel Disease (IBD)
datasets and combined with support vector machines and random forest classifiers.
Experimental results show that the proposed SLS method significantly reduces the
running time of feature selection algorithms and improves the prediction power of
the machine learning models. SLS is integrated into a novel feature selection method
(DRPT), which, when combined with Support Vector Machine (SVM), is able to
generate models to discriminate between healthy subjects and subjects with Ulcerative
Colitis (UC) based on the expression values of genes in colon samples. The best models were validated on two validation datasets and achieving higher predictive performance
than a model generated by a recently published biomarker discovery tool
(BioDiscML).