br Fig Comparison of our
Fig. 5. Comparison of our proposed model (nsNMF + SVM) with the state-of-the-art methods.
Biological processes most associated with each cancer type.
GO term Cancer
Microtubule-based process BRCA Axon guidance GBM Morphogenesis of a polarized epithelium PRAD Chemical synaptic transmission LUSC
gene. A high score indicates the significant role of a mutated gene in the disease. The 103213-34-9 for each cancer type were identified and sorted by association score (Table S1). The top 300 genes associated with each cancer type were derived and analyzed for pathway enrichment. Interestingly, distinct biological processes were found to be sig-nificantly associated (p < 0.05) with each type of cancer. Microtubule processes, axon guidance, cell morphogenesis, and synaptic transmis-sion were found to be associated with BRCA, GBM, PRAD, and LUSC, respectively (Table 2). Additional pathways associated with each cancer type can be found in Table S2. Many of these pathways are known to be associated with their corresponding cancer type. For Journal of Biomedical Informatics 96 (2019) 103247
instance, a majority of breast cancer drugs involve targeting micro-tubules . In glioblastoma, axon guidance is known to play a role in glioma progression . To confirm the robustness of our findings, we performed and evaluated pathway enrichment and discovery using multiple methods [33–37]. We also discovered axon guidance from the Reactome pathway database  and synaptic transmission in the KEGG pathway database . Together, these results corrobarated the relevant pathway discovered by our method.
We have proposed a novel method to fully use and understand so-matic mutations to classify the cancer type and derive relevant genes and pathways. In this study, we applied nsNMF and SVM to train a classifier to distinguish and classify a tumor type as Glioblastoma Multiforme (GBM), Breast Invasive Carcinoma (BRCA), Lung Squamous Cell Carcinoma (LUSC), and Prostate Adenocarcinoma (PRAD). Products of the basis matrix and coefficient matrix derived from nsNMF were both retained to construct the feature matrix. Subsequently, the constructed features were used as input variables to train the classifier. We compared functional scores using CADD, SIFT, and PP2, and counted mutation number and found that counted mutation number yielded the best performance (accuracy = 80.0% with SEM = 0.1%). Finally, regularized logistic regression was applied to study each gene’s association effect with cancer type. Using the associated features, we derived relevant genes and pathways for each cancer.
When training the classifier, we used an alternative method by multiplying the matrix Ascore with matrix W to obtain the feature matrix F . Information from the basis component W was retained, providing information about weights in each gene group. This information was then used as features to train the classifier. Another benefit of this al-ternative method is the ease of us at the testing stage. With the trained
W matrix, we only need to multiply the testing Ascore matrix in order to get the test feature matrix. In addition to improving cancer type clas- sification, each gene’s association effect with the cancers was of interest and also studied. The p-value for gene pre-selection was to limit the number of features to be included. One of the challenges for genomics studies are the large number of genes accompanied by the small sample sizes, resulting in a wide and flat matrix, henceforth impact the per-formance of matrix decomposition. In this study, we utilized a p-value cutoff to pre-select genes but try to only introduce a minimum amount of influence on model performance. Therefore, we have tested multiple p-values and selected the cutoff of 0.5, in which we observed non-sig-nificant differences in model performances compared to the no-selec-tion scenario. Genes that were filtered are those with only one or two mutations and only appeared in one or two subjects in the cohort. Removing these genes has a minimum amount of influence and has the potential to remove noises for model training. In our study, if we tune NMF + SVM (our model), we can get even better results. But our pur-pose is to focus on assessing improvements from NMF. In addition, with the default parameters for NMF + SVM, our model still outperformed the state-of-the-art methods that are parameter-tuned.
The development of high throughput sequencing technology has enabled the cataloging of large-scale mutation information. Somatic mutations are relatively stable and lead to the initiation and progres-sion of many sporadic cancers. Hence in this study, we utilized muta-tions in protein-coding genes as input data. We acknowledge that non-protein-coding genes, including mutations in intronic areas [38,39], long non-coding RNAs , mi-RNAs  are also important for cancer development. In future work, we will incorporate these multiple dimensions of genetics data to increase the model performance. Tra-ditionally, mutations derived from sequence data were examined as a single variable using the regression models [30,42]. Unfortunately, the large number of variables limit the power of such studies. To reduce the number of variables, studies have proposed to aggregate mutations at the gene level as an input in a regression model [24,43,44]. In other