• 2019-07
  • 2019-08
  • 2019-09
  • 2019-10
  • 2019-11
  • 2020-03
  • 2020-07
  • 2020-08
  • 2021-03
  • br Table br Relevant features according to


    Table 6
    Relevant features according to state-of-the-art knowledge. Features highlighted in bold are those that have also been found relevant in our study.
    • Fiber
    • Level of education
    •ilyRed meat
    •toryEthanol in the past decade
    • Legume
    • Fruit
    • Dairy
    • Energy
    Table 7
    AUC for different classifiers with different feature sets.
    Feature Set Cardinality LR k-NN NN SVM BT
    Experts’ set
    thy that performance never decreases in spite of removing features that were considered relevant in the literature. If we increase the feature set to a cardinality of 64 with the next most relevant fea-tures (Top-40 union), performance shows the same behavior, that is, either increasing or maintaining its value. Thus, for example, it Bradykinin (acetate) increases from 0.667 with 28 features to 0.686 with the Top-40 union for the SVM classifier, compared with an AUC of 0.652 achieved with the Experts’ set. Likewise, the AUC for the BT classi-fier is 0.676 compared with 0.660 when using less features or the experts’ feature set. The Top-40 union set leads to an increase in performance with respect to the full feature set and the experts’ feature subset and on average to the best performance results.
    This analysis suggests that some of the features proposed by the experts are either redundant or irrelevant since performance is not affected by removing them. This also is confirmed by the fact that these features do not hold top positions in the ranking lists obtained in our experimental setting. This is the case of the dairy consumption (it could be redundant since VitD is also in the list) or SNPs such as, rs6,983,267, rs10411210 or rs7,758,229.
    Note that the AUC prediction results achieved with the experts’ feature set compared with the full set leads to an increase in per-formance of 0.4% for LR and 1.9% for k-NN. This improvement, though, is less than the achieved with feature selection strategies, which is 1.9% for LR and 7.8% for k-NN. Performance with the ex-pert set, however, drops −2.2% and −1.5% for the SVM and BT models, while it is always increased with feature selection algo-rithms. This could indicate that some features excluded from the list should be given more relevance in this context.
    When comparing our results with the features provided by the experts according to the state-of-the-art knowledge, it turns out that both lists have in common many features.The features high-lighted in bold in Table 6 are those that also appear in any of the selected top-41 SVM-wrapper and top-40 Pearson lists. Note that almost two out of three variables are also in our reduced set of the most relevant features.
    Other features, though, suggested by the experts (some SNPs) do not seem to affect the predictive power of the model and some 
    features like Zinc, Carotenoids, Niacin that were found relevant in our experimental setting are not considered relevant in the litera-ture, which deserves further study.
    We acknowledge that this study is focused on the Spanish pop-ulation and our findings may not directly translate to individuals with other ethnicities. The study included six feature ranking al-gorithms selected from the three main categories (filter, wrapper and embedded), but there are many more different widely used feature selection algorithms that could be tested. However, results may still be widely applicable as they reconfirm previous findings and point out new factors to be considered in further studies.
    Future work includes the study of ensemble strategies to in-crease the stability of feature selection techniques, in particular those that have a high margin of improvement. Since ranking al-gorithms may have a high computational cost, our aim is also to explore new hybrid ranking approaches based on two steps: (1) a first simple one based on filters able to quickly remove the most irrelevant features and (2) a second phase with wrapper or embed-ded ranking algorithms focused on the subset of features selected in the first step. We consider to test these techniques on a global dataset with thousands of SNPS and more instances. Having access to a bigger dataset, we also aim to assess deep learning approaches that have shown outstanding performance in many fileds.