br Are there substantially more cancer driver missense muta
Are there substantially more cancer-driver missense muta-tions yet to be discovered? Prior attempts to address this question have focused on the identification of cancer-driver genes (Davoli et al., 2013; Lawrence et al., 2014; Martincorena et al., 2017), which can contain a confounding mixture of both passenger and driver mutations. Here, we examined the tra-jectory of driver discovery from CHASMplus at the mutation level as the number of tumor samples analyzed is gradually increased by random subsampling. Subsampling analysis showed all cancer types had a linear increase in the number of unique driver missense mutations identified (R2 > 0.5) with no evidence of saturation at current sample sizes (Fig-ure S7A). However, discovery of driver missense mutations, which occur in aggregate at a given prevalence (average number per cancer sample), varied substantially among can-cer types (Figure 6B). For SARC, adrenocortical carcinoma (ACC), and prostate adenocarcinoma (PRAD), as sample size increased, there was a minimal increase in the prevalence of driver missense mutations. As a case in point, we extended our analysis to data from a recently released PRAD study (Armenia et al., 2018), which augmented the 477 TCGA PRAD samples with 536 additional samples. This resulted in only marginal increases in the overall prevalence of identified driver missense mutations, consistent with our predicted trajectory based only on TCGA samples (STAR Methods; Table S7; Figure S7B). In contrast, THYM, UVM, and PAAD contained common driver missense mutations that could be detected based on only a few samples from the cohort, e.g., GTF2I L424H in THYM. Prevalence of driver missense muta-tions abruptly saturated for THYM and UVM as sample size increased, and nearly all of these mutations were common. In PAAD, the overall driver prevalence exhibited a diminishing rate of discovery, but the prevalence of intermediate or rare driver missense mutations increased with greater sample size. In contrast, the prevalence of rare driver missense muta-tions increased substantially with sample size in breast cancer (BRCA), HNSC, and COAD.
These results suggest cancer types can be clustered by pat-terns of driver missense mutation G-418 and prevalence (Fig-ure 6A), in addition to well-established approaches to define cancer subtypes, such as by the cell of origin (Hoadley et al., 2018). Moreover, a statistical power analysis suggests that an alternative approach based only on mutation hotspot detection
(E and F) Distribution of (E) PTEN lipid phosphatase activity or (F) protein abundance in predicted driver missense mutations from TCGA (common: >5% of tumor samples; intermediate: 1%–5%; and rare: <1%), all other missense mutations and truncating mutations. Box plots show quartiles, with whiskers defined according to Tukey’s criterion.
(G) Comparison of CHASMplus to the 2nd and 3rd ranked methods in Figure 2E. Left: specificity of methods at identifying PTEN missense mutations that do not lower lipid phosphatase activity. Right: sensitivity (recall), precision, and F1 score for identifying missense mutations that lower lipid phosphatase activity. CHASMplus had the highest specificity, precision, and F1 score.
Figure 6. Characteristics and Trajectory of Missense Mutation Driver Discovery
See also Figure S7 and Table S7.
(A) Plot displaying normalized driver diversity and driver prevalence (fraction of tumor samples mutated) for driver missense mutations in 32 cancer types. K-means clustering identified 5 clusters with centroids shown as numerically designated circles.
(B) Prevalence of driver missense mutations identified by CHASMplus as a function of sample size. Lines represent LOWESS fit to different rarities of driver missense mutations. TCGA acronyms for cancer types are listed in the STAR Methods.
Figure 7. Hotspot Detection Alone Has Limited Statistical Power to Identify Driver Mutations
(A) Statistical power to detect a significantly elevated number of non-silent mutations in an individual codon, as a function of sample size and mutation rate. Circles represent each cancer type from the TCGA and are placed by sample size and median mutation rate. Curves are colored by the frequency of driver mutations (fraction of non-silent mutated cancer samples above background). If a circle is below a curve, then hotspot detection is not yet sufficiently powerful to detect driver mutations of that frequency.
(B) Bar graph comparing power (sensitivity) to detect labeled oncogenic driver missense mutations from OncoKB between CHASMplus and the cancer hotspots method (Chang et al., 2016). Stratification by TP53 suggests that the increased power provided by CHASMplus is not solely a result of high performance on oncogenic TP53 mutations.
would be underpowered to detect such results (Figure 7; STAR Methods).