Archives

# br Framework for analysis of mutational

Framework for analysis of mutational signatures on cell line and PDX datasets

Mutational signatures were annotated on cell line and PDX datasets using SigProfiler (v.2.1) and the SigProfilerSingleSample (v.1.2), modified as described below.

SigProfiler hierarchical de novo extraction of mutational signatures

SigProfiler was first used for de novo discovery of mutational signatures across five separate datasets, including 96-channel muta-tional catalogs (Table S3) from (1) exome sequences from 1,001 human cancer cell lines, (2) exome sequences from 577 PDX models and 25 of the available originating tumors, (3) exome sequences from 63 cell line clones, (4) whole-genome sequences from 136 cell line clones and (5) whole-genome sequences from 36 single cells.

For a given set of the mutational catalogs, the previously developed algorithm (Alexandrov et al., 2013b) was applied in a hierar-chical manner to an input matrix M ˛ RK+3G of non-negative natural numbers with dimension K 3 N, where K reflects the number of Navitoclax types and G corresponds to the number of samples. The algorithm first deciphers the minimal set of mutational signatures that optimally explains the proportion of each mutation type and then estimates the contribution of each signature across the sam-ples. More specifically, the algorithm makes use of a well-known blind source separation technique, termed nonnegative matrix factorization (NMF). NMF identifies the matrix of mutational signatures, P ˛ RK+3N, and the matrix of the activities of these signatures,

E ˛ RN+3G. Identification of the unknown number of signatures, N, is based on the robustness of the overall solution; the method-ology has been previously described (Alexandrov et al., 2013b). The identification of M and P is done by minimizing the generalized

Kullback-Leibler divergence:

min

Mij log
Mij

Mij
b
!

X

b

+
is the unnormalized approximation of M, i.e., M = P 3
E. The framework is applied in a hierarchical manner to

increase its ability to find mutational signatures present in few samples as well as mutational signatures exhibiting a low mutational burden. More specifically, after application to a matrix M containing the original samples, the accuracy for explaining the mutational spectra of each of the cancers with the extracted mutational signatures is evaluated. All samples that are well-explained by the ex-tracted mutational signatures are removed and the framework is applied to the remaining sub-matrix of M.

The extracted signatures were compared to the set of mutational signatures deciphered from the PCAWG Platinum release (Table S1). Given the high proportions of germline variants in mutational catalogs from 1,001 cell lines (dataset 1) and 602 PDX models and their originating tumors (dataset 2) due to non-availability of the normal reference samples, we only considered two newly ex-tracted signatures: SBS25 discovered in Hodgkin’s lymphoma cell lines (Figure S1; Table S1), because mutational signatures anal-ysis was not available from primary Hodking lymphomas (Alexandrov et al., 2018); and signature termed ‘SNP signature’ (Table S1), characterized by T>C mutations at NTG context believed to reflect the residual germline variants, which commonly present as C>T mutations at CpG islands, but on rare occasions may also present as T>C variants at TpGs in the reference genome. All signatures extracted across all cell line clones (datasets 3 and 4) could be explained by a combination of signatures from the global set (cosine similarity > 0.75) and hence none were considered as novel. Signatures extracted across complete mutational catalogs from single cells (dataset 5) revealed two novel mutational signatures, termed SBS scE and scF, likely associated with the process of single cell lysis and/or WGA of single DNA molecule (see section Mutational catalogs from single cells).

Assignment of mutational signatures with SigProfilerSingleSample