### EDA Reference - Python vs R

A reference guide for exploratory data analysis (EDA) in Python and R. Genetic data from The Cancer Genome Atlas (TCGA) is used to walk through some routine tasks when you first get your hands on a new (semi-clean) dataset.

1. Intro
2. Read and clean
3. Counting and summaries
4. Visualizations

# Intro

## Motivation

R and Python both have their pros and cons in the data science world. Sometimes I find myself using one language exclusively for an extended period of time, and then switching back to the other often entails some re-learning of basic usage. This tutorial gives side-by-side comparisons of R and Python code to jump-start the relearning process or act as a quick reference guide for those unfamiliar with one language.

For comparisons, code snippets below are posted side-by-side with Python on the left and R on the right. Since the output tables and images are essentially identical by design, only Python output is displayed, but full R and Python code/output can be viewed in the Rmarkdown and Jupyter notebook, respectively, which are available on the GitHub repo.

## The Dataset: TCGA

TCGA was a monumental research endeavor to characterize the genetic landscapes of a variety of cancers. Essentially, hundreds of patients had tumor samples profiled for genetic mutations. Most of the data is published online, and we’ll be using some publicly available data for this guide. You don’t have to be too familiar in genetics to follow along with this analysis. We’ll just explore some basic details such as which genes had the most mutations and how many mutations existed per tumor. Simple explanations are given for some genetic terminology.

The data comes in the form of multiple verbose .maf files (detailed here and here), but after some processing we’ll simplify it to a single table looking something like

GENE SAMPLE MUTATION
A 1 $\alpha$
A 2 $\alpha$
A 3 $\alpha$
B 1 $\beta$
B 1 $\gamma$
B 2 $\beta$
B 2 $\gamma$
C 1 $\delta$
C 1 $\epsilon$
C 1 $\zeta$

The elements are just dummy fillers and there will be some other columns, but essentially we are dealing with long-format data and primarily interested in which mutations occurred in which genes and in which samples.

To download the data, first download the GDC Data Transfer Tool and a manifest file, then simply run

./gdc-client download -m MANIFEST.txt

from a terminal. The manifest file used for this analysis contains all the publicly available data from the breast cancer (BRCA) dataset. However, the analysis is quite general so you could use any dataset from TCGA that is in the .maf format.

Notice that the manifest contains multiple .maf files. Essentially, each refers to a different algorithm used to determine if a mutation exists or not given genetic sequencing data. The different algorithms and the software that implements them are typically referred to as callers.

## Dependencies

First, let’s load the dependencies.

For Python we make heavy use of the pandas library for data wrangling plus matplotlib and seaborn (built on top of matplotlib) for visualizations. The seaborn library changes the default plot styling, and we further tweak it with some preset themes.

Most of the R packages needed are contained in the tidyverse mega-package, but cowplot is also useful for creating a single figure with multiple plots. Further, like seaborn, cowplot changes the default ggplot2 (part of tidyverse) plot styling.

# Read and clean

## Read single file

First, define where the data lives. This is identical in each language (although you’ll often see people use <- for assignment in R instead of =).

If we were only interested in a single .maf, reading data would be quite easy. Just tell the readers where the file is and in this case that the comment character in .maf is a number symbol #. It is also worth noting that these files are tab-separated, so we may use the readers for .tsv files.

If you run this, you’ll actually get warnings in both languages that some columns have mixed types (e.g. strings and integers). We’ll handle that soon.

## Preview

Both of the above output dimensions of 120988 x 120 and a short 10-row preview of the columns (omit the 10 argument to use default values). Only a subset of the 120 columns are printed here.

Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type ... FILTER CONTEXT src_vcf_id tumor_bam_uuid normal_bam_uuid case_id GDC_FILTER COSMIC MC3_Overlap GDC_Validation_Status
0 USP24 23358 WUGSC GRCh38 chr1 55159655 55159655 + Missense_Mutation SNP ... panel_of_normals CTGGATTGTAG d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 common_in_exac;gdc_pon NaN True Unknown
1 ERICH3 127254 WUGSC GRCh38 chr1 74571494 74571494 + Missense_Mutation SNP ... PASS TTCCTCTACCA d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1474194 True Unknown
2 KIF26B 55083 WUGSC GRCh38 chr1 245419680 245419680 + Silent SNP ... PASS GCCTCGCAGGG d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1473725;COSM1473726 True Unknown
3 USP34 9736 WUGSC GRCh38 chr2 61189055 61189055 + Silent SNP ... PASS AAAGCGAGTGC d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1483177 True Unknown
4 ANTXR1 84168 WUGSC GRCh38 chr2 69245305 69245305 + Silent SNP ... PASS TCCTCGCCGCC d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1409122 True Unknown
5 SCN9A 6335 WUGSC GRCh38 chr2 166199365 166199365 + Silent SNP ... PASS AGTATGACTGC d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1482144;COSM4814664 True Unknown
6 FN1 2335 WUGSC GRCh38 chr2 215397809 215397809 + Nonsense_Mutation SNP ... PASS CACTTCTCGTG d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1482746;COSM1482747 True Unknown
7 SPHKAP 80309 WUGSC GRCh38 chr2 228016738 228016738 + Missense_Mutation SNP ... PASS TCTTTCCTCGG d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1482832;COSM1482833 True Unknown
8 HRH1 3269 WUGSC GRCh38 chr3 11259653 11259653 + Missense_Mutation SNP ... PASS TGCTCATGCTC d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 NaN COSM1484451 True Unknown
9 LRRC2 79442 WUGSC GRCh38 chr3 46519054 46519054 + Missense_Mutation SNP ... panel_of_normals AGCTGGGAACA d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 gdc_pon COSM1485224 True Unknown

## Combine multiple files

Now let’s take the .maf from each caller and combine them into a single data frame. For both languages, the strategy is to create a list of the data frames for each caller and then concatenate them vertically (i.e. by stacking all the rows on top of each other). This creates duplicates if a mutation in one sample was found by multiple callers, so we de-duplicate later.

In most cases, one can probably get away with letting the input readers infer the data type of each column as they are good at guessing, but recall we actually had some warning messages about mixed type columns before. Plus, there was way more data than we need for this exercise. So, below we explicitly define both the columns to read in and their data types.

Note that in R it is often desirable to convert categorical variables/columns from character (string) type to the factor type. Here, the conversion is applied to every character column.

Chromosome Start_Position End_Position Variant_Classification Variant_Type Reference_Allele Tumor_Sample_Barcode Allele SYMBOL IMPACT
0 chr1 1916819 1916819 Missense_Mutation SNP C TCGA-A2-A3Y0-01A-11D-A23C-09 G CALML6 MODERATE
1 chr1 2172304 2172304 Missense_Mutation SNP G TCGA-A2-A3Y0-01A-11D-A23C-09 C PRKCZ MODERATE
2 chr1 3766586 3766586 Missense_Mutation SNP G TCGA-A2-A3Y0-01A-11D-A23C-09 A CCDC27 MODERATE
3 chr1 6040634 6040634 Silent SNP G TCGA-A2-A3Y0-01A-11D-A23C-09 C KCNAB2 LOW
4 chr1 23961791 23961791 Missense_Mutation SNP A TCGA-A2-A3Y0-01A-11D-A23C-09 G PNRC2 MODERATE

## Rename and reorder columns

We essentially create a map between the old and new column names and apply that to the data frames in both languages. In Python, notice that we’re using an OrderedDict collection to preserve the order of the columns as we want them to appear.

CHR START END GENE REF ALT CLASS IMPACT TYPE BARCODE
0 chr1 1916819 1916819 CALML6 C G Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09
1 chr1 2172304 2172304 PRKCZ G C Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09
2 chr1 3766586 3766586 CCDC27 G A Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09
3 chr1 6040634 6040634 KCNAB2 G C Silent LOW SNP TCGA-A2-A3Y0-01A-11D-A23C-09
4 chr1 23961791 23961791 PNRC2 A G Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09

## Fill Missing Data

Some of the GENE column values are set to the none/NA/missing value in each language. This is reasonably accurate as the mutations occur between genes, but there is a more descriptive technical term: “intergenic” mutations. Here, we reassign all the missing GENE values to “INTERGENIC”.

Since GENE is a factor in the R dataframe, we have to add the new value to the existing levels before reassignment.

## New columns from string operations

We create a new SAMPLE column by selecting the first 12 characters from the BARCODE column e.g. barcode TCGA-A2-A3Y0-01A-11D-A23C-09 refers to sample TCGA-A2-A3Y0. The rest of the barcode references other properties we are not concerned with here (such as if the sample is a metastatic or primary tumor).

Further, we create a unique identifier for each mutation by joining the CHR, GENE, START, END, REF, and ALT columns together with a colon delimiter. That is, if a mutation changes the C nucleotide at position 1916819 on chromosome 1 to a G, then the mutation is in the CALML6 gene, and we refer to it as chr1:CALML6:1916819:1916819:C:G. Note that the start and end positions will be different if we are dealing with an insertion or deletion of nucleotides rather than a single-nucleotide polymorphism (SNP).

Finally, a TYPE2 column is created which condenses the information from the TYPE column. TYPE tells if the mutation is a “SNP” (single nucleotide polymorphism e.g. one nucleotide changes to another), “INS” (insertion), or “DEL” (deletion). Insertions and deletions insert or delete one or more nucleotides, and they are often collectively referred to as “indels”. The TYPE2 column will contain only “SNP” and “INDEL” values.

Similar to when we read the data, if necessary we should convert character variable to factor variables in R.

Also in R, there are 2 common methods to accomplish the above tasks, both worth noting. One is to create the columns directly by assignment as done in Python, and the other is to use the mutate() function from the dplyr package. Both are basically the same, but mutate() has the advantage of not having to constantly reference columns by the \$ sign.

CHR START END GENE REF ALT CLASS IMPACT TYPE BARCODE SAMPLE MUTATION TYPE2
0 chr1 1916819 1916819 CALML6 C G Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:CALML6:1916819:1916819:C:G SNP
1 chr1 2172304 2172304 PRKCZ G C Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:PRKCZ:2172304:2172304:G:C SNP
2 chr1 3766586 3766586 CCDC27 G A Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:CCDC27:3766586:3766586:G:A SNP
3 chr1 6040634 6040634 KCNAB2 G C Silent LOW SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:KCNAB2:6040634:6040634:G:C SNP
4 chr1 23961791 23961791 PNRC2 A G Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:PNRC2:23961791:23961791:A:G SNP

## Remove duplicates

Since we generated duplicate rows when combining data frames for each caller, we remove them so they don’t interfere with counting and statistics. Since the rows would be exactly identical for each caller, we can just look for where entire rows are duplicated.

Further, it might be useful to have a dataframe with only mutation information and without reference to which samples were found. That is, one row per mutation instead of one row per-mutation per-sample. For this, the deduplication process should consider duplicate values in the MUTATION column only.

By checking the shape of each frame, we see that df has shape 132916 x 13 and df_mut is 130625 x 13.

## Remove columns

The df_mut dataframe above still contains sample information, but it should only have mutation information. Thus, the SAMPLE and BARCODE columns should be dropped.

Note that with pandas in Python you might encounter a warning about chaining operations (e.g. reading -> renaming -> dedupe -> drop columns) resulting in unintended effects. You can read about why the issue arises and suppress the warning if you feel comfortable about chaining operations.

CHR START END GENE REF ALT CLASS IMPACT TYPE MUTATION TYPE2
0 chr1 1916819 1916819 CALML6 C G Missense_Mutation MODERATE SNP chr1:CALML6:1916819:1916819:C:G SNP
1 chr1 2172304 2172304 PRKCZ G C Missense_Mutation MODERATE SNP chr1:PRKCZ:2172304:2172304:G:C SNP
2 chr1 3766586 3766586 CCDC27 G A Missense_Mutation MODERATE SNP chr1:CCDC27:3766586:3766586:G:A SNP
3 chr1 6040634 6040634 KCNAB2 G C Silent LOW SNP chr1:KCNAB2:6040634:6040634:G:C SNP
4 chr1 23961791 23961791 PNRC2 A G Missense_Mutation MODERATE SNP chr1:PNRC2:23961791:23961791:A:G SNP

## New columns from delimited column

Here we essentially reverse the process of making the mutation column. In other words, assume we start with a MUTATION column that has values like chr1:CALML6:1916819:1916819:C:G. The goal is to create six new columns with values like chr1, CALM6, etc.

Since we already have those columns in the existing df, first we make a dummy frame without them so we can verify that the procedure works.

CLASS IMPACT TYPE BARCODE SAMPLE MUTATION TYPE2 CHR GENE START END REF ALT
0 Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:CALML6:1916819:1916819:C:G SNP chr1 CALML6 1916819 1916819 C G
1 Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:PRKCZ:2172304:2172304:G:C SNP chr1 PRKCZ 2172304 2172304 G C
2 Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:CCDC27:3766586:3766586:G:A SNP chr1 CCDC27 3766586 3766586 G A
3 Silent LOW SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:KCNAB2:6040634:6040634:G:C SNP chr1 KCNAB2 6040634 6040634 G C
4 Missense_Mutation MODERATE SNP TCGA-A2-A3Y0-01A-11D-A23C-09 TCGA-A2-A3Y0 chr1:PNRC2:23961791:23961791:A:G SNP chr1 PNRC2 23961791 23961791 A G

## Read and write binary files

Both R and Python have methods to read/write binary data for efficiency.

## Long to wide

The goal here is to go from

GENE SAMPLE MUTATION
A 1 $\alpha$
A 2 $\alpha$
A 3 $\alpha$
B 1 $\beta$
B 1 $\gamma$
B 2 $\beta$
B 2 $\gamma$
C 1 $\delta$
C 1 $\epsilon$
C 1 $\zeta$

to

GENE MUTATION 1 2 3
A $\alpha$ 1 1 1
B $\beta$ 1 1 0
B $\gamma$ 1 1 0
C $\delta$ 0 0 1
C $\epsilon$ 0 0 1
C $\zeta$ 0 0 1

That is, make each row correspond to a mutation and create columns for each sample. The sample columns are filled with 1 if the mutation exists and 0 otherwise. This transformation is also knowing as casting.

If you’re familiar with genetics / bioinformatics, the wide format is similar to how a VCF file is structured while MAF structure is long.

Note that df_wide in both cases does not contain columns like GENE. Additional columns are added below. Also, only a subset of the sample columns are shown below (and likely they are all filled with 0).

MUTATION TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK TCGA-5L-AAT0 TCGA-5L-AAT1 TCGA-5T-A9QA TCGA-A1-A0SD ... TCGA-UL-AAZ6 TCGA-UU-A93S TCGA-V7-A7HQ TCGA-W8-A86G TCGA-WT-AB41 TCGA-WT-AB44 TCGA-XX-A899 TCGA-XX-A89A TCGA-Z7-A8R5 TCGA-Z7-A8R6
0 chr10:A1CF:50813906:50813906:G:- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 chr10:A1CF:50813932:50813932:G:T 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 chr10:A1CF:50828193:50828193:C:A 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 chr10:A1CF:50836094:50836094:G:A 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 chr10:A1CF:50836177:50836177:G:A 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

## Merge

Add the GENE and IMPACT columns that exist in one dataframe (df_mut) to the newly created df_wide which only has MUTATION and sample columns.

Again, only a subset of the sample columns are shown (probably all filled with 0).

GENE IMPACT MUTATION TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK TCGA-5L-AAT0 TCGA-5L-AAT1 ... TCGA-UL-AAZ6 TCGA-UU-A93S TCGA-V7-A7HQ TCGA-W8-A86G TCGA-WT-AB41 TCGA-WT-AB44 TCGA-XX-A899 TCGA-XX-A89A TCGA-Z7-A8R5 TCGA-Z7-A8R6
0 A1CF HIGH chr10:A1CF:50813906:50813906:G:- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 A1CF MODERATE chr10:A1CF:50813932:50813932:G:T 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 A1CF MODERATE chr10:A1CF:50828193:50828193:C:A 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 A1CF MODERATE chr10:A1CF:50836094:50836094:G:A 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 A1CF LOW chr10:A1CF:50836177:50836177:G:A 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

# Wide to long

We invert the procedure from long to wide (this is also known as melting). Note that for this particular dataset, the operation is quite memory intensive.

GENE IMPACT MUTATION SAMPLE
967 CELF2 MODIFIER chr10:CELF2:11333805:11333805:A:- TCGA-3C-AAAU
1972 GATA3 HIGH chr10:GATA3:8073911:8073912:-:A TCGA-3C-AAAU
4782 WDR11 MODIFIER chr10:WDR11:120909443:120909443:G:A TCGA-3C-AAAU
6124 CD248 MODERATE chr11:CD248:66314996:66314996:C:T TCGA-3C-AAAU
8170 MALAT1 MODIFIER chr11:MALAT1:65505435:65505437:AAA:- TCGA-3C-AAAU

# Counting and summaries

Various descriptive calculations are done with the data now in the proper format. These calculations are often paired with graphic representations which are covered in the next section.

## Count unique elements

The values are

Number of unique
Samples: 986
Genes: 19168
Mutations: 130625
Mutation Classes: 18


# Count by factor levels

From the above, we saw that, for example, there are 18 mutation classes such as missense mutations, silent mutations, etc. What would probably be more useful is to know how many mutations are there of each class. Similarly, for the IMPACT column, it would be useful to know how many HIGH impact, LOW impact, etc. mutations were found.

A quick preview of the counts are

Counts per CLASS:
Missense_Mutation         66371
Silent                    23881
3'UTR                     11052
Intron                     6990
Nonsense_Mutation          6056
Frame_Shift_Del            3661
5'UTR                      3392
RNA                        2543
Frame_Shift_Ins            1975
Splice_Site                1921
Splice_Region              1537
3'Flank                    1132
5'Flank                     906
In_Frame_Del                869
In_Frame_Ins                433
Nonstop_Mutation             93
Translation_Start_Site       90
IGR                          14
Name: CLASS, dtype: int64

Counts per IMPACT:
MODERATE    67648
MODIFIER    26059
LOW         25413
HIGH        13796
Name: IMPACT, dtype: int64

Counts per TYPE:
SNP    121319
DEL      7316
INS      4281
Name: TYPE, dtype: int64

Top repeated (>10) mutations:
chr3:PIK3CA:179234297:179234297:A:G         121
chr3:PIK3CA:179218303:179218303:G:A          63
chr3:PIK3CA:179218294:179218294:G:A          43
chr1:ST6GALNAC3:76576946:76576947:-:AAAC     33
chr14:AKT1:104780214:104780214:C:T           25
chr3:MUC4:195783009:195783009:C:T            21
chr10:GATA3:8069470:8069471:CA:-             21
chr3:MUC4:195783008:195783008:A:G            20
chr17:TP53:7675088:7675088:C:T               20
chr3:PIK3CA:179203765:179203765:T:A          17
chr15:GOLGA6L6:20535018:20535018:C:T         17
chr6:OPRM1:154107953:154107958:TTTTTA:-      13
chr3:PIK3CA:179234297:179234297:A:T          13
chr17:TP53:7673802:7673802:C:T               12
chr1:NBPF12:146963189:146963189:G:C          11
chr16:PKD1P6:15104542:15104542:T:A           11
Name: MUTATION, dtype: int64

Samples with most mutations:
TCGA-AN-A046    7948
TCGA-AC-A23H    6711
TCGA-5L-AAT1    2117
TCGA-BH-A18G    2033
TCGA-AN-A0AK    2029
TCGA-A8-A09Z    1924
TCGA-BH-A0HF    1695
TCGA-AO-A128    1644
TCGA-D8-A1XK    1541
TCGA-BH-A0B6    1419
Name: SAMPLE, dtype: int64

Samples with least mutations:
TCGA-A2-A0ES    18
TCGA-AC-A2FB    17
TCGA-AR-A24W    16
TCGA-LL-A440    16
TCGA-AR-A252    16
TCGA-AO-A1KO    15
TCGA-A2-A1G6    12
TCGA-A8-A08C     9
TCGA-A2-A25F     7
TCGA-AC-A2FK     2
Name: SAMPLE, dtype: int64


# Summarize by group

Consider the toy example dataframe again:

GENE SAMPLE MUTATION
A 1 $\alpha$
A 2 $\alpha$
A 3 $\alpha$
B 1 $\beta$
B 1 $\gamma$
B 2 $\beta$
B 2 $\gamma$
C 1 $\delta$
C 1 $\epsilon$
C 1 $\zeta$

For quantifying the number of mutations, we might be interested in any of the following three metrics:

• Genes with the most samples mutated
• Genes with the largest number of aggregate mutations
• Genes with the largest number of unique mutations

The counts for the above example are as follows (bolded cells are the highest count for the column):

GENE # Samples Mutated # Aggregate Mutations # Unique Mutations
A 3 3 1
B 2 4 2
C 1 3 3

All three metrics are calculated below and saved into a dataframe (df_summary) like the above example.

Note that by default in the Python example, the dataframe columns will be a MultiIndex which is somewhat cumbersome for this simple task, so some extra work is done to simplify it to what we see above and get by default in R.

GENE N_MUTATIONS N_UNIQUE_MUTATIONS N_SAMPLES
0 A1BG 5 5 5
1 A1CF 10 10 10
2 A2M 15 15 14
3 A2ML1 15 15 13
4 A4GALT 2 2 2
5 A4GNT 3 3 3
6 AAAS 4 4 4
7 AACS 7 7 7
8 AACSP1 1 1 1
9 AADAC 6 6 6

## Sorting

Now that we have our summary dataframe, often we might be interested in sorting it by one or more columns.

GENE N_MUTATIONS N_UNIQUE_MUTATIONS N_SAMPLES
17100 TP53 377 240 360
12150 PIK3CA 384 82 339
17553 TTN 472 472 243
10153 MUC4 280 141 197
2831 CDH1 154 133 149

## Summarize by multiple groups

It is simple to further refine our grouping. Here, rather than just finding how many mutations are in each gene, we’ll investigate how many HIGH-impact, LOW-impact, etc. mutations are found in each gene. In other words, we group by both the GENE and IMPACT columns before counting. Doing so is a straightforward extension of grouping by a single column.

GENE IMPACT MUTATION
0 A1BG MODERATE 3
1 A1BG MODIFIER 2
2 A1CF HIGH 1
3 A1CF LOW 1
4 A1CF MODERATE 6

## Descriptive statistics

Here classic descriptive stats such as median, upper quartile, minimum, etc. number of mutations per sample are calculated. The stats are calculated per TYPE2 per IMPACT. So, before calculating, we first make a data frame (df_sample_counts) that is appropriately grouped and counts the number of mutations per grouping. The stats are then compiled in a second data frame (df_sample_stats).

Note that for visual appeal when creating plots with these numbers later on, we arbitrarily only consider groupings with between 20 and 200 mutations.

df_sample_counts looks like

SAMPLE TYPE2 IMPACT N_MUTATIONS
5 TCGA-3C-AAAU SNP MODIFIER 22
7 TCGA-3C-AALI INDEL MODIFIER 21
8 TCGA-3C-AALI SNP HIGH 43
9 TCGA-3C-AALI SNP LOW 153
17 TCGA-3C-AALJ SNP MODERATE 27

and df_sample_stats looks like

TYPE2 IMPACT MEDIAN UPPER LOWER MEAN STD MIN MAX
0 INDEL HIGH 57.0 93.00 26 69.764706 53.634794 20 196
1 INDEL MODERATE 49.0 68.00 36 61.800000 41.865260 25 131
2 INDEL MODIFIER 58.0 119.00 39 78.692308 55.741045 21 173
3 SNP HIGH 34.5 64.75 24 44.431818 27.942131 20 150
4 SNP LOW 31.0 44.00 23 39.250000 25.638295 20 157
5 SNP MODERATE 40.5 65.00 28 52.778364 34.457118 20 190
6 SNP MODIFIER 30.5 45.75 24 40.580292 28.512752 20 200

# Visualizations

Here we use some of the data summaries made in the last section to create visual descriptions of the dataset.

Note that ggplot2 is the main graphics package in R whereas Python primarily uses matplotlib. The two take different paradigms about creating graphics. Essentially, ggplot2 assumes the data you’re working with is in long-format whereas matplotlib is easier to work with if you have already taken the time to do the counting summaries as done in the last section. However, ggplot2 can handle summarized data with minimal extra effort, and the seaborn library in Python makes working with long-format data easier. So, in either language you can work with whichever format. Examples are presented covering all cases, and one example is devoted explicitly to showing the differences. We start in the matplotlib paradigm of plotting summarized data.

## Ranking

Recall that there were a couple samples that were seemingly much more mutated than others. A ranking plot will reveal that nicely with the added benefit of showing how the rest of the samples compare as well.

## Bar plot

Use the mutation class summary to display all the mutation classes plus how many mutation in each class were found.

To achieve the horizontal style, use barh() in Python and coord_flip in R. Note the plt.gca.invert_yaxis() call needed to get the Python plot to display in descending order.

## Subplots

Plot two (or more) graphs on the same figure. matplotlib in Python has a built-in method to handle this, but we’ll need plot_grid() from the cowplot package in R.

## Count vs identity

Thus far we’ve made “identity” plots wherein the numbers plotted were already calculated. This is the matplotlib paradigm but is easily accomplished in ggplot2 with the stat = 'identity' option. Without that option, ggplot2 assumes the data is long-format and will do the counting for us. The seaborn library will let us work with long-format data directly in Python. This example shows how both methods are used in each language to produce the same output when counting the number of mutation found in each gene.

## Plot with grouping

In a previous section, we refined our counting procedure to count not just the number of mutations per gene but rather the number of high-impact, low-impact, etc. mutations per gene. We can accomplish that graphically as well. We use the ggplot2 paradigm of passing in the long-format data and letting the plotting packages do the counting.

Note that here we use a specific color palette (Set1) which exists in both languages. We could also use the default, another pre-defined palette, or make our own.

## Faceted plot

Let’s do a plot like the one above, but instead look at specific samples rather than aggregated across all samples. This is a good case for a faceted plot.