Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations

public 1yr ago Version: 1.0.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations

This Snakemake workflow calls allele-specific expression genes using short-read RNA-seq. Phasing information derived from long-read data by tools such as WhatsHap can be provided to increase the performance of the tool, and to link results to features of interest. Copy number variant data, allelic methylation data and somatic variant data can also be provided to analyze genes with allele specific expression.

Overall Workflow
Installation
- Dependencies
Input Files
- Optional input
Running Workflow
- Edit config file
- Running snakemake workflow
Output Files
- Summary Output
- Example Figures
Contributors
License

Overall Workflow

Installation

This will clone the repository. You can run the IMPALA within this directory.

git clone https://github.com/bcgsc/IMPALA.git

Dependencies

To run this workflow, you must have snakemake (v6.12.3) and singularity (v3.5.2-1.1.el7). You can install snakemake using this guide and singularity using this guide . The remaining dependencies will be downloaded automatically within the snakemake workflow.

Input Files

Method 1 †: RNA reads:

RNA paired end reads (R1 & R2 fastq file)

Method 2 §: RNA alignment:

RNA alignment alignment (bam file)
Expression Matrix
- Expression in RPKM/TPM
- Gene name must be in HGNC format
- Column name is "Gene" and sample names

Optional Inputs:

Phase VCF
- Can be obtained using WhatsHap with DNA long reads
- Significantly improves precision of ASE calling
- Adds TFBS mutation and stop gain/loss information
Copy Number Variant Data
- Can be optained using ploidetect
Allelic Methylation
- Can be optained using NanoMethPhase
Somatic mutations
- Finds somatic mutations in ASE gene and promoters
Tumor Content
- Used to calcualte the expected major allele frequency
- Assumes 1.0 if not specified
Tissue type
- Include data for average MAF in normal tissue in summary table
- Otained from GTex database which ran phASER to calcualte allelic expression

Running Workflow

Edit the config files

Example parameters.yaml:

Config files to specify parameters and paths needed for the workflow. The main parameter to include is the genome name, path to expression matrix, major allele frequency threshold and threads as well as settings for using phased vcf and doing cancer analysis.

# genome_name should match bams
genome_name: hg38/hg19/hg38_no_alt_TCGA_HTMCP_HPVs
# RPKM matrix of the samples
matrix: /path/to/expression/matrix.tsv
# Major allele frequency threshold for ASE (0.5 - 0.75)
maf_threshold: 0.65
# Threads for STAR, RSEM, Strelka and MBASED
threads: 72
# Use phased vcf (True or False)
# Uses pseudphasing algorithm if False
phased: True
# Perform cancer analysis 
# Intersect with optional input
cancer_analysis: True
# Paths for annotation
annotationPath:
 snpEff_config:
 /path/to/snpEff/config
 snpEff_datadir:
 /path/to/snpEff/binaries/data
 snpEff_genomeName:
 GRCh38.100
 snpEff_javaHeap:
 64g
# Paths for references
# Only needed if RNA read is provided instead of RNA bam
starReferencePath:
 /path/to/star/ref
rsemReferencePath:
 /path/to/rsem/ref

Example samples.yaml:

Main config file to specify input files. For input method 1 using R1 and R2 fastq file, use R1 and R2 tag. For input method 2 using RNA bam file, use rna tag. All other tags are optional.

samples: # Sample Name must match expression matrix sampleName_1: # Method 1 R1: /path/to/RNA/R1.fq R2: /path/to/RNA/R2.fq somatic_snv: /path/to/somatic/snv.vcf somatic_indel: /path/to/somatic/indel.vcf tissueType: Lung sampleName_2: # Method 2 rna: /path/to/RNA/alignment.bam phase: /path/to/phase.vcf.gz cnv: /path/to/cnv/data methyl: /path/to/methyl/data.bed tumorContent: 0.80

Example defaults.yaml:

Config file for specify path for reference genome, annotation bed file and centromere bed file. Annotation and centromere bed file for hg38 are included in the repository.

genome:
 hg19:
 /path/to/hg19/ref.fa
 hg38:
 /path/to/hg38/ref.fa
 hg38_no_alt_TCGA_HTMCP_HPVs:
 /path/to/hg38_no_alt_TCGA_HTMCP_HPVs/ref.fa
annotation:
 hg19:
 /path/to/hg19/annotation.fa
 hg38:
 annotation/biomart_ensembl100_GRCh38.sorted.bed 
 hg38_no_alt_TCGA_HTMCP_HPVs:
 annotation/biomart_ensembl100_GRCh38.sorted.bed
centromere:
 hg19:
 /path/to/hg19/centromere.bed
 hg38:
 annotation/hg38_centromere_positions.bed
 hg38_no_alt_TCGA_HTMCP_HPVs:
 annotation/hg38_centromere_positions.bed

Run snakemake

This is the command to run it with singularity. The -c parameter can be used to specify maximum number of threads. The -B parameter is used to speceify paths for the docker container to bind.

snakemake -c 30 --use-singularity --singularity-args "-B /projects,/home,/gsc"

Output Files

All output and intermediary files is found in output/{sample} directory. The workflow has four main section, alignment, variant calling, mbased and cancer analysis and their outputs can be found in the corrosponding directories. The key outputs from the workflow is located below

MBASED related outputs (found in output/{sample}/mbased )
- The tabular results of the output MBASED_expr_gene_results.txt
- The rds object of the MBASED raw output MBASEDresults.rds
Summary table of all outputs
- Found in output/{sample}/summaryTable.tsv
- Data of all phased genes with ASE information along potential causes based on optional inputs
Figures
- Found in output/{sample}/figures
- Example figure shown below

Summary Table Description

Column	Description
gene	HGNC gene symbol
Expression	Expression level
allele1IsMajor	T/F if allele 1 is the major allele (allele 1 = HP1)
majorAlleleFrequency	Major allele frequency
padj	Benjamini-Hochberg adjusted pvalue
aseResults	ASE result based on MAF threshold (and pval)
cnv.A1	Copy Number for allele 1
cnv.B1	Copy Number for allele 2
expectedMAF1	Expect Major Allele Frequency based on CNV
cnv_state1	Allelic CNV state (Loss of Heterozygosity, Allelic balance/imbalabnce)
methyl_state2	Methylation difference in promter region (Allele 1 - Allele 2)
tf_allele3	Allele where there is gain of transcription factor binding site
transcriptionFactor3	Transcription Factor for gain TFBS
stop_variant_allele3	Allele where stop gain/stop loss variant is found
somaticSNV4	Somatic SNV found in (or around) gene (T/F)
somaticIndel4	Somatic Indel found in (or around) gene (T/F)
normalMAF5	Add MAF for gene in normal tissue
cancer_gene	T/F if gene is a known cancer gene (based on `annotation/cancer_gene.txt` )
sample	Sample Name

Columns only included if optional input is included:

1 Copy number variant 2 Allelic methylation 3 Phased vcf 4 Somatic SNV and Indel 5 Tissue type

Example Figures

Several figures are automatically generate based on the optional inputs. They can be found in output/{sample}/figures . The main figure is karyogram.pdf which show co-locationzation of ASE genes with allelic methylation and somatic copy number alteration. Example figures can be found here .

Contributors

The contributors of this project are Glenn Chang, Vannessa Porter, and Kieran O'Neill.

License

IMPALA is licensed under the terms of the GNU GPL v3 .

Code Snippets

shell:
	"""
	date >> {output} 2> {log}
	printf "{wildcards.sample}\n------------------------\n\n" >> {output} 2> {log}
	echo {params.sample_info}| sed 's/{{//' | sed 's/}}//' | tr  , '\n'  >> {output} 2> {log}
	echo expression_matrix: {params.expressionMatrix} >> {output} 2> {log}
	printf "\n------------------------\n" >> {output} 2> {log}
	echo Major Allele Frequency Threshold: {params.maf} >> {output} 2> {log}
	echo Genome: {params.genome_name} >> {output} 2> {log}
	echo Phased: {params.phase}  >> {output} 2> {log}
	echo Cancer analysis: {params.cancer} >> {output} 2> {log}
	echo Threads: {params.threads} >> {output} 2> {log}
	"""

SnakeMake From line 66 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/0_alignment
	STAR \
		--genomeDir {params.ref} \
		--runThreadN {threads} \
		--readFilesIn  {input.r1} {input.r2} \
		--outFileNamePrefix output/{wildcards.sample}/0_alignment/star \
		--outSAMtype BAM SortedByCoordinate \
		--outSAMunmapped Within \
		--outSAMattributes Standard \
		--waspOutputMode SAMtag \
		--varVCFfile {input.vcf} \
		--quantMode TranscriptomeSAM \
		--twopassMode Basic \
		--twopass1readsN -1 &> {log}
	"""

SnakeMake STAR From line 98 of master/Snakefile

shell:
	"""
	samtools view -h {input} | grep -e '^@' -e 'vW:i:1' | samtools view -b -S > {output} 2> {log}
	samtools index {output} &> {log}
	"""

SnakeMake SAMtools From line 121 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/0_alignment
	STAR \
		--genomeDir {params.ref} \
		--runThreadN {threads} \
		--readFilesIn  {input.r1} {input.r2} \
		--outFileNamePrefix output/{wildcards.sample}/0_alignment/star \
		--outSAMtype BAM SortedByCoordinate \
		--outSAMunmapped Within \
		--outSAMattributes Standard \
		--quantMode TranscriptomeSAM \
		--twopassMode Basic \
		--twopass1readsN -1 &> {log}
	"""

SnakeMake STAR From line 141 of master/Snakefile

shell:
		"""
		rsem-calculate-expression \
				--alignments \
				-p {threads} \
				--paired-end \
				{input} \
				{params.ref} \
				output/{wildcards.sample}/0_alignment/star &> {log}
		"""

SnakeMake RSEM From line 166 of master/Snakefile

shell:
	"""
	Rscript src/generateExpressionMatrix.R \
		-i {input} \
		-o output/{wildcards.sample}/0_alignment \
		-a {params.annotation} \
		-s {wildcards.sample} &> {log}
	"""

SnakeMake From line 184 of master/Snakefile

    shell:
        """
		mkdir -p output/{wildcards.sample}/1_variant
		zcat {input.phase} | grep -E '(PASS|#)' | grep -E '(0/1|\||#)' | awk '/^#/||length($4)==1 && length($5)==1' | bgzip > {output} 2> {log}
		"""

SnakeMake From line 203 of master/Snakefile

shell:
    "tabix {input.vcf} &> {log}"

SnakeMake tabix From line 217 of master/Snakefile

		shell:
			"""
			configureStrelkaGermlineWorkflow.py \
			--bam={input.bam} \
			--referenceFasta={input.ref} \
			--forcedGT={input.vcf} \
			--rna \
			--runDir=output/{wildcards.sample}/StrelkaRNA &> {log}

			output/{wildcards.sample}/StrelkaRNA/runWorkflow.py -m local -j {threads} &> {log} 
        	"""

SnakeMake From line 237 of master/Snakefile

shell:
        """
        configureStrelkaGermlineWorkflow.py \
                --bam={input.bam} \
                --referenceFasta={input.ref} \
                --rna \
                --runDir=output/{wildcards.sample}/StrelkaRNA &> {log}
        output/{wildcards.sample}/StrelkaRNA/runWorkflow.py -m local -j {threads} &> {log} 
        """

SnakeMake From line 258 of master/Snakefile

shell:
    "zcat {input.vcf} | grep -E '(PASS|#)' | bgzip > {output} 2> {log} && rm -rf output/{wildcards.sample}/StrelkaRNA/"

SnakeMake From line 275 of master/Snakefile

shell:
    "tabix {input.vcf} &> {log}"

SnakeMake tabix From line 285 of master/Snakefile

    		shell:
        		"""
        		bcftools isec {input.vcf2} {input.vcf1} -p output/{wildcards.sample}/isec -n =2 -w 1 &> {log} 
        		mv output/{wildcards.sample}/isec/0000.vcf {output} 
				rm -rf output/{wildcards.sample}/isec
        		"""

SnakeMake BCFtools From line 301 of master/Snakefile

shell: "bgzip {input} &> {log}"

SnakeMake From line 313 of master/Snakefile

shell:
    """
    snpEff -Xmx{params.heapSize} \
        -v {params.genome} \
        -c {params.snpEff_config} \
        -dataDir {params.snpEff_datadir} \
        -noStats \
        {input} > {output} 2> {log}
    """

SnakeMake snpEff From line 334 of master/Snakefile

shell:
    """
    SnpSift -Xmx{params.heapSize} filter "( exists ANN[0].GENE )" {input} > {output.geneFilter} 2> {log}

    SnpSift -Xmx{params.heapSize} extractFields {output.geneFilter} \
        CHROM POS GEN[0].AD REF ALT ANN[0].GENE ANN[0].BIOTYPE > {output.tsv} 2> {log}
    """

SnakeMake SnpSift From line 352 of master/Snakefile

    		shell:
        		"""
				Rscript src/mbased.snpEff.R \
					--threads={threads} \
					--phase={input.phase} \
					--rna={input.tsv} \
					--outdir=output/{wildcards.sample}/2_mBASED &> {log}
			"""

SnakeMake From line 374 of master/Snakefile

shell:
	"""
	Rscript src/mbased.snpEff.R \
			--threads={threads} \
			--rna={input.tsv} \
			--outdir=output/{wildcards.sample}/2_mBASED &> {log}
	"""

SnakeMake From line 391 of master/Snakefile

shell:
	"""
	Rscript src/addExpression.R \
		--mbased={input.rds} \
		--sample={wildcards.sample} \
		--rpkm={input.rpkm} \
		--min=1 \
		--maf_threshold={params.maf} \
		--outdir=output/{wildcards.sample}/2_mBASED &> {log}
	"""

SnakeMake From line 409 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/figures

	Rscript src/figures.R \
		--mbased={input.txt} \
		--rpkm={input.rpkm} \
		--gene={input.bed} \
		--sample={wildcards.sample} \
		--maf_threshold={params.maf} \
		--outdir=output/{wildcards.sample}/figures &> {log}
	"""

SnakeMake From line 431 of master/Snakefile

shell:	
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/raw

	cat {input} | cut -f1 > {output.gene} 2> {log}
	awk 'NR == FNR {{ keywords[$1]=1; next; }} {{ if ($4 in keywords) print; }}' {output.gene} {params.annotation} | \
		cut -f 1,2,3,4 | uniq > {output.bed} 2> {log}
	"""

SnakeMake From line 458 of master/Snakefile

shell:
	"""
	cut -f 1,2 {input} > {output} 2> {log}
	"""

SnakeMake From line 472 of master/Snakefile

shell:
	"""
	bedtools flank -l 2000 -r 500 -i {input.gene} -g {input.length} > {output} 2> {log}
	"""

SnakeMake BEDTools From line 486 of master/Snakefile

shell:
	"""
	Rscript src/cnv_preprocess.R \
		--cnv={input} \
		--tumorContent={params.tumor} \
		--outdir=output/{wildcards.sample}/3_cancer/raw &> {log}
	"""

SnakeMake From line 499 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/intersect
	bedtools intersect -loj -a {input.gene} -b {input.cnv} | awk '$10 != "." {{print $0}}' > {output} 2> {log}
	"""

SnakeMake BEDTools From line 514 of master/Snakefile

shell:
        """
        Rscript src/methyl_preprocess.R \
                --methyl={input} \
                --outdir=output/{wildcards.sample}/3_cancer/raw &> {log}
        """

SnakeMake From line 530 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/intersect
	bedtools intersect -loj -a {input.gene} -b {input.methyl} | awk '$10 != "." {{print $0}}' > {output} 2> {log}
	"""

SnakeMake BEDTools From line 545 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/tfbs
	awk 'BEGIN {{print "sequence_id\tgene"}}; {{print $1 ":" $2 "-" $3 "\t" $4}}' {input} > {output.id2gene} 2> {log}
	cut -f1 {output.id2gene} | tail -n +2 > {output.id}
	"""

SnakeMake From line 563 of master/Snakefile

shell:
	"samtools faidx -r {input} {params.ref} > {output} 2> {log}"

SnakeMake SAMtools From line 578 of master/Snakefile

shell:
	"""
	cat {input.ref_promoter} | bcftools consensus {input.phase_vcf} -H {wildcards.num}pIu > {output} 2> {log}
	"""

SnakeMake BCFtools Consensus From line 589 of master/Snakefile

shell:
	"""
	fimo --text {params.motifFile} {input} > {output} 2> {log}
	"""

SnakeMake FiMO From line 601 of master/Snakefile

shell:
	"""
	Rscript src/compareTFBS.R \
		--allele1={input.a1} \
		--allele2={input.a2} \
		--expression_matrix={input.expression_matrix} \
		--id2gene={input.id2gene} \
		--tf={params.tf_list} \
		--min=10 \
		--outdir=output/{wildcards.sample}/3_cancer/tfbs \
		--sample={wildcards.sample} 2> {log}
	"""

SnakeMake From line 616 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/stopVar

	snpEff -Xmx{params.heapSize} \
		-v {params.genome} \
		-c {params.snpEff_config} \
		-dataDir {params.snpEff_datadir} \
		-noStats \
		{input} > {output} 2> {log}

	"""	

SnakeMake snpEff From line 643 of master/Snakefile

shell:
	"""
	cat {input} | src/vcfEffOnePerLine.pl > {output} 2> {log}
	"""

SnakeMake From line 660 of master/Snakefile

shell:
	"""
	SnpSift -Xmx{params.heapSize} \
		extractFields {input} \
		ANN[0].GENE ANN[0].EFFECT GEN[0].GT FILTER > {output} 2> {log}
	"""

SnakeMake SnpSift From line 672 of master/Snakefile

shell:
	"""
	Rscript src/stopVariant.R \
		--annotation={input} \
		--outdir=output/{wildcards.sample}/3_cancer/stopVar &> {log}
	"""	

SnakeMake From line 684 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/somatic
	bedtools slop -l 5000 -r 1000 -i {input.gene} -g {input.length} > {output} 2> {log}
	"""

SnakeMake BEDTools From line 703 of master/Snakefile

shell:
	"""
	bedtools intersect -a {input.genes} -b {input.variants} > {output} 2> {log}
	"""

SnakeMake BEDTools From line 717 of master/Snakefile

shell:
	"""
	Rscript src/summaryTable.R \
		--cnv={input.cnv} \
		--methyl={input.methyl} \
		--tfbs={input.tfbs} \
		--stop={input.stopVar} \
		--snv={input.snv} \
		--indel={input.indel} \
		--ase={input.ase} \
		--sample={wildcards.sample} \
		--cancer={params.cancer} \
		--tissue={params.tissue} \
		--tumorContent={params.tumorContent} \
		--normal={params.normal} \
		--outdir=output/{wildcards.sample} 2> {log}
	"""

SnakeMake From line 744 of master/Snakefile

shell:
	"""
	mkdir -p output/{wildcards.sample}/figures/tables

	Rscript src/cancerFigures.R \
		--summary={input} \
		--outdir=output/{wildcards.sample}/figures \
		--sample={wildcards.sample} &> {log}
	"""

SnakeMake From line 772 of master/Snakefile

	shell:
		"""
		Rscript src/karyogramFigure.R \
        		--chromSize={input.chromSize} \
        		--centPos={input.centromere} \
        		--cna={input.cnv} \
        		--dmr={input.dmr} \
        		--ase={input.ase} \
        		--genes={input.annotation} \
        		--out=output/{wildcards.sample}/figures/karyogram &> {log}
		"""

SnakeMake From line 797 of master/Snakefile

suppressMessages(library(optparse))
suppressMessages(library(dplyr))

## ---------------------------------------------------------------------------
## OPTIONS
## ---------------------------------------------------------------------------

# Make help options
option_list = list(
  make_option(c("-b", "--mbased"), type="character", default=NULL,
              help="mbased rds file", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default = NULL,
              help="Sample name from the RPKM matrix (HTMCP written like e.g. HTMCP.03.06.02109)", metavar="character"),
  make_option(c("-r", "--rpkm"), type="character", default = NULL,
              help="RPKM matrix", metavar="character"),
  make_option(c("-m", "--min"), type="numeric", default = 1,
              help="Minimum RPKM value", metavar="numeric"),
  make_option(c("-t", "--maf_threshold"), type="numeric", default = 0.60,
              help="Threshold for MAF to consider as ASE", metavar="numeric"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character")
)

## ---------------------------------------------------------------------------
## VARIABLES
## ---------------------------------------------------------------------------

opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)

out <- opt$outdir
sample <- opt$sample
rpkm <- read.delim(opt$rpkm, header = T, stringsAsFactors = F) 
results <- readRDS(opt$mbased)
min <- opt$min
maf_threshold <- opt$maf_threshold 

## ---------------------------------------------------------------------------
## VARIABLES
## ---------------------------------------------------------------------------

print("Adding expression")

# fix sample name
sample <- ifelse(length(grep("-", sample)) == 0, sample, gsub("-", ".", sample))

# select the RPKM of this sample
rpkm_sample <- rpkm[,c("gene", sample)] 

# expressed genes in the sample
results$geneOutput$RPKM <- rpkm_sample[match(results$geneOutput$gene, gsub(" ", "", rpkm_sample$gene, fixed = TRUE)), 2]
results_filt <- results$geneOutput[results$geneOutput$RPKM > min, ]

# filter for genes that have an RPKM calculated
results_filt <- results_filt[!is.na(results_filt$RPKM),] 

# MAF filter
results_filt$MAF <- as.factor(ifelse(results_filt$majorAlleleFrequency > maf_threshold, paste0("MAF > ", maf_threshold), paste0("MAF < ", maf_threshold)))
results_filt$aseResults <- as.factor(ifelse(results_filt$majorAlleleFrequency > maf_threshold & results_filt$padj < 0.05, "ASE", "BAE"))

# rearrange columns to a logical order
results_filt <- results_filt[,c("gene", "geneBiotype", "RPKM", "allele1IsMajor","majorAlleleFrequency", 
                                "pValueASE", "pValueHeterogeneity", "padj",
                                "significance", "MAF", "aseResults")]

# save the data frame as a table
write.table(results_filt, paste0(out, "/MBASED_expr_gene_results.txt"), quote = F, col.names = T, row.names = F, sep = "\t")

R dplyr optparse From line 6 of src/addExpression.R

suppressMessages(library(tidyverse))
suppressMessages(library(optparse))
suppressMessages(library("ggsci"))


# Make help options
option_list = list(
  make_option(c("-s", "--summary"), type="character", default=NULL,
              help="Summary Table file", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="Output directory name", metavar="character"),
  make_option(c("-a", "--sample"), type="character", default = NULL,
              help="Sample name", metavar="character")
)


# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
sample <- opt$sample

summary_path <- opt$summary
summary_table <- read.delim(summary_path, header = T, stringsAsFactors = F, sep = "\t")


column <- colnames(summary_table)

############
#CNV Figures
############
if ("cnv_state" %in% column) {
  print("Creating CNV figure...")
  cnvBar <- summary_table %>%
    dplyr::filter(!is.na(cnv_state)) %>%
    dplyr::filter(padj <= 0.05) %>%
    ggplot(aes(cnv_state)) + 
    geom_bar(aes(fill = aseResults), position = "dodge") +
    ggtitle(paste0("Gene frequency in each copy number variant state"), 
            subtitle = sample) +
    ylab("Gene Count") + 
    xlab("Copy number variant state") + 
    theme_bw() + 
    scale_color_npg() + 
    scale_fill_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank()) 


  ggsave(filename = paste0(out, "/cnvBar.pdf"), plot = cnvBar, units = "in")


  expectMAF <- summary_table %>%
    dplyr::filter(!is.na(cnv_state)) %>%
    dplyr::filter(padj <= 0.05) %>%
    dplyr::mutate(cnvRatioDiff = majorAlleleFrequency - expectedMAF) %>%
    ggplot(aes(cnvRatioDiff, padj)) + 
    geom_point(aes(colour = cnv_state)) +
    geom_vline(xintercept = 0.10) +
    geom_hline(yintercept = 0.05) +
    xlab("MAF - Expectd MAF from CNV") + 
    ylab("Adjusted pvalue") + 
    ggtitle(paste0("Copy Number Variants adjusted for expected MAF"), 
            subtitle = sample) +
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank()) 

  ggsave(filename = paste0(out, "/expectMAF.pdf"), plot = expectMAF, units = "in")

  #### TEST CORRELATION PLOT ####
  correlationPlot <- summary_table %>%
    dplyr::filter(cnv_state == "imbalance") %>%
    dplyr::filter(padj <= 0.05) %>%
    dplyr::filter(!is.na(cnv_state)) %>%
    dplyr::mutate(rawExpectedMAF = pmax(cnv.A, cnv.B)/(cnv.A + cnv.B)) %>%
    ggplot(aes(majorAlleleFrequency, rawExpectedMAF)) +
    geom_point(aes(color = aseResults), alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE) +
    ylab("CNV expected MAF") + 
    xlab("MBASED Major Allele Frequency") + 
    ggtitle("Correlation between CNV expected MAF and \nMBASED MAF for CNV Imbalanced genes",
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank()) 
    ggsave(filename = paste0(out, "/cnvMAFcorrlation.pdf"), plot = correlationPlot, units = "in")
}


##############
# DMR Figures
##############

if ("methyl_state" %in% column) {
  print("Creating DMR figures...")

  if ("cnv_state" %in% column) {
  dmr <- summary_table %>%
    dplyr::filter(!is.na(methyl_state)) %>%
    dplyr::filter(cnv_state == "balance") %>%
    dplyr::filter(aseResults == "ASE") %>%
    dplyr::mutate(methylation = case_when(
      methyl_state < 0 ~ "allele2Methyl",
      TRUE ~ "allele1Methyl"
    )) %>%
    dplyr::mutate(expression = case_when(
      allele1IsMajor ~ "alelle1Expression",
      TRUE ~ "alelle2Expression"
    )) %>%
    dplyr::select(methylation, expression) %>%
    group_by(methylation, expression) %>%
    summarize(n=n())

  } else { 
  dmr <- summary_table %>%
    dplyr::filter(!is.na(methyl_state)) %>%
    dplyr::filter(aseResults == "ASE") %>%
    dplyr::mutate(methylation = case_when(
      methyl_state < 0 ~ "allele2Methyl",
      TRUE ~ "allele1Methyl"
    )) %>%
    dplyr::mutate(expression = case_when(
      allele1IsMajor ~ "alelle1Expression",
      TRUE ~ "alelle2Expression"
    )) %>%
    dplyr::select(methylation, expression) %>%
    group_by(methylation, expression) %>%
    summarize(n=n())
  } 

  dmr_contigency <- ggplot(data =  dmr, mapping = aes(x = methylation, y = expression)) +
    geom_tile(aes(fill = n), colour = "white") +
    geom_text(aes(label = sprintf("%1.0f", n)), vjust = 1, colour = "white") +
    scale_fill_gradient(low = "blue", high = "red") +
    ggtitle("Allelic Methylation and Expression Contigency Table", subtitle = sample) +
    theme_bw() + 
    theme(legend.position = "none", 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(), 
          panel.border = element_blank(),
          axis.ticks.x = element_blank(),
          axis.ticks.y = element_blank(), 
          axis.title.x=element_blank(), 
          axis.title.y=element_blank())  

  ggsave(filename = paste0(out, "/dmrContingency.pdf"), plot = dmr_contigency, units = "in")

  dmr %>%
    dplyr::mutate(sample = sample) %>%
    write.table(file = paste0(out, "/tables/dmrContingency.tsv"),
                quote = F, sep = "\t", row.names = T, col.names = T)
}




##############
# Stop Var Figures
##############
if ("stop_variant_allele" %in% column) {
  print("Creating Stop gain/loss figures...")

  stop <- summary_table %>%
    dplyr::select(gene, allele1IsMajor, aseResults, stop_variant_allele) %>% 
    dplyr::filter(!is.na(stop_variant_allele)) %>%
    dplyr::filter(aseResults == "ASE") %>%
    dplyr::mutate(expression = case_when(
      allele1IsMajor ~ "alelle1Expression",
      TRUE ~ "alelle2Expression"
    )) %>%
    dplyr::mutate(stop = case_when(
      stop_variant_allele == 1 ~ "alelle1StopVar",
      TRUE ~ "alelle2StopVar"
    )) %>%
    group_by(expression, stop) %>%
    summarize(n=n()) 

  stop %>%
    dplyr::mutate(sample = sample) %>%
    write.table(file = paste0(out, "/tables/stopVarContingency.tsv"),
                quote = F, sep = "\t", row.names = T, col.names = T)

  stopContingency <- ggplot(data =  stop, mapping = aes(x = stop, y = expression)) +
      geom_tile(aes(fill = n), colour = "white") +
      geom_text(aes(label = sprintf("%1.0f", n)), vjust = 1, colour = "white") +
      scale_fill_gradient(low = "blue", high = "red") +
      theme_bw() +
      ggtitle("ASE vs Stop gain/loss contingency table", subtitle = sample) + 
      theme(legend.position = "none", 
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(), 
            panel.border = element_blank(),
            axis.ticks.x = element_blank(),
            axis.ticks.y = element_blank(), 
            axis.title.x=element_blank(), 
            axis.title.y=element_blank()) 
  ggsave(filename = paste0(out, "/stopVarContingency.pdf"), plot = stopContingency, units = "in")

  stopVarTable <- summary_table %>%
    dplyr::mutate(stopVar = !is.na(stop_variant_allele)) %>%
    dplyr::select(aseResults, stopVar) %>%
    group_by(aseResults, stopVar) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)

  write.table(stopVarTable, file = paste0(out, "/tables/stopVarTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  stopBar <-  stopVarTable %>%
    dplyr::filter(stopVar) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults ), stat = "identity") +
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with stop gain or loss variant", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/stopVarBar.pdf"), plot = stopBar, units = "in")
}




##############
# Somatic SNV
##############
if ("somaticSNV" %in% column) {
  print("Creating somatic SNV figures...")

  snvTable <- summary_table %>%
    dplyr::select(gene, aseResults, somaticSNV) %>%
    group_by(aseResults, somaticSNV) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)

  write.table(snvTable, file = paste0(out, "/tables/snvTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  snv <- snvTable %>%
    dplyr::filter(somaticSNV) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults), stat = "identity") +
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with somatic SNV mutation", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/somaticSNVbar.pdf"), plot = snv, units = "in")
}



##############
# Somatic Indel
##############
if ("somaticIndel" %in% column) {
  print("Creating somatic indel figures...")

  indelTable <- summary_table %>%
    dplyr::select(gene, aseResults, somaticIndel) %>%
    group_by(aseResults, somaticIndel) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)

  write.table(indelTable, file = paste0(out, "/tables/indelTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  indel <- indelTable %>%
    dplyr::filter(somaticIndel) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults), stat = "identity") +
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with somatic indel mutation", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/somaticIndelBar.pdf"), plot = indel, units = "in")
}


##############
# TFBS Figures
##############
if ("tf_allele" %in% column) {
  print("Creating TFBS figures...")

  tfbsTable <- summary_table %>%
    dplyr::select(gene, aseResults, transcriptionFactor) %>%
    dplyr::mutate(tfbsVar = !is.na(transcriptionFactor)) %>%
    group_by(aseResults, tfbsVar) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)
  write.table(tfbsTable, file = paste0(out, "/tables/tfbsTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  tfbs <- tfbsTable %>%
    dplyr::filter(tfbsVar) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults), stat = "identity")+
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with mutation in \ntranscription factor binding site", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/tfbsBar.pdf"), plot = tfbs, units = "in")
}


#######################
# Cancer Figures
#######################
cancer_bar <- summary_table %>%
  dplyr::filter(cancer_gene) %>%
  ggplot(aes(aseResults)) + 
  geom_bar(aes(fill = aseResults)) +
  theme_bw() +
  scale_fill_npg() +
  scale_color_npg() +
  ggtitle("MBASED result for cancer genes", subtitle=sample) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), 
        legend.position = "none") 
ggsave(filename = paste0(out, "/cancerBar.pdf"), plot = cancer_bar, units = "in")

#######################
# Mechanism function
#######################

causeFunction <- function(dataset) {
  summaryTableExplain <- dataset %>%
    dplyr::filter(aseResults == "ASE") 
  ASEunknown <- summaryTableExplain
  cause_final <- tibble()

  if ("cnv_state" %in% column) {

    loh <- summaryTableExplain %>%
      dplyr::filter(cnv_state == "LOH") %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "LOH", num = loh))

    ASEunknown <- ASEunknown %>%
      dplyr::filter(cnv_state != "LOH")

    summaryTableExplain$cnvRatioDiff = summaryTableExplain$majorAlleleFrequency - summaryTableExplain$expectedMAF
    ASEunknown$cnvRatioDiff = ASEunknown$majorAlleleFrequency - ASEunknown$expectedMAF

    cnv <- summaryTableExplain %>%
      dplyr::filter(cnv_state != "LOH") %>%
      dplyr::filter(cnvRatioDiff< 0.10) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "CNV imbalance", num = cnv))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(cnvRatioDiff <= 0.10)
  }

  if ("somaticSNV" %in% column) {
    somaticSNV <- summaryTableExplain %>%
      dplyr::filter(somaticSNV) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Somatic SNV", num = somaticSNV))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(!somaticSNV)
  }
  if ("somaticIndel" %in% column) {
    somaticIndel <- summaryTableExplain %>%
      dplyr::filter(somaticIndel) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Somatic Indel", num = somaticIndel))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(!somaticIndel)
  }

  if ("methyl_state" %in% column) {
    methyl <- summaryTableExplain %>%
      dplyr::filter((methyl_state < 0 & allele1IsMajor) | (methyl_state > 0 & !allele1IsMajor)) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Allelic methylation", num = methyl))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(is.na(methyl_state) | !((methyl_state < 0 & allele1IsMajor) | (methyl_state > 0 & !allele1IsMajor)))
  }

  if ("stop_variant_allele" %in% column) {
    stop <- summaryTableExplain %>% 
      dplyr::filter((allele1IsMajor & stop_variant_allele == 2) | (!allele1IsMajor & stop_variant_allele == 1)) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Stop variant", num = stop))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(is.na(stop_variant_allele) | !((allele1IsMajor & stop_variant_allele == 2) | (!allele1IsMajor & stop_variant_allele == 1)))
  }
  if ("tf_allele" %in% column) {
    tfbs <- summaryTableExplain %>%
      dplyr::filter(!is.na(transcriptionFactor)) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "TFBS variant", num = tfbs))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(is.na(transcriptionFactor))
  }

  unknown <- nrow(ASEunknown)
  cause_final <- bind_rows(cause_final, tibble(cause = "Unknown", num = unknown))


   return(cause_final)
}

################
# ASE mechanism
################
aseMechanism <- causeFunction(summary_table)

aseCause <- aseMechanism %>%
  ggplot(aes(reorder(cause, -num), num)) + 
  geom_bar(aes(fill = cause), stat = "identity") +
  theme_bw() + 
  scale_fill_npg() +
  scale_color_npg() + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), 
        legend.position = "none") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  xlab("ASE mechanism") + 
  ylab("Gene frequency") + 
  ggtitle("Frequency of all ASE genes for each genetic mechanism", 
          subtitle = sample)

ggsave(filename = paste0(out, "/aseCause.pdf"), plot = aseCause, units = "in")

write.table(aseMechanism, file = paste0(out, "/tables/aseCause.tsv"),
            quote = F, sep = "\t", row.names = F, col.names = T)

################################
# Cancer ASE mechanism
################################
cancerMechanism <- summary_table %>%
  dplyr::filter(cancer_gene) %>%
  causeFunction()


cancerCause <- cancerMechanism %>%
  ggplot(aes(reorder(cause, -num), num)) + 
  geom_bar(aes(fill = cause), stat = "identity") +
  theme_bw() + 
  scale_fill_npg() +
  scale_color_npg() + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), 
        legend.position = "none") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  xlab("ASE mechanism") + 
  ylab("Gene frequency") + 
  ggtitle("Frequency of cancer ASE genes for each genetic mechanism", 
          subtitle = sample)
ggsave(filename = paste0(out, "/cancerCause.pdf"), plot = cancerCause, units = "in")


write.table(cancerMechanism, file = paste0(out, "/tables/cancerCause.tsv"),
            quote = F, sep = "\t", row.names = F, col.names = T)

R ggplot2 tidyverse optparse ggsci From line 1 of src/cancerFigures.R

suppressMessages(library(tidyverse))
suppressMessages(library(optparse))

# Make help options
option_list = list(
  make_option(c("-c", "--cnv"), type="character", default=NULL,
              help="CNV file (from Ploidetect)", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character"),
  make_option(c("-t", "--tumorContent"), type="double", default = 1.0,
              help="Tumor Content (0.0-1.0)", metavar="character")

)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
TumorContent <- opt$tumorContent

cnv <- read.delim(opt$cnv, header = T, comment.char = "#", stringsAsFactors = F)

cnv %>%
  dplyr::mutate(type = case_when(
    zygosity == "HOM" ~ "LOH",
    A != B ~ "imbalance",
    A == B ~ "balance"
  )) %>%
  dplyr::select(chr, pos, end, A, B, type) %>%
  #dplyr::mutate(chr2 = case_when(
  #  grepl("^chr", chr) ~ chr,
  #  TRUE ~ paste0("chr", chr)
  #)) %>%
  dplyr::mutate(chr = ifelse(grepl("^chr", chr), chr, paste0("chr", chr))) %>%
  dplyr::mutate(rawExpectedMAF = pmax(A, B)/(A + B)) %>%
  dplyr::mutate(expectedMAF = (rawExpectedMAF * TumorContent) + (0.5 * (1 - TumorContent))) %>%
  dplyr::select(-rawExpectedMAF) %>%
  dplyr::mutate(pos = gsub(" ", "", format(pos, scientific=F), fixed = TRUE)) %>%
  dplyr::mutate(end = gsub(" ", "", format(end, scientific=F), fixed = TRUE)) %>%
  write.table(paste0(out, "/cnv.bed"), 
              quote = F, sep = "\t", row.names = F, col.names = F)

R tidyverse optparse From line 1 of src/cnv_preprocess.R

suppressMessages(library(optparse))
suppressMessages(library(tidyverse))

option_list = list(
  make_option(c("-e", "--expression_matrix"), type="character", default=NULL,
              help="Expression Matrix", metavar="character"),
  make_option(c("-t", "--tf"), type="character", default = NULL,
              help="Transcription Factor", metavar="character"),
  make_option(c("-a", "--allele1"), type="character", default = NULL,
              help="Allele 1", metavar="character"),
  make_option(c("-b", "--allele2"), type="character", default = NULL,
              help="Allele 2", metavar="character"),
  make_option(c("-i", "--id2gene"), type="character", default = NULL,
              help="id to gene tsv", metavar="character"),
  make_option(c("-m", "--min"), type="integer", default = 10,
              help="min expression level", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="output directory", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default = NULL,
              help="sample name", metavar="character")
)

# load in options
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
min <- opt$min
outdir <- opt$outdir
sample <- opt$sample

## GET EXPRESSED TF
expressionMatrix <- read.delim(opt$expression_matrix, sep = "\t", header = T)
transcriptionFactor <- read.delim(opt$tf, sep = "\t", header = T) %>%
  separate(Transcription.factor, sep = ":", into=c("human", "transcriptionFactor")) %>%
  dplyr::select(-human)

expressionMatrix <- expressionMatrix[,c("gene", sample)]
colnames(expressionMatrix) <- c("gene", "expression")

expressionMatrix_filt <- expressionMatrix %>%
  dplyr::filter(gene %in% transcriptionFactor$transcriptionFactor) %>%
  dplyr::filter(expression > min)

tf_expression <- transcriptionFactor %>%
  left_join(expressionMatrix_filt, by=c("transcriptionFactor"="gene")) %>%
  dplyr::filter(!is.na(expression))

# GET ALLELE 1 TFBS
allele1MEME <- read.delim(opt$allele1) %>%
  dplyr::mutate(id = paste0(motif_id, "::", sequence_name)) %>%
  pull(id)

# GET ALLELE 2 TFBS
allele2MEME <- read.delim(opt$allele2) %>%
  dplyr::mutate(id = paste0(motif_id, "::", sequence_name)) %>%
  pull(id)

# GET GAIN OR LOSS OF TFBS
Gene <- read.delim(opt$id2gene, header = T)

difference <- rbind(tibble(id = setdiff(allele1MEME, allele2MEME), allele = 1),
                    tibble(id = setdiff(allele2MEME, allele1MEME), allele = 2)) %>%
  separate(id, sep = "::", into = c("motif_id", "sequence_id")) %>%
  left_join(Gene) %>%
  dplyr::filter(motif_id %in% tf_expression$Model) %>%
  left_join(tf_expression, by = c("motif_id"="Model")) %>%
  dplyr::select(transcriptionFactor, gene, allele) %>%
  write_tsv(paste0(outdir, "/motifDiff.tsv"))

R tidyverse optparse From line 1 of src/compareTFBS.R

suppressMessages(library(optparse))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(RColorBrewer))
suppressMessages(library(tibble))
suppressMessages(library(chromPlot))
suppressMessages(library(networkD3))
suppressMessages(library(htmlwidgets))
suppressMessages(library(reshape2))
suppressMessages(library(ggrepel))
suppressMessages(library(ggsci))

## ---------------------------------------------------------------------------
## FIGURES
## ---------------------------------------------------------------------------


# Make help options
option_list = list(
  make_option(c("-b", "--mbased"), type="character", default=NULL,
              help="mbased dataframe output", metavar="character"),
  make_option(c("-r", "--rpkm"), type="character", default=NULL,
              help="RPKM matrix", metavar="character"),
  make_option(c("-g", "--gene"), type="character", default = NULL,
              help="Ensembl gene annotation", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default = NULL,
              help="Sample name", metavar="character"),
  make_option(c("-m", "--min"), type="numeric", default = 1,
              help="Minimum RPKM value", metavar="numeric"),
  make_option(c("-t", "--maf_threshold"), type="numeric", default = 0.75,
              help="Minimum RPKM value", metavar="numeric"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="Output directory", metavar="character")
)

### 
### SET UP THE DATAFRAME
###

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
df <- read.delim(opt$mbased, header = T, stringsAsFactors = F)
rpkm <- read.delim(opt$rpkm, header = T, stringsAsFactors = F)
all_genes <- read.delim(opt$gene, header = F, stringsAsFactors = F)
out <- opt$outdir
sample <- opt$sample
min <- opt$min
maf_threshold <- opt$maf_threshold

mafG_padjG <- paste0("MAF > ", maf_threshold, " & padj > 0.05")
mafL_padjG <- paste0("MAF < ", maf_threshold, " & padj > 0.05")
mafG_padjL <- paste0("MAF > ", maf_threshold, " & padj < 0.05")
mafL_padjL <- paste0("MAF < ", maf_threshold, " & padj < 0.05")


# fix sample name
sample <- ifelse(length(grep("-", sample)) == 0, sample, gsub("-", ".", sample))

# select sample
rpkm_sample <- rpkm[,c("gene", sample)] 
colnames(rpkm_sample) <- c("gene", "expr")

# make a colour filter
df$colour_filt <- ifelse(df$padj < 0.05 & df$majorAlleleFrequency > maf_threshold, mafG_padjL,
                         ifelse(df$padj > 0.05 & df$majorAlleleFrequency > maf_threshold, mafG_padjG,
                                ifelse(df$padj > 0.05 & df$majorAlleleFrequency < maf_threshold, mafL_padjG, mafL_padjL)))

# add the chromosome
df$chr <- all_genes$V1[match(df$gene, all_genes$V4)]

# set the factor levels
df$colour_filt <- factor(df$colour_filt, levels = c(mafG_padjL, mafL_padjL, 
                                                    mafG_padjG, mafL_padjG))
df$chr <- factor(df$chr, levels = c(paste0("chr", 1:22), "chrX"))
#df <- df[!is.na(df$chr),]

print("Beginning figures ...")

#### 
#### DOTPLOT
####

dotplot <- ggplot(df, aes(x = majorAlleleFrequency, y = padj, colour = colour_filt)) +
  geom_point(alpha = 0.5, size = 2) +
  scale_color_manual(values = c("#e74645", "black", "black", "grey")) +
  theme_bw() + 
  geom_hline(yintercept = 0.05, linetype = 2) +
  geom_vline(xintercept = maf_threshold, linetype = 2) +
  geom_text(aes(label = paste0(table(colour_filt)[mafG_padjL], " ASE genes"), x = 0.9, y = 0.75), size = 4.5, colour = "#e74645") +
  labs(x = "major allele frequency", y = "adjusted pvalue", colour = NULL) +
  theme(legend.position = "none", 
        axis.title = element_text(size = 12, face = "bold", colour = "black"),
        axis.text = element_text(size = 10, colour = "black"))

ggsave(filename = paste0(out, "/aseGenesDot.pdf"), plot = dotplot, width = 5, height = 4, units = "in")

####
#### BARPLOT
####
df_chr <- df[!is.na(df$chr),]
barplot <- ggplot(df_chr, aes(x = chr, fill = colour_filt)) +
  geom_bar() +
  scale_fill_manual(values = rev(c("#e0f0ea","#574f7d", "#95adbe", "#e74645"))) +
  theme_bw() +
  labs(x = "chromosome", y = "number of genes", fill = "ASE results")+
  theme(axis.title = element_text(size = 12, face = "bold", colour = "black"),
        axis.text.y = element_text(size = 10, colour = "black"),
        axis.text.x = element_text(size = 10, colour = "black"),
        legend.text = element_text(size = 10, colour = "black"),
        legend.title = element_text(size = 12, face = "bold", colour = "black"))

ggsave(filename = paste0(out, "/aseGenesBar.pdf"), plot = barplot, width = 12, height = 5, units = "in")


####
#### SANKEY PLOT
####

# set filters on the RPKM matrix
rpkm_sample$gene_biotype <- all_genes$V7[match(rpkm_sample$gene, all_genes$V4)]
rpkm_sample_filt1 <- rpkm_sample[rpkm_sample$gene_biotype %in% c("lincRNA", "miRNA", "protein_coding"),]
rpkm_sample_filt2 <- rpkm_sample_filt1[rpkm_sample_filt1$expr > min,] 

# get the input values for the plot
a <- nrow(rpkm_sample_filt1)
b <- c(nrow(rpkm_sample_filt2),nrow(rpkm_sample_filt1[rpkm_sample_filt1$expr <= min,] ))
c <- c(nrow(df),nrow(rpkm_sample_filt2[!rpkm_sample_filt2$gene %in% df$gene,]))
d <- c(nrow(df[df$padj < 0.05 & df$majorAlleleFrequency > maf_threshold,]),
       sum(nrow(df[df$padj >= 0.05 & df$majorAlleleFrequency <= maf_threshold,]),
           nrow(df[df$padj >= 0.05 & df$majorAlleleFrequency > maf_threshold,]),
           nrow(df[df$padj < 0.05 & df$majorAlleleFrequency <= maf_threshold,])))

# create a connection data frame
links <- data.frame(
  source=c(rep(paste0("All Genes (n=", a, ")"), 2), rep(paste0("Expressed (n=", b[1], ")"), 2), rep(paste0("Phased Genes (n=", c[1], ")"), 2)),
  target=c(paste0("Expressed (n=", b[1], ")"), paste0("Not Expressed (n=", b[2], ")"), 
           paste0("Phased Genes (n=", c[1], ")"), paste0("Unphased Genes (n=", c[2], ")"), 
           paste0("ASE Genes (n=", d[1], ")"), paste0("Biallelic Genes (n=", d[2], ")")), 
  value=c(b, c, d)
)

# create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name=c(as.character(links$source), 
         as.character(links$target)) %>% unique()
)

# Reformat the links
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# Make the Network
sankey <- sankeyNetwork(Links = links, Nodes = nodes,
                        Source = "IDsource", Target = "IDtarget",
                        Value = "value", NodeID = "name", 
                        sinksRight=FALSE, fontSize = 18)

saveWidget(sankey, file=paste0(out, "/sankeyPlot.html"), selfcontained = F)


print("Figures completed")

R ggplot2 dplyr ggrepel optparse RColorBrewer reshape2 tibble ggsci htmlwidgets chromPlot From line 7 of src/figures.R

.libPaths("/home/glchang/R/x86_64-pc-linux-gnu-library/4.1")

suppressMessages(library("tidyverse"))
suppressMessages(library("optparse"))

option_list = list(
  make_option(c("-s", "--sample"), type="character", default=NULL,
              help="Sample name", metavar="character"),
  make_option(c("-i", "--input"), type="character", default=NULL,
              help="Gene expression data from RSEM", metavar="character"),
  make_option(c("-a", "--annotation"), type="character", default=NULL,
              help="Annotation file for ensembl id to hgnc name", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default =NULL,
              help="Output directory name", metavar="character")
)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
input <- opt$input
sample <- opt$sample
annotation <- read_delim(opt$annotation, col_names = F, show_col_types = F) %>%
  dplyr::select(X9, X4)

val <- read_delim(input, show_col_types = F) %>%
  dplyr::select(gene_id, TPM) %>%
  left_join(annotation, by = c("gene_id" = "X9")) %>%
  dplyr::select(X4, TPM) %>%
  `colnames<-`(c("gene", sample))

write.table(val, paste0(out, "/expression_matrix.tsv"), 
            col.names = T, row.names = F, quote = F, sep = "\t")

R tidyverse optparse RSEM From line 2 of src/generateExpressionMatrix.R

library(optparse)

## ---------------------------------------------------------------------------
## OPT PARSE
## ---------------------------------------------------------------------------

# options
option_list = list(
  make_option(c("-c", "--chromSize"), type="character",  
              help="chromosome sizes", metavar="character"),
  make_option(c("-p", "--centPos"), type="character", 
              help="centred centromere positions", metavar="character"),
  make_option(c("-n", "--cna"), type="character", 
              help="copy number segment file (preferably condensed)",  metavar="character"),
  make_option(c("-d", "--dmr"), type="character", 
              help="DMR positions", metavar="character"),
  make_option(c("-a", "--ase"), type="character",
              help="ASE summary table file", metavar="character"),
  make_option(c("-g", "--genes"), type="character",
              help="Gene annotation file", metavar="character"),
  make_option(c("-o", "--out"), type="character", 
              help="output file prefix", metavar="character")
)

opt_parser = OptionParser(option_list=option_list)
opt = parse_args(opt_parser)

#Note: these packages need to be installed.
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(ggplot2))
suppressMessages(library(tidyr))
suppressMessages(library(cowplot))

# chromosome size/position info
#chromSize <- read.delim("/projects/hpv_nanopore_prj/refs/hg38_no_alt_TCGA_HTMCP_HPVs_chromSizes.txt", header = F)
#centPos <-  read.delim("/projects/hpv_nanopore_prj/refs/hg38_centromere_positions_merged.bed", header = F)
#genes <- read.delim("/projects/hpv_nanopore_prj/htmcp/ase/pull_trial/vporter-allelespecificexpression/output/HTMCP.03.06.02058/3_cancer/raw/gene_annotation.bed", header = F)
chromSize <- read.delim(opt$chromSize, header = F)
centPos <-  read.delim(opt$centPos, header = F)
genes <- read.delim(opt$genes, header = F)

## ---------------------------------------------------------------------------
## POSITION CHROMOSOME INFO
## ---------------------------------------------------------------------------

# subset to the main chromosomes
chromSize <- chromSize[chromSize$V1 %in% c(paste0("chr", 1:22), "chrX"),]

# Rename columns to chromosome and size
colnames(chromSize) <- c("chr","size")

# Reorder levels for plotting
chromSize$chr <- factor(chromSize$chr,levels=c(paste0("chr", 1:22), "chrX"))
chromSize <- chromSize[chromSize$chr %in% c(paste0("chr", 1:22), "chrX"),]

# Divide by 1Mb to clean up axis
chromSize$size <- chromSize$size/1000000

# centromere mapping
colnames(centPos) <- c("chr", "start", "end")
centPos$chr <- factor(centPos$chr,levels=c(paste0("chr", 1:22), "chrX"))
centPos$centre <- centPos$start + ((centPos$end - centPos$start)/2)
centPos$centre <- centPos$centre/1000000

## ---------------------------------------------------------------------------
## COPY NUMBER
## ---------------------------------------------------------------------------
if (!is.null(opt$cna) & opt$cna != ""){
  #cna <- read.delim("/projects/hpv_nanopore_prj/htmcp/ploidetect/illumina/Ploidetect-pipeline/ploidetect_out/HTMCP-03-06-02058/A37261_A37189/cna_condensed.txt", header = T)
  cna <- read.delim(opt$cna, header = T)
  cna$chr <- paste0("chr", cna$chr)

  # rearrange to make a bed file
  cna_bed <- cna[,c("chr", "pos", "end", "CN", "zygosity", "A", "B")]

  # categorize copy number
  cna_bed <- cna_bed %>%
    mutate(CN.Status = 
             case_when(
               zygosity == "HOM" ~ "LOH",
               A > B ~ "imbalance",
               TRUE ~ "balance"
             ))

  # Divide by 1Mb for axis
  cna_bed$end <- cna_bed$end/1000000
  cna_bed$pos <- cna_bed$pos/1000000

  # Change to factor and reorder levels
  cna_bed$chr <- factor(cna_bed$chr,levels=c(paste0("chr", 1:22), "chrX"))

  cnaLOH <- cna_bed %>% filter(CN.Status == "LOH")
  cnaGAIN <- cna_bed %>% filter(CN.Status == "imbalance") 
}

## ---------------------------------------------------------------------------
## DIFFERENTIAL METHYLATION
## ---------------------------------------------------------------------------

if (!is.null(opt$dmr) & opt$dmr != ""){
  #dmr <- read.delim("/projects/hpv_nanopore_prj/htmcp/call_integration/output/HTMCP-03-06-02058/methylation/diff_meth.csv", header = T)
  dmr <- read.delim(opt$dmr, header = T)

  # Divide by 1Mb for axis
  dmr$start <- dmr$start/1000000
  dmr$end <- dmr$end/1000000
  dmr$middle <- (dmr$start + dmr$end) / 2

  # Change to factor and reorder levels
  dmr$chr <- factor(dmr$chr, levels=c(paste0("chr", 1:22), "chrX"))

  # count in 1Mb bins
  dmrCount <- data.frame(table(as.factor(paste0(dmr$chr, ":", as.integer(dmr$middle)))))

  # split the chromosome name and bin position
  dmrPlot <- separate(dmrCount, col = Var1, into = c("chr", "pos"), sep = ":", remove = T)

  # scale to fit the plot - i.e. make the maximum width 0.65
  maxDMR <- max(dmrPlot$Freq)
  dmrPlot$percMax <- dmrPlot$Freq/maxDMR
  dmrPlot$percMax <- dmrPlot$percMax * 0.65

  # Change to factor and reorder levels
  dmrPlot$chr <- factor(dmrPlot$chr, levels=c(paste0("chr", 1:22), "chrX"))
  dmrPlot$pos <- as.numeric(dmrPlot$pos)
}

## ---------------------------------------------------------------------------
## ASE GENE HISTOGRAM
## ---------------------------------------------------------------------------

#ase <- read.delim("/projects/hpv_nanopore_prj/htmcp/ase/pull_trial/vporter-allelespecificexpression/output/HTMCP.03.06.02058/summaryTable.tsv", header = T)
ase <- read.delim(opt$ase, header = T)

# filter for ASE genes
ase <- ase[ase$aseResults == "ASE",]

# get gene positions
ase$chr <- genes$V1[match(ase$gene, genes$V4)]
ase$start <- genes$V2[match(ase$gene, genes$V4)]
ase$end <- genes$V3[match(ase$gene, genes$V4)]

# get the middle of the gene for plotting
ase$middle <- (ase$start + ase$end)/2

# get the data frame ready for plotting 
ase <- ase[,c("chr", "start", "end","middle")]
ase <- ase[complete.cases(ase),]

# Divide by 1Mb for axis
ase$middle <- ase$middle/1000000

# count in 1Mb bins
aseCount <- data.frame(table(as.factor(paste0(ase$chr, ":", as.integer(ase$middle)))))

# split the chromosome name and bin position
asePlot <- separate(aseCount[aseCount$Var1 != "NA:NA",], col = Var1, into = c("chr", "pos"), sep = ":", remove = T)

# scale to fit the plot - i.e. make the maximum width 0.65
maxASE <- max(asePlot$Freq)
asePlot$percMax <- asePlot$Freq/maxASE
asePlot$percMax <- asePlot$percMax * 0.65

# Change to factor and reorder levels
asePlot$chr <- factor(asePlot$chr, levels=c(paste0("chr", 1:22), "chrX"))
asePlot$pos <- as.numeric(asePlot$pos)

## ---------------------------------------------------------------------------
## PLOT OPTIONS
## ---------------------------------------------------------------------------

##### CNV AND DMRs AVAILABLE

if (!is.null(opt$dmr) & opt$dmr != "" & !is.null(opt$cna) & opt$cna != ""){
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 9.7, 10.3,9.7,10.3,7.1,7.1), xmax = c(10.35, 10.35,9.75,10.35,9.75,10.35,7.26,7.26), ymin = c(240,210,238,238,208,208,235,205), ymax = c(243,213,243,243,213,213,245,215),
                    fill = c("ase","dmr","ase","ase","dmr","dmr", "gain", "loh"))

  adW <- data.frame(x = c(11.5, 11.2,8.25,7.6,10,10), y = c(240,210,240,210,250,220),
                    label = c("ASE Gene Density","DMR Density", "Imbalanced CNV", "LOH", as.character(c(maxASE,maxDMR))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOH %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAIN %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "gray") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73","#ae2012","#ee9b00","#94d2bd")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt <- cnaLOH %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt$chr <- factor(cnaLOHFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt <- cnaGAIN %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt$chr <- factor(cnaGAINFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt <- dmrPlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt$chr <- factor(dmrPlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOHFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAINFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlotFilt, 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)

} else if (!is.null(opt$cna) & opt$cna != ""){ ### CNVs BUT NO DMRs
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 10.3,7.1,7.1), xmax = c(10.35, 9.75,10.35,7.26,7.26), ymin = c(240,238,238,235,205), ymax = c(243,243,243,245,215),
                    fill = c("ase","ase","ase","gain", "loh"))

  adW <- data.frame(x = c(11.5,8.25,7.6,10), y = c(240,240,210,250),
                    label = c("ASE Gene Density", "Imbalanced CNV", "LOH", as.character(c(maxASE))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOH %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAIN %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "gray") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73","#ee9b00","#94d2bd")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt <- cnaLOH %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt$chr <- factor(cnaLOHFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt <- cnaGAIN %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt$chr <- factor(cnaGAINFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOHFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAINFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)

} else if (!is.null(opt$dmr) & opt$dmr != ""){ ## DMRs BUT NO CNV
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 9.7, 10.3,9.7,10.3), xmax = c(10.35, 10.35,9.75,10.35,9.75,10.35), ymin = c(240,210,238,238,208,208), ymax = c(243,213,243,243,213,213),
                    fill = c("ase","dmr","ase","ase","dmr","dmr"))

  adW <- data.frame(x = c(11.5, 11.2,10,10), y = c(240,210,250,220),
                    label = c("ASE Gene Density","DMR Density", as.character(c(maxASE,maxDMR))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73","#ae2012")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt <- dmrPlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt$chr <- factor(dmrPlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlotFilt, 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)

} else{ ## NO DMRs OR CNVs
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 10.3), xmax = c(10.35,9.75,10.35), ymin = c(240,238,238), ymax = c(243,243,243),
                    fill = c("ase","ase","ase"))

  adW <- data.frame(x = c(11.5,10), y = c(240,250),
                    label = c("ASE Gene Density", as.character(c(maxASE))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)
}


## ---------------------------------------------------------------------------
## PLOT 
## ---------------------------------------------------------------------------

# put them together
plot <- plot_grid(p1, p2, align = "v", axis = "l", nrow = 2)

# save plot
ggsave(plot, filename = paste0(opt$out,".pdf"), width = 10, height = 7, units = "in")

R ggplot2 dplyr tidyr cowplot optparse reshape2 From line 6 of src/karyogramFigure.R

suppressMessages(library(optparse))
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(prob))
suppressMessages(library(tidyr))
suppressMessages(library(MBASED))
suppressMessages(library(SummarizedExperiment))
suppressMessages(library(stats))
suppressMessages(library(tibble))

## ---------------------------------------------------------------------------
## LOAD INPUT 
## ---------------------------------------------------------------------------

# Make help options
option_list = list(
  make_option(c("-p", "--phase"), type="character", default=NULL,
              help="Phased VCF file (from WhatsHap)", metavar="character"),
  make_option(c("-r", "--rna"), type="character", default=NULL,
              help="Tumour RNA vcf file (from Strelka2)", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character"),
  make_option(c("-t", "--threads"), type="integer", default = "mBASED",
              help="Threads used for mbased", metavar="integer")
)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
threads <- opt$threads

## ---------------------------------------------------------------------------
## USER FUNCTIONS
## ---------------------------------------------------------------------------

# extract info from a list
list_n_item <- function(list, n){
  sapply(list, `[`, n)
}

# define function to print out the summary of ASE results
summarizeASEResults_1s <- function(MBASEDOutput) {

  geneOutputDF <- data.frame(
    majorAlleleFrequency = assays(MBASEDOutput)$majorAlleleFrequency[,1],
    pValueASE = assays(MBASEDOutput)$pValueASE[,1],
    pValueHeterogeneity = assays(MBASEDOutput)$pValueHeterogeneity[,1])

  geneAllele <- as.data.frame(assays(metadata(MBASEDOutput)$locusSpecificResults)$allele1IsMajor) %>%
    rownames_to_column(var = "rowname") %>%
    dplyr::mutate(gene = unlist(lapply(strsplit(rowname, split = ":"),function(x){x = x[1]}))) %>%
    dplyr::group_by(gene) %>%
    summarise(allele1IsMajor = unique(mySample))

  geneOutputDF$allele1IsMajor <- geneAllele$allele1IsMajor[match(rownames(geneOutputDF), geneAllele$gene)]

  lociOutputGR <- rowRanges(metadata(MBASEDOutput)$locusSpecificResults)
  lociOutputGR$allele1IsMajor <- assays(metadata(MBASEDOutput)$locusSpecificResults)$allele1IsMajor[,1]
  lociOutputGR$MAF <- assays(metadata(MBASEDOutput)$locusSpecificResults)$MAF[,1]
  lociOutputList <- split(lociOutputGR, factor(lociOutputGR$aseID, levels=unique(lociOutputGR$aseID)))

  return(
    list(
      geneOutput=geneOutputDF,
      locusOutput=lociOutputList
    )
  )
}

## ---------------------------------------------------------------------------
## READ IN THE RNA SNV CALLS
## ---------------------------------------------------------------------------

# read in the RNA calls
rna_filt <- read.delim(opt$rna, header = T, comment.char = "#", stringsAsFactors = F)
colnames(rna_filt) <- c("CHROM", "POS", "AD","REF","ALT","gene", "gene_biotype") 
rna_filt$variant <- paste0(rna_filt$CHROM, ":", rna_filt$POS)


## ---------------------------------------------------------------------------
## EXTRACT REF/ALT READ COUNTS
## ---------------------------------------------------------------------------

# Extract and add the read counts
expr <- strsplit(rna_filt$AD, ",")
rna_filt$REF.COUNTS <- as.numeric(list_n_item(expr, 1))
rna_filt$ALT.COUNTS <- as.numeric(list_n_item(expr, 2))

## ---------------------------------------------------------------------------
## MBASED WITH OR WITHOUT PHASING
## ---------------------------------------------------------------------------

### WITH PHASING
if (!is.null(opt$phase)){

  ### 
  ### PHASING
  ###

  # WhatsHap phased VCF from ONT sequencing pipeline
  wh <- read.delim(opt$phase, header = F, comment.char = "#", stringsAsFactors = F)
  colnames(wh) <- c("CHROM",  "POS",  "ID",  "REF", "ALT", "QUAL", "FILTER",  "INFO", "FORMAT", "SAMPLE")
  wh$variant <- paste0(wh$CHROM, ":", wh$POS)

  # remove unphased variants - phased variants have the pipe "|" symbol in column 10 - and remove indels
  wh <- wh[grep("|", wh$SAMPLE, fixed=TRUE),]
  wh <- wh %>% dplyr::filter(nchar(REF) == 1 & nchar(ALT) == 1)

  # add genotype from the SAMPLE column as a new column
  info2 <- strsplit(wh$SAMPLE, ":")
  wh$GT <- list_n_item(info2, 1)

  # Add the genotype from WhatsHap 
  rna_filt$GT <- wh$GT[match(rna_filt$variant, wh$variant)]

  # Find unphased genes with one variant (test)
  singleUnphased <- rna_filt %>%
    mutate(phase = variant %in% wh$variant) %>%
    left_join(rna_filt %>% group_by(gene) %>% summarize(n=n())) %>%
    dplyr::filter(!phase & n == 1)

  # Add genotype to unphased gene with one variant (test)
  rna_filt$GT[which(rna_filt$variant %in% singleUnphased$variant)] <- "1|0"

  # annotate the phased variants as alleleA and alleleB
  rna_filt$alleleA <- ifelse(rna_filt$GT == "1|0", rna_filt$ALT, rna_filt$REF)
  rna_filt$alleleB <- ifelse(rna_filt$GT == "1|0", rna_filt$REF, rna_filt$ALT)

  # add the phased COUNTS variants as alleleA and alleleB
  rna_filt$alleleA.counts <- ifelse(rna_filt$GT == "1|0", rna_filt$ALT.COUNTS, rna_filt$REF.COUNTS)
  rna_filt$alleleB.counts <- ifelse(rna_filt$GT == "1|0", rna_filt$REF.COUNTS, rna_filt$ALT.COUNTS)

  # phased only variants
  rna_phased <- rna_filt[complete.cases(rna_filt),]

  # make SNV IDs
  rna_phased <- rna_phased %>%
    arrange(CHROM, POS) %>%
    group_by(gene) %>%
    mutate(label = paste0("SNV",1:n()))
  rna_phased$SNV.ID <- paste0(rna_phased$gene, ":", rna_phased$label)

  ### 
  ### MBASED
  ###

  print("Beginning MBASED ...")

  # make the GRanges object of the loci
  mySNVs <- GRanges(seqnames=rna_phased$CHROM,
                     ranges=IRanges(start=rna_phased$POS, width=1),
                     aseID=rna_phased$gene,
                     allele1=rna_phased$REF,
                     allele2=rna_phased$ALT)
  names(mySNVs) <- rna_phased$SNV.ID

  # create input RangedSummarizedExperiment object
  mySample <- SummarizedExperiment(
    assays=list(lociAllele1Counts=matrix(rna_phased$alleleA.counts,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample')),
                lociAllele2Counts=matrix(rna_phased$alleleB.counts,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample'))),
    rowRanges=mySNVs
  )

  # run MBASED
  ASEresults_1s_haplotypesKnown <- runMBASED(ASESummarizedExperiment=mySample,
                                             isPhased=TRUE,
                                             numSim=10^6,
                                             BPPARAM = MulticoreParam(workers = threads))

  saveRDS(ASEresults_1s_haplotypesKnown, file=paste0(out, "/ASEresults_1s_haplotypesKnown.rds"))
  # extract results
  results <- summarizeASEResults_1s(ASEresults_1s_haplotypesKnown)

  # adjust the pvalue with BH correction
  results$geneOutput$padj <- p.adjust(p = results$geneOutput$pValueASE, method = "BH")
  results$geneOutput$significance <- as.factor(ifelse(results$geneOutput$padj < 0.05, "padj < 0.05", "padj > 0.05"))
  results$geneOutput$gene <- rownames(results$geneOutput)

  results$geneOutput$allele1IsMajor[results$geneOutput$gene %in% singleUnphased$gene] = NA

  # add the locus
  results$geneOutput$geneBiotype <- rna_filt$gene_biotype[match(results$geneOutput$gene, rna_filt$gene)]

### WITHOUT PHASING
} else {

  # make SNV labels
  rna_filt <- rna_filt %>%
    arrange(CHROM, POS) %>%
    group_by(gene) %>%
    mutate(label = paste0("SNV",1:n()))
  rna_filt$SNV.ID <- paste0(rna_filt$gene, ":", rna_filt$label)

  ### 
  ### MBASED
  ###

  print("Beginning MBASED ...")

  # make the GRanges object of the loci
  mySNVs <- GRanges(seqnames=rna_filt$CHROM,
                    ranges=IRanges(start=rna_filt$POS, width=1),
                    aseID=rna_filt$gene,
                    allele1=rna_filt$REF,
                    allele2=rna_filt$ALT)
  names(mySNVs) <- rna_filt$SNV.ID

  ## create input RangedSummarizedExperiment object
  mySample <- SummarizedExperiment(
    assays=list(lociAllele1Counts=matrix(rna_filt$REF.COUNTS,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample')),
                lociAllele2Counts=matrix(rna_filt$ALT.COUNTS,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample'))),
    rowRanges=mySNVs
  )

  # run MBASED
  ASEresults_1s_haplotypesUnknown <- runMBASED(ASESummarizedExperiment=mySample,
                                               isPhased=FALSE,
                                               numSim=10^6,
                                               BPPARAM = MulticoreParam(workers = threads))
  saveRDS(ASEresults_1s_haplotypesUnknown, file=paste0(out, "/ASEresults_1s_haplotypesUnknown.rds"))

  # extract results
  results <- summarizeASEResults_1s(ASEresults_1s_haplotypesUnknown)

  # adjust the pvalue with BH correction
  results$geneOutput$padj <- p.adjust(p = results$geneOutput$pValueASE, method = "BH")
  results$geneOutput$significance <- as.factor(ifelse(results$geneOutput$padj < 0.05, "padj < 0.05", "padj > 0.05"))
  results$geneOutput$gene <- rownames(results$geneOutput)

  # add the locus
  results$geneOutput$geneBiotype <- rna_filt$gene_biotype[match(results$geneOutput$gene, rna_filt$gene)]

} 

# save the results 
saveRDS(results, file=paste0(out, "/MBASEDresults.rds"))
print("Finished MBASED")

R dplyr tidyr optparse reshape2 tibble SummarizedExperiment MBASED From line 7 of src/mbased.snpEff.R

suppressMessages(library(tidyverse))
suppressMessages(library(optparse))

# Make help options
option_list = list(
  make_option(c("-m", "--methyl"), type="character", default=NULL,
              help="Allelic methylation (from Nanomethphase)", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character")
)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir


dmr <- read.delim(opt$methyl, header = T, comment.char = "#", stringsAsFactors = F)

dmr %>%
  dplyr::mutate(chr = case_when(
    grepl("^chr", chr) ~ chr,
    TRUE ~ paste0("chr", chr)
  )) %>%
  dplyr::select(chr, start, end, meanMethy1, meanMethy2, diff.Methy) %>%
  dplyr::mutate(start = gsub(" ", "", format(start, scientific=F), fixed = TRUE)) %>%
  dplyr::mutate(end = gsub(" ", "", format(end, scientific=F), fixed = TRUE)) %>%
  dplyr::mutate(meanMethy1 = format(meanMethy1, scientific = FALSE)) %>%
  dplyr::mutate(meanMethy2 = format(meanMethy2, scientific = FALSE)) %>%
  dplyr::mutate(diff.Methy = format(diff.Methy, scientific = FALSE)) %>%
  write.table(paste0(out, "/methyl.bed"), 
              quote = F, sep = "\t", row.names = F, col.names = F)

R tidyverse optparse From line 1 of src/methyl_preprocess.R

suppressMessages(library(optparse))
suppressMessages(library(tidyverse))

option_list = list(
  make_option(c("-a", "--annotation"), type="character", default=NULL,
              help="Annotation file", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="output directory", metavar="character")
)

# load in options
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
annotation_path <- opt$annotation
out <- opt$out

read.delim(annotation_path, sep = "\t", header = T) %>%
  `colnames<-`(c("gene", "effect", "genotype", "filter" )) %>%
  dplyr::filter(filter == "PASS") %>%
  dplyr::filter(genotype == c("1|0", "0|1")) %>%
  dplyr::filter(effect != "stop_retained_variant") %>%
  dplyr::filter(grepl("stop", effect)) %>%
  distinct(gene, genotype, .keep_all = TRUE) %>%
  dplyr::mutate(stop_variant_allele = case_when(
    genotype == "0|1" ~ 2,
    TRUE ~ 1
  )) %>%
  dplyr::select(gene, effect, stop_variant_allele) %>%
  write.table(paste0(out, "/stop_variant.tsv"), 
              sep = "\t", quote = F, row.names = F, col.names = T)

R tidyverse optparse From line 1 of src/stopVariant.R

suppressMessages(library(tidyverse))
suppressMessages(library(optparse))


# Make help options
option_list = list(
  make_option(c("-c", "--cnv"), type="character", default=NULL,
              help="CNV bed file", metavar="character"),
  make_option(c("-m", "--methyl"), type="character", default=NULL,
              help="allelic methylation bed file", metavar="character"),
  make_option(c("-a", "--ase"), type="character", default=NULL,
              help="mbased ASE result", metavar="character"),
  make_option(c("-g", "--cancer"), type="character", default=NULL,
              help="cancer gene", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default=NULL,
              help="Sample name", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="Output directory name", metavar="character"),
  make_option(c("-t", "--tfbs"), type="character", default = NULL,
              help="Transcription Factor binding site", metavar="character"),
  make_option(c("-p", "--stop"), type="character", default = NULL,
              help="Stop Variants", metavar="character"),
  make_option(c("-n", "--snv"), type="character", default = NULL,
              help="Somatic SNV", metavar="character"),
  make_option(c("-i", "--indel"), type="character", default = NULL,
              help="Somatic Indel", metavar="character"),
  make_option(c("r-", "--tissue"), type="character", default = "allTissue",
              help="tissue name", metavar="character"),
  make_option(c("-N", "--normal"), type="character", default = NULL,
              help="Normal ASE", metavar="character"),
  make_option(c("-T", "--tumorContent"), type="double", default = 1.0,
              help="Tumor Content", metavar="character")
)


# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
sample <- opt$sample

cnv_path <- opt$cnv
methyl_path <- opt$methyl
tfbs_path <- opt$tfbs
stop_path <- opt$stop
snv_path <- opt$snv
indel_path <- opt$indel
cancer_path <- opt$cancer
normal_path <- opt$normal

tissue <- opt$tissue
tumorContent <- opt$tumorContent
ase <- read.delim(opt$ase, header = T, comment.char = "#", stringsAsFactors = F)

##########
# CNV
##########

if (is.null(cnv_path) | cnv_path == "") {
  cnv <- data.frame(gene = ase$gene)
} else {
  normal <- read.delim(normal_path, sep = "\t", header = F,
                       comment.char = "#") %>%
    `colnames<-`(.[1, ]) %>%
    .[-1, ] %>%
    dplyr::select(gene, tissue) %>%
    `colnames<-`(c("gene", "normalMAF"))
  cnv <- read.delim(cnv_path, header = F, comment.char = "#") %>%
    dplyr::select(V4, V8, V9, V11) %>%
    dplyr::mutate(V8 = as.numeric(V8)) %>%
    dplyr::mutate(V9 = as.numeric(V9)) %>%
    dplyr::mutate(V11 = as.numeric(V11)) %>%
    `colnames<-`(c("gene", "cnv.A", "cnv.B", "expectedMAF")) %>%
    group_by(gene) %>% 
    summarize(cnv.A = mean(cnv.A), cnv.B = mean(cnv.B), expectedMAF = mean(expectedMAF)) %>%
    dplyr::mutate(cnv_state = case_when(
      cnv.A == 0 | cnv.B == 0 ~ "LOH",
      abs(cnv.A - cnv.B) <= 1 ~ "balance",
      TRUE ~ "imbalance"
    )) %>%
    left_join(normal, by = "gene") %>%
    dplyr::mutate(normalMAF = as.numeric(normalMAF)) %>%
    dplyr::mutate(expectedMAF = ((pmax(cnv.A,cnv.B)/(cnv.A + cnv.B))*tumorContent) + ((1-tumorContent) * normalMAF)) %>%
    dplyr::select("gene", "cnv.A", "cnv.B", "cnv_state", "expectedMAF")
}

##########
# DMR
##########

if ( is.null(methyl_path) | methyl_path == "") {
  dmr <- data.frame(gene = ase$gene)
} else {
  dmr <- read.delim(methyl_path, header = F, comment.char = "#", stringsAsFactors = F) %>%
    dplyr::filter(V5 != ".") %>%
    dplyr::select(V4, V8, V9, V10) %>%
    `colnames<-`(c("gene", "methyl.A", "methyl.B", "diff.Methyl")) %>%
    dplyr::mutate(diff.Methyl = as.numeric(diff.Methyl)) %>%
    group_by(gene) %>%
    summarize(minimum = min(diff.Methyl), maximum = max(diff.Methyl)) %>%
    dplyr::mutate(methyl_state = case_when(
      (minimum < 0) & (maximum < 0) ~ minimum,
      (minimum > 0) & (maximum > 0) ~ maximum,
      TRUE ~ ifelse((abs(minimum) > maximum), minimum, maximum)
    )) %>%
    dplyr::select(gene, methyl_state)
}

########################################
# Transcription Factor Binding Site
########################################

if (is.null(tfbs_path) | tfbs_path == "") {
  tfbs <- data.frame(gene = ase$gene)
} else {
  tfbs <- read.delim(tfbs_path, header = T, comment.char = "#") %>%
    `colnames<-`(c("transcriptionFactor", "gene", "tf_allele")) %>%
    dplyr::select(gene, transcriptionFactor, tf_allele) %>%
    group_by(gene) %>% 
    mutate(transcriptionFactor = paste(transcriptionFactor, collapse=",")) %>%
    mutate(tf_allele = paste(tf_allele, collapse=",")) %>%
    distinct()
}

##########################
# Stop Variants
##########################

if (is.null(stop_path) | stop_path == "") {
  stopVar <- data.frame(gene = ase$gene)
} else {
  stopVar <- read.delim(stop_path, header = T, comment.char = "#") %>%
    dplyr::select(gene, stop_variant_allele)
}

##########################
# Somatic SNV
##########################

if (is.null(snv_path) | snv_path == "") {
  snv <- data.frame(gene = ase$gene)
} else {
  snv_gene <- read.delim(snv_path, header = F, comment.char = "#") %>%
    pull(V4) %>%
    unique()
  snv <- data.frame(gene = ase$gene) %>%
    dplyr::mutate(somaticSNV = gene %in% snv_gene)
}

##########################
# Somatic Indel
##########################

if (is.null(indel_path) | indel_path == "") {
  indel <- data.frame(gene = ase$gene)
} else {
  indel_gene <- read.delim(indel_path, header = F, comment.char = "#") %>%
    pull(V4) %>%
    unique()
  indel <- data.frame(gene = ase$gene) %>%
    dplyr::mutate(somaticIndel = gene %in% indel_gene)
}

##########################
# Cancer
##########################

if (is.null(cancer_path) | cancer_path == "") {
  cancer <- data.frame(gene = ase$gene)
} else {
  Can_genes <- read.delim(cancer_path, sep = "\t", header = F, comment.char = "#") %>% 
    pull()
  cancer <- data.frame(gene = ase$gene) %>%
    dplyr::mutate(cancer_gene = gene %in% Can_genes)
}

##########################
# Normal ASE
##########################

if (is.null(tissue) | tissue == "") {
  normal <- data.frame(gene = ase$gene)
} else {
  normal <- read.delim(normal_path, sep = "\t", header = F, 
                       comment.char = "#") %>% 
    `colnames<-`(.[1, ]) %>%
    .[-1, ] %>%
    dplyr::select(gene, tissue) %>%
    `colnames<-`(c("gene", "normalMAF")) 
}

##########################
# Summary Table 
##########################

summary_table <- ase %>%
  dplyr::select(gene, RPKM, allele1IsMajor, majorAlleleFrequency, padj, aseResults) %>%
  dplyr::rename("expression" = RPKM) %>%
  left_join(cnv, by = "gene") %>%
  left_join(dmr, by = "gene") %>%
  left_join(tfbs, by = "gene") %>%
  left_join(stopVar, by = "gene") %>%
  left_join(snv, by = "gene") %>%
  left_join(indel, by = "gene") %>%
  left_join(cancer, by = "gene") %>%
  left_join(normal, by = "gene") %>%
  dplyr::mutate(sample = sample) %>%
  dplyr::mutate(tumorContent = tumorContent) %>%
  write.table(paste0(out, "/summaryTable.tsv"), 
              sep = "\t", quote = F, row.names = F, col.names = T)