Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations

public public 1yr ago Version: 1.0.0 0 bookmarks

Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations

This Snakemake workflow calls allele-specific expression genes using short-read RNA-seq. Phasing information derived from long-read data by tools such as WhatsHap can be provided to increase the performance of the tool, and to link results to features of interest. Copy number variant data, allelic methylation data and somatic variant data can also be provided to analyze genes with allele specific expression.

Table of Contents

Overall Workflow


Installation

This will clone the repository. You can run the IMPALA within this directory.

git clone https://github.com/bcgsc/IMPALA.git

Dependencies

To run this workflow, you must have snakemake (v6.12.3) and singularity (v3.5.2-1.1.el7). You can install snakemake using this guide and singularity using this guide . The remaining dependencies will be downloaded automatically within the snakemake workflow.

Input Files

Method 1 †: RNA reads:

  • RNA paired end reads (R1 & R2 fastq file)

Method 2 §: RNA alignment:

  • RNA alignment alignment (bam file)

  • Expression Matrix

    • Expression in RPKM/TPM

    • Gene name must be in HGNC format

    • Column name is "Gene" and sample names

Optional Inputs:

  • Phase VCF

    • Can be obtained using WhatsHap with DNA long reads

    • Significantly improves precision of ASE calling

    • Adds TFBS mutation and stop gain/loss information

  • Copy Number Variant Data

  • Allelic Methylation

  • Somatic mutations

    • Finds somatic mutations in ASE gene and promoters
  • Tumor Content

    • Used to calcualte the expected major allele frequency

    • Assumes 1.0 if not specified

  • Tissue type

    • Include data for average MAF in normal tissue in summary table

    • Otained from GTex database which ran phASER to calcualte allelic expression

Running Workflow

Edit the config files

Example parameters.yaml:

Config files to specify parameters and paths needed for the workflow. The main parameter to include is the genome name, path to expression matrix, major allele frequency threshold and threads as well as settings for using phased vcf and doing cancer analysis.

# genome_name should match bams
genome_name: hg38/hg19/hg38_no_alt_TCGA_HTMCP_HPVs
# RPKM matrix of the samples
matrix: /path/to/expression/matrix.tsv
# Major allele frequency threshold for ASE (0.5 - 0.75)
maf_threshold: 0.65
# Threads for STAR, RSEM, Strelka and MBASED
threads: 72
# Use phased vcf (True or False)
# Uses pseudphasing algorithm if False
phased: True
# Perform cancer analysis 
# Intersect with optional input
cancer_analysis: True
# Paths for annotation
annotationPath:
 snpEff_config:
 /path/to/snpEff/config
 snpEff_datadir:
 /path/to/snpEff/binaries/data
 snpEff_genomeName:
 GRCh38.100
 snpEff_javaHeap:
 64g
# Paths for references
# Only needed if RNA read is provided instead of RNA bam
starReferencePath:
 /path/to/star/ref
rsemReferencePath:
 /path/to/rsem/ref

Example samples.yaml:

Main config file to specify input files. For input method 1 using R1 and R2 fastq file, use R1 and R2 tag. For input method 2 using RNA bam file, use rna tag. All other tags are optional.

samples: # Sample Name must match expression matrix sampleName_1: # Method 1 R1: /path/to/RNA/R1.fq R2: /path/to/RNA/R2.fq somatic_snv: /path/to/somatic/snv.vcf somatic_indel: /path/to/somatic/indel.vcf tissueType: Lung sampleName_2: # Method 2 rna: /path/to/RNA/alignment.bam phase: /path/to/phase.vcf.gz cnv: /path/to/cnv/data methyl: /path/to/methyl/data.bed tumorContent: 0.80

Example defaults.yaml:

Config file for specify path for reference genome, annotation bed file and centromere bed file. Annotation and centromere bed file for hg38 are included in the repository.

genome:
 hg19:
 /path/to/hg19/ref.fa
 hg38:
 /path/to/hg38/ref.fa
 hg38_no_alt_TCGA_HTMCP_HPVs:
 /path/to/hg38_no_alt_TCGA_HTMCP_HPVs/ref.fa
annotation:
 hg19:
 /path/to/hg19/annotation.fa
 hg38:
 annotation/biomart_ensembl100_GRCh38.sorted.bed 
 hg38_no_alt_TCGA_HTMCP_HPVs:
 annotation/biomart_ensembl100_GRCh38.sorted.bed
centromere:
 hg19:
 /path/to/hg19/centromere.bed
 hg38:
 annotation/hg38_centromere_positions.bed
 hg38_no_alt_TCGA_HTMCP_HPVs:
 annotation/hg38_centromere_positions.bed

Run snakemake

This is the command to run it with singularity. The -c parameter can be used to specify maximum number of threads. The -B parameter is used to speceify paths for the docker container to bind.

snakemake -c 30 --use-singularity --singularity-args "-B /projects,/home,/gsc"

Output Files

All output and intermediary files is found in output/{sample} directory. The workflow has four main section, alignment, variant calling, mbased and cancer analysis and their outputs can be found in the corrosponding directories. The key outputs from the workflow is located below

  1. MBASED related outputs (found in output/{sample}/mbased )

    • The tabular results of the output MBASED_expr_gene_results.txt

    • The rds object of the MBASED raw output MBASEDresults.rds

  2. Summary table of all outputs

    • Found in output/{sample}/summaryTable.tsv

    • Data of all phased genes with ASE information along potential causes based on optional inputs

  3. Figures

    • Found in output/{sample}/figures

    • Example figure shown below

Summary Table Description

Column Description
gene HGNC gene symbol
Expression Expression level
allele1IsMajor T/F if allele 1 is the major allele (allele 1 = HP1)
majorAlleleFrequency Major allele frequency
padj Benjamini-Hochberg adjusted pvalue
aseResults ASE result based on MAF threshold (and pval)
cnv.A1 Copy Number for allele 1
cnv.B1 Copy Number for allele 2
expectedMAF1 Expect Major Allele Frequency based on CNV
cnv_state1 Allelic CNV state (Loss of Heterozygosity, Allelic balance/imbalabnce)
methyl_state2 Methylation difference in promter region (Allele 1 - Allele 2)
tf_allele3 Allele where there is gain of transcription factor binding site
transcriptionFactor3 Transcription Factor for gain TFBS
stop_variant_allele3 Allele where stop gain/stop loss variant is found
somaticSNV4 Somatic SNV found in (or around) gene (T/F)
somaticIndel4 Somatic Indel found in (or around) gene (T/F)
normalMAF5 Add MAF for gene in normal tissue
cancer_gene T/F if gene is a known cancer gene (based on annotation/cancer_gene.txt )
sample Sample Name

Columns only included if optional input is included:

1 Copy number variant 2 Allelic methylation 3 Phased vcf 4 Somatic SNV and Indel 5 Tissue type

Example Figures

Several figures are automatically generate based on the optional inputs. They can be found in output/{sample}/figures . The main figure is karyogram.pdf which show co-locationzation of ASE genes with allelic methylation and somatic copy number alteration. Example figures can be found here .

Contributors

The contributors of this project are Glenn Chang, Vannessa Porter, and Kieran O'Neill.

License

IMPALA is licensed under the terms of the GNU GPL v3 .

Code Snippets

66
67
68
69
70
71
72
73
74
75
76
77
78
shell:
	"""
	date >> {output} 2> {log}
	printf "{wildcards.sample}\n------------------------\n\n" >> {output} 2> {log}
	echo {params.sample_info}| sed 's/{{//' | sed 's/}}//' | tr  , '\n'  >> {output} 2> {log}
	echo expression_matrix: {params.expressionMatrix} >> {output} 2> {log}
	printf "\n------------------------\n" >> {output} 2> {log}
	echo Major Allele Frequency Threshold: {params.maf} >> {output} 2> {log}
	echo Genome: {params.genome_name} >> {output} 2> {log}
	echo Phased: {params.phase}  >> {output} 2> {log}
	echo Cancer analysis: {params.cancer} >> {output} 2> {log}
	echo Threads: {params.threads} >> {output} 2> {log}
	"""
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
shell:
	"""
	mkdir -p output/{wildcards.sample}/0_alignment
	STAR \
		--genomeDir {params.ref} \
		--runThreadN {threads} \
		--readFilesIn  {input.r1} {input.r2} \
		--outFileNamePrefix output/{wildcards.sample}/0_alignment/star \
		--outSAMtype BAM SortedByCoordinate \
		--outSAMunmapped Within \
		--outSAMattributes Standard \
		--waspOutputMode SAMtag \
		--varVCFfile {input.vcf} \
		--quantMode TranscriptomeSAM \
		--twopassMode Basic \
		--twopass1readsN -1 &> {log}
	"""
121
122
123
124
125
shell:
	"""
	samtools view -h {input} | grep -e '^@' -e 'vW:i:1' | samtools view -b -S > {output} 2> {log}
	samtools index {output} &> {log}
	"""
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
shell:
	"""
	mkdir -p output/{wildcards.sample}/0_alignment
	STAR \
		--genomeDir {params.ref} \
		--runThreadN {threads} \
		--readFilesIn  {input.r1} {input.r2} \
		--outFileNamePrefix output/{wildcards.sample}/0_alignment/star \
		--outSAMtype BAM SortedByCoordinate \
		--outSAMunmapped Within \
		--outSAMattributes Standard \
		--quantMode TranscriptomeSAM \
		--twopassMode Basic \
		--twopass1readsN -1 &> {log}
	"""
166
167
168
169
170
171
172
173
174
175
shell:
		"""
		rsem-calculate-expression \
				--alignments \
				-p {threads} \
				--paired-end \
				{input} \
				{params.ref} \
				output/{wildcards.sample}/0_alignment/star &> {log}
		"""
184
185
186
187
188
189
190
191
shell:
	"""
	Rscript src/generateExpressionMatrix.R \
		-i {input} \
		-o output/{wildcards.sample}/0_alignment \
		-a {params.annotation} \
		-s {wildcards.sample} &> {log}
	"""
SnakeMake From line 184 of master/Snakefile
203
204
205
206
207
    shell:
        """
		mkdir -p output/{wildcards.sample}/1_variant
		zcat {input.phase} | grep -E '(PASS|#)' | grep -E '(0/1|\||#)' | awk '/^#/||length($4)==1 && length($5)==1' | bgzip > {output} 2> {log}
		"""
SnakeMake From line 203 of master/Snakefile
217
218
shell:
    "tabix {input.vcf} &> {log}"
237
238
239
240
241
242
243
244
245
246
247
		shell:
			"""
			configureStrelkaGermlineWorkflow.py \
			--bam={input.bam} \
			--referenceFasta={input.ref} \
			--forcedGT={input.vcf} \
			--rna \
			--runDir=output/{wildcards.sample}/StrelkaRNA &> {log}

			output/{wildcards.sample}/StrelkaRNA/runWorkflow.py -m local -j {threads} &> {log} 
        	"""
SnakeMake From line 237 of master/Snakefile
258
259
260
261
262
263
264
265
266
shell:
        """
        configureStrelkaGermlineWorkflow.py \
                --bam={input.bam} \
                --referenceFasta={input.ref} \
                --rna \
                --runDir=output/{wildcards.sample}/StrelkaRNA &> {log}
        output/{wildcards.sample}/StrelkaRNA/runWorkflow.py -m local -j {threads} &> {log} 
        """
SnakeMake From line 258 of master/Snakefile
275
276
shell:
    "zcat {input.vcf} | grep -E '(PASS|#)' | bgzip > {output} 2> {log} && rm -rf output/{wildcards.sample}/StrelkaRNA/"
SnakeMake From line 275 of master/Snakefile
285
286
shell:
    "tabix {input.vcf} &> {log}"
301
302
303
304
305
306
    		shell:
        		"""
        		bcftools isec {input.vcf2} {input.vcf1} -p output/{wildcards.sample}/isec -n =2 -w 1 &> {log} 
        		mv output/{wildcards.sample}/isec/0000.vcf {output} 
				rm -rf output/{wildcards.sample}/isec
        		"""
313
shell: "bgzip {input} &> {log}"
SnakeMake From line 313 of master/Snakefile
334
335
336
337
338
339
340
341
342
shell:
    """
    snpEff -Xmx{params.heapSize} \
        -v {params.genome} \
        -c {params.snpEff_config} \
        -dataDir {params.snpEff_datadir} \
        -noStats \
        {input} > {output} 2> {log}
    """
352
353
354
355
356
357
358
shell:
    """
    SnpSift -Xmx{params.heapSize} filter "( exists ANN[0].GENE )" {input} > {output.geneFilter} 2> {log}

    SnpSift -Xmx{params.heapSize} extractFields {output.geneFilter} \
        CHROM POS GEN[0].AD REF ALT ANN[0].GENE ANN[0].BIOTYPE > {output.tsv} 2> {log}
    """
374
375
376
377
378
379
380
381
    		shell:
        		"""
				Rscript src/mbased.snpEff.R \
					--threads={threads} \
					--phase={input.phase} \
					--rna={input.tsv} \
					--outdir=output/{wildcards.sample}/2_mBASED &> {log}
			"""
SnakeMake From line 374 of master/Snakefile
391
392
393
394
395
396
397
shell:
	"""
	Rscript src/mbased.snpEff.R \
			--threads={threads} \
			--rna={input.tsv} \
			--outdir=output/{wildcards.sample}/2_mBASED &> {log}
	"""
SnakeMake From line 391 of master/Snakefile
409
410
411
412
413
414
415
416
417
418
shell:
	"""
	Rscript src/addExpression.R \
		--mbased={input.rds} \
		--sample={wildcards.sample} \
		--rpkm={input.rpkm} \
		--min=1 \
		--maf_threshold={params.maf} \
		--outdir=output/{wildcards.sample}/2_mBASED &> {log}
	"""
SnakeMake From line 409 of master/Snakefile
431
432
433
434
435
436
437
438
439
440
441
442
shell:
	"""
	mkdir -p output/{wildcards.sample}/figures

	Rscript src/figures.R \
		--mbased={input.txt} \
		--rpkm={input.rpkm} \
		--gene={input.bed} \
		--sample={wildcards.sample} \
		--maf_threshold={params.maf} \
		--outdir=output/{wildcards.sample}/figures &> {log}
	"""
SnakeMake From line 431 of master/Snakefile
458
459
460
461
462
463
464
465
shell:	
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/raw

	cat {input} | cut -f1 > {output.gene} 2> {log}
	awk 'NR == FNR {{ keywords[$1]=1; next; }} {{ if ($4 in keywords) print; }}' {output.gene} {params.annotation} | \
		cut -f 1,2,3,4 | uniq > {output.bed} 2> {log}
	"""
SnakeMake From line 458 of master/Snakefile
472
473
474
475
shell:
	"""
	cut -f 1,2 {input} > {output} 2> {log}
	"""
SnakeMake From line 472 of master/Snakefile
486
487
488
489
shell:
	"""
	bedtools flank -l 2000 -r 500 -i {input.gene} -g {input.length} > {output} 2> {log}
	"""
499
500
501
502
503
504
505
shell:
	"""
	Rscript src/cnv_preprocess.R \
		--cnv={input} \
		--tumorContent={params.tumor} \
		--outdir=output/{wildcards.sample}/3_cancer/raw &> {log}
	"""
SnakeMake From line 499 of master/Snakefile
514
515
516
517
518
shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/intersect
	bedtools intersect -loj -a {input.gene} -b {input.cnv} | awk '$10 != "." {{print $0}}' > {output} 2> {log}
	"""
530
531
532
533
534
535
shell:
        """
        Rscript src/methyl_preprocess.R \
                --methyl={input} \
                --outdir=output/{wildcards.sample}/3_cancer/raw &> {log}
        """
SnakeMake From line 530 of master/Snakefile
545
546
547
548
549
shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/intersect
	bedtools intersect -loj -a {input.gene} -b {input.methyl} | awk '$10 != "." {{print $0}}' > {output} 2> {log}
	"""
563
564
565
566
567
568
shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/tfbs
	awk 'BEGIN {{print "sequence_id\tgene"}}; {{print $1 ":" $2 "-" $3 "\t" $4}}' {input} > {output.id2gene} 2> {log}
	cut -f1 {output.id2gene} | tail -n +2 > {output.id}
	"""
SnakeMake From line 563 of master/Snakefile
578
579
shell:
	"samtools faidx -r {input} {params.ref} > {output} 2> {log}"
589
590
591
592
shell:
	"""
	cat {input.ref_promoter} | bcftools consensus {input.phase_vcf} -H {wildcards.num}pIu > {output} 2> {log}
	"""
601
602
603
604
shell:
	"""
	fimo --text {params.motifFile} {input} > {output} 2> {log}
	"""
616
617
618
619
620
621
622
623
624
625
626
627
shell:
	"""
	Rscript src/compareTFBS.R \
		--allele1={input.a1} \
		--allele2={input.a2} \
		--expression_matrix={input.expression_matrix} \
		--id2gene={input.id2gene} \
		--tf={params.tf_list} \
		--min=10 \
		--outdir=output/{wildcards.sample}/3_cancer/tfbs \
		--sample={wildcards.sample} 2> {log}
	"""
SnakeMake From line 616 of master/Snakefile
643
644
645
646
647
648
649
650
651
652
653
654
shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/stopVar

	snpEff -Xmx{params.heapSize} \
		-v {params.genome} \
		-c {params.snpEff_config} \
		-dataDir {params.snpEff_datadir} \
		-noStats \
		{input} > {output} 2> {log}

	"""	
660
661
662
663
shell:
	"""
	cat {input} | src/vcfEffOnePerLine.pl > {output} 2> {log}
	"""
SnakeMake From line 660 of master/Snakefile
672
673
674
675
676
677
shell:
	"""
	SnpSift -Xmx{params.heapSize} \
		extractFields {input} \
		ANN[0].GENE ANN[0].EFFECT GEN[0].GT FILTER > {output} 2> {log}
	"""
684
685
686
687
688
689
shell:
	"""
	Rscript src/stopVariant.R \
		--annotation={input} \
		--outdir=output/{wildcards.sample}/3_cancer/stopVar &> {log}
	"""	
SnakeMake From line 684 of master/Snakefile
703
704
705
706
707
shell:
	"""
	mkdir -p output/{wildcards.sample}/3_cancer/somatic
	bedtools slop -l 5000 -r 1000 -i {input.gene} -g {input.length} > {output} 2> {log}
	"""
717
718
719
720
shell:
	"""
	bedtools intersect -a {input.genes} -b {input.variants} > {output} 2> {log}
	"""
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
shell:
	"""
	Rscript src/summaryTable.R \
		--cnv={input.cnv} \
		--methyl={input.methyl} \
		--tfbs={input.tfbs} \
		--stop={input.stopVar} \
		--snv={input.snv} \
		--indel={input.indel} \
		--ase={input.ase} \
		--sample={wildcards.sample} \
		--cancer={params.cancer} \
		--tissue={params.tissue} \
		--tumorContent={params.tumorContent} \
		--normal={params.normal} \
		--outdir=output/{wildcards.sample} 2> {log}
	"""
SnakeMake From line 744 of master/Snakefile
772
773
774
775
776
777
778
779
780
shell:
	"""
	mkdir -p output/{wildcards.sample}/figures/tables

	Rscript src/cancerFigures.R \
		--summary={input} \
		--outdir=output/{wildcards.sample}/figures \
		--sample={wildcards.sample} &> {log}
	"""
SnakeMake From line 772 of master/Snakefile
797
798
799
800
801
802
803
804
805
806
807
	shell:
		"""
		Rscript src/karyogramFigure.R \
        		--chromSize={input.chromSize} \
        		--centPos={input.centromere} \
        		--cna={input.cnv} \
        		--dmr={input.dmr} \
        		--ase={input.ase} \
        		--genes={input.annotation} \
        		--out=output/{wildcards.sample}/figures/karyogram &> {log}
		"""
SnakeMake From line 797 of master/Snakefile
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
suppressMessages(library(optparse))
suppressMessages(library(dplyr))

## ---------------------------------------------------------------------------
## OPTIONS
## ---------------------------------------------------------------------------

# Make help options
option_list = list(
  make_option(c("-b", "--mbased"), type="character", default=NULL,
              help="mbased rds file", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default = NULL,
              help="Sample name from the RPKM matrix (HTMCP written like e.g. HTMCP.03.06.02109)", metavar="character"),
  make_option(c("-r", "--rpkm"), type="character", default = NULL,
              help="RPKM matrix", metavar="character"),
  make_option(c("-m", "--min"), type="numeric", default = 1,
              help="Minimum RPKM value", metavar="numeric"),
  make_option(c("-t", "--maf_threshold"), type="numeric", default = 0.60,
              help="Threshold for MAF to consider as ASE", metavar="numeric"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character")
)

## ---------------------------------------------------------------------------
## VARIABLES
## ---------------------------------------------------------------------------

opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)

out <- opt$outdir
sample <- opt$sample
rpkm <- read.delim(opt$rpkm, header = T, stringsAsFactors = F) 
results <- readRDS(opt$mbased)
min <- opt$min
maf_threshold <- opt$maf_threshold 

## ---------------------------------------------------------------------------
## VARIABLES
## ---------------------------------------------------------------------------

print("Adding expression")

# fix sample name
sample <- ifelse(length(grep("-", sample)) == 0, sample, gsub("-", ".", sample))

# select the RPKM of this sample
rpkm_sample <- rpkm[,c("gene", sample)] 

# expressed genes in the sample
results$geneOutput$RPKM <- rpkm_sample[match(results$geneOutput$gene, gsub(" ", "", rpkm_sample$gene, fixed = TRUE)), 2]
results_filt <- results$geneOutput[results$geneOutput$RPKM > min, ]

# filter for genes that have an RPKM calculated
results_filt <- results_filt[!is.na(results_filt$RPKM),] 

# MAF filter
results_filt$MAF <- as.factor(ifelse(results_filt$majorAlleleFrequency > maf_threshold, paste0("MAF > ", maf_threshold), paste0("MAF < ", maf_threshold)))
results_filt$aseResults <- as.factor(ifelse(results_filt$majorAlleleFrequency > maf_threshold & results_filt$padj < 0.05, "ASE", "BAE"))

# rearrange columns to a logical order
results_filt <- results_filt[,c("gene", "geneBiotype", "RPKM", "allele1IsMajor","majorAlleleFrequency", 
                                "pValueASE", "pValueHeterogeneity", "padj",
                                "significance", "MAF", "aseResults")]

# save the data frame as a table
write.table(results_filt, paste0(out, "/MBASED_expr_gene_results.txt"), quote = F, col.names = T, row.names = F, sep = "\t")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
suppressMessages(library(tidyverse))
suppressMessages(library(optparse))
suppressMessages(library("ggsci"))


# Make help options
option_list = list(
  make_option(c("-s", "--summary"), type="character", default=NULL,
              help="Summary Table file", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="Output directory name", metavar="character"),
  make_option(c("-a", "--sample"), type="character", default = NULL,
              help="Sample name", metavar="character")
)


# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
sample <- opt$sample

summary_path <- opt$summary
summary_table <- read.delim(summary_path, header = T, stringsAsFactors = F, sep = "\t")


column <- colnames(summary_table)

############
#CNV Figures
############
if ("cnv_state" %in% column) {
  print("Creating CNV figure...")
  cnvBar <- summary_table %>%
    dplyr::filter(!is.na(cnv_state)) %>%
    dplyr::filter(padj <= 0.05) %>%
    ggplot(aes(cnv_state)) + 
    geom_bar(aes(fill = aseResults), position = "dodge") +
    ggtitle(paste0("Gene frequency in each copy number variant state"), 
            subtitle = sample) +
    ylab("Gene Count") + 
    xlab("Copy number variant state") + 
    theme_bw() + 
    scale_color_npg() + 
    scale_fill_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank()) 


  ggsave(filename = paste0(out, "/cnvBar.pdf"), plot = cnvBar, units = "in")


  expectMAF <- summary_table %>%
    dplyr::filter(!is.na(cnv_state)) %>%
    dplyr::filter(padj <= 0.05) %>%
    dplyr::mutate(cnvRatioDiff = majorAlleleFrequency - expectedMAF) %>%
    ggplot(aes(cnvRatioDiff, padj)) + 
    geom_point(aes(colour = cnv_state)) +
    geom_vline(xintercept = 0.10) +
    geom_hline(yintercept = 0.05) +
    xlab("MAF - Expectd MAF from CNV") + 
    ylab("Adjusted pvalue") + 
    ggtitle(paste0("Copy Number Variants adjusted for expected MAF"), 
            subtitle = sample) +
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank()) 

  ggsave(filename = paste0(out, "/expectMAF.pdf"), plot = expectMAF, units = "in")

  #### TEST CORRELATION PLOT ####
  correlationPlot <- summary_table %>%
    dplyr::filter(cnv_state == "imbalance") %>%
    dplyr::filter(padj <= 0.05) %>%
    dplyr::filter(!is.na(cnv_state)) %>%
    dplyr::mutate(rawExpectedMAF = pmax(cnv.A, cnv.B)/(cnv.A + cnv.B)) %>%
    ggplot(aes(majorAlleleFrequency, rawExpectedMAF)) +
    geom_point(aes(color = aseResults), alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE) +
    ylab("CNV expected MAF") + 
    xlab("MBASED Major Allele Frequency") + 
    ggtitle("Correlation between CNV expected MAF and \nMBASED MAF for CNV Imbalanced genes",
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank()) 
    ggsave(filename = paste0(out, "/cnvMAFcorrlation.pdf"), plot = correlationPlot, units = "in")
}


##############
# DMR Figures
##############

if ("methyl_state" %in% column) {
  print("Creating DMR figures...")

  if ("cnv_state" %in% column) {
  dmr <- summary_table %>%
    dplyr::filter(!is.na(methyl_state)) %>%
    dplyr::filter(cnv_state == "balance") %>%
    dplyr::filter(aseResults == "ASE") %>%
    dplyr::mutate(methylation = case_when(
      methyl_state < 0 ~ "allele2Methyl",
      TRUE ~ "allele1Methyl"
    )) %>%
    dplyr::mutate(expression = case_when(
      allele1IsMajor ~ "alelle1Expression",
      TRUE ~ "alelle2Expression"
    )) %>%
    dplyr::select(methylation, expression) %>%
    group_by(methylation, expression) %>%
    summarize(n=n())

  } else { 
  dmr <- summary_table %>%
    dplyr::filter(!is.na(methyl_state)) %>%
    dplyr::filter(aseResults == "ASE") %>%
    dplyr::mutate(methylation = case_when(
      methyl_state < 0 ~ "allele2Methyl",
      TRUE ~ "allele1Methyl"
    )) %>%
    dplyr::mutate(expression = case_when(
      allele1IsMajor ~ "alelle1Expression",
      TRUE ~ "alelle2Expression"
    )) %>%
    dplyr::select(methylation, expression) %>%
    group_by(methylation, expression) %>%
    summarize(n=n())
  } 

  dmr_contigency <- ggplot(data =  dmr, mapping = aes(x = methylation, y = expression)) +
    geom_tile(aes(fill = n), colour = "white") +
    geom_text(aes(label = sprintf("%1.0f", n)), vjust = 1, colour = "white") +
    scale_fill_gradient(low = "blue", high = "red") +
    ggtitle("Allelic Methylation and Expression Contigency Table", subtitle = sample) +
    theme_bw() + 
    theme(legend.position = "none", 
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(), 
          panel.border = element_blank(),
          axis.ticks.x = element_blank(),
          axis.ticks.y = element_blank(), 
          axis.title.x=element_blank(), 
          axis.title.y=element_blank())  

  ggsave(filename = paste0(out, "/dmrContingency.pdf"), plot = dmr_contigency, units = "in")

  dmr %>%
    dplyr::mutate(sample = sample) %>%
    write.table(file = paste0(out, "/tables/dmrContingency.tsv"),
                quote = F, sep = "\t", row.names = T, col.names = T)
}




##############
# Stop Var Figures
##############
if ("stop_variant_allele" %in% column) {
  print("Creating Stop gain/loss figures...")

  stop <- summary_table %>%
    dplyr::select(gene, allele1IsMajor, aseResults, stop_variant_allele) %>% 
    dplyr::filter(!is.na(stop_variant_allele)) %>%
    dplyr::filter(aseResults == "ASE") %>%
    dplyr::mutate(expression = case_when(
      allele1IsMajor ~ "alelle1Expression",
      TRUE ~ "alelle2Expression"
    )) %>%
    dplyr::mutate(stop = case_when(
      stop_variant_allele == 1 ~ "alelle1StopVar",
      TRUE ~ "alelle2StopVar"
    )) %>%
    group_by(expression, stop) %>%
    summarize(n=n()) 

  stop %>%
    dplyr::mutate(sample = sample) %>%
    write.table(file = paste0(out, "/tables/stopVarContingency.tsv"),
                quote = F, sep = "\t", row.names = T, col.names = T)

  stopContingency <- ggplot(data =  stop, mapping = aes(x = stop, y = expression)) +
      geom_tile(aes(fill = n), colour = "white") +
      geom_text(aes(label = sprintf("%1.0f", n)), vjust = 1, colour = "white") +
      scale_fill_gradient(low = "blue", high = "red") +
      theme_bw() +
      ggtitle("ASE vs Stop gain/loss contingency table", subtitle = sample) + 
      theme(legend.position = "none", 
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank(), 
            panel.border = element_blank(),
            axis.ticks.x = element_blank(),
            axis.ticks.y = element_blank(), 
            axis.title.x=element_blank(), 
            axis.title.y=element_blank()) 
  ggsave(filename = paste0(out, "/stopVarContingency.pdf"), plot = stopContingency, units = "in")

  stopVarTable <- summary_table %>%
    dplyr::mutate(stopVar = !is.na(stop_variant_allele)) %>%
    dplyr::select(aseResults, stopVar) %>%
    group_by(aseResults, stopVar) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)

  write.table(stopVarTable, file = paste0(out, "/tables/stopVarTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  stopBar <-  stopVarTable %>%
    dplyr::filter(stopVar) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults ), stat = "identity") +
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with stop gain or loss variant", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/stopVarBar.pdf"), plot = stopBar, units = "in")
}




##############
# Somatic SNV
##############
if ("somaticSNV" %in% column) {
  print("Creating somatic SNV figures...")

  snvTable <- summary_table %>%
    dplyr::select(gene, aseResults, somaticSNV) %>%
    group_by(aseResults, somaticSNV) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)

  write.table(snvTable, file = paste0(out, "/tables/snvTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  snv <- snvTable %>%
    dplyr::filter(somaticSNV) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults), stat = "identity") +
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with somatic SNV mutation", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/somaticSNVbar.pdf"), plot = snv, units = "in")
}



##############
# Somatic Indel
##############
if ("somaticIndel" %in% column) {
  print("Creating somatic indel figures...")

  indelTable <- summary_table %>%
    dplyr::select(gene, aseResults, somaticIndel) %>%
    group_by(aseResults, somaticIndel) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)

  write.table(indelTable, file = paste0(out, "/tables/indelTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  indel <- indelTable %>%
    dplyr::filter(somaticIndel) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults), stat = "identity") +
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with somatic indel mutation", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/somaticIndelBar.pdf"), plot = indel, units = "in")
}


##############
# TFBS Figures
##############
if ("tf_allele" %in% column) {
  print("Creating TFBS figures...")

  tfbsTable <- summary_table %>%
    dplyr::select(gene, aseResults, transcriptionFactor) %>%
    dplyr::mutate(tfbsVar = !is.na(transcriptionFactor)) %>%
    group_by(aseResults, tfbsVar) %>%
    summarize(n=n()) %>%
    dplyr::mutate(percent = n/sum(n)) %>%
    dplyr::mutate(sample = sample)
  write.table(tfbsTable, file = paste0(out, "/tables/tfbsTable.tsv"),
              quote = F, sep = "\t", row.names = F, col.names = T)

  tfbs <- tfbsTable %>%
    dplyr::filter(tfbsVar) %>%
    ggplot(aes(aseResults, percent)) +
    geom_bar(aes(fill = aseResults), stat = "identity")+
    ylab("Proportion of genes") +
    coord_flip() +
    ggtitle("Proportion of genes with mutation in \ntranscription factor binding site", 
            subtitle = sample) + 
    theme_bw() + 
    scale_fill_npg() +
    scale_color_npg() + 
    theme(axis.line = element_line(colour = "black"),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank(),
          panel.background = element_blank(), 
          legend.position = "none") 
  ggsave(filename = paste0(out, "/tfbsBar.pdf"), plot = tfbs, units = "in")
}


#######################
# Cancer Figures
#######################
cancer_bar <- summary_table %>%
  dplyr::filter(cancer_gene) %>%
  ggplot(aes(aseResults)) + 
  geom_bar(aes(fill = aseResults)) +
  theme_bw() +
  scale_fill_npg() +
  scale_color_npg() +
  ggtitle("MBASED result for cancer genes", subtitle=sample) +
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), 
        legend.position = "none") 
ggsave(filename = paste0(out, "/cancerBar.pdf"), plot = cancer_bar, units = "in")

#######################
# Mechanism function
#######################

causeFunction <- function(dataset) {
  summaryTableExplain <- dataset %>%
    dplyr::filter(aseResults == "ASE") 
  ASEunknown <- summaryTableExplain
  cause_final <- tibble()

  if ("cnv_state" %in% column) {

    loh <- summaryTableExplain %>%
      dplyr::filter(cnv_state == "LOH") %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "LOH", num = loh))

    ASEunknown <- ASEunknown %>%
      dplyr::filter(cnv_state != "LOH")

    summaryTableExplain$cnvRatioDiff = summaryTableExplain$majorAlleleFrequency - summaryTableExplain$expectedMAF
    ASEunknown$cnvRatioDiff = ASEunknown$majorAlleleFrequency - ASEunknown$expectedMAF

    cnv <- summaryTableExplain %>%
      dplyr::filter(cnv_state != "LOH") %>%
      dplyr::filter(cnvRatioDiff< 0.10) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "CNV imbalance", num = cnv))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(cnvRatioDiff <= 0.10)
  }

  if ("somaticSNV" %in% column) {
    somaticSNV <- summaryTableExplain %>%
      dplyr::filter(somaticSNV) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Somatic SNV", num = somaticSNV))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(!somaticSNV)
  }
  if ("somaticIndel" %in% column) {
    somaticIndel <- summaryTableExplain %>%
      dplyr::filter(somaticIndel) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Somatic Indel", num = somaticIndel))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(!somaticIndel)
  }

  if ("methyl_state" %in% column) {
    methyl <- summaryTableExplain %>%
      dplyr::filter((methyl_state < 0 & allele1IsMajor) | (methyl_state > 0 & !allele1IsMajor)) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Allelic methylation", num = methyl))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(is.na(methyl_state) | !((methyl_state < 0 & allele1IsMajor) | (methyl_state > 0 & !allele1IsMajor)))
  }

  if ("stop_variant_allele" %in% column) {
    stop <- summaryTableExplain %>% 
      dplyr::filter((allele1IsMajor & stop_variant_allele == 2) | (!allele1IsMajor & stop_variant_allele == 1)) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "Stop variant", num = stop))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(is.na(stop_variant_allele) | !((allele1IsMajor & stop_variant_allele == 2) | (!allele1IsMajor & stop_variant_allele == 1)))
  }
  if ("tf_allele" %in% column) {
    tfbs <- summaryTableExplain %>%
      dplyr::filter(!is.na(transcriptionFactor)) %>%
      nrow()
    cause_final <- bind_rows(cause_final, tibble(cause = "TFBS variant", num = tfbs))
    ASEunknown <- ASEunknown %>%
      dplyr::filter(is.na(transcriptionFactor))
  }

  unknown <- nrow(ASEunknown)
  cause_final <- bind_rows(cause_final, tibble(cause = "Unknown", num = unknown))


   return(cause_final)
}

################
# ASE mechanism
################
aseMechanism <- causeFunction(summary_table)

aseCause <- aseMechanism %>%
  ggplot(aes(reorder(cause, -num), num)) + 
  geom_bar(aes(fill = cause), stat = "identity") +
  theme_bw() + 
  scale_fill_npg() +
  scale_color_npg() + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), 
        legend.position = "none") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  xlab("ASE mechanism") + 
  ylab("Gene frequency") + 
  ggtitle("Frequency of all ASE genes for each genetic mechanism", 
          subtitle = sample)

ggsave(filename = paste0(out, "/aseCause.pdf"), plot = aseCause, units = "in")

write.table(aseMechanism, file = paste0(out, "/tables/aseCause.tsv"),
            quote = F, sep = "\t", row.names = F, col.names = T)

################################
# Cancer ASE mechanism
################################
cancerMechanism <- summary_table %>%
  dplyr::filter(cancer_gene) %>%
  causeFunction()


cancerCause <- cancerMechanism %>%
  ggplot(aes(reorder(cause, -num), num)) + 
  geom_bar(aes(fill = cause), stat = "identity") +
  theme_bw() + 
  scale_fill_npg() +
  scale_color_npg() + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        panel.background = element_blank(), 
        legend.position = "none") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  xlab("ASE mechanism") + 
  ylab("Gene frequency") + 
  ggtitle("Frequency of cancer ASE genes for each genetic mechanism", 
          subtitle = sample)
ggsave(filename = paste0(out, "/cancerCause.pdf"), plot = cancerCause, units = "in")


write.table(cancerMechanism, file = paste0(out, "/tables/cancerCause.tsv"),
            quote = F, sep = "\t", row.names = F, col.names = T)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
suppressMessages(library(tidyverse))
suppressMessages(library(optparse))

# Make help options
option_list = list(
  make_option(c("-c", "--cnv"), type="character", default=NULL,
              help="CNV file (from Ploidetect)", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character"),
  make_option(c("-t", "--tumorContent"), type="double", default = 1.0,
              help="Tumor Content (0.0-1.0)", metavar="character")

)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
TumorContent <- opt$tumorContent

cnv <- read.delim(opt$cnv, header = T, comment.char = "#", stringsAsFactors = F)

cnv %>%
  dplyr::mutate(type = case_when(
    zygosity == "HOM" ~ "LOH",
    A != B ~ "imbalance",
    A == B ~ "balance"
  )) %>%
  dplyr::select(chr, pos, end, A, B, type) %>%
  #dplyr::mutate(chr2 = case_when(
  #  grepl("^chr", chr) ~ chr,
  #  TRUE ~ paste0("chr", chr)
  #)) %>%
  dplyr::mutate(chr = ifelse(grepl("^chr", chr), chr, paste0("chr", chr))) %>%
  dplyr::mutate(rawExpectedMAF = pmax(A, B)/(A + B)) %>%
  dplyr::mutate(expectedMAF = (rawExpectedMAF * TumorContent) + (0.5 * (1 - TumorContent))) %>%
  dplyr::select(-rawExpectedMAF) %>%
  dplyr::mutate(pos = gsub(" ", "", format(pos, scientific=F), fixed = TRUE)) %>%
  dplyr::mutate(end = gsub(" ", "", format(end, scientific=F), fixed = TRUE)) %>%
  write.table(paste0(out, "/cnv.bed"), 
              quote = F, sep = "\t", row.names = F, col.names = F)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
suppressMessages(library(optparse))
suppressMessages(library(tidyverse))

option_list = list(
  make_option(c("-e", "--expression_matrix"), type="character", default=NULL,
              help="Expression Matrix", metavar="character"),
  make_option(c("-t", "--tf"), type="character", default = NULL,
              help="Transcription Factor", metavar="character"),
  make_option(c("-a", "--allele1"), type="character", default = NULL,
              help="Allele 1", metavar="character"),
  make_option(c("-b", "--allele2"), type="character", default = NULL,
              help="Allele 2", metavar="character"),
  make_option(c("-i", "--id2gene"), type="character", default = NULL,
              help="id to gene tsv", metavar="character"),
  make_option(c("-m", "--min"), type="integer", default = 10,
              help="min expression level", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="output directory", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default = NULL,
              help="sample name", metavar="character")
)

# load in options
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
min <- opt$min
outdir <- opt$outdir
sample <- opt$sample

## GET EXPRESSED TF
expressionMatrix <- read.delim(opt$expression_matrix, sep = "\t", header = T)
transcriptionFactor <- read.delim(opt$tf, sep = "\t", header = T) %>%
  separate(Transcription.factor, sep = ":", into=c("human", "transcriptionFactor")) %>%
  dplyr::select(-human)

expressionMatrix <- expressionMatrix[,c("gene", sample)]
colnames(expressionMatrix) <- c("gene", "expression")

expressionMatrix_filt <- expressionMatrix %>%
  dplyr::filter(gene %in% transcriptionFactor$transcriptionFactor) %>%
  dplyr::filter(expression > min)

tf_expression <- transcriptionFactor %>%
  left_join(expressionMatrix_filt, by=c("transcriptionFactor"="gene")) %>%
  dplyr::filter(!is.na(expression))

# GET ALLELE 1 TFBS
allele1MEME <- read.delim(opt$allele1) %>%
  dplyr::mutate(id = paste0(motif_id, "::", sequence_name)) %>%
  pull(id)

# GET ALLELE 2 TFBS
allele2MEME <- read.delim(opt$allele2) %>%
  dplyr::mutate(id = paste0(motif_id, "::", sequence_name)) %>%
  pull(id)

# GET GAIN OR LOSS OF TFBS
Gene <- read.delim(opt$id2gene, header = T)

difference <- rbind(tibble(id = setdiff(allele1MEME, allele2MEME), allele = 1),
                    tibble(id = setdiff(allele2MEME, allele1MEME), allele = 2)) %>%
  separate(id, sep = "::", into = c("motif_id", "sequence_id")) %>%
  left_join(Gene) %>%
  dplyr::filter(motif_id %in% tf_expression$Model) %>%
  left_join(tf_expression, by = c("motif_id"="Model")) %>%
  dplyr::select(transcriptionFactor, gene, allele) %>%
  write_tsv(paste0(outdir, "/motifDiff.tsv"))
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
suppressMessages(library(optparse))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(RColorBrewer))
suppressMessages(library(tibble))
suppressMessages(library(chromPlot))
suppressMessages(library(networkD3))
suppressMessages(library(htmlwidgets))
suppressMessages(library(reshape2))
suppressMessages(library(ggrepel))
suppressMessages(library(ggsci))

## ---------------------------------------------------------------------------
## FIGURES
## ---------------------------------------------------------------------------


# Make help options
option_list = list(
  make_option(c("-b", "--mbased"), type="character", default=NULL,
              help="mbased dataframe output", metavar="character"),
  make_option(c("-r", "--rpkm"), type="character", default=NULL,
              help="RPKM matrix", metavar="character"),
  make_option(c("-g", "--gene"), type="character", default = NULL,
              help="Ensembl gene annotation", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default = NULL,
              help="Sample name", metavar="character"),
  make_option(c("-m", "--min"), type="numeric", default = 1,
              help="Minimum RPKM value", metavar="numeric"),
  make_option(c("-t", "--maf_threshold"), type="numeric", default = 0.75,
              help="Minimum RPKM value", metavar="numeric"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="Output directory", metavar="character")
)

### 
### SET UP THE DATAFRAME
###

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
df <- read.delim(opt$mbased, header = T, stringsAsFactors = F)
rpkm <- read.delim(opt$rpkm, header = T, stringsAsFactors = F)
all_genes <- read.delim(opt$gene, header = F, stringsAsFactors = F)
out <- opt$outdir
sample <- opt$sample
min <- opt$min
maf_threshold <- opt$maf_threshold

mafG_padjG <- paste0("MAF > ", maf_threshold, " & padj > 0.05")
mafL_padjG <- paste0("MAF < ", maf_threshold, " & padj > 0.05")
mafG_padjL <- paste0("MAF > ", maf_threshold, " & padj < 0.05")
mafL_padjL <- paste0("MAF < ", maf_threshold, " & padj < 0.05")


# fix sample name
sample <- ifelse(length(grep("-", sample)) == 0, sample, gsub("-", ".", sample))

# select sample
rpkm_sample <- rpkm[,c("gene", sample)] 
colnames(rpkm_sample) <- c("gene", "expr")

# make a colour filter
df$colour_filt <- ifelse(df$padj < 0.05 & df$majorAlleleFrequency > maf_threshold, mafG_padjL,
                         ifelse(df$padj > 0.05 & df$majorAlleleFrequency > maf_threshold, mafG_padjG,
                                ifelse(df$padj > 0.05 & df$majorAlleleFrequency < maf_threshold, mafL_padjG, mafL_padjL)))

# add the chromosome
df$chr <- all_genes$V1[match(df$gene, all_genes$V4)]

# set the factor levels
df$colour_filt <- factor(df$colour_filt, levels = c(mafG_padjL, mafL_padjL, 
                                                    mafG_padjG, mafL_padjG))
df$chr <- factor(df$chr, levels = c(paste0("chr", 1:22), "chrX"))
#df <- df[!is.na(df$chr),]

print("Beginning figures ...")

#### 
#### DOTPLOT
####

dotplot <- ggplot(df, aes(x = majorAlleleFrequency, y = padj, colour = colour_filt)) +
  geom_point(alpha = 0.5, size = 2) +
  scale_color_manual(values = c("#e74645", "black", "black", "grey")) +
  theme_bw() + 
  geom_hline(yintercept = 0.05, linetype = 2) +
  geom_vline(xintercept = maf_threshold, linetype = 2) +
  geom_text(aes(label = paste0(table(colour_filt)[mafG_padjL], " ASE genes"), x = 0.9, y = 0.75), size = 4.5, colour = "#e74645") +
  labs(x = "major allele frequency", y = "adjusted pvalue", colour = NULL) +
  theme(legend.position = "none", 
        axis.title = element_text(size = 12, face = "bold", colour = "black"),
        axis.text = element_text(size = 10, colour = "black"))

ggsave(filename = paste0(out, "/aseGenesDot.pdf"), plot = dotplot, width = 5, height = 4, units = "in")

####
#### BARPLOT
####
df_chr <- df[!is.na(df$chr),]
barplot <- ggplot(df_chr, aes(x = chr, fill = colour_filt)) +
  geom_bar() +
  scale_fill_manual(values = rev(c("#e0f0ea","#574f7d", "#95adbe", "#e74645"))) +
  theme_bw() +
  labs(x = "chromosome", y = "number of genes", fill = "ASE results")+
  theme(axis.title = element_text(size = 12, face = "bold", colour = "black"),
        axis.text.y = element_text(size = 10, colour = "black"),
        axis.text.x = element_text(size = 10, colour = "black"),
        legend.text = element_text(size = 10, colour = "black"),
        legend.title = element_text(size = 12, face = "bold", colour = "black"))

ggsave(filename = paste0(out, "/aseGenesBar.pdf"), plot = barplot, width = 12, height = 5, units = "in")


####
#### SANKEY PLOT
####

# set filters on the RPKM matrix
rpkm_sample$gene_biotype <- all_genes$V7[match(rpkm_sample$gene, all_genes$V4)]
rpkm_sample_filt1 <- rpkm_sample[rpkm_sample$gene_biotype %in% c("lincRNA", "miRNA", "protein_coding"),]
rpkm_sample_filt2 <- rpkm_sample_filt1[rpkm_sample_filt1$expr > min,] 

# get the input values for the plot
a <- nrow(rpkm_sample_filt1)
b <- c(nrow(rpkm_sample_filt2),nrow(rpkm_sample_filt1[rpkm_sample_filt1$expr <= min,] ))
c <- c(nrow(df),nrow(rpkm_sample_filt2[!rpkm_sample_filt2$gene %in% df$gene,]))
d <- c(nrow(df[df$padj < 0.05 & df$majorAlleleFrequency > maf_threshold,]),
       sum(nrow(df[df$padj >= 0.05 & df$majorAlleleFrequency <= maf_threshold,]),
           nrow(df[df$padj >= 0.05 & df$majorAlleleFrequency > maf_threshold,]),
           nrow(df[df$padj < 0.05 & df$majorAlleleFrequency <= maf_threshold,])))

# create a connection data frame
links <- data.frame(
  source=c(rep(paste0("All Genes (n=", a, ")"), 2), rep(paste0("Expressed (n=", b[1], ")"), 2), rep(paste0("Phased Genes (n=", c[1], ")"), 2)),
  target=c(paste0("Expressed (n=", b[1], ")"), paste0("Not Expressed (n=", b[2], ")"), 
           paste0("Phased Genes (n=", c[1], ")"), paste0("Unphased Genes (n=", c[2], ")"), 
           paste0("ASE Genes (n=", d[1], ")"), paste0("Biallelic Genes (n=", d[2], ")")), 
  value=c(b, c, d)
)

# create a node data frame: it lists every entities involved in the flow
nodes <- data.frame(
  name=c(as.character(links$source), 
         as.character(links$target)) %>% unique()
)

# Reformat the links
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# Make the Network
sankey <- sankeyNetwork(Links = links, Nodes = nodes,
                        Source = "IDsource", Target = "IDtarget",
                        Value = "value", NodeID = "name", 
                        sinksRight=FALSE, fontSize = 18)

saveWidget(sankey, file=paste0(out, "/sankeyPlot.html"), selfcontained = F)


print("Figures completed")
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
.libPaths("/home/glchang/R/x86_64-pc-linux-gnu-library/4.1")

suppressMessages(library("tidyverse"))
suppressMessages(library("optparse"))

option_list = list(
  make_option(c("-s", "--sample"), type="character", default=NULL,
              help="Sample name", metavar="character"),
  make_option(c("-i", "--input"), type="character", default=NULL,
              help="Gene expression data from RSEM", metavar="character"),
  make_option(c("-a", "--annotation"), type="character", default=NULL,
              help="Annotation file for ensembl id to hgnc name", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default =NULL,
              help="Output directory name", metavar="character")
)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
input <- opt$input
sample <- opt$sample
annotation <- read_delim(opt$annotation, col_names = F, show_col_types = F) %>%
  dplyr::select(X9, X4)

val <- read_delim(input, show_col_types = F) %>%
  dplyr::select(gene_id, TPM) %>%
  left_join(annotation, by = c("gene_id" = "X9")) %>%
  dplyr::select(X4, TPM) %>%
  `colnames<-`(c("gene", sample))

write.table(val, paste0(out, "/expression_matrix.tsv"), 
            col.names = T, row.names = F, quote = F, sep = "\t")
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
library(optparse)

## ---------------------------------------------------------------------------
## OPT PARSE
## ---------------------------------------------------------------------------

# options
option_list = list(
  make_option(c("-c", "--chromSize"), type="character",  
              help="chromosome sizes", metavar="character"),
  make_option(c("-p", "--centPos"), type="character", 
              help="centred centromere positions", metavar="character"),
  make_option(c("-n", "--cna"), type="character", 
              help="copy number segment file (preferably condensed)",  metavar="character"),
  make_option(c("-d", "--dmr"), type="character", 
              help="DMR positions", metavar="character"),
  make_option(c("-a", "--ase"), type="character",
              help="ASE summary table file", metavar="character"),
  make_option(c("-g", "--genes"), type="character",
              help="Gene annotation file", metavar="character"),
  make_option(c("-o", "--out"), type="character", 
              help="output file prefix", metavar="character")
)

opt_parser = OptionParser(option_list=option_list)
opt = parse_args(opt_parser)

#Note: these packages need to be installed.
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(ggplot2))
suppressMessages(library(tidyr))
suppressMessages(library(cowplot))

# chromosome size/position info
#chromSize <- read.delim("/projects/hpv_nanopore_prj/refs/hg38_no_alt_TCGA_HTMCP_HPVs_chromSizes.txt", header = F)
#centPos <-  read.delim("/projects/hpv_nanopore_prj/refs/hg38_centromere_positions_merged.bed", header = F)
#genes <- read.delim("/projects/hpv_nanopore_prj/htmcp/ase/pull_trial/vporter-allelespecificexpression/output/HTMCP.03.06.02058/3_cancer/raw/gene_annotation.bed", header = F)
chromSize <- read.delim(opt$chromSize, header = F)
centPos <-  read.delim(opt$centPos, header = F)
genes <- read.delim(opt$genes, header = F)

## ---------------------------------------------------------------------------
## POSITION CHROMOSOME INFO
## ---------------------------------------------------------------------------

# subset to the main chromosomes
chromSize <- chromSize[chromSize$V1 %in% c(paste0("chr", 1:22), "chrX"),]

# Rename columns to chromosome and size
colnames(chromSize) <- c("chr","size")

# Reorder levels for plotting
chromSize$chr <- factor(chromSize$chr,levels=c(paste0("chr", 1:22), "chrX"))
chromSize <- chromSize[chromSize$chr %in% c(paste0("chr", 1:22), "chrX"),]

# Divide by 1Mb to clean up axis
chromSize$size <- chromSize$size/1000000

# centromere mapping
colnames(centPos) <- c("chr", "start", "end")
centPos$chr <- factor(centPos$chr,levels=c(paste0("chr", 1:22), "chrX"))
centPos$centre <- centPos$start + ((centPos$end - centPos$start)/2)
centPos$centre <- centPos$centre/1000000

## ---------------------------------------------------------------------------
## COPY NUMBER
## ---------------------------------------------------------------------------
if (!is.null(opt$cna) & opt$cna != ""){
  #cna <- read.delim("/projects/hpv_nanopore_prj/htmcp/ploidetect/illumina/Ploidetect-pipeline/ploidetect_out/HTMCP-03-06-02058/A37261_A37189/cna_condensed.txt", header = T)
  cna <- read.delim(opt$cna, header = T)
  cna$chr <- paste0("chr", cna$chr)

  # rearrange to make a bed file
  cna_bed <- cna[,c("chr", "pos", "end", "CN", "zygosity", "A", "B")]

  # categorize copy number
  cna_bed <- cna_bed %>%
    mutate(CN.Status = 
             case_when(
               zygosity == "HOM" ~ "LOH",
               A > B ~ "imbalance",
               TRUE ~ "balance"
             ))

  # Divide by 1Mb for axis
  cna_bed$end <- cna_bed$end/1000000
  cna_bed$pos <- cna_bed$pos/1000000

  # Change to factor and reorder levels
  cna_bed$chr <- factor(cna_bed$chr,levels=c(paste0("chr", 1:22), "chrX"))

  cnaLOH <- cna_bed %>% filter(CN.Status == "LOH")
  cnaGAIN <- cna_bed %>% filter(CN.Status == "imbalance") 
}

## ---------------------------------------------------------------------------
## DIFFERENTIAL METHYLATION
## ---------------------------------------------------------------------------

if (!is.null(opt$dmr) & opt$dmr != ""){
  #dmr <- read.delim("/projects/hpv_nanopore_prj/htmcp/call_integration/output/HTMCP-03-06-02058/methylation/diff_meth.csv", header = T)
  dmr <- read.delim(opt$dmr, header = T)

  # Divide by 1Mb for axis
  dmr$start <- dmr$start/1000000
  dmr$end <- dmr$end/1000000
  dmr$middle <- (dmr$start + dmr$end) / 2

  # Change to factor and reorder levels
  dmr$chr <- factor(dmr$chr, levels=c(paste0("chr", 1:22), "chrX"))

  # count in 1Mb bins
  dmrCount <- data.frame(table(as.factor(paste0(dmr$chr, ":", as.integer(dmr$middle)))))

  # split the chromosome name and bin position
  dmrPlot <- separate(dmrCount, col = Var1, into = c("chr", "pos"), sep = ":", remove = T)

  # scale to fit the plot - i.e. make the maximum width 0.65
  maxDMR <- max(dmrPlot$Freq)
  dmrPlot$percMax <- dmrPlot$Freq/maxDMR
  dmrPlot$percMax <- dmrPlot$percMax * 0.65

  # Change to factor and reorder levels
  dmrPlot$chr <- factor(dmrPlot$chr, levels=c(paste0("chr", 1:22), "chrX"))
  dmrPlot$pos <- as.numeric(dmrPlot$pos)
}

## ---------------------------------------------------------------------------
## ASE GENE HISTOGRAM
## ---------------------------------------------------------------------------

#ase <- read.delim("/projects/hpv_nanopore_prj/htmcp/ase/pull_trial/vporter-allelespecificexpression/output/HTMCP.03.06.02058/summaryTable.tsv", header = T)
ase <- read.delim(opt$ase, header = T)

# filter for ASE genes
ase <- ase[ase$aseResults == "ASE",]

# get gene positions
ase$chr <- genes$V1[match(ase$gene, genes$V4)]
ase$start <- genes$V2[match(ase$gene, genes$V4)]
ase$end <- genes$V3[match(ase$gene, genes$V4)]

# get the middle of the gene for plotting
ase$middle <- (ase$start + ase$end)/2

# get the data frame ready for plotting 
ase <- ase[,c("chr", "start", "end","middle")]
ase <- ase[complete.cases(ase),]

# Divide by 1Mb for axis
ase$middle <- ase$middle/1000000

# count in 1Mb bins
aseCount <- data.frame(table(as.factor(paste0(ase$chr, ":", as.integer(ase$middle)))))

# split the chromosome name and bin position
asePlot <- separate(aseCount[aseCount$Var1 != "NA:NA",], col = Var1, into = c("chr", "pos"), sep = ":", remove = T)

# scale to fit the plot - i.e. make the maximum width 0.65
maxASE <- max(asePlot$Freq)
asePlot$percMax <- asePlot$Freq/maxASE
asePlot$percMax <- asePlot$percMax * 0.65

# Change to factor and reorder levels
asePlot$chr <- factor(asePlot$chr, levels=c(paste0("chr", 1:22), "chrX"))
asePlot$pos <- as.numeric(asePlot$pos)

## ---------------------------------------------------------------------------
## PLOT OPTIONS
## ---------------------------------------------------------------------------

##### CNV AND DMRs AVAILABLE

if (!is.null(opt$dmr) & opt$dmr != "" & !is.null(opt$cna) & opt$cna != ""){
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 9.7, 10.3,9.7,10.3,7.1,7.1), xmax = c(10.35, 10.35,9.75,10.35,9.75,10.35,7.26,7.26), ymin = c(240,210,238,238,208,208,235,205), ymax = c(243,213,243,243,213,213,245,215),
                    fill = c("ase","dmr","ase","ase","dmr","dmr", "gain", "loh"))

  adW <- data.frame(x = c(11.5, 11.2,8.25,7.6,10,10), y = c(240,210,240,210,250,220),
                    label = c("ASE Gene Density","DMR Density", "Imbalanced CNV", "LOH", as.character(c(maxASE,maxDMR))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOH %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAIN %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "gray") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73","#ae2012","#ee9b00","#94d2bd")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt <- cnaLOH %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt$chr <- factor(cnaLOHFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt <- cnaGAIN %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt$chr <- factor(cnaGAINFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt <- dmrPlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt$chr <- factor(dmrPlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOHFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAINFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlotFilt, 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)

} else if (!is.null(opt$cna) & opt$cna != ""){ ### CNVs BUT NO DMRs
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 10.3,7.1,7.1), xmax = c(10.35, 9.75,10.35,7.26,7.26), ymin = c(240,238,238,235,205), ymax = c(243,243,243,245,215),
                    fill = c("ase","ase","ase","gain", "loh"))

  adW <- data.frame(x = c(11.5,8.25,7.6,10), y = c(240,240,210,250),
                    label = c("ASE Gene Density", "Imbalanced CNV", "LOH", as.character(c(maxASE))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOH %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAIN %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "gray") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73","#ee9b00","#94d2bd")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt <- cnaLOH %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaLOHFilt$chr <- factor(cnaLOHFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt <- cnaGAIN %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  cnaGAINFilt$chr <- factor(cnaGAINFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # LOH
    geom_rect(data = cnaLOHFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#94d2bd",size = 0.2) +
    # Imbalanced CNV
    geom_rect(data = cnaGAINFilt, 
              aes(xmin = as.integer(chr) - 0.08, xmax = as.integer(chr) + 0.08, ymin = pos, ymax = end),
              fill="#ee9b00",size = 0.2) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)

} else if (!is.null(opt$dmr) & opt$dmr != ""){ ## DMRs BUT NO CNV
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 9.7, 10.3,9.7,10.3), xmax = c(10.35, 10.35,9.75,10.35,9.75,10.35), ymin = c(240,210,238,238,208,208), ymax = c(243,213,243,243,213,213),
                    fill = c("ase","dmr","ase","ase","dmr","dmr"))

  adW <- data.frame(x = c(11.5, 11.2,10,10), y = c(240,210,250,220),
                    label = c("ASE Gene Density","DMR Density", as.character(c(maxASE,maxDMR))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73","#ae2012")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt <- dmrPlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  dmrPlotFilt$chr <- factor(dmrPlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # DMRs
    geom_rect(data = dmrPlotFilt, 
              aes(xmin = (as.integer(chr) - 0.1 - percMax), xmax = as.integer(chr) - 0.1, ymin = pos, ymax = pos+1),
              fill = "#ae2012", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)

} else{ ## NO DMRs OR CNVs
  # legend
  adL <- data.frame(xmin = c(9.7, 9.7, 10.3), xmax = c(10.35,9.75,10.35), ymin = c(240,238,238), ymax = c(243,243,243),
                    fill = c("ase","ase","ase"))

  adW <- data.frame(x = c(11.5,10), y = c(240,250),
                    label = c("ASE Gene Density", as.character(c(maxASE))))

  # plot
  # chromosomes 1 - 12
  p1 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSize %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlot %>% filter(chr %in% paste0("chr", 1:12)), 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPos %>% filter(chr %in% paste0("chr", 1:12)), aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    # legend bars
    geom_rect(data = adL, 
              aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = fill),
              size = 0.25) +
    # legend text
    geom_text(data = adW, 
              aes(x = x, y = y, label = label))+
    scale_fill_manual(values = c("#005f73")) +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y="Chromosome Size (Mb)")

  # chromosomes 13 - 22 + X

  # very annoying but you have to filter all the dataframes or else the factor levels won't match the integer value
  chromSizeFilt <- chromSize %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  chromSizeFilt$chr <- factor(chromSizeFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  centPosFilt <- centPos %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  centPosFilt$chr <- factor(centPosFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))
  asePlotFilt <- asePlot %>% filter(chr %in% c(paste0("chr", 13:22), "chrX"))
  asePlotFilt$chr <- factor(asePlotFilt$chr,levels=c(paste0("chr", 13:22), "chrX"))

  # chromosomes 13-22+X
  p2 <- ggplot() +
    # chromosome bars
    geom_segment(data = chromSizeFilt, aes(x = chr, xend = chr, y = 0, yend = size), 
                 lineend = "round", color = "lightgrey", size = 5) +
    # ASE genes
    geom_rect(data = asePlotFilt, 
              aes(xmin = as.integer(chr) + 0.1, xmax = (as.integer(chr) + 0.1 + percMax), ymin = pos, ymax = pos+1),
              fill = "#005f73", size = 0.25) +
    # centromeres
    geom_point(data = centPosFilt, aes(x = chr, y = centre), 
               size = 5, colour = "black") +
    ylim(0, 250) +
    theme_classic() +
    theme(text = element_text(size=15),axis.line=element_blank(),
          axis.ticks.x=element_blank(),
          legend.position = "none")+
    labs(x=NULL,y=NULL)
}


## ---------------------------------------------------------------------------
## PLOT 
## ---------------------------------------------------------------------------

# put them together
plot <- plot_grid(p1, p2, align = "v", axis = "l", nrow = 2)

# save plot
ggsave(plot, filename = paste0(opt$out,".pdf"), width = 10, height = 7, units = "in")
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
suppressMessages(library(optparse))
suppressMessages(library(dplyr))
suppressMessages(library(reshape2))
suppressMessages(library(prob))
suppressMessages(library(tidyr))
suppressMessages(library(MBASED))
suppressMessages(library(SummarizedExperiment))
suppressMessages(library(stats))
suppressMessages(library(tibble))

## ---------------------------------------------------------------------------
## LOAD INPUT 
## ---------------------------------------------------------------------------

# Make help options
option_list = list(
  make_option(c("-p", "--phase"), type="character", default=NULL,
              help="Phased VCF file (from WhatsHap)", metavar="character"),
  make_option(c("-r", "--rna"), type="character", default=NULL,
              help="Tumour RNA vcf file (from Strelka2)", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character"),
  make_option(c("-t", "--threads"), type="integer", default = "mBASED",
              help="Threads used for mbased", metavar="integer")
)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
threads <- opt$threads

## ---------------------------------------------------------------------------
## USER FUNCTIONS
## ---------------------------------------------------------------------------

# extract info from a list
list_n_item <- function(list, n){
  sapply(list, `[`, n)
}

# define function to print out the summary of ASE results
summarizeASEResults_1s <- function(MBASEDOutput) {

  geneOutputDF <- data.frame(
    majorAlleleFrequency = assays(MBASEDOutput)$majorAlleleFrequency[,1],
    pValueASE = assays(MBASEDOutput)$pValueASE[,1],
    pValueHeterogeneity = assays(MBASEDOutput)$pValueHeterogeneity[,1])

  geneAllele <- as.data.frame(assays(metadata(MBASEDOutput)$locusSpecificResults)$allele1IsMajor) %>%
    rownames_to_column(var = "rowname") %>%
    dplyr::mutate(gene = unlist(lapply(strsplit(rowname, split = ":"),function(x){x = x[1]}))) %>%
    dplyr::group_by(gene) %>%
    summarise(allele1IsMajor = unique(mySample))

  geneOutputDF$allele1IsMajor <- geneAllele$allele1IsMajor[match(rownames(geneOutputDF), geneAllele$gene)]

  lociOutputGR <- rowRanges(metadata(MBASEDOutput)$locusSpecificResults)
  lociOutputGR$allele1IsMajor <- assays(metadata(MBASEDOutput)$locusSpecificResults)$allele1IsMajor[,1]
  lociOutputGR$MAF <- assays(metadata(MBASEDOutput)$locusSpecificResults)$MAF[,1]
  lociOutputList <- split(lociOutputGR, factor(lociOutputGR$aseID, levels=unique(lociOutputGR$aseID)))

  return(
    list(
      geneOutput=geneOutputDF,
      locusOutput=lociOutputList
    )
  )
}

## ---------------------------------------------------------------------------
## READ IN THE RNA SNV CALLS
## ---------------------------------------------------------------------------

# read in the RNA calls
rna_filt <- read.delim(opt$rna, header = T, comment.char = "#", stringsAsFactors = F)
colnames(rna_filt) <- c("CHROM", "POS", "AD","REF","ALT","gene", "gene_biotype") 
rna_filt$variant <- paste0(rna_filt$CHROM, ":", rna_filt$POS)


## ---------------------------------------------------------------------------
## EXTRACT REF/ALT READ COUNTS
## ---------------------------------------------------------------------------

# Extract and add the read counts
expr <- strsplit(rna_filt$AD, ",")
rna_filt$REF.COUNTS <- as.numeric(list_n_item(expr, 1))
rna_filt$ALT.COUNTS <- as.numeric(list_n_item(expr, 2))

## ---------------------------------------------------------------------------
## MBASED WITH OR WITHOUT PHASING
## ---------------------------------------------------------------------------

### WITH PHASING
if (!is.null(opt$phase)){

  ### 
  ### PHASING
  ###

  # WhatsHap phased VCF from ONT sequencing pipeline
  wh <- read.delim(opt$phase, header = F, comment.char = "#", stringsAsFactors = F)
  colnames(wh) <- c("CHROM",  "POS",  "ID",  "REF", "ALT", "QUAL", "FILTER",  "INFO", "FORMAT", "SAMPLE")
  wh$variant <- paste0(wh$CHROM, ":", wh$POS)

  # remove unphased variants - phased variants have the pipe "|" symbol in column 10 - and remove indels
  wh <- wh[grep("|", wh$SAMPLE, fixed=TRUE),]
  wh <- wh %>% dplyr::filter(nchar(REF) == 1 & nchar(ALT) == 1)

  # add genotype from the SAMPLE column as a new column
  info2 <- strsplit(wh$SAMPLE, ":")
  wh$GT <- list_n_item(info2, 1)

  # Add the genotype from WhatsHap 
  rna_filt$GT <- wh$GT[match(rna_filt$variant, wh$variant)]

  # Find unphased genes with one variant (test)
  singleUnphased <- rna_filt %>%
    mutate(phase = variant %in% wh$variant) %>%
    left_join(rna_filt %>% group_by(gene) %>% summarize(n=n())) %>%
    dplyr::filter(!phase & n == 1)

  # Add genotype to unphased gene with one variant (test)
  rna_filt$GT[which(rna_filt$variant %in% singleUnphased$variant)] <- "1|0"

  # annotate the phased variants as alleleA and alleleB
  rna_filt$alleleA <- ifelse(rna_filt$GT == "1|0", rna_filt$ALT, rna_filt$REF)
  rna_filt$alleleB <- ifelse(rna_filt$GT == "1|0", rna_filt$REF, rna_filt$ALT)

  # add the phased COUNTS variants as alleleA and alleleB
  rna_filt$alleleA.counts <- ifelse(rna_filt$GT == "1|0", rna_filt$ALT.COUNTS, rna_filt$REF.COUNTS)
  rna_filt$alleleB.counts <- ifelse(rna_filt$GT == "1|0", rna_filt$REF.COUNTS, rna_filt$ALT.COUNTS)

  # phased only variants
  rna_phased <- rna_filt[complete.cases(rna_filt),]

  # make SNV IDs
  rna_phased <- rna_phased %>%
    arrange(CHROM, POS) %>%
    group_by(gene) %>%
    mutate(label = paste0("SNV",1:n()))
  rna_phased$SNV.ID <- paste0(rna_phased$gene, ":", rna_phased$label)

  ### 
  ### MBASED
  ###

  print("Beginning MBASED ...")

  # make the GRanges object of the loci
  mySNVs <- GRanges(seqnames=rna_phased$CHROM,
                     ranges=IRanges(start=rna_phased$POS, width=1),
                     aseID=rna_phased$gene,
                     allele1=rna_phased$REF,
                     allele2=rna_phased$ALT)
  names(mySNVs) <- rna_phased$SNV.ID

  # create input RangedSummarizedExperiment object
  mySample <- SummarizedExperiment(
    assays=list(lociAllele1Counts=matrix(rna_phased$alleleA.counts,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample')),
                lociAllele2Counts=matrix(rna_phased$alleleB.counts,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample'))),
    rowRanges=mySNVs
  )

  # run MBASED
  ASEresults_1s_haplotypesKnown <- runMBASED(ASESummarizedExperiment=mySample,
                                             isPhased=TRUE,
                                             numSim=10^6,
                                             BPPARAM = MulticoreParam(workers = threads))

  saveRDS(ASEresults_1s_haplotypesKnown, file=paste0(out, "/ASEresults_1s_haplotypesKnown.rds"))
  # extract results
  results <- summarizeASEResults_1s(ASEresults_1s_haplotypesKnown)

  # adjust the pvalue with BH correction
  results$geneOutput$padj <- p.adjust(p = results$geneOutput$pValueASE, method = "BH")
  results$geneOutput$significance <- as.factor(ifelse(results$geneOutput$padj < 0.05, "padj < 0.05", "padj > 0.05"))
  results$geneOutput$gene <- rownames(results$geneOutput)

  results$geneOutput$allele1IsMajor[results$geneOutput$gene %in% singleUnphased$gene] = NA

  # add the locus
  results$geneOutput$geneBiotype <- rna_filt$gene_biotype[match(results$geneOutput$gene, rna_filt$gene)]

### WITHOUT PHASING
} else {

  # make SNV labels
  rna_filt <- rna_filt %>%
    arrange(CHROM, POS) %>%
    group_by(gene) %>%
    mutate(label = paste0("SNV",1:n()))
  rna_filt$SNV.ID <- paste0(rna_filt$gene, ":", rna_filt$label)

  ### 
  ### MBASED
  ###

  print("Beginning MBASED ...")

  # make the GRanges object of the loci
  mySNVs <- GRanges(seqnames=rna_filt$CHROM,
                    ranges=IRanges(start=rna_filt$POS, width=1),
                    aseID=rna_filt$gene,
                    allele1=rna_filt$REF,
                    allele2=rna_filt$ALT)
  names(mySNVs) <- rna_filt$SNV.ID

  ## create input RangedSummarizedExperiment object
  mySample <- SummarizedExperiment(
    assays=list(lociAllele1Counts=matrix(rna_filt$REF.COUNTS,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample')),
                lociAllele2Counts=matrix(rna_filt$ALT.COUNTS,
                                         ncol=1,
                                         dimnames=list(names(mySNVs),'mySample'))),
    rowRanges=mySNVs
  )

  # run MBASED
  ASEresults_1s_haplotypesUnknown <- runMBASED(ASESummarizedExperiment=mySample,
                                               isPhased=FALSE,
                                               numSim=10^6,
                                               BPPARAM = MulticoreParam(workers = threads))
  saveRDS(ASEresults_1s_haplotypesUnknown, file=paste0(out, "/ASEresults_1s_haplotypesUnknown.rds"))

  # extract results
  results <- summarizeASEResults_1s(ASEresults_1s_haplotypesUnknown)

  # adjust the pvalue with BH correction
  results$geneOutput$padj <- p.adjust(p = results$geneOutput$pValueASE, method = "BH")
  results$geneOutput$significance <- as.factor(ifelse(results$geneOutput$padj < 0.05, "padj < 0.05", "padj > 0.05"))
  results$geneOutput$gene <- rownames(results$geneOutput)

  # add the locus
  results$geneOutput$geneBiotype <- rna_filt$gene_biotype[match(results$geneOutput$gene, rna_filt$gene)]

} 

# save the results 
saveRDS(results, file=paste0(out, "/MBASEDresults.rds"))
print("Finished MBASED")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
suppressMessages(library(tidyverse))
suppressMessages(library(optparse))

# Make help options
option_list = list(
  make_option(c("-m", "--methyl"), type="character", default=NULL,
              help="Allelic methylation (from Nanomethphase)", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = "mBASED",
              help="Output directory name", metavar="character")
)

# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir


dmr <- read.delim(opt$methyl, header = T, comment.char = "#", stringsAsFactors = F)

dmr %>%
  dplyr::mutate(chr = case_when(
    grepl("^chr", chr) ~ chr,
    TRUE ~ paste0("chr", chr)
  )) %>%
  dplyr::select(chr, start, end, meanMethy1, meanMethy2, diff.Methy) %>%
  dplyr::mutate(start = gsub(" ", "", format(start, scientific=F), fixed = TRUE)) %>%
  dplyr::mutate(end = gsub(" ", "", format(end, scientific=F), fixed = TRUE)) %>%
  dplyr::mutate(meanMethy1 = format(meanMethy1, scientific = FALSE)) %>%
  dplyr::mutate(meanMethy2 = format(meanMethy2, scientific = FALSE)) %>%
  dplyr::mutate(diff.Methy = format(diff.Methy, scientific = FALSE)) %>%
  write.table(paste0(out, "/methyl.bed"), 
              quote = F, sep = "\t", row.names = F, col.names = F)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
suppressMessages(library(optparse))
suppressMessages(library(tidyverse))

option_list = list(
  make_option(c("-a", "--annotation"), type="character", default=NULL,
              help="Annotation file", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="output directory", metavar="character")
)

# load in options
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
annotation_path <- opt$annotation
out <- opt$out

read.delim(annotation_path, sep = "\t", header = T) %>%
  `colnames<-`(c("gene", "effect", "genotype", "filter" )) %>%
  dplyr::filter(filter == "PASS") %>%
  dplyr::filter(genotype == c("1|0", "0|1")) %>%
  dplyr::filter(effect != "stop_retained_variant") %>%
  dplyr::filter(grepl("stop", effect)) %>%
  distinct(gene, genotype, .keep_all = TRUE) %>%
  dplyr::mutate(stop_variant_allele = case_when(
    genotype == "0|1" ~ 2,
    TRUE ~ 1
  )) %>%
  dplyr::select(gene, effect, stop_variant_allele) %>%
  write.table(paste0(out, "/stop_variant.tsv"), 
              sep = "\t", quote = F, row.names = F, col.names = T)
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
suppressMessages(library(tidyverse))
suppressMessages(library(optparse))


# Make help options
option_list = list(
  make_option(c("-c", "--cnv"), type="character", default=NULL,
              help="CNV bed file", metavar="character"),
  make_option(c("-m", "--methyl"), type="character", default=NULL,
              help="allelic methylation bed file", metavar="character"),
  make_option(c("-a", "--ase"), type="character", default=NULL,
              help="mbased ASE result", metavar="character"),
  make_option(c("-g", "--cancer"), type="character", default=NULL,
              help="cancer gene", metavar="character"),
  make_option(c("-s", "--sample"), type="character", default=NULL,
              help="Sample name", metavar="character"),
  make_option(c("-o", "--outdir"), type="character", default = NULL,
              help="Output directory name", metavar="character"),
  make_option(c("-t", "--tfbs"), type="character", default = NULL,
              help="Transcription Factor binding site", metavar="character"),
  make_option(c("-p", "--stop"), type="character", default = NULL,
              help="Stop Variants", metavar="character"),
  make_option(c("-n", "--snv"), type="character", default = NULL,
              help="Somatic SNV", metavar="character"),
  make_option(c("-i", "--indel"), type="character", default = NULL,
              help="Somatic Indel", metavar="character"),
  make_option(c("r-", "--tissue"), type="character", default = "allTissue",
              help="tissue name", metavar="character"),
  make_option(c("-N", "--normal"), type="character", default = NULL,
              help="Normal ASE", metavar="character"),
  make_option(c("-T", "--tumorContent"), type="double", default = 1.0,
              help="Tumor Content", metavar="character")
)


# load in options 
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
out <- opt$outdir
sample <- opt$sample

cnv_path <- opt$cnv
methyl_path <- opt$methyl
tfbs_path <- opt$tfbs
stop_path <- opt$stop
snv_path <- opt$snv
indel_path <- opt$indel
cancer_path <- opt$cancer
normal_path <- opt$normal

tissue <- opt$tissue
tumorContent <- opt$tumorContent
ase <- read.delim(opt$ase, header = T, comment.char = "#", stringsAsFactors = F)

##########
# CNV
##########

if (is.null(cnv_path) | cnv_path == "") {
  cnv <- data.frame(gene = ase$gene)
} else {
  normal <- read.delim(normal_path, sep = "\t", header = F,
                       comment.char = "#") %>%
    `colnames<-`(.[1, ]) %>%
    .[-1, ] %>%
    dplyr::select(gene, tissue) %>%
    `colnames<-`(c("gene", "normalMAF"))
  cnv <- read.delim(cnv_path, header = F, comment.char = "#") %>%
    dplyr::select(V4, V8, V9, V11) %>%
    dplyr::mutate(V8 = as.numeric(V8)) %>%
    dplyr::mutate(V9 = as.numeric(V9)) %>%
    dplyr::mutate(V11 = as.numeric(V11)) %>%
    `colnames<-`(c("gene", "cnv.A", "cnv.B", "expectedMAF")) %>%
    group_by(gene) %>% 
    summarize(cnv.A = mean(cnv.A), cnv.B = mean(cnv.B), expectedMAF = mean(expectedMAF)) %>%
    dplyr::mutate(cnv_state = case_when(
      cnv.A == 0 | cnv.B == 0 ~ "LOH",
      abs(cnv.A - cnv.B) <= 1 ~ "balance",
      TRUE ~ "imbalance"
    )) %>%
    left_join(normal, by = "gene") %>%
    dplyr::mutate(normalMAF = as.numeric(normalMAF)) %>%
    dplyr::mutate(expectedMAF = ((pmax(cnv.A,cnv.B)/(cnv.A + cnv.B))*tumorContent) + ((1-tumorContent) * normalMAF)) %>%
    dplyr::select("gene", "cnv.A", "cnv.B", "cnv_state", "expectedMAF")
}

##########
# DMR
##########

if ( is.null(methyl_path) | methyl_path == "") {
  dmr <- data.frame(gene = ase$gene)
} else {
  dmr <- read.delim(methyl_path, header = F, comment.char = "#", stringsAsFactors = F) %>%
    dplyr::filter(V5 != ".") %>%
    dplyr::select(V4, V8, V9, V10) %>%
    `colnames<-`(c("gene", "methyl.A", "methyl.B", "diff.Methyl")) %>%
    dplyr::mutate(diff.Methyl = as.numeric(diff.Methyl)) %>%
    group_by(gene) %>%
    summarize(minimum = min(diff.Methyl), maximum = max(diff.Methyl)) %>%
    dplyr::mutate(methyl_state = case_when(
      (minimum < 0) & (maximum < 0) ~ minimum,
      (minimum > 0) & (maximum > 0) ~ maximum,
      TRUE ~ ifelse((abs(minimum) > maximum), minimum, maximum)
    )) %>%
    dplyr::select(gene, methyl_state)
}

########################################
# Transcription Factor Binding Site
########################################

if (is.null(tfbs_path) | tfbs_path == "") {
  tfbs <- data.frame(gene = ase$gene)
} else {
  tfbs <- read.delim(tfbs_path, header = T, comment.char = "#") %>%
    `colnames<-`(c("transcriptionFactor", "gene", "tf_allele")) %>%
    dplyr::select(gene, transcriptionFactor, tf_allele) %>%
    group_by(gene) %>% 
    mutate(transcriptionFactor = paste(transcriptionFactor, collapse=",")) %>%
    mutate(tf_allele = paste(tf_allele, collapse=",")) %>%
    distinct()
}

##########################
# Stop Variants
##########################

if (is.null(stop_path) | stop_path == "") {
  stopVar <- data.frame(gene = ase$gene)
} else {
  stopVar <- read.delim(stop_path, header = T, comment.char = "#") %>%
    dplyr::select(gene, stop_variant_allele)
}

##########################
# Somatic SNV
##########################

if (is.null(snv_path) | snv_path == "") {
  snv <- data.frame(gene = ase$gene)
} else {
  snv_gene <- read.delim(snv_path, header = F, comment.char = "#") %>%
    pull(V4) %>%
    unique()
  snv <- data.frame(gene = ase$gene) %>%
    dplyr::mutate(somaticSNV = gene %in% snv_gene)
}

##########################
# Somatic Indel
##########################

if (is.null(indel_path) | indel_path == "") {
  indel <- data.frame(gene = ase$gene)
} else {
  indel_gene <- read.delim(indel_path, header = F, comment.char = "#") %>%
    pull(V4) %>%
    unique()
  indel <- data.frame(gene = ase$gene) %>%
    dplyr::mutate(somaticIndel = gene %in% indel_gene)
}

##########################
# Cancer
##########################

if (is.null(cancer_path) | cancer_path == "") {
  cancer <- data.frame(gene = ase$gene)
} else {
  Can_genes <- read.delim(cancer_path, sep = "\t", header = F, comment.char = "#") %>% 
    pull()
  cancer <- data.frame(gene = ase$gene) %>%
    dplyr::mutate(cancer_gene = gene %in% Can_genes)
}

##########################
# Normal ASE
##########################

if (is.null(tissue) | tissue == "") {
  normal <- data.frame(gene = ase$gene)
} else {
  normal <- read.delim(normal_path, sep = "\t", header = F, 
                       comment.char = "#") %>% 
    `colnames<-`(.[1, ]) %>%
    .[-1, ] %>%
    dplyr::select(gene, tissue) %>%
    `colnames<-`(c("gene", "normalMAF")) 
}

##########################
# Summary Table 
##########################

summary_table <- ase %>%
  dplyr::select(gene, RPKM, allele1IsMajor, majorAlleleFrequency, padj, aseResults) %>%
  dplyr::rename("expression" = RPKM) %>%
  left_join(cnv, by = "gene") %>%
  left_join(dmr, by = "gene") %>%
  left_join(tfbs, by = "gene") %>%
  left_join(stopVar, by = "gene") %>%
  left_join(snv, by = "gene") %>%
  left_join(indel, by = "gene") %>%
  left_join(cancer, by = "gene") %>%
  left_join(normal, by = "gene") %>%
  dplyr::mutate(sample = sample) %>%
  dplyr::mutate(tumorContent = tumorContent) %>%
  write.table(paste0(out, "/summaryTable.tsv"), 
              sep = "\t", quote = F, row.names = F, col.names = T)
ShowHide 37 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/bcgsc/IMPALA
Name: impala
Version: 1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...