Analysis code for the TAP-seq manuscript.

public public 1yr ago 0 bookmarks

This repository contains code to reproduce the results presented in the article. It builds upon a snakemake workflow that handles all data processing from fastq files to transcript counts etc. It also performs some high-level analyses, in particular analyses underlying figure 3 of the manuscript and quality control steps.

High level analyses underlying figures 1, 2, 3b,h,i and 4 are described in detail in the vignettes subfolder. Input data required for running these vignettes can either be produced by running the relevant parts of the snakemake pipeline, or they can be downloaded as described in the vignettes.

The workflow can be downloaded by simply cloning the repository into a location of choice:

git clone https://github.com/argschwind/TAPseq_manuscript.git

Download data

Download of raw data from the SRA archive is handled by the workflow itself and necessary fastq files are generated automatically when executing the workflow. All fastq files are saved in the raw_data directory. Downloading sequencing reads from SRA requires that a recent version of the SRA toolkit is installed and added to the PATH.

Data processing

All data processing and analyses steps can be executed through snakemake and conda. This only requires that conda and snakemake are installed. All other required dependencies will be installed through conda.

Data processing steps (and part of the analyses) can be executed by running parts of the workflow. This will create the subdirectories data and results. Data contains all processed data in one subdirectory per sample. All executed analyses will be saved as .html output in results. Alignment reports for all samples will for instance be saved under results/alignment/'sample'_align_report.html.

However, some of these steps might take a very long time and use substantial free storage!

Reproducing analyses

Input data for the following thematic analyses can be procuced by executing following snakemake rules.

TAP-seq quality control (Figure 1)

Key plots of figure 1 can be reproduced by running the Vignette for Figure1e,f. This requires the following data processing step:

# align all TAP-seq data for figure 1 and create dge downsamplings (--jobs = number of threads to
# use in parallel, please adjust; -n = dryrun, remove it to execute)
snakemake --use-conda Figure1 --jobs 4 -n

To obtain all data used for quality control, i.e. also the experiments detailed in supplementary figure 2-4 and the supplementary note, the following data processing step is required:

# create all input data for supplementary figures 2-4
snakemake --use-conda tapseq_validation --jobs 4 -n

Evaluation of differential expression performance (Figure 2)

The analyses underlyng figure 2 can be reproduced using the corresponding vignette. This requires the following data processing step; furthermore, it requires the execution of a relatively compute-expensive sampling and DE testing workflow, documented in the vignette.

# create all input data for figure 2
snakemake --use-conda Figure2 --jobs 4 -n

Enhancer screen (Figure 3)

Data and most analyses for the chromosome 8 and 11 enhancer screen can be reproduced with this command.

snakemake --use-conda enhancer_screen --jobs 4 -n

Code to reproduce the enhancer prediction analyses can be found in the Vignette for Figure 3b,h,i. This requires generating chromatin annotated enhancer - target pairs. This can be achieved by running the following snakemake rule:

snakemake --use-conda chromatin_annotated_etps --jobs 4 -n

Bone marrow cell type identification (Figure 4)

Input data for mouse bone-marrow cell type identification analyses can be produced by this command. Downstream analyses to repoduce plots shown in the article are found in the Vignette for Figure 4.

snakemake --use-conda bone_marrow_cell_types --jobs 4 -n

Executing different steps of data processing

Following commands can be used to execute data processing steps for all samples .

# generate all required alignment references
snakemake --use-conda alignment_reference -n
# align reads for all samples
snakemake --use-conda align -n
# create digital gene expression data for all samples
snakemake --use-conda dge -n

Executing full workflow

If you feel brave you can also execute the entire workflow at once. Godspeed, this will take a long time!

# execute entire project (-k: don't abort independent jobs if a job fails)
snakemake --use-conda --jobs 4 -k -n

Code Snippets

16
17
18
19
20
21
22
23
shell:
  "picard FastqToSam "
  "FASTQ={input.fastq1} "
  "FASTQ2={input.fastq2} "
  "OUTPUT={output} "
  "SAMPLE_NAME={wildcards.sample} "
  "SORT_ORDER=queryname "
  "2> {log}"
38
39
40
41
42
43
44
45
46
47
48
49
shell:
  "TagBamWithReadSequenceExtended "
  "INPUT={input} "
  "OUTPUT={output.bam} "
  "SUMMARY={output.summary} "
  "BASE_QUALITY={params.base_qual} "
  "NUM_BASES_BELOW_QUALITY={params.bases_below_qual} "
  "BASE_RANGE={params.base_range} "
  "BARCODED_READ=1 "
  "DISCARD_READ=false "
  "TAG_NAME=XC "
  "2> {log}"
64
65
66
67
68
69
70
71
72
73
74
75
shell:
  "TagBamWithReadSequenceExtended "
  "INPUT={input} "
  "OUTPUT={output.bam} "
  "SUMMARY={output.summary} "
  "BASE_QUALITY={params.base_qual} "
  "NUM_BASES_BELOW_QUALITY={params.bases_below_qual} "
  "BASE_RANGE={params.base_range} "
  "BARCODED_READ=1 "
  "DISCARD_READ=true "
  "TAG_NAME=XM "
  "2> {log}"
85
86
87
88
89
90
shell:
  "FilterBAM "
  "INPUT={input} "
  "OUTPUT={output} "
  "TAG_REJECT=XQ "
  "2> {log}"
105
106
107
108
109
110
111
112
113
shell:
  "TrimStartingSequence "
  "INPUT={input} "
  "OUTPUT={output.bam} "
  "OUTPUT_SUMMARY={output.summary} "
  "SEQUENCE={params.adapter_sequence} "
  "MISMATCHES={params.mismatches} "
  "NUM_BASES={params.num_bases} "
  "2> {log}"
127
128
129
130
131
132
133
134
shell:
  "PolyATrimmer "
  "INPUT={input} "
  "OUTPUT={output.bam} "
  "OUTPUT_SUMMARY={output.summary} "
  "MISMATCHES={params.mismatches} "
  "NUM_BASES={params.num_bases} "
  "2> {log}"
144
145
146
147
148
shell:
  "picard SamToFastq "
  "INPUT={input} "
  "FASTQ={output} "
  "2> {log}"
162
163
164
165
166
167
168
shell:
  "STAR --runThreadN {threads} "
  "--genomeDir {input.genomedir} "
  "--readFilesIn {input.fastq} "
  "--outFileNamePrefix {params.outprefix} "
  "--readFilesCommand zcat "
  "--outSAMtype BAM Unsorted ; "
183
184
185
186
187
188
shell:
  "picard SortSam "
  "INPUT={input} "
  "OUTPUT={output} "
  "SORT_ORDER=queryname "
  "2> {log}"
202
203
204
205
206
207
208
209
210
shell:
  "picard MergeBamAlignment "
  "ALIGNED_BAM={input.aligned} "
  "UNMAPPED_BAM={input.unaligned} "
  "REFERENCE_SEQUENCE={input.reference} "
  "OUTPUT={output} "
  "INCLUDE_SECONDARY_ALIGNMENTS=false "
  "PAIRED_RUN=false "
  "2> {log}"
222
223
224
225
226
227
228
229
shell:
  "TagReadWithGeneExon "
  "INPUT={input.bam} "
  "OUTPUT={output.bam} "
  "ANNOTATIONS_FILE={input.annot} "
  "TAG=GE "
  "CREATE_INDEX=true "
  "2> {log}"
241
242
243
244
245
246
247
shell:
  "BAMTagHistogram "
  "INPUT={input} "
  "OUTPUT={output} "
  "TAG=XC "
  "READ_QUALITY={params.read_quality} "
  "2> {log}"
264
265
script:
  "../scripts/processing/align_report.Rmd"
46
47
48
49
shell:
  "python scripts/processing/extract_dge.py -i {input.umi_obs} {params.whitelist_arg}"
  "-o {output.dge} --sample {params.reads} --seed {params.seed} "
  "--tpt_threshold {params.tpt_threshold} 2> {log}"
68
69
script:
  "../scripts/analyses/advanced_dge_downsampling.R"
82
83
script:
  "../scripts/analyses/reads_on_target.R"
103
104
script:
  "../scripts/analyses/tapseq_vs_cropseq.Rmd"
114
115
script:
  "../scripts/analyses/filter_mmix_umi_obs.R"
134
135
136
137
shell:
  "python scripts/processing/extract_dge.py -i {input.umi_obs} -o {output.dge} "
  "{params.whitelist_arg} --sample {params.reads} --seed {params.seed} --tpt_threshold "
  "{params.tpt_threshold} 2> {log}"
158
159
script:
  "../scripts/analyses/screen_data_qc.Rmd"
170
171
172
shell:
  "python scripts/analyses/collapse_perturbations.py -i {input.pert_status} -t {input.grna_targets} "
  "-o {output} 2> {log}"
195
196
script:
  "../scripts/analyses/differential_expression.R"
212
213
script:
  "../scripts/analyses/compare_covariates.Rmd"
231
232
script:
  "../scripts/analyses/process_de_results.R"
251
252
script:
  "../scripts/analyses/map_enhancers.Rmd"
261
262
shell:
  "wget -O {output} {params.url}"
272
273
shell:
  "wget -O {output.bam} {params.url} ; picard BuildBamIndex I={output.bam}"
284
285
286
287
288
289
shell:
  "rm -r data/k562_chromatin_data/HiC/5kb_resolution_intrachromosomal; "
  "wget -c {params.url} -O - | tar -xz K562/5kb_resolution_intrachromosomal/chr*/MAPQG0/; "
  "mv K562/5kb_resolution_intrachromosomal "
  "data/k562_chromatin_data/HiC/5kb_resolution_intrachromosomal; "
  "rmdir K562"
300
301
script:
  "../scripts/analyses/chromatin_analyses_screen.Rmd"
314
315
script:
  "../scripts/analyses/hic_analysis.Rmd"
324
325
script:
  "../scripts/analyses/abc_analysis_screen.Rmd"
337
338
339
shell:
  "Rscript scripts/analyses/extract_cropseq_vectors.R -b {input.bam} -d {input.dge_summary} "
  "-p {params.vector_pattern} -o {output}"
349
350
script:
  "../scripts/analyses/screen_grna_synthesis_errors.Rmd"
18
19
script:
  "../scripts/chromatin_annotated_etps/enhancer_screen_etps.R"
35
36
script:
  "../scripts/chromatin_annotated_etps/genome_wide_etps.R"
52
53
script:
  "../scripts/chromatin_annotated_etps/gasperini_screen_etps.R"
69
70
script:
  "../scripts/chromatin_annotated_etps/gasperini_pilot_etps.R"
86
87
script:
  "../scripts/chromatin_annotated_etps/fulco_tiling_etps.R"
22
23
24
shell:
  "wget -O {output.gtf}.gz {params.gtf_url} 2> {log} && gzip -d {output.gtf}.gz ; "
  "wget -O {output.fasta}.gz {params.fasta_url} 2>> {log} && gzip -d {output.fasta}.gz"
38
39
40
shell:
  "Rscript scripts/processing/create_tapseq_annot.R -t {input} -b {params.BSgenome} "
  "-f {output.fasta} -o {output.gtf} 2> {log}"
53
54
55
shell:
  "python scripts/processing/cropseq_vector_reference.py -i {input} -o {params.output_bn} "
  "--prefix {params.vector_prefix}"
62
63
shell:
  "touch {output.fasta} {output.gtf}"
76
77
78
79
80
shell:
  "cat {input.genome_gtf} > {output.cropseq_ref_gtf} && "
  "cat {input.vectors_gtf} >> {output.cropseq_ref_gtf} ; "
  "cat {input.genome_fasta} > {output.cropseq_ref_fasta} && "
  "cat {input.vectors_fasta} >> {output.cropseq_ref_fasta} ; "
90
91
92
93
94
shell:
  "picard CreateSequenceDictionary "
  "REFERENCE={input} "
  "OUTPUT={output} "
  "2> {log}"
105
106
107
108
109
110
shell:
  "ConvertToRefFlat "
  "ANNOTATIONS_FILE={input.annot} "
  "SEQUENCE_DICTIONARY={input.seq_dict} "
  "OUTPUT={output} "
  "2> {log}"
124
125
126
127
128
129
130
131
132
133
shell:
  "mkdir -p {output} && "  # create genomeDir directory
  "STAR --runThreadN {threads} "
  "--runMode genomeGenerate "
  "--genomeDir {output} "
  "--genomeFastaFiles {input.fasta} "
  "--sjdbGTFfile {input.gtf} "
  "--sjdbOverhang {params.sjdb_overhang} "
  "--outFileNamePrefix {params.outprefix} ; "
  "mv data/alignment_references/{wildcards.align_ref}/star.Log.out {log}"
31
32
33
shell:
  "cmd='fastq-dump --split-files --gzip --outdir {params.outdir} {wildcards.run}' ; "
  "echo $cmd > {log} ; $cmd >> {log}"
47
48
49
shell:
  "mv -v {params.outdir}/{params.access}_1.fastq.gz {output.fwd} > {log} & "
  "mv -v {params.outdir}/{params.access}_2.fastq.gz {output.rev} >> {log}"
65
66
67
shell:
  "bash scripts/download_fastq/merge_fastq.sh {params.outdir} {output.fwd} {output.rev} "
  "> {log}"
84
85
86
shell:
  "bash scripts/download_fastq/merge_fastq_i7.sh {params.outdir} {output.fwd} {output.rev} "
  "> {log}"
103
104
105
106
shell:
  "bash scripts/download_fastq/merge_fastq_i7.sh {params.outdir} {output.fwd} {output.rev} "
  "> {log} ; trimmomatic SE {output.rev} {params.outdir}/tmp.txt.gz CROP:58 2>> {log} ; "
  "mv -v {params.outdir}/tmp.txt.gz {output.rev} >> {log}"
42
43
44
45
46
47
48
49
50
51
shell:
  "GatherMolecularBarcodeDistributionByGene "
  "INPUT={input} "
  "OUTPUT={output} "
  "NUM_CORE_BARCODES={params.ncells} "
  "EDIT_DISTANCE={params.edit_distance} "
  "READ_MQ={params.read_mq} "
  "MIN_BC_READ_THRESHOLD={params.min_umi_reads} "
  "RARE_UMI_FILTER_THRESHOLD={params.rare_umi_filter} "
  "2> {log}"
69
70
71
shell:
  "python scripts/processing/extract_dge.py -i {input.umi_obs} -o {output.dge} "
  "{params.whitelist_arg} --tpt_threshold {params.tpt_threshold} 2> {log}"
88
89
90
shell:
  "Rscript scripts/processing/perturbation_status.R --infile {input} --outfile {output} "
  "--vector_patter {params.vector_prefix} --min_txs {params.min_txs} --trim 2> {log}"
106
107
script:
  "../scripts/processing/dge_report.Rmd"
119
120
script:
  "../scripts/processing/dge_report_no_perts.Rmd"
19
20
21
22
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
26
27
28
29
30
31
32
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# attach required packages
library(here)
library(pROC)
library(tidyverse)
36
37
# load chromatin annotated ETPs generated by scripts/chromatin_annotated_etps/enhancer_screen_etps.R
pairs <- read.csv(here(snakemake@input), stringsAsFactors = FALSE)
47
48
49
50
51
52
53
54
55
56
57
# plot activity score vs interaction frequency
ggplot(pairs, aes(x = int_freq + 1, y = activity_score, color = prop_grna_hits)) +
  facet_wrap(~enh_chr, ncol = 2) +
  geom_point(data = filter(pairs, significant == 0), color = "gray69") +
  geom_point(data = filter(pairs, significant == 1)) +
  labs(x = "Hi-C interaction frequency", y = "Enhancer activity (DNAse + H3K27ac)",
       title = "Enhancer activity vs Hi-C contact frequency") +
  scale_color_gradient(low = "blue", high = "orangered", na.value = "gray69") +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw()
67
68
69
70
71
72
73
74
75
76
# plot abc score vs distance to TSS
ggplot(pairs, aes(x = dist_to_tss, y = abc_score_norm, color = prop_grna_hits)) +
  facet_wrap(~enh_chr, ncol = 1) +
  geom_point(data = filter(pairs, significant == 0), color = "gray69", alpha = 0.25) +
  geom_point(data = filter(pairs, significant == 1)) +
  labs(x = "Distance to TSS", y = "ABC score", "ABC score vs distance to TSS") +
  scale_color_gradient(low = "blue", high = "orangered", na.value = "gray69") +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw()
82
83
84
85
86
87
88
# plot abc score vs pvalue
ggplot(pairs, aes(x = abc_score_norm, y = -log10(pvalue), color = as.factor(significant))) +
  facet_wrap(~enh_chr, ncol = 2) +
  geom_point() +
  labs(x = "ABC score", color = "Significant") +
  scale_color_manual(values = c("1" = "steelblue", "0" = "gray69")) +
  theme_bw()
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# function to calculate area under roc curve
calc_auroc <- function(category, prediction) {

  # order by prediction
  dat <- data.frame(category, prediction) %>%
    arrange(desc(prediction))

  # compute auc
  roc_obj <- roc(dat$category, dat$prediction)
  as.numeric(auc(roc_obj))

}

# only retain data on pairs that have at least one significant hit
sig_pairs <- pairs %>%
  group_by(sample, gene) %>%
  filter(sum(significant) > 0) %>%
  select(-c(2, 5:14, 23:29, 32))

# convert to long format and set pretty predictor names
sig_pairs_long <- sig_pairs %>%
  gather(key = "predictor", value = "predictor_value",
         -c(sample, perturbation, gene, significant)) %>%
  mutate(predictor = case_when(
    predictor == "int_freq" ~ "Hi-C int. freq.",
    predictor == "dist_to_tss" ~ "Distance to TSS",
    predictor == "activity_score" ~ "Activity score",
    predictor == "abc_score_norm" ~ "ABC score",
    TRUE ~ predictor))

# calculate auroc for all genes with significant enhancers
auroc <- sig_pairs_long %>%
  group_by(sample, predictor) %>%
  summarize(auc = calc_auroc(category = significant, prediction = predictor_value))

# plot auroc per gene and predictor
ggplot(auroc, aes(x = predictor, y = auc)) +
  facet_wrap(~sample, ncol = 2) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(y = "AUC", title = "AUROC") +
  coord_flip() +
  theme_bw() +
  theme(axis.title.x = element_blank())
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# calculate mean and median auroc
auroc_stats <- auroc %>%
  group_by(predictor) %>%
  summarize(avg_auroc = mean(auc),
            med_auroc = median(auc))

# add predictor type for plot
auroc_stats <- auroc_stats %>%
  mutate(type = case_when(
    predictor %in% c("H3K27ac", "H3K27me3", "H3K4me1", "H3K4me3") ~ "Histone PTM",
    predictor %in% c("Hi-C int. freq.", "Distance to TSS") ~ "Proximity",
    predictor %in% c("Activity score", "ABC score") ~ "ABC model",
    predictor == "Dnase.seq" ~ "DNA accessibility",
    predictor == "POLR2A" ~ "Transcription"))

# plot average auroc per predictor
ggplot(auroc_stats, aes(x = fct_reorder(predictor, desc(avg_auroc)), y = avg_auroc, fill = type)) +
  geom_bar(stat = "identity") +
  labs(x = "Predictor", y = "Average AUROC", fill = "Type") +
  scale_fill_brewer(palette = "Set1") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), axis.title.x = element_blank())
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# only retain data on pairs that have at least one significant hit within 5 Mb of the TSS
sig_pairs_1mb <- pairs %>%
  group_by(sample, gene) %>%
  filter(dist_to_tss <= 1e6) %>%
  filter(sum(significant) > 0) %>%
  select(-c(2, 5:14, 23:29, 32))

# compute number of enhancers per gene
enh_per_gene <- summarize(sig_pairs_1mb, enhancers = n())

# plot number of enhancers per gene
ggplot(enh_per_gene, aes(x = gene, y = enhancers)) +
  facet_wrap(~sample, scales = "free_x") +
  geom_bar(stat = "identity") +
  labs(x = "Gene", y = "Enhancers", title = "Enhancers per gene within 1Mb") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# convert to long format and set pretty predictor names
sig_pairs_1mb_long <- sig_pairs_1mb %>%
  gather(key = "predictor", value = "predictor_value",
         -c(sample, perturbation, gene, significant)) %>%
  mutate(predictor = case_when(
    predictor == "int_freq" ~ "Hi-C int. freq.",
    predictor == "dist_to_tss" ~ "Distance to TSS",
    predictor == "activity_score" ~ "Activity score",
    predictor == "abc_score_norm" ~ "ABC score",
    TRUE ~ predictor))

# calculate auroc for all genes with significant enhancers
auroc_1mb <- sig_pairs_1mb_long %>%
  group_by(sample, predictor) %>%
  summarize(auc = calc_auroc(category = significant, prediction = predictor_value))

# plot auroc per gene and predictor
ggplot(auroc_1mb, aes(x = predictor, y = auc)) +
  facet_wrap(~sample, ncol = 2) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(y = "AUC", title = "AUROC 1Mb") +
  coord_flip() +
  theme_bw() +
  theme(axis.title.x = element_blank())
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
# calculate mean and median auroc
auroc_stats_1mb <- auroc_1mb %>%
  group_by(predictor) %>%
  summarize(avg_auroc = mean(auc),
            med_auroc = median(auc))

# add predictor type for plot
auroc_stats_1mb <- auroc_stats_1mb %>%
  mutate(type = case_when(
    predictor %in% c("H3K27ac", "H3K27me3", "H3K4me1", "H3K4me3") ~ "Histone PTM",
    predictor %in% c("Hi-C int. freq.", "Distance to TSS") ~ "Proximity",
    predictor %in% c("Activity score", "ABC score") ~ "ABC model",
    predictor == "Dnase.seq" ~ "DNA accessibility",
    predictor == "POLR2A" ~ "Transcription"))

# plot average auroc per predictor
ggplot(auroc_stats_1mb, aes(x = fct_reorder(predictor, desc(avg_auroc)), y = avg_auroc,
                            fill = type)) +
  geom_bar(stat = "identity") +
  labs(x = "Predictor", y = "Average AUROC", fill = "Type") +
  scale_fill_brewer(palette = "Set1") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), axis.title.x = element_blank())
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
# function to calculate roc curve
calc_roc <- function(category, prediction) {

  # order by prediction
  dat <- data.frame(category, prediction) %>%
    arrange(desc(prediction))

  # compute roc
  roc_obj <- roc(dat$category, dat$prediction)

  # return tpr and fpr
  data.frame(TPR = rev(roc_obj$sensitivities), FPR = rev(1 - roc_obj$specificities))

}

# calculate roc for all genes with significant enhancers
roc_1mb <- sig_pairs_1mb_long %>%
  group_by(sample, predictor) %>%
  do(calc_roc(category = .$significant, prediction = .$predictor_value))

# order according to average auroc
order <- auroc_stats_1mb %>%
  arrange(desc(avg_auroc)) %>%
  pull(predictor)

roc_1mb <- roc_1mb %>%
  ungroup() %>%
  mutate(predictor = factor(predictor, levels = order))

# create labels to add auroc to plots
auroc_labels <- auroc_1mb %>%
  left_join(select(auroc_stats_1mb, predictor, avg_auroc), by = "predictor") %>%
  mutate(predictor = factor(predictor, levels = order)) %>%
  mutate(label = paste0("AUC: ", round(auc, digits = 2))) %>%
  mutate(x = 0.8, y = if_else(sample == "chr11", true = 0.25, false = 0.1))

# plot roc curves for every predictor
ggplot(roc_1mb, aes(x = FPR, y = TPR, color = sample)) +
  facet_wrap(~predictor, ncol = 2) +
  geom_line(size = 1) +
  geom_abline(lty = "dashed", color = "darkgray") +
  geom_text(data = auroc_labels, aes(x, y, label = label), show.legend = FALSE) +
  labs(color = "Region") +
  scale_color_manual(values = c("chr11" = "indianred3", "chr8" = "steelblue")) +
  theme_bw() +
  theme(legend.position = "top",
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 13))
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
library(tidyverse)
library(here)

# define functions ---------------------------------------------------------------------------------

# calculate transcripts-per-transcript for umi observations to filter for chimeric reads
calc_tpt <- function(umi_obs) {

  # calculate total number of reads for every cbc - umi combination
  total_reads <- umi_obs %>%
    group_by(Cell.Barcode, Molecular_Barcode) %>%
    summarize(total_reads = sum(Num_Obs))

  # add total reads to umi_obs and calculate TPT
  umi_obs %>%
    left_join(total_reads, by = c("Cell.Barcode", "Molecular_Barcode")) %>%
    mutate(tpt = Num_Obs / total_reads) %>%
    select(-total_reads)

}

# filter according to a specified tpt threshold
filter_tpt <- function(umi_obs_tpt, tpt_threshold) {
  umi_obs_tpt %>%
    filter(tpt >= tpt_threshold) %>%
    select(-tpt)
}

# downsample umi_obs for a specifc number of genic reads
downsample_reads <- function(umi_obs, reads, level = c("cell", "sample")) {

  level <- match.arg(level)

  # "uncount" umi_observations so that each read is a row by repeating each line number-of-reads
  # times (i.e. num_obs)
  reps <- rep(seq_len(nrow(umi_obs)), times = umi_obs$Num_Obs)
  umi_obs_reads <- umi_obs[reps, -4]

  # randomly sample reads per cell or the whole sample, depending on level parameter
  if (level == "cell") {
    sampled_reads <- umi_obs_reads %>%
      group_by(Cell.Barcode) %>%
      sample_n(size = reads, replace = FALSE)
  }else{
    sampled_reads <- sample_n(umi_obs_reads, size = reads, replace = FALSE)
  }

  # count observations of each sample-cell-umi combination
  sampled_umi_obs <- sampled_reads %>%
    group_by(Cell.Barcode, Gene, Molecular_Barcode) %>%
    summarize(Num_Obs = n()) %>%
    ungroup()

}

# extract dge data for each sample by calculating number of umis per gene and  cell
extract_dge <- function(umi_obs){
  umi_obs %>%
    count(Cell.Barcode, Gene) %>%
    rename(txs = n)
}

# convert dge from long format to wide matrix
create_dge_matrix <- function(dge_long) {
  dge_long %>%
    spread(key = Cell.Barcode, value = txs, fill = 0) %>%
    as.data.frame(stringsAsFactors == FALSE) %>%
    rename(GENE = Gene)
}


# load input data ----------------------------------------------------------------------------------

# get parameters
sample_reads <- as.numeric(snakemake@wildcards$reads)
sampling <- match.arg(snakemake@wildcards$sampling, choices = c("fixed", "avg"))
panel <- match.arg(snakemake@wildcards$panel, choices = c("onGenes", "onTarget"))

# set sed for reproducible sampling
set.seed(snakemake@params$seed)

# load input data
umi_obs <- read.table(here(snakemake@input$umi_obs), header = TRUE, stringsAsFactors = FALSE,
                      sep = "\t")
target_genes <- read.csv(here(snakemake@input$target_genes), stringsAsFactors = FALSE)
cbc_whitelist <- readLines(here(snakemake@input$whitelist))

# filter for cell barcodes on CBC whitelist
umi_obs_filt <- filter(umi_obs, Cell.Barcode %in% cbc_whitelist)

# remove any vector transcripts if vector_pattern is provided
if (!is.null(snakemake@params$vector_pattern)) {

  umi_obs_filt <- filter(umi_obs_filt, !grepl(Gene, pattern = snakemake@params$vector_pattern))

}

# only retain target genes if panel is set to onTarget
if (panel == "onTarget") {

  umi_obs_filt <- filter(umi_obs_filt, Gene %in% target_genes$gene)

}


# downsample genic reads ---------------------------------------------------------------------------

if (sampling == "fixed") {

  # calculate number of reads per cell
  reads_per_cell <- umi_obs_filt %>%
    group_by(Cell.Barcode) %>%
    summarize(reads = sum(Num_Obs))

  # filter for cells that have at least the desired sampling size of reads
  cells_filt <- reads_per_cell %>%
    filter(reads >= sample_reads) %>%
    pull(Cell.Barcode)

  # retain umi observations ony for these cells
  umi_obs_filt <- filter(umi_obs_filt, Cell.Barcode %in% cells_filt)

  # downsample umi observations
  ds_umi_obs <- downsample_reads(umi_obs_filt, reads = sample_reads, level = "cell")

}else{

  # calculate number of reads to sample from umi_observation
  ncells <- n_distinct(umi_obs_filt$Cell.Barcode)
  sample_reads <- ncells * sample_reads

  # abort if scaled sampling size is larger than the number of available reads
  if (sample_reads > sum(umi_obs_filt$Num_Obs)) {
    stop("Not enough input reads for desired sampling size per sample!")
  }

  # downsample umi observations
  ds_umi_obs <- downsample_reads(umi_obs_filt, reads = sample_reads, level = "sample")

}

# extract dge --------------------------------------------------------------------------------------

# calculate tpt and filter for chimeric reads using the specified tpt threshold
umi_obs_tpt <- calc_tpt(ds_umi_obs)
umi_obs_tpt <- filter_tpt(umi_obs_tpt, tpt_threshold = snakemake@params$tpt_threshold)

# extract dge and create dge matrix
dge <- extract_dge(umi_obs_tpt)
dge_mat <- create_dge_matrix(dge)

# output file
if (tools::file_ext(snakemake@output[[1]]) == "gz") {
  outfile <- gzfile(here(snakemake@output[[1]]))
}else{
  outfile <- here(snakemake@output[[1]])
}

# save dge matrix to output file
write.table(dge_mat, file = outfile, sep = "\t", row.names = FALSE, quote = FALSE)
16
17
18
19
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
23
24
25
26
27
28
29
30
31
32
33
34
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# disable all dplyr progress bars
options(dplyr.show_progress = FALSE)

# attach required packages
library(ggpubr)
library(here)
library(kableExtra)
library(rtracklayer)
library(tidyverse)
38
39
40
41
42
43
44
45
46
# load processed de output (includes confidence level and distance to TSS)
results <- here(snakemake@input$processed_results) %>%
  read.csv(stringsAsFactors = FALSE) %>%
  mutate(sample = paste0("Chr", sub("iScreen1", "", sample)),
         significant = if_else(pval_adj_allTests < 0.05, true = "sig", false = "non_sig"))

# encode chip-seq files
chip_files <- here(snakemake@input$encode_chip)
names(chip_files) <- sub("_.*", "", basename(chip_files))
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# extract cis enhancer - gene pairs, remove significant pairs that increase gene expression
cis_enh_perts <- results %>%
  filter(enh_type == "cis", abs(dist_to_tss) >= 1000) %>%
  filter(!(significant == "sig" & manual_lfc > 0))

# only retain hits within the same target region (discard out of region control enhancers) and add
# "chr" to enhancer chromosome for use with ENCODE data
cis_enh_perts <- cis_enh_perts %>%
  filter(enh_chr == sub("Chr", "", sample)) %>%
  mutate(enh_chr = paste0("chr", enh_chr),
         gene_chr = paste0("chr", gene_chr))

# get enhancer coordinates and convert to GRanges object
enh_coords <- cis_enh_perts %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>%
  `names<-`(.$perturbation)  # set names to perturbation id

# load encode chip-seq data
chip_data <- lapply(chip_files, FUN = import, which = range(enh_coords), format = "bigWig")
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
# function to calculate average assay signal per enhancer
avg_assay_signal <- function(assay, enh_coords) {
  overlaps <- findOverlaps(enh_coords, assay)
  overlaps <- split(overlaps, from(overlaps))
  names(overlaps) <- names(enh_coords[as.numeric(names(overlaps))])
  sapply(overlaps, function(x) {
    sum(mcols(assay[to(x)])$score * width(assay[to(x)])) / sum(width(assay[to(x)]))
  })
}

# calculate average chromatin signal at enhancers
enh_chrom_signal <- chip_data %>%
  sapply(FUN = avg_assay_signal, enh_coords = enh_coords) %>%
  as.data.frame() %>%
  rownames_to_column(var = "perturbation")

# transform to data.frame and merge with cis enhancer - gene pairs
cis_enh_perts <- cis_enh_perts %>%
  select(sample, perturbation, enh_chr, gene, pvalue, prop_grna_hits, dist_to_tss, significant) %>%
  left_join(enh_chrom_signal, by = "perturbation")
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# function to randomly draw n non-significant pairs for a significant association. any enhancers
# from drawn non-significant pairs are excluded from subsequent draws
draw_control_pairs <- function(sig, non_sig, n = 3) {

  # get distance to tss bins for which a control pair needs to be drawn (repeat n times)
  bins <- select(sig, enh_chr, dTSS_bin)
  bins <- bins[rep(seq_len(nrow(bins)), each = n), ]

  # iteratively select a random negative control for each bin and remove any pairs involving that
  # enhancer from pool of non-significant pairs for any subsequenct draws
  random_ctrls <- data.frame()
  for (i in seq_len(nrow(bins))) {

    # get pairs on same chromosome and same distance to TSS bin
    nonSig_bin <- filter(non_sig, enh_chr == bins[i, "enh_chr"],
                         dTSS_bin == bins[i, "dTSS_bin"])

    # randomly draw a control pair (if possible)
    if (nrow(nonSig_bin)) {

      ctrl <- sample_n(nonSig_bin, size = 1)
      random_ctrls <- rbind(random_ctrls, ctrl)

      # remove any pairs involving that enhancer from non_sig
      non_sig <- filter(non_sig, perturbation != ctrl$perturbation)

    }else{
      warning("Not enough unique enhancers to draw control!\n", call. = FALSE)
    }
  }

  return(random_ctrls)

}
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# create distance to TSS bins
dtss_range <- range(cis_enh_perts$dist_to_tss)
bins <- seq(from = dtss_range[1], to = dtss_range[2], by = 1e4)

# bin distance to tss
cis_enh_perts <- cis_enh_perts %>%
  mutate(dTSS_bin = cut(dist_to_tss, bins))

# extract significant enhancer - gene associations
sig_pairs <- filter(cis_enh_perts, significant == "sig")

# get other tested enhancer - gene pairs but whose enhancers appear only in non-sigificant tests
nonSig_pairs <- cis_enh_perts %>%
  filter(significant == "non_sig", !perturbation %in% sig_pairs$perturbation,
         perturbation != "GATA1")  # no GATA1, because it's the only on chrX

# randomly draw 3 control pairs for each significant enhancer - gene association
set.seed(20190617)
nonSig_ctrls <- draw_control_pairs(sig_pairs, non_sig = nonSig_pairs, n = 3)

# combine significant and control pairs
all_pairs <- bind_rows(sig_pairs, nonSig_ctrls)
181
182
183
184
185
186
187
188
189
190
# plot distance to TSS for significant pairs and sampled non-significant control pairs
ggplot(all_pairs, aes(x = dist_to_tss, fill = significant)) +
  facet_wrap(~sample) +
  geom_histogram(bins = 30) +
  labs(x = "Distance to TSS", y = "pairs",
       title = "Distance to TSS of significant vs non-significant control pairs") +
  theme_bw()

# figure caption
cap <- "Distance to TSS distribution of significant enhancers and sampled non-significant controls"
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
# extract unique enhancers from gene-enhancer pairs and convert to long format
enh_plot <- all_pairs %>%
  select(-c(sample, gene, pvalue, prop_grna_hits, dist_to_tss, dTSS_bin)) %>%
  distinct() %>%
  gather(key = "assay", value = "fc_over_ctrl", -c(perturbation, enh_chr, significant))

# pairwise comparisons using wilcoxon test and format adjusted p-value for plotting
pw_tests <- compare_means(fc_over_ctrl ~ significant, group.by = "assay", data = enh_plot,
                          method = "wilcox.test", p.adjust.method = "holm") %>%
  mutate(p.adj.plot = if_else(p.adj < 0.001, true = "< 0.001",
                              false = paste("=", round(p.adj, digits = 2))))

# plot difference in enhancer relevant chromatin marks between significant and non-significant hits
ggplot(enh_plot, aes(x = significant, y = fc_over_ctrl + 1, fill = significant)) +
  facet_wrap(~assay, ncol = 2) +
  geom_boxplot(notch = TRUE) +
  stat_pvalue_manual(pw_tests, label = "P adj. {p.adj.plot}", y.position = log10(145),
                     inherit.aes = FALSE) +
  labs(y = "Chromatin FC over control") +
  scale_x_discrete(labels = c("Non significant", "Significant")) +
  scale_y_log10(limits = c(NA, 180)) +
  scale_fill_manual(values = c(non_sig = "darkgray", sig = "steelblue")) +
  theme_bw() + 
  theme(axis.title.x = element_blank(), legend.position = "none", text = element_text(size = 15),
        panel.grid = element_blank())

# figure caption
cap <- "Chromatin marks of significant enhancers vs non-significant control enhancers"
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# add label for confidence level based on prop gRNA hits
all_pairs_conf <- all_pairs %>%
  filter(!is.na(prop_grna_hits)) %>%
  mutate(conf = if_else(significant == "sig" & prop_grna_hits >= 0.5,
                        true = "High confidence", false = "Non significant"),
         conf = if_else(significant == "sig" & prop_grna_hits < 0.5,
                        true = "Low confidence", false = conf))

# convert to long format for plot
all_pairs_conf <- all_pairs_conf %>%
  gather(key = "assay", value = "fc_over_ctrl", -c(1:8, dTSS_bin, conf))

# perform kruskal-wallis test (non-parametric one-way anova) per assay and reformat adjusted p-value
kw_tests <- compare_means(formula = fc_over_ctrl ~ conf, method = "kruskal.test",
                          data = all_pairs_conf, group.by = "assay") %>%
  mutate(p.adj.plot = if_else(p.adj < 0.001, true = "< 0.001",
                              false = paste("=", round(p.adj, digits = 2)))) %>%
  mutate(group1 = "High confidence", group2 = "Low confidence")  # add fake groups for text in plots

# plot chromatin signal for high-, low-confidence and non significant enhancer - gene pairs
all_pairs_conf %>%
  filter(assay != "H3K27me3") %>%
  ggplot(., aes(x = conf, y = fc_over_ctrl + 1, fill = conf)) +
  facet_wrap(~assay, ncol = 4) +
  geom_boxplot(notch = TRUE) +
  stat_pvalue_manual(data = filter(kw_tests, assay != "H3K27me3"), y.position = 130, size = 5.5,
                     label = "P adj. {p.adj.plot}", remove.bracket = TRUE, inherit.aes = FALSE) +
  labs(y = "FC over control") +
  scale_x_discrete(labels = c("High conf.", "Low conf.", "Non sig.")) +
  scale_fill_manual(values = c("High confidence" = "goldenrod2",
                               "Low confidence" = "darkslategray3", "Non significant" = "gray")) +
  scale_y_log10(limits = c(NA, 180)) +
  theme_bw() +
  theme(axis.title.x = element_blank(), legend.position = "none", text = element_text(size = 20),
        panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

# figure caption
cap <- "Chromatin mark signal versus enhancer confidence level"

# save plot for manuscript
ggsave(here("results/plots", "chromatin_sig_vs_conf.pdf"), device = "pdf", width = 9,
       height = 3.5)
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# plot H3K27me3 vs confidence levels
all_pairs_conf %>%
  filter(assay == "H3K27me3") %>%
  ggplot(., aes(x = conf, y = fc_over_ctrl + 1, fill = conf)) +
  facet_wrap(~assay) +
  geom_boxplot(notch = TRUE) +
  stat_pvalue_manual(data = filter(kw_tests, assay == "H3K27me3"), y.position = 8.7, size = 5.5,
                     label = "P adj. {p.adj.plot}", remove.bracket = TRUE, inherit.aes = FALSE) +
  labs(y = "FC over control") +
  scale_x_discrete(labels = c("High conf.", "Low conf.", "Non sig.")) +
  scale_fill_manual(values = c("High confidence" = "goldenrod2",
                               "Low confidence" = "darkslategray3", "Non significant" = "gray")) +
  scale_y_log10(limits = c(NA, 10)) +
  theme_bw() +
  theme(axis.title.x = element_blank(), legend.position = "none", text = element_text(size = 20),
        panel.grid = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

# figure caption
cap <- "H3k27me3 signal versus enhancer confidence level"
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# perform individual pairwise tests
pw_tests <- compare_means(fc_over_ctrl ~ conf, group.by = "assay", data = all_pairs_conf,
                          method = "wilcox.test", p.adjust.method = "holm")

# number of etps per group
etps_per_group <- count(all_pairs_conf, assay, conf)

# add number of etps per group and reformat for printing
pw_tests <- pw_tests %>%
  left_join(etps_per_group, by = c("assay" = "assay", "group1" = "conf")) %>% 
  left_join(etps_per_group, by = c("assay" = "assay", "group2" = "conf")) %>% 
  select(assay, group1, group2, n.group1 = n.x, n.group2 = n.y, p.format, p.signif, p.adj) %>%
  rename(p.value = p.format)

# create pretty html table
pw_tests %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE,
                position = "left")
329
330
# save pairwise tests as .csv file
write.csv(pw_tests, file = here("results/chromatin_pairwise_tests.csv"), row.names = FALSE)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
from __future__ import print_function
import sys, argparse
import pandas as pd

# define functions ---------------------------------------------------------------------------------

# helper function to print to stderr
def eprint(*args, **kwargs):
  print(*args, file = sys.stderr, **kwargs)

# collapse perturbations by target
def collapse_perts(perts, targets):

  # add targets to perturbation status and remove VECTOR from data.frame
  perts = pd.merge(targets, perts, how = 'right', on = 'VECTOR').drop(columns = 'VECTOR')

  # count number of perturbations per target
  perts = perts.groupby('TARGET').sum()

  return perts

# execute if run as main program -------------------------------------------------------------------

if __name__ == '__main__':

  # parse command line arguments
  parser = argparse.ArgumentParser(description = 'Collapse perturbations by gRNA targets.')

  parser.add_argument('-i', '--inputfile', help = 'File containing perturbation status data to \
    merge. Typically output of perturbation_status.R', required = True)
  parser.add_argument('-o', '--outputfile', help = 'Output file for collapsed perturbation data',
    required = True)
  parser.add_argument('-t', '--targets', help = 'Two column .csv file listing every gRNA vector \
    (column named VECTOR) and their targets (column named TARGET).', required = True)
  parser.add_argument('-c', '--count', help = 'Should perturbations be counted rather than \
    returned as binary matrix (default)', action='store_true')

  args = parser.parse_args()

  # load perturbation and vector target files
  eprint('Loading input files...')
  perts = pd.read_csv(args.inputfile, sep = '\t')
  targets = pd.read_csv(args.targets)

  # collapse pertubations by vector targets
  eprint('Collapsing perturbations...')
  perts = collapse_perts(perts, targets)

  # convert to binary matrix unless specified otherwise
  if not args.count:
    eprint('Converting collapsed perturbations to binary...')
    perts[perts > 0] = 1

  eprint('Writing collapsed perturbations to file...')
  perts.to_csv(args.outputfile, sep = '\t', index = True)

  eprint('Done!')
17
18
19
20
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
24
25
26
27
28
29
30
31
32
33
34
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# attach required packages
library(corrplot)
library(eulerr)
library(factoextra)
library(gridExtra)
library(here)
library(rtracklayer)
library(tidyverse)
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# dge files
dge_files <- here(snakemake@input$dge)
names(dge_files) <- basename(dirname(dge_files))

# load dge data
dge <- lapply(dge_files, FUN = read.table, header = TRUE, row.names = "GENE")

# differential expression results
de_files <- here(snakemake@input$results)

# extract relevant information from filenames
de_files_info <- de_files %>%
  sub(".*data/", "", .) %>%
  sub("/diff_expr/output_MAST_", "_", .) %>%
  sub("Covar.csv", "", .) %>%
  sub("_no", "_none", .)

# set names and load all de output files
names(de_files) <- de_files_info
de_results <- lapply(de_files, FUN = read.csv, stringsAsFactor = FALSE)

# merge into one data.frame and reformat id variable
de_results <- de_results %>%
  bind_rows(.id = "id") %>%
  separate(id, into = c("sample", "strategy", "covars"), sep = "_")

# load target gene annotations
annot <- lapply(here(snakemake@input$annot), FUN = import, format = "gtf")

# merge into one GRanges object
annot <- annot %>%
  do.call("c", .) %>%
  unique()

# filter for exons of protein-coding genes (no processed transcripts etc)
annot <- annot[annot$type == "exon" &
                 annot$gene_biotype == "protein_coding" &
                 annot$transcript_biotype == "protein_coding"
               ]

# split by gene name into a GRangesList
genes <- split(annot, f = annot$gene_name)
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# function to extract gene expression data from total dge
extract_genex <- function(x, vector_pattern) {
  gene_rows <- grep(rownames(x), pattern = vector_pattern, invert = TRUE)
  x[gene_rows, ]
}

# function to remove dge data of cells from certain 10x lanes
filter_lanes <- function(x, lanes) {
  cells <- colnames(x)
  cell_lanes <- substr(cells, start = 1, stop = 8)
  cells_filt <- cells[!cell_lanes %in% lanes]
  x[, cells_filt]
}

# 10x lanes excluded from DE tests
rm_lanes <- unlist(snakemake@config$map_enhancers$remove_lanes)
rm_lanes[rm_lanes == "none"] <- NA

# extract gene expression data and filter for 10x lanes
genex <- dge %>%
  lapply(FUN = extract_genex, vector_pattern = snakemake@params$vector_pattern) %>%
  mapply(FUN = filter_lanes, x = ., lanes = rm_lanes, SIMPLIFY = FALSE)

# perform pca on gene expression data
pca <- genex %>%
  lapply(FUN = t) %>%
  lapply(FUN = prcomp, scale. = TRUE)
119
120
121
122
123
124
# quickly plot cell scores for first 2 principal components
p1 <- fviz_pca_ind(pca$`11iScreen1`, geom = "point", title = "Chromosome 11", col.ind = "steelblue")
p2 <- fviz_pca_ind(pca$`8iScreen1`, geom = "point", title = "Chromosome 8", col.ind = "steelblue")

# print plots
grid.arrange(p1, p2, ncol = 2)
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# extract eigenvalues
eig_val <- lapply(pca, FUN = get_eigenvalue)

# process eigenvalues (top 20)
eig_val_top20 <- eig_val %>%
  lapply(FUN = rownames_to_column, var = "Dim") %>%
  lapply(FUN = head, 20) %>%
  bind_rows(.id = "sample") %>%
  mutate(Dim = as.integer(sub("Dim\\.", "", Dim))) %>%
  mutate(gt1 = if_else(eigenvalue > 1, true = ">1", false = "<=1"))

# plot eigenvalues per PC
ggplot(eig_val_top20, aes(x = factor(Dim), y = eigenvalue, fill = gt1)) +
  facet_wrap(~sample, ncol = 1) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = 1, lty = "dashed", col = "firebrick2") +
  labs(x = "PC", title = "Eigenvalues of top 30 PCs", fill = "eigenvalue") +
  scale_fill_manual(values = c(">1" = "steelblue", "<=1" = "darkgray")) +
  theme_bw()
154
155
156
157
158
159
160
161
162
163
# scree plot showing percentage of variance explained per PC
p1 <- fviz_eig(pca$`11iScreen1`, ncp = 11, addlabels = TRUE, ylim = c(0, 40), main = "Chr11")
p2 <- fviz_eig(pca$`8iScreen1`, ncp = 10, addlabels = TRUE, ylim = c(0, 40), main = "Chr8")

# get cummulative variance explained of top 11, resp 10 PCs
cum_prop_chr11 <- filter(eig_val_top20, sample == "11iScreen1")[11, 5]
cum_prop_chr8 <- filter(eig_val_top20, sample == "8iScreen1")[10, 5]

# print plots
grid.arrange(p1, p2, top = "Scree plots")
172
173
174
175
176
177
178
# plot correlation circle
p1 <- fviz_pca_var(pca$`11iScreen1`, col.var = "contrib", alpha.var = 0.5, title = "Genes chr11",
                   gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
p2 <- fviz_pca_var(pca$`8iScreen1`, col.var = "contrib", alpha.var = 0.5, title = "Genes chr8",
                   gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

grid.arrange(p1, p2, ncol = 2)
185
186
187
188
189
190
191
# plot contribution of all genes to 
p1 <- fviz_contrib(pca$`11iScreen1`, choice = "var", axes = 1,
                   title = "Contribution of genes to PC1")
p2 <- fviz_contrib(pca$`11iScreen1`, choice = "var", axes = 2,
                   title = "Contribution of genes to PC2")

grid.arrange(p1, p2, top = "Chromosome 11")
195
196
197
198
199
200
201
# plot contribution of all genes to 
p1 <- fviz_contrib(pca$`8iScreen1`, choice = "var", axes = 1,
                   title = "Contribution of genes to PC1")
p2 <- fviz_contrib(pca$`8iScreen1`, choice = "var", axes = 2,
                   title = "Contribution of genes to PC2")

grid.arrange(p1, p2, top = "Chromosome 8")
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
# function to compute average expression, standard deviation and coefficient of variation
gene_expr_stats <- function(gene_expr) {
  avg_expr <- rowMeans(gene_expr)
  sd_expr <- apply(gene_expr, MARGIN = 1, FUN = sd)
  coeff_var <- sd_expr / avg_expr

  data.frame(avg_expr, sd_expr, coeff_var, check.rows = TRUE) %>%
    rownames_to_column(var = "gene")
}

# compute gene expression stats for every gene in both samples and reformat
gene_stats <- genex %>%
  lapply(gene_expr_stats) %>%
  bind_rows(.id = "sample") %>%
  gather(key = "stat", value = "value", -sample, -gene)

# extract contributions of genes to PCs 1 & 2 and reformat
contrib <- pca %>%
  lapply(FUN = function(x) get_pca_var(x)$contrib[,1:2]) %>%
  lapply(FUN = as.data.frame) %>%
  lapply(rownames_to_column, var = "gene") %>%
  bind_rows(.id = "sample") %>%
  rename(PC1 = Dim.1, PC2 = Dim.2) %>%
  gather(key = "PC", value = "contribution", -sample, -gene)

# add gene expression statistics to contributions
contrib <- left_join(contrib, gene_stats, by = c("sample", "gene"))

# plot correlation of contribution with gene stats for chromosome 11
p1 <- contrib %>%
  filter(sample == "11iScreen1", stat != "sd_expr") %>%
  ggplot(., aes(x = value, y = contribution / 100)) +
  facet_grid(PC~stat, scales = "free")+
  geom_point(color = "steelblue") +
  labs(x = "Value", y = "Contribution (%)", title = "Contribution vs. gene stats (chr11)") +
  scale_y_log10(labels = scales::percent) +
  scale_x_log10() +
  theme_bw()

# plot correlation of contribution with gene stats for chromosome 8
p2 <- contrib %>%
  filter(sample == "8iScreen1", stat != "sd_expr") %>%
  ggplot(., aes(x = value, y = contribution / 100)) +
  facet_grid(PC~stat, scales = "free")+
  geom_point(color = "steelblue") +
  labs(x = "Value", y = "Contribution (%)", title = "Contribution vs. gene stats (chr8)") +
  scale_y_log10(labels = scales::percent) +
  scale_x_log10() +
  theme_bw()

# print plots
grid.arrange(p1, p2, ncol = 1)
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
# function to compute interesting gene expression properties per cell
cell_dge_stats <- function(dge, vector_pattern) {

  # separate dge into gene and vector expression
  vctr_rows <- grep(rownames(dge), pattern = vector_pattern)
  gene_expr <- dge[-vctr_rows, ]
  vctr_expr <- dge[vctr_rows, ]

  # calculate stats
  ngenes <- colSums(gene_expr > 0)
  total_txs <- colSums(gene_expr)

  # calculate vector stats
  total_vctr <- colSums(vctr_expr)

  data.frame(ngenes, total_txs, total_vctr, check.rows = TRUE) %>%
    rownames_to_column(var = "cell") 
}

# calculate dge statistics per experiment
cell_stats <- dge %>%
  lapply(FUN = cell_dge_stats, vector_pattern = snakemake@params$vector_pattern) %>%
  bind_rows(.id = "sample")
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
# function to correlate with cell level stats
correlate_pcs <- function(x, method = "spearman") {

  # remove sample variable and make cell rownames
  x <- select(x, -sample) %>%
    `rownames<-`(NULL) %>%  # remove existing row names
    column_to_rownames(var = "cell")

  # separate x into PCs and meta variables
  pc_cols <- grep(colnames(x), pattern = "^PC")
  pc_rot <- x[, pc_cols]
  meta_vars <- x[, -pc_cols]

  # correlate pcas with meta vars
  cor(pc_rot, meta_vars, method = method)

}

# extract scores for first 5 PCs for both experiments
pc_scores <- pca %>%
  lapply(FUN = function(pca) as.data.frame(pca$x[, 1:5])) %>%
  lapply(FUN = rownames_to_column, var = "cell") %>%
  bind_rows(.id = "sample")

# add cell level stats to pc_scores
pc_scores_stats <- pc_scores %>%
  left_join(cell_stats, by = c("sample", "cell"))

# correlate PCs with cell-level stats
corr_mats <- pc_scores_stats %>%
  split(., f = .$sample) %>%
  map(., .f = correlate_pcs)

# create correlation plots
layout(matrix(1:2, ncol = 2))
corrplot(corr_mats$`11iScreen1`, title = "11iScreen1", mar=c(0,0,2,0), cl.align.text = "l",
         cl.ratio = 0.5)
corrplot(corr_mats$`8iScreen1`, title = "8iScreen1", mar=c(0,0,2,0), cl.align.text = "l",
         cl.ratio = 0.5)
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
# plot first and second PC colored according to number of expressed genes
p1 <- ggplot(pc_scores_stats, aes(x = PC1, y = PC2, color = ngenes)) +
  facet_wrap(~sample) +
  geom_point() +
  labs(title = "Number of detected genes", color = "Genes") +
  coord_fixed() +
  theme_bw()

# plot first and second PC colored according to number of gene transcripts
p2 <- ggplot(pc_scores_stats, aes(x = PC1, y = PC2, color = total_txs)) +
  facet_wrap(~sample) +
  geom_point() +
  labs(title = "Gene transcripts", color = "Transcripts") +
  coord_fixed() +
  theme_bw()

# plot first and second PC colored according to number of vector transcripts
p3 <- ggplot(pc_scores_stats, aes(x = PC1, y = PC2, color = log10(total_vctr))) +
  facet_wrap(~sample) +
  geom_point() +
  labs(title = "Vector transcripts", color = "log10(Transcripts)") +
  coord_fixed() +
  theme_bw()

# print plots
p1
p2
p3
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
# apply multiple testing correction per pc covar and strategy
de_results <- de_results %>%
  group_by(strategy, covars) %>%
  rename(pval_adj_sample = pval_adj_allTests) %>%
  mutate(pval_adj_allTests = p.adjust(pvalue, method = "fdr"))

# extract perEnh hits
enh_perts <- de_results %>%
  filter(strategy == "perEnh") %>%
  mutate(significant = if_else(pval_adj_allTests < 0.05, true = "sig", false = "not_sig"))

# function to collapse gRNA ids per targeted enhancer
collapse_grnas <- function(grnas) {
  grnas <- sub("_.+", "", grnas)
  sub("\\.[A|B|C|D]", "", grnas)
}

# count number of gRNA hits per enhancer-gene pair
gRNA_hits <- de_results %>%
  filter(strategy == "perGRNA") %>%
  mutate(perturbation = collapse_grnas(perturbation)) %>%
  group_by(sample, covars, perturbation, gene) %>%
  summarize(grna_hits = sum(pval_adj_allTests < 0.05),
            prop_grna_hits = mean(pval_adj_allTests < 0.05))

# add single gRNA hits to enhancer hits
enh_perts <- left_join(enh_perts, gRNA_hits, by = c("sample", "covars", "perturbation", "gene"))
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
# get all significant hits and reformat to wide format, so that each row is one enhancer - target
# gene link and "covariate columns" specify whether that associatiation was discovered or not
pairs <- enh_perts %>%
  ungroup() %>%
  filter(significant == "sig") %>%
  select(sample, covars, perturbation, gene) %>%
  mutate(sig = 1) %>%
  spread(key = covars, value = sig, fill = 0) %>%
  mutate(total = rowSums(.[, -c(1:3)]))

# compute euler diagrams
euler_chr11 <- pairs %>%
  filter(sample == "11iScreen1") %>%
  select(-sample, -perturbation, -gene, -total) %>%
  euler()

euler_chr8 <- pairs %>%
  filter(sample == "8iScreen1") %>%
  select(-sample, -perturbation, -gene, -total) %>%
  euler()

# plot diagrams
p1 <- plot(euler_chr11, quantities = TRUE, labels = list(font = 4), main = "Chromosome 11")
p2 <- plot(euler_chr8, quantities = TRUE, labels = list(font = 4), main = "Chromosome 8")

# print plots
grid.arrange(p1, p2, ncol = 2)
439
440
441
442
443
444
445
446
447
448
449
450
# calculate number of hits per confidence level
hits_conf_levels <- enh_perts %>%
  group_by(sample, covars, grna_hits) %>%
  summarize(hits = sum(pval_adj_allTests < 0.05))

# plot number of hits per confidence level
ggplot(hits_conf_levels, aes(x = as.factor(grna_hits), y = hits, fill = as.factor(covars))) +
  facet_wrap(~sample) +
  geom_bar(stat = "identity", position = "dodge") + 
  labs(x = "Individual gRNA hits", y = "Significant hits", title = "Confidence level",
       fill = "PC covariates") +
  theme_bw()
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
# get hits that are found only in one unique test
unique_pairs <- pairs %>%
  right_join(enh_perts, by = c("sample", "perturbation", "gene")) %>%
  filter(total == 1)

# calculate number of hits per confidence level
unique_hits <- unique_pairs %>%
  group_by(sample, covars, grna_hits) %>%
  summarize(hits = sum(pval_adj_allTests < 0.05))

# plot confidence levels for these hits
ggplot(unique_hits, aes(x = factor(grna_hits), y = hits, fill = covars)) +
  facet_wrap(~sample) +
  geom_bar(stat = "identity", position = "dodge") + 
  labs(x = "Individual gRNA hits", y = "Significant hits",
       title = "Confidence of hits unique to one model", fill = "PC covariates") +
  theme_bw()
478
479
480
481
482
483
484
485
# download chain file for hg19 to hg38 liftover
chain_url <- "http://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/hg19ToHg38.over.chain.gz"
chain_file <- tempfile("hg19ToHg38", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg19_to_hg38_chain <- import.chain(tools::file_path_sans_ext(chain_file))
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
# extract data on discovery perturbations
disc_perts <- enh_perts %>%
  filter(grepl(perturbation, pattern = "^chr.+$")) %>%
  as.data.frame()

# extract enhancer coordinates from perturbation id and create GRanges object
enh_coords_hg19 <- disc_perts %>%
  select(perturbation) %>%
  distinct() %>%
  separate(perturbation, into = c("chr", "start", "end"), remove = FALSE) %>%
  column_to_rownames(var = "perturbation") %>%
  makeGRangesFromDataFrame()

# liftover from hg19 to hg38
enh_coords_hg38 <- enh_coords_hg19 %>%
  liftOver(chain = hg19_to_hg38_chain) %>%
  unlist()

# compute center of every enhancer
enh_centers <- data.frame(perturbation = names(enh_coords_hg38),
                          enh_chr = sub("chr", "", as.character(seqnames(enh_coords_hg38))),
                          enh_center = round((start(enh_coords_hg38) + end(enh_coords_hg38)) / 2),
                          stringsAsFactors = FALSE)

# add enhancer centers to de results
disc_perts <- disc_perts %>%
  left_join(enh_centers, by = "perturbation")
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
# get chromosome and strand for each gene
gene_chr <- unlist(unique(seqnames(genes)))
gene_strand <- unlist(unique(strand(genes)))

# function to get a genes TSS
get_tss <- function(x) {
  if (all(strand(x) == "+")) {
    min(start(x))
  }else if (all(strand(x) == "-")) {
    max(end(x))
  }else{
    stop("Inconsistent strand information!")
  }
}

# get TSS for each gene
gene_tss <- sapply(genes, FUN = get_tss)

# create data.frame with strand and tss coordinates for every gene
tss_coords <- data.frame(gene_chr, gene_strand, gene_tss, check.rows = TRUE) %>%
  rownames_to_column(var = "gene") %>%
  mutate_if(is.factor, as.character)

# add to discovery perturbation results
disc_perts <- left_join(disc_perts, tss_coords, by = "gene")

# calculate distance to tss for every enhancer - gene pair
disc_perts <- disc_perts %>%
  mutate(dist_to_tss = if_else(gene_strand == "+",
                               true  = enh_center - gene_tss,
                               false = gene_tss - enh_center))

# only retain enhancer - gene pairs on the same chromosomes
disc_perts_cis <- filter(disc_perts, enh_chr == gene_chr)
556
557
558
559
560
561
562
563
564
565
566
567
568
# calculate -log10 p-value
disc_perts_cis <- mutate(disc_perts_cis, neg_log10_pvalue = -log10(pvalue))

# plot -log10 p-value as a function of distance to TSS
ggplot(disc_perts_cis,
             aes(x = dist_to_tss / 1e6, y = neg_log10_pvalue, color = prop_grna_hits)) +
  facet_grid(covars~sample) +
  geom_point(data = filter(disc_perts_cis, significant == "not_sig")) +
  geom_point(data = filter(disc_perts_cis, significant == "sig")) +
  labs(x = "Distance to TSS (Mb)", y = expression(-log[10] ~ p - value),
       title = "-Log10 p-value vs distance to TSS") +
  scale_color_gradient(low = "blue", high = "orangered") +
  theme_bw()
572
573
574
575
576
577
578
579
580
581
582
583
# only select significant enhancer - gene pairs
enh_gene_pairs <- filter(disc_perts_cis, significant == "sig")

# plot the distance to TSS vs number of gRNA hits
ggplot(enh_gene_pairs, aes(x = as.factor(grna_hits), y = abs(dist_to_tss) / 1000, color = sample)) +
  facet_wrap(~covars, ncol = 2) +
  geom_point(position = position_jitterdodge()) +
  geom_boxplot(outlier.shape = NA, fill = NA) +
  labs(x = "Confidence level (gRNA hits)", y = "Distance to TSS (kb)",
       title = "Distance to TSS vs. confidence level") +
  scale_y_log10() +
  theme_bw()
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

# required functions
source(file.path(snakemake@scriptdir, "SingleCellExperiment_tapseq.R"))
source(file.path(snakemake@scriptdir, "differential_expression_fun.R"))

# register parallel backend if specified (if more than 1 thread provided)
parallel <- FALSE
if (snakemake@threads > 1) {
  library(BiocParallel)
  register(MulticoreParam(workers = snakemake@threads))
  parallel <- TRUE
}

# set seed for reproducible results
if (snakemake@params$seed > 0) set.seed(snakemake@params$seed)

# prepare input data ===============================================================================

# load DGE data
dge <- read.table(snakemake@input$dge, header = TRUE, row.names = "GENE")

# load perturbation status data
perturb_status <- read.table(snakemake@input$perturb_status, header = TRUE, row.names = 1)

# remove negative controls, as these are not supposed to be tested
neg_ctrls <- grep(rownames(perturb_status), pattern = "^non-targeting.*$", perl = TRUE)
perturb_status <- perturb_status[-neg_ctrls, ]

# count cells per perturbation and save to file for later analyses
cells_per_pert <- rowSums(perturb_status > 0)
cells_per_pert <- data.frame(perturbation = rownames(perturb_status), cells = cells_per_pert)
write.csv(cells_per_pert, file = snakemake@output$ncells, row.names = FALSE)

# create tap-seq SingleCellExperiment object
sce <- create_tapseq_sce(dge = dge, perturb_status = perturb_status,
                         vector_pattern = snakemake@params$vector_pattern)

# remove cells from specified 10x lanes
if (!is.null(snakemake@params$exclude_lanes)) {

  # get cells that do not belong to filtered out lanes
  cells <- colnames(sce)
  cell_lanes <- substr(cells, start = 1, stop = 8)
  cells_filt <- cells[!cell_lanes %in% snakemake@params$exclude_lanes]

  # only retain data on filtered cells
  sce <- sce[, cells_filt]

}

# add covariates ===================================================================================

# covariates specified via output file name
covars <- snakemake@wildcards$covars

# compute covariates and add as colData to the SingleCellExperiment object
if (covars == "nGenesCovar") {

  # number of expressed genes per cell
  nGenes <- colSums(assay(sce) > 0)
  colData(sce) <- DataFrame(nGenes)

}else if (grepl(covars, pattern = "^\\d+pcCovar$", perl = TRUE)) {

  # perform PCA on gene expression data and extract PC scores of each cell
  pca_genex <- prcomp(t(assay(sce)), scale. = TRUE)
  pcs <- seq_len(as.integer(sub("pcCovar", "", covars)))
  pc_loadings <- DataFrame(pca_genex$x[, pcs])
  colData(sce) <- pc_loadings

}else if (covars != "noCovar") {
  stop("Incorrect covariate wildcard!", call. = FALSE)
}

# perform DE tests =================================================================================

# perform differential gene expression analysis
output <- test_differential_expression(sce, min_cells = snakemake@params$min_cells,
                                       method = snakemake@wildcards$method, parallel = parallel)

# save output
write.csv(output, file = snakemake@output$results, row.names = FALSE)
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
library(optparse)

# create arguments list
option_list = list(
  make_option(c("-b", "--bam_file"), type = "character", default = NULL,
              help = "Path to bam file containing aligned reads",
              metavar = "character"),
  make_option(c("-d", "--dge_summary_file"), type = "character", default = NULL,
              help = "Path to txt file containing dge summary", metavar = "character"),
  make_option(c("-p", "--vector_pattern"), type = "character", default = NULL,
              help = "Pattern for reads mapping to CROP-seq vectors", metavar = "character"),
  make_option(c("-o", "--output_file"), type = "character", default = NULL,
              help = "Output file", metavar = "character")

)

# parse arguments
opt_parser = OptionParser(option_list = option_list)
opt = parse_args(opt_parser)

# function to check for required arguments
check_required_args <- function(arg, opt, opt_parser) {
  if (is.null(opt[[arg]])) {
    print_help(opt_parser)
    stop(arg, " argument is required!", call. = FALSE)
  }
}

# check that all required parameters are provided
required_args <- c("bam_file", "dge_summary_file", "vector_pattern", "output_file")
for (i in required_args) {
  check_required_args(i, opt = opt, opt_parser = opt_parser)
}

# define functions =================================================================================

# function to extract reads mapping to CROP-seq vectors and extract number of mismatches etc.
extract_vectors <- function(bam, cell_barcodes, output_file, vector_pattern) {

  # create cell barcode patterns and write to temporary file
  cbc_patterns <- paste0("XC:Z:", cell_barcodes)
  cbc_file <- tempfile(pattern = "cbc_patterns_", fileext = ".txt")
  write(cbc_patterns, file = cbc_file)

  # write heder to file
  header = c("cell_barcode", "umi_barcode", "aligned_vector_id", "mapping_quality", "mismatches",
             "mismatch_positions", "sequence")
  gz_outfile <- gzfile(output_file, open = "w")
  write(paste(header, collapse = "\t"), file = gz_outfile)
  close(gz_outfile)

  # bash command to filter bam file for CROP-seq vector reads from specified cells and to extract
  # mismatches, barcodes, and read sequence
  command <- paste("samtools view", bam, "| fgrep -f", cbc_file,
    "| fgrep", paste0("GE:Z:", vector_pattern),
    "| awk 'BEGIN { FS = \" \" }; {print $12\"\\t\"$20\"\\t\"$14\"\\t\"$5\"\\t\"$19\"\\t\"$13\"\\t\"$10}'",
    "| gzip >> ", output_file)

  # execute command
  system(command)

}

# extract CROP-seq vectors from provided bam file ==================================================

# read dge summary file and extract cell barcodes
dge_summary <- read.table(opt$dge_summary_file, header = TRUE, stringsAsFactors = FALSE)
cells <- dge_summary$cell_barcode

# extract CROP-seq vectors
extract_vectors(opt$bam_file, cell_barcodes = cells, output_file = opt$output_file,
                vector_pattern = opt$vector_pattern)
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
umi_obs <- read.table(snakemake@input$umi_obs, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
target_genes <- read.csv(snakemake@input$target_genes, stringsAsFactors = FALSE)

# get genes that are expressed in all cell types
genes_all_cells <- target_genes[target_genes$marker == "all", "gene"]

# get umi observations for these genes
umi_obs_out <- umi_obs[umi_obs$Gene %in% genes_all_cells, ]

# save to output file
colnames(umi_obs_out)[1] <- "Cell Barcode"
write.table(umi_obs_out, file = snakemake@output[[1]], sep = "\t", row.names = FALSE, quote = FALSE)
13
14
15
16
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# disable dplyr progress
options(dplyr.show_progress = FALSE)

# attach required packages
library(ggpubr)
library(here)
library(HiTC)
library(Matrix)
library(tidyverse)
library(rtracklayer)
library(tools)
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# load processed de output (includes confidence level and distance to TSS)
results <- here(snakemake@input$processed_results) %>%
  read.csv(stringsAsFactors = FALSE) %>%
  mutate(sample = paste0("chr", sub("iScreen1", "", sample)))

# extract cis-interactions on chromosome 8 and 11 and add column identifying significant hits
cis_enh_perts <- results %>%
  filter(enh_type == "cis", abs(dist_to_tss) >= 1000) %>%
  mutate(enh_chr = paste0("chr", enh_chr), gene_chr = paste0("chr", gene_chr),
         significant = if_else(pval_adj_allTests < 0.05, true = "sig", false = "non_sig"))

# remove validation controls on other chromosomes than the samples' target region and significant
# hits that increase gene expression
cis_enh_perts <- cis_enh_perts %>%
  filter(sample == enh_chr) %>%
  filter(!(significant == "sig" & manual_lfc > 0))

# download chain file for hg38 to hg19 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz"
chain_file <- tempfile("hg38ToHg19", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg38_to_hg19_chain <- import.chain(file_path_sans_ext(chain_file))
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# extract hg38 enhancer coordinates
enh_coords_hg38 <- cis_enh_perts %>%
  select(perturbation, enh_chr, enh_start, enh_end) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "enh_chr",
                           start.field = "enh_start", end.field = "enh_end")

# liftover enhancers from hg38 to hg19 and convert to data.frame
enh_coords_hg19 <- enh_coords_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, end, perturbation) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end)

# replace tested enhancer - gene pairs hg38 enhancer coordinates with hg19 coordinates
cis_enh_perts <- cis_enh_perts %>%
  select(-c(enh_chr, enh_start, enh_end)) %>%
  left_join(enh_coords_hg19, by = "perturbation")
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# extract gene TSS coordinates
gene_tss_hg38 <- cis_enh_perts %>%
  select(gene, gene_chr, gene_tss, gene_strand) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "gene_chr",
                           start.field = "gene_tss", end.field = "gene_tss",
                           strand.field = "gene_strand")

# liftover tss coordinates to hg19
gene_tss_hg19 <- gene_tss_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, strand, gene) %>%
  rename(gene_chr = seqnames, gene_tss = start, gene_strand = strand)

# replace gene tss hg38 coordinates with hg19 coordinates
cis_enh_perts <- cis_enh_perts %>%
  select(-c(gene_chr, gene_tss, gene_strand)) %>%
  left_join(gene_tss_hg19, by = "gene")

# recalculate distance to tss for hg19 coordinates
cis_enh_perts <- cis_enh_perts %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                                false = gene_tss - enh_center)) %>%
  select(-enh_center)
117
118
119
120
121
122
123
124
# add label for confidence level based on prop gRNA hits
cis_enh_perts <- cis_enh_perts %>%
  filter(!is.na(prop_grna_hits)) %>%
  mutate(conf = case_when(
    significant == "non_sig" ~ "Non significant",
    significant == "sig" & prop_grna_hits >= 0.5 ~ "High confidence",
    significant == "sig" & prop_grna_hits <  0.5 ~ "Low confidence")
  )
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# function to import HiC data from Rao et al for one chromosome and create a HTCexp object
import_hic <- function(sparse_cm_file, chromosome, resolution, bins) {

  # load sparse contact matrix file (only observed contacts)
  obs_contacts <- read.table(sparse_cm_file, col.names = c("i", "j", "M_ij"), sep = "\t")

  # get starting coordinates of assessed genomic bins at 5kb resolution
  max_bin <- (bins - 1) * resolution
  bin_starts <- seq(from = 0, to = max_bin, by = resolution)

  # create GRanges object containing all assessed bins for that chromosome
  bin_coords <- GRanges(seqnames = chromosome,
                        ranges = IRanges(start = bin_starts, end = bin_starts + resolution - 1,
                                         names = paste0("bin_", seq_len(length(bin_starts))))
                        )

  # convert starting coordinates of bins in sparse matrix input to bin ids by dividing by the
  # resolution (and add 1 to get correct index)
  obs_contacts_bins <- data.frame(i = round(obs_contacts$i / resolution + 1),
                                  j = round(obs_contacts$j / resolution + 1), 
                                  M_ij = obs_contacts$M_ij)

  # create sparse contact matrix from observed contacts
  sparse_cm <- sparseMatrix(i = obs_contacts_bins$i, j = obs_contacts_bins$j,
                            x = obs_contacts_bins$M_ij, symmetric = TRUE, dims = c(bins, bins))

  # create HTCexp object containing data for the given chromosome
  HTCexp(intdata = sparse_cm, xgi = bin_coords, ygi = bin_coords)

}
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
# k562 intrachromosomal sparse matrix and krnorm vector files for chromosomes 8 and 11
scm_files <- here(snakemake@input$hic_raw)
names(scm_files) <- sub("_5kb.RAWobserved", "", basename(scm_files))

krnorm_files <- here(snakemake@input$hic_norm)
names(krnorm_files) <- sub("_5kb.KRnorm", "", basename(krnorm_files))

# import normalization vectors
chr8_KRnorm  <- as.numeric(readLines(krnorm_files["chr8"]))
chr11_KRnorm <- as.numeric(readLines(krnorm_files["chr11"]))

# infer number of bins per chromosome based on the normalization vectors
chr8_bins  <- length(chr8_KRnorm)
chr11_bins <- length(chr11_KRnorm)

# import hi-c data for these chromosomes
chr8_hic  <- import_hic(scm_files["chr8"],  chromosome = "chr8", resolution = 5000,
                        bins = chr8_bins)
chr11_hic <- import_hic(scm_files["chr11"], chromosome = "chr11", resolution = 5000,
                        bins = chr11_bins)
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# function to normalize Hi-C data based on provided normalization vectors
normalize_hic <- function(htc_obj, norm_vector) {

  # extract raw observed interaction matrix
  raw_obs <- intdata(htc_obj)

  # create normalization matrix by computing the outer product of the normalization vector
  norm_mat <- outer(norm_vector, norm_vector)

  # multiply observed interactions by normalization matrix and add back to HTC object
  intdata(htc_obj) <- raw_obs / norm_mat
  return(htc_obj)

}

# normalize HiC data
chr8_hic_norm  <- normalize_hic(chr8_hic,  norm_vector = chr8_KRnorm)
chr11_hic_norm <- normalize_hic(chr11_hic, norm_vector = chr11_KRnorm)
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# infer chromosomal region range
region_coords <- cis_enh_perts %>%
  select(sample, enh_start, enh_end, gene_tss) %>%
  gather(key = "key", value = "coord", -sample) %>%
  group_by(sample) %>%
  summarize(start = min(coord), end = max(coord))

# calculate bin coordinates that contain the entire regions
region_bins <- region_coords %>%
  mutate(start = floor(start / 5000) * 5000,
         end = ceiling(end / 5000) * 5000)

# extract data for assessed regions
chr8_region_hic <- extractRegion(chr8_hic_norm, MARGIN = c(1, 2), chr = "chr8",
                                 from = pull(filter(region_bins, sample == "chr8"), start),
                                 to = pull(filter(region_bins, sample == "chr8"), end))

chr11_region_hic <- extractRegion(chr11_hic_norm, MARGIN = c(1, 2), chr = "chr11",
                                  from = pull(filter(region_bins, sample == "chr11"), start),
                                  to = pull(filter(region_bins, sample == "chr11"), end))
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# function to extract interaction frequency versus distance between genomic bins
intFreq_over_distance <- function(htc_obj, type = "background") {

  # extract interaction matrix and compute distances between bins (in bp)
  ints <- intdata(htc_obj)
  dists <- intervalsDist(htc_obj)

  # get unique interactions (upper triangle of matrix, including diagonal), because matrix is
  # symmetric
  ints_unique <- suppressMessages(ints[upper.tri(ints, diag = TRUE)])
  dists_unique <- suppressMessages(dists[upper.tri(dists, diag = TRUE)])

  # create data.frame containing interaction frequency across all in distances
  int_over_dist <- data.frame(dist = dists_unique, int_freq = ints_unique, type = type,
                              stringsAsFactors = FALSE)

  # sort according to distance
  int_over_dist[order(int_over_dist$dist), ]

}

# compute expected background interaction frequency by calculating the distance between all observed
# genomic bins
chr8_int_over_dist  <- intFreq_over_distance(chr8_region_hic)
chr11_int_over_dist <- intFreq_over_distance(chr11_region_hic)

# combine into one data.frame and remove any NaN values, which were introduced because the
# normalization failed for certain bins
expect_int <- bind_rows(chr8 = chr8_int_over_dist, chr11 = chr11_int_over_dist, .id = "sample") %>%
  filter(!is.nan(int_freq))
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
# function to compute the interaction frequency for enhancer - gene pairs, by finding the hi-c 
# genomic bins with overlap with the enhancer and the target gene tss. the interaction frequency of
# the pair is then defined as the interaction frequency of these bins
compute_int_freq <- function(pairs, htc_object) {

  # add pair identifier
  pairs$pair <- seq_len(nrow(pairs))

  # get coordinates of enhancer centers as GRanges object
  enh_coords <- pairs %>%
    mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
    select(enh_chr, enh_center) %>%
    makeGRangesFromDataFrame(., seqnames.field = "enh_chr", start.field = "enh_center",
                             end.field = "enh_center")

  # get gene tss coordinates as GRanges object
  tss_coords <- pairs %>%
    select(gene_chr, gene_tss) %>%
    makeGRangesFromDataFrame(., seqnames.field = "gene_chr", start.field = "gene_tss",
                             end.field = "gene_tss")

  # get bins overlapping for all enhancers and gene tss
  hic_bins <- x_intervals(htc_object)
  enh_bins <- findOverlaps(query = enh_coords, subject = hic_bins)
  tss_bins <- findOverlaps(query = tss_coords, subject = hic_bins)

  # combine into one data.frame
  enh_bins <- data.frame(pair = from(enh_bins), enh_bin = to(enh_bins))
  tss_bins <- data.frame(pair = from(tss_bins), tss_bin = to(tss_bins))
  bins <- full_join(enh_bins, tss_bins, by = "pair")

  # extract distance matrix between bins from htc object
  dists <- intervalsDist(htc_object)

  # calculate distances between bins of enhancer gene pairs
  dist_pairs <- dists[as.matrix(bins[, 2:3])]
  dist_pairs <- data.frame(pair = bins$pair, dist_bins = dist_pairs)

  # extract hi-c interaction matrix from htc object
  intdata <- intdata(htc_object)

  # get interaction frequencies for all bins and add pair id
  int_freq_pairs <- intdata[as.matrix(bins[, 2:3])]
  int_freq_pairs <- data.frame(pair = bins$pair, int_freq = int_freq_pairs)

  # add interaction frequencies and bin distances to pairs to create output
  pairs %>%
    left_join(dist_pairs, by = "pair") %>%
    left_join(int_freq_pairs, by = "pair") %>%
    select(-pair)

}
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
# compute interaction frequencies for all tested enhancer - gene pairs
chr8_pairs <- cis_enh_perts %>%
  filter(enh_chr == "chr8", enh_chr == "chr8") %>%
  compute_int_freq(., htc_object = chr8_region_hic)

chr11_pairs <- cis_enh_perts %>%
  filter(enh_chr == "chr11", enh_chr == "chr11") %>%
  compute_int_freq(., htc_object = chr11_region_hic)

# combine into one data.frame and extract relevant columns
int_freq_pairs <- bind_rows(chr8_pairs, chr11_pairs) %>%
  select(sample, dist_bins, int_freq, significant, conf) %>%
  rename(dist = dist_bins, type = significant)

# add to expected background frequencies
int_freqs <- expect_int %>%
  mutate(conf = as.character(NA)) %>%
  bind_rows(int_freq_pairs)
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
# add an amount equal to bin size to distance for plotting on log10 scale and reformat labels
int_freqs_plot <- int_freqs %>%
  mutate(dist = dist + 5000) %>%
  mutate(sample = sub("c", "C", sample)) %>%
  mutate(conf = sub(" confidence", "", conf),
         conf = sub("Non significant", "NS", conf))

# plot interaction frequencies between
p1 <- ggplot(drop_na(int_freqs_plot), aes(x = conf, y = int_freq + 1, color = conf)) +
  facet_wrap(~sample, ncol = 1) +
  geom_jitter(width = 0.2, size = 2.5) +
  geom_boxplot(color = "black", outlier.shape = NA, fill = NA) +
  labs(y = "Hi-C Interaction frequency", color = "Confidence:") +
  scale_color_manual(values = c("High" = "goldenrod2",
                                "Low" = "darkslategray3", "NS" = "gray")) +
  scale_y_log10(limits = c(NA, 1200), breaks = c(1, 10, 100, 1000)) +
  theme_bw() +
  theme(text = element_text(size = 22), panel.grid = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))

# plot interaction frequency as a function of distance
x_breaks <- c(0, 10^(4:10))  # breaks to draw on x axis (without added bin distance!)
p2 <- ggplot(int_freqs_plot, aes(x = dist, y = int_freq + 1)) +
  facet_wrap(~sample, ncol = 1) + 
  geom_point(data = filter(int_freqs_plot, type == "background"), pch = 4, size = 2,
             color = "black", alpha = 1) +
  geom_point(data = filter(int_freqs_plot, type == "non_sig"), aes(color = conf), alpha = 1,
             size = 2.5) +
  geom_point(data = filter(int_freqs_plot, type == "sig"), aes(color = conf), size = 2.5) +
  geom_smooth(data = filter(int_freqs_plot, type != "sig"), aes(linetype = type),
              color = "indianred3", se = FALSE, lwd = 2) +
  labs(x = "Distance (kb)", y = "Hi-C Interaction frequency", color = "Confidence",
       linetype = "Type") +
  scale_color_manual(values = c("High" = "goldenrod2", "Low" = "darkslategray3", "NS" = "gray")) +
  scale_x_log10(breaks = x_breaks + 5000, labels = scales::comma(x_breaks / 1000)) +
  scale_y_log10(limits = c(NA, 1200), breaks = c(1, 10, 100, 1000)) +
  theme_bw() +
  theme(text = element_text(size = 22), panel.grid = element_blank())

# arrange plots with common legend (props ggpubr!)
ggarrange(p1, p2 + theme(axis.title.y = element_blank()), ncol = 2, nrow = 1, common.legend = TRUE,
          legend = "top", widths = c(1, 2))
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
# plot interaction frequencies across confidence levels and significant vs non-significant
p1 <- ggplot(drop_na(int_freqs_plot), aes(x = conf, y = int_freq + 1, color = conf)) +
  geom_jitter(width = 0.2, size = 2.5) +
  geom_boxplot(color = "black", outlier.shape = NA, fill = NA) +
  labs(y = "Hi-C Interaction frequency", color = "Confidence:") +
  scale_color_manual(values = c("High" = "goldenrod2",
                                "Low" = "darkslategray3", "NS" = "gray")) +
  scale_y_log10(limits = c(NA, 1200), breaks = c(1, 10, 100, 1000)) +
  theme_bw() +
  theme(text = element_text(size = 23.5), panel.grid = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))

# plot interaction frequency as a function of distance
x_breaks <- c(0, 10^(4:10))  # breaks to draw on x axis (without added bin distance!)
p2 <- ggplot(int_freqs_plot, aes(x = dist, y = int_freq + 1)) +
  geom_point(data = filter(int_freqs_plot, type == "background"), pch = 4, size = 2,
             color = "black", alpha = 1) +
  geom_point(data = filter(int_freqs_plot, type == "non_sig"), aes(color = conf), alpha = 1,
             size = 2.5) +
  geom_point(data = filter(int_freqs_plot, type == "sig"), aes(color = conf), size = 2.5) +
  geom_smooth(data = filter(int_freqs_plot, type != "sig"), aes(linetype = type),
              color = "indianred3", se = FALSE, lwd = 2) +
  labs(x = "Distance (kb)", y = "Hi-C Interaction frequency", color = "Confidence",
       linetype = "Type") +
  scale_color_manual(values = c("High" = "goldenrod2", "Low" = "darkslategray3", "NS" = "gray")) +
  scale_x_log10(breaks = x_breaks + 5000, labels = scales::comma(x_breaks / 1000)) +
  scale_y_log10(limits = c(NA, 1200), breaks = c(1, 10, 100, 1000)) +
  theme_bw() +
  theme(text = element_text(size = 22), panel.grid = element_blank())

# arrange plots with common legend (props ggpubr!)
ggarrange(p1, p2 + theme(axis.title.y = element_blank()), ncol = 2, nrow = 1, common.legend = TRUE,
          legend = "top", widths = c(1, 2))

# save plot for manuscript
ggsave(here("results/plots", "hic_interaction_frequencies_combined.png"), device = "png", width = 8,
       height = 4)
ggsave(here("results/plots", "hic_interaction_frequencies_combined.pdf"), device = "pdf", width = 8,
       height = 4)
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
# separate pair interaction frequencies into significant and non-significant interactions
int_freq_sig <- filter(int_freqs, type == "sig")
int_freq_nonSig <- filter(int_freqs, type == "non_sig")
int_freq_background <- filter(int_freqs, type == "background")

# function to randomly select non-significant controls with same distance
sample_ctrls <- function(pair, nonSig_pairs, n = 2) {

  # get non-significant controls from the same sample and with the same bin distance
  ctrl_pairs <- filter(nonSig_pairs, sample == pair$sample, dist == pair$dist)

  # randomly select n controls
  ctrls <- sample_n(ctrl_pairs, size = n, replace = FALSE)

  # add confidence level of pair for which the control was drawn
  ctrls$conf <- pair$conf

  return(ctrls)

}

# randomly sample non-significant controls
set.seed(20190626)
sampled_ctrls <- int_freq_sig %>%
  rowwise() %>%
  do(sample_ctrls(pair = ., nonSig_pairs = int_freq_nonSig, n = 2))

# randomly sample background controls
sampled_bckgrnd <- int_freq_sig %>%
  rowwise() %>%
  do(sample_ctrls(pair = ., nonSig_pairs = int_freq_background, n = 2))

# combine significant pairs and sampled controls
all_int_pairs <- rbind(int_freq_sig, sampled_ctrls) %>%
  mutate(conf_plot = case_when(
    type == "sig" ~ conf,
    type == "non_sig" ~ "non_sig",
    type == "background" ~ "background"
  ))

# perform pairwise tests
pw_tests <- compare_means(int_freq ~ type, group.by = "conf", data = all_int_pairs,
                          method = "wilcox.test", p.adjust.method = "holm")

# plot interaction frequencies
ggplot(all_int_pairs, aes(x = fct_relevel(type, "sig", "non_sig"), y = int_freq + 1, fill = conf_plot)) +
  facet_wrap(~conf, ncol = 1) +
  geom_boxplot() +
  suppressWarnings(stat_pvalue_manual(data = pw_tests, y.position = log10(450), inherit.aes = FALSE,
                                      label = "p = {p.format}", size = 7, lwd = 0.75)) +
  labs(y = "Hi-C interaction frequency") +
  scale_fill_manual(values = c("High confidence" = "goldenrod2", "non_sig" = "gray",
                               "Low confidence" = "darkslategray3", "background" = "black")) +
  scale_y_log10(limits = c(NA, 1000)) +
  scale_x_discrete(labels = c("Sig.", "Non sig.")) +
  theme_bw() +
  theme(text = element_text(size = 25), legend.position = "none", axis.title.x = element_blank())

# save plot for manuscript
ggsave(here("results/plots", "hic_sig_vs_nonsig.pdf"), device = "pdf", width = 4, height = 7)
16
17
18
19
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
23
24
25
26
27
28
29
30
31
32
33
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# attach required packages
library(ggrepel)
library(gridExtra)
library(here)
library(readxl)
library(rtracklayer)
library(tidyverse)
library(tools)
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# files containing number of cell pert perturbation
ncell_files <- here(snakemake@input$ncells)

# set name for each file
samples <- basename(dirname(dirname(ncell_files)))
names(ncell_files) <- basename(ncell_files) %>%
  sub("ncells_MAST_", "", .) %>%
  sub("_.*", "", .) %>%
  paste(samples, ., sep = "_")

# files containing differential expression results
de_output_files <- here(snakemake@input$de_output)

# set name for each file
samples <- basename(dirname(dirname(de_output_files)))
names(de_output_files) <- basename(de_output_files) %>%
  sub("output_MAST_", "", .) %>%
  sub("_.*", "", .) %>%
  paste(samples, ., sep = "_")

# read files
ncells <- lapply(ncell_files, FUN = read.csv, stringsAsFactors = FALSE)
de_output <- lapply(de_output_files, FUN = read.csv, stringsAsFactors = FALSE)

# convert to one data.frame
ncells <- ncells %>%
  bind_rows(.id = "id") %>%
  separate(id, into = c("sample", "strategy"), sep = "_") %>%
  mutate(sample = paste0("Chr", sub("iScreen1", "", sample)))

de_output <- de_output %>%
  bind_rows(.id = "id") %>%
  separate(id, into = c("sample", "strategy"), sep = "_") %>%
  mutate(sample = paste0("Chr", sub("iScreen1", "", sample)))

# load processed de output (includes confidence level and distance to TSS)
results <- here(snakemake@input$processed_results) %>%
  read.csv(stringsAsFactors = FALSE) %>% 
  mutate(sample = paste0("Chr", sub("iScreen1", "", sample)))

# load perturbation status
perturb_status_files <- here(snakemake@input$perturb_status)
names(perturb_status_files) <- basename(dirname(perturb_status_files))
perturb_status <- lapply(perturb_status_files, FUN = read.table, header = TRUE,
                         row.names = "TARGET", stringsAsFactors = FALSE)

# load vector targets
vector_target_files <- here(snakemake@input$vector_targets)
names(vector_target_files) <- sub(".+_(chr\\d+)_.+", "\\1", basename(vector_target_files))
vector_targets <- vector_target_files %>%
  lapply(FUN = read.csv, stringsAsFactors = FALSE) %>%
  bind_rows(.id = "sample")
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# minimum number of cells per perturbation used for DE tests
min_cells <- snakemake@config$map_enhancers$min_cells %>%
  unlist() %>%
  data.frame(strategy = names(.), min_cells = .)

# calculate percentage of perturbation with at least min_cells cells
cells_perc <- ncells %>%
  group_by(sample, strategy) %>%
  summarize(cells = sum(cells >= min_cells[strategy, "min_cells"]),
            cells_perc = cells / n() * 100)

# plot number of cells pert perturbation
ggplot(ncells, aes(cells)) +
  facet_grid(strategy~sample, scales = "free") +
  geom_histogram(bins = 45) +
  geom_vline(data = min_cells, aes(xintercept = min_cells), color = "firebrick") +
  labs(title = "Cells per perturbation") +
  theme_bw()
120
121
122
123
124
125
126
127
128
129
130
131
# correct for multiple testing across samples using FDR
de_output <- de_output %>%
  group_by(strategy) %>%
  rename(pval_adj_sample = pval_adj_allTests) %>%
  mutate(pval_adj_allTests = p.adjust(pvalue, method = "fdr"))

# plot p-value distribution
ggplot(de_output, aes(pvalue)) +
  facet_grid(strategy~sample, scales = "free") +
  geom_histogram(bins = 100) +
  labs(title = "P-value distribution") +
  theme_bw()
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# count the number and proportion of hits across different FDR values
hits <- de_output %>%
  group_by(sample, strategy) %>%
  summarize("0.05" = sum(pval_adj_allTests < 0.05),
            "0.1" = sum(pval_adj_allTests < 0.1),
            "0.15" = sum(pval_adj_allTests < 0.15),
            "0.2" = sum(pval_adj_allTests < 0.2),
            tests = n()) %>%
  gather(key = "fdr", value = "hits", -sample, -strategy, -tests) %>%
  mutate(fdr = as.numeric(fdr)) %>%
  mutate(prop_hits = hits / tests)

# plot number of hits as function of FDR
p1 <- ggplot(hits, aes(x = fdr, y = hits, color = sample, lty = strategy)) +
  geom_point() +
  geom_line() +
  labs(x = "FDR cutoff", y = "Significant tests", title = "Number of significant tests") +
  scale_color_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  theme_bw()

# plot proportion of hits as function of FDR
p2 <- ggplot(hits, aes(x = fdr, y = prop_hits, color = sample, lty = strategy)) +
  geom_point() +
  geom_line() +
  labs(x = "FDR cutoff", y = "Significant tests", title = "Proportion of significant tests") +
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  theme_bw()

# print plots
grid.arrange(p1, p2, ncol = 1)
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# add variable for significant associations
results <- results %>%
  mutate(significant = if_else(pval_adj_allTests < 0.05, true = "sig", false = "non_sig"))

# calculate number of hits per confidence level
hits_conf_levels <- results %>%
  group_by(sample, grna_hits) %>%
  summarize(hits = sum(significant == "sig"))

# plot number of hits per confidence level
ggplot(hits_conf_levels, aes(x = as.factor(grna_hits), y = hits, fill = sample)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "gRNA hits (pvalue adj. < 0.05)", y = "Hits", title = "Hits per confidence level",
       fill = "Sample") +
  scale_fill_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  theme_bw()
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# known enhancer perturbations
known_enh <- c("GATA1", "HS2", "MYC", "ZFPM2")

# get all transfected enhancer perturbations
enh_pert_targets <- vector_targets %>%
  select(sample, TARGET) %>%
  distinct() %>%
  filter(grepl(TARGET, pattern = "chr\\d+:.*") | TARGET %in% known_enh) %>%
  mutate(sample = paste0(sub("chr", "", sample), "iScreen1"))

# only retain perturbation status on enhancer perturbations
enh_perts <- perturb_status %>%
  lapply(FUN = function(x) {
    enh_rows <- rownames(x) %in% enh_pert_targets$TARGET
    x[enh_rows, ]
  })
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# count number of cells per perturbation
cells_per_pert <- enh_perts %>%
  lapply(FUN = rowSums) %>%
  lapply(FUN = function(x) data.frame(pert = names(x), cells = x, stringsAsFactors = FALSE) ) %>%
  bind_rows(.id = "sample")

# add transfected, but not detected perturbations (in case there are any...)
cells_per_pert <- cells_per_pert %>%
  full_join(enh_pert_targets, by = c("sample" = "sample", "pert" = "TARGET")) %>%
  mutate(cells = replace_na(cells, replace = 0)) %>%
  mutate(sample = paste0("Chr", sub("iScreen1", "", sample)))

# plot number of cell per perturbation
ggplot(cells_per_pert, aes(x = sample, y = cells, fill = sample, color = sample)) +
  geom_violin() +
  labs(y = "Cells") +
  scale_fill_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  scale_color_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  scale_y_continuous(breaks = seq(0, 600, 100)) +
  theme_bw() +
  theme(text = element_text(size = 28), legend.position = "none", axis.title.x = element_blank(),
        panel.grid = element_blank())

# save plot for manuscript
ggsave(here("results/plots", "cells_per_pert.pdf"), device = "pdf", width = 5, height = 5)
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# count number of perturbations per cell
perts_per_cell <- perturb_status %>%
  lapply(FUN = colSums) %>%
  lapply(FUN = function(x) data.frame(cell = names(x), perts = x, stringsAsFactors = FALSE) ) %>%
  bind_rows(.id = "sample") %>%
  mutate(sample = paste0("Chr", sub("iScreen1", "", sample)))

# compute histogram
perts_per_cell_hist <- perts_per_cell %>%
  group_by(sample, perts) %>%
  summarize(cells = n()) %>%
  mutate(perts = factor(perts, levels = seq(0, max(.$perts))))

# plot perturbations per cell histogram
ggplot(perts_per_cell_hist, aes(x = perts, y = cells / 1000, fill = sample)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Perturbations per cell", y = "Cells (x 1000)",
       fill = "Region:") +
  scale_fill_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  scale_x_discrete(drop = FALSE) +
  theme_bw() +
  theme(text = element_text(size = 22), legend.text = element_text(size = 25),
        legend.position = "top", panel.grid = element_blank())

# save plot for manuscript
ggsave(here("results/plots", "perts_per_cell.pdf"), device = "pdf", width = 7,  height = 5)
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
# extract cis enhancer - gene pairs and remove significant pairs that increase gene expression
cis_enh_perts <- results %>%
  filter(enh_type == "cis", abs(dist_to_tss) >= 1000) %>%
  filter(!(significant == "sig" & manual_lfc > 0))

# only retain hits within the same target region (discard out of region control enhancers)
cis_enh_perts <- cis_enh_perts %>%
  filter(enh_chr == sub("Chr", "", sample))

# count the number of enhancers for each gene
enh_per_gene <- cis_enh_perts %>%
  group_by(sample, gene) %>%
  summarize(enh = sum(significant == "sig"))

# compute proportion of genes with at least one discovered enhancer
prop_with_enh <- enh_per_gene %>%
  summarize(enh = sum(enh > 0),
            genes = n(),
            prop_enh = enh / genes)

# print table
knitr::kable(prop_with_enh)

# compute number of genes per enhancer count
enh_counts <- enh_per_gene %>%
  filter(enh > 0) %>%
  group_by(sample, enh) %>%
  summarize(genes = n()) %>%
  spread(key = "sample", value = "genes", fill = 0) %>%
  gather(key = "sample", value = "genes", -enh)

# plot number of genes per enhancer count
ggplot(enh_counts, aes(x = factor(enh, levels = seq_len(max(enh))), y = genes, fill = sample)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Enhancers", y = "Genes",
       fill = "Region:") +
  scale_fill_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  scale_x_discrete(drop = FALSE) +
  theme_bw() +
  theme(text = element_text(size = 25), panel.grid = element_blank(),
        legend.position = c(0.8, 0.85))

# save plot for manuscript
ggsave(here("results/plots", "enhancers_per_gene.pdf"), device = "pdf", width = 5, height = 5)
342
343
344
345
346
347
348
349
350
351
352
# calculate -log10 p-value
cis_enh_perts <- mutate(cis_enh_perts, neg_log10_pvalue = -log10(pvalue))

# plot -log10 p-value as a function of distance to TSS
ggplot(cis_enh_perts, aes(x = dist_to_tss / 1e6, y = neg_log10_pvalue, color = significant)) +
  facet_wrap(~sample, ncol = 1, scales = "free") +
  geom_point(data = filter(cis_enh_perts, significant == "non_sig")) +
  geom_point(data = filter(cis_enh_perts, significant == "sig")) +
  labs(x = "Distance to TSS (Mb)", y = expression(-log[10] ~ p - value)) +
  scale_color_manual(values = c(non_sig = "darkgray", sig = "firebrick3")) +
  theme_bw()
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# set grna hits to NA for non-significant hits
cis_enh_perts_plot <- cis_enh_perts %>%
  mutate(grna_hits = if_else(significant != "sig", true = as.integer(NA), false = grna_hits),
         prop_grna_hits = if_else(significant != "sig", true = as.numeric(NA),
                                  false = prop_grna_hits))

# same plot, but color according to confidence level
ggplot(cis_enh_perts_plot,
             aes(x = dist_to_tss / 1e6, y = neg_log10_pvalue, color = prop_grna_hits)) +
  facet_wrap(~sample, ncol = 1, scales = "free") +
  geom_point(data = filter(cis_enh_perts_plot, significant == "non_sig")) +
  geom_point(data = filter(cis_enh_perts_plot, significant == "sig")) +
  labs(x = "Distance to TSS (Mb)", y = expression(-log[10] ~ p - value)) +
  scale_color_gradient(low = "blue", high = "orangered", na.value = "gray69") +
  theme_bw()
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
# get hit closest to TSS
closest_hit <- cis_enh_perts_plot %>%
  filter(significant == "sig") %>%
  slice(which.min(abs(dist_to_tss)))

# get hits further away than the closest hit for plotting
cis_enh_perts_plot <- cis_enh_perts_plot %>%
  filter(abs(dist_to_tss) >= abs(closest_hit$dist_to_tss))

# plot decay of association strength over distance within 1Mb of TSS
p1 <- ggplot(cis_enh_perts_plot, aes(x = abs(dist_to_tss) / 1e6, y = neg_log10_pvalue,
                                     color = prop_grna_hits)) +
    labs(x = "Distance to TSS (Mb)", y = expression(-log[10] ~ p - value),
         color = "Confidence") +
    geom_point(data = filter(cis_enh_perts_plot, significant == "non_sig"), size = 3.5) +
    geom_point(data = filter(cis_enh_perts_plot, significant == "sig"), size = 3.5) +
    scale_color_gradient(low = "blue", high = "orangered", na.value = "gray69") +
    theme_bw() +
    theme(text = element_text(size = 25), panel.grid = element_blank())

# same plot, but with logarithmic distance
p2 <- ggplot(cis_enh_perts_plot, aes(x = abs(dist_to_tss) / 1e6, y = neg_log10_pvalue,
                                     color = prop_grna_hits)) +
    geom_smooth(data = filter(cis_enh_perts_plot, significant == "sig"),
                method = "loess", se = FALSE, color = "black", lwd = 2) +
    labs(x = "Distance to TSS (Mb)", y = expression(-log[10] ~ p - value)) +
    geom_point(data = filter(cis_enh_perts_plot, significant == "non_sig"), size = 2.5) +
    geom_point(data = filter(cis_enh_perts_plot, significant == "sig"), size = 2.5) +
    scale_x_log10(breaks = c(0.01, 0.1, 1, 10), labels = c("0.01", "0.1", "1", "10")) +
    scale_color_gradient(low = "blue", high = "orangered", na.value = "gray69") +
    theme_bw() +
    theme(text = element_text(size = 20), axis.title = element_blank(), legend.position = "none",
          panel.grid = element_blank(), plot.background = element_rect(fill = NA))

# add logarithmic plot as inset of linear plot
p1 + 
  annotation_custom(
    grob = ggplotGrob(p2),
    xmin = 18,
    xmax = 56,
    ymin = 40,
    ymax = 122
  )

# save plot for manuscript
ggsave(here("results/plots", "enh_pval_vs_dist.pdf"), device = "pdf", width = 8, height = 5)
ggsave(here("results/plots", "enh_pval_vs_dist.png"), device = "png", width = 8, height = 5)
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
# only select significant enhancer - gene pairs
enh_gene_pairs <- filter(cis_enh_perts, significant == "sig")

# plot the distance to TSS vs number of gRNA hits
ggplot(drop_na(enh_gene_pairs),
       aes(x = as.factor(grna_hits), y = abs(dist_to_tss) / 1000,color = sample)) +
  geom_hline(yintercept = 1000, lty = "dashed", color = "gray42") +
  geom_jitter(width = 0.2) +
  geom_boxplot(color = "black", width = 0.2, fill = NA, outlier.shape = NA) +
  labs(x = "Confidence level (gRNA hits)", y = "Distance to TSS (kb)",
       title = "Distance to TSS vs confidence level", color = "Sample") +
  scale_color_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  scale_y_log10() +
  theme_bw() +
  theme(text = element_text(size = 20), panel.grid = element_blank())

# save plot for manuscript
ggsave(here("results/plots", "confidence_vs_dist.pdf"), device = "pdf", width = 7, height = 5)
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
# add label for confidence level based on prop gRNA hits
cis_enh_perts_plot <- cis_enh_perts %>%
  filter(!is.na(prop_grna_hits)) %>%
  mutate(conf = if_else(significant == "sig" & prop_grna_hits >= 0.5,
                        true = "High confidence", false = "Non significant"),
         conf = if_else(significant == "sig" & prop_grna_hits < 0.5,
                        true = "Low confidence", false = conf))

# plot distance to TSS of significant enhacner gene - pairs vs non-significant pairs
ggplot(cis_enh_perts_plot,
       aes(x = fct_relevel(conf, "High confidence", "Low confidence"),
           y = abs(dist_to_tss) / 1000, fill = conf)) +
  geom_hline(yintercept = 1000, lty = "dashed") +
  geom_violin() +
  labs(y = "Distance to TSS (kb)") +
  scale_fill_manual(values = c("High confidence" = "goldenrod2", "Low confidence" = "darkslategray3", 
                               "Non significant" = "gray")) +
  scale_y_log10() +
  theme_bw() +
  theme(legend.position = "none", axis.title.x = element_blank(), text = element_text(size = 20),
        panel.grid = element_blank())
485
486
487
488
489
490
491
492
493
494
495
496
# import gencode hg38 annotations
annot_url <- "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.chr.gtf.gz"
annot <- import(annot_url, format = "gtf")

# extract exon annotations for protein-coding genes
genes <- annot[annot$type == "exon" & annot$gene_biotype == "protein_coding" &
                 annot$transcript_biotype == "protein_coding"]

# only retain annotations on assessed chromosomes (maybe a little faster to overlap)
enh_chrs <- unique(cis_enh_perts$enh_chr)
genes <- genes[seqnames(genes) %in% enh_chrs]
genes <- split(genes, f = genes$gene_name)
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
# calculate enhancer centers
enh_gene_pairs <- enh_gene_pairs %>%
  mutate(enh_center = round((enh_start + enh_end) / 2))

# create (sorted) interval between enhancer center and target gene tss
coords <- enh_gene_pairs %>%
  select(sample, perturbation, gene, gene_tss, enh_center) %>%
  gather(key = "type", value = "coord", -c(sample, perturbation, gene)) %>%
  group_by(sample, perturbation, gene) %>%
  summarize(start = sort(coord)[1],
            end = sort(coord)[2])

# merge with additional data
coords <- enh_gene_pairs %>%
  select(sample, perturbation, gene, chr = enh_chr, dist_to_tss, neg_log10_pvalue,
         prop_grna_hits, grna_hits) %>%
  right_join(coords, by = c("sample", "perturbation", "gene"))

# create GRanges object and overlap with protein-coding gene annotations
overlaps <- coords %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE) %>%
  findOverlaps(query = ., subject = genes)

# convert to list per hit
overlaps_list <- split(overlaps, f = from(overlaps))

# get unique overlapping gene names (genes that are skipped by the enhancer)
skipped_genes <- lapply(overlaps_list, FUN = function(x) {
  target_gene <- unique(coords[from(x), "gene"])
  all_genes <- names(genes[to(x)])
  all_genes[all_genes != target_gene]
})

# count number of skipped genes per association and add that number to data.frame
n_skipped <- sapply(skipped_genes, FUN = length)
coords$skipped_genes <- n_skipped

# compute histogram
skipped_hist <- coords %>%
  count(skipped_genes) %>%
  mutate(perc = n / sum(n))

# plot the number of skipped genes
total_value = sum(skipped_hist$n)
ggplot(skipped_hist, aes(x = skipped_genes, y = n)) +
  geom_bar(stat = "identity") +
  labs(x = "Genes skipped") +
  scale_y_continuous(sec.axis = sec_axis(~(./total_value), labels = scales::percent,
                                         name = "Percentage"), name = "Counts") +
  theme_bw() +
  theme(text = element_text(size = 15), panel.grid = element_blank())

# save plot for manuscript
ggsave(here("results/plots", "skipped_genes.pdf"), device = "pdf", width = 7, height = 5)

# plot association strength vs skipped genes
ggplot(coords, aes(x = skipped_genes, y = neg_log10_pvalue, color = prop_grna_hits)) +
  geom_point() +
  scale_color_gradient(low = "blue", high = "orangered") +
  labs(x = "Genes skipped", y = expression(-log[10] ~ p - value),
       title = "Association strength vs skipped genes", color = "Confidence\n(prop. gRNA hits)") +
  theme_bw() +
  theme(text = element_text(size = 15), panel.grid = element_blank())

# save plot for manuscript
ggsave(here("results/plots", "pvalue_vs_skipped_genes.pdf"), device = "pdf", width = 7, height = 5)
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
# create GRanges object with enhancer center coordinates for every detected association
enh_coords <- enh_gene_pairs %>%
  select(sample, perturbation, gene, enh_chr, enh_center, dist_to_tss, neg_log10_pvalue,
         grna_hits, prop_grna_hits) %>%
  makeGRangesFromDataFrame(seqnames.field = "enh_chr", start.field = "enh_center",
                           end.field = "enh_center", keep.extra.columns = TRUE)

# function to get a genes TSS
get_tss <- function(x) {
  if (all(strand(x) == "+")) {
    tss <- min(start(x))
  }else if (all(strand(x) == "-")) {
    tss <- max(end(x))
  }else{
    warning("Inconsistent strand information for gene:", unique(x$gene_name), call. = FALSE)
    return(GRanges())
  }
  GRanges(seqnames = unique(seqnames(x)), ranges = IRanges(tss, tss), strand = unique(strand(x)))
}

# get TSS for all genes on chr8 and chr11
gene_tss <- unlist(GRangesList(lapply(genes, FUN = get_tss)))

# find nearest gene TSS to each enhancer
nearest_idx <- nearest(x = enh_coords, subject = gene_tss, ignore.strand = TRUE)
nearest_genes <- names(gene_tss[nearest_idx])

# add to associations
nearest_per_hit <- data.frame(mcols(enh_coords), nearest_gene = nearest_genes,
                              stringsAsFactors = FALSE)

# add variable to specify if closest gene ot the enhancer is the inferred target
nearest_per_hit <- nearest_per_hit %>%
  mutate(is_nearest = if_else(gene == nearest_gene, true = 1, false = 0))

# number and percentage of enhancers where the target gene is the nearest gene
n_nearest <- sum(nearest_per_hit$is_nearest)
perc_is_nearest <- mean(nearest_per_hit$is_nearest) * 100

# labels for x-axis for plot(s)
x_labels <- c(paste0("Nearest TSS (n = ", n_nearest,")"),
              paste0("Other TSS (n = ", nrow(nearest_per_hit) - n_nearest, ")"))

# title for plot
plot_title = paste0("Nearest TSS (", round(perc_is_nearest, digits = 1), "% of cases)")

# plot the number of pairs, where the target TSS is the closest TSS vs. not 
ggplot(nearest_per_hit, aes(x = fct_relevel(factor(is_nearest), "1"), fill = sample)) +
  geom_bar(position = "dodge") +
  labs(title = plot_title, y = "ETPs", fill = "Region") +
  scale_fill_manual(values = c("Chr11" = "indianred3", "Chr8" = "steelblue")) +
  scale_x_discrete(labels = x_labels) +
  theme_bw() +
  theme(axis.title.x = element_blank(), text = element_text(size = 13),
        panel.grid = element_blank())
632
633
634
635
636
637
638
639
640
641
642
# compare p-value and confidence level between hits where the target gene is the nearest gene and
# hits where this is not the case
nearest_per_hit %>%
  select(is_nearest, neg_log10_pvalue, prop_grna_hits) %>%
  gather(key = "stat", value = "value", -is_nearest) %>%
  ggplot(., aes(x = factor(is_nearest), y = value)) +
  facet_wrap(~stat, ncol = 1, scales = "free") +
  geom_boxplot(fill = "steelblue3") +
  scale_x_discrete(labels = x_labels) +
  theme_bw() +
  theme(axis.title.x = element_blank(), panel.grid = element_blank())
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
library(here)
library(rtracklayer)
library(tidyverse)

# prepare data -------------------------------------------------------------------------------------

# files containing differential expression results
result_files <- here(snakemake@input$results)

# set name for each file
samples <- basename(dirname(dirname(result_files)))
names(result_files) <- basename(result_files) %>%
  sub("output_MAST_", "", .) %>%
  sub("_.*", "", .) %>%
  paste(samples, ., sep = "_")

# read files and convert into one data.frame
results <- result_files %>%
  lapply(FUN = read.csv, stringsAsFactors = FALSE) %>%
  bind_rows(.id = "id") %>%
  separate(id, into = c("sample", "strategy"), sep = "_")

# files containing manually calculated log fold change
lfc_files <- here(snakemake@input$lfc)
names(lfc_files) <- basename(dirname(dirname(lfc_files)))

# load lfc files and convert into one data.frame
lfc <- lfc_files %>%
  lapply(FUN = read.csv, stringsAsFactors = FALSE) %>%
  bind_rows(.id = "sample")

# replace infinite values (pert or control is 0) by NA
lfc <- mutate(lfc, lfc = replace(lfc, is.infinite(lfc), NA))

# load target gene annotations
annot <- lapply(here(snakemake@input$annot), FUN = import, format = "gtf")

# merge into one GRanges object
annot <- annot %>%
  do.call("c", .) %>%
  unique()

# filter for exons of protein-coding genes (no processed transcripts etc)
annot <- annot[annot$type == "exon" &
                 annot$gene_biotype == "protein_coding" &
                 annot$transcript_biotype == "protein_coding"
               ]

# remove unlikely exons for HBE1 (extremely long introns, do not agree with RefSeq nor other HB*)
annot <- annot[!annot$exon_id %in% c("ENSE00001484269", "ENSE00001484268", "ENSE00001526635")]

# split by gene name into a GRangesList
genes <- split(annot, f = annot$gene_name)

# load enhancer coordinates files
enh_coords <- here(snakemake@input$enh_coords) %>%
  lapply( FUN = read.table, header = FALSE, stringsAsFactors = FALSE) %>%
  bind_rows() %>%
  distinct()

colnames(enh_coords) <- c("chr", "start", "end", "name", "score", "strand")


# FDR and confidence levels ------------------------------------------------------------------------

# correct for multiple testing across samples using FDR
results <- results %>%
  group_by(strategy) %>%
  rename(pval_adj_sample = pval_adj_allTests) %>%
  mutate(pval_adj_allTests = p.adjust(pvalue, method = "fdr"))

# extract results for 'per enhancer' tests
enh_perts <- filter(results, strategy == "perEnh")

# function to collapse gRNA ids per targeted enhancer
collapse_grnas <- function(grnas) {
  grnas <- sub("_.+", "", grnas)
  sub("\\.[A|B|C|D]", "", grnas)
}

# count number of gRNA hits per enhancer-gene pair
gRNA_hits <- results %>%
  filter(strategy == "perGRNA") %>%
  mutate(perturbation = collapse_grnas(perturbation)) %>%
  group_by(sample, perturbation, gene) %>%
  summarize(grna_hits = sum(pval_adj_allTests < snakemake@params$confidence_fdr),
            prop_grna_hits = mean(pval_adj_allTests < snakemake@params$confidence_fdr))

# add single gRNA hits to enhancer hits
enh_perts <- left_join(enh_perts, gRNA_hits, by = c("sample", "perturbation", "gene"))

# log fold change ----------------------------------------------------------------------------------

# add manually calculate log fold change to de results
enh_perts <- lfc %>%
  select(sample, perturbation, gene, manual_lfc = lfc) %>%
  left_join(enh_perts, ., by = c("sample", "perturbation", "gene"))

# add enhancer and gene TSS coordinates ------------------------------------------------------------

# compute center of every enhancer
enh_centers <- enh_coords %>%
  mutate(enh_center = round((start + end) / 2)) %>%
  select(name, chr, start, end, enh_center) %>%
  mutate(name = gsub(":|-", ".", name),
         chr = sub("chr", "", chr)) %>%
  rename(enh_chr = chr, enh_start = start, enh_end = end)

# add enhancer centers to de results
enh_perts <- enh_perts %>%
  left_join(enh_centers, by = c("perturbation" = "name"))

# get chromosome and strand for each gene
gene_chr <- unlist(unique(seqnames(genes)))
gene_strand <- unlist(unique(strand(genes)))

# function to get a genes TSS
get_tss <- function(x) {
  if (all(strand(x) == "+")) {
    min(start(x))
  }else if (all(strand(x) == "-")) {
    max(end(x))
  }else{
    stop("Inconsistent strand information!")
  }
}

# get TSS for each gene
gene_tss <- sapply(genes, FUN = get_tss)

# create data.frame with strand and tss coordinates for every gene
tss_coords <- data.frame(gene_chr, gene_tss, gene_strand, check.rows = TRUE) %>%
  rownames_to_column(var = "gene") %>%
  mutate_if(is.factor, as.character)

# add to discovery perturbation results
enh_perts <- left_join(enh_perts, tss_coords, by = "gene")


# add distance to TSS for cis associations ---------------------------------------------------------

# add type for cis or trans enhancer interactions
enh_perts <- enh_perts %>%
  mutate(enh_type = if_else(enh_chr == gene_chr, true = "cis", false = "trans"))

# calculate distance to tss for every cis enhancer - gene pair
enh_perts <- enh_perts %>%
  mutate(dist_to_tss = if_else(enh_type == "cis",
                               true = if_else(gene_strand == "+",
                                              true  = enh_center - gene_tss,
                                              false = gene_tss - enh_center),
                               false = as.numeric(NA)))


# label intronic ETPs ------------------------------------------------------------------------------

# create GRangesList object with gene locus coordinates (overlapping enhancers should be intronic)
gene_loci <- range(genes)

# create GRanges object with coordinates of all cis-enhancers (no promoter controls)
cis_enh_coords <- enh_perts %>% 
  ungroup() %>% 
  filter(enh_type == "cis") %>% 
  select(chr = enh_chr, start = enh_start, end = enh_end, perturbation) %>%
  distinct() %>% 
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# overlap enhancers with gene loci
overlaps <- findOverlaps(cis_enh_coords, gene_loci, ignore.strand = TRUE)

# covert to data.frame with overlapping gene names for every enhancers
enh_gene_overlaps <- data.frame(perturbation = cis_enh_coords[queryHits(overlaps)]$perturbation,
                                gene = names(gene_loci[subjectHits(overlaps)]),
                                stringsAsFactors = FALSE)

# create nested data.frame with all overlapping genes per perturbation
enh_gene_overlaps <- enh_gene_overlaps %>% 
  group_by(perturbation) %>% 
  nest(enh_overlapping_genes = gene)

# add to results and add column indicating intronic ETPs
enh_perts <- enh_perts %>% 
  left_join(enh_gene_overlaps, by = "perturbation") %>% 
  rowwise() %>% 
  mutate(enh_in_intron = if_else(gene %in% enh_overlapping_genes$gene, true = 1, false = 0))


# export processed DE results ----------------------------------------------------------------------

# reformat for output
enh_perts <- select(enh_perts, -c(enh_center, enh_overlapping_genes))

# save processed output to file
write.csv(enh_perts, file = here(snakemake@output), row.names = FALSE)
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
library(tidyverse)
library(here)

# get tap-seq target genes -------------------------------------------------------------------------

# load tap-seq target genes
target_genes <- read.csv(here(snakemake@input$target_genes), stringsAsFactors = FALSE)

# known enhancer target genes
known_enh <- target_genes %>%
  filter(screen == "validation") %>%
  pull(gene)

# chromosome 11 inhibition panel
chr11_genes <- target_genes %>%
  filter(screen == "inhibition", panel == "chr11_hs2") %>%
  pull(gene) %>%
  c(known_enh)

# chromosome 8 inhibition panel
chr8_genes <- target_genes %>%
  filter(screen == "inhibition") %>%
  filter(panel %in% c("chr8_zfpm2", "chr8_myc")) %>%
  pull(gene) %>%
  c(known_enh)

# string patterns for all target genes (incl CROP-seq vectors)
vector_pattern <- snakemake@params$vector_pattern
targets_chr11 <- c(chr11_genes, vector_pattern)
targets_chr8 <- c(chr8_genes, vector_pattern)

# total number of reads ----------------------------------------------------------------------------

# fastq files containing genome reads of each sample
fastq_files <- snakemake@input$fastq

# infer sample names from fastq files and set as names
bn <- sub(".*lane1", "", basename(fastq_files))
names(fastq_files) <- sub("_2_sequence.txt.gz", "", bn)

# count total reads in fastq files
raw_reads <- sapply(X = fastq_files, FUN = function(x) {
  command <- paste("echo $(zcat", x, "| wc -l) / 4 | bc")
  as.numeric(system(command, intern = TRUE))
})

# convert to data.frame
raw_reads <- data.frame(sample = names(raw_reads), total_reads = raw_reads,
                        stringsAsFactors = FALSE)

# reads on target ----------------------------------------------------------------------------------

# function to count reads on target genes
reads_on_target <- function(bam_files, targets, bam_tag = "GE:Z:") {

  # add bam tag to target genes to create patterns for grep and write to text file
  targets_patterns <- paste0(bam_tag, targets)
  targets_file <- tempfile(pattern = "target_patterns_", fileext = ".txt")
  write(targets_patterns, file = targets_file)

  # count reads on targets
  target_reads <- vapply(bam_files, FUN = function(x) {
    command <- paste("samtools view", x, "| fgrep -f", targets_file, "| wc -l")
    as.numeric(system(command, intern = TRUE))
  }, FUN.VALUE = numeric(1))

  # delete target genes tempfile
  unlink(targets_file)

  # create final output
  data.frame(sample = names(bam_files), target_reads = target_reads, stringsAsFactors = FALSE)

}

# get bam files
bam_files <- here(snakemake@input$bam)
names(bam_files) <- basename(dirname(bam_files))

# bam files for chr11 and chr8 targets
chr11_bam <- c(bam_files[grep(names(bam_files), pattern = "^11")], bam_files["Sample10X"])
chr8_bam <- c(bam_files[grep(names(bam_files), pattern = "^8")], bam_files["Sample10X"])

# count reads mapping to target genes
chr11_target_reads <- reads_on_target(chr11_bam, targets = targets_chr11)
chr8_target_reads <- reads_on_target(chr8_bam, targets = targets_chr8)

# finalize output and save to file -----------------------------------------------------------------

# combine read counts into data frame and compute percentage of reads on target genes
read_counts <- bind_rows(chr11 = chr11_target_reads, chr8 = chr8_target_reads, .id = "panel") %>%
  left_join(raw_reads, by = "sample") %>%
  mutate(perc = target_reads / total_reads)

# save to output file
write.csv(read_counts, file = here(snakemake@output), row.names = FALSE)
16
17
18
19
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
23
24
25
26
27
28
29
30
31
32
33
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# attach required packages
library(factoextra)
library(grid)
library(gridExtra)
library(ggrepel)
library(GGally)
library(here)
library(tidyverse)
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# load target genes
target_genes <- read.csv(here(snakemake@input$target_genes), stringsAsFactors = FALSE)

# load dge summary
dge_stats_files <- here(snakemake@input$dge_stats)
names(dge_stats_files) <- basename(dirname(dge_stats_files))
dge_stats <- lapply(dge_stats_files, FUN = read.table, header = TRUE, stringsAsFactors = FALSE)

# load dge data
dge_files <- here(snakemake@input$dge)
names(dge_files) <- basename(dirname(dge_files))
dge <- lapply(dge_files, FUN = read.table, header = TRUE, stringsAsFactors = FALSE,
              row.names = "GENE")

# load perturbation status
perturb_status_files <- here(snakemake@input$perturb_status)
names(perturb_status_files) <- basename(dirname(perturb_status_files))
perturb_status <- lapply(perturb_status_files, FUN = read.table, header = TRUE,
                         row.names = "VECTOR", stringsAsFactors = FALSE)

# dge input files from validation TAP-seq experiments
valid_dge_files <- grep(snakemake@input$valid_dge, pattern = "iv2", value = TRUE)
valid_dge_files <- here(valid_dge_files)
names(valid_dge_files) <- basename(dirname(dirname(valid_dge_files)))

# load validation dge data
valid_dge <- lapply(valid_dge_files, FUN = read.table, header = TRUE, stringsAsFactors = FALSE)

# load experimental meta data and add sample information
exp_data <- here(snakemake@input$exp_data) %>%
  read.csv(stringsAsFactors = FALSE) %>%
  mutate(sample = paste0(sub("chr", "", panel), "Screen1"))

# transfected vector files
vector_files <- here(snakemake@input$vctr_seqs)
names(vector_files) <- sub("cropseq_vectors_(chr.+)_screen.fasta", "\\1", basename(vector_files))

# load transfected vectors and extract sample ids for both chromosomal regions
vectors <- vector_files %>%
  lapply(FUN = readLines) %>%
  lapply(FUN = grep, pattern = "^>.+$", perl = TRUE, value = TRUE) %>%
  lapply(FUN = function(x) data.frame(vector = x, stringsAsFactors = FALSE) ) %>%
  bind_rows(.id = "sample") %>%
  mutate(vector = sub(vector, pattern = ">", replacement = ""),
         sample = paste0(sub("chr", "", sample), "iScreen1"))

# extract vector expression from dge data
vector_pattern <- snakemake@params$vector_pattern
vector_rows <- lapply(dge, FUN = function(x) grep(rownames(x), pattern = vector_pattern) )
vctr_expr <- mapply(FUN = function(x, y) x[y, ], x = dge, y = vector_rows)

# extract gene expression data
gene_expr <- mapply(FUN = function(x, y) x[-y, ], x = dge, y = vector_rows)

# remove dge to free up memory
rm(dge)
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# combine dge_stats into one data.frame and split cell barcode into i7 and cell
dge_stats <- dge_stats %>%
  bind_rows(.id = "sample") %>%
  separate(col = cell_barcode, into = c("i7_index", "cell"), sep = 8)

# add variable for 10x lane
dge_stats <- left_join(dge_stats, select(exp_data, sample, name, i7_index),
                       by = c("sample", "i7_index"))

# plot number of cells per lane
p1 <- ggplot(dge_stats, aes(x = name, fill = name)) +
  facet_wrap(~sample, scales = "free_x") +
  geom_bar() +
  theme_bw() +
  labs(x = "10x lane", y = "Cells", title = "Number of cells per 10x lane") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# plot number of detected genes per cell
p2 <- ggplot(dge_stats, aes(x = name, y = genes, color = name)) +
  facet_wrap(~sample, scales = "free_x") +
  geom_jitter(width = 0.1, alpha = 0.2) +
  geom_boxplot(color = "black", fill = NA, outlier.shape = NA, notch = TRUE) +
  labs(x = "10x lane", y = "Detected genes", title = "Detected genes per cell") +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# print plots
grid.arrange(p1, p2, nrow = 2, ncol = 1)
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# plot number of transcripts vs genic reads
p1 <- ggplot(dge_stats, aes(x = transcripts, y = genic_reads, color = name)) +
  facet_wrap(~sample) +
  geom_point() +
  labs(x = "Transcripts", y = "Genic reads", title = "Transcripts vs. genic reads",
       color = "10x lane") +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw() +
  theme(legend.position = "none")

# plot number of genic reads per cell
p2 <- ggplot(dge_stats, aes(x = name, y = genic_reads, color = name)) +
  facet_wrap(~sample, scales = "free_x", ncol = 2) +
  geom_jitter(width = 0.1, alpha = 0.2) +
  geom_boxplot(color = "black", fill = NA, outlier.shape = NA, notch = TRUE) +
  labs(x = "10x lane", y = "Genic reads", title = "Genic reads per cell") +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# plot number of transcripts per cell
p3 <- ggplot(dge_stats, aes(x = name, y = transcripts, color = name)) +
  facet_wrap(~sample, scales = "free_x", ncol = 2) +
  geom_jitter(width = 0.1, alpha = 0.2) +
  geom_boxplot(color = "black", fill = NA, outlier.shape = NA, notch = TRUE) +
  labs(x = "10x lane", y = "Transcripts", title = "Transcripts per cell") +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# print plots
layout <- rbind(c(1, 1), c(2, 2), c(3, 3))
grid.arrange(p1, p2, p3, layout_matrix = layout)
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# transpose gene expression, convert to long format and combine into one data.frame
gene_expr_l <- gene_expr %>%
  lapply(FUN = function(x) as.data.frame(t(x))) %>%
  lapply(FUN = rownames_to_column, var = "cell_barcode") %>%
  lapply(FUN = gather, key = "gene", value = "txs", -cell_barcode) %>%
  bind_rows(.id = "sample")

# split barcode into i7 index and cell barcode, and add variable for 10x lane
gene_expr_l <- gene_expr_l %>%
  separate(col = cell_barcode, into = c("i7_index", "cell"), sep = 8) %>%
  left_join(select(exp_data, sample, name, i7_index), by = c("sample", "i7_index"))

# compute average gene expression per 10x lane
avg_genex_lanes <- gene_expr_l %>%
  group_by(sample, name, gene) %>%
  summarize(avg_txs = mean(txs))

# plot average gene expression
ggplot(avg_genex_lanes, aes(x = name, y = avg_txs, color = name)) +
  facet_wrap(~sample, scales = "free_x", ncol = 2) +
  geom_jitter(width = 0.25, alpha = 0.5) +
  geom_boxplot(color = "black", fill = NA, notch = TRUE, outlier.shape = NA) +
  labs(x = "10x lane", y = "Average transcripts per gene", title = "Transcripts per gene") +
  scale_y_log10() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")
215
216
217
218
219
220
221
222
223
224
225
226
227
228
# calculate log10 average gene expression and reformat
avg_genex_log10 <- mutate(avg_genex_lanes, avg_txs = log10(avg_txs))

# split by sample and reformat
avg_genex_log10 <- avg_genex_log10 %>%
  split(f = .$sample) %>%
  lapply(spread, key = "name", value = "avg_txs")

# plot pairwise correlations between 10x lanes
plots <- lapply(avg_genex_log10, FUN = function(x) ggpairs(x, columns = 3:ncol(x),
           title = unique(x$sample), progress = FALSE) )

# print plots
for (plot in plots) print(plot)
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# convert gene expression into data matrices and transpose for pca
genex_mat <- lapply(gene_expr, FUN = function(x) t(data.matrix(x)) )

# create experimental meta data for every cell
exp_data_cells <- genex_mat %>%
  lapply(FUN = function(x) data.frame(cell = rownames(x), stringsAsFactors = FALSE) ) %>%
  bind_rows(.id = "sample") %>%
  mutate(i7_index = substr(cell, start = 1, stop = 8)) %>%
  left_join(select(exp_data, sample, i7_index, chip_10x, lane_10x, bead_vial_10x, name),
            by = c("sample", "i7_index")) %>%
  rename(lane_number_10x = lane_10x, sample_id_10x = name) %>%
  split(f = .$sample)

# perform PCA on DGE data
pca_genex <- lapply(genex_mat, FUN = prcomp, scale. = TRUE)

# function to plot pca colored by experimental variable 
plot_pca <- function(var, pca_obj, title = NULL) {
  fviz_pca_ind(pca_obj, habillage = var, label = "none", invisible = "quali", title = title)
}

# plot pcs 1 & 2 colored by experimental meta data
pca_plots_chr11 <- lapply(colnames(exp_data_cells[["11iScreen1"]][, -c(1:3)]), FUN = function(x) {
  plot_pca(var = exp_data_cells[["11iScreen1"]][, x], pca_obj = pca_genex[["11iScreen1"]],
           title = x)
})

pca_plots_chr8 <- lapply(colnames(exp_data_cells[["8iScreen1"]][, -c(1:3)]), FUN = function(x) {
  plot_pca(var = exp_data_cells[["8iScreen1"]][, x], pca_obj = pca_genex[["8iScreen1"]], title = x)
})

# print plots
grid.arrange(grobs = pca_plots_chr11, top = textGrob("11iScreen1", gp = gpar(fontsize = 15)))
grid.arrange(grobs = pca_plots_chr8,  top = textGrob("8iScreen1", gp = gpar(fontsize = 15)))
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
# number of transcripts per vector
txs_per_vector <- vctr_expr %>%
  lapply(FUN = rowSums) %>%
  lapply(FUN = function(x) data.frame(vector = names(x), txs = x, stringsAsFactors = FALSE)) %>%
  bind_rows(.id = "sample") %>%
  mutate(vector = sub(vector_pattern, "", vector))

# compute the number of cells per perturbation
cells_per_vector <- perturb_status %>%
  lapply(FUN = rowSums) %>%
  lapply(FUN = function(x) data.frame(vector = names(x), cells = x, stringsAsFactors = FALSE)) %>%
  bind_rows(.id = "sample")

# merge transcripts and cells per vector and add vectors that were not detected
vector_stats <- full_join(txs_per_vector, cells_per_vector, by = c("sample", "vector")) %>%
  full_join(vectors, by = c("sample", "vector")) %>%
  replace_na(replace = list(txs = 0, cells = 0))

# calculate average vector txs per transfected cell (if possible)
vector_stats <- vector_stats %>%
  mutate(avg_txs = if_else(cells > 0, true = txs / cells, false = 0))

# add label for control vectors and convert to long format
vector_stats <- vector_stats %>%
  mutate(type = if_else(grepl(vector, pattern = "^chr.+$"), true = "screen", false = "ctrl"))
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
# reformat vector_stats for plotting
vector_stats_plot <- vector_stats %>%
  gather(key = "stat", value = "value", -c(sample, vector, type)) %>%
  mutate(vector = paste(sample, vector, stat, sep = "_")) %>%
  mutate(vector = fct_reorder(vector, value, .desc = TRUE)) %>%
  mutate(stat = case_when(stat == "txs" ~ "Transcripts per vector",
                          stat == "cells" ~ "Cells per vector",
                          stat == "avg_txs" ~ "Average transcripts per cell"))

# plot vector transcripts and cells for chromosome 11
p1 <- vector_stats_plot %>%
  filter(sample == "11iScreen1") %>%
  ggplot(., aes(x = value)) +
  facet_wrap(~stat, scales = "free", ncol = 1) +
  geom_histogram(bins = 50) +
  labs(x = "Transcripts / Cells", title = "11iScreen1") +
  theme_bw()

p2 <- vector_stats_plot %>%
  filter(sample == "11iScreen1") %>%
  ggplot(., aes(x = vector, y = value, color = type)) +
  facet_wrap(~stat, scales = "free", ncol = 1) +
  geom_point() +
  geom_rug(color = "black", alpha = 0.1, sides = "l") +
  scale_color_manual(values = c("ctrl" = "firebrick2", "screen" = "steelblue3")) +
  labs(x = "Vector", y = "Transcripts / Cells", title = "", color = "Vector type") +
  scale_x_discrete(expand = c(0.06, 0)) +
  theme_bw() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank())

layout <- rbind(c(1, 1, 2, 2, 2))
grid.arrange(p1, p2, ncol = 2, layout_matrix = layout)
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
# plot vector transcripts and cells for chromosome 8
p1 <- vector_stats_plot %>%
  filter(sample == "8iScreen1") %>%
  ggplot(., aes(x = value)) +
  facet_wrap(~stat, scales = "free", ncol = 1) +
  geom_histogram(bins = 50) +
  labs(x = "Transcripts / Cells", title = "8iScreen1") +
  theme_bw()

p2 <- vector_stats_plot %>%
  filter(sample == "8iScreen1") %>%
  ggplot(., aes(x = vector, y = value, color = type)) +
  facet_wrap(~stat, scales = "free", ncol = 1) +
  geom_point() +
  geom_rug(color = "black", alpha = 0.1, sides = "l") +
  scale_color_manual(values = c("ctrl" = "firebrick2", "screen" = "steelblue3")) +
  labs(x = "Vector", y = "Transcripts / Cells", title = "", color = "Vector type") +
  scale_x_discrete(expand = c(0.06, 0)) +
  theme_bw() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank())

layout <- rbind(c(1, 1, 2, 2, 2))
grid.arrange(p1, p2, ncol = 2, layout_matrix = layout)
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
# compute average gene expression per gene
avg_genex <- gene_expr_l %>%
  group_by(sample, gene) %>%
  summarize(avg_txs = mean(txs))

# get target genes for both samples
sample_targets_chr11 <- target_genes %>%
  filter((panel == "chr11_hs2" | screen == "validation"), !gene %in% c("HBG1", "HBG2")) %>%
  mutate(sample = "11iScreen1")

sample_targets_chr8 <- target_genes %>%
  filter((panel %in% c("chr8_myc", "chr8_zfpm2") | screen == "validation"),
         !gene %in% c("HBG1", "HBG2")) %>%
  mutate(sample = "8iScreen1")

sample_targets <- bind_rows(sample_targets_chr11, sample_targets_chr8) %>%
  select(gene, sample, type = screen) %>%
  mutate(type = if_else(type == "validation", true = "e-gene", false = "non_e-gene"))

# average gene expression in validation experiments
avg_genex_valid <- valid_dge %>%
  lapply(FUN = gather, key = "cell", value = "txs", -GENE) %>%
  bind_rows(.id = "sample") %>%
  mutate(sample = paste0(sub("iv2.*", "", sample), "iScreen1")) %>%
  rename(gene = GENE) %>%
  right_join(sample_targets, by = c("sample", "gene")) %>%
  group_by(sample, gene, type) %>%
  summarize(avg_txs_valid = mean(txs))

# merge with screen avg genex datasets
avg_genex_merged <- full_join(avg_genex, avg_genex_valid, by = c("sample", "gene"))

# correlation plot
ggplot(avg_genex_merged, aes(x = avg_txs_valid, y = avg_txs, color = type)) +
  facet_wrap(~sample) +
  geom_abline() +
  geom_point() +
  geom_text_repel(data = filter(avg_genex_merged, type == "e-gene" | gene %in% c("STK3", "PGAP2")),
                  aes(label = gene), box.padding = unit(3, "lines"), color = "gray50") +
  labs(title = "Average gene expression", x = "Transcripts validation", y = "Transcripts screen",
       color = "Type") +
  scale_color_manual(values = c("e-gene" = "firebrick2", "non_e-gene" = "steelblue3")) +
  scale_x_log10() +
  scale_y_log10() +
  coord_fixed() +
  theme_bw()
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
# only retain perturbation status on control perturbations
ctrl_perts <- perturb_status %>%
  lapply(FUN = rownames_to_column, var = "vector") %>%
  lapply(FUN = function(x) x[!grepl(x$vector, pattern = "^chr.+$"), ] ) %>%
  lapply(FUN = function(x) x[, c(TRUE, colSums(x[, -1]) > 0)] ) %>%  # remove cells without any pert
  lapply(FUN = gather, key = "cell_barcode", value = "pert", -vector) %>%
  bind_rows(.id = "sample") %>%
  filter(pert > 0)  # only retain detected perturbation events

# convert vector ids to grna targets
ctrl_perts <- ctrl_perts %>%
  mutate(vector = sub("_.+", "", vector),
         vector = sub("-[ABCD]", "", vector)) %>%
  rename(target = vector)

# collapse vector perturbations
ctrl_perts <- ctrl_perts %>%
  group_by(sample, target, cell_barcode) %>%
  summarize(pert = sum(pert))
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
# get gene expression data for cells carrying control perturbations
genex_pert <- gene_expr_l %>%
  unite(col = "cell_barcode", i7_index, cell, sep = "") %>%
  right_join(ctrl_perts, by = c("sample", "cell_barcode"))

# calculate average gene expression of each gene in every perturbation
avg_genex_pert <- genex_pert %>%
  group_by(sample, target, gene) %>%
  summarize(avg_txs = mean(txs)) %>%
  ungroup()

# extract negative control expression for every gene
neg_ctrl <- avg_genex_pert %>%
  filter(target == "non-targeting") %>%
  select(sample, gene, neg_ctrl = avg_txs)

# add to perturbed average expression levels
avg_genex_pert <- avg_genex_pert %>%
  filter(target != "non-targeting") %>%
  left_join(neg_ctrl, by = c("sample", "gene"))

# calculate log fold change for each gene in each perturbation
lfc_pert <- avg_genex_pert %>%
  filter(avg_txs > 0) %>%  # filter for genes with detected expression for the given perturbations
  mutate(fc = avg_txs / neg_ctrl,
         lfc = log2(fc))
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
# list of perturbations and their target genes
perturbations <- as.list(unique(lfc_pert$target))
names(perturbations) <- unlist(perturbations)
perturbations$HS2 <- c("HBB", "HBD", "HBE1", "HBG1", "HBG2")

# add variable indicating if a gene is a target of the given perturbation
lfc_pert <- lfc_pert %>%
  group_by(target) %>%
  mutate(type = if_else(gene %in% unlist(perturbations[target]),
                        true = "target", false = "non_target"))

# function to plot perturbation effects for one sample
plot_pert_effects <- function(lfc_pert, title = NULL) {
  ggplot(lfc_pert, aes(x = lfc, y = gene, color = type)) +
    facet_wrap(~target, ncol = 3) +
    geom_vline(xintercept = c(-1, 1), lty = "dashed", color = "darkgray") +
    geom_vline(xintercept = 0, color = "darkgray") +
    geom_point() +
    geom_text_repel(data = filter(lfc_pert, abs(lfc) > 0.75), aes(label = gene),
                    box.padding = unit(1.5, "lines"), color = "gray50") +
    labs(title = title, x = "Expression log fold change") +
    scale_color_manual(values = c("target" = "firebrick2", "non_target" = "steelblue3")) +
    theme_bw() +
    theme(axis.text.y = element_text(size = 5))
}

plot_pert_effects(filter(lfc_pert, sample == "11iScreen1"), title = "Perturbation LFC 11iScreen1")
plot_pert_effects(filter(lfc_pert, sample == "8iScreen1"), title = "Perturbation LFC 8iScreen1")
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
# get average gene expression of non_targets that show a |lfc| > 0.75
potential_fp <- filter(lfc_pert, type == "non_target", abs(lfc) > 0.75) %>%
  ungroup() %>%
  select(sample, gene) %>%
  distinct() %>%
  mutate(type = "False positive?")

# get average expression over all cells and all genes and mark potential false positives
avg_genex_fp <- avg_genex %>%
  left_join(potential_fp, by = c("sample", "gene")) %>%
  mutate(type = replace_na(type, "Other"))

# plot average expression of these genes vs other gene
p1 <- ggplot(avg_genex_fp, aes(x = type, y = avg_txs, fill = type, color = type)) +
  facet_wrap(~sample, scales = "free_y") +
  geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 0.15) +
  geom_text_repel(data = filter(avg_genex_fp, type == "False positive?"), aes(label = gene),
                  color = "gray50") +
  labs(title = "Average expression potential false positives", y = "Average expression") +
  scale_color_manual(values = c("False positive?" = "firebrick2", "Other" = "steelblue3")) +
  scale_fill_manual(values = c("False positive?" = "firebrick2", "Other" = "steelblue3")) +
  scale_y_log10() +
  theme_bw() +
  theme(axis.title.x = element_blank(), legend.position = "none")

# compute detection rate for each gene
detection_rate <- gene_expr_l %>%
  group_by(sample, gene) %>%
  summarize(cells = sum(txs > 0), 
            detect = cells / n()) %>%
  left_join(potential_fp, by = c("sample", "gene")) %>%
  mutate(type = replace_na(type, "Other"))

# plot proportion of cells with zero expression
p2 <- ggplot(detection_rate, aes(x = type, y = detect, fill = type, color = type)) +
  facet_wrap(~sample, scales = "free_y") +
  geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 0.034) +
  geom_text_repel(data = filter(detection_rate, type == "False positive?"), aes(label = gene),
                  color = "gray50") +
  labs(title = "Detection rate", y = " % cells with transcripts > 0") +
  scale_color_manual(values = c("False positive?" = "firebrick2", "Other" = "steelblue3")) +
  scale_fill_manual(values = c("False positive?" = "firebrick2", "Other" = "steelblue3")) +
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  theme(axis.title.x = element_blank(), legend.position = "none")

# print plots
grid.arrange(p1, p2, ncol = 1)
17
18
19
20
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
24
25
26
27
# attach required packages
library(here)
library(Biostrings)
library(tidyverse)
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# perturbation status
pert_8iScreen1 <- read.table(here("data/8iScreen1/perturb_status.txt"), header = TRUE,
                             stringsAsFactors = FALSE)
pert_11iScreen1 <- read.table(here("data/11iScreen1/perturb_status.txt"), header = TRUE,
                             stringsAsFactors = FALSE)

# convert perturbation status to long format and only keep detected perturbations
pert_8iScreen1 <- pert_8iScreen1 %>% 
  pivot_longer(-VECTOR, names_to = "cell_barcode", values_to = "pert") %>% 
  filter(pert > 0) %>% 
  select(-pert) %>% 
  dplyr::rename(vector_id = VECTOR)

pert_11iScreen1 <- pert_11iScreen1 %>% 
  pivot_longer(-VECTOR, names_to = "cell_barcode", values_to = "pert") %>% 
  filter(pert > 0) %>% 
  select(-pert) %>% 
  dplyr::rename(vector_id = VECTOR)

# files containing parsed misatch data
outfile_8iScreen1 <- here("data/8iScreen1/cropseq_vector_seqs.txt.gz")
outfile_11iScreen1 <- here("data/11iScreen1/cropseq_vector_seqs.txt.gz")

col_types <- cols(
  "cell_barcode" = col_character(),
  "umi_barcode" = col_character(),
  "aligned_vector_id" = col_character(),
  "mapping_quality" = col_integer(),
  "mismatches" = col_character(),
  "mismatch_positions" = col_character(),
  "sequence" = col_character()
)

# read mismatch data
mm_8iScreen1 <- read_tsv(outfile_8iScreen1, col_types = col_types)
mm_11iScreen1 <- read_tsv(outfile_11iScreen1, col_types = col_types)
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
# only retain reads with a mapping quality >= 10
mm_8iScreen1  <- dplyr::filter(mm_8iScreen1, mapping_quality >= 10)
mm_11iScreen1 <- dplyr::filter(mm_11iScreen1, mapping_quality >= 10)

# filter for reads aligning to inferred perturbations for each cell
mm_perts8 <- mm_8iScreen1 %>% 
  rename(vector_id = aligned_vector_id) %>% 
  mutate(cell_barcode = sub("XC:Z:", "", cell_barcode),
         vector_id = sub("GE:Z:CROPseq_dCas9_DS_", "", vector_id),
         mismatches = as.integer(sub("NM:i:", "", mismatches))) %>% 
  inner_join(pert_8iScreen1, by = c("cell_barcode", "vector_id"))

mm_perts11 <- mm_11iScreen1 %>% 
  rename(vector_id = aligned_vector_id) %>% 
  mutate(cell_barcode = sub("XC:Z:", "", cell_barcode),
         vector_id = sub("GE:Z:CROPseq_dCas9_DS_", "", vector_id),
         mismatches = as.integer(sub("NM:i:", "", mismatches))) %>% 
  inner_join(pert_11iScreen1, by = c("cell_barcode", "vector_id"))
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# load vector sequences for chr8 and chr11 screen datasets
vectors8 <- readDNAStringSet(here("meta_data/cropseq_vectors/cropseq_vectors_chr8_screen.fasta"))
vectors11 <- readDNAStringSet(here("meta_data/cropseq_vectors/cropseq_vectors_chr11_screen.fasta"))

# convert to data.frame 
gRNAs8 <- data.frame(vector_id = names(vectors8), vector_seq = as.character(vectors8),
                     row.names = NULL, stringsAsFactors = FALSE)

gRNAs11 <- data.frame(vector_id = names(vectors11), vector_seq = as.character(vectors11),
                     row.names = NULL, stringsAsFactors = FALSE)

# extract gRNA sequences (19bp) from vector sequences
gRNAs8 <- gRNAs8 %>% 
  mutate(grna_seq = sub(".*TGTGGAAAGGACGAAACACCG", "", vector_seq),
         grna_seq = substr(grna_seq, start = 1, stop = 19))

gRNAs11 <- gRNAs11 %>% 
  mutate(grna_seq = sub(".*TGTGGAAAGGACGAAACACCG", "", vector_seq),
         grna_seq = substr(grna_seq, start = 1, stop = 19))

# extract gRNA sequences from vector reads, assuming gRNAs are 19bp long
mm_perts8  <- mutate(mm_perts8, grna_seq = substr(sequence, start = 22, stop = 40))
mm_perts11 <- mutate(mm_perts11, grna_seq = substr(sequence, start = 22, stop = 40))
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# create consensus gRNA sequence for every detected (perturbations) gRNA for each cell
gRNA_consensus_chr8 <- mm_perts8 %>% 
  group_by(cell_barcode, vector_id) %>% 
  summarize(grna_consensus_seq = consensusString(DNAStringSet(grna_seq)))

gRNA_consensus_chr11 <- mm_perts11 %>% 
  group_by(cell_barcode, vector_id) %>% 
  summarize(grna_consensus_seq = consensusString(DNAStringSet(grna_seq)))

# add expected gRNA sequence
gRNA_consensus_chr8 <- gRNA_consensus_chr8 %>% 
  left_join(select(gRNAs8, -vector_seq), by = "vector_id")

gRNA_consensus_chr11 <- gRNA_consensus_chr11 %>% 
  left_join(select(gRNAs11, -vector_seq), by = "vector_id")
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# function to compute edit distance between consensus and expected gRNA sequences
calc_edit_dist <- function(x, y, method = "levenshtein") {
  as.integer(stringDist(c(x, y), method = method))
}

# calculate edit distance between sequenced consensus gRNA and expected gRNA sequence
gRNA_consensus_chr8 <- gRNA_consensus_chr8 %>% 
  rowwise() %>% 
  mutate(edit_dist = calc_edit_dist(grna_consensus_seq, grna_seq, method = "levenshtein"))

gRNA_consensus_chr11 <- gRNA_consensus_chr11 %>% 
  rowwise() %>% 
  mutate(edit_dist = calc_edit_dist(grna_consensus_seq, grna_seq, method = "levenshtein"))

noerr_chr8 <- mean(gRNA_consensus_chr8$edit_dist == 0)
noerr_chr11 <- mean(gRNA_consensus_chr11$edit_dist == 0)
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# create one table
gRNA_cons <- bind_rows(Chr8 = gRNA_consensus_chr8, Chr11 = gRNA_consensus_chr11, .id = "Sample")

# create histogram of edit distance distribution
ggplot(gRNA_cons, aes(edit_dist, fill = Sample)) +
  geom_histogram(position = "dodge", bins = 20) +
  labs(x = "Edit distance\n(gRNA consensus sequence - template)",
       y = "gRNA - cell combinations") +
  scale_fill_manual(values = c(Chr11 = "indianred3", Chr8 = "steelblue")) +
  theme_bw() +
  theme(panel.grid = element_blank(), aspect.ratio = 1,
        text = element_text(size = 15))

# save plot to .pdf file
ggsave(filename = here("results/plots", "screen_gRNA_errors.pdf"), height = 4, width = 5)
15
16
17
18
/* Define a margin before every image element */
img {
  margin-top: 3em;
}
22
23
24
25
26
27
28
29
30
31
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# attach required packages
library(tidyverse)
library(ggrepel)
library(here)

# create plots directory if needed
dir.create("results/plots", showWarnings = FALSE)
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# load read counts
read_counts <- read.csv(here(snakemake@input$reads_on_target), stringsAsFactors = FALSE)

# add variable for approach and rename Sample10X to include assessed chromosome
read_counts_plot <- read_counts %>%
  mutate(approach = if_else(sample == "Sample10X", true = "cropseq", false = "tapseq")) %>%
  mutate(sample = if_else(sample == "Sample10X", true = paste0(sample, " (", panel, ")"),
                          false = sample)) %>%
  arrange(panel, desc(sample))

# plot percentage of total reads on target genes
ggplot(read_counts_plot, aes(x = fct_inorder(sample), y = perc, fill = approach)) +
  geom_bar(stat = "identity") +
  labs(y = "Percentage of reads mapping to target genes", x = NULL) +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  theme_bw() +
  theme(legend.position = "none") +
  scale_fill_manual(values =  c("tapseq" = "firebrick2", "cropseq" = "gray42"))

# figure caption
cap <- "Percentage of reads mapping to assessed target genes."

# save plot to file
ggsave(here("results/plots/reads_on_target.pdf"), device = "pdf", width = 5, height = 4)
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
# load tap-seq target genes
target_genes <- read.csv(here(snakemake@input$target_genes), stringsAsFactors = FALSE)

# known enhancer targets for validation
known_enh <- target_genes %>%
  filter(screen == "validation") %>%
  pull(gene)

# get downsampled dge files and set names
ds_dge_files <- here(snakemake@input$dge)
names(ds_dge_files) <- ds_dge_files %>%
  sub("_avg_reads_per_cell.txt", "", .) %>%
  sub(".*data/", "", .) %>%
  sub("/downsampled/dge_", "_", .)

# load downsampled dge data and only retain chr8 and chr11 target genes, convert to long format and
# make one data.frame
ds_dge <- ds_dge_files %>%
  lapply(FUN = read.table, header = TRUE, stringsAsFactors = FALSE) %>%
  lapply(FUN = filter, GENE %in% target_genes$gene) %>%
  lapply(FUN = gather, key = "cell",  value = "txs", -GENE) %>%
  bind_rows(.id = "sample") %>%
  rename(gene = GENE)

# split sample into sample and sequencing depth, and add gene panel and experiment information
ds_dge <- ds_dge %>%
  separate(sample, into = c("sample", "avg_rpc"), sep = "_") %>%
  mutate(avg_rpc = as.numeric(avg_rpc)) %>%
  left_join(target_genes[, c("gene", "panel")], by = "gene") %>%
  mutate(panel = sub("_.+", "", panel)) %>%
  mutate(exp = case_when(sample %in% c("11iv210ng", "11iv22ng") ~ "chr11",
                         sample %in% c("8iv210ng", "8iv22ng") ~ "chr8",
                         TRUE ~ "cropseq"))

# only retain intended target gene expression for tap-seq samples
ds_dge <- ds_dge %>%
  mutate(sample_panel = paste0("chr", sub("iv2.+" ,"", sample))) %>%
  filter(sample_panel == panel | gene %in% known_enh | sample == "Sample10X") %>%
  select(-sample_panel)
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# only retain data on 3500 reads per cell downsampling
ds_dge_filt <- filter(ds_dge, avg_rpc == 3500)

# calculate average number transcripts per target gene and experiment
avg_txs <- ds_dge_filt %>%
  group_by(sample, exp, gene) %>%
  summarize(avg_txs = mean(txs)) %>%
  mutate(type = if_else(gene %in% known_enh, true = "e-gene", false = "non_e-gene")) %>%
  ungroup()

# extract average transcripts per gene for crop-seq
avg_txs_cropseq <- avg_txs %>%
  ungroup() %>%
  filter(exp == "cropseq") %>%
  select(gene, cropseq = avg_txs)

# add to tap-seq data and calculate log fold change
avg_txs_fc <- avg_txs %>%
  filter(exp != "cropseq") %>%
  left_join(avg_txs_cropseq, by = "gene") %>%
  mutate(fc = avg_txs / cropseq, lfc = log2(fc))

# axis limits
axis_limits <- range(select(avg_txs_fc, cropseq, avg_txs))

# create scatter plots
ggplot(avg_txs_fc, aes(x = cropseq, y = avg_txs, col = type)) + 
  facet_wrap(~sample, ncol = 2, nrow = 2) +
  geom_abline(slope = 1, color = "darkgray", size = 0.5) +
  geom_text_repel(data = filter(avg_txs_fc, type == "e-gene"), aes(label = gene),
                  box.padding = unit(1.5, "lines"), color = "gray50") +
  geom_point() +
  geom_smooth(method = lm,  se = FALSE, color = "steelblue") +
  scale_x_log10(limits = axis_limits) + 
  scale_y_log10(limits = axis_limits) +
  scale_color_manual(values = c("e-gene" = "firebrick2", "non_e-gene" = "gray42")) +
  labs(x = "Average transcripts per gene (CROP-seq)",
       y = "Average transcripts per gene (TAP-seq)") +
  coord_fixed() +
  theme_bw() +
  theme(legend.position = "none")

# figure caption
cap <- "Average transcripts per gene, TAP-seq vs. CROP-seq"

# save plot to file
ggsave(here("results/plots/avg_txs_per_gene.pdf"), device = "pdf", width = 7, height = 7)
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# axis limits
lim <- max(abs(avg_txs_fc$lfc))

# plot lfc per gene between methods
ggplot(avg_txs_fc, aes(x = sample, y = lfc, color = type)) +
  geom_hline(yintercept = 0, color = "darkgray", lty = "dashed", size = 1) +
  geom_jitter(width = 0.2, size = 1) +
  geom_boxplot(outlier.shape = NA, fill = NA, color = "gray42", notch = TRUE) +
  labs(x = "TAP-seq sample", y = "Log fold change TAP-seq / CROP-seq") +
  scale_color_manual(values = c("e-gene" = "firebrick2", "non_e-gene" = "gray42")) +
  coord_flip() +
  theme_bw() +
  theme(legend.position = "none")

# figure caption
cap <- "Average LFC TAP-seq vs. CROP-seq"

# save plot to file
ggsave("results/plots/lfc_tapseq_vs_cropseq.pdf", device = "pdf", height = 4, width = 5)
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
# calculate number of cells in which a gene is detected in each experiment
cells_genes <- ds_dge %>%
  group_by(exp, avg_rpc, gene) %>%
  summarize(avg_txs = mean(txs),
            perc_cells = mean(txs > 0))

# compute expression quantiles for all target genes based on average expression in CROP-seq
expr_quantiles <- cells_genes %>%
  ungroup() %>%
  filter(exp == "cropseq", avg_rpc == 20000) %>%
  mutate(quantile = ntile(avg_txs, 4)) %>%
  select(gene, quantile)

# add expression quantile to each gene
cells_genes <- left_join(cells_genes, expr_quantiles, by = "gene")

# reformat for plot
cells_genes_plot <- cells_genes %>%
  ungroup() %>%
  select(-avg_txs) %>%
  spread(key = "exp", value = "perc_cells") %>%
  gather(key = "exp", value = "tapseq", -c(avg_rpc, gene, quantile, cropseq)) %>%
  gather(key = "approach", value = "perc_cells", -c(avg_rpc, gene, quantile, exp)) %>%
  drop_na() %>%
  mutate(avg_rpc = factor(avg_rpc, levels = sort(unique(avg_rpc))),
         approach = fct_relevel(approach, "tapseq"))

# create boxplot comparing detection sensitivity between tap and crop-seq across sequencing depths
ggplot(cells_genes_plot, aes(x = as.factor(quantile), y = perc_cells)) +
  facet_wrap(~exp, ncol = 1, nrow = 2) +
  geom_boxplot(aes(linetype = approach, fill = avg_rpc), outlier.shape = NA) +
  labs(y = "Cells in which genes are detected", x = "CROP-seq expression quantiles",
       fill = "Avg reads per cell", linetype = "Method") +
  scale_y_continuous(labels = scales::percent) +
  theme_bw()

# figure caption
cap <- "Detection sensitivity TAP-seq vs. CROP-seq (all genes should be expressed in K562)"

# save plot to file
ggsave(here("results/plots/detection_sensitivity.pdf"), device = "pdf", height = 5, width = 7)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
library(GenomicAlignments)
library(GenomicFeatures)
library(here)
library(HiTC)
library(Matrix)
library(rtracklayer)
library(tidyverse)
library(BiocParallel)

# backend for parallel computing
register(MulticoreParam(workers = snakemake@threads))

# load processed differential expression output (includes confidence level and distance to TSS)
results <- here(snakemake@input$processed_results) %>%
  read.csv(stringsAsFactors = FALSE) %>%
  mutate(sample = paste0("chr", sub("iScreen1", "", sample)))

# extract cis-interactions on chromosome 8 and 11 and add column identifying significant hits
cis_enh_perts <- results %>%
  filter(enh_type == "cis", abs(dist_to_tss) >= 1000) %>%
  mutate(enh_chr = paste0("chr", enh_chr), gene_chr = paste0("chr", gene_chr),
         significant = if_else(pval_adj_allTests < 0.05, true = "sig", false = "non_sig"))

# remove validation controls on other chromosomes than the samples' target region and significant
# hits that increase gene expression
cis_enh_perts <- cis_enh_perts %>%
  filter(sample == enh_chr) %>%
  filter(!(significant == "sig" & manual_lfc > 0))

# Annotate chromatin ===============================================================================

# Every assessed enhancer is annotated for the number of chromatin assay ChIP-seq and DNAse-seq
# reads. The RPKM is calculated to normalize for both the sequencing depth and enhancer size.

# extract hg38 enhancer coordinates
enh_coords <- cis_enh_perts %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# get coordinates of genomic regions and extend by 10 kb
region_coords <- range(enh_coords)
start(region_coords) <- start(region_coords) - 10000
end(region_coords) <- end(region_coords) + 10000

# bam files containing aligned reads
bam_files <- here(snakemake@input$encode_bam)
names(bam_files) <- sub("_encode_rep1_alignment.bam", "", basename(bam_files))

# load aligned reads within the regions from bam files
chrom_reads <- bplapply(bam_files, FUN = readGAlignments,
                        param = ScanBamParam(which = region_coords))

# count the number of reads within each enhancer
enh_chrom_reads <- chrom_reads %>%
  bplapply(FUN = countOverlaps, query = enh_coords) %>%
  bind_cols()

# combine with enhancer coordinates and convert to long format
enh_chrom <- enh_coords %>%
  data.frame(stringsAsFactors = FALSE) %>%
  select(-strand) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  bind_cols(enh_chrom_reads) %>%
  gather(key = assay, value = reads, -c(1:5))

# normalize for sequencing depth and enhancer size by calculating rpkm
enh_chrom_norm <- enh_chrom %>%
  group_by(assay) %>%
  mutate(rpkm = reads / (sum(reads) / 1e6) / (width / 1000))

# convert to wide format add to enhancer gene pair results
cis_enh_perts <- enh_chrom_norm %>%
  select(-c(enh_chr, enh_start, enh_end, width, reads)) %>%
  spread(key = "assay", value = "rpkm") %>%
  left_join(x = cis_enh_perts, y = ., by = "perturbation")

# Annotate Hi-C ====================================================================================

# Hi-C contact matrices from Rao et al., 2014 are imported and the contact frequency for each
# enhancer - gene pair is calculated.

# Liftover enhancer and gene TSS coordinates -------------------------------------------------------

# Hi-C contact matrices are in hg19 annotation, so enhancer and gene TSS coordinates need to be
# lifted over from hg38 to hg19.

# download chain file for hg38 to hg19 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz"
chain_file <- tempfile("hg38ToHg19", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg38_to_hg19_chain <- import.chain(tools::file_path_sans_ext(chain_file))
closeAllConnections()  # close any open file connection

# extract hg38 enhancer coordinates
enh_coords_hg38 <- cis_enh_perts %>%
  select(perturbation, enh_chr, enh_start, enh_end) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "enh_chr",
                           start.field = "enh_start", end.field = "enh_end")

# liftover enhancers from hg38 to hg19 and convert to data.frame
enh_coords_hg19 <- enh_coords_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, end, perturbation) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end)

# replace tested enhancer - gene pairs hg38 enhancer coordinates with hg19 coordinates
cis_enh_perts <- cis_enh_perts %>%
  select(-c(enh_chr, enh_start, enh_end)) %>%
  left_join(enh_coords_hg19, by = "perturbation")

# extract gene TSS coordinates
gene_tss_hg38 <- cis_enh_perts %>%
  select(gene, gene_chr, gene_tss, gene_strand) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "gene_chr",
                           start.field = "gene_tss", end.field = "gene_tss",
                           strand.field = "gene_strand")

# liftover tss coordinates to hg19
gene_tss_hg19 <- gene_tss_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, strand, gene) %>%
  rename(gene_chr = seqnames, gene_tss = start, gene_strand = strand)

# replace gene tss hg38 coordinates with hg19 coordinates
cis_enh_perts <- cis_enh_perts %>%
  select(-c(gene_chr, gene_tss, gene_strand)) %>%
  left_join(gene_tss_hg19, by = "gene")

# recalculate distance to tss for hg19 coordinates
cis_enh_perts <- cis_enh_perts %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>%
  select(-enh_center)

# Prepare Hi-C data --------------------------------------------------------------------------------

# Contact matrices for K562 cells are imported (5kb resolution). The provided normalization vectors
# are used to normalize the observed contacts (read counts). Data from the two chromosonal regions
# on chromosome 8 and 11 is extracted for any further analyses.

# function to import HiC data from Rao et al for one chromosome and create a HTCexp object
import_hic <- function(sparse_cm_file, chromosome, resolution, bins) {

  # load sparse contact matrix file (only observed contacts)
  obs_contacts <- read.table(sparse_cm_file, col.names = c("i", "j", "M_ij"), sep = "\t")

  # get starting coordinates of assessed genomic bins at 5kb resolution
  max_bin <- (bins - 1) * resolution
  bin_starts <- seq(from = 0, to = max_bin, by = resolution)

  # create GRanges object containing all assessed bins for that chromosome
  bin_coords <- GRanges(seqnames = chromosome,
                        ranges = IRanges(start = bin_starts, end = bin_starts + resolution - 1,
                                         names = paste0("bin_", seq_len(length(bin_starts))))
  )

  # convert starting coordinates of bins in sparse matrix input to bin ids by dividing by the
  # resolution (and add 1 to get correct index)
  obs_contacts_bins <- data.frame(i = round(obs_contacts$i / resolution + 1),
                                  j = round(obs_contacts$j / resolution + 1), 
                                  M_ij = obs_contacts$M_ij)

  # create sparse contact matrix from observed contacts
  sparse_cm <- sparseMatrix(i = obs_contacts_bins$i, j = obs_contacts_bins$j,
                            x = obs_contacts_bins$M_ij, symmetric = TRUE, dims = c(bins, bins))

  # create HTCexp object containing data for the given chromosome
  HTCexp(intdata = sparse_cm, xgi = bin_coords, ygi = bin_coords)

}

# k562 intrachromosomal sparse matrix and krnorm vector files for chromosomes 8 and 11
scm_files <- here(snakemake@input$hic_raw)
names(scm_files) <- sub("_5kb.RAWobserved", "", basename(scm_files))

krnorm_files <- here(snakemake@input$hic_norm)
names(krnorm_files) <- sub("_5kb.KRnorm", "", basename(krnorm_files))

# import normalization vectors
chr8_KRnorm  <- as.numeric(readLines(krnorm_files["chr8"]))
chr11_KRnorm <- as.numeric(readLines(krnorm_files["chr11"]))

# infer number of bins per chromosome based on the normalization vectors
chr8_bins  <- length(chr8_KRnorm)
chr11_bins <- length(chr11_KRnorm)

# import hi-c data for these chromosomes
chr8_hic  <- import_hic(scm_files["chr8"],  chromosome = "chr8", resolution = 5000,
                        bins = chr8_bins)
chr11_hic <- import_hic(scm_files["chr11"], chromosome = "chr11", resolution = 5000,
                        bins = chr11_bins)

# function to normalize Hi-C data based on provided normalization vectors
normalize_hic <- function(htc_obj, norm_vector) {

  # extract raw observed interaction matrix
  raw_obs <- intdata(htc_obj)

  # create normalization matrix by computing the outer product of the normalization vector
  norm_mat <- outer(norm_vector, norm_vector)

  # multiply observed interactions by normalization matrix and add back to HTC object
  intdata(htc_obj) <- raw_obs / norm_mat
  return(htc_obj)

}

# normalize HiC data
chr8_hic_norm  <- normalize_hic(chr8_hic,  norm_vector = chr8_KRnorm)
chr11_hic_norm <- normalize_hic(chr11_hic, norm_vector = chr11_KRnorm)

# infer chromosomal region range
region_coords <- cis_enh_perts %>%
  select(sample, enh_start, enh_end, gene_tss) %>%
  gather(key = "key", value = "coord", -sample) %>%
  group_by(sample) %>%
  summarize(start = min(coord), end = max(coord))

# calculate bin coordinates that contain the entire regions
region_bins <- region_coords %>%
  mutate(start = floor(start / 5000) * 5000,
         end = ceiling(end / 5000) * 5000)

# extract data for assessed regions
chr8_region_hic <- extractRegion(chr8_hic_norm, MARGIN = c(1, 2), chr = "chr8",
                                 from = pull(filter(region_bins, sample == "chr8"), start),
                                 to = pull(filter(region_bins, sample == "chr8"), end))

chr11_region_hic <- extractRegion(chr11_hic_norm, MARGIN = c(1, 2), chr = "chr11",
                                  from = pull(filter(region_bins, sample == "chr11"), start),
                                  to = pull(filter(region_bins, sample == "chr11"), end))

# Enhancer - gene pairs ----------------------------------------------------------------------------

# The interaction frequency of all tested enhancer - gene pair computed, defined as the interaction
# frequency of the genomic bins overlapping the enhancer and the target genes TSS.

# function to compute the interaction frequency for enhancer - gene pairs, by finding the hi-c 
# genomic bins with overlap with the enhancer and the target gene tss. the interaction frequency of
# the pair is then defined as the interaction frequency of these bins
compute_int_freq <- function(pairs, htc_object) {

  # add pair identifier
  pairs$pair <- seq_len(nrow(pairs))

  # get coordinates of enhancer centers as GRanges object
  enh_coords <- pairs %>%
    mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
    select(enh_chr, enh_center) %>%
    makeGRangesFromDataFrame(., seqnames.field = "enh_chr", start.field = "enh_center",
                             end.field = "enh_center")

  # get gene tss coordinates as GRanges object
  tss_coords <- pairs %>%
    select(gene_chr, gene_tss) %>%
    makeGRangesFromDataFrame(., seqnames.field = "gene_chr", start.field = "gene_tss",
                             end.field = "gene_tss")

  # get bins overlapping for all enhancers and gene tss
  hic_bins <- x_intervals(htc_object)
  enh_bins <- findOverlaps(query = enh_coords, subject = hic_bins)
  tss_bins <- findOverlaps(query = tss_coords, subject = hic_bins)

  # combine into one data.frame
  enh_bins <- data.frame(pair = from(enh_bins), enh_bin = to(enh_bins))
  tss_bins <- data.frame(pair = from(tss_bins), tss_bin = to(tss_bins))
  bins <- full_join(enh_bins, tss_bins, by = "pair")

  # extract distance matrix between bins from htc object
  dists <- intervalsDist(htc_object)

  # calculate distances between bins of enhancer gene pairs
  dist_pairs <- dists[as.matrix(bins[, 2:3])]
  dist_pairs <- data.frame(pair = bins$pair, dist_bins = dist_pairs)

  # extract hi-c interaction matrix from htc object
  intdata <- intdata(htc_object)

  # get interaction frequencies for all bins and add pair id
  int_freq_pairs <- intdata[as.matrix(bins[, 2:3])]
  int_freq_pairs <- data.frame(pair = bins$pair, int_freq = int_freq_pairs)

  # add interaction frequencies and bin distances to pairs to create output
  pairs %>%
    left_join(dist_pairs, by = "pair") %>%
    left_join(int_freq_pairs, by = "pair") %>%
    select(-c(pair))

}

# compute interaction frequencies for all tested enhancer - gene pairs
chr8_pairs <- cis_enh_perts %>%
  filter(enh_chr == "chr8", enh_chr == "chr8") %>%
  compute_int_freq(., htc_object = chr8_region_hic)

chr11_pairs <- cis_enh_perts %>%
  filter(enh_chr == "chr11", enh_chr == "chr11") %>%
  compute_int_freq(., htc_object = chr11_region_hic)

# combine into one data.frame
pairs <- rbind(chr8_pairs, chr11_pairs)

# Activity by contact ==============================================================================

# The activity by contact score is computed as proposed by Fulco et al., 2019

# function to calculate geometric mean
gm_mean <- function(x, na.rm = TRUE){
  exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}

# compute enhancer activity score
pairs <- pairs %>%
  rowwise() %>%
  mutate(activity_score = gm_mean(c(`Dnase-seq`, H3K27ac)))

# compute abc score
pairs <- pairs %>%
  ungroup() %>% 
  group_by(sample, gene) %>%
  mutate(abc_score = activity_score * (int_freq + 1),
         abc_score_norm = abc_score / sum(abc_score))

# convert distance to TSS to absolute value and change label of significant hits
pairs <- pairs %>%
  mutate(dist_to_tss = abs(dist_to_tss),
         significant = if_else(significant == "sig", true = 1, false = 0))

# write to output file
pairs <- select(pairs, -enh_in_intron)  # remove column not used in enhancer prediction analysis
write.csv(pairs, file = here(snakemake@output[[1]]), quote = FALSE, row.names = FALSE)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
library(GenomicAlignments)
library(GenomicFeatures)
library(here)
library(HiTC)
library(Matrix)
library(rtracklayer)
library(tidyverse)
library(BiocParallel)

# backend for parallel computing
register(MulticoreParam(workers = snakemake@threads))

# Expressed genes ==================================================================================

# Whole-Tx CROP-seq data is used to compute expression in K562 cells.

# import gencode hg38 annotations
annot_url <- "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.chr.gtf.gz"
annot <- import(annot_url, format = "gtf")

# extract exon annotations for MYC and GATA1
genes <- annot[annot$type == "exon" & annot$gene_biotype == "protein_coding" &
                 annot$transcript_biotype == "protein_coding" &
                 annot$gene_name %in% c("MYC", "GATA1")]

# genome-wide K562 expression data
dge <- read.table(here(snakemake@input$wholeTx_dge), header = TRUE, row.names = "GENE",
                  stringsAsFactors = FALSE)

# get dge data on target genes
gene_expr <- dge[rownames(dge) %in% genes$gene_name, ]

# compute average gene expression
avg_expr <- data.frame(gene = rownames(gene_expr), avg_txs = rowSums(gene_expr),
                       stringsAsFactors = FALSE, row.names = NULL)

# split gene annotations by gene name
expr_genes <- split(genes, f = genes$gene_name)

# Select enhancers - gene pairs ====================================================================

# Tested enhancers from Fulco et al. 2018 are extracted and ETPs are formed with MYC and GATA1.

# import enhancer sites
enh_coords <- import(here(snakemake@input$enhancers))

# create enhancer id
enh_coords$perturbation <- paste0(seqnames(enh_coords), ":", start(enh_coords), "-",
                                  end(enh_coords))

# download chain file for hg19 to hg38 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz"
chain_file <- tempfile("hg19ToHg38", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg19_to_hg38_chain <- import.chain(tools::file_path_sans_ext(chain_file))

# liftover enhancers from hg38 to hg19
enh_coords <- enh_coords %>%
  liftOver(chain = hg19_to_hg38_chain)

# filter out enhancers that did not lift over correctly (split into 2 loci)
enh_loci <- sapply(enh_coords, length)
enh_loci_good <- which(enh_loci == 1)
enh_coords <- enh_coords[enh_loci_good]

# convert to GRanges object
enh_coords <- unlist(enh_coords)

# function to get a genes TSS
get_tss <- function(x) {
  if (all(strand(x) == "+")) {
    tss <- min(start(x))
  }else if (all(strand(x) == "-")) {
    tss <- max(end(x))
  }else{
    stop("Inconsistent strand information!")
  }
  GRanges(seqnames = unique(seqnames(x)), ranges = IRanges(start = tss, end = tss),
          strand = unique(strand(x)), gene_name = unique(x$gene_name))
}

# get tss for all genes
gene_tss <- endoapply(expr_genes, FUN = get_tss)
gene_tss <- unlist(gene_tss)

# create data.frame containing enhancer data
enh_df <- enh_coords %>% 
  data.frame(stringsAsFactors = FALSE) %>% 
  select(perturbation, enh_chr = seqnames, enh_start = start, enh_end = end) %>% 
  mutate(enh_chr = as.character(enh_chr))

# remove enhancers, where liftover moved enhancer to other chromosome
enh_df <- enh_df %>% 
  mutate(chr_id = sub(":.+", "", perturbation)) %>% 
  filter(chr_id == enh_chr) %>% 
  select(-chr_id)

# create data.frame containing gene tss information
tss_df <- gene_tss %>% 
  data.frame(stringsAsFactors = FALSE) %>% 
  select(gene = gene_name, gene_chr = seqnames, gene_start = start, gene_end = end,
         gene_strand = strand) %>% 
  mutate(gene_strand = as.character(gene_strand),
         gene_chr = paste0("chr", as.character(gene_chr)))

# combine enhancers and tss data to form etps
etps <- left_join(enh_df, tss_df, by = c("enh_chr" = "gene_chr")) %>% 
  mutate(gene_chr = enh_chr) %>% 
  select(perturbation, gene, enh_chr, enh_start, enh_end, gene_chr, gene_tss = gene_start,
         gene_strand)

# calculate distance to tss for every enhancer - gene pair
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# Chromatin activity ===============================================================================

# Each ETP is annotated for chromatin activity at the involved enhancer. The RPKM is calculated to
# normalize for both the sequencing depth and enhancer size.

# chromatin assays and input files
chrom_infiles <- here(snakemake@input$encode_bam)
names(chrom_infiles) <- sub("_.+", "", basename(chrom_infiles))

# load all bam files
chrom_reads <- bplapply(chrom_infiles, FUN = readGAlignments)

# extract hg38 coordinates of all enhancers
enh_coords <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# count the number of reads within each enhancer
enh_chrom_reads <- chrom_reads %>%
  bplapply(FUN = countOverlaps, query = enh_coords) %>%
  bind_cols()

# combine with enhancer coordinates and convert to long format
enh_chrom <- enh_coords %>%
  data.frame(stringsAsFactors = FALSE) %>%
  select(-strand) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  bind_cols(enh_chrom_reads) %>%
  gather(key = assay, value = reads, -c(1:5))

# normalize for sequencing depth and enhancer size by calculating rpkm
enh_chrom_norm <- enh_chrom %>%
  group_by(assay) %>%
  mutate(rpkm = reads / (sum(reads) / 1e6) / (width / 1000))

# convert to wide format add to enhancer gene pair results
etps <- enh_chrom_norm %>%
  select(-c(enh_chr, enh_start, enh_end, width, reads)) %>%
  spread(key = "assay", value = "rpkm") %>%
  left_join(x = etps, y = ., by = "perturbation")

# remove data to free up memory
rm(enh_chrom_reads, chrom_reads, enh_chrom, enh_chrom_norm)
rm(annot, dge, expr_genes, gene_expr, genes)
gc()

# Annotate Hi-C ====================================================================================

# Hi-C contact matrices from Rao et al., 2014 are imported and the contact frequency for each
# enhancer - gene pair is calculated.

# Liftover enhancer and gene TSS coordinates -------------------------------------------------------

# Hi-C contact matrices are in hg19 annotation, so enhancer and gene TSS coordinates need to be
# lifted over from hg38 to hg19.

# download chain file for hg38 to hg19 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz"
chain_file <- tempfile("hg38ToHg19", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg38_to_hg19_chain <- import.chain(tools::file_path_sans_ext(chain_file))
closeAllConnections()  # close any open file connection

# extract hg38 coordinates of all enhancers
enh_coords_hg38 <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# liftover enhancers from hg38 to hg19 and convert to data.frame
enh_coords_hg19 <- enh_coords_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain)

# filter out enhancers that did not lift over correctly (split into 2 loci)
enh_loci <- sapply(enh_coords_hg19, length)
enh_loci_good <- which(enh_loci == 1)
enh_coords_hg19 <- enh_coords_hg19[enh_loci_good]

# reformat
enh_coords_hg19 <- enh_coords_hg19 %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, end, perturbation) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  mutate(enh_chr = as.character(enh_chr))

# replace tested enhancer - gene pairs hg38 enhancer coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(enh_chr, enh_start, enh_end)) %>%
  left_join(enh_coords_hg19, by = "perturbation")

# extract gene TSS coordinates
gene_tss_coords <- etps %>%
  select(gene, gene_chr, gene_tss, gene_strand) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "gene_chr",
                           start.field = "gene_tss", end.field = "gene_tss",
                           strand.field = "gene_strand")

# liftover tss coordinates to hg19
gene_tss_coords_hg19 <- gene_tss_coords %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, strand, gene) %>%
  rename(gene_chr = seqnames, gene_tss = start, gene_strand = strand) %>%
  mutate(gene_chr = as.character(gene_chr), gene_strand = as.character(gene_strand))

# replace gene tss hg38 coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(gene_chr, gene_tss, gene_strand)) %>%
  left_join(gene_tss_coords_hg19, by = "gene")

# recalculate distance to tss for hg19 coordinates
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# remove enhancers that were not lifted over or switched chromosomes...
etps <- filter(etps, enh_chr == gene_chr)

# Prepare Hi-C data --------------------------------------------------------------------------------

# Contact matrices for K562 cells from Rao et al. 2014 are imported (5kb resolution). The provided
# normalization vectors are used to normalize the observed contacts (read counts). Data from the two
# chromosonal regions on chromosome 8 and 11 is extracted for any further analyses.

# function to import HiC data from Rao et al for one chromosome and create a HTCexp object
import_hic <- function(sparse_cm_file, chromosome, resolution, bins) {

  # load sparse contact matrix file (only observed contacts)
  obs_contacts <- read.table(sparse_cm_file, col.names = c("i", "j", "M_ij"), sep = "\t")

  # get starting coordinates of assessed genomic bins at 5kb resolution
  max_bin <- (bins - 1) * resolution
  bin_starts <- seq(from = 0, to = max_bin, by = resolution)

  # create GRanges object containing all assessed bins for that chromosome
  bin_coords <- GRanges(seqnames = chromosome,
                        ranges = IRanges(start = bin_starts, end = bin_starts + resolution - 1,
                                         names = paste0("bin_", seq_len(length(bin_starts))))
  )

  # convert starting coordinates of bins in sparse matrix input to bin ids by dividing by the
  # resolution (and add 1 to get correct index)
  obs_contacts_bins <- data.frame(i = round(obs_contacts$i / resolution + 1),
                                  j = round(obs_contacts$j / resolution + 1), 
                                  M_ij = obs_contacts$M_ij)

  # create sparse contact matrix from observed contacts
  sparse_cm <- sparseMatrix(i = obs_contacts_bins$i, j = obs_contacts_bins$j,
                            x = obs_contacts_bins$M_ij, symmetric = TRUE, dims = c(bins, bins))

  # create HTCexp object containing data for the given chromosome
  HTCexp(intdata = sparse_cm, xgi = bin_coords, ygi = bin_coords)

}

# k562 intrachromosomal sparse matrix files
scm_files <- here(snakemake@input$hic_raw)
names(scm_files) <- sub("_5kb.RAWobserved", "", basename(scm_files))

# normalization vector files
krnorm_files <- here(snakemake@input$hic_norm)
names(krnorm_files) <- sub("_5kb.KRnorm", "", basename(krnorm_files))

# import normalization vectors
krnorm <- lapply(krnorm_files, FUN = function(x) as.numeric(readLines(x)) )

# infer number of bins per chromosome based on the normalization vectors
hic_bins <- lapply(krnorm, FUN = length)

# import hi-c data for these chromosomes
hic_data <- bpmapply(FUN = import_hic,
                     sparse_cm_file = scm_files,
                     chromosome = names(scm_files),
                     bins = hic_bins,
                     MoreArgs = list(resolution = 5000))

# Enhancer - gene pairs ----------------------------------------------------------------------------

# The interaction frequency of all tested enhancer - gene pair computed, defined as the interaction
# frequency of the genomic bins overlapping the enhancer and the target genes TSS.

# function to compute the interaction frequency for enhancer - gene pairs, by finding the hi-c 
# genomic bins with overlap with the enhancer and the target gene tss. the interaction frequency of
# the pair is then defined as the interaction frequency of these bins
compute_int_freq <- function(pairs, htc_object, norm_vector) {

  # add pair identifier
  pairs$pair <- seq_len(nrow(pairs))

  # get coordinates of enhancer centers as GRanges object
  enh_coords <- pairs %>%
    mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
    select(enh_chr, enh_center) %>%
    makeGRangesFromDataFrame(., seqnames.field = "enh_chr", start.field = "enh_center",
                             end.field = "enh_center")

  # get gene tss coordinates as GRanges object
  tss_coords <- pairs %>%
    select(gene_chr, gene_tss) %>%
    makeGRangesFromDataFrame(., seqnames.field = "gene_chr", start.field = "gene_tss",
                             end.field = "gene_tss")

  # get bins overlapping for all enhancers and gene tss
  hic_bins <- x_intervals(htc_object)
  enh_bins <- findOverlaps(query = enh_coords, subject = hic_bins)
  tss_bins <- findOverlaps(query = tss_coords, subject = hic_bins)

  # combine into one data.frame
  enh_bins <- data.frame(pair = from(enh_bins), enh_bin = to(enh_bins))
  tss_bins <- data.frame(pair = from(tss_bins), tss_bin = to(tss_bins))
  bins <- full_join(enh_bins, tss_bins, by = "pair")

  # extract distance matrix between bins from htc object
  dists <- intervalsDist(htc_object)

  # calculate distances between bins of enhancer gene pairs
  dist_pairs <- dists[as.matrix(bins[, 2:3])]
  dist_pairs <- data.frame(pair = bins$pair, dist_bins = dist_pairs)

  # extract hi-c interaction matrix from htc object
  intdata <- intdata(htc_object)

  # get interaction frequencies and normalization factors for all bins
  int_freq_pairs <- intdata[as.matrix(bins[, 2:3])]

  # compute normalization factors
  norm_vector_values <- matrix(norm_vector[as.matrix(bins[, 2:3])], ncol = 2)
  norm_factors <- norm_vector_values[, 1] * norm_vector_values[, 2]

  # normalize data and add and add pair id
  int_freq_pairs <- int_freq_pairs / norm_factors
  int_freq_pairs <- data.frame(pair = bins$pair, int_freq = int_freq_pairs)

  # add interaction frequencies and bin distances to pairs to create output
  pairs %>%
    left_join(dist_pairs, by = "pair") %>%
    left_join(int_freq_pairs, by = "pair") %>%
    select(-c(pair))

}

# compute interaction frequencies for all tested enhancer - gene pairs
etps_chrs <- split(etps, f = etps$enh_chr)
hic_data <- hic_data[names(etps_chrs)]
krnorm <- krnorm[names(etps_chrs)]
int_freqs <- bpmapply(FUN = compute_int_freq, pairs = etps_chrs, htc_object = hic_data,
                      norm_vector = krnorm, SIMPLIFY = FALSE)

# combine into one data.frame
chrom_annot_etps <- bind_rows(int_freqs)

# Activity by contact ==============================================================================

# The activity by contact score is computed as proposed by Fulco et al. The geometric mean between
# H3K27ac and DNAse-seq is then calculatated for every site to compute an enhancer activity score
# proposed by Fulco et al. The ABC score is then computed by multiplying the activity score with the
# Hi-C interaction frequency.

# function to calculate geometric mean
gm_mean <- function(x, na.rm = TRUE){
  exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}

# compute enhancer activity score
chrom_annot_etps <- chrom_annot_etps %>%
  rowwise() %>%
  mutate(activity_score = gm_mean(c(`Dnase-seq`, H3K27ac))) %>%
  ungroup()

# compute abc score
chrom_annot_etps <- chrom_annot_etps %>%
  group_by(gene) %>%
  mutate(abc_score = activity_score * (int_freq + 1),
         abc_score_norm = abc_score / sum(abc_score))

# Finalize output ==================================================================================

# The output is finalized by adding the average gene expression, transforming the distance to TSS to
# absolute values and pruning pairs based on the distance to TSS.

# add average gene expression
chrom_annot_etps <- left_join(chrom_annot_etps, avg_expr, by = "gene")

# transform distance to TSS
chrom_annot_etps <- mutate(chrom_annot_etps, dist_to_tss = abs(dist_to_tss))

# rearrange columns
columns <- c("perturbation", "gene", "avg_txs", "enh_chr", "enh_start", "enh_end", "gene_chr",
             "gene_tss", "gene_strand", "dist_to_tss", "Dnase-seq", "H3K27ac", "H3K27me3",
             "H3K4me1", "H3K4me3", "POLR2A", "dist_bins", "int_freq", "activity_score",
             "abc_score", "abc_score_norm")
chrom_annot_etps <- chrom_annot_etps[, columns]

# write to output file
write.csv(chrom_annot_etps, file = here(snakemake@output[[1]]), quote = FALSE, row.names = FALSE)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
library(GenomicAlignments)
library(GenomicFeatures)
library(here)
library(HiTC)
library(Matrix)
library(rtracklayer)
library(tidyverse)
library(BiocParallel)

# backend for parallel computing
register(MulticoreParam(workers = snakemake@threads))

# Expressed genes ==================================================================================

# Whole-Tx CROP-seq data is used to find expressed genes in K562 cells.

# import gencode hg38 annotations
annot_url <- "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.chr.gtf.gz"
annot <- import(annot_url, format = "gtf")

# extract exon annotations for protein-coding genes on autosomes and chromosome X
genes <- annot[annot$type == "exon" & annot$gene_biotype == "protein_coding" &
                 annot$transcript_biotype == "protein_coding" &
                 !seqnames(annot) %in% c("Y", "MT")]

# genome-wide K562 expression data
dge <- read.table(here(snakemake@input$wholeTx_dge), header = TRUE, row.names = "GENE",
                  stringsAsFactors = FALSE)

# only retain dge data on protein-coding genes
gene_expr <- dge[rownames(dge) %in% genes$gene_name, ]

# compute average gene expression
avg_expr <- data.frame(gene = rownames(gene_expr), avg_txs = rowSums(gene_expr),
                       stringsAsFactors = FALSE, row.names = NULL)

# extract genes with 100 or more transcripts on average
expr_gene_names <- avg_expr %>% 
  filter(avg_txs >= 100) %>% 
  pull(gene)

# get annotations for these genes and split by gene name
expr_genes <- genes[genes$gene_name %in% expr_gene_names]
expr_genes <- split(expr_genes, f = expr_genes$gene_name)

# Select enhancers - gene pairs ====================================================================

# Tested enhancers from the pilot experiments in Gasperini et al. 2019 are used to create ETPs with
# all expressed genes within 300kb.

# import enhancer sites
enh_coords <- import(snakemake@input$enhancers)

# remove duplicates
enh_coords <- unique(enh_coords)

# create enhancer id
enh_coords$perturbation <- paste0(seqnames(enh_coords), ":", start(enh_coords), "-",
                                  end(enh_coords))

# download chain file for hg19 to hg38 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz"
chain_file <- tempfile("hg19ToHg38", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg19_to_hg38_chain <- import.chain(tools::file_path_sans_ext(chain_file))

# liftover enhancers from hg38 to hg19
enh_coords <- enh_coords %>%
  liftOver(chain = hg19_to_hg38_chain)

# filter out enhancers that did not lift over correctly (split into 2 loci)
enh_loci <- sapply(enh_coords, length)
enh_loci_good <- which(enh_loci == 1)
enh_coords <- enh_coords[enh_loci_good]

# convert to GRanges object
enh_coords <- unlist(enh_coords)

# function to get a genes TSS
get_tss <- function(x) {
  if (all(strand(x) == "+")) {
    tss <- min(start(x))
  }else if (all(strand(x) == "-")) {
    tss <- max(end(x))
  }else{
    stop("Inconsistent strand information!")
  }
  GRanges(seqnames = unique(seqnames(x)), ranges = IRanges(start = tss, end = tss),
          strand = unique(strand(x)), gene_name = unique(x$gene_name))
}

# get tss for all genes
gene_tss <- endoapply(expr_genes, FUN = get_tss)
gene_tss <- unlist(gene_tss)

# adjust chromosome names of enhancer coordinates
seqlevels(enh_coords) <- sub("chr", "", seqlevels(enh_coords))

# find enhancers that overlap a 600kb window centered on each genes tss
tss_windows <- promoters(gene_tss, upstream = 300000, downstream = 300000)
overlaps <- findOverlaps(query = enh_coords, subject = tss_windows)

# get overlapping gene and gene tss
ol_enh <- enh_coords[from(overlaps)]
ol_tss <- gene_tss[to(overlaps)]

# create data.frame containing etps
enh_df <- data.frame(ol_enh)
tss_df <- data.frame(ol_tss)
etps <- data.frame(perturbation = enh_df$perturbation, gene = tss_df$gene_name,
                   enh_chr = as.character(enh_df$seqnames), enh_start = enh_df$start,
                   enh_end = enh_df$end, gene_chr = as.character(tss_df$seqnames),
                   gene_tss = tss_df$start, gene_strand = as.character(tss_df$strand),
                   stringsAsFactors = FALSE)

# calculate distance to tss for every enhancer - gene pair
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# reformat chromosome ids
etps <- mutate(etps, enh_chr = paste0("chr", enh_chr), gene_chr = paste0("chr", gene_chr))

# Chromatin activity ===============================================================================

# Each ETP is annotated for chromatin activity at the involved enhancer. The RPKM is calculated to
# normalize for both the sequencing depth and enhancer size.

# chromatin assays and input files
chrom_infiles <- here(snakemake@input$encode_bam)
names(chrom_infiles) <- sub("_.+", "", basename(chrom_infiles))

# load all bam files
chrom_reads <- bplapply(chrom_infiles, FUN = readGAlignments)

# extract hg38 coordinates of all enhancers
enh_coords <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# count the number of reads within each enhancer
enh_chrom_reads <- chrom_reads %>%
  bplapply(FUN = countOverlaps, query = enh_coords) %>%
  bind_cols()

# combine with enhancer coordinates and convert to long format
enh_chrom <- enh_coords %>%
  data.frame(stringsAsFactors = FALSE) %>%
  select(-strand) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  bind_cols(enh_chrom_reads) %>%
  gather(key = assay, value = reads, -c(1:5))

# normalize for sequencing depth and enhancer size by calculating rpkm
enh_chrom_norm <- enh_chrom %>%
  group_by(assay) %>%
  mutate(rpkm = reads / (sum(reads) / 1e6) / (width / 1000))

# convert to wide format add to enhancer gene pair results
etps <- enh_chrom_norm %>%
  select(-c(enh_chr, enh_start, enh_end, width, reads)) %>%
  spread(key = "assay", value = "rpkm") %>%
  left_join(x = etps, y = ., by = "perturbation")

# remove data to free up memory
rm(enh_chrom_reads, chrom_reads, enh_chrom, enh_chrom_norm)
rm(annot, dge, expr_genes, gene_expr, genes)
gc()

# Annotate Hi-C ====================================================================================

# Hi-C contact matrices from Rao et al., 2014 are imported and the contact frequency for each
# enhancer - gene pair is calculated.

# Liftover enhancer and gene TSS coordinates -------------------------------------------------------

# Hi-C contact matrices are in hg19 annotation, so enhancer and gene TSS coordinates need to be
# lifted over from hg38 to hg19.

# download chain file for hg38 to hg19 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz"
chain_file <- tempfile("hg38ToHg19", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg38_to_hg19_chain <- import.chain(tools::file_path_sans_ext(chain_file))
closeAllConnections()  # close any open file connection

# extract hg38 coordinates of all enhancers
enh_coords_hg38 <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# liftover enhancers from hg38 to hg19 and convert to data.frame
enh_coords_hg19 <- enh_coords_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain)

# filter out enhancers that did not lift over correctly (split into 2 loci)
enh_loci <- sapply(enh_coords_hg19, length)
enh_loci_good <- which(enh_loci == 1)
enh_coords_hg19 <- enh_coords_hg19[enh_loci_good]

# reformat
enh_coords_hg19 <- enh_coords_hg19 %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, end, perturbation) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  mutate(enh_chr = as.character(enh_chr))

# replace tested enhancer - gene pairs hg38 enhancer coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(enh_chr, enh_start, enh_end)) %>%
  left_join(enh_coords_hg19, by = "perturbation")

# extract gene TSS coordinates
gene_tss_coords <- etps %>%
  select(gene, gene_chr, gene_tss, gene_strand) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "gene_chr",
                           start.field = "gene_tss", end.field = "gene_tss",
                           strand.field = "gene_strand")

# liftover tss coordinates to hg19
gene_tss_coords_hg19 <- gene_tss_coords %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, strand, gene) %>%
  rename(gene_chr = seqnames, gene_tss = start, gene_strand = strand) %>%
  mutate(gene_chr = as.character(gene_chr), gene_strand = as.character(gene_strand))

# replace gene tss hg38 coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(gene_chr, gene_tss, gene_strand)) %>%
  left_join(gene_tss_coords_hg19, by = "gene")

# recalculate distance to tss for hg19 coordinates
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# remove enhancers that were not lifted over or switched chromosomes...
etps <- filter(etps, enh_chr == gene_chr)

# Prepare Hi-C data --------------------------------------------------------------------------------

# Contact matrices for K562 cells from Rao et al. 2014 are imported (5kb resolution). The provided
# normalization vectors are used to normalize the observed contacts (read counts). Data from the two
# chromosonal regions on chromosome 8 and 11 is extracted for any further analyses.

# function to import HiC data from Rao et al for one chromosome and create a HTCexp object
import_hic <- function(sparse_cm_file, chromosome, resolution, bins) {

  # load sparse contact matrix file (only observed contacts)
  obs_contacts <- read.table(sparse_cm_file, col.names = c("i", "j", "M_ij"), sep = "\t")

  # get starting coordinates of assessed genomic bins at 5kb resolution
  max_bin <- (bins - 1) * resolution
  bin_starts <- seq(from = 0, to = max_bin, by = resolution)

  # create GRanges object containing all assessed bins for that chromosome
  bin_coords <- GRanges(seqnames = chromosome,
                        ranges = IRanges(start = bin_starts, end = bin_starts + resolution - 1,
                                         names = paste0("bin_", seq_len(length(bin_starts))))
  )

  # convert starting coordinates of bins in sparse matrix input to bin ids by dividing by the
  # resolution (and add 1 to get correct index)
  obs_contacts_bins <- data.frame(i = round(obs_contacts$i / resolution + 1),
                                  j = round(obs_contacts$j / resolution + 1), 
                                  M_ij = obs_contacts$M_ij)

  # create sparse contact matrix from observed contacts
  sparse_cm <- sparseMatrix(i = obs_contacts_bins$i, j = obs_contacts_bins$j,
                            x = obs_contacts_bins$M_ij, symmetric = TRUE, dims = c(bins, bins))

  # create HTCexp object containing data for the given chromosome
  HTCexp(intdata = sparse_cm, xgi = bin_coords, ygi = bin_coords)

}

# k562 intrachromosomal sparse matrix files
scm_files <- here(snakemake@input$hic_raw)
names(scm_files) <- sub("_5kb.RAWobserved", "", basename(scm_files))
scm_files <- scm_files[unique(etps$enh_chr)]

# normalization vector files
krnorm_files <- here(snakemake@input$hic_norm)
names(krnorm_files) <- sub("_5kb.KRnorm", "", basename(krnorm_files))
krnorm_files <- krnorm_files[unique(etps$enh_chr)]

# import normalization vectors
krnorm <- lapply(krnorm_files, FUN = function(x) as.numeric(readLines(x)) )

# infer number of bins per chromosome based on the normalization vectors
hic_bins <- lapply(krnorm, FUN = length)

# import hi-c data for these chromosomes
hic_data <- bpmapply(FUN = import_hic,
                     sparse_cm_file = scm_files,
                     chromosome = names(scm_files),
                     bins = hic_bins,
                     MoreArgs = list(resolution = 5000))

## Enhancer - gene pairs ---------------------------------------------------------------------------

# The interaction frequency of all tested enhancer - gene pair computed, defined as the interaction
# frequency of the genomic bins overlapping the enhancer and the target genes TSS.

# function to compute the interaction frequency for enhancer - gene pairs, by finding the hi-c 
# genomic bins with overlap with the enhancer and the target gene tss. the interaction frequency of
# the pair is then defined as the interaction frequency of these bins
compute_int_freq <- function(pairs, htc_object, norm_vector) {

  # add pair identifier
  pairs$pair <- seq_len(nrow(pairs))

  # get coordinates of enhancer centers as GRanges object
  enh_coords <- pairs %>%
    mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
    select(enh_chr, enh_center) %>%
    makeGRangesFromDataFrame(., seqnames.field = "enh_chr", start.field = "enh_center",
                             end.field = "enh_center")

  # get gene tss coordinates as GRanges object
  tss_coords <- pairs %>%
    select(gene_chr, gene_tss) %>%
    makeGRangesFromDataFrame(., seqnames.field = "gene_chr", start.field = "gene_tss",
                             end.field = "gene_tss")

  # get bins overlapping for all enhancers and gene tss
  hic_bins <- x_intervals(htc_object)
  enh_bins <- findOverlaps(query = enh_coords, subject = hic_bins)
  tss_bins <- findOverlaps(query = tss_coords, subject = hic_bins)

  # combine into one data.frame
  enh_bins <- data.frame(pair = from(enh_bins), enh_bin = to(enh_bins))
  tss_bins <- data.frame(pair = from(tss_bins), tss_bin = to(tss_bins))
  bins <- full_join(enh_bins, tss_bins, by = "pair")

  # extract distance matrix between bins from htc object
  dists <- intervalsDist(htc_object)

  # calculate distances between bins of enhancer gene pairs
  dist_pairs <- dists[as.matrix(bins[, 2:3])]
  dist_pairs <- data.frame(pair = bins$pair, dist_bins = dist_pairs)

  # extract hi-c interaction matrix from htc object
  intdata <- intdata(htc_object)

  # get interaction frequencies and normalization factors for all bins
  int_freq_pairs <- intdata[as.matrix(bins[, 2:3])]

  # compute normalization factors
  norm_vector_values <- matrix(norm_vector[as.matrix(bins[, 2:3])], ncol = 2)
  norm_factors <- norm_vector_values[, 1] * norm_vector_values[, 2]

  # normalize data and add and add pair id
  int_freq_pairs <- int_freq_pairs / norm_factors
  int_freq_pairs <- data.frame(pair = bins$pair, int_freq = int_freq_pairs)

  # add interaction frequencies and bin distances to pairs to create output
  pairs %>%
    left_join(dist_pairs, by = "pair") %>%
    left_join(int_freq_pairs, by = "pair") %>%
    select(-c(pair))

}

# compute interaction frequencies for all tested enhancer - gene pairs
etps_chrs <- split(etps, f = etps$enh_chr)
hic_data <- hic_data[names(etps_chrs)]
krnorm <- krnorm[names(etps_chrs)]
int_freqs <- bpmapply(FUN = compute_int_freq, pairs = etps_chrs, htc_object = hic_data,
                      norm_vector = krnorm, SIMPLIFY = FALSE)

# combine into one data.frame
chrom_annot_etps <- bind_rows(int_freqs)

# Activity by contact ==============================================================================

# The activity by contact score is computed as proposed by Fulco et al. The geometric mean between
# H3K27ac and DNAse-seq is then calculatated for every site to compute an enhancer activity score
# proposed by Fulco et al. The ABC score is then computed by multiplying the activity score with the
# Hi-C interaction frequency.

# function to calculate geometric mean
gm_mean <- function(x, na.rm = TRUE){
  exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}

# compute enhancer activity score
chrom_annot_etps <- chrom_annot_etps %>%
  rowwise() %>%
  mutate(activity_score = gm_mean(c(`Dnase-seq`, H3K27ac))) %>%
  ungroup()

# compute abc score
chrom_annot_etps <- chrom_annot_etps %>%
  group_by(gene) %>%
  mutate(abc_score = activity_score * (int_freq + 1),
         abc_score_norm = abc_score / sum(abc_score))

# Finalize output ==================================================================================

# The output is finalized by adding the average gene expression, transforming the distance to TSS to
# absolute values and pruning pairs based on the distance to TSS.

# add average gene expression
chrom_annot_etps <- left_join(chrom_annot_etps, avg_expr, by = "gene")

# transform distance to TSS
chrom_annot_etps <- mutate(chrom_annot_etps, dist_to_tss = abs(dist_to_tss))

# only retain pairs with distance to TSS between 1 and 300 kb
chrom_annot_etps <- filter(chrom_annot_etps, dist_to_tss >= 1000, dist_to_tss <= 300000)

# rearrange columns
columns <- c("perturbation", "gene", "avg_txs", "enh_chr", "enh_start", "enh_end", "gene_chr",
             "gene_tss", "gene_strand", "dist_to_tss", "Dnase-seq", "H3K27ac", "H3K27me3",
             "H3K4me1", "H3K4me3", "POLR2A", "dist_bins", "int_freq", "activity_score",
             "abc_score", "abc_score_norm")
chrom_annot_etps <- chrom_annot_etps[, columns]

# write to output file
write.csv(chrom_annot_etps, file = here(snakemake@output[[1]]), quote = FALSE, row.names = FALSE)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
library(GenomicAlignments)
library(GenomicFeatures)
library(here)
library(HiTC)
library(Matrix)
library(rtracklayer)
library(tidyverse)
library(BiocParallel)

# backend for parallel computing
register(MulticoreParam(workers = snakemake@threads))

# Expressed genes ==================================================================================

# Whole-Tx CROP-seq data is used to find expressed genes in K562 cells.

# import gencode hg38 annotations
annot_url <- "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.chr.gtf.gz"
annot <- import(annot_url, format = "gtf")

# extract exon annotations for protein-coding genes on autosomes and chromosome X
genes <- annot[annot$type == "exon" & annot$gene_biotype == "protein_coding" &
                 annot$transcript_biotype == "protein_coding" &
                 !seqnames(annot) %in% c("Y", "MT")]

# genome-wide K562 expression data
dge <- read.table(here(snakemake@input$wholeTx_dge), header = TRUE, row.names = "GENE",
                  stringsAsFactors = FALSE)

# only retain dge data on protein-coding genes
gene_expr <- dge[rownames(dge) %in% genes$gene_name, ]

# compute average gene expression
avg_expr <- data.frame(gene = rownames(gene_expr), avg_txs = rowSums(gene_expr),
                       stringsAsFactors = FALSE, row.names = NULL)

# extract genes with 100 or more transcripts on average
expr_gene_names <- avg_expr %>% 
  filter(avg_txs >= 100) %>% 
  pull(gene)

# get annotations for these genes and split by gene name
expr_genes <- genes[genes$gene_name %in% expr_gene_names]
expr_genes <- split(expr_genes, f = expr_genes$gene_name)

# Select enhancers - gene pairs ====================================================================

# Tested enhancers from the enhancer screen in Gasperini et al. 2019 are used to create ETPs with
# all expressed genes within 300kb.

# import enhancer sites
enh_coords <- import(snakemake@input$enhancers)

# remove duplicates
enh_coords <- unique(enh_coords)

# create enhancer id
enh_coords$perturbation <- paste0(seqnames(enh_coords), ":", start(enh_coords), "-",
                                  end(enh_coords))

# download chain file for hg19 to hg38 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz"
chain_file <- tempfile("hg19ToHg38", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg19_to_hg38_chain <- import.chain(tools::file_path_sans_ext(chain_file))

# liftover enhancers from hg38 to hg19
enh_coords <- enh_coords %>%
  liftOver(chain = hg19_to_hg38_chain)

# filter out enhancers that did not lift over correctly (split into 2 loci)
enh_loci <- sapply(enh_coords, length)
enh_loci_good <- which(enh_loci == 1)
enh_coords <- enh_coords[enh_loci_good]

# convert to GRanges object
enh_coords <- unlist(enh_coords)

# function to get a genes TSS
get_tss <- function(x) {
  if (all(strand(x) == "+")) {
    tss <- min(start(x))
  }else if (all(strand(x) == "-")) {
    tss <- max(end(x))
  }else{
    stop("Inconsistent strand information!")
  }
  GRanges(seqnames = unique(seqnames(x)), ranges = IRanges(start = tss, end = tss),
          strand = unique(strand(x)), gene_name = unique(x$gene_name))
}

# get tss for all genes
gene_tss <- endoapply(expr_genes, FUN = get_tss)
gene_tss <- unlist(gene_tss)

# adjust chromosome names of enhancer coordinates
seqlevels(enh_coords) <- sub("chr", "", seqlevels(enh_coords))

# find enhancers that overlap a 600kb window centered on each genes tss
tss_windows <- promoters(gene_tss, upstream = 300000, downstream = 300000)
overlaps <- findOverlaps(query = enh_coords, subject = tss_windows)

# get overlapping gene and gene tss
ol_enh <- enh_coords[from(overlaps)]
ol_tss <- gene_tss[to(overlaps)]

# create data.frame containing etps
enh_df <- data.frame(ol_enh)
tss_df <- data.frame(ol_tss)
etps <- data.frame(perturbation = enh_df$perturbation, gene = tss_df$gene_name,
                   enh_chr = as.character(enh_df$seqnames), enh_start = enh_df$start,
                   enh_end = enh_df$end, gene_chr = as.character(tss_df$seqnames),
                   gene_tss = tss_df$start, gene_strand = as.character(tss_df$strand),
                   stringsAsFactors = FALSE)

# calculate distance to tss for every enhancer - gene pair
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# reformat chromosome ids
etps <- mutate(etps, enh_chr = paste0("chr", enh_chr), gene_chr = paste0("chr", gene_chr))

# Chromatin activity ===============================================================================

# Each ETP is annotated for chromatin activity at the involved enhancer. The RPKM is calculated to
# normalize for both the sequencing depth and enhancer size.

# chromatin assays and input files
chrom_infiles <- here(snakemake@input$encode_bam)
names(chrom_infiles) <- sub("_.+", "", basename(chrom_infiles))

# load all bam files
chrom_reads <- bplapply(chrom_infiles, FUN = readGAlignments)

# extract hg38 coordinates of all enhancers
enh_coords <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# count the number of reads within each enhancer
enh_chrom_reads <- chrom_reads %>%
  bplapply(FUN = countOverlaps, query = enh_coords) %>%
  bind_cols()

# combine with enhancer coordinates and convert to long format
enh_chrom <- enh_coords %>%
  data.frame(stringsAsFactors = FALSE) %>%
  select(-strand) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  bind_cols(enh_chrom_reads) %>%
  gather(key = assay, value = reads, -c(1:5))

# normalize for sequencing depth and enhancer size by calculating rpkm
enh_chrom_norm <- enh_chrom %>%
  group_by(assay) %>%
  mutate(rpkm = reads / (sum(reads) / 1e6) / (width / 1000))

# convert to wide format add to enhancer gene pair results
etps <- enh_chrom_norm %>%
  select(-c(enh_chr, enh_start, enh_end, width, reads)) %>%
  spread(key = "assay", value = "rpkm") %>%
  left_join(x = etps, y = ., by = "perturbation")

# remove data to free up memory
rm(enh_chrom_reads, chrom_reads, enh_chrom, enh_chrom_norm)
rm(annot, dge, expr_genes, gene_expr, genes)
invisible(gc())

# Annotate Hi-C ====================================================================================

# Hi-C contact matrices from Rao et al., 2014 are imported and the contact frequency for each
# enhancer - gene pair is calculated.

# Liftover enhancer and gene TSS coordinates -------------------------------------------------------

# Hi-C contact matrices are in hg19 annotation, so enhancer and gene TSS coordinates need to be
# lifted over from hg38 to hg19.

# download chain file for hg38 to hg19 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz"
chain_file <- tempfile("hg38ToHg19", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg38_to_hg19_chain <- import.chain(tools::file_path_sans_ext(chain_file))
closeAllConnections()  # close any open file connection

# extract hg38 coordinates of all enhancers
enh_coords_hg38 <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# liftover enhancers from hg38 to hg19 and convert to data.frame
enh_coords_hg19 <- enh_coords_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain)

# filter out enhancers that did not lift over correctly (split into 2 loci)
enh_loci <- sapply(enh_coords_hg19, length)
enh_loci_good <- which(enh_loci == 1)
enh_coords_hg19 <- enh_coords_hg19[enh_loci_good]

# reformat
enh_coords_hg19 <- enh_coords_hg19 %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, end, perturbation) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  mutate(enh_chr = as.character(enh_chr))

# replace tested enhancer - gene pairs hg38 enhancer coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(enh_chr, enh_start, enh_end)) %>%
  left_join(enh_coords_hg19, by = "perturbation")

# extract gene TSS coordinates
gene_tss_coords <- etps %>%
  select(gene, gene_chr, gene_tss, gene_strand) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "gene_chr",
                           start.field = "gene_tss", end.field = "gene_tss",
                           strand.field = "gene_strand")

# liftover tss coordinates to hg19
gene_tss_coords_hg19 <- gene_tss_coords %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, strand, gene) %>%
  rename(gene_chr = seqnames, gene_tss = start, gene_strand = strand) %>%
  mutate(gene_chr = as.character(gene_chr), gene_strand = as.character(gene_strand))

# replace gene tss hg38 coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(gene_chr, gene_tss, gene_strand)) %>%
  left_join(gene_tss_coords_hg19, by = "gene")

# recalculate distance to tss for hg19 coordinates
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# remove enhancers that were not lifted over or switched chromosomes...
etps <- filter(etps, enh_chr == gene_chr)

# Prepare Hi-C data --------------------------------------------------------------------------------

# Contact matrices for K562 cells from Rao et al. 2014 are imported (5kb resolution). The provided
# normalization vectors are used to normalize the observed contacts (read counts). Data from the two
# chromosonal regions on chromosome 8 and 11 is extracted for any further analyses.

# function to import HiC data from Rao et al for one chromosome and create a HTCexp object
import_hic <- function(sparse_cm_file, chromosome, resolution, bins) {

  # load sparse contact matrix file (only observed contacts)
  obs_contacts <- read.table(sparse_cm_file, col.names = c("i", "j", "M_ij"), sep = "\t")

  # get starting coordinates of assessed genomic bins at 5kb resolution
  max_bin <- (bins - 1) * resolution
  bin_starts <- seq(from = 0, to = max_bin, by = resolution)

  # create GRanges object containing all assessed bins for that chromosome
  bin_coords <- GRanges(seqnames = chromosome,
                        ranges = IRanges(start = bin_starts, end = bin_starts + resolution - 1,
                                         names = paste0("bin_", seq_len(length(bin_starts))))
  )

  # convert starting coordinates of bins in sparse matrix input to bin ids by dividing by the
  # resolution (and add 1 to get correct index)
  obs_contacts_bins <- data.frame(i = round(obs_contacts$i / resolution + 1),
                                  j = round(obs_contacts$j / resolution + 1), 
                                  M_ij = obs_contacts$M_ij)

  # create sparse contact matrix from observed contacts
  sparse_cm <- sparseMatrix(i = obs_contacts_bins$i, j = obs_contacts_bins$j,
                            x = obs_contacts_bins$M_ij, symmetric = TRUE, dims = c(bins, bins))

  # create HTCexp object containing data for the given chromosome
  HTCexp(intdata = sparse_cm, xgi = bin_coords, ygi = bin_coords)

}

# k562 intrachromosomal sparse matrix files
scm_files <- here(snakemake@input$hic_raw)
names(scm_files) <- sub("_5kb.RAWobserved", "", basename(scm_files))

# normalization vector files
krnorm_files <- here(snakemake@input$hic_norm)
names(krnorm_files) <- sub("_5kb.KRnorm", "", basename(krnorm_files))

# import normalization vectors
krnorm <- lapply(krnorm_files, FUN = function(x) as.numeric(readLines(x)) )

# infer number of bins per chromosome based on the normalization vectors
hic_bins <- lapply(krnorm, FUN = length)

# import hi-c data for these chromosomes
hic_data <- bpmapply(FUN = import_hic,
                     sparse_cm_file = scm_files,
                     chromosome = names(scm_files),
                     bins = hic_bins,
                     MoreArgs = list(resolution = 5000))

## Enhancer - gene pairs ---------------------------------------------------------------------------

# The interaction frequency of all tested enhancer - gene pair computed, defined as the interaction
# frequency of the genomic bins overlapping the enhancer and the target genes TSS.

# function to compute the interaction frequency for enhancer - gene pairs, by finding the hi-c 
# genomic bins with overlap with the enhancer and the target gene tss. the interaction frequency of
# the pair is then defined as the interaction frequency of these bins
compute_int_freq <- function(pairs, htc_object, norm_vector) {

  # add pair identifier
  pairs$pair <- seq_len(nrow(pairs))

  # get coordinates of enhancer centers as GRanges object
  enh_coords <- pairs %>%
    mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
    select(enh_chr, enh_center) %>%
    makeGRangesFromDataFrame(., seqnames.field = "enh_chr", start.field = "enh_center",
                             end.field = "enh_center")

  # get gene tss coordinates as GRanges object
  tss_coords <- pairs %>%
    select(gene_chr, gene_tss) %>%
    makeGRangesFromDataFrame(., seqnames.field = "gene_chr", start.field = "gene_tss",
                             end.field = "gene_tss")

  # get bins overlapping for all enhancers and gene tss
  hic_bins <- x_intervals(htc_object)
  enh_bins <- findOverlaps(query = enh_coords, subject = hic_bins)
  tss_bins <- findOverlaps(query = tss_coords, subject = hic_bins)

  # combine into one data.frame
  enh_bins <- data.frame(pair = from(enh_bins), enh_bin = to(enh_bins))
  tss_bins <- data.frame(pair = from(tss_bins), tss_bin = to(tss_bins))
  bins <- full_join(enh_bins, tss_bins, by = "pair")

  # extract distance matrix between bins from htc object
  dists <- intervalsDist(htc_object)

  # calculate distances between bins of enhancer gene pairs
  dist_pairs <- dists[as.matrix(bins[, 2:3])]
  dist_pairs <- data.frame(pair = bins$pair, dist_bins = dist_pairs)

  # extract hi-c interaction matrix from htc object
  intdata <- intdata(htc_object)

  # get interaction frequencies and normalization factors for all bins
  int_freq_pairs <- intdata[as.matrix(bins[, 2:3])]

  # compute normalization factors
  norm_vector_values <- matrix(norm_vector[as.matrix(bins[, 2:3])], ncol = 2)
  norm_factors <- norm_vector_values[, 1] * norm_vector_values[, 2]

  # normalize data and add and add pair id
  int_freq_pairs <- int_freq_pairs / norm_factors
  int_freq_pairs <- data.frame(pair = bins$pair, int_freq = int_freq_pairs)

  # add interaction frequencies and bin distances to pairs to create output
  pairs %>%
    left_join(dist_pairs, by = "pair") %>%
    left_join(int_freq_pairs, by = "pair") %>%
    select(-c(pair))

}

# compute interaction frequencies for all tested enhancer - gene pairs
etps_chrs <- split(etps, f = etps$enh_chr)
hic_data <- hic_data[names(etps_chrs)]
krnorm <- krnorm[names(etps_chrs)]
int_freqs <- bpmapply(FUN = compute_int_freq, pairs = etps_chrs, htc_object = hic_data,
                      norm_vector = krnorm, SIMPLIFY = FALSE)

# combine into one data.frame
chrom_annot_etps <- bind_rows(int_freqs)

# Activity by contact ==============================================================================

# The activity by contact score is computed as proposed by Fulco et al. The geometric mean between
# H3K27ac and DNAse-seq is then calculatated for every site to compute an enhancer activity score
# proposed by Fulco et al. The ABC score is then computed by multiplying the activity score with the
# Hi-C interaction frequency.

# function to calculate geometric mean
gm_mean <- function(x, na.rm = TRUE){
  exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}

# compute enhancer activity score
chrom_annot_etps <- chrom_annot_etps %>%
  rowwise() %>%
  mutate(activity_score = gm_mean(c(`Dnase-seq`, H3K27ac))) %>%
  ungroup()

# compute abc score
chrom_annot_etps <- chrom_annot_etps %>%
  group_by(gene) %>%
  mutate(abc_score = activity_score * (int_freq + 1),
         abc_score_norm = abc_score / sum(abc_score))

# Finalize output ==================================================================================

# The output is finalized by adding the average gene expression, transforming the distance to TSS to
# absolute values and pruning pairs based on the distance to TSS.

# add average gene expression
chrom_annot_etps <- left_join(chrom_annot_etps, avg_expr, by = "gene")

# transform distance to TSS
chrom_annot_etps <- mutate(chrom_annot_etps, dist_to_tss = abs(dist_to_tss))

# only retain pairs with distance to TSS between 1 and 300 kb
chrom_annot_etps <- filter(chrom_annot_etps, dist_to_tss >= 1000, dist_to_tss <= 300000)

# rearrange columns
columns <- c("perturbation", "gene", "avg_txs", "enh_chr", "enh_start", "enh_end", "gene_chr",
             "gene_tss", "gene_strand", "dist_to_tss", "Dnase-seq", "H3K27ac", "H3K27me3",
             "H3K4me1", "H3K4me3", "POLR2A", "dist_bins", "int_freq", "activity_score",
             "abc_score", "abc_score_norm")
chrom_annot_etps <- chrom_annot_etps[, columns]

# write to output file
write.csv(chrom_annot_etps, file = here(snakemake@output[[1]]), quote = FALSE, row.names = FALSE)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
library(GenomicAlignments)
library(GenomicFeatures)
library(here)
library(HiTC)
library(Matrix)
library(rtracklayer)
library(tidyverse)
library(BiocParallel)

# backend for parallel computing
register(MulticoreParam(workers = snakemake@threads))

# Expressed genes ==================================================================================

# Whole-Tx CROP-seq data is used to find expressed genes in K562 cells

# import gencode hg38 annotations
annot_url <- "ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.chr.gtf.gz"
annot <- import(annot_url, format = "gtf")

# extract exon annotations for protein-coding genes on autosomes
genes <- annot[annot$type == "exon" & annot$gene_biotype == "protein_coding" &
                 annot$transcript_biotype == "protein_coding" &
                 !seqnames(annot) %in% c("X", "Y", "MT")]

# genome-wide K562 expression data
dge <- read.table(here(snakemake@input$wholeTx_dge), header = TRUE, row.names = "GENE",
                  stringsAsFactors = FALSE)

# only retain dge data on protein-coding genes
gene_expr <- dge[rownames(dge) %in% genes$gene_name, ]

# compute average gene expression
avg_expr <- data.frame(gene = rownames(gene_expr), avg_txs = rowSums(gene_expr),
                       stringsAsFactors = FALSE, row.names = NULL)

# extract genes with 100 or more transcripts on average
expr_gene_names <- avg_expr %>% 
  filter(avg_txs >= 100) %>% 
  pull(gene)

# get annotations for these genes and split by gene name
expr_genes <- genes[genes$gene_name %in% expr_gene_names]
expr_genes <- split(expr_genes, f = expr_genes$gene_name)

# Select enhancers - target pairs ==================================================================

# Predicted enhancer sites are selected that are within 300kb of the expressed genes TSS

# load genome-wide enhancers
enh <- read.table(here(snakemake@input$enhancers), col.names = "enh_id", stringsAsFactors = FALSE)

# split enhancer id into chromosome, start and end coordinates
enh <- enh %>% 
  separate(col = enh_id, into = c("chr", "start", "end"), convert = TRUE, remove = FALSE) %>% 
  mutate(chr = sub("chr", "", chr))

# create GRanges object containing enhancer coordinates
enh_coords <- makeGRangesFromDataFrame(enh, keep.extra.columns = TRUE)

# function to get a genes TSS
get_tss <- function(x) {
  if (all(strand(x) == "+")) {
    tss <- min(start(x))
  }else if (all(strand(x) == "-")) {
    tss <- max(end(x))
  }else{
    stop("Inconsistent strand information!")
  }
  GRanges(seqnames = unique(seqnames(x)), ranges = IRanges(start = tss, end = tss),
          strand = unique(strand(x)), gene_name = unique(x$gene_name))
}

# get tss for all genes
gene_tss <- endoapply(expr_genes, FUN = get_tss)
gene_tss <- unlist(gene_tss)

# find enhancers that overlap a 600kb window centered on each genes tss
tss_windows <- promoters(gene_tss, upstream = 300000, downstream = 300000)
overlaps <- findOverlaps(query = enh_coords, subject = tss_windows)

# get overlapping gene and gene tss
ol_enh <- enh_coords[from(overlaps)]
ol_tss <- gene_tss[to(overlaps)]

# create data.frame containing etps
enh_df <- data.frame(ol_enh)
tss_df <- data.frame(ol_tss)
etps <- data.frame(perturbation = enh_df$enh_id, gene = tss_df$gene_name,
                   enh_chr = as.character(enh_df$seqnames), enh_start = enh_df$start,
                   enh_end = enh_df$end, gene_chr = as.character(tss_df$seqnames),
                   gene_tss = tss_df$start, gene_strand = as.character(tss_df$strand),
                   stringsAsFactors = FALSE)

# calculate distance to tss for every enhancer - gene pair
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# reformat chromosome ids
etps <- mutate(etps, enh_chr = paste0("chr", enh_chr), gene_chr = paste0("chr", gene_chr))

# Chromatin activity ===============================================================================

# Each ETP is annotated for chromatin activity at the involved enhancer. The RPKM is calculated to
# normalize for both the sequencing depth and enhancer size

# chromatin assays and input files
chrom_infiles <- here(snakemake@input$encode_bam)
names(chrom_infiles) <- sub("_.+", "", basename(chrom_infiles))

# load all bam files
chrom_reads <- bplapply(chrom_infiles, FUN = readGAlignments)

# extract hg38 coordinates of all enhancers
enh_coords <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# count the number of reads within each enhancer
enh_chrom_reads <- chrom_reads %>%
  bplapply(FUN = countOverlaps, query = enh_coords) %>%
  bind_cols()

# combine with enhancer coordinates and convert to long format
enh_chrom <- enh_coords %>%
  data.frame(stringsAsFactors = FALSE) %>%
  select(-strand) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  bind_cols(enh_chrom_reads) %>%
  gather(key = assay, value = reads, -c(1:5))

# normalize for sequencing depth and enhancer size by calculating rpkm
enh_chrom_norm <- enh_chrom %>%
  group_by(assay) %>%
  mutate(rpkm = reads / (sum(reads) / 1e6) / (width / 1000))

# convert to wide format add to enhancer gene pair results
etps <- enh_chrom_norm %>%
  select(-c(enh_chr, enh_start, enh_end, width, reads)) %>%
  spread(key = "assay", value = "rpkm") %>%
  left_join(x = etps, y = ., by = "perturbation")

# remove data to free up memory
rm(enh_chrom_reads, chrom_reads, enh_chrom, enh_chrom_norm)
rm(annot, dge, expr_genes, gene_expr, genes)
invisible(gc())

# Annotate Hi-C ====================================================================================

# Hi-C contact matrices from Rao et al., 2014 are imported and the contact frequency for each
# enhancer - gene pair is calculated

# liftover ETPs ------------------------------------------------------------------------------------

# Hi-C contact matrices are in hg19 annotation, so enhancer and gene TSS coordinates need to be
# lifted over from hg38 to hg19.

# download chain file for hg38 to hg19 liftover
chain_url <- "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz"
chain_file <- tempfile("hg38ToHg19", fileext = ".gz")
download.file(chain_url, chain_file)
system(paste("gunzip", chain_file))

# import chain file
hg38_to_hg19_chain <- import.chain(tools::file_path_sans_ext(chain_file))

# extract hg38 coordinates of all enhancers
enh_coords_hg38 <- etps %>%
  select(enh_chr, enh_start, enh_end, perturbation) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE)

# liftover enhancers from hg38 to hg19 and convert to data.frame
enh_coords_hg19 <- enh_coords_hg38 %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, end, perturbation) %>%
  rename(enh_chr = seqnames, enh_start = start, enh_end = end) %>%
  mutate(enh_chr = as.character(enh_chr))

# replace tested enhancer - gene pairs hg38 enhancer coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(enh_chr, enh_start, enh_end)) %>%
  left_join(enh_coords_hg19, by = "perturbation")

# extract gene TSS coordinates
gene_tss_coords <- etps %>%
  select(gene, gene_chr, gene_tss, gene_strand) %>%
  distinct() %>%
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, seqnames.field = "gene_chr",
                           start.field = "gene_tss", end.field = "gene_tss",
                           strand.field = "gene_strand")

# liftover tss coordinates to hg19
gene_tss_coords_hg19 <- gene_tss_coords %>%
  liftOver(chain = hg38_to_hg19_chain) %>%
  unlist() %>%
  as.data.frame() %>%
  select(seqnames, start, strand, gene) %>%
  rename(gene_chr = seqnames, gene_tss = start, gene_strand = strand) %>%
  mutate(gene_chr = as.character(gene_chr), gene_strand = as.character(gene_strand))

# replace gene tss hg38 coordinates with hg19 coordinates
etps <- etps %>%
  select(-c(gene_chr, gene_tss, gene_strand)) %>%
  left_join(gene_tss_coords_hg19, by = "gene")

# recalculate distance to tss for hg19 coordinates
etps <- etps %>%
  mutate(enh_center = round((enh_start + enh_end) / 2)) %>% 
  mutate(dist_to_tss = if_else(gene_strand == "+", true  = enh_center - gene_tss,
                               false = gene_tss - enh_center)) %>% 
  select(-enh_center)

# remove enhancers that were not lifted over or switched chromosomes...
etps <- filter(etps, enh_chr == gene_chr)

# Prepare Hi-C data --------------------------------------------------------------------------------

# Contact matrices for K562 cells from Rao et al. 2014 are imported (5kb resolution). The provided
# normalization vectors are used to normalize the observed contacts (read counts). Data from the two
# chromosonal regions on chromosome 8 and 11 is extracted for any further analyses.

# function to import HiC data from Rao et al for one chromosome and create a HTCexp object
import_hic <- function(sparse_cm_file, chromosome, resolution, bins) {

  # load sparse contact matrix file (only observed contacts)
  obs_contacts <- read.table(sparse_cm_file, col.names = c("i", "j", "M_ij"), sep = "\t")

  # get starting coordinates of assessed genomic bins at 5kb resolution
  max_bin <- (bins - 1) * resolution
  bin_starts <- seq(from = 0, to = max_bin, by = resolution)

  # create GRanges object containing all assessed bins for that chromosome
  bin_coords <- GRanges(seqnames = chromosome,
                        ranges = IRanges(start = bin_starts, end = bin_starts + resolution - 1,
                                         names = paste0("bin_", seq_len(length(bin_starts))))
  )

  # convert starting coordinates of bins in sparse matrix input to bin ids by dividing by the
  # resolution (and add 1 to get correct index)
  obs_contacts_bins <- data.frame(i = round(obs_contacts$i / resolution + 1),
                                  j = round(obs_contacts$j / resolution + 1), 
                                  M_ij = obs_contacts$M_ij)

  # create sparse contact matrix from observed contacts
  sparse_cm <- sparseMatrix(i = obs_contacts_bins$i, j = obs_contacts_bins$j,
                            x = obs_contacts_bins$M_ij, symmetric = TRUE, dims = c(bins, bins))

  # create HTCexp object containing data for the given chromosome
  HTCexp(intdata = sparse_cm, xgi = bin_coords, ygi = bin_coords)

}

# k562 intrachromosomal sparse matrix files
scm_files <- here(snakemake@input$hic_raw)
names(scm_files) <- sub("_5kb.RAWobserved", "", basename(scm_files))

# normalization vector files
krnorm_files <- here(snakemake@input$hic_norm)
names(krnorm_files) <- sub("_5kb.KRnorm", "", basename(krnorm_files))

# import normalization vectors
krnorm <- lapply(krnorm_files, FUN = function(x) as.numeric(readLines(x)) )

# infer number of bins per chromosome based on the normalization vectors
hic_bins <- lapply(krnorm, FUN = length)

# import hi-c data for these chromosomes
hic_data <- bpmapply(FUN = import_hic,
                     sparse_cm_file = scm_files,
                     chromosome = names(scm_files),
                     bins = hic_bins,
                     MoreArgs = list(resolution = 5000))

# Enhancer - gene pairs ----------------------------------------------------------------------------

# The interaction frequency of all tested enhancer - gene pair computed, defined as the interaction
# frequency of the genomic bins overlapping the enhancer and the target genes TSS.

# function to compute the interaction frequency for enhancer - gene pairs, by finding the hi-c 
# genomic bins with overlap with the enhancer and the target gene tss. the interaction frequency of
# the pair is then defined as the interaction frequency of these bins
compute_int_freq <- function(pairs, htc_object, norm_vector) {

  # add pair identifier
  pairs$pair <- seq_len(nrow(pairs))

  # get coordinates of enhancer centers as GRanges object
  enh_coords <- pairs %>%
    mutate(enh_center = round((enh_start + enh_end) / 2)) %>%
    select(enh_chr, enh_center) %>%
    makeGRangesFromDataFrame(., seqnames.field = "enh_chr", start.field = "enh_center",
                             end.field = "enh_center")

  # get gene tss coordinates as GRanges object
  tss_coords <- pairs %>%
    select(gene_chr, gene_tss) %>%
    makeGRangesFromDataFrame(., seqnames.field = "gene_chr", start.field = "gene_tss",
                             end.field = "gene_tss")

  # get bins overlapping for all enhancers and gene tss
  hic_bins <- x_intervals(htc_object)
  enh_bins <- findOverlaps(query = enh_coords, subject = hic_bins)
  tss_bins <- findOverlaps(query = tss_coords, subject = hic_bins)

  # combine into one data.frame
  enh_bins <- data.frame(pair = from(enh_bins), enh_bin = to(enh_bins))
  tss_bins <- data.frame(pair = from(tss_bins), tss_bin = to(tss_bins))
  bins <- full_join(enh_bins, tss_bins, by = "pair")

  # extract distance matrix between bins from htc object
  dists <- intervalsDist(htc_object)

  # calculate distances between bins of enhancer gene pairs
  dist_pairs <- dists[as.matrix(bins[, 2:3])]
  dist_pairs <- data.frame(pair = bins$pair, dist_bins = dist_pairs)

  # extract hi-c interaction matrix from htc object
  intdata <- intdata(htc_object)

  # get interaction frequencies and normalization factors for all bins
  int_freq_pairs <- intdata[as.matrix(bins[, 2:3])]

  # compute normalization factors
  norm_vector_values <- matrix(norm_vector[as.matrix(bins[, 2:3])], ncol = 2)
  norm_factors <- norm_vector_values[, 1] * norm_vector_values[, 2]

  # normalize data and add and add pair id
  int_freq_pairs <- int_freq_pairs / norm_factors
  int_freq_pairs <- data.frame(pair = bins$pair, int_freq = int_freq_pairs)

  # add interaction frequencies and bin distances to pairs to create output
  pairs %>%
    left_join(dist_pairs, by = "pair") %>%
    left_join(int_freq_pairs, by = "pair") %>%
    select(-c(pair))

}

# compute interaction frequencies for all tested enhancer - gene pairs
etps_chrs <- split(etps, f = etps$enh_chr)
etps_chrs <- etps_chrs[names(hic_data)]
krnorm <- krnorm[names(etps_chrs)]
int_freqs <- bpmapply(FUN = compute_int_freq, pairs = etps_chrs, htc_object = hic_data,
                      norm_vector = krnorm, SIMPLIFY = FALSE)

# combine into one data.frame
chrom_annot_etps <- bind_rows(int_freqs)

# Activity by contact ==============================================================================

# The activity by contact score is computed as proposed by Fulco et al. The geometric mean between
# H3K27ac and DNAse-seq is then calculatated for every site to compute an enhancer activity score
# proposed by Fulco et al. The ABC score is then computed by multiplying the activity score with the
# Hi-C interaction frequency.

# function to calculate geometric mean
gm_mean <- function(x, na.rm = TRUE){
  exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x))
}

# compute enhancer activity score
chrom_annot_etps <- chrom_annot_etps %>%
  rowwise() %>%
  mutate(activity_score = gm_mean(c(`Dnase-seq`, H3K27ac))) %>%
  ungroup()

# compute abc score
chrom_annot_etps <- chrom_annot_etps %>%
  group_by(gene) %>%
  mutate(abc_score = activity_score * (int_freq + 1),
         abc_score_norm = abc_score / sum(abc_score))

# Finalize output ==================================================================================

# The output is finalized by adding the average gene expression, transforming the distance to TSS to
# absolute values and pruning pairs based on the distance to TSS.

# add average gene expression
chrom_annot_etps <- left_join(chrom_annot_etps, avg_expr, by = "gene")

# transform distance to TSS
chrom_annot_etps <- mutate(chrom_annot_etps, dist_to_tss = abs(dist_to_tss))

# only retain pairs with distance to TSS between 1 and 300 kb
chrom_annot_etps <- filter(chrom_annot_etps, dist_to_tss >= 1000, dist_to_tss <= 300000)

# rearrange columns
columns <- c("perturbation", "gene", "avg_txs", "enh_chr", "enh_start", "enh_end", "gene_chr",
             "gene_tss", "gene_strand", "dist_to_tss", "Dnase-seq", "H3K27ac", "H3K27me3",
             "H3K4me1", "H3K4me3", "POLR2A", "dist_bins", "int_freq", "activity_score",
             "abc_score", "abc_score_norm")
chrom_annot_etps <- chrom_annot_etps[, columns]

# write to output file
write.csv(chrom_annot_etps, file = here(snakemake@output[[1]]), quote = FALSE, row.names = FALSE)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
sd=$(dirname $0)

# input directory passed as argument
indir=$1

# output files passed as arguments
r1_outfile=$2
r2_outfile=$3

# delete potentially existing output files to avoid attaching to them
if [ -f $r1_outfile ]; then
  rm $r1_outfile
fi

if [ -f $r2_outfile ]; then
  rm $r2_outfile
fi

# BARCODE READS ------------------------------------------------------------------------------------

# function to get i7 barcode from barcode table, add it to the beginning of every barcode read and
# append output to compressed fastq file
process_read1 () {
  local indir=$1
  local outfile=$2
  local i7_file=$3
  for i in ${indir}/*_1.fastq.gz; do
    bn=$(basename $i)
    sample=$(echo $bn | perl -pe 's/(\w+)_1.fastq.gz/$1/')
    i7=$(awk -v sample="$sample" '$1 == sample{print $2}' $i7_file)
    echo -e "\tprocessing ${i} (i7 = ${i7})..."
    zcat $i | sed "2~4s/^/${i7}/" | sed '4~4s/^/IIIIIIII/' | gzip >> $outfile
  done
}

# add all genome reads into one fastq file
echo "Merging barcode reads:"
process_read1 $indir $r1_outfile ${sd}/i7_barcodes.txt

# GENOME READS -------------------------------------------------------------------------------------

# function to simply append all genome reads to one compressed fastq file
process_read2() {
  local indir=$1
  local outfile=$2
  for i in ${indir}/*_2.fastq.gz; do
    echo -e "\tprocessing ${i}..."
    cat $i >> $outfile
  done
}

# add all genome reads into one fastq file
echo -e "\nMerging genome reads:"
process_read2 $indir $r2_outfile

echo -e "\nDone!"
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
sd=$(dirname $0)

# input directory passed as argument
indir=$1

# output files passed as arguments
r1_outfile=$2
r2_outfile=$3

# delete potentially existing output files to avoid attaching to them
if [ -f $r1_outfile ]; then
  rm $r1_outfile
fi

if [ -f $r2_outfile ]; then
  rm $r2_outfile
fi

# function to simply append all genome reads to one compressed fastq file
process_read() {
  local indir=$1
  local outfile=$2
  local read=$3
  for i in ${indir}/*_${read}.fastq.gz; do
    echo -e "\tprocessing ${i}..."
    cat $i >> $outfile
  done
}

echo "Merging barcode reads:"
process_read $indir $r1_outfile "1"

echo -e "\nMerging genome reads:"
process_read $indir $r2_outfile "2"

echo -e "\nDone!"
15
16
17
18
19
20
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# load required functions and packages
library(here)
source(file.path(snakemake@scriptdir, "align_report_fun.R"))
32
33
34
35
36
# calculate summary statistics
bc_stats <- bc_summary(here(snakemake@input$cell_bcs), here(snakemake@input$mol_bcs))

# print table
knitr::kable(bc_stats)
51
52
53
54
55
56
57
58
# extract total number of reads from STAR log output
input_reads  <- readLines(here(snakemake@input$star_smry), n = 6)[6]
total_reads <- as.numeric(unlist(strsplit(input_reads, split = "\t"))[2])

# plot adapter trimming summary
adapter_trim_hist(adapter_trim_file = here(snakemake@input$adapt_trim),
                  sample = snakemake@wildcards$sample,
                  total_reads = total_reads)
64
65
66
67
# plot polyA trimming summary
polyA_trim_hist(polyA_trim_file = here(snakemake@input$polyA_trim),
                  sample = snakemake@wildcards$sample,
                  total_reads = total_reads)
79
80
81
82
83
84
# extract percentage of reads across mapping cats
perc_reads <- get_mapping_cats(here(snakemake@input$star_smry))

# plot percentage of reads
plot_mapping_cats(mapping_cats = perc_reads, sample = snakemake@wildcards$sample,
                  total_reads = total_reads)
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# get sample id
sample <- snakemake@wildcards$sample

# extract expected cell number from snakemake object
expect_cells <- as.numeric(snakemake@config$cell_numbers[sample])

# compute the reads per cell barcode distribution for the top x cells, where
# x is the number of expected cells * 2 and compute local minimum in
# bimodal distribution
rpc_dens <- rpc_density(here(snakemake@input$reads_per_cell), expect_cells = expect_cells)

# define plot title
dens_title <- paste(sample, "reads per cell barcode (top", expect_cells * 2, "cells)")

# create interactive plot
plot_rpc_density(rpc_dens, title = dens_title)
135
136
137
138
139
140
141
142
143
144
145
# extract estimated number of cells
ncells <- rpc_dens$ncells

# calculate cumulative fraction
rpc <- cumfrac_rpc(here(snakemake@input$reads_per_cell))

# define plot title
title <- paste0(sample, " cumulative reads (total reads: ", total_reads, ")")

# plot cumulative fraction plot
plot_cumfrac_rpc(rpc, nbcs = expect_cells * 10, title = title, ncells = ncells)
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
library(optparse)

# create arguments list
option_list = list(
  make_option(c("-t", "--txs_file"), type = "character", default = NULL,
              help = "Path to transcripts annotation file (.gtf)",
              metavar = "character"),
  make_option(c("-b", "--BSgenome_id"), type = "character", default = NULL,
              help = "Identifier for used BSgenome", metavar = "character"),
  make_option(c("-f", "--fasta_outfile"), type = "character", default = NULL,
              help = "Path to .fasta output file", metavar = "character"),
  make_option(c("-o", "--gtf_outfile"), type = "character", default = NULL,
              help = "Path to .gtf output file", metavar = "character")
)

# parse arguments
opt_parser = OptionParser(option_list = option_list)
opt = parse_args(opt_parser)

# function to check for required arguments
check_required_args <- function(arg, opt, opt_parser) {
  if (is.null(opt[[arg]])) {
    print_help(opt_parser)
    stop(arg, " argument is required!", call. = FALSE)
  }
}

# check that all required parameters are provided
required_args <- c("txs_file", "BSgenome_id", "fasta_outfile", "gtf_outfile")
for (i in required_args) {
  check_required_args(i, opt = opt, opt_parser = opt_parser)
}

# define functions =================================================================================

# attach required packages
library(rtracklayer)
library(BSgenome)
library(GenomicRanges)
library(GenomicFeatures)

# main function to create gene loci tap-seq alignment reference based on input file paths
create_tapseq_ref <- function(txs_file, BSgenome_id, fasta_outfile, gtf_outfile) {

  message("\nLoading data...")

  # load transcripts
  txs <- import(txs_file, format ="gtf")

  # get BSgenome object containing the genome sequence
  genome <- getBSgenome(BSgenome_id)
  seqlevelsStyle(genome) <- seqlevelsStyle(txs)[1]  # set seqlevels to same style as txs input

  # create tap-seq alignment references
  message("Creating tap-seq alignment reference...")
  ref <- tapseq_alignment_ref(txs, genome = genome)

  # save alignment reference data
  writeXStringSet(ref$contig_seqs, filepath = fasta_outfile, format = "fasta")
  export_gtf(ref$annot, filepath = gtf_outfile)

  message("Done!")

}

# helper functions to create tap-seq alignment reference data ======================================

# create alignment reference data for gene loci
tapseq_alignment_ref <- function(txs, genome, gene_id = "gene_name") {

  # convert txs to GRanges if it's a GRangesList
  if (is(txs, "GRangesList")) {
    txs <- unlist(txs)
  }

  # abort if genome is not a BSgenome or DNAStringSet object
  if (!any(is(genome, "BSgenome") | is(genome, "DNAStringSet"))) {
    stop("genome must be of class BSgenome or DNAStringSet!", call. = FALSE)
  }

  # sort txs by coordinates
  txs <- sort(txs)

  # group transcripts by gene names
  genes <- split(txs, f = mcols(txs)[, gene_id])

  # create contigs for each locus ------------------------------------------------------------------

  # obtain ranges of each gene locus (start of gene to end of gene)
  loci <- range(genes)

  # merge loci to create non-overlapping contigs
  contigs <- reduce(unlist(loci), ignore.strand = TRUE)
  strand(contigs) <- "+"

  # split contigs into GRangesList
  contigs <- split(contigs, f = 1:length(contigs))

  # find genes overlapping with each contig
  overlaps <- as.data.frame(findOverlaps(contigs, genes, ignore.strand = TRUE))

  # split by query hit to have all overlapping genes per contig in one table
  overlaps <- split(overlaps, f = overlaps$queryHits)

  # create annotations for genes overlapping each contig
  annot <- lapply(overlaps, FUN = create_contig_annot, contigs = contigs, genes = genes)
  annot <- unlist(GRangesList(annot))
  names(annot) <- NULL

  # get contig names based on names of genes overlapping each contig
  names(contigs) <- vapply(overlaps, FUN = contig_names, genes = genes, FUN.VALUE = character(1))

  # get sequences for contigs
  contig_seqs <- extractTranscriptSeqs(genome, transcripts = contigs)

  # create and return output
  list(contig_seqs = contig_seqs, annot = annot)

}

# create annotations for one contig
create_contig_annot <- function(overlap, contigs, genes) {

  # get contig and overlapping genes
  contig <- contigs[[unique(overlap$queryHits)]]
  ol_genes <- unlist(genes[overlap$subjectHits])

  # convert genome coordinates of overlapping genes into coordinates relative
  # to the contig
  start(ol_genes) <- start(ol_genes) - start(contig) + 1
  end(ol_genes) <- end(ol_genes) - start(contig) + 1

  # create new contig name to use as seqnames of output annotations
  gene_names <- names(genes[overlap$subjectHits])
  contig_name <- paste(gene_names, collapse = "_")

  # create new ouptut with contig name as seqnames
  annot <- GRanges(seqnames = contig_name, ranges = ranges(ol_genes), strand = strand(ol_genes))

  # add metadata and return output
  mcols(annot) <- mcols(ol_genes)
  annot

}

# create contig names by exracting the names of all overlapping genes per contig
contig_names <- function(overlap, genes) {

  # get gene names of overlapping genes
  gene_names <- names(genes[overlap$subjectHits])

  # create contig name
  paste(gene_names, collapse = "_")

}

# export GRanges object to gtf file. rtracklayers 'export()' function causes an error with STAR 
export_gtf <- function(x, filepath) {

  # transform x into data.frame
  x <- data.frame(x)

  # transform all columns to character
  x[] <- lapply(x, FUN = as.character)

  # remove the width column (from converting GRanges to data.frame)
  x <- x[,-which(colnames(x) == "width")]

  # replace potential NA for score and phase by '.'
  x$score[is.na(x$score)] <- "."
  x$phase[is.na(x$phase)] <- "."

  # replace potential '*' for strand by '.'
  x$strand[x$strand == "*"] <- "."

  # get first 8 fields for gtf output
  gtf_fields <- x[,c("seqnames", "source", "type", "start",
                     "end", "score", "strand", "phase")]

  ## the remaining fields are processed into the attribute field

  # get remaining fields
  attr_cols <- x[,colnames(x)[!colnames(x) %in% colnames(gtf_fields)]]

  # get attribute names
  attr_names <- colnames(attr_cols)

  # get mandatory attributes "gene_id" and "transcript_id"
  mand <- c(which(attr_names == "gene_id"),
            which(attr_names == "transcript_id"))

  # abort if mandatory fields not found
  if (length(mand) != 2) {

    stop("Mandatory attributes \"gene_id\" and \"transcript_id\" not found!",
         call. = FALSE)

  }

  # order attributes so that the mandatory attributes come first
  attr_names <- c(attr_names[mand], attr_names[-mand])
  attr_cols <- attr_cols[attr_names]

  # create matrix with names
  attr_names <- matrix(rep(attr_names, nrow(attr_cols)),
                       ncol=length(attr_names), byrow = TRUE)

  # paste attribute names to values
  attr <- matrix(paste(attr_names, as.matrix(attr_cols)),
                 nrow = nrow(attr_names))

  # collapse all attributes into one string to create the attribute field
  # and add to gtf_fields
  gtf_fields$attributes <- do.call(paste, c(as.data.frame(attr), sep="; "))
  gtf_fields$attributes <- paste0(gtf_fields$attributes, ';')

  # write output to file
  write.table(gtf_fields, file = filepath, row.names = FALSE,
              col.names = FALSE, quote = FALSE, sep = "\t") 

}

# create tap-seq alignment references ==============================================================

create_tapseq_ref(txs_file = opt$txs_file, BSgenome_id = opt$BSgenome_id,
                  fasta_outfile = opt$fasta_outfile, gtf_outfile = opt$gtf_outfile)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import argparse
from Bio import SeqIO

# define functions to generate reference data ------------------------------------------------------

# function to generate alignment reference data from a fasta file containing vector sequences
def cropseq_alignment_reference(input_fasta, output_bn, fasta_ext = ".fasta", prefix = ""):

  # output files
  fasta_outfile = output_bn + fasta_ext
  gtf_outfile = output_bn + ".gtf"

  # open output files
  output_fasta = open(fasta_outfile, "w")
  output_gtf  = open(gtf_outfile,  "w")

  # process each input sequence
  for record in SeqIO.parse(input_fasta, "fasta"):

    # add prefix to sequence id
    record.id = prefix + record.id

    # remove name and description to remove them from header in fasta output
    record.name = ""
    record.description = ""

    # create gtf entry
    gtf = gtf_entry(record)

    # write gtf entry and modified fasta record to respective output files
    output_gtf.write("%s\n" % gtf)
    SeqIO.write(record, output_fasta, "fasta")

  # close open files
  output_fasta.close()
  output_gtf.close()

# function to create gtf entry from a crop-seq vector fasta record
def gtf_entry(fasta_record):

  # get sequence name and length
  seq_name = fasta_record.id
  seq_len = len(fasta_record.seq)

  # create gtf attribute field
  attr_names = ["gene_id", "transcript_id", "gene_name", "transcript_name"]
  attr = [s + " " + seq_name for s in attr_names]
  attr = "; ".join(attr) + ";"

  # create gtf entry
  gtf_fields = [seq_name, "VECTOR", "exon", "1", str(seq_len), ".", "+", ".", attr]
  gtf_line = "\t".join(gtf_fields)

  return gtf_line

# create reference files ---------------------------------------------------------------------------

# create crop-seq vector references based on command line arguments if the  is called as main
# program
if __name__ == "__main__":

  # parse command line arguments
  parser = argparse.ArgumentParser(description = ("Create CROP-seq vector "
                                    "alignment references"))
  parser.add_argument("-i", "--input_fasta", type = str, required = True,
                      help = ("Input fasta file containing sequences of CROP-seq vectors."))
  parser.add_argument("-o", "--output_bn", type = str, required = True,
                      help = "Basename for output files.")
  parser.add_argument("--fasta_ext", type = str, default = ".fasta",
                      help = "Filename extension of fasta files "
                      "(default: .fasta).")
  parser.add_argument("--prefix", type = str, default = "",
                      help = "Optional prefix to be added to sequence name.")
  args = parser.parse_args()

  # create crop-seq vector reference
  cropseq_alignment_reference(input_fasta = args.input_fasta,
                              output_bn = args.output_bn,
                              fasta_ext = args.fasta_ext,
                              prefix = args.prefix)
15
16
17
18
19
20
21
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# load required packages
library(tidyverse)
library(here)
library(plotly)
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# read tpt histogram data
tpt_hist <- read.table(here(snakemake@input$tpt_hist), header = TRUE, stringsAsFactors = FALSE)

# bin tpt values into better intervals for plotting
bins <- seq(0, 1, by = 0.1)
tpt_hist_plot <- tpt_hist %>%
  mutate(tpt_bin = cut(tpt, bins)) %>%
  group_by(tpt_bin) %>%
  summarize(transcripts = sum(transcripts))

# plot histogram
ggplot(tpt_hist_plot, aes(x = tpt_bin, y = transcripts)) +
  geom_bar(stat = "identity") +
  labs(x = "Transcripts-per-transcripts", y = "Transcripts", title = "TPT distribution") +
  theme_bw()
65
66
67
68
69
70
71
72
73
74
# read dge summary statistics
dge_stats <- read.table(here(snakemake@input$dge_stats), header = TRUE, stringsAsFactors = FALSE)

# sort dge_stats according to number of detected transcripts and transform cell barcode into factor
dge_stats <- arrange(dge_stats, desc(transcripts)) %>%
  mutate(cell_barcode = forcats::fct_inorder(cell_barcode))

# calculate number of cell barcodes lost because of cell barcode whitelist filtering (for text only)
ncells <- snakemake@config$cell_numbers[[snakemake@wildcards$sample]]
filt_cells <- ncells - nrow(dge_stats)
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
# number of transcripts
ggplot(dge_stats, aes(x = cell_barcode, y = transcripts)) +
  geom_bar(stat = "identity") + 
  labs(title = paste(snakemake@wildcards$sample, "number of transcripts")) +
  theme_bw() +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.grid = element_blank())

# number of genic reads
ggplot(dge_stats, aes(x = cell_barcode, y = genic_reads)) + 
  geom_bar(stat = "identity") + 
  labs(title = paste(snakemake@wildcards$sample, "number of genic reads")) +
  theme_bw() +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.grid = element_blank())

# number of detected genes
ggplot(dge_stats, aes(x = cell_barcode, y = genes)) + 
  geom_bar(stat = "identity") + 
  labs(title = paste(snakemake@wildcards$sample, "number of detected genes")) +
  theme_bw() +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.grid = element_blank())
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# create scatter plot with plotly for small datasets or ggplot for larget datasets (>20k cells)
if (nrow(dge_stats) <= 20000) {

  # create scatter plot
  p <- plot_ly(dge_stats) %>%
    add_trace(x = ~transcripts, y = ~genic_reads, type = "scatter", mode = "markers",
              marker = list(opacity = 0.75, line = list(width = 2)), hoverinfo = "text",
              text = ~paste0("Cell barcode: ", cell_barcode, "\n",
                             "Genic reads: ", genic_reads, "\n",
                             "Transcripts: ", transcripts))

  # define axes and title
  xaxis <- list(title = "Number of detected transcripts (UMIs)", type = "log")
  yaxis <- list(title = "Number of genic reads", type = "log")
  title <- "Genic reads vs. transcripts"

  # add layout and print plot
layout(p, xaxis = xaxis, yaxis = yaxis, title = title)

}else{

  # create scatter plot
  ggplot(dge_stats, aes(x = transcripts, y = genic_reads)) +
    geom_abline() +
    geom_point(color = "#1f77b4") +
    labs(x = "Number of detected transcripts (UMIs)", y = "Number of genic reads",
         title = "Genic reads vs. transcripts") +
    scale_x_log10() +
    scale_y_log10() +
    theme_bw()

}
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# add index for plotting
dge_stats$index <- 1:nrow(dge_stats)

# create scatter plot with plotly for small datasets or ggplot for larget datasets (>20k cells)
if (nrow(dge_stats) <= 20000) {

  # create plot
  p <- plot_ly(dge_stats) %>%
    add_trace(x = ~index, y = ~transcripts, type = "scatter", mode = "markers",
              marker = list(color = "darkgray"), name = "Cells", hoverinfo = "none") %>%
    add_trace(x ~index, y = ~transcripts, type = "scatter", mode = "lines",
              line = list(color = "#1f77b4"), name = "transcripts per cell",
              hoverinfo = "text", text = ~paste0("Cells: ", index, "\n",
                              "UMIs cell i: ", transcripts))

  # define axes and title
  xaxis <- list(title = "Cell barcodes sorted by number of transcripts [descending]")
  yaxis <- list(title = "Number of transcripts", type = "log")
  title <- "Number of transcripts per cell"

  # add layout and print plot
  layout(p, xaxis = xaxis, yaxis = yaxis, title = title)

}else{

  ggplot(dge_stats, aes(x = index, y = transcripts)) +
    geom_point(color = "darkgray") +
    geom_line(color = "#1f77b4") +
    labs(x = "Cell barcodes sorted by number of transcripts [descending]",
         y = "Number of transcripts", title = "Number of transcripts per cell") +
    scale_y_log10() +
    theme_bw()

}
15
16
17
18
19
20
21
# set global chunk options
knitr::opts_chunk$set(echo = FALSE)

# load required packages
library(tidyverse)
library(here)
library(plotly)
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# read tpt histogram data
tpt_hist <- read.table(here(snakemake@input$tpt_hist), header = TRUE, stringsAsFactors = FALSE)

# bin tpt values into better intervals for plotting
bins <- seq(0, 1, by = 0.1)
tpt_hist_plot <- tpt_hist %>%
  mutate(tpt_bin = cut(tpt, bins)) %>%
  group_by(tpt_bin) %>%
  summarize(transcripts = sum(transcripts))

# plot histogram
ggplot(tpt_hist_plot, aes(x = tpt_bin, y = transcripts)) +
  geom_bar(stat = "identity") +
  labs(x = "Transcripts-per-transcripts", y = "Transcripts", title = "TPT distribution") +
  theme_bw()
65
66
67
68
69
70
71
72
73
74
# read dge summary statistics
dge_stats <- read.table(here(snakemake@input$dge_stats), header = TRUE, stringsAsFactors = FALSE)

# sort dge_stats according to number of detected transcripts and transform cell barcode into factor
dge_stats <- arrange(dge_stats, desc(transcripts)) %>%
  mutate(cell_barcode = forcats::fct_inorder(cell_barcode))

# calculate number of cell barcodes lost because of cell barcode whitelist filtering (for text only)
ncells <- snakemake@config$cell_numbers[[snakemake@wildcards$sample]]
filt_cells <- ncells - nrow(dge_stats)
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
# number of transcripts
ggplot(dge_stats, aes(x = cell_barcode, y = transcripts)) +
  geom_bar(stat = "identity") + 
  labs(title = paste(snakemake@wildcards$sample, "number of transcripts")) +
  theme_bw() +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.grid = element_blank())

# number of genic reads
ggplot(dge_stats, aes(x = cell_barcode, y = genic_reads)) + 
  geom_bar(stat = "identity") + 
  labs(title = paste(snakemake@wildcards$sample, "number of genic reads")) +
  theme_bw() +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.grid = element_blank())

# number of detected genes
ggplot(dge_stats, aes(x = cell_barcode, y = genes)) + 
  geom_bar(stat = "identity") + 
  labs(title = paste(snakemake@wildcards$sample, "number of detected genes")) +
  theme_bw() +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.grid = element_blank())
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# create scatter plot with plotly for small datasets or ggplot for larget datasets (>20k cells)
if (nrow(dge_stats) <= 20000) {

  # create scatter plot
  p <- plot_ly(dge_stats) %>%
    add_trace(x = ~transcripts, y = ~genic_reads, type = "scatter", mode = "markers",
              marker = list(opacity = 0.75, line = list(width = 2)), hoverinfo = "text",
              text = ~paste0("Cell barcode: ", cell_barcode, "\n",
                             "Genic reads: ", genic_reads, "\n",
                             "Transcripts: ", transcripts))

  # define axes and title
  xaxis <- list(title = "Number of detected transcripts (UMIs)", type = "log")
  yaxis <- list(title = "Number of genic reads", type = "log")
  title <- "Genic reads vs. transcripts"

  # add layout and print plot
layout(p, xaxis = xaxis, yaxis = yaxis, title = title)

}else{

  # create scatter plot
  ggplot(dge_stats, aes(x = transcripts, y = genic_reads)) +
    geom_abline() +
    geom_point(color = "#1f77b4") +
    labs(x = "Number of detected transcripts (UMIs)", y = "Number of genic reads",
         title = "Genic reads vs. transcripts") +
    scale_x_log10() +
    scale_y_log10() +
    theme_bw()

}
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# add index for plotting
dge_stats$index <- 1:nrow(dge_stats)

# create scatter plot with plotly for small datasets or ggplot for larget datasets (>20k cells)
if (nrow(dge_stats) <= 20000) {

  # create plot
  p <- plot_ly(dge_stats) %>%
    add_trace(x = ~index, y = ~transcripts, type = "scatter", mode = "markers",
              marker = list(color = "darkgray"), name = "Cells", hoverinfo = "none") %>%
    add_trace(x ~index, y = ~transcripts, type = "scatter", mode = "lines",
              line = list(color = "#1f77b4"), name = "transcripts per cell",
              hoverinfo = "text", text = ~paste0("Cells: ", index, "\n",
                              "UMIs cell i: ", transcripts))

  # define axes and title
  xaxis <- list(title = "Cell barcodes sorted by number of transcripts [descending]")
  yaxis <- list(title = "Number of transcripts", type = "log")
  title <- "Number of transcripts per cell"

  # add layout and print plot
  layout(p, xaxis = xaxis, yaxis = yaxis, title = title)

}else{

  ggplot(dge_stats, aes(x = index, y = transcripts)) +
    geom_point(color = "darkgray") +
    geom_line(color = "#1f77b4") +
    labs(x = "Cell barcodes sorted by number of transcripts [descending]",
         y = "Number of transcripts", title = "Number of transcripts per cell") +
    scale_y_log10() +
    theme_bw()

}
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# load dge data
dge <- read.table(here(snakemake@input$dge), header = TRUE, stringsAsFactors = FALSE)

# extract vector transcripts
vctr_dge <- dge %>%
  filter(grepl(GENE, pattern = snakemake@params$vector_prefix)) %>%
  gather(key = "cell", value = "txs", -GENE)

# count number of total vector molecules per cell
vctr_txs_per_cell <- vctr_dge %>%
  group_by(cell) %>%
  summarize(vctr_txs = sum(txs))

# plot number of transcripts per cell distribution
ggplot(vctr_txs_per_cell, aes(vctr_txs)) +
  geom_histogram(bins = 25) +
  labs(x = "Vector transcripts", y = "Cells", title = "Vector transcripts per cell") +
  theme_bw()
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# load perturbation matrix
perturb <- read.table(here(snakemake@input$perturb_stats), header = TRUE, stringsAsFactors = FALSE)

# convert to long format
perturb <- perturb %>%
  gather(key = "cell", value = "pert", -VECTOR)

# compute the number of perturbations per cell
pert_per_cell <- perturb %>%
  group_by(cell) %>%
  summarize(pert_per_cell = sum(pert)) %>%
  count(pert_per_cell)

# plot number of perturbations per cell
ggplot(pert_per_cell, aes(x = pert_per_cell, y = n)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = 0) +
  labs(x = "Perturbations", y = "Cells", title = "Perturbations per cell") +
  theme_bw()
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# calculate number of cells per perturbation
cells_per_pert <- perturb %>%
  group_by(VECTOR) %>%
  summarize(cells = sum(pert))

# plot number of cell per perturbation
p <- ggplot(cells_per_pert,
            aes(x = forcats::fct_reorder(VECTOR, .x = cells, .desc = TRUE), y = cells)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  labs(x = "Perturbation", y = "Cells", title = "Number of cells per perturbation") +
  theme_bw()

# print plot with layout appropriate for number of perturbations
if (nrow(cells_per_pert) <= 30) {
  p + coord_flip()
}else{
  p + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.grid = element_blank())
}
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
from __future__ import print_function
import sys, argparse, os, gzip
import pandas as pd
import numpy as np

# define functions ---------------------------------------------------------------------------------

# helper function to print to stderr
def eprint(*args, **kwargs):
  print(*args, file = sys.stderr, **kwargs)

# filter for cell barcodes on whitelist
def filter_cell_barcodes(umi_obs, whitelist):

  # get umi observations on whitelist
  bc_filter = umi_obs['Cell Barcode'].isin(whitelist)
  filt_umi_obs = umi_obs[bc_filter]

  # get number of filtered out reads and barcodes
  n_filt_bc = len(umi_obs[~bc_filter]['Cell Barcode'].unique().tolist())
  n_filt_reads = umi_obs[~bc_filter]['Num_Obs'].sum()
  perc_filt_reads = n_filt_reads / umi_obs['Num_Obs'].sum() * 100

  # report filtered barcodes and reads to stderr
  eprint(' Removed ' + str(n_filt_reads) + ' reads (' + str(round(perc_filt_reads, 3)) + '%) ' +
         'from ' + str(n_filt_bc) + ' cell barcodes')

  return(filt_umi_obs)

# sample a specified number of total genic reads from the umi counts
def sample_reads(umi_obs, n, seed = None):

  # repeat each bc-umi-gene combination by the number of observations to recreate pool of reads
  reads = umi_obs.reindex(umi_obs.index.repeat(umi_obs.Num_Obs))

  # randomly sample n reads
  sample_reads = reads.sample(n = n, replace = False, random_state = seed)

  # count the number of times each read (bc-umi-gene combination) is found in the downsampled data
  read_counts = sample_reads.groupby(['Cell Barcode', 'Gene', 'Molecular_Barcode']).size()

  # reset index to "ungroup" data frame and rename counts column
  read_counts = read_counts.reset_index().rename(columns = {0:'Num_Obs'})

  return(read_counts)

# calculate transcripts per transcripts from umi observations (output from  Drop-seq tool:
# GatherMolecularBarcodeDistributionByGene)
def calculate_tpt(umi_obs):

  # calculate the total number of reads per cell-umi barcode combination
  reads = umi_obs.groupby(['Cell Barcode', 'Molecular_Barcode'])['Num_Obs'].sum()

  # convert to DataFrame with barcodes as columns and rename Num_obs column
  reads = pd.DataFrame(reads).reset_index()
  reads = reads.rename(columns = {'Num_Obs': 'total_reads'})

  # add total reads to umi_counts
  umi_obs = pd.merge(umi_obs, reads, how = 'left', on = ['Cell Barcode', 'Molecular_Barcode'])

  # calculate the fraction of total reads for each gene-cell-umi combination, where total reads is
  # the sum of reads for that cell-umi combination. this metric is also called transcripts per
  # transcript (tpt).
  umi_obs['tpt'] = np.divide(umi_obs['Num_Obs'], umi_obs['total_reads'])
  umi_tpt = umi_obs.drop(columns = 'total_reads')

  return(umi_tpt)

# compute tpt histogram from calculate_tpt() output
def compute_tpt_histogram(umi_tpt):

  # calculate number of bc-umi-gene combinations per tpt value
  tpt_hist = pd.value_counts(umi_tpt.tpt).to_frame().reset_index()

  # rename columns and sort according to tpt value
  tpt_hist.columns = ['tpt','transcripts']
  tpt_hist = tpt_hist.sort_values('tpt', ascending = False)

  return(tpt_hist)

# filter umi observations based on minimum tbt value
def filter_tpt(umi_tpt, tpt_threshold):

  # retain observations with tbt >= tbt_threshold
  filter_logical = umi_tpt['tpt'] >= tpt_threshold
  umi_filt = umi_tpt[filter_logical]

  eprint('  Removed ' + str(np.round(100 * (1 - np.mean(filter_logical)), 3)) + '% of transcripts')

  return(umi_filt)

# calculate dge summary based of umi observations. calculates number of genic reads, molecules and
# detected genes per cell
def dge_summary(umi_obs):

  # calculate number of genic reads, detected transcripts and genes per cell barcode
  agg_funs = {'Num_Obs': ['sum', 'count'], 'Gene' : 'nunique'}
  stats = umi_obs.groupby('Cell Barcode').agg(agg_funs)

  # sort according to number of genic reads
  stats = stats.sort_values([('Num_Obs', 'sum')], ascending = False)

  # make cell barcode a regular column and set new column names
  stats = stats.reset_index()
  stats.columns = ['cell_barcode', 'genic_reads', 'transcripts', 'genes']

  return(stats)

# convert umi observations 
def create_dge_matrix(umi_obs):

  # count number of umis per cell barcode and gene
  umi_counts = umi_obs.groupby(['Cell Barcode', 'Gene']).size()

  # convert to wide format ('unstack') with missing values set to 0 to create dge expression matrix
  dge_matrix = umi_counts.unstack(level = 0, fill_value = 0)

  # rename 'Gene' index to 'GENE' to mirror Drop-seq tools DGE output
  dge_matrix.index.rename('GENE', inplace = True)

  return(dge_matrix)

# execute if run as main program -------------------------------------------------------------------

# parse command line arguments
if __name__ == '__main__':

  # function to check if an argument contains None or a positive integer
  def none_or_int(arg):
    if arg == 'None':
      return(None)
    elif arg.isdigit():
      return(int(arg))
    else:
      msg = "invalid None or int value: %r" % arg
      raise argparse.ArgumentTypeError(msg)

  parser = argparse.ArgumentParser(description = 'Extract DGE data from UMI observations. This \
    program allows for downsampling of total genic reads and filtering for potential chimeric \
    reads. It returns a text file containing the gene expression matrix.')

  parser.add_argument('-i', '--inputfile', help = 'File containing UMI observations as produced by \
    GatherMolecularBarcodeDistributionByGene from Drop-seq tools.', required = True)
  parser.add_argument('-o', '--outputfile', help = 'Output file for filtered digital gene \
    expression matrix. TPT histogram and DGE summary files will be written into the same directory',
    required = True)
  parser.add_argument('-w', '--whitelist', help = 'File containing possible true cell barcodes \
    without header and one cell barcode per line. If provided only cell barcodes on this \
    whitelist will be considered. Default = None, which turns cell barcode filtering off.',
    default = None)
  parser.add_argument('--tpt_threshold', help = 'Minimum transcript per transcript value for \
    gene-cell-umi combinations to pass chimeric reads filtering. Default = 0, which does not \
    remove any reads', type = float, default = 0.0)
  parser.add_argument('--sample', help = 'Number of genic reads (int) to be drawn from input for \
    downsampling. Default = None, which turns sampling off.', type = none_or_int, default = 0)
  parser.add_argument('--seed', help = 'Seed (int) for the random number generator used when \
    downsampling. Default = None', type = none_or_int, default = None)

  args = parser.parse_args()

  # read umi observations input file
  eprint('Reading input file...')
  umi_obs = pd.read_csv(args.inputfile, sep = '\t')

  # filter for cell barcodes on whitelist (if whitelist file is provided)
  if args.whitelist:
    eprint('Filtering for cell barcodes on provided whitelist...')
    if args.whitelist.endswith('.gz'):
      with gzip.open(args.whitelist, 'rt') as f:
        whitelist = f.read().splitlines()
    else:
      with open(args.whitelist, 'r') as f:
        whitelist = f.read().splitlines()
    umi_obs = filter_cell_barcodes(umi_obs, whitelist)

  # downsample total genic reads if specified
  if args.sample:
    if args.sample <= umi_obs.Num_Obs.sum():
      eprint('Downsampling to ' + str(args.sample) + ' reads...')
      umi_obs = sample_reads(umi_obs, n = args.sample, seed = args.seed)
    else:
      eprint('Desired sampling size larger than number of input reads! Exiting.')
      quit()

  # calculate tpt
  eprint('Calculating TPT...')
  umi_tpt = calculate_tpt(umi_obs)

  # compute tpt histogram
  tpt_hist = compute_tpt_histogram(umi_tpt)

  # write histogram to text file
  hist_file = os.path.splitext(args.outputfile)[0] + '_tpt_histogram.txt'
  tpt_hist.to_csv(hist_file, sep = '\t', index = False)

  # filter transcripts based on minimum tbt
  eprint('Filtering for chimeric reads (min TPT = ' + str(args.tpt_threshold) + ')...')
  umi_filt = filter_tpt(umi_tpt, tpt_threshold = args.tpt_threshold)

  # calculate dge summary statistics
  dge_stats = dge_summary(umi_filt)

  # write dge summary to file
  dge_stats_file = os.path.splitext(args.outputfile)[0] + '_summary.txt'
  dge_stats.to_csv(dge_stats_file, sep = '\t', index = False)

  # create expression matrix and save to tab deliminted .txt file
  dge = create_dge_matrix(umi_filt)
  eprint('Writing DGE matrix to file...')
  dge.to_csv(args.outputfile, sep = '\t', index = True)

  eprint('Done!')
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
library("optparse")

# create arguments list
option_list = list(
  make_option(c("-i", "--infile"), type = "character", default = NULL,
              help = "Path to file containing digital expression (DGE) input data.",
              metavar = "character"),
  make_option(c("-o", "--outfile"), type = "character", default = NULL,
              help = "Path to output file for perturbation status matrix", metavar = "character"),
  make_option(c("-v", "--vector_pattern"), type = "character", default = NULL,
              help = paste(
                     "Identifier common to all CROP-seq vectors, for instance a prefix or suffix",
                     "to the unique vector identifier. For instance in ID 'CROPseq_dCas9_DS_01'",
                     "the common identifier is 'CROPseq_dCas9_DS_' while '01' is the unique vector",
                     "identifier. This parameter is needed to extract vector expression data from",
                     "DGE data.", sep = " \n\t\t"
                     ),
              metavar = "character"),
  make_option(c("-m", "--min_txs"), type = "integer", default = NULL,
              help = "Minimum number (integer) of transcripts per vector to define perturbation",
              metavar = "integer"),
  make_option(c("-t", "--trim_id"), action = "store_true", default = FALSE,
              help = "Triggers removal of the vector pattern output vector IDs.")
)

# parse arguments
opt_parser = OptionParser(option_list = option_list)
opt = parse_args(opt_parser)

# function to check for required arguments
check_required_args <- function(arg, opt, opt_parser) {

  if (is.null(opt[[arg]])) {

    print_help(opt_parser)
    stop(arg, " argument is required!", call. = FALSE)

  }

}

# check that all required parameters are provided
required_args <- c("infile", "outfile", "vector_pattern", "min_txs")
for (i in required_args) {

  check_required_args(i, opt = opt, opt_parser = opt_parser)

}

# define main function -----------------------------------------------------------------------------

# infer perturbation status of each cell based on simple transcript cutoff
infer_pert <- function(dge, vector_pattern, min_txs) {

  # filter for vector expression data
  vctrs <- grep(dge$GENE, pattern = vector_pattern)
  vctr_dge <- dge[vctrs, ]

  # move gene name to rownames
  rownames(vctr_dge) <- vctr_dge$GENE
  vctr_dge <- vctr_dge[, which(colnames(vctr_dge) != "GENE")]

  # infer perturbation status of each cell
  pert <- as.data.frame((vctr_dge >= min_txs) * 1)

  # add column for vector id
  pert <- cbind.data.frame("VECTOR" = rownames(pert), pert, stringsAsFactors = FALSE)
  rownames(pert) <- NULL

  return(pert)

}

# infer perturbation status of specified input file ------------------------------------------------

message("Loading DGE data...")
dge <- read.table(opt$infile, header = TRUE, stringsAsFactors = FALSE)

message("Inferring perturbation status...")
pert_status <- infer_pert(dge, vector_pattern = opt$vector_pattern, min_txs = opt$min_txs)

# remove vector pattern (common vector identifier) from output if specified
if (opt$trim_id == TRUE) {

  message("Trimming vector pattern from vector IDs...")
  pert_status$VECTOR <- sub(opt$vector_pattern, "", pert_status$VECTOR)

}

message("Writing to output file...")
write.table(pert_status, file = opt$outfile, quote = FALSE, sep = "\t", row.names = FALSE)

message("Done!")
ShowHide 188 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/argschwind/TAPseq_manuscript
Name: tapseq_manuscript
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...