A SingleCell RNASeq pre-processing snakemake workflow

public public 8mo ago Version: v0.32 0 bookmarks

Description

This pipeline is based on snakemake and the dropseq tools provided by the McCarroll Lab . It allows to go from raw data of your Single Cell RNA seq experiment until the final count matrix with QC plots along the way.

This is the tool we use in our lab to improve our wetlab protocol as well as provide an easy framework to reproduce and compare different experiments with different parameters.

It uses STAR to map the reads. It is usable for any single cell protocol using two reads where the first one holds the Cell and UMI barcodes and the second read holds the RNA. Here is a non-exhausitve list of compatible protocols/brands:

This package is trying to be as user friendly as possible. One of the hopes is that non-bioinformatician can make use of it without too much hassle. It will still require some command line execution, this is not going to be fully interactive package.

Authors

Usage

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further develop this workflow, fork this reposity. Please consider providing any generally applicable modifications via a pull request.

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, once available, its DOI.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml and the samples.tsv following those instructions

Step 3: Execute workflow

All you need to execute this workflow is to install Snakemake via the Conda package manager . Software needed by this workflow is automatically deployed into isolated environments by Snakemake.

Test your configuration by performing a dry-run via

snakemake --use-conda -n --directory $WORKING_DIR

Execute the workflow locally via

snakemake --use-conda --cores $N --directory $WORKING_DIR

using $N cores on the $WORKING_DIR . Alternatively, it can be run in cluster or cloud environments (see the docs for details).

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above.

Step 4: Investigate results

After successful execution, you can create a self-contained report with all results via:

snakemake --report report.html

Documentation

You can find the documentation here

Future implementations

I'm actively seeking help to implement the points listed bellow. Don't hesitate to contact me if you wish to contribute.

  • Create a sharing platform where quality plots/logs can be discussed and troubleshooted.

  • Create a full html report for the whole pipeline

  • Multiqc module for drop-seq-tools

  • Implement an elegant "preview" mode where the pipeline would only run on a couple of millions of reads and allow you to have an approximated view before running all of the data. This would dramatically reduce the time needed to get an idea of what filters whould be used.

  • I hope it can help you out in your single cell experiments!

Feel free to comment and point out potential improvements via issues

TODO

  • Add a mixed reference reference for testing purposes

  • Finalize the parameters validation schema

  • Make the debug feature a bit "cleaner". Deal with automatic naming of the debug variables

  • Implement ddseq barcoding strategies

Code Snippets

19
20
script:
    '../scripts/generate_extended_ref.py'
32
33
34
35
36
37
38
shell:
    """umi_tools whitelist\
    --stdin {input}\
    --bc-pattern='(?P<cell_1>.{{{params.cell_barcode_length}}})(?P<umi_1>.{{{params.umi_barcode_length}}})'\
    --extract-method=regex\
    --set-cell-number={params.num_cells}\
    --log2stderr > {output}"""
45
46
shell:
    """cat {input} | cut -f 1 > {output}"""
56
57
script:
    '../scripts/umi_tools_extended_ref.py'
70
71
script:
    '../scripts/repair_barcodes.py'
36
37
shell:
    "gunzip -c -d {input} > {output}"
44
45
shell:
    "gunzip -d -c {input} > {output}"
55
56
shell:
    """sed -e 's/>/>{params.species}/g' {input} > {output}"""
77
78
shell:
    """cat {input.genome1} {input.genome2} > {output}"""
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
run:
    import datetime
    import re
    header1="#!Mixed reference of {} and {}\n".format(
        species_list[0],
        species_list[1])
    header2="#!genome-builds GRC{}{} GRC{}{}\n".format(
        species_list[0].lower()[0],
        build_list[0],
        species_list[1].lower()[0],
        build_list[1])
    header3="#!genome-releases {} {}\n".format(
        release_list[0],
        release_list[1])
    header4="#!genome-date {}\n".format(str(datetime.date.today()))
    header=[header1,header2,header3,header4]
    with open(input.annotation1[0]) as annotation1:
        with open(input.annotation2[0]) as annotation2:
            with open(output[0], 'w') as outfile:
                outfile.writelines(header)
                for line in annotation1:
                    if(not line.startswith('#!')):
                        outfile.write(re.sub('^',species_list[0],line))
                for line in annotation2:
                    if(not line.startswith('#!')):
                        outfile.write(re.sub('^',species_list[1],line))
33
34
shell:
    "gunzip -c -d {input} > {output}"
41
42
shell:
    "gunzip -d -c {input} > {output}"
25
26
27
28
29
30
31
32
33
34
35
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=false\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""
53
54
55
56
57
58
59
60
61
62
63
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=true\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""
86
87
88
89
90
91
92
93
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && SingleCellRnaSeqMetricsCollector -m {params.memory}\
    INPUT={input.data}\
    OUTPUT={output}\
    ANNOTATIONS_FILE={input.refFlat}\
    CELL_BC_FILE={input.barcode_whitelist}\
    RIBOSOMAL_INTERVALS={input.rRNA_intervals}
    """
102
103
script:
    '../scripts/plot_rna_metrics.R'
115
116
script:
    "../scripts/convert_mtx.py"
129
130
shell:
    """pigz -p {threads} {input.barcodes} {input.features} {input.mtx}"""
24
25
26
27
28
29
30
31
32
33
34
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=false\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""
53
54
55
56
57
58
59
60
61
62
63
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=true\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""
74
75
script:
    "../scripts/convert_mtx.py"
88
89
shell:
    """pigz -p {threads} {input.barcodes} {input.features} {input.mtx}"""
112
113
114
115
116
117
118
119
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && SingleCellRnaSeqMetricsCollector -m {params.memory}\
    INPUT={input.data}\
    OUTPUT={output}\
    ANNOTATIONS_FILE={input.refFlat}\
    CELL_BC_FILE={input.barcode_whitelist}\
    RIBOSOMAL_INTERVALS={input.rRNA_intervals}
    """
127
128
script:
    '../scripts/plot_rna_metrics.R'
19
20
wrapper:
    '0.36.0/bio/fastqc'
31
32
wrapper:
    '0.36.0/bio/fastqc'
41
42
wrapper:
    '0.36.0/bio/multiqc'
50
51
wrapper:
    '0.36.0/bio/multiqc'
59
60
script:
    '../scripts/fa2tsv.py'
29
30
31
32
33
34
35
36
37
38
39
shell:
    """cutadapt\
    --max-n {params.max_n}\
    -a file:{input.adapters}\
    -g file:{input.adapters}\
    -q {params.barcode_quality},{params.barcode_quality}\
    --minimum-length {params.barcode_length}\
    --cores={threads}\
    --overlap {params.cell_barcode_length}\
    -o {output.fastq} {input.R1}\
    {params.extra_params} > {log.qc}"""
56
57
58
59
60
61
62
63
64
65
shell:
    """cutadapt\
    -a file:{input.adapters}\
    -g file:{input.adapters}\
    -q {params.read_quality}\
    --minimum-length {params.minimum_length}\
    --cores={threads}\
    --overlap {params.adapters_minimum_overlap}\
    -o {output.fastq} {input.R2}\
    {params.extra_params} > {log.qc}"""
73
74
script:
    '../scripts/clean_cutadapt.py'
89
90
91
92
93
94
95
96
97
shell:
    """repair.sh\
    -Xmx{params.memory}\
    in={input.R1}\
    in2={input.R2}\
    out1={output.R1}\
    out2={output.R2}\
    repair=t\
    threads={threads} 2> {log}"""
105
106
script:
    '../scripts/detect_barcodes.py'
SnakeMake From line 105 of rules/filter.smk
119
120
script:
    '../scripts/plot_adapter_content.R'
SnakeMake From line 119 of rules/filter.smk
128
129
wrapper:
    '0.36.0/bio/multiqc'
137
138
wrapper:
    '0.36.0/bio/multiqc'
22
23
shell:
    """cat {input.annotation} | grep -E "{params.patterns}" > {output}"""
36
37
38
39
40
shell:
    """java -jar -Djava.io.tmpdir={params.temp_directory} {params.picard} CreateSequenceDictionary\
    REFERENCE={input}\
    OUTPUT={output}
    """
52
53
54
55
56
57
58
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && ReduceGtf -m {params.memory}\
    GTF={input.annotation}\
    OUTPUT={output}\
    SEQUENCE_DICTIONARY={input.reference_dict}\
    IGNORE_FUNC_TYPE='null'\
    ENHANCE_GTF='false'"""
70
71
72
73
74
75
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && ConvertToRefFlat -m {params.memory}\
    ANNOTATIONS_FILE={input.annotation}\
    OUTPUT={output}\
    SEQUENCE_DICTIONARY={input.reference_dict}
    """
89
90
91
92
93
94
95
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && CreateIntervalsFiles -m {params.memory}\
    REDUCED_GTF={input.annotation_reduced}\
    SEQUENCE_DICTIONARY={input.reference_dict}\
    O={params.reference_directory}\
    PREFIX={params.prefix}
    """
105
106
107
108
109
110
111
112
113
114
115
116
117
run:
    """
    from math import log2
    from platform import system
    if (system() == 'Darwin'):
        genomeLength = shell("wc -c {} | cut -d' ' -f2".format(snakemake.reference_file), iterable=True)
    else:
        genomeLength = shell("wc -c {} | cut -d' ' -f1".format(snakemake.reference_file), iterable=True)
    genomeLength = int(next(genomeLength))
    referenceNumber = shell('grep "^>" {} | wc -l'.format(snakemake.reference_file), iterable=True)
    referenceNumber = int(next(referenceNumber))
    value = min([18,int(log2(genomeLength/referenceNumber))])
    """
131
132
script:
    '../scripts/prep_star.py'
149
150
151
152
153
154
155
156
157
158
159
shell:
    """mkdir -p {params.genomeDir}; STAR\
    --runThreadN {threads}\
    --runMode genomeGenerate\
    --genomeDir {params.genomeDir}\
    --genomeFastaFiles {input.reference_file}\
    --sjdbGTFfile {input.annotation_file}\
    --sjdbOverhang {params.sjdbOverhang}\
    --genomeChrBinNbits {params.genomeChrBinNbits}\
    --genomeSAsparseD 2
    """
48
49
wrapper:
    "0.27.1/bio/star/align"
SnakeMake From line 48 of rules/map.smk
82
83
wrapper:
    '0.36.0/bio/multiqc'
92
93
shell:
    """pigz -p 4 {input}"""
SnakeMake From line 92 of rules/map.smk
108
109
script:
    '../scripts/merge_bam.py'
SnakeMake From line 108 of rules/map.smk
129
130
131
132
133
134
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && TagReadWithGeneFunction -m {params.memory}\
    INPUT={input.data}\
    OUTPUT={output}\
    ANNOTATIONS_FILE={input.refFlat}
    """
SnakeMake From line 129 of rules/map.smk
149
150
151
152
153
154
155
156
157
shell:
    """
    export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DetectBeadSubstitutionErrors -m {params.memory}\
    I={input}\
    O={output.data}\
    OUTPUT_REPORT={output.report}\
    OUTPUT_SUMMARY={output.summary}\
    NUM_THREADS={threads}
    """
SnakeMake From line 149 of rules/map.smk
173
174
175
176
177
178
179
180
181
182
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DetectBeadSynthesisErrors -m {params.memory}\
    INPUT={input}\
    OUTPUT={output}\
    OUTPUT_STATS={params.out_stats}\
    SUMMARY={params.summary}\
    NUM_BARCODES={params.barcodes}\
    PRIMER_SEQUENCE={params.SmartAdapter}\
    NUM_THREADS={threads}
    """
SnakeMake From line 173 of rules/map.smk
194
195
196
197
198
199
200
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && BamTagHistogram -m {params.memory}\
    TAG=XC\
    I={input}\
    READ_MQ=10\
    O={output}
    """
SnakeMake From line 194 of rules/map.smk
217
218
script:
    '../scripts/plot_yield.R'
SnakeMake From line 217 of rules/map.smk
230
231
script:
    '../scripts/plot_knee_plot.R'
SnakeMake From line 230 of rules/map.smk
17
18
script:
    "../scripts/convert_mtx.py"
46
47
script:
    '../scripts/plot_violine.R'
61
62
script:
    '../scripts/create_summary_stats.R'
16
17
script:
    "../scripts/publication_text.Rmd"
17
18
19
20
21
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && FilterBam -m {params.memory}\
    REF_SOFT_MATCHED_RETAINED={params.species}\
    INPUT={input}\
    OUTPUT={output}"""
38
39
40
41
42
43
44
45
46
shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.umi_matrix}\
    SUMMARY={output.summary}\
    EDIT_DISTANCE={params.cellBarcodeEditDistance}\
    CELL_BC_FILE={input.barcode_whitelist}\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}"""
58
59
script: 
    '../scripts/plot_species_plot.R'
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from collections import defaultdict
import re

def fill_results(snakemake,pair, adapter_results):
    adapter_pattern = re.compile(pattern="=== Adapter (.*) ===\n")
    with open(snakemake.input[pair], 'r') as logfile:
        line = logfile.readline()
        while(line):
            adapter_matched = re.findall(pattern=adapter_pattern,string=line)
            if(adapter_matched):
                logfile.readline()
                line = logfile.readline().rstrip('.\n')
                line_list = line.split(';')
                adapter_results[adapter_matched[0]]['Pair'] = pair
                adapter_results[adapter_matched[0]]['Sequence'] = line_list[0].split(':')[1].strip()
                if(adapter_matched[0] in adapter_results):
                    adapter_results[adapter_matched[0]]['Times'] += int(line_list[3].split(':')[1].split(' ')[1].strip())
                else:    
                    adapter_results[adapter_matched[0]]['Times'] += int(line_list[3].split(':')[1].split(' ')[1].strip())
            line = logfile.readline()
    return(adapter_results)

adapter_results_R1 = defaultdict(lambda :{'Pair':'','Sequence':'','Times':0})
adapter_results_R2 = defaultdict(lambda :{'Pair':'','Sequence':'','Times':0})


adapter_results_R1 = fill_results(snakemake, 'R1', adapter_results_R1)
adapter_results_R2 = fill_results(snakemake, 'R2', adapter_results_R2)

with open(snakemake.output[0],'w') as outfile:
    outfile.write('Adapter,Sequence,Pair,Count\n')
    for adapter in adapter_results_R1:
        outfile.write("{},{},{},{}\n".format(adapter, adapter_results_R1[adapter]['Sequence'],adapter_results_R1[adapter]['Pair'],adapter_results_R1[adapter]['Times']))
    for adapter in adapter_results_R2:
        outfile.write("{},{},{},{}\n".format(adapter, adapter_results_R2[adapter]['Sequence'],adapter_results_R2[adapter]['Pair'],adapter_results_R2[adapter]['Times']))
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import os
import subprocess


samples = snakemake.params['samples']
barcodes = {}
features = {}

out_folder = os.path.dirname(snakemake.output['mtx'])
out_barcodes = snakemake.output.barcodes
out_features = snakemake.output.features
mtx = snakemake.output.mtx
temp_mtx = os.path.join(out_folder,'temp_umi.mtx')
header = os.path.join(out_folder,'header.mtx')
n_lines=0
barcode_index = 1
feature_index = 1

with open(temp_mtx,'w') as mtx_stream:
	for i,sample in enumerate(snakemake.input):
		if samples[i] not in sample:
			sys.exit("Sample name not found in file path")
		with open(sample,'r') as input_file:
			next(input_file) # skip first line
			for line in input_file:
				barcode,feature,count = line.strip().split('\t')
				if(not isinstance(samples,str)):
					barcode = samples[i] + '_' + barcode
				if(barcode not in barcodes):
					barcodes[barcode] = barcode_index
					barcode_index += 1
				if(feature not in features):
					features[feature] = feature_index
					feature_index += 1
				mtx_stream.write('{} {} {}\n'.format(features[feature],barcodes[barcode],count))
				n_lines +=1

with open(out_barcodes,'w') as barcodes_outfile:
	for barcode in barcodes:
		barcodes_outfile.write('{}\n'.format(barcode))

with open(out_features,'w') as features_outfile:
	for feature in features:
		features_outfile.write('{}\n'.format(feature))

with open(header,'w') as header_outfile:
	header_outfile.write("%%MatrixMarket matrix coordinate real general\n")
	header_outfile.write('{} {} {}\n'.format(len(features), len(barcodes), n_lines))

subprocess.call("cat {} {} > {}".format(header, temp_mtx, mtx), shell=True)

os.remove(temp_mtx)
os.remove(header)
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
debug_flag <- FALSE
# check if DEBUG flag is set
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "create_summary_stats_snakemake.rdata"))
}
#### /debug

library(dplyr) # Dataframe manipulation
library(Matrix) # Sparse matrices
library(stringr)
library(RColorBrewer)
library(devtools)
library(Seurat)
library(plotly)

samples <- snakemake@params$sample_names
batches <- snakemake@params$batches


# importing Seurat object

seuratobj <- readRDS(file  = file.path(snakemake@input$R_objects))
meta.data <- seuratobj@meta.data

# subset only highest stamps as set in config.yaml
# this is necessary since there are more stamps selected as a safty margin which now have to be taken out again to calculate stats.

meta.data.sub <- meta.data %>%
  group_by(orig.ident) %>%
  arrange(desc(nCounts)) %>%
  slice(1:expected_cells[1]) %>% # makes sure only # expected cell are kept
  as.data.frame()

gini_index <- function (x, weights = rep(1, length = length(x))) {
    ox      <- order(x)
    x       <- x[ox]
    weights <- weights[ox] / sum(weights)
    p       <- cumsum(weights)
    nu      <- cumsum(weights * x)
    n       <- length(nu)
    nu      <- nu/nu[n]
    sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
}

#------------------------------------ post-filter-stats
# stats only based after keeping most abundant `expected-cells`
# taken out from seurat object generated in violine_plot rule.

# median calculator

stats_post <- meta.data.sub %>%
  group_by(orig.ident) %>%
  summarise(
    Total_nb_reads                 = sum(nCounts),
    Nb_STAMPS                      = mean(expected_cells), # should be all the same anyway..
    Median_reads_per_STAMP         = round(median(nCounts), 2),
    Mean_reads_per_STAMP           = round(mean(nCounts), 2),
    Total_nb_UMIs                  = sum(nUMI),
    Median_UMIs_per_STAMP          = round(median(nUMI), 2),
    Mean_UMIs_per_STAMP            = round(mean(nUMI), 2),
    Mean_UMIs_per_Gene             = round(mean(umi.per.gene), 2),
    Median_number_genes_per_STAMP  = round(median(nGene), 2),
    Mean_number_genes_per_STAMP    = round(mean(nGene), 2),
    Mean_Ribo_pct                  = round(100 * mean(pct.Ribo), 2),
    Mean_Mito_pct                  = round(100 * mean(pct.mito), 2),
    Mean_Count_per_UMI             = round(sum(nCounts) / sum(nUMI), 2),
    Read_length                    = mean(read_length), # should be all the same anyway..
    Number_barcodes_used_for_debug = n()
  ) %>%
  as.data.frame()

row.names(stats_post) <- stats_post$orig.ident



# highest, lowest count/UMI Stamp
# pre STAMP stats

# hist out goes into knee plots
# 'results/logs/{sample}_hist_out_cell.txt'
# """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && BAMTagHistogram -m {params.memory}\
# TAG=XC\
# https://hpc.nih.gov/apps/dropseq.html
# there is not hint in documentation on any filtering (only read quality)

#------------------------------------ pre-filter-stats
# calculating statistics based on barcodes before thresholding it (i.e. keeping the most abudant barcodes based on `expected cells`)
# This is based on 'logs/{sample}_hist_out_cell.txt' generated by `BAMTagHistogram` from dropseq-tools
# TOOO: The documentation only mentions duplicate and quality filter. But it seems do more filtering since most barcodes are expected to have only one read assinged but there are usually more. Find out.
# https://hpc.nih.gov/apps/dropseq.html

# TODO: Nr UMI for pre filter

stats_pre <- data.frame(matrix(nrow = length(samples), ncol = 10))
colnames(stats_pre) <- c(
  "Sample",
  "Batch",
  "Total_raw_reads",
  "Nr_barcodes_total",
  "Nr_barcodes_more_than_1_reads",
  "Nr_barcodes_more_than_10_reads",
  "percentile99",
  "percentile95",
  "percentile50",
  "Gini-index"
)

stats_pre[, "Sample"] <- samples
stats_pre[, "Batch"]  <- batches


for (i in 1:length(samples)) {
  # importing 'logs/{sample}_hist_out_cell.txt'
  hist_out <- read.table(
    file = snakemake@input$hist_cell[i],
    header = FALSE, stringsAsFactors = FALSE
  )
  mysample <- samples[i]
  reads <- hist_out$V1
  barcodes <- hist_out$V2
  # calculations on reads
  # total reads are not sum(reads)! Needs to be taken from
  # results/logs/cutadapt/sample1_R1.qc.txt
  #
  # read in full text file "sample_R2.qc.txt"
  filedump <- readLines(snakemake@input$R2qc[i])
  # subset line matching a pattern
  total_reads <- filedump[grep("Total reads processed:", filedump)] %>% #extract line
    str_extract("[0-9,]+") %>% # extract number from line
    str_replace_all(",", "") %>% # delete comma for subsequent numeric casting
    as.numeric()
  reads_cumsum      <- cumsum(reads)
  reads_cumsum_perc <- (reads_cumsum / sum(reads))
  # reporting stats
  stats_pre[i, "Total_raw_reads"]                <- total_reads
  stats_pre[i, "Reads_assigned_to_expected_STAMPs"] <- sum(reads[1:stats_post$Nb_STAMPS[i]])
  stats_pre[i, "Nr_barcodes_total"]              <- length(barcodes)
  stats_pre[i, "percentile99"]                   <- which.min(reads_cumsum_perc < 0.99)
  stats_pre[i, "percentile95"]                   <- which.min(reads_cumsum_perc < 0.95)
  stats_pre[i, "percentile50"]                   <- which.min(reads_cumsum_perc < 0.50)
  stats_pre[i, "Nr_barcodes_more_than_1_reads"]  <- sum(reads > 1)
  stats_pre[i, "Nr_barcodes_more_than_10_reads"] <- sum(reads > 10)
  stats_pre[i, "Gini-index"]                     <- round(gini_index(reads), 2)
  expected_cells <- as.numeric(filter(stats_post, orig.ident==mysample) %>% select(Nb_STAMPS))
  # % of reads left after applying expected_cells cuttoff
  stats_post[mysample, "Pct_reads_after_filter_expected_cells"] <-
    round(100 * (
                 reads_cumsum[expected_cells] / sum(reads)
    ), 2)
  # % of reads left after applying all filters including mapping etc.
  # Thats the effecive usable reads of the sequencing run
  stats_post[mysample, "Pct_reads_after_filter_everything"] <-
    round(100 * (
                 filter(stats_post, orig.ident==mysample) %>% select(Total_nb_reads) /
                 stats_pre [i, "Total_raw_reads"]), 2
         )
}

stats_pre <- stats_pre %>%
  arrange(Sample)

stats_post <- stats_post %>%
  arrange((orig.ident))

# output
write.csv(stats_pre,  file.path(snakemake@output$stats_pre))
write.csv(stats_post, file.path(snakemake@output$stats_post)) # writes table for excel

if (debug_flag) {
  save.image(file = file.path(path_debug, "create_summary_stats_workspace.rdata"))
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from Bio import SeqIO
import gzip
from collections import Counter

fastq_parser = SeqIO.parse(gzip.open(snakemake.input.R1, "rt"), "fastq")
sequences = []
n=0
for fastq_R1 in fastq_parser:
		sequences.append(str(fastq_R1.seq))
		n+=1
		if(n==10000000):
			break
def parse_barcodes(fastq_parser):
	counts={}
	ranges = range(5,len(sequences[0]))

	for cell_bc_length in ranges:
		counts[cell_bc_length] = list()
		for fastq_R1 in sequences:
			counts[cell_bc_length].append(fastq_R1[0:cell_bc_length])
	return(counts)



counts = parse_barcodes(fastq_parser)

with open(snakemake.output[0], "w") as outfile:
	outfile.write('bc_length,first_counts\n')
	for cell_bc_length in counts:
		outfile.write('{},{}\n'.format(cell_bc_length,str(Counter(counts[cell_bc_length]).most_common(100)[0][1])))
 9
10
11
12
13
14
15
16
17
import sys
from Bio import SeqIO

number_bp = 12

with open( snakemake.output['tsv'], "w" ) as output:
    for seq_record in SeqIO.parse(snakemake.input['fa'], "fasta"):
        myline = (str(seq_record.id)) + "\t" + str(seq_record.seq[0:(number_bp + 1)]) + "\n"
        output.write(myline)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from itertools import combinations, product
from collections import defaultdict
from copy import deepcopy
import pickle
from shutil import copyfile


def save_obj(obj, name):
    with open(name, 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def generate_all(barcode, reference, mapping, edit_distance):
    mutants = generate_mutants(barcode, edit_distance)
    for mutant in mutants:
        if(mutant not in reference):
            reference.add(mutant)
            mapping[edit_distance][mutant]['ref'] = barcode
            mapping[edit_distance][mutant]['count'] = 0
            mapping[edit_distance][mutant]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}

    mapping['unknown']=defaultdict()
    return(reference, mapping)


def generate_mutants(sequence, d=1):
    """Taken from stackoverflow: https://stackoverflow.com/a/19823295/9178565"""
    N = len(sequence)
    letters = 'ACGTN'
    pool = list(sequence)
    for indices in combinations(range(N), d):
        for replacements in product(letters, repeat=d):
            skip = False
            for i, a in zip(indices, replacements):
                if pool[i] == a: skip = True
            if skip: continue

            keys = dict(zip(indices, replacements))
            yield ''.join([pool[i] if i not in indices else keys[i] 
                           for i in range(N)])

# Create empty sets and defaultdicts
barcode_ref = set()
mapping=defaultdict(dict)

# Initiate ref and mapping with the given barcodes
with open(snakemake.input.whitelist,'r') as ref_file:
    for line in ref_file.readlines():
        barcode = line.strip()
        barcode_ref.add(barcode)
        mapping[0][barcode]=defaultdict(dict)
        mapping[0][barcode]['ref'] = barcode
        mapping[0][barcode]['count'] = 0
        mapping[0][barcode]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}

barcode_ext_ref = deepcopy(barcode_ref)
# For now edit distance is one, but can be extended to a higher number later on.
max_edit_distance = 1
for edit_distance in range(1,max_edit_distance+1):
    mapping[edit_distance]=defaultdict(dict)
    for barcode in mapping[0]:
        (barcode_ext_ref,mapping) = generate_all(barcode, barcode_ext_ref, mapping, edit_distance)

# Delete given barcodes out of new reference. This helps later on when running "repair_barcodes.py"
barcode_ref = set(mapping[0])
barcode_ext_ref.difference_update(barcode_ref)

# Save mapping and references to reuse later.
save_obj(obj=mapping, name=snakemake.output.barcode_mapping)
save_obj(obj=barcode_ref,name=snakemake.output.barcode_ref)
save_obj(obj=barcode_ext_ref,name=snakemake.output.barcode_ext_ref)


copyfile(snakemake.input['whitelist'], snakemake.output['barcodes'])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pysam
import re
import csv
from Bio import SeqIO
import gzip
from collections import defaultdict
import sys

#This function fills in a dict with readids
#and their corresponding cell and umi barcodes until it finds
#a specific read id

discard_secondary_alignements = snakemake.params['discard_secondary_alignements']

barcodes_struct = {
	'BC_start':snakemake.params['BC_start'],
	'BC_end':snakemake.params['BC_end'],
	'UMI_start':snakemake.params['UMI_start'],
	'UMI_end':snakemake.params['UMI_end']
	}

def parse_barcodes(fastq_parser, query_name, read_barcodes, barcodes_struct):
	for fastq_R1 in fastq_parser:
		# Some sequencers give a /1 and /2 to R1 and R2 read ids respectively. This attempts to solve the issue #69.
		if '/' in fastq_R1.id:
			R1_id = fastq_R1.id[:fastq_R1.id.find("/")]
		else:
			R1_id = fastq_R1.id
		read_barcodes[R1_id]['XC'] = str(fastq_R1.seq)[barcodes_struct['BC_start']:barcodes_struct['BC_end']]
		read_barcodes[R1_id]['XM'] = str(fastq_R1.seq)[barcodes_struct['UMI_start']:barcodes_struct['UMI_end']]
		if(read_barcodes[R1_id]['XM']==''):
			sys.SystemExit('UMI empty for read {}.\n The barcode is: {}.\nWhole entry is:{}'.format(R1_id, fastq_R1.seq,fastq_R1))
		if (R1_id == query_name):
			return(fastq_parser,read_barcodes)
	return(fastq_parser,read_barcodes)

infile_bam = pysam.AlignmentFile(snakemake.input[0], "rb")

fastq_parser = SeqIO.parse(gzip.open(snakemake.input[1], "rt"), "fastq")

outfile = pysam.AlignmentFile(snakemake.output[0], "wb", template=infile_bam)

read_barcodes = defaultdict(lambda :{'XC':'','XM':''})

for bam_read in infile_bam:
	if(discard_secondary_alignements & bam_read.is_secondary):
		continue
	if (bam_read.query_name) in read_barcodes:
		current_barcodes = read_barcodes.pop(bam_read.query_name)
		tags = bam_read.get_tags()
		tags.extend([
			('XC', current_barcodes['XC'],'Z'),
			('XM', current_barcodes['XM'],'Z')])
		bam_read.set_tags(tags)
	else:
		fastq_parser,read_barcodes = parse_barcodes(fastq_parser, bam_read.query_name, read_barcodes, barcodes_struct)
		if (bam_read.query_name) not in read_barcodes:
			raise SystemExit('Read {} from mapped file is missing in reference fastq file!'.format(bam_read.query_name))
			os.remove(snakemake.output[0])
		current_barcodes = read_barcodes.pop(bam_read.query_name)
		tags = bam_read.get_tags()
		tags.extend([
			('XC', current_barcodes['XC'],'Z'),
			('XM', current_barcodes['XM'],'Z')])
		bam_read.set_tags(tags)
	outfile.write(bam_read)
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
debug_flag <- FALSE
# check if DEBUG flag is set
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "plot_adapter_content_snakemake.rdata"))
}

#------------------------------------ debugging

library(ggplot2)
library(dplyr)
library(viridis)

samples <- snakemake@params$sample_names
batches <- snakemake@params$batches

#Read files into a list
cutadapt_clean_list <- list()
for (i in seq_along(samples)){
  cutadapt_clean           <- read.csv(snakemake@input[[i]][1], header = TRUE)
  cutadapt_clean$Sample    <- samples[i]
  cutadapt_clean$Batch     <- batches[i]
  cutadapt_clean_list[[i]] <- cutadapt_clean
}

# combining adaptors accross samples
cutadapt_counts <- Reduce(rbind, cutadapt_clean_list, NULL)

#Transform it into percentages
cutadapt_counts <- group_by(cutadapt_counts, Sample, Pair) %>%
  mutate(Percentages=Count/sum(Count))
#      Adapter                           Sequence Pair Count  Sample  Batch
# 1 PrefixNX/1                AGATGTGTATAAGAGACAG   R1     7 sample1 Batch1
# ...
# 6  Trans2_rc CTGTCTCTTATACACATCTCCGAGCCCACGAGAC   R2     5 sample2 Batch2

p1 <- ggplot(cutadapt_counts, aes(x=Sample, y = Percentages, fill = Adapter))  +
  geom_bar(stat = "identity") +
  facet_grid(Pair ~ Batch, scales = "free") +
  theme_minimal() +
  ggtitle("Comparison accross samples of adapter content") +
  scale_x_discrete(label=abbreviate) +
  scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  scale_fill_viridis(discrete=TRUE)

ggsave(plot=p1, filename=snakemake@output$pdf)

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_adapter_content_workspace.rdata"))
}
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug,
     paste0("plot_knee_plot_snakemake_",
            attr(snakemake, "wildcard")$sample, ".rdata"))
          )
}
#### /debug

library(ggplot2)
library(plyr)
# Create the cumulative plot
data=read.table(file = snakemake@input[[1]][1], header=FALSE, stringsAsFactors=FALSE)
barcodes = data$V2
total_reads = sum(data$V1)
y_raw=cumsum(data$V1)
y=(y_raw/total_reads)
x = 1:length(y)
plot_data = data.frame(rank = x,cum_sum=y, Barcode=data$V2)
x_scale = snakemake@params$cells * 4

knee_plot <- ggplot(plot_data, aes(x=rank, y=cum_sum)) +
 geom_point(size = 0.1) +
 xlim(0,x_scale) +
 geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
 ggtitle(paste0(snakemake@wildcards$sample, '\nTotal reads: ', prettyNum(total_reads))) +
 theme(plot.title = element_text(size=10)) +
 labs(x='STAMPS', y='Cumulative fraction of reads') +
 scale_y_continuous(labels = scales::percent) +
 theme_classic()

if(!is.null(snakemake@input$barcodes))
{
  selected_cells <- read.csv(snakemake@input$barcodes, header=FALSE, stringsAsFactors=FALSE)
  knee_plot <- knee_plot +
    geom_point(data = plot_data[plot_data$Barcode %in% selected_cells$V1,],
               aes(x=rank, y=cum_sum, color='Selected'), size=0.1) +
    scale_color_manual(values=c('Selected'='green'))
}
ggsave(knee_plot, file=snakemake@output$pdf, width = 4, height = 3)


if (debug_flag) {
  library(gridExtra)
  library(grid)
  # potential usefull to change lag of diff calculation
  mylag <- 1
  # data <- read.table(file = file, header=FALSE, stringsAsFactors=FALSE)
  # head(data,3)
  #        V1           V2
  # 1 1145137 CCCTTCGTCTGC
  # 2 1039974 ATAGTTTTTTAA
  # 3  912199 GCATGAAACTTC

  # borrowed from https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima
  localMaxima <- function(x) {
    # Use -Inf instead if x is numeric (non-integer)
    y <- diff(c(-.Machine$integer.max, x)) > 0L
    rle(y)$lengths
    y <- cumsum(rle(y)$lengths)
    y <- y[seq.int(1L, length(y), 2L)]
    if (x[[1]] == x[[2]]) {
      y <- y[-1]
    }
    y
  }

  reads        <- data$V1
  barcodes     <- data$V2
  total_reads  <- sum(reads)
  reads_cumsum <- cumsum(reads)
  # 1st dervivative (diff) needs also to be padded to keep same vector length
  reads_diff   <- c(diff(reads,lag=mylag,differences = 1),rep(0,mylag))

  # 2nd derivative: twice as much padding:
  reads_diff_diff   <- c(diff(reads,lag=mylag,differences = 2),rep(0,mylag*2))
  reads_cumsum_perc <- (reads_cumsum/total_reads)
  x                 <- 1:length(reads_cumsum_perc)
  plot_data         <- data.frame(rank       = x,
                                  cum_sum    = reads_cumsum_perc,
                                  read_count = reads,
                                  Barcode    = data$V2,
                                  diff       = reads_diff,
                                  diffdiff   = reads_diff_diff)

  # head(plot_data,6)
  #   rank     cum_sum read_count      Barcode    diff diffdiff
  # 1    1 0.007192916    1145137 CCCTTCGTCTGC -105163   -22612
  # 2    2 0.013725274    1039974 ATAGTTTTTTAA -127775    14025
  # 3    3 0.019455043     912199 GCATGAAACTTC -113750    61486
  # 4    4 0.024470318     798449 GTGTGGGTCTCT  -52264    34985
  # 5    5 0.029157308     746185 CGTACTGACTAC  -17279   -60441
  # 6    6 0.033735764     728906 GTTCGTCCCGCC  -77720    69104
  mystats <- paste0("| Nr barcodes total: ", length(barcodes), ' \n ',
                    "Nr barcodes for 50% reads: ", which.min(reads_cumsum_perc<0.50)," | ",
                    "Nr barcodes for 95% reads: ", which.min(reads_cumsum_perc<0.95)," | ",
                    "Nr barcodes for 99% reads: ", which.min(reads_cumsum_perc<0.99)
                    )

  # x_scale        <- which(reads_cumsum_perc>0.99)[1]
  x_scale        <- snakemake@params$cells * 4
  plot_data_head <- head(plot_data, x_scale)
  # plot_data_head$reads_diff_smooth <- predict(loess(diff~rank,data=plot_data_head))
  plot_data_head$reads_diffdiff_smooth <- predict(loess(diffdiff~rank,span=0.2,data=plot_data_head))

  ## Finding knee in knee-plot:
  # Best approach so far is to calclate the 2nd derivative of the read counts per STAMP (reads_diff_diff), smooth it (loess) and find it's maxima. Since ther can be several maxima, they are just ploted and it's up to the user to visually asses and decide.
  # finding local maxima in 2nd derivative:
  #https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima
  loc.max <- localMaxima(plot_data_head$reads_diffdiff_smooth)
  # local maxima are then colored green as lines in plots
  plot_data_head_sub <- plot_data_head[loc.max,]
  # which.peaks(plot_data_head$reads_diff_smooth2)

  knee_plot_ext <- ggplot(plot_data_head, aes(x=rank, y=cum_sum)) +
    xlim(0,x_scale) +
    ylim(0,1) +
    geom_text(data=plot_data_head_sub,aes(label = round(cum_sum,2)),nudge_y=-0.05, vjust = "inward", hjust = "inward") +
    geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    ggtitle(paste0(snakemake@wildcards$sample, '\nTotal reads: ', prettyNum(total_reads), mystats)) +
    theme(plot.title = element_text(size=10)) +
    labs(x='STAMPS', y='Cumulative fraction of reads')
  read_count_plot <- ggplot(plot_data_head, aes(x=rank, y=read_count)) +
    geom_text(data=plot_data_head_sub,aes(label = read_count),nudge_y=-0.05, vjust = "inward", hjust = "inward", check_overlap = TRUE) +
    #   geom_smooth() +
    xlim(0,x_scale) +
    #   ylim(0,1000) +
    geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    theme(plot.title = element_text(size=10)) +
    labs(x='STAMPS', y='Read counts per STAMP')

  # knee_plot_ext = knee_plot_ext + scale_y_continuous(labels = scales::percent)
  diff_plot <- ggplot(plot_data_head, aes(x=rank, y=diff)) +
    geom_text(data=plot_data_head_sub,aes(label = diff),nudge_y=-0.05, vjust = "inward", hjust = "inward", check_overlap = TRUE) +
    #   geom_smooth() +
    xlim(0,x_scale) +
    #   ylim(0,1000) +
    geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    theme(plot.title = element_text(size=10)) +
    #1st derivative read count diff to next STAMP")
    labs(x='STAMPS', y="1st derivative of read counts")
  diff_diff_plot <- ggplot(plot_data_head,
                           aes(x=rank, y=reads_diffdiff_smooth)) +
    geom_text(data=plot_data_head_sub,
              aes(label = rank),nudge_y=-1,
              vjust = "inward",
              hjust = "inward",
              check_overlap = TRUE) +
    xlim(0,x_scale) +
    ylim(-30,30) +
    geom_vline(xintercept=snakemake@params$cells,
               linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    theme(plot.title = element_text(size=10)) +
    labs(x='STAMPS', y='2nd derivative of read counts')

  if(!is.null(snakemake@input$barcodes))
  {
    selected_cells <- read.csv(snakemake@input$barcodes, header=FALSE, stringsAsFactors=FALSE)
    knee_plot_ext <- knee_plot_ext +
      geom_point(data = plot_data_head[plot_data_head$Barcode %in% selected_cells$V1,],
                 aes(x=rank, y=cum_sum, color='Selected'), size=0.1) +
      scale_color_manual(values=c('Selected'='green')) +
      theme(legend.position="none")
    diff_diff_plot <- diff_diff_plot +
      geom_point(data = plot_data_head[plot_data_head$Barcode %in% selected_cells$V1,],
                 aes(x=rank, y=reads_diffdiff_smooth, color='Selected'), size=0.1) +
      scale_color_manual(values=c('Selected'='green')) +
      theme(legend.position="none")
  }
  #   scale_y_continuous(position = "right")
  gp1 <- ggplotGrob(knee_plot_ext)
  gp2 <- ggplotGrob(read_count_plot)
  gp3 <- ggplotGrob(diff_plot)
  gp4 <- ggplotGrob(diff_diff_plot)
  # grid::grid.newpage()
  # gg <- grid::grid.draw(rbind(gp1, gp2, gp3, gp4, size = "last"))
  gg <- gridExtra::arrangeGrob(rbind(gp1, gp2, gp3, gp4, size = "last"))
  # gg <- gridExtra::arrangeGrob(gp1, gp2, gp3, gp4, ncol = 1)

  # if barcode.csv is present in base directory, only use barcodes in there (rule plot_knee_plot_whitelist in map.smk)
  ggsave(gg, file=paste0(snakemake@output$pdf, "_extended.pdf"), width = 9, height = 11)
}

if (debug_flag) {
  save.image(file = file.path(path_debug,
       paste0("plot_knee_plot_workspace_",
              attr(snakemake, "wildcard")$sample, ".rdata"))
            )
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
library(ggplot2)
library(tidyr)
library(gridExtra)
library(grid)
library(viridis)
debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "plot_rna_metrics_snakemake.rdata"))
}

#### /debug

mydata <- read.csv(file = snakemake@input$rna_metrics, header = T,
                   stringsAsFactors = F, skip = 6, sep = "\t")
mydata <- mydata[order(mydata$PF_ALIGNED_BASES, decreasing = T), ]
mydata_pct <- mydata[, c("READ_GROUP",
                         "PCT_INTERGENIC_BASES",
                         "PCT_UTR_BASES",
                         "PCT_RIBOSOMAL_BASES",
                         "PCT_INTRONIC_BASES",
                         "PCT_CODING_BASES")
                     ]
colnames(mydata_pct) = c('Cell Barcode', 'Intergenic', 'UTR', 'Ribosomial', 'Intronic', 'Coding')

mydata <- mydata[, c("READ_GROUP",
                     "INTERGENIC_BASES",
                     "UTR_BASES",
                     "RIBOSOMAL_BASES",
                     "INTRONIC_BASES",
                     "CODING_BASES")
                 ]
colnames(mydata) = c('Cell Barcode', 'Intergenic', 'UTR', 'Ribosomial', 'Intronic', 'Coding')

# converting into long format for ploting
mydata_long <- mydata %>% gather("Read Overlap", count, -"Cell Barcode")

# Keep the original order of the barcodes using factor and levels.
mydata_long$`Cell Barcode` <- factor(mydata_long$`Cell Barcode`,
                                 levels = factor(unique(mydata_long$`Cell Barcode`)))
mydata_long$`Read Overlap` <- factor(mydata_long$`Read Overlap`,
                                   levels = unique(mydata_long$`Read Overlap`))

p1 <- ggplot(mydata_long, aes(x = `Cell Barcode`, y = count, fill = `Read Overlap`)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 0), legend.position = "none")
p1 <- p1 + labs(title = paste(nrow(mydata),
                              "selected barcodes for",
                              snakemake@wildcards$sample),
                x = "Barcodes", y = "Bases")
p1 <- p1 + theme(axis.title.x = element_blank(),
                 axis.text.x = element_blank(),
                 axis.ticks.x = element_blank())
p1 <- p1 + scale_y_continuous(labels = scales::scientific)
p1 <- p1 + scale_fill_viridis(discrete = TRUE, option = "viridis")


mydata_long_pct <- mydata_pct %>% gather("Read Overlap", fraction, -"Cell Barcode")
# Keep the original order of the barcodes using factor and levels.
mydata_long_pct$`Cell Barcode` <- factor(mydata_long_pct$`Cell Barcode`,
                                     levels = factor(unique(mydata_long_pct$`Cell Barcode`)))
mydata_long_pct$`Read Overlap` <- factor(mydata_long_pct$`Read Overlap`,
                                       levels = unique(mydata_long_pct$`Read Overlap`))

p2 <- ggplot(mydata_long_pct, aes(x = `Cell Barcode`, y = fraction, fill = `Read Overlap`)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 0, size=8, vjust = 0.05), legend.position = "bottom") +
  labs(x = "Barcodes", y = "%Bases") +
  scale_y_continuous(labels = scales::percent) + scale_fill_viridis(discrete = TRUE, option = "viridis")
# This allows to align the main plots so that we can relate both directly with the label from the bottom one.
gp1 <- ggplotGrob(p1)
gp2 <- ggplotGrob(p2)
pdf(file = snakemake@output$pdf, width = 16, height = 13)
grid::grid.newpage()
grid::grid.draw(rbind(gp1, gp2, size = "last"))
dev.off()

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_rna_metrics_workspace.rdata"))
}
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(
    path_debug,
    paste0("plot_species_plot_snakemake_", attr(snakemake, "wildcard")$sample, ".rdata")
  ))
}

#### /debug

categorizeCellsUsingKneeKnownNumCellsPaper <- function(
                                                   digitalExpressionFileO1,
                                                   digitalExpressionFileO2,
                                                   organismOne,
                                                   organismTwo,
                                                   pureRatio = 0.2,
                                                   numCells,
                                                   numBeads,
                                                   point.cex = 1.5,
                                                   xlim_range = NULL,
                                                   ylim_range = NULL,
                                                   category = "transcripts") {
  dfFull <- getNumTranscriptsPerCellBarcodeByOrganismPair(
                                                    digitalExpressionFileO1,
                                                    digitalExpressionFileO2,
                                                    organismOne,
                                                    organismTwo,
                                                    category)
  dfFull <- dfFull[order(dfFull$total, decreasing = T), ]
  dfFull$ratio_one <- dfFull[, 2] / dfFull[, 4]
  dfFull <- head(dfFull, n = numBeads)
  df <- head(dfFull, n = numCells)

  dfNoCall <- dfFull[-1:-numCells, ]
  if (dim(dfNoCall)[1] > 0) {
    dfNoCall$organism <- "No Call"
  }

  df$organism <- "Mixed"

  idx <- which(df$ratio_one >= (1 - pureRatio))
  # checks if the species is actually assigned at all
  if (length(idx) > 0) {
    df[idx, ]$organism <- organismOne
  }
  idx <- which(df$ratio_one <= (pureRatio))
  if (length(idx) > 0) {
    df[idx, ]$organism <- organismTwo
  }

  result <- rbind(df, dfNoCall)

  maxRange <- max(result[, 2], result[, 3])

  dforganismOne <- result[result$organism == organismOne, ]
  dforganismTwo <- result[result$organism == organismTwo, ]
  dfMixed <- result[result$organism == "Mixed", ]
  dfNoCall <- result[result$organism == "No Call", ]

  if (is.null(xlim_range)) {
    xlim_range <- c(0, maxRange)
  }

  if (is.null(ylim_range)) {
    ylim_range <- c(0, maxRange)
  }
  colors <- c("blue", "red", "purple", "grey")
  plot(dforganismOne[, 2], dforganismOne[, 3], col = colors[1],
       pch = 16, xlim = xlim_range, ylim = ylim_range,
       xlab = paste(organismOne, category),
       ylab = paste(organismTwo, category),
       cex = point.cex)
  points(dforganismTwo[, 2], dforganismTwo[, 3],
         col = colors[2], pch = 16, cex = point.cex)
  points(dfMixed[, 2], dfMixed[, 3],
         col = colors[3], pch = 16, cex = point.cex)
  points(dfNoCall[, 2], dfNoCall[, 3],
         col = colors[4], pch = 16, cex = point.cex)
  l <- c(paste(organismOne, dim(dforganismOne)[1]),
         paste(organismTwo, dim(dforganismTwo)[1]),
         paste("Mixed", dim(dfMixed)[1]),
         paste("No Call", dim(dfNoCall)[1]))
  legend("topright", legend = l, fill = colors)
  title(paste("Species plot based on", category))
  return(df)
}

getNumTranscriptsPerCellBarcodeByOrganismPair <- function(
                                                      digitalExpressionFileO1,
                                                      digitalExpressionFileO2,
                                                      organismOne,
                                                      organismTwo,
                                                      category) {
  if (is.null(organismOne) || is.null(organismTwo)) {
    return(NULL)
  }

  o1 <- getGenesAndTranscriptsPerCellBarcode(digitalExpressionFileO1)
  o2 <- getGenesAndTranscriptsPerCellBarcode(digitalExpressionFileO2)

  commonBC <- union(o1$cellBC, o2$cellBC)
  o1p <- o1[match(commonBC, o1$cellBC), ]
  o2p <- o2[match(commonBC, o2$cellBC), ]
  if (category == "genes") {
    df <- data.frame(tag = commonBC, o1Count = o1p$numGenes,
                     o2Count = o2p$numGenes, stringsAsFactors = F)
  }
  else {
    df <- data.frame(tag = commonBC, o1Count = o1p$numTranscripts,
                     o2Count = o2p$numTranscripts, stringsAsFactors = F)
  }

  idx1 <- which(is.na(df$o1Count))
  idx2 <- which(is.na(df$o2Count))
  if (length(idx1) > 0) df[idx1, ]$o1Count <- 0
  if (length(idx2) > 0) df[idx2, ]$o2Count <- 0

  df$total <- apply(df[, 2:3], 1, sum, na.rm = T)
  df <- df[order(df$total, decreasing = T), ]
  colnames(df)[2] <- organismOne
  colnames(df)[3] <- organismTwo
  return(df)
}


getGenesAndTranscriptsPerCellBarcode <- function(digitalExpressionFile) {
  a <- read.table(digitalExpressionFile, header = T, stringsAsFactors = F)
  colnames(a) <- c("cellBC", "numGenicReads", "numTranscripts", "numGenes")
  return(a)
}

digitalExpressionFileO1 <- snakemake@input[[1]][1]
digitalExpressionFileO2 <- snakemake@input[[2]][1]

num_cells <- snakemake@params$expected_cells

organismOne <- names(snakemake@config$META$species)[1]
organismTwo <- names(snakemake@config$META$species)[2]

par(mar = c(5, 4, 4, 2) + 0.5)

pdf(snakemake@output$genes_pdf, height = 8, width = 8)
df_temp <- categorizeCellsUsingKneeKnownNumCellsPaper(
  digitalExpressionFileO1,
  digitalExpressionFileO2,
  organismOne = organismOne,
  organismTwo = organismTwo,
  pureRatio = snakemake@config$META$ratio,
  numCells = num_cells,
  numBeads = num_cells * 2,
  point.cex = 1,
  category = "genes"
)
dev.off()



pdf(snakemake@output$transcripts_pdf, height = 8, width = 8)
df <- categorizeCellsUsingKneeKnownNumCellsPaper(
  digitalExpressionFileO1,
  digitalExpressionFileO2,
  organismOne = organismOne,
  organismTwo = organismTwo,
  pureRatio = snakemake@config$META$ratio,
  numCells = num_cells,
  numBeads = num_cells * 2,
  point.cex = 1,
  category = "transcripts"
)
dev.off()
organism1 <- subset(df, df$organism == organismOne)
organism2 <- subset(df, df$organism == organismTwo)

write.table(organism1$tag, snakemake@output$barcodes_species[1],
            row.names = F, col.names = F, quote = F)
write.table(organism2$tag, snakemake@output$barcodes_species[2],
            row.names = F, col.names = F, quote = F)

# save.image(paste0(snakemake@output$genes_pdf,".rdata"))
if (debug_flag) {
  save.image(file = file.path(
    path_debug,
    paste0(
      "plot_species_plot_workspace_",
      attr(snakemake, "wildcard")$sample, ".rdata"
    )
  ))
}
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "plot_violin_snakemake.rdata"))
}


options(warn = -1)
library(plyr, quietly = TRUE, warn.conflicts = FALSE)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE) # Dataframe manipulation
library(Matrix, quietly = TRUE, warn.conflicts = FALSE) # Sparse matrices
library(stringr, quietly = TRUE, warn.conflicts = FALSE)
library(RColorBrewer, quietly = TRUE, warn.conflicts = FALSE)
library(devtools, quietly = TRUE, warn.conflicts = FALSE)
library(Seurat, quietly = TRUE, warn.conflicts = FALSE)
library(plotly, quietly = TRUE, warn.conflicts = FALSE)

# rule map in Snakefile
# rule map:
#     input:
#         'plots/violinplots_comparison_UMI.pdf',
#         ...

# importing UMI
# importing counts ( summary/counts_expression_matrix.tsv )

ReadMTX <- function(mtx_path) {
  data_dir <- dirname(mtx_path)
  files <- list.files(data_dir)
  # Find files
  barcodes_file <- grep("barcodes", files, value = TRUE)
  features_file <- grep(pattern = "genes|features", x = files, value = TRUE)
  mtx <- grep("mtx", files, value = TRUE)
  # load the data
  data <- readMM(file.path(data_dir, mtx))
  barcodes <- read.csv(file.path(data_dir, barcodes_file), header = FALSE)$V1
  features <- read.csv(file.path(data_dir, features_file), header = FALSE)$V1

  colnames(data) <- barcodes
  rownames(data) <- features
  return(data)
}

#count_matrix <- ReadMTX(snakemake@input$counts)
# importing UMIs ( summary/umi_expression_matrix.tsv )
#umi_matrix <- ReadMTX(snakemake@input$UMIs)

count_matrix <- Read10X(file.path(snakemake@wildcards$results_dir,'summary','read'))
umi_matrix <- Read10X(file.path(snakemake@wildcards$results_dir,'summary','umi'))

design <- read.csv(snakemake@input$design,
  stringsAsFactors = TRUE,
  header = TRUE,
  row.names = NULL
)
metaData <- data.frame(cellNames = colnames(umi_matrix)) %>%
  mutate(samples = factor(str_replace(cellNames, "_[^_]*$", ""))) %>%
  mutate(barcode = factor(str_replace(cellNames, ".+_", ""))) %>%
  left_join(design, by = "samples")
rownames(metaData) <- metaData$cellNames

# possible to set is.expr = -1 to avoid filtering whilst creating
# seuratobj <- CreateSeuratObject(raw.data = umi_matrix, meta.data = metaData, is.expr = -1)
seuratobj <- CreateSeuratObject(raw.data = umi_matrix, meta.data = metaData)
seuratobj <- SetAllIdent(object = seuratobj, id = "samples")
# relabel cell idenity (https://github.com/satijalab/seurat/issues/380)
seuratobj@meta.data$orig.ident <- seuratobj@meta.data$samples

mycount <- CreateSeuratObject(raw.data = count_matrix, meta.data = metaData)
mycount <- SetAllIdent(object = mycount, id = "samples")
mycount@meta.data$orig.ident <- mycount@meta.data$samples
# turn off filtering
# note, the @meta.data slot contains usefull summary stuff
# head([email protected],2)
#                              nGene nUMI expected_cells read_length      barcode
# dropseqLib1_ACTAACATTATT    15   33            400         100 ACTAACATTATT
# dropseqLib1_GAGTCTGAGGCG     5    9            400         100 GAGTCTGAGGCG
#                                       origin      origin
# dropseqLib1_ACTAACATTATT dropseqLib1 dropseqLib1
# dropseqLib1_GAGTCTGAGGCG dropseqLib1 dropseqLib1
meta.data <- seuratobj@meta.data
# combining UMIs and Counts in to one Seurat object
meta.data$nCounts <- mycount@meta.data$nUMI
seuratobj@meta.data <- meta.data
# delete since Counts have been added to seuratobj as nCounts column
rm(mycount)


# mytheme <- theme_bw(base_size = 9) +
mytheme <- theme_bw() +
  theme(
    legend.position = "right",
    axis.ticks = element_blank(),
    axis.text.x = element_text(angle = 300, hjust = 0)
  )
theme_set(mytheme)

# predefined ggplot layers for subsequent plots
gglayers <- list(
  geom_smooth(method = "loess"),
  geom_point(size = .5),
  scale_y_continuous(
    labels = scales::unit_format(unit = "", scale = 1e-3, digits = 2),
    breaks = scales::pretty_breaks(n = 8)
  ),
  scale_x_continuous(
    labels = scales::unit_format(unit = "", scale = 1e-3, digits = 2),
    breaks = scales::pretty_breaks(n = 8)
  )
)

gg <- ggplot(meta.data, aes(x = nUMI, y = nCounts, color = orig.ident)) +
  #   coord_trans(y="log10",x = "log10") +
  gglayers +
  geom_abline(intercept = 0, slope = 1) +
  labs(
    title = "UMI counts vs raw Counts",
    subtitle = "Number of UMIs and raw Counts for each Bead",
    x = "Number of UMIs per Bead [k]",
    y = "Number of Counts per Bead [k]"
  )

# dev.new()
# htmlwidgets::saveWidget(ggplotly(gg), file.path(getwd(),snakemake@output$html_umivscounts))
ggsave(gg, file = file.path(getwd(), snakemake@output$pdf_umivscounts), width = 12, height = 7)

# how about unaligned reads/UMI?
# Note(Seb): raw.data is actually filtered data i.e. nr of genes likely to be smaller than input data!
mito.gene.names <- grep("^mt-", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE)
sribo.gene.names <- grep("^Rps", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE)
lribo.gene.names <- grep("^Rpl", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE)

col.total <- Matrix::colSums(seuratobj@raw.data)
meta.data$col.total <- col.total

seuratobj.top_50 <- apply(seuratobj@raw.data, 2, function(x) sum(x[order(x, decreasing = TRUE)][1:50]) / sum(x))
# mycount.top_50 <- apply([email protected], 2, function(x) sum(x[order(x, decreasing = TRUE)][1:50])/sum(x))

seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[sribo.gene.names, ]) / col.total, "pct.sribo")
seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[lribo.gene.names, ]) / col.total, "pct.lribo")
seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[unique(c(sribo.gene.names, lribo.gene.names)), ]) / col.total, "pct.Ribo")
seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[mito.gene.names, ]) / col.total, "pct.mito")
seuratobj <- AddMetaData(seuratobj, seuratobj.top_50, "top50")
tmp <- seuratobj@meta.data$nUMI / seuratobj@meta.data$nGene
names(tmp) <- rownames(seuratobj@meta.data)
seuratobj <- AddMetaData(seuratobj, tmp, "umi.per.gene")


gg <- VlnPlot(seuratobj,
  c("nUMI", "nGene", "top50", "umi.per.gene", "pct.Ribo", "pct.mito"),
  x.lab.rot = TRUE, do.return = TRUE
)
# ggsave(gg,file=file.path("violinplots_comparison_UMI.pdf"),width=18,height=18)
ggsave(gg, file = snakemake@output$pdf_violine, width = 18, height = 18)
# gg <- VlnPlot(mycount,c("nUMI", "nGene", "top50", "count.per.gene","pct.Ribo", "pct.mito"), x.lab.rot = TRUE, do.return = TRUE)
# ggsave(gg,file=file.path("violinplots_comparison_count.pdf"),width=18,height=18)

# gg <- GenePlot(object = seuratobj, gene1 = "nUMI", gene2 = "nGene")
# ggsave(gg,file=file.path("violinplots_comparison.pdf"),width=18,height=18)


gg <- ggplot(meta.data, aes(x = nUMI, y = nGene, color = orig.ident)) +
  gglayers +
  labs(
    title = "Genes (pooled mouse and human set) vs UMIs for each bead",
    x = "Number of UMIs per Bead [k]",
    y = "Number of Genes per Bead [k]"
  )

# dev.new()
# htmlwidgets::saveWidget(ggplotly(gg),
# file.path(getwd(), snakemake@output$html_umi_vs_gene))
ggsave(gg, file = snakemake@output$pdf_umi_vs_gene, width = 12, height = 7)



################################################################################
## same for Counts instead UMIs (using mycount object)
gg <- ggplot(meta.data, aes(x = nCounts, y = nGene, color = orig.ident)) +
  gglayers +
  labs(
    title = "Genes (pooled mouse and human set) vs Counts for each bead",
    x = "Number of Counts per Bead [k]",
    y = "Number of Genes per Bead [k]"
  )

# dev.new()
# htmlwidgets::saveWidget(ggplotly(gg),
#                       file.path(getwd(), snakemake@output$html_count_vs_gene))

ggsave(gg, file = snakemake@output$pdf_count_vs_gene, width = 12, height = 7)


# head(meta.data,2)
#                              nGene nUMI                    cellNames         samples      barcode expected_cells read_length  batch      orig.ident pct.sribo  pct.lribo  pct.Ribo  pct.mito     top50 umi.per.gene
# sample1_GAGTCTGAGGCG     6    6 sample1_GAGTCTGAGGCG sample1 GAGTCTGAGGCG            100         100 batch1 sample1 0.0000000 0.00000000 0.0000000 0.0000000 1.0000000     1.000000
# sample1_CAGCCCTCAGTA   264  437 sample1_CAGCCCTCAGTA sample1 CAGCCCTCAGTA            100         100 batch1 sample1 0.0389016 0.07551487 0.1144165 0.0228833 0.5102975     1.655303

# saving snakemake meta information into misc slot so all can be exported as one object
seuratobj@misc <- snakemake
# exporting R Seurat objects into summary/R_Seurat_objects.rdata
saveRDS(seuratobj, file = file.path(snakemake@output$R_objects))

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_violin_workspace.rdata"))
}
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug)
  save(snakemake, file = file.path(path_debug, "plot_yield_snakemake.rdata"))
}
#------------------------------------ debugging

library(ggplot2)
library(tidyr)
library(grid)
library(gridExtra)
library(viridis)
library(stringr)

samples <- snakemake@params$sample_names
batches <- snakemake@params$batches
mydata  <- data.frame(matrix(nrow = length(samples), ncol = 7))
colnames(mydata) <- c("Sample", "Batch", "Cutadapt filtered", "Unmapped",
                      "Multi mapped", "Uniquely mapped", "Total reads")
mydata[, "Sample"] <- samples
mydata[, "Batch"]  <- batches
for (i in 1:length(samples)) {
  # Input files and variables
  STAR_output <- read.table(snakemake@input$STAR_output[i],
                            skip = 5, sep = "\t",
                            fill = TRUE, stringsAsFactors = FALSE)
  mysample <- samples[i]
  # Read files
  # bbmap_log = read.table(snakemake@input$repaired[i], sep=':', header=FALSE, skip=8, row.names=1, nrows=4)
  # reads_after_filtering = as.numeric(str_match(bbmap_log['Pairs',], pattern = "\t([0-9]{1,20}) reads.*")[,2])/2
  bbmap_log   <- readLines(snakemake@input$repaired[i])
  reads_after_filtering <- as.numeric(str_match(bbmap_log[grep("^Pairs:", bbmap_log)],
                                                pattern = "\t([0-9]{1,20}) reads.*")[, 2]) / 2
  R1_filtered <- read.table(snakemake@input$R1_filtered[i], header = FALSE, skip = 7, sep = ":", nrows = 7, row.names = 1)
  total_reads <- as.numeric(str_replace_all(R1_filtered["Total reads processed", ], pattern = (" |,"), ""))

  # R2_filtered = read.table(snakemake@input$R2_filtered[i], header = FALSE, skip=8, sep=':', nrows=7, row.names=1)

  # R1_adapters = as.numeric(str_remove_all(str_match(R1_filtered['Reads with adapters',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R1_too_short = as.numeric(str_remove_all(str_match(R1_filtered['Reads that were too short',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R1_passed = as.numeric(str_remove_all(str_match(R1_filtered['Reads written (passing filters)',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R1_filtered = total_reads - R1_passed

  # R2_adapters = as.numeric(str_remove_all(str_match(R2_filtered['Reads with adapters',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R2_too_short = as.numeric(str_remove_all(str_match(R2_filtered['Reads that were too short',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R2_passed = as.numeric(str_remove_all(str_match(R2_filtered['Reads written (passing filters)',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R2_filtered = total_reads - R2_passed

  mydata[which(mydata$Sample == mysample), "Cutadapt filtered"] <- total_reads - reads_after_filtering
  mydata[which(mydata$Sample == mysample), "Total reads"] <- total_reads

  # STAR output
  reads_in        <- as.numeric(STAR_output$V2[1])
  uniquely_mapped <- as.numeric(STAR_output$V2[4])
  multi_mapped    <- as.numeric(STAR_output$V2[19])
  unmapped        <- reads_in - uniquely_mapped - multi_mapped

  mydata[which(mydata$Sample == mysample), "Uniquely mapped"] <- uniquely_mapped
  mydata[which(mydata$Sample == mysample), "Multi mapped"]    <- multi_mapped
  mydata[which(mydata$Sample == mysample), "Unmapped"]        <- unmapped
}

# tidyr version
 mydata_long <- mydata %>% gather(variable, value, -Sample, -Batch)
# melt will be retired, use gather instead: https://github.com/hadley/reshape
#Force factor order.
mydata_long$variable = factor(mydata_long$variable, levels = c('Cutadapt filtered','Multi mapped','Total reads','Unmapped','Uniquely mapped'))
color_palette = c('#e88270','#cb7262','#ae6254','#70d6e8')


p1 <- ggplot(subset(mydata_long, mydata_long$variable != "Total reads"),
             aes(x = Sample, y = value, fill = variable)) +
  geom_histogram(stat = "identity", binwidth = 1 / length(samples)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 0)) +
  labs(title = paste("Yield of all the reads for each category"),
       x = "Samples",
       y = "Number of reads") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        legend.position = "none",
        plot.title = element_text(size = 20, face = "bold")) +
  facet_grid(~Batch, scales = "free") +
  scale_fill_viridis(discrete = TRUE, option = "viridis") +
  scale_y_continuous(labels = scales::scientific)

mydata_pct <- mydata[, -c(1, 2)] / mydata[, "Total reads"]
mydata_pct <- cbind(Sample = mydata[, "Sample"],
                    Batch = mydata[, "Batch"], mydata_pct)

mydata_long_pct <- mydata_pct %>% gather(variable, value, -Sample, -Batch)

mydata_long_pct$variable = factor(mydata_long$variable, levels = c('Cutadapt filtered','Multi mapped','Unmapped','Uniquely mapped'))

p2 <- ggplot(subset(mydata_long_pct, mydata_long_pct$variable != "Total reads"),
             aes(x = Sample, y = value, fill = variable)) +
  labs(fill = "Filters") +
  geom_histogram(stat = "identity", binwidth = 1 / length(samples)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 0),
        legend.position = "bottom",
        strip.background = element_blank(),
        strip.text.x = element_blank()) +
  labs(x = "Samples",
       y = "Percentage of reads") +
  facet_grid(~Batch, scales = "free") +
  scale_fill_viridis(discrete = TRUE, option = "viridis") +
  scale_y_continuous(labels = scales::percent)

gp1 <- ggplotGrob(p1)
gp2 <- ggplotGrob(p2)

pdf(file = snakemake@output$pdf, width = 16, height = 13)
grid::grid.newpage()
grid::grid.draw(rbind(gp1, gp2, size = "last"))
dev.off()

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_yield_workspace.rdata"))
}
8
library(yaml)
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
versions = list()
for (yaml_file in snakemake@input$yaml_files){
  current_env = yaml.load_file(yaml_file)
  for (package in current_env$dependencies){
    if(grepl(pattern = 'cutadapt', package)){
      versions[['cutadapt']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
    else if(grepl(pattern = 'star', package)){
      versions[['star']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
    else if(grepl(pattern = 'dropseq_tools', package)){
      versions[['dropseq_tools']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
    else if(grepl(pattern = 'bbmap', package)){
      versions[['bbmap']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
  }
}

umi_distance=snakemake@config$EXTRACTION$`UMI-edit-distance`
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pickle
import pysam

def load_obj(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

def save_obj(obj, name):
    with open(name, 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)


infile_bam = pysam.AlignmentFile(snakemake.input.bam, "rb")
outfile = pysam.AlignmentFile(snakemake.output.bam, "wb", template=infile_bam)

mapping = load_obj(snakemake.input.barcode_mapping)
barcode_ref = load_obj(snakemake.input.barcode_ref)
barcode_ext_ref = load_obj(snakemake.input.barcode_ext_ref)
unknown_barcodes = set()

for bam_read in infile_bam:
	barcode = bam_read.get_tag('XC')
	#lane_number = bam_read.query_name.split(':')[3]
	if barcode in barcode_ref:
		mapping[0][barcode]['count'] += 1
		#mapping[0][barcode]['lanes'][lane_number] += 1
		outfile.write(bam_read)
		continue
	elif barcode in barcode_ext_ref:
		# The barcode is in our extended reference. Change the barcode to the original one
		reference_barcode = mapping[1][barcode]['ref']
		mapping[1][barcode]['count'] += 1
		#mapping[1][barcode]['lanes'][lane_number] += 1
		bam_read.set_tag('XC',reference_barcode,value_type='Z',replace=True)
		outfile.write(bam_read)
		continue
	else:
		# If the barcode is not found in the extended ref, then don't modify it.
		if barcode in unknown_barcodes:
			mapping['unknown'][barcode]['count'] += 1
			#mapping['unknown'][barcode]['lanes'][lane_number] += 1
		else:
			#mapping['unknown'][barcode] = {'count':1, 'lanes':{'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}}
			mapping['unknown'][barcode] = {'count':1}
			#mapping['unknown'][barcode]['lanes'][lane_number] += 1
			unknown_barcodes.add(barcode)
		outfile.write(bam_read)

save_obj(obj=mapping, name=snakemake.output.barcode_mapping_counts)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import os
import sys
from collections import defaultdict
import pickle


def save_obj(obj, name):
    with open(name, 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)


mapping=defaultdict(dict)
barcode_ref = set()
barcode_ext_ref = set()


with open(snakemake.input['whitelist'],'r') as whitelist:
	for line in whitelist:
		if len(line.strip().split()) == 2:  # This means we didn't find any other linked barcode
			(reference,counts_ref) = line.strip().split()
			mapping[0][reference]= defaultdict()
			mapping[0][reference]['ref'] = reference
			mapping[0][reference]['count'] = 0
			mapping[0][reference]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}
			barcode_ref.add(reference)
			continue
		(reference,extended_ref,counts_ref,counts_ext) = line.strip().split()
		mapping[0][reference]= defaultdict()
		mapping[0][reference]['ref'] = reference
		mapping[0][reference]['count'] = 0
		mapping[0][reference]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}
		barcode_ref.add(reference)
		for barcode in extended_ref.split(','):
			mapping[1][barcode] = defaultdict()
			mapping[1][barcode]['ref'] = reference
			mapping[1][barcode]['count'] = 0
			mapping[1][barcode]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}
		barcode_ext_ref.update(extended_ref.split(','))

# Save mapping and references to reuse later.
save_obj(obj=mapping, name=snakemake.output.barcode_mapping)
save_obj(obj=barcode_ref,name=snakemake.output.barcode_ref)
save_obj(obj=barcode_ext_ref,name=snakemake.output.barcode_ext_ref)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1
fq2 =  snakemake.input.get("fq2")
if fq2:
    fq2 = [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2
    assert len(fq1) == len(fq2), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str =  " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

outprefix = os.path.dirname(snakemake.output[0]) + "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outSAMtype BAM Unsorted "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs 
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell("fastqc {snakemake.params} --quiet "
          "--outdir {tempdir} {snakemake.input[0]}"
          " {log}")

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path} {snakemake.output.html}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path} {snakemake.output.zip}")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}")
ShowHide 69 more snippets with no or duplicated tags.

Free

Created: 8mo ago
Updated: 8mo ago
Maitainers: public
URL: https://github.com/Hoohm/dropSeqPipe
Name: dropseqpipe
Version: v0.32
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Accessed: 5
Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...