A SingleCell RNASeq pre-processing snakemake workflow
Description
This pipeline is based on snakemake and the dropseq tools provided by the McCarroll Lab . It allows to go from raw data of your Single Cell RNA seq experiment until the final count matrix with QC plots along the way.
This is the tool we use in our lab to improve our wetlab protocol as well as provide an easy framework to reproduce and compare different experiments with different parameters.
It uses STAR to map the reads. It is usable for any single cell protocol using two reads where the first one holds the Cell and UMI barcodes and the second read holds the RNA. Here is a non-exhausitve list of compatible protocols/brands:
-
Drop-Seq
-
SCRB-Seq
-
10x Genomics
-
DroNc-seq
-
Dolomite Bio ( Nadia Instrument )
This package is trying to be as user friendly as possible. One of the hopes is that non-bioinformatician can make use of it without too much hassle. It will still require some command line execution, this is not going to be fully interactive package.
Authors
-
Patrick Roelli ( @Hoohm) )
-
Sebastian Mueller ( @seb-mueller) )
-
Charles Girardot ( @cgirardot) )
Usage
Step 1: Install workflow
If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further develop this workflow, fork this reposity. Please consider providing any generally applicable modifications via a pull request.
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, once available, its DOI.
Step 2: Configure workflow
Configure the workflow according to your needs via editing the file
config.yaml
and the
samples.tsv
following those
instructions
Step 3: Execute workflow
All you need to execute this workflow is to install Snakemake via the Conda package manager . Software needed by this workflow is automatically deployed into isolated environments by Snakemake.
Test your configuration by performing a dry-run via
snakemake --use-conda -n --directory $WORKING_DIR
Execute the workflow locally via
snakemake --use-conda --cores $N --directory $WORKING_DIR
using
$N
cores on the
$WORKING_DIR
. Alternatively, it can be run in cluster or cloud environments (see
the docs
for details).
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above.
Step 4: Investigate results
After successful execution, you can create a self-contained report with all results via:
snakemake --report report.html
Documentation
You can find the documentation here
Future implementations
I'm actively seeking help to implement the points listed bellow. Don't hesitate to contact me if you wish to contribute.
-
Create a sharing platform where quality plots/logs can be discussed and troubleshooted.
-
Create a full html report for the whole pipeline
-
Multiqc module for drop-seq-tools
-
Implement an elegant "preview" mode where the pipeline would only run on a couple of millions of reads and allow you to have an approximated view before running all of the data. This would dramatically reduce the time needed to get an idea of what filters whould be used.
-
I hope it can help you out in your single cell experiments!
Feel free to comment and point out potential improvements via issues
TODO
-
Add a mixed reference reference for testing purposes
-
Finalize the parameters validation schema
-
Make the debug feature a bit "cleaner". Deal with automatic naming of the debug variables
-
Implement ddseq barcoding strategies
Code Snippets
19 20 | script: '../scripts/generate_extended_ref.py' |
32 33 34 35 36 37 38 | shell: """umi_tools whitelist\ --stdin {input}\ --bc-pattern='(?P<cell_1>.{{{params.cell_barcode_length}}})(?P<umi_1>.{{{params.umi_barcode_length}}})'\ --extract-method=regex\ --set-cell-number={params.num_cells}\ --log2stderr > {output}""" |
45 46 | shell: """cat {input} | cut -f 1 > {output}""" |
56 57 | script: '../scripts/umi_tools_extended_ref.py' |
70 71 | script: '../scripts/repair_barcodes.py' |
36 37 | shell: "gunzip -c -d {input} > {output}" |
44 45 | shell: "gunzip -d -c {input} > {output}" |
55 56 | shell: """sed -e 's/>/>{params.species}/g' {input} > {output}""" |
77 78 | shell: """cat {input.genome1} {input.genome2} > {output}""" |
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | run: import datetime import re header1="#!Mixed reference of {} and {}\n".format( species_list[0], species_list[1]) header2="#!genome-builds GRC{}{} GRC{}{}\n".format( species_list[0].lower()[0], build_list[0], species_list[1].lower()[0], build_list[1]) header3="#!genome-releases {} {}\n".format( release_list[0], release_list[1]) header4="#!genome-date {}\n".format(str(datetime.date.today())) header=[header1,header2,header3,header4] with open(input.annotation1[0]) as annotation1: with open(input.annotation2[0]) as annotation2: with open(output[0], 'w') as outfile: outfile.writelines(header) for line in annotation1: if(not line.startswith('#!')): outfile.write(re.sub('^',species_list[0],line)) for line in annotation2: if(not line.startswith('#!')): outfile.write(re.sub('^',species_list[1],line)) |
33 34 | shell: "gunzip -c -d {input} > {output}" |
41 42 | shell: "gunzip -d -c {input} > {output}" |
25 26 27 28 29 30 31 32 33 34 35 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\ I={input.data}\ O={output.dense}\ EDIT_DISTANCE={params.umiBarcodeEditDistance}\ OUTPUT_LONG_FORMAT={output.long}\ STRAND_STRATEGY={params.strand_strategy}\ OUTPUT_READS_INSTEAD=false\ LOCUS_FUNCTION_LIST={{{params.locus_list}}}\ MIN_BC_READ_THRESHOLD={params.count_per_umi}\ CELL_BC_FILE={input.barcode_whitelist}""" |
53 54 55 56 57 58 59 60 61 62 63 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\ I={input.data}\ O={output.dense}\ EDIT_DISTANCE={params.umiBarcodeEditDistance}\ OUTPUT_LONG_FORMAT={output.long}\ STRAND_STRATEGY={params.strand_strategy}\ OUTPUT_READS_INSTEAD=true\ LOCUS_FUNCTION_LIST={{{params.locus_list}}}\ MIN_BC_READ_THRESHOLD={params.count_per_umi}\ CELL_BC_FILE={input.barcode_whitelist}""" |
86 87 88 89 90 91 92 93 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && SingleCellRnaSeqMetricsCollector -m {params.memory}\ INPUT={input.data}\ OUTPUT={output}\ ANNOTATIONS_FILE={input.refFlat}\ CELL_BC_FILE={input.barcode_whitelist}\ RIBOSOMAL_INTERVALS={input.rRNA_intervals} """ |
102 103 | script: '../scripts/plot_rna_metrics.R' |
115 116 | script: "../scripts/convert_mtx.py" |
129 130 | shell: """pigz -p {threads} {input.barcodes} {input.features} {input.mtx}""" |
24 25 26 27 28 29 30 31 32 33 34 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\ I={input.data}\ O={output.dense}\ EDIT_DISTANCE={params.umiBarcodeEditDistance}\ OUTPUT_LONG_FORMAT={output.long}\ STRAND_STRATEGY={params.strand_strategy}\ OUTPUT_READS_INSTEAD=false\ LOCUS_FUNCTION_LIST={{{params.locus_list}}}\ MIN_BC_READ_THRESHOLD={params.count_per_umi}\ CELL_BC_FILE={input.barcode_whitelist}""" |
53 54 55 56 57 58 59 60 61 62 63 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\ I={input.data}\ O={output.dense}\ EDIT_DISTANCE={params.umiBarcodeEditDistance}\ OUTPUT_LONG_FORMAT={output.long}\ STRAND_STRATEGY={params.strand_strategy}\ OUTPUT_READS_INSTEAD=true\ LOCUS_FUNCTION_LIST={{{params.locus_list}}}\ MIN_BC_READ_THRESHOLD={params.count_per_umi}\ CELL_BC_FILE={input.barcode_whitelist}""" |
74 75 | script: "../scripts/convert_mtx.py" |
88 89 | shell: """pigz -p {threads} {input.barcodes} {input.features} {input.mtx}""" |
112 113 114 115 116 117 118 119 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && SingleCellRnaSeqMetricsCollector -m {params.memory}\ INPUT={input.data}\ OUTPUT={output}\ ANNOTATIONS_FILE={input.refFlat}\ CELL_BC_FILE={input.barcode_whitelist}\ RIBOSOMAL_INTERVALS={input.rRNA_intervals} """ |
127 128 | script: '../scripts/plot_rna_metrics.R' |
19 20 | wrapper: '0.36.0/bio/fastqc' |
31 32 | wrapper: '0.36.0/bio/fastqc' |
41 42 | wrapper: '0.36.0/bio/multiqc' |
50 51 | wrapper: '0.36.0/bio/multiqc' |
59 60 | script: '../scripts/fa2tsv.py' |
29 30 31 32 33 34 35 36 37 38 39 | shell: """cutadapt\ --max-n {params.max_n}\ -a file:{input.adapters}\ -g file:{input.adapters}\ -q {params.barcode_quality},{params.barcode_quality}\ --minimum-length {params.barcode_length}\ --cores={threads}\ --overlap {params.cell_barcode_length}\ -o {output.fastq} {input.R1}\ {params.extra_params} > {log.qc}""" |
56 57 58 59 60 61 62 63 64 65 | shell: """cutadapt\ -a file:{input.adapters}\ -g file:{input.adapters}\ -q {params.read_quality}\ --minimum-length {params.minimum_length}\ --cores={threads}\ --overlap {params.adapters_minimum_overlap}\ -o {output.fastq} {input.R2}\ {params.extra_params} > {log.qc}""" |
73 74 | script: '../scripts/clean_cutadapt.py' |
89 90 91 92 93 94 95 96 97 | shell: """repair.sh\ -Xmx{params.memory}\ in={input.R1}\ in2={input.R2}\ out1={output.R1}\ out2={output.R2}\ repair=t\ threads={threads} 2> {log}""" |
105 106 | script: '../scripts/detect_barcodes.py' |
119 120 | script: '../scripts/plot_adapter_content.R' |
128 129 | wrapper: '0.36.0/bio/multiqc' |
137 138 | wrapper: '0.36.0/bio/multiqc' |
22 23 | shell: """cat {input.annotation} | grep -E "{params.patterns}" > {output}""" |
36 37 38 39 40 | shell: """java -jar -Djava.io.tmpdir={params.temp_directory} {params.picard} CreateSequenceDictionary\ REFERENCE={input}\ OUTPUT={output} """ |
52 53 54 55 56 57 58 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && ReduceGtf -m {params.memory}\ GTF={input.annotation}\ OUTPUT={output}\ SEQUENCE_DICTIONARY={input.reference_dict}\ IGNORE_FUNC_TYPE='null'\ ENHANCE_GTF='false'""" |
70 71 72 73 74 75 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && ConvertToRefFlat -m {params.memory}\ ANNOTATIONS_FILE={input.annotation}\ OUTPUT={output}\ SEQUENCE_DICTIONARY={input.reference_dict} """ |
89 90 91 92 93 94 95 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && CreateIntervalsFiles -m {params.memory}\ REDUCED_GTF={input.annotation_reduced}\ SEQUENCE_DICTIONARY={input.reference_dict}\ O={params.reference_directory}\ PREFIX={params.prefix} """ |
105 106 107 108 109 110 111 112 113 114 115 116 117 | run: """ from math import log2 from platform import system if (system() == 'Darwin'): genomeLength = shell("wc -c {} | cut -d' ' -f2".format(snakemake.reference_file), iterable=True) else: genomeLength = shell("wc -c {} | cut -d' ' -f1".format(snakemake.reference_file), iterable=True) genomeLength = int(next(genomeLength)) referenceNumber = shell('grep "^>" {} | wc -l'.format(snakemake.reference_file), iterable=True) referenceNumber = int(next(referenceNumber)) value = min([18,int(log2(genomeLength/referenceNumber))]) """ |
131 132 | script: '../scripts/prep_star.py' |
149 150 151 152 153 154 155 156 157 158 159 | shell: """mkdir -p {params.genomeDir}; STAR\ --runThreadN {threads}\ --runMode genomeGenerate\ --genomeDir {params.genomeDir}\ --genomeFastaFiles {input.reference_file}\ --sjdbGTFfile {input.annotation_file}\ --sjdbOverhang {params.sjdbOverhang}\ --genomeChrBinNbits {params.genomeChrBinNbits}\ --genomeSAsparseD 2 """ |
48 49 | wrapper: "0.27.1/bio/star/align" |
82 83 | wrapper: '0.36.0/bio/multiqc' |
92 93 | shell: """pigz -p 4 {input}""" |
108 109 | script: '../scripts/merge_bam.py' |
129 130 131 132 133 134 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && TagReadWithGeneFunction -m {params.memory}\ INPUT={input.data}\ OUTPUT={output}\ ANNOTATIONS_FILE={input.refFlat} """ |
149 150 151 152 153 154 155 156 157 | shell: """ export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DetectBeadSubstitutionErrors -m {params.memory}\ I={input}\ O={output.data}\ OUTPUT_REPORT={output.report}\ OUTPUT_SUMMARY={output.summary}\ NUM_THREADS={threads} """ |
173 174 175 176 177 178 179 180 181 182 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DetectBeadSynthesisErrors -m {params.memory}\ INPUT={input}\ OUTPUT={output}\ OUTPUT_STATS={params.out_stats}\ SUMMARY={params.summary}\ NUM_BARCODES={params.barcodes}\ PRIMER_SEQUENCE={params.SmartAdapter}\ NUM_THREADS={threads} """ |
194 195 196 197 198 199 200 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && BamTagHistogram -m {params.memory}\ TAG=XC\ I={input}\ READ_MQ=10\ O={output} """ |
217 218 | script: '../scripts/plot_yield.R' |
230 231 | script: '../scripts/plot_knee_plot.R' |
17 18 | script: "../scripts/convert_mtx.py" |
46 47 | script: '../scripts/plot_violine.R' |
61 62 | script: '../scripts/create_summary_stats.R' |
16 17 | script: "../scripts/publication_text.Rmd" |
17 18 19 20 21 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && FilterBam -m {params.memory}\ REF_SOFT_MATCHED_RETAINED={params.species}\ INPUT={input}\ OUTPUT={output}""" |
38 39 40 41 42 43 44 45 46 | shell: """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\ I={input.data}\ O={output.umi_matrix}\ SUMMARY={output.summary}\ EDIT_DISTANCE={params.cellBarcodeEditDistance}\ CELL_BC_FILE={input.barcode_whitelist}\ LOCUS_FUNCTION_LIST={{{params.locus_list}}}\ MIN_BC_READ_THRESHOLD={params.count_per_umi}""" |
58 59 | script: '../scripts/plot_species_plot.R' |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | from collections import defaultdict import re def fill_results(snakemake,pair, adapter_results): adapter_pattern = re.compile(pattern="=== Adapter (.*) ===\n") with open(snakemake.input[pair], 'r') as logfile: line = logfile.readline() while(line): adapter_matched = re.findall(pattern=adapter_pattern,string=line) if(adapter_matched): logfile.readline() line = logfile.readline().rstrip('.\n') line_list = line.split(';') adapter_results[adapter_matched[0]]['Pair'] = pair adapter_results[adapter_matched[0]]['Sequence'] = line_list[0].split(':')[1].strip() if(adapter_matched[0] in adapter_results): adapter_results[adapter_matched[0]]['Times'] += int(line_list[3].split(':')[1].split(' ')[1].strip()) else: adapter_results[adapter_matched[0]]['Times'] += int(line_list[3].split(':')[1].split(' ')[1].strip()) line = logfile.readline() return(adapter_results) adapter_results_R1 = defaultdict(lambda :{'Pair':'','Sequence':'','Times':0}) adapter_results_R2 = defaultdict(lambda :{'Pair':'','Sequence':'','Times':0}) adapter_results_R1 = fill_results(snakemake, 'R1', adapter_results_R1) adapter_results_R2 = fill_results(snakemake, 'R2', adapter_results_R2) with open(snakemake.output[0],'w') as outfile: outfile.write('Adapter,Sequence,Pair,Count\n') for adapter in adapter_results_R1: outfile.write("{},{},{},{}\n".format(adapter, adapter_results_R1[adapter]['Sequence'],adapter_results_R1[adapter]['Pair'],adapter_results_R1[adapter]['Times'])) for adapter in adapter_results_R2: outfile.write("{},{},{},{}\n".format(adapter, adapter_results_R2[adapter]['Sequence'],adapter_results_R2[adapter]['Pair'],adapter_results_R2[adapter]['Times'])) |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | import os import subprocess samples = snakemake.params['samples'] barcodes = {} features = {} out_folder = os.path.dirname(snakemake.output['mtx']) out_barcodes = snakemake.output.barcodes out_features = snakemake.output.features mtx = snakemake.output.mtx temp_mtx = os.path.join(out_folder,'temp_umi.mtx') header = os.path.join(out_folder,'header.mtx') n_lines=0 barcode_index = 1 feature_index = 1 with open(temp_mtx,'w') as mtx_stream: for i,sample in enumerate(snakemake.input): if samples[i] not in sample: sys.exit("Sample name not found in file path") with open(sample,'r') as input_file: next(input_file) # skip first line for line in input_file: barcode,feature,count = line.strip().split('\t') if(not isinstance(samples,str)): barcode = samples[i] + '_' + barcode if(barcode not in barcodes): barcodes[barcode] = barcode_index barcode_index += 1 if(feature not in features): features[feature] = feature_index feature_index += 1 mtx_stream.write('{} {} {}\n'.format(features[feature],barcodes[barcode],count)) n_lines +=1 with open(out_barcodes,'w') as barcodes_outfile: for barcode in barcodes: barcodes_outfile.write('{}\n'.format(barcode)) with open(out_features,'w') as features_outfile: for feature in features: features_outfile.write('{}\n'.format(feature)) with open(header,'w') as header_outfile: header_outfile.write("%%MatrixMarket matrix coordinate real general\n") header_outfile.write('{} {} {}\n'.format(len(features), len(barcodes), n_lines)) subprocess.call("cat {} {} > {}".format(header, temp_mtx, mtx), shell=True) os.remove(temp_mtx) os.remove(header) |
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | debug_flag <- FALSE # check if DEBUG flag is set if (snakemake@config$DEBUG) { debug_flag <- TRUE message("In debug mode: saving R objects to inspect later") path_debug <- file.path(snakemake@config$LOCAL$results, "debug") dir.create(path_debug, showWarnings = FALSE) save(snakemake, file = file.path(path_debug, "create_summary_stats_snakemake.rdata")) } #### /debug library(dplyr) # Dataframe manipulation library(Matrix) # Sparse matrices library(stringr) library(RColorBrewer) library(devtools) library(Seurat) library(plotly) samples <- snakemake@params$sample_names batches <- snakemake@params$batches # importing Seurat object seuratobj <- readRDS(file = file.path(snakemake@input$R_objects)) meta.data <- seuratobj@meta.data # subset only highest stamps as set in config.yaml # this is necessary since there are more stamps selected as a safty margin which now have to be taken out again to calculate stats. meta.data.sub <- meta.data %>% group_by(orig.ident) %>% arrange(desc(nCounts)) %>% slice(1:expected_cells[1]) %>% # makes sure only # expected cell are kept as.data.frame() gini_index <- function (x, weights = rep(1, length = length(x))) { ox <- order(x) x <- x[ox] weights <- weights[ox] / sum(weights) p <- cumsum(weights) nu <- cumsum(weights * x) n <- length(nu) nu <- nu/nu[n] sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1]) } #------------------------------------ post-filter-stats # stats only based after keeping most abundant `expected-cells` # taken out from seurat object generated in violine_plot rule. # median calculator stats_post <- meta.data.sub %>% group_by(orig.ident) %>% summarise( Total_nb_reads = sum(nCounts), Nb_STAMPS = mean(expected_cells), # should be all the same anyway.. Median_reads_per_STAMP = round(median(nCounts), 2), Mean_reads_per_STAMP = round(mean(nCounts), 2), Total_nb_UMIs = sum(nUMI), Median_UMIs_per_STAMP = round(median(nUMI), 2), Mean_UMIs_per_STAMP = round(mean(nUMI), 2), Mean_UMIs_per_Gene = round(mean(umi.per.gene), 2), Median_number_genes_per_STAMP = round(median(nGene), 2), Mean_number_genes_per_STAMP = round(mean(nGene), 2), Mean_Ribo_pct = round(100 * mean(pct.Ribo), 2), Mean_Mito_pct = round(100 * mean(pct.mito), 2), Mean_Count_per_UMI = round(sum(nCounts) / sum(nUMI), 2), Read_length = mean(read_length), # should be all the same anyway.. Number_barcodes_used_for_debug = n() ) %>% as.data.frame() row.names(stats_post) <- stats_post$orig.ident # highest, lowest count/UMI Stamp # pre STAMP stats # hist out goes into knee plots # 'results/logs/{sample}_hist_out_cell.txt' # """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && BAMTagHistogram -m {params.memory}\ # TAG=XC\ # https://hpc.nih.gov/apps/dropseq.html # there is not hint in documentation on any filtering (only read quality) #------------------------------------ pre-filter-stats # calculating statistics based on barcodes before thresholding it (i.e. keeping the most abudant barcodes based on `expected cells`) # This is based on 'logs/{sample}_hist_out_cell.txt' generated by `BAMTagHistogram` from dropseq-tools # TOOO: The documentation only mentions duplicate and quality filter. But it seems do more filtering since most barcodes are expected to have only one read assinged but there are usually more. Find out. # https://hpc.nih.gov/apps/dropseq.html # TODO: Nr UMI for pre filter stats_pre <- data.frame(matrix(nrow = length(samples), ncol = 10)) colnames(stats_pre) <- c( "Sample", "Batch", "Total_raw_reads", "Nr_barcodes_total", "Nr_barcodes_more_than_1_reads", "Nr_barcodes_more_than_10_reads", "percentile99", "percentile95", "percentile50", "Gini-index" ) stats_pre[, "Sample"] <- samples stats_pre[, "Batch"] <- batches for (i in 1:length(samples)) { # importing 'logs/{sample}_hist_out_cell.txt' hist_out <- read.table( file = snakemake@input$hist_cell[i], header = FALSE, stringsAsFactors = FALSE ) mysample <- samples[i] reads <- hist_out$V1 barcodes <- hist_out$V2 # calculations on reads # total reads are not sum(reads)! Needs to be taken from # results/logs/cutadapt/sample1_R1.qc.txt # # read in full text file "sample_R2.qc.txt" filedump <- readLines(snakemake@input$R2qc[i]) # subset line matching a pattern total_reads <- filedump[grep("Total reads processed:", filedump)] %>% #extract line str_extract("[0-9,]+") %>% # extract number from line str_replace_all(",", "") %>% # delete comma for subsequent numeric casting as.numeric() reads_cumsum <- cumsum(reads) reads_cumsum_perc <- (reads_cumsum / sum(reads)) # reporting stats stats_pre[i, "Total_raw_reads"] <- total_reads stats_pre[i, "Reads_assigned_to_expected_STAMPs"] <- sum(reads[1:stats_post$Nb_STAMPS[i]]) stats_pre[i, "Nr_barcodes_total"] <- length(barcodes) stats_pre[i, "percentile99"] <- which.min(reads_cumsum_perc < 0.99) stats_pre[i, "percentile95"] <- which.min(reads_cumsum_perc < 0.95) stats_pre[i, "percentile50"] <- which.min(reads_cumsum_perc < 0.50) stats_pre[i, "Nr_barcodes_more_than_1_reads"] <- sum(reads > 1) stats_pre[i, "Nr_barcodes_more_than_10_reads"] <- sum(reads > 10) stats_pre[i, "Gini-index"] <- round(gini_index(reads), 2) expected_cells <- as.numeric(filter(stats_post, orig.ident==mysample) %>% select(Nb_STAMPS)) # % of reads left after applying expected_cells cuttoff stats_post[mysample, "Pct_reads_after_filter_expected_cells"] <- round(100 * ( reads_cumsum[expected_cells] / sum(reads) ), 2) # % of reads left after applying all filters including mapping etc. # Thats the effecive usable reads of the sequencing run stats_post[mysample, "Pct_reads_after_filter_everything"] <- round(100 * ( filter(stats_post, orig.ident==mysample) %>% select(Total_nb_reads) / stats_pre [i, "Total_raw_reads"]), 2 ) } stats_pre <- stats_pre %>% arrange(Sample) stats_post <- stats_post %>% arrange((orig.ident)) # output write.csv(stats_pre, file.path(snakemake@output$stats_pre)) write.csv(stats_post, file.path(snakemake@output$stats_post)) # writes table for excel if (debug_flag) { save.image(file = file.path(path_debug, "create_summary_stats_workspace.rdata")) } |
22
of
scripts/create_summary_stats.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | from Bio import SeqIO import gzip from collections import Counter fastq_parser = SeqIO.parse(gzip.open(snakemake.input.R1, "rt"), "fastq") sequences = [] n=0 for fastq_R1 in fastq_parser: sequences.append(str(fastq_R1.seq)) n+=1 if(n==10000000): break def parse_barcodes(fastq_parser): counts={} ranges = range(5,len(sequences[0])) for cell_bc_length in ranges: counts[cell_bc_length] = list() for fastq_R1 in sequences: counts[cell_bc_length].append(fastq_R1[0:cell_bc_length]) return(counts) counts = parse_barcodes(fastq_parser) with open(snakemake.output[0], "w") as outfile: outfile.write('bc_length,first_counts\n') for cell_bc_length in counts: outfile.write('{},{}\n'.format(cell_bc_length,str(Counter(counts[cell_bc_length]).most_common(100)[0][1]))) |
9 10 11 12 13 14 15 16 17 | import sys from Bio import SeqIO number_bp = 12 with open( snakemake.output['tsv'], "w" ) as output: for seq_record in SeqIO.parse(snakemake.input['fa'], "fasta"): myline = (str(seq_record.id)) + "\t" + str(seq_record.seq[0:(number_bp + 1)]) + "\n" output.write(myline) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | from itertools import combinations, product from collections import defaultdict from copy import deepcopy import pickle from shutil import copyfile def save_obj(obj, name): with open(name, 'wb') as f: pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL) def generate_all(barcode, reference, mapping, edit_distance): mutants = generate_mutants(barcode, edit_distance) for mutant in mutants: if(mutant not in reference): reference.add(mutant) mapping[edit_distance][mutant]['ref'] = barcode mapping[edit_distance][mutant]['count'] = 0 mapping[edit_distance][mutant]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0} mapping['unknown']=defaultdict() return(reference, mapping) def generate_mutants(sequence, d=1): """Taken from stackoverflow: https://stackoverflow.com/a/19823295/9178565""" N = len(sequence) letters = 'ACGTN' pool = list(sequence) for indices in combinations(range(N), d): for replacements in product(letters, repeat=d): skip = False for i, a in zip(indices, replacements): if pool[i] == a: skip = True if skip: continue keys = dict(zip(indices, replacements)) yield ''.join([pool[i] if i not in indices else keys[i] for i in range(N)]) # Create empty sets and defaultdicts barcode_ref = set() mapping=defaultdict(dict) # Initiate ref and mapping with the given barcodes with open(snakemake.input.whitelist,'r') as ref_file: for line in ref_file.readlines(): barcode = line.strip() barcode_ref.add(barcode) mapping[0][barcode]=defaultdict(dict) mapping[0][barcode]['ref'] = barcode mapping[0][barcode]['count'] = 0 mapping[0][barcode]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0} barcode_ext_ref = deepcopy(barcode_ref) # For now edit distance is one, but can be extended to a higher number later on. max_edit_distance = 1 for edit_distance in range(1,max_edit_distance+1): mapping[edit_distance]=defaultdict(dict) for barcode in mapping[0]: (barcode_ext_ref,mapping) = generate_all(barcode, barcode_ext_ref, mapping, edit_distance) # Delete given barcodes out of new reference. This helps later on when running "repair_barcodes.py" barcode_ref = set(mapping[0]) barcode_ext_ref.difference_update(barcode_ref) # Save mapping and references to reuse later. save_obj(obj=mapping, name=snakemake.output.barcode_mapping) save_obj(obj=barcode_ref,name=snakemake.output.barcode_ref) save_obj(obj=barcode_ext_ref,name=snakemake.output.barcode_ext_ref) copyfile(snakemake.input['whitelist'], snakemake.output['barcodes']) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | import pysam import re import csv from Bio import SeqIO import gzip from collections import defaultdict import sys #This function fills in a dict with readids #and their corresponding cell and umi barcodes until it finds #a specific read id discard_secondary_alignements = snakemake.params['discard_secondary_alignements'] barcodes_struct = { 'BC_start':snakemake.params['BC_start'], 'BC_end':snakemake.params['BC_end'], 'UMI_start':snakemake.params['UMI_start'], 'UMI_end':snakemake.params['UMI_end'] } def parse_barcodes(fastq_parser, query_name, read_barcodes, barcodes_struct): for fastq_R1 in fastq_parser: # Some sequencers give a /1 and /2 to R1 and R2 read ids respectively. This attempts to solve the issue #69. if '/' in fastq_R1.id: R1_id = fastq_R1.id[:fastq_R1.id.find("/")] else: R1_id = fastq_R1.id read_barcodes[R1_id]['XC'] = str(fastq_R1.seq)[barcodes_struct['BC_start']:barcodes_struct['BC_end']] read_barcodes[R1_id]['XM'] = str(fastq_R1.seq)[barcodes_struct['UMI_start']:barcodes_struct['UMI_end']] if(read_barcodes[R1_id]['XM']==''): sys.SystemExit('UMI empty for read {}.\n The barcode is: {}.\nWhole entry is:{}'.format(R1_id, fastq_R1.seq,fastq_R1)) if (R1_id == query_name): return(fastq_parser,read_barcodes) return(fastq_parser,read_barcodes) infile_bam = pysam.AlignmentFile(snakemake.input[0], "rb") fastq_parser = SeqIO.parse(gzip.open(snakemake.input[1], "rt"), "fastq") outfile = pysam.AlignmentFile(snakemake.output[0], "wb", template=infile_bam) read_barcodes = defaultdict(lambda :{'XC':'','XM':''}) for bam_read in infile_bam: if(discard_secondary_alignements & bam_read.is_secondary): continue if (bam_read.query_name) in read_barcodes: current_barcodes = read_barcodes.pop(bam_read.query_name) tags = bam_read.get_tags() tags.extend([ ('XC', current_barcodes['XC'],'Z'), ('XM', current_barcodes['XM'],'Z')]) bam_read.set_tags(tags) else: fastq_parser,read_barcodes = parse_barcodes(fastq_parser, bam_read.query_name, read_barcodes, barcodes_struct) if (bam_read.query_name) not in read_barcodes: raise SystemExit('Read {} from mapped file is missing in reference fastq file!'.format(bam_read.query_name)) os.remove(snakemake.output[0]) current_barcodes = read_barcodes.pop(bam_read.query_name) tags = bam_read.get_tags() tags.extend([ ('XC', current_barcodes['XC'],'Z'), ('XM', current_barcodes['XM'],'Z')]) bam_read.set_tags(tags) outfile.write(bam_read) |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | debug_flag <- FALSE # check if DEBUG flag is set if (snakemake@config$DEBUG) { debug_flag <- TRUE message("In debug mode: saving R objects to inspect later") path_debug <- file.path(snakemake@config$LOCAL$results, "debug") dir.create(path_debug, showWarnings = FALSE) save(snakemake, file = file.path(path_debug, "plot_adapter_content_snakemake.rdata")) } #------------------------------------ debugging library(ggplot2) library(dplyr) library(viridis) samples <- snakemake@params$sample_names batches <- snakemake@params$batches #Read files into a list cutadapt_clean_list <- list() for (i in seq_along(samples)){ cutadapt_clean <- read.csv(snakemake@input[[i]][1], header = TRUE) cutadapt_clean$Sample <- samples[i] cutadapt_clean$Batch <- batches[i] cutadapt_clean_list[[i]] <- cutadapt_clean } # combining adaptors accross samples cutadapt_counts <- Reduce(rbind, cutadapt_clean_list, NULL) #Transform it into percentages cutadapt_counts <- group_by(cutadapt_counts, Sample, Pair) %>% mutate(Percentages=Count/sum(Count)) # Adapter Sequence Pair Count Sample Batch # 1 PrefixNX/1 AGATGTGTATAAGAGACAG R1 7 sample1 Batch1 # ... # 6 Trans2_rc CTGTCTCTTATACACATCTCCGAGCCCACGAGAC R2 5 sample2 Batch2 p1 <- ggplot(cutadapt_counts, aes(x=Sample, y = Percentages, fill = Adapter)) + geom_bar(stat = "identity") + facet_grid(Pair ~ Batch, scales = "free") + theme_minimal() + ggtitle("Comparison accross samples of adapter content") + scale_x_discrete(label=abbreviate) + scale_y_continuous(labels = scales::percent) + theme(axis.text.x=element_text(angle = 90, hjust = 0)) + scale_fill_viridis(discrete=TRUE) ggsave(plot=p1, filename=snakemake@output$pdf) if (debug_flag) { save.image(file = file.path(path_debug, "plot_adapter_content_workspace.rdata")) } |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | debug_flag <- FALSE if (snakemake@config$DEBUG) { debug_flag <- TRUE message("In debug mode: saving R objects to inspect later") path_debug <- file.path(snakemake@config$LOCAL$results, "debug") dir.create(path_debug, showWarnings = FALSE) save(snakemake, file = file.path(path_debug, paste0("plot_knee_plot_snakemake_", attr(snakemake, "wildcard")$sample, ".rdata")) ) } #### /debug library(ggplot2) library(plyr) # Create the cumulative plot data=read.table(file = snakemake@input[[1]][1], header=FALSE, stringsAsFactors=FALSE) barcodes = data$V2 total_reads = sum(data$V1) y_raw=cumsum(data$V1) y=(y_raw/total_reads) x = 1:length(y) plot_data = data.frame(rank = x,cum_sum=y, Barcode=data$V2) x_scale = snakemake@params$cells * 4 knee_plot <- ggplot(plot_data, aes(x=rank, y=cum_sum)) + geom_point(size = 0.1) + xlim(0,x_scale) + geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") + ggtitle(paste0(snakemake@wildcards$sample, '\nTotal reads: ', prettyNum(total_reads))) + theme(plot.title = element_text(size=10)) + labs(x='STAMPS', y='Cumulative fraction of reads') + scale_y_continuous(labels = scales::percent) + theme_classic() if(!is.null(snakemake@input$barcodes)) { selected_cells <- read.csv(snakemake@input$barcodes, header=FALSE, stringsAsFactors=FALSE) knee_plot <- knee_plot + geom_point(data = plot_data[plot_data$Barcode %in% selected_cells$V1,], aes(x=rank, y=cum_sum, color='Selected'), size=0.1) + scale_color_manual(values=c('Selected'='green')) } ggsave(knee_plot, file=snakemake@output$pdf, width = 4, height = 3) if (debug_flag) { library(gridExtra) library(grid) # potential usefull to change lag of diff calculation mylag <- 1 # data <- read.table(file = file, header=FALSE, stringsAsFactors=FALSE) # head(data,3) # V1 V2 # 1 1145137 CCCTTCGTCTGC # 2 1039974 ATAGTTTTTTAA # 3 912199 GCATGAAACTTC # borrowed from https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima localMaxima <- function(x) { # Use -Inf instead if x is numeric (non-integer) y <- diff(c(-.Machine$integer.max, x)) > 0L rle(y)$lengths y <- cumsum(rle(y)$lengths) y <- y[seq.int(1L, length(y), 2L)] if (x[[1]] == x[[2]]) { y <- y[-1] } y } reads <- data$V1 barcodes <- data$V2 total_reads <- sum(reads) reads_cumsum <- cumsum(reads) # 1st dervivative (diff) needs also to be padded to keep same vector length reads_diff <- c(diff(reads,lag=mylag,differences = 1),rep(0,mylag)) # 2nd derivative: twice as much padding: reads_diff_diff <- c(diff(reads,lag=mylag,differences = 2),rep(0,mylag*2)) reads_cumsum_perc <- (reads_cumsum/total_reads) x <- 1:length(reads_cumsum_perc) plot_data <- data.frame(rank = x, cum_sum = reads_cumsum_perc, read_count = reads, Barcode = data$V2, diff = reads_diff, diffdiff = reads_diff_diff) # head(plot_data,6) # rank cum_sum read_count Barcode diff diffdiff # 1 1 0.007192916 1145137 CCCTTCGTCTGC -105163 -22612 # 2 2 0.013725274 1039974 ATAGTTTTTTAA -127775 14025 # 3 3 0.019455043 912199 GCATGAAACTTC -113750 61486 # 4 4 0.024470318 798449 GTGTGGGTCTCT -52264 34985 # 5 5 0.029157308 746185 CGTACTGACTAC -17279 -60441 # 6 6 0.033735764 728906 GTTCGTCCCGCC -77720 69104 mystats <- paste0("| Nr barcodes total: ", length(barcodes), ' \n ', "Nr barcodes for 50% reads: ", which.min(reads_cumsum_perc<0.50)," | ", "Nr barcodes for 95% reads: ", which.min(reads_cumsum_perc<0.95)," | ", "Nr barcodes for 99% reads: ", which.min(reads_cumsum_perc<0.99) ) # x_scale <- which(reads_cumsum_perc>0.99)[1] x_scale <- snakemake@params$cells * 4 plot_data_head <- head(plot_data, x_scale) # plot_data_head$reads_diff_smooth <- predict(loess(diff~rank,data=plot_data_head)) plot_data_head$reads_diffdiff_smooth <- predict(loess(diffdiff~rank,span=0.2,data=plot_data_head)) ## Finding knee in knee-plot: # Best approach so far is to calclate the 2nd derivative of the read counts per STAMP (reads_diff_diff), smooth it (loess) and find it's maxima. Since ther can be several maxima, they are just ploted and it's up to the user to visually asses and decide. # finding local maxima in 2nd derivative: #https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima loc.max <- localMaxima(plot_data_head$reads_diffdiff_smooth) # local maxima are then colored green as lines in plots plot_data_head_sub <- plot_data_head[loc.max,] # which.peaks(plot_data_head$reads_diff_smooth2) knee_plot_ext <- ggplot(plot_data_head, aes(x=rank, y=cum_sum)) + xlim(0,x_scale) + ylim(0,1) + geom_text(data=plot_data_head_sub,aes(label = round(cum_sum,2)),nudge_y=-0.05, vjust = "inward", hjust = "inward") + geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") + # geom_vline(xintercept=100, linetype="dashed", color = "red") + geom_vline(xintercept=loc.max, col="lightgreen") + geom_point(size = 0.1) + ggtitle(paste0(snakemake@wildcards$sample, '\nTotal reads: ', prettyNum(total_reads), mystats)) + theme(plot.title = element_text(size=10)) + labs(x='STAMPS', y='Cumulative fraction of reads') read_count_plot <- ggplot(plot_data_head, aes(x=rank, y=read_count)) + geom_text(data=plot_data_head_sub,aes(label = read_count),nudge_y=-0.05, vjust = "inward", hjust = "inward", check_overlap = TRUE) + # geom_smooth() + xlim(0,x_scale) + # ylim(0,1000) + geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") + # geom_vline(xintercept=100, linetype="dashed", color = "red") + geom_vline(xintercept=loc.max, col="lightgreen") + geom_point(size = 0.1) + theme(plot.title = element_text(size=10)) + labs(x='STAMPS', y='Read counts per STAMP') # knee_plot_ext = knee_plot_ext + scale_y_continuous(labels = scales::percent) diff_plot <- ggplot(plot_data_head, aes(x=rank, y=diff)) + geom_text(data=plot_data_head_sub,aes(label = diff),nudge_y=-0.05, vjust = "inward", hjust = "inward", check_overlap = TRUE) + # geom_smooth() + xlim(0,x_scale) + # ylim(0,1000) + geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") + # geom_vline(xintercept=100, linetype="dashed", color = "red") + geom_vline(xintercept=loc.max, col="lightgreen") + geom_point(size = 0.1) + theme(plot.title = element_text(size=10)) + #1st derivative read count diff to next STAMP") labs(x='STAMPS', y="1st derivative of read counts") diff_diff_plot <- ggplot(plot_data_head, aes(x=rank, y=reads_diffdiff_smooth)) + geom_text(data=plot_data_head_sub, aes(label = rank),nudge_y=-1, vjust = "inward", hjust = "inward", check_overlap = TRUE) + xlim(0,x_scale) + ylim(-30,30) + geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") + # geom_vline(xintercept=100, linetype="dashed", color = "red") + geom_vline(xintercept=loc.max, col="lightgreen") + geom_point(size = 0.1) + theme(plot.title = element_text(size=10)) + labs(x='STAMPS', y='2nd derivative of read counts') if(!is.null(snakemake@input$barcodes)) { selected_cells <- read.csv(snakemake@input$barcodes, header=FALSE, stringsAsFactors=FALSE) knee_plot_ext <- knee_plot_ext + geom_point(data = plot_data_head[plot_data_head$Barcode %in% selected_cells$V1,], aes(x=rank, y=cum_sum, color='Selected'), size=0.1) + scale_color_manual(values=c('Selected'='green')) + theme(legend.position="none") diff_diff_plot <- diff_diff_plot + geom_point(data = plot_data_head[plot_data_head$Barcode %in% selected_cells$V1,], aes(x=rank, y=reads_diffdiff_smooth, color='Selected'), size=0.1) + scale_color_manual(values=c('Selected'='green')) + theme(legend.position="none") } # scale_y_continuous(position = "right") gp1 <- ggplotGrob(knee_plot_ext) gp2 <- ggplotGrob(read_count_plot) gp3 <- ggplotGrob(diff_plot) gp4 <- ggplotGrob(diff_diff_plot) # grid::grid.newpage() # gg <- grid::grid.draw(rbind(gp1, gp2, gp3, gp4, size = "last")) gg <- gridExtra::arrangeGrob(rbind(gp1, gp2, gp3, gp4, size = "last")) # gg <- gridExtra::arrangeGrob(gp1, gp2, gp3, gp4, ncol = 1) # if barcode.csv is present in base directory, only use barcodes in there (rule plot_knee_plot_whitelist in map.smk) ggsave(gg, file=paste0(snakemake@output$pdf, "_extended.pdf"), width = 9, height = 11) } if (debug_flag) { save.image(file = file.path(path_debug, paste0("plot_knee_plot_workspace_", attr(snakemake, "wildcard")$sample, ".rdata")) ) } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | library(ggplot2) library(tidyr) library(gridExtra) library(grid) library(viridis) debug_flag <- FALSE if (snakemake@config$DEBUG) { debug_flag <- TRUE message("In debug mode: saving R objects to inspect later") path_debug <- file.path(snakemake@config$LOCAL$results, "debug") dir.create(path_debug, showWarnings = FALSE) save(snakemake, file = file.path(path_debug, "plot_rna_metrics_snakemake.rdata")) } #### /debug mydata <- read.csv(file = snakemake@input$rna_metrics, header = T, stringsAsFactors = F, skip = 6, sep = "\t") mydata <- mydata[order(mydata$PF_ALIGNED_BASES, decreasing = T), ] mydata_pct <- mydata[, c("READ_GROUP", "PCT_INTERGENIC_BASES", "PCT_UTR_BASES", "PCT_RIBOSOMAL_BASES", "PCT_INTRONIC_BASES", "PCT_CODING_BASES") ] colnames(mydata_pct) = c('Cell Barcode', 'Intergenic', 'UTR', 'Ribosomial', 'Intronic', 'Coding') mydata <- mydata[, c("READ_GROUP", "INTERGENIC_BASES", "UTR_BASES", "RIBOSOMAL_BASES", "INTRONIC_BASES", "CODING_BASES") ] colnames(mydata) = c('Cell Barcode', 'Intergenic', 'UTR', 'Ribosomial', 'Intronic', 'Coding') # converting into long format for ploting mydata_long <- mydata %>% gather("Read Overlap", count, -"Cell Barcode") # Keep the original order of the barcodes using factor and levels. mydata_long$`Cell Barcode` <- factor(mydata_long$`Cell Barcode`, levels = factor(unique(mydata_long$`Cell Barcode`))) mydata_long$`Read Overlap` <- factor(mydata_long$`Read Overlap`, levels = unique(mydata_long$`Read Overlap`)) p1 <- ggplot(mydata_long, aes(x = `Cell Barcode`, y = count, fill = `Read Overlap`)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 0), legend.position = "none") p1 <- p1 + labs(title = paste(nrow(mydata), "selected barcodes for", snakemake@wildcards$sample), x = "Barcodes", y = "Bases") p1 <- p1 + theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()) p1 <- p1 + scale_y_continuous(labels = scales::scientific) p1 <- p1 + scale_fill_viridis(discrete = TRUE, option = "viridis") mydata_long_pct <- mydata_pct %>% gather("Read Overlap", fraction, -"Cell Barcode") # Keep the original order of the barcodes using factor and levels. mydata_long_pct$`Cell Barcode` <- factor(mydata_long_pct$`Cell Barcode`, levels = factor(unique(mydata_long_pct$`Cell Barcode`))) mydata_long_pct$`Read Overlap` <- factor(mydata_long_pct$`Read Overlap`, levels = unique(mydata_long_pct$`Read Overlap`)) p2 <- ggplot(mydata_long_pct, aes(x = `Cell Barcode`, y = fraction, fill = `Read Overlap`)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 0, size=8, vjust = 0.05), legend.position = "bottom") + labs(x = "Barcodes", y = "%Bases") + scale_y_continuous(labels = scales::percent) + scale_fill_viridis(discrete = TRUE, option = "viridis") # This allows to align the main plots so that we can relate both directly with the label from the bottom one. gp1 <- ggplotGrob(p1) gp2 <- ggplotGrob(p2) pdf(file = snakemake@output$pdf, width = 16, height = 13) grid::grid.newpage() grid::grid.draw(rbind(gp1, gp2, size = "last")) dev.off() if (debug_flag) { save.image(file = file.path(path_debug, "plot_rna_metrics_workspace.rdata")) } |
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | debug_flag <- FALSE if (snakemake@config$DEBUG) { debug_flag <- TRUE message("In debug mode: saving R objects to inspect later") path_debug <- file.path(snakemake@config$LOCAL$results, "debug") dir.create(path_debug, showWarnings = FALSE) save(snakemake, file = file.path( path_debug, paste0("plot_species_plot_snakemake_", attr(snakemake, "wildcard")$sample, ".rdata") )) } #### /debug categorizeCellsUsingKneeKnownNumCellsPaper <- function( digitalExpressionFileO1, digitalExpressionFileO2, organismOne, organismTwo, pureRatio = 0.2, numCells, numBeads, point.cex = 1.5, xlim_range = NULL, ylim_range = NULL, category = "transcripts") { dfFull <- getNumTranscriptsPerCellBarcodeByOrganismPair( digitalExpressionFileO1, digitalExpressionFileO2, organismOne, organismTwo, category) dfFull <- dfFull[order(dfFull$total, decreasing = T), ] dfFull$ratio_one <- dfFull[, 2] / dfFull[, 4] dfFull <- head(dfFull, n = numBeads) df <- head(dfFull, n = numCells) dfNoCall <- dfFull[-1:-numCells, ] if (dim(dfNoCall)[1] > 0) { dfNoCall$organism <- "No Call" } df$organism <- "Mixed" idx <- which(df$ratio_one >= (1 - pureRatio)) # checks if the species is actually assigned at all if (length(idx) > 0) { df[idx, ]$organism <- organismOne } idx <- which(df$ratio_one <= (pureRatio)) if (length(idx) > 0) { df[idx, ]$organism <- organismTwo } result <- rbind(df, dfNoCall) maxRange <- max(result[, 2], result[, 3]) dforganismOne <- result[result$organism == organismOne, ] dforganismTwo <- result[result$organism == organismTwo, ] dfMixed <- result[result$organism == "Mixed", ] dfNoCall <- result[result$organism == "No Call", ] if (is.null(xlim_range)) { xlim_range <- c(0, maxRange) } if (is.null(ylim_range)) { ylim_range <- c(0, maxRange) } colors <- c("blue", "red", "purple", "grey") plot(dforganismOne[, 2], dforganismOne[, 3], col = colors[1], pch = 16, xlim = xlim_range, ylim = ylim_range, xlab = paste(organismOne, category), ylab = paste(organismTwo, category), cex = point.cex) points(dforganismTwo[, 2], dforganismTwo[, 3], col = colors[2], pch = 16, cex = point.cex) points(dfMixed[, 2], dfMixed[, 3], col = colors[3], pch = 16, cex = point.cex) points(dfNoCall[, 2], dfNoCall[, 3], col = colors[4], pch = 16, cex = point.cex) l <- c(paste(organismOne, dim(dforganismOne)[1]), paste(organismTwo, dim(dforganismTwo)[1]), paste("Mixed", dim(dfMixed)[1]), paste("No Call", dim(dfNoCall)[1])) legend("topright", legend = l, fill = colors) title(paste("Species plot based on", category)) return(df) } getNumTranscriptsPerCellBarcodeByOrganismPair <- function( digitalExpressionFileO1, digitalExpressionFileO2, organismOne, organismTwo, category) { if (is.null(organismOne) || is.null(organismTwo)) { return(NULL) } o1 <- getGenesAndTranscriptsPerCellBarcode(digitalExpressionFileO1) o2 <- getGenesAndTranscriptsPerCellBarcode(digitalExpressionFileO2) commonBC <- union(o1$cellBC, o2$cellBC) o1p <- o1[match(commonBC, o1$cellBC), ] o2p <- o2[match(commonBC, o2$cellBC), ] if (category == "genes") { df <- data.frame(tag = commonBC, o1Count = o1p$numGenes, o2Count = o2p$numGenes, stringsAsFactors = F) } else { df <- data.frame(tag = commonBC, o1Count = o1p$numTranscripts, o2Count = o2p$numTranscripts, stringsAsFactors = F) } idx1 <- which(is.na(df$o1Count)) idx2 <- which(is.na(df$o2Count)) if (length(idx1) > 0) df[idx1, ]$o1Count <- 0 if (length(idx2) > 0) df[idx2, ]$o2Count <- 0 df$total <- apply(df[, 2:3], 1, sum, na.rm = T) df <- df[order(df$total, decreasing = T), ] colnames(df)[2] <- organismOne colnames(df)[3] <- organismTwo return(df) } getGenesAndTranscriptsPerCellBarcode <- function(digitalExpressionFile) { a <- read.table(digitalExpressionFile, header = T, stringsAsFactors = F) colnames(a) <- c("cellBC", "numGenicReads", "numTranscripts", "numGenes") return(a) } digitalExpressionFileO1 <- snakemake@input[[1]][1] digitalExpressionFileO2 <- snakemake@input[[2]][1] num_cells <- snakemake@params$expected_cells organismOne <- names(snakemake@config$META$species)[1] organismTwo <- names(snakemake@config$META$species)[2] par(mar = c(5, 4, 4, 2) + 0.5) pdf(snakemake@output$genes_pdf, height = 8, width = 8) df_temp <- categorizeCellsUsingKneeKnownNumCellsPaper( digitalExpressionFileO1, digitalExpressionFileO2, organismOne = organismOne, organismTwo = organismTwo, pureRatio = snakemake@config$META$ratio, numCells = num_cells, numBeads = num_cells * 2, point.cex = 1, category = "genes" ) dev.off() pdf(snakemake@output$transcripts_pdf, height = 8, width = 8) df <- categorizeCellsUsingKneeKnownNumCellsPaper( digitalExpressionFileO1, digitalExpressionFileO2, organismOne = organismOne, organismTwo = organismTwo, pureRatio = snakemake@config$META$ratio, numCells = num_cells, numBeads = num_cells * 2, point.cex = 1, category = "transcripts" ) dev.off() organism1 <- subset(df, df$organism == organismOne) organism2 <- subset(df, df$organism == organismTwo) write.table(organism1$tag, snakemake@output$barcodes_species[1], row.names = F, col.names = F, quote = F) write.table(organism2$tag, snakemake@output$barcodes_species[2], row.names = F, col.names = F, quote = F) # save.image(paste0(snakemake@output$genes_pdf,".rdata")) if (debug_flag) { save.image(file = file.path( path_debug, paste0( "plot_species_plot_workspace_", attr(snakemake, "wildcard")$sample, ".rdata" ) )) } |
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | debug_flag <- FALSE if (snakemake@config$DEBUG) { debug_flag <- TRUE message("In debug mode: saving R objects to inspect later") path_debug <- file.path(snakemake@config$LOCAL$results, "debug") dir.create(path_debug, showWarnings = FALSE) save(snakemake, file = file.path(path_debug, "plot_violin_snakemake.rdata")) } options(warn = -1) library(plyr, quietly = TRUE, warn.conflicts = FALSE) library(dplyr, quietly = TRUE, warn.conflicts = FALSE) # Dataframe manipulation library(Matrix, quietly = TRUE, warn.conflicts = FALSE) # Sparse matrices library(stringr, quietly = TRUE, warn.conflicts = FALSE) library(RColorBrewer, quietly = TRUE, warn.conflicts = FALSE) library(devtools, quietly = TRUE, warn.conflicts = FALSE) library(Seurat, quietly = TRUE, warn.conflicts = FALSE) library(plotly, quietly = TRUE, warn.conflicts = FALSE) # rule map in Snakefile # rule map: # input: # 'plots/violinplots_comparison_UMI.pdf', # ... # importing UMI # importing counts ( summary/counts_expression_matrix.tsv ) ReadMTX <- function(mtx_path) { data_dir <- dirname(mtx_path) files <- list.files(data_dir) # Find files barcodes_file <- grep("barcodes", files, value = TRUE) features_file <- grep(pattern = "genes|features", x = files, value = TRUE) mtx <- grep("mtx", files, value = TRUE) # load the data data <- readMM(file.path(data_dir, mtx)) barcodes <- read.csv(file.path(data_dir, barcodes_file), header = FALSE)$V1 features <- read.csv(file.path(data_dir, features_file), header = FALSE)$V1 colnames(data) <- barcodes rownames(data) <- features return(data) } #count_matrix <- ReadMTX(snakemake@input$counts) # importing UMIs ( summary/umi_expression_matrix.tsv ) #umi_matrix <- ReadMTX(snakemake@input$UMIs) count_matrix <- Read10X(file.path(snakemake@wildcards$results_dir,'summary','read')) umi_matrix <- Read10X(file.path(snakemake@wildcards$results_dir,'summary','umi')) design <- read.csv(snakemake@input$design, stringsAsFactors = TRUE, header = TRUE, row.names = NULL ) metaData <- data.frame(cellNames = colnames(umi_matrix)) %>% mutate(samples = factor(str_replace(cellNames, "_[^_]*$", ""))) %>% mutate(barcode = factor(str_replace(cellNames, ".+_", ""))) %>% left_join(design, by = "samples") rownames(metaData) <- metaData$cellNames # possible to set is.expr = -1 to avoid filtering whilst creating # seuratobj <- CreateSeuratObject(raw.data = umi_matrix, meta.data = metaData, is.expr = -1) seuratobj <- CreateSeuratObject(raw.data = umi_matrix, meta.data = metaData) seuratobj <- SetAllIdent(object = seuratobj, id = "samples") # relabel cell idenity (https://github.com/satijalab/seurat/issues/380) seuratobj@meta.data$orig.ident <- seuratobj@meta.data$samples mycount <- CreateSeuratObject(raw.data = count_matrix, meta.data = metaData) mycount <- SetAllIdent(object = mycount, id = "samples") mycount@meta.data$orig.ident <- mycount@meta.data$samples # turn off filtering # note, the @meta.data slot contains usefull summary stuff # head([email protected],2) # nGene nUMI expected_cells read_length barcode # dropseqLib1_ACTAACATTATT 15 33 400 100 ACTAACATTATT # dropseqLib1_GAGTCTGAGGCG 5 9 400 100 GAGTCTGAGGCG # origin origin # dropseqLib1_ACTAACATTATT dropseqLib1 dropseqLib1 # dropseqLib1_GAGTCTGAGGCG dropseqLib1 dropseqLib1 meta.data <- seuratobj@meta.data # combining UMIs and Counts in to one Seurat object meta.data$nCounts <- mycount@meta.data$nUMI seuratobj@meta.data <- meta.data # delete since Counts have been added to seuratobj as nCounts column rm(mycount) # mytheme <- theme_bw(base_size = 9) + mytheme <- theme_bw() + theme( legend.position = "right", axis.ticks = element_blank(), axis.text.x = element_text(angle = 300, hjust = 0) ) theme_set(mytheme) # predefined ggplot layers for subsequent plots gglayers <- list( geom_smooth(method = "loess"), geom_point(size = .5), scale_y_continuous( labels = scales::unit_format(unit = "", scale = 1e-3, digits = 2), breaks = scales::pretty_breaks(n = 8) ), scale_x_continuous( labels = scales::unit_format(unit = "", scale = 1e-3, digits = 2), breaks = scales::pretty_breaks(n = 8) ) ) gg <- ggplot(meta.data, aes(x = nUMI, y = nCounts, color = orig.ident)) + # coord_trans(y="log10",x = "log10") + gglayers + geom_abline(intercept = 0, slope = 1) + labs( title = "UMI counts vs raw Counts", subtitle = "Number of UMIs and raw Counts for each Bead", x = "Number of UMIs per Bead [k]", y = "Number of Counts per Bead [k]" ) # dev.new() # htmlwidgets::saveWidget(ggplotly(gg), file.path(getwd(),snakemake@output$html_umivscounts)) ggsave(gg, file = file.path(getwd(), snakemake@output$pdf_umivscounts), width = 12, height = 7) # how about unaligned reads/UMI? # Note(Seb): raw.data is actually filtered data i.e. nr of genes likely to be smaller than input data! mito.gene.names <- grep("^mt-", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE) sribo.gene.names <- grep("^Rps", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE) lribo.gene.names <- grep("^Rpl", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE) col.total <- Matrix::colSums(seuratobj@raw.data) meta.data$col.total <- col.total seuratobj.top_50 <- apply(seuratobj@raw.data, 2, function(x) sum(x[order(x, decreasing = TRUE)][1:50]) / sum(x)) # mycount.top_50 <- apply([email protected], 2, function(x) sum(x[order(x, decreasing = TRUE)][1:50])/sum(x)) seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[sribo.gene.names, ]) / col.total, "pct.sribo") seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[lribo.gene.names, ]) / col.total, "pct.lribo") seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[unique(c(sribo.gene.names, lribo.gene.names)), ]) / col.total, "pct.Ribo") seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[mito.gene.names, ]) / col.total, "pct.mito") seuratobj <- AddMetaData(seuratobj, seuratobj.top_50, "top50") tmp <- seuratobj@meta.data$nUMI / seuratobj@meta.data$nGene names(tmp) <- rownames(seuratobj@meta.data) seuratobj <- AddMetaData(seuratobj, tmp, "umi.per.gene") gg <- VlnPlot(seuratobj, c("nUMI", "nGene", "top50", "umi.per.gene", "pct.Ribo", "pct.mito"), x.lab.rot = TRUE, do.return = TRUE ) # ggsave(gg,file=file.path("violinplots_comparison_UMI.pdf"),width=18,height=18) ggsave(gg, file = snakemake@output$pdf_violine, width = 18, height = 18) # gg <- VlnPlot(mycount,c("nUMI", "nGene", "top50", "count.per.gene","pct.Ribo", "pct.mito"), x.lab.rot = TRUE, do.return = TRUE) # ggsave(gg,file=file.path("violinplots_comparison_count.pdf"),width=18,height=18) # gg <- GenePlot(object = seuratobj, gene1 = "nUMI", gene2 = "nGene") # ggsave(gg,file=file.path("violinplots_comparison.pdf"),width=18,height=18) gg <- ggplot(meta.data, aes(x = nUMI, y = nGene, color = orig.ident)) + gglayers + labs( title = "Genes (pooled mouse and human set) vs UMIs for each bead", x = "Number of UMIs per Bead [k]", y = "Number of Genes per Bead [k]" ) # dev.new() # htmlwidgets::saveWidget(ggplotly(gg), # file.path(getwd(), snakemake@output$html_umi_vs_gene)) ggsave(gg, file = snakemake@output$pdf_umi_vs_gene, width = 12, height = 7) ################################################################################ ## same for Counts instead UMIs (using mycount object) gg <- ggplot(meta.data, aes(x = nCounts, y = nGene, color = orig.ident)) + gglayers + labs( title = "Genes (pooled mouse and human set) vs Counts for each bead", x = "Number of Counts per Bead [k]", y = "Number of Genes per Bead [k]" ) # dev.new() # htmlwidgets::saveWidget(ggplotly(gg), # file.path(getwd(), snakemake@output$html_count_vs_gene)) ggsave(gg, file = snakemake@output$pdf_count_vs_gene, width = 12, height = 7) # head(meta.data,2) # nGene nUMI cellNames samples barcode expected_cells read_length batch orig.ident pct.sribo pct.lribo pct.Ribo pct.mito top50 umi.per.gene # sample1_GAGTCTGAGGCG 6 6 sample1_GAGTCTGAGGCG sample1 GAGTCTGAGGCG 100 100 batch1 sample1 0.0000000 0.00000000 0.0000000 0.0000000 1.0000000 1.000000 # sample1_CAGCCCTCAGTA 264 437 sample1_CAGCCCTCAGTA sample1 CAGCCCTCAGTA 100 100 batch1 sample1 0.0389016 0.07551487 0.1144165 0.0228833 0.5102975 1.655303 # saving snakemake meta information into misc slot so all can be exported as one object seuratobj@misc <- snakemake # exporting R Seurat objects into summary/R_Seurat_objects.rdata saveRDS(seuratobj, file = file.path(snakemake@output$R_objects)) if (debug_flag) { save.image(file = file.path(path_debug, "plot_violin_workspace.rdata")) } |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | debug_flag <- FALSE if (snakemake@config$DEBUG) { debug_flag <- TRUE message("In debug mode: saving R objects to inspect later") path_debug <- file.path(snakemake@config$LOCAL$results, "debug") dir.create(path_debug) save(snakemake, file = file.path(path_debug, "plot_yield_snakemake.rdata")) } #------------------------------------ debugging library(ggplot2) library(tidyr) library(grid) library(gridExtra) library(viridis) library(stringr) samples <- snakemake@params$sample_names batches <- snakemake@params$batches mydata <- data.frame(matrix(nrow = length(samples), ncol = 7)) colnames(mydata) <- c("Sample", "Batch", "Cutadapt filtered", "Unmapped", "Multi mapped", "Uniquely mapped", "Total reads") mydata[, "Sample"] <- samples mydata[, "Batch"] <- batches for (i in 1:length(samples)) { # Input files and variables STAR_output <- read.table(snakemake@input$STAR_output[i], skip = 5, sep = "\t", fill = TRUE, stringsAsFactors = FALSE) mysample <- samples[i] # Read files # bbmap_log = read.table(snakemake@input$repaired[i], sep=':', header=FALSE, skip=8, row.names=1, nrows=4) # reads_after_filtering = as.numeric(str_match(bbmap_log['Pairs',], pattern = "\t([0-9]{1,20}) reads.*")[,2])/2 bbmap_log <- readLines(snakemake@input$repaired[i]) reads_after_filtering <- as.numeric(str_match(bbmap_log[grep("^Pairs:", bbmap_log)], pattern = "\t([0-9]{1,20}) reads.*")[, 2]) / 2 R1_filtered <- read.table(snakemake@input$R1_filtered[i], header = FALSE, skip = 7, sep = ":", nrows = 7, row.names = 1) total_reads <- as.numeric(str_replace_all(R1_filtered["Total reads processed", ], pattern = (" |,"), "")) # R2_filtered = read.table(snakemake@input$R2_filtered[i], header = FALSE, skip=8, sep=':', nrows=7, row.names=1) # R1_adapters = as.numeric(str_remove_all(str_match(R1_filtered['Reads with adapters',], pattern = "(.*) \\(")[,2], pattern = (' |,'))) # R1_too_short = as.numeric(str_remove_all(str_match(R1_filtered['Reads that were too short',], pattern = "(.*) \\(")[,2], pattern = (' |,'))) # R1_passed = as.numeric(str_remove_all(str_match(R1_filtered['Reads written (passing filters)',], pattern = "(.*) \\(")[,2], pattern = (' |,'))) # R1_filtered = total_reads - R1_passed # R2_adapters = as.numeric(str_remove_all(str_match(R2_filtered['Reads with adapters',], pattern = "(.*) \\(")[,2], pattern = (' |,'))) # R2_too_short = as.numeric(str_remove_all(str_match(R2_filtered['Reads that were too short',], pattern = "(.*) \\(")[,2], pattern = (' |,'))) # R2_passed = as.numeric(str_remove_all(str_match(R2_filtered['Reads written (passing filters)',], pattern = "(.*) \\(")[,2], pattern = (' |,'))) # R2_filtered = total_reads - R2_passed mydata[which(mydata$Sample == mysample), "Cutadapt filtered"] <- total_reads - reads_after_filtering mydata[which(mydata$Sample == mysample), "Total reads"] <- total_reads # STAR output reads_in <- as.numeric(STAR_output$V2[1]) uniquely_mapped <- as.numeric(STAR_output$V2[4]) multi_mapped <- as.numeric(STAR_output$V2[19]) unmapped <- reads_in - uniquely_mapped - multi_mapped mydata[which(mydata$Sample == mysample), "Uniquely mapped"] <- uniquely_mapped mydata[which(mydata$Sample == mysample), "Multi mapped"] <- multi_mapped mydata[which(mydata$Sample == mysample), "Unmapped"] <- unmapped } # tidyr version mydata_long <- mydata %>% gather(variable, value, -Sample, -Batch) # melt will be retired, use gather instead: https://github.com/hadley/reshape #Force factor order. mydata_long$variable = factor(mydata_long$variable, levels = c('Cutadapt filtered','Multi mapped','Total reads','Unmapped','Uniquely mapped')) color_palette = c('#e88270','#cb7262','#ae6254','#70d6e8') p1 <- ggplot(subset(mydata_long, mydata_long$variable != "Total reads"), aes(x = Sample, y = value, fill = variable)) + geom_histogram(stat = "identity", binwidth = 1 / length(samples)) + theme(axis.text.x = element_text(angle = 90, hjust = 0)) + labs(title = paste("Yield of all the reads for each category"), x = "Samples", y = "Number of reads") + theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "none", plot.title = element_text(size = 20, face = "bold")) + facet_grid(~Batch, scales = "free") + scale_fill_viridis(discrete = TRUE, option = "viridis") + scale_y_continuous(labels = scales::scientific) mydata_pct <- mydata[, -c(1, 2)] / mydata[, "Total reads"] mydata_pct <- cbind(Sample = mydata[, "Sample"], Batch = mydata[, "Batch"], mydata_pct) mydata_long_pct <- mydata_pct %>% gather(variable, value, -Sample, -Batch) mydata_long_pct$variable = factor(mydata_long$variable, levels = c('Cutadapt filtered','Multi mapped','Unmapped','Uniquely mapped')) p2 <- ggplot(subset(mydata_long_pct, mydata_long_pct$variable != "Total reads"), aes(x = Sample, y = value, fill = variable)) + labs(fill = "Filters") + geom_histogram(stat = "identity", binwidth = 1 / length(samples)) + theme(axis.text.x = element_text(angle = 90, hjust = 0), legend.position = "bottom", strip.background = element_blank(), strip.text.x = element_blank()) + labs(x = "Samples", y = "Percentage of reads") + facet_grid(~Batch, scales = "free") + scale_fill_viridis(discrete = TRUE, option = "viridis") + scale_y_continuous(labels = scales::percent) gp1 <- ggplotGrob(p1) gp2 <- ggplotGrob(p2) pdf(file = snakemake@output$pdf, width = 16, height = 13) grid::grid.newpage() grid::grid.draw(rbind(gp1, gp2, size = "last")) dev.off() if (debug_flag) { save.image(file = file.path(path_debug, "plot_yield_workspace.rdata")) } |
8 | library(yaml) |
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | versions = list() for (yaml_file in snakemake@input$yaml_files){ current_env = yaml.load_file(yaml_file) for (package in current_env$dependencies){ if(grepl(pattern = 'cutadapt', package)){ versions[['cutadapt']] = strsplit(package,'=|==| ==| == ')[[1]][2] } else if(grepl(pattern = 'star', package)){ versions[['star']] = strsplit(package,'=|==| ==| == ')[[1]][2] } else if(grepl(pattern = 'dropseq_tools', package)){ versions[['dropseq_tools']] = strsplit(package,'=|==| ==| == ')[[1]][2] } else if(grepl(pattern = 'bbmap', package)){ versions[['bbmap']] = strsplit(package,'=|==| ==| == ')[[1]][2] } } } umi_distance=snakemake@config$EXTRACTION$`UMI-edit-distance` |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import pickle import pysam def load_obj(name): with open(name, 'rb') as f: return pickle.load(f) def save_obj(obj, name): with open(name, 'wb') as f: pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL) infile_bam = pysam.AlignmentFile(snakemake.input.bam, "rb") outfile = pysam.AlignmentFile(snakemake.output.bam, "wb", template=infile_bam) mapping = load_obj(snakemake.input.barcode_mapping) barcode_ref = load_obj(snakemake.input.barcode_ref) barcode_ext_ref = load_obj(snakemake.input.barcode_ext_ref) unknown_barcodes = set() for bam_read in infile_bam: barcode = bam_read.get_tag('XC') #lane_number = bam_read.query_name.split(':')[3] if barcode in barcode_ref: mapping[0][barcode]['count'] += 1 #mapping[0][barcode]['lanes'][lane_number] += 1 outfile.write(bam_read) continue elif barcode in barcode_ext_ref: # The barcode is in our extended reference. Change the barcode to the original one reference_barcode = mapping[1][barcode]['ref'] mapping[1][barcode]['count'] += 1 #mapping[1][barcode]['lanes'][lane_number] += 1 bam_read.set_tag('XC',reference_barcode,value_type='Z',replace=True) outfile.write(bam_read) continue else: # If the barcode is not found in the extended ref, then don't modify it. if barcode in unknown_barcodes: mapping['unknown'][barcode]['count'] += 1 #mapping['unknown'][barcode]['lanes'][lane_number] += 1 else: #mapping['unknown'][barcode] = {'count':1, 'lanes':{'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}} mapping['unknown'][barcode] = {'count':1} #mapping['unknown'][barcode]['lanes'][lane_number] += 1 unknown_barcodes.add(barcode) outfile.write(bam_read) save_obj(obj=mapping, name=snakemake.output.barcode_mapping_counts) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | import os import sys from collections import defaultdict import pickle def save_obj(obj, name): with open(name, 'wb') as f: pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL) mapping=defaultdict(dict) barcode_ref = set() barcode_ext_ref = set() with open(snakemake.input['whitelist'],'r') as whitelist: for line in whitelist: if len(line.strip().split()) == 2: # This means we didn't find any other linked barcode (reference,counts_ref) = line.strip().split() mapping[0][reference]= defaultdict() mapping[0][reference]['ref'] = reference mapping[0][reference]['count'] = 0 mapping[0][reference]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0} barcode_ref.add(reference) continue (reference,extended_ref,counts_ref,counts_ext) = line.strip().split() mapping[0][reference]= defaultdict() mapping[0][reference]['ref'] = reference mapping[0][reference]['count'] = 0 mapping[0][reference]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0} barcode_ref.add(reference) for barcode in extended_ref.split(','): mapping[1][barcode] = defaultdict() mapping[1][barcode]['ref'] = reference mapping[1][barcode]['count'] = 0 mapping[1][barcode]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0} barcode_ext_ref.update(extended_ref.split(',')) # Save mapping and references to reuse later. save_obj(obj=mapping, name=snakemake.output.barcode_mapping) save_obj(obj=barcode_ref,name=snakemake.output.barcode_ref) save_obj(obj=barcode_ext_ref,name=snakemake.output.barcode_ext_ref) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | __author__ = "Johannes Köster" __copyright__ = "Copyright 2016, Johannes Köster" __email__ = "[email protected]" __license__ = "MIT" import os from snakemake.shell import shell extra = snakemake.params.get("extra", "") log = snakemake.log_fmt_shell(stdout=True, stderr=True) fq1 = snakemake.input.get("fq1") assert fq1 is not None, "input-> fq1 is a required input parameter" fq1 = [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1 fq2 = snakemake.input.get("fq2") if fq2: fq2 = [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2 assert len(fq1) == len(fq2), "input-> equal number of files required for fq1 and fq2" input_str_fq1 = ",".join(fq1) input_str_fq2 = ",".join(fq2) if fq2 is not None else "" input_str = " ".join([input_str_fq1, input_str_fq2]) if fq1[0].endswith(".gz"): readcmd = "--readFilesCommand zcat" else: readcmd = "" outprefix = os.path.dirname(snakemake.output[0]) + "/" shell( "STAR " "{extra} " "--runThreadN {snakemake.threads} " "--genomeDir {snakemake.params.index} " "--readFilesIn {input_str} " "{readcmd} " "--outSAMtype BAM Unsorted " "--outFileNamePrefix {outprefix} " "--outStd Log " "{log}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from tempfile import TemporaryDirectory from snakemake.shell import shell log = snakemake.log_fmt_shell(stdout=False, stderr=True) def basename_without_ext(file_path): """Returns basename of file path, without the file extension.""" base = path.basename(file_path) split_ind = 2 if base.endswith(".gz") else 1 base = ".".join(base.split(".")[:-split_ind]) return base # Run fastqc, since there can be race conditions if multiple jobs # use the same fastqc dir, we create a temp dir. with TemporaryDirectory() as tempdir: shell("fastqc {snakemake.params} --quiet " "--outdir {tempdir} {snakemake.input[0]}" " {log}") # Move outputs into proper position. output_base = basename_without_ext(snakemake.input[0]) html_path = path.join(tempdir, output_base + "_fastqc.html") zip_path = path.join(tempdir, output_base + "_fastqc.zip") if snakemake.output.html != html_path: shell("mv {html_path} {snakemake.output.html}") if snakemake.output.zip != zip_path: shell("mv {zip_path} {snakemake.output.zip}") |
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | __author__ = "Julian de Ruiter" __copyright__ = "Copyright 2017, Julian de Ruiter" __email__ = "[email protected]" __license__ = "MIT" from os import path from snakemake.shell import shell input_dirs = set(path.dirname(fp) for fp in snakemake.input) output_dir = path.dirname(snakemake.output[0]) output_name = path.basename(snakemake.output[0]) log = snakemake.log_fmt_shell(stdout=True, stderr=True) shell( "multiqc" " {snakemake.params}" " --force" " -o {output_dir}" " -n {output_name}" " {input_dirs}" " {log}") |
Support
- Future updates