A SingleCell RNASeq pre-processing snakemake workflow

public 10mo ago Version: v0.32 0 bookmarks

View Workflow

Description

This pipeline is based on snakemake and the dropseq tools provided by the McCarroll Lab . It allows to go from raw data of your Single Cell RNA seq experiment until the final count matrix with QC plots along the way.

This is the tool we use in our lab to improve our wetlab protocol as well as provide an easy framework to reproduce and compare different experiments with different parameters.

It uses STAR to map the reads. It is usable for any single cell protocol using two reads where the first one holds the Cell and UMI barcodes and the second read holds the RNA. Here is a non-exhausitve list of compatible protocols/brands:

Drop-Seq
SCRB-Seq
10x Genomics
DroNc-seq
Dolomite Bio ( Nadia Instrument )

This package is trying to be as user friendly as possible. One of the hopes is that non-bioinformatician can make use of it without too much hassle. It will still require some command line execution, this is not going to be fully interactive package.

Authors

Patrick Roelli ( @Hoohm) )
Sebastian Mueller ( @seb-mueller) )
Charles Girardot ( @cgirardot) )

Usage

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further develop this workflow, fork this reposity. Please consider providing any generally applicable modifications via a pull request.

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, once available, its DOI.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml and the samples.tsv following those instructions

Step 3: Execute workflow

All you need to execute this workflow is to install Snakemake via the Conda package manager . Software needed by this workflow is automatically deployed into isolated environments by Snakemake.

Test your configuration by performing a dry-run via

snakemake --use-conda -n --directory $WORKING_DIR

Execute the workflow locally via

snakemake --use-conda --cores $N --directory $WORKING_DIR

using $N cores on the $WORKING_DIR . Alternatively, it can be run in cluster or cloud environments (see the docs for details).

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above.

Step 4: Investigate results

After successful execution, you can create a self-contained report with all results via:

snakemake --report report.html

Documentation

You can find the documentation here

Future implementations

I'm actively seeking help to implement the points listed bellow. Don't hesitate to contact me if you wish to contribute.

Create a sharing platform where quality plots/logs can be discussed and troubleshooted.
Create a full html report for the whole pipeline
Multiqc module for drop-seq-tools
Implement an elegant "preview" mode where the pipeline would only run on a couple of millions of reads and allow you to have an approximated view before running all of the data. This would dramatically reduce the time needed to get an idea of what filters whould be used.
I hope it can help you out in your single cell experiments!

Feel free to comment and point out potential improvements via issues

TODO

Add a mixed reference reference for testing purposes
Finalize the parameters validation schema
Make the debug feature a bit "cleaner". Deal with automatic naming of the debug variables
Implement ddseq barcoding strategies

Code Snippets

script:
    '../scripts/generate_extended_ref.py'

SnakeMake From line 19 of rules/cell_barcodes.smk

shell:
    """umi_tools whitelist\
    --stdin {input}\
    --bc-pattern='(?P<cell_1>.{{{params.cell_barcode_length}}})(?P<umi_1>.{{{params.umi_barcode_length}}})'\
    --extract-method=regex\
    --set-cell-number={params.num_cells}\
    --log2stderr > {output}"""

SnakeMake umi_tools From line 32 of rules/cell_barcodes.smk

shell:
    """cat {input} | cut -f 1 > {output}"""

SnakeMake From line 45 of rules/cell_barcodes.smk

script:
    '../scripts/umi_tools_extended_ref.py'

SnakeMake From line 56 of rules/cell_barcodes.smk

script:
    '../scripts/repair_barcodes.py'

SnakeMake From line 70 of rules/cell_barcodes.smk

shell:
    "gunzip -c -d {input} > {output}"

SnakeMake From line 36 of rules/download_meta_mixed.smk

shell:
    "gunzip -d -c {input} > {output}"

SnakeMake From line 44 of rules/download_meta_mixed.smk

shell:
    """sed -e 's/>/>{params.species}/g' {input} > {output}"""

SnakeMake From line 55 of rules/download_meta_mixed.smk

shell:
    """cat {input.genome1} {input.genome2} > {output}"""

SnakeMake From line 77 of rules/download_meta_mixed.smk

run:
    import datetime
    import re
    header1="#!Mixed reference of {} and {}\n".format(
        species_list[0],
        species_list[1])
    header2="#!genome-builds GRC{}{} GRC{}{}\n".format(
        species_list[0].lower()[0],
        build_list[0],
        species_list[1].lower()[0],
        build_list[1])
    header3="#!genome-releases {} {}\n".format(
        release_list[0],
        release_list[1])
    header4="#!genome-date {}\n".format(str(datetime.date.today()))
    header=[header1,header2,header3,header4]
    with open(input.annotation1[0]) as annotation1:
        with open(input.annotation2[0]) as annotation2:
            with open(output[0], 'w') as outfile:
                outfile.writelines(header)
                for line in annotation1:
                    if(not line.startswith('#!')):
                        outfile.write(re.sub('^',species_list[0],line))
                for line in annotation2:
                    if(not line.startswith('#!')):
                        outfile.write(re.sub('^',species_list[1],line))

SnakeMake From line 102 of rules/download_meta_mixed.smk

shell:
    "gunzip -c -d {input} > {output}"

SnakeMake From line 33 of rules/download_meta_single.smk

shell:
    "gunzip -d -c {input} > {output}"

SnakeMake From line 41 of rules/download_meta_single.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=false\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""

SnakeMake From line 25 of rules/extract_expression_single.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=true\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""

SnakeMake From line 53 of rules/extract_expression_single.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && SingleCellRnaSeqMetricsCollector -m {params.memory}\
    INPUT={input.data}\
    OUTPUT={output}\
    ANNOTATIONS_FILE={input.refFlat}\
    CELL_BC_FILE={input.barcode_whitelist}\
    RIBOSOMAL_INTERVALS={input.rRNA_intervals}
    """

SnakeMake From line 86 of rules/extract_expression_single.smk

script:
    '../scripts/plot_rna_metrics.R'

SnakeMake From line 102 of rules/extract_expression_single.smk

script:
    "../scripts/convert_mtx.py"

SnakeMake From line 115 of rules/extract_expression_single.smk

shell:
    """pigz -p {threads} {input.barcodes} {input.features} {input.mtx}"""

SnakeMake From line 129 of rules/extract_expression_single.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=false\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""

SnakeMake From line 24 of rules/extract_expression_species.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.dense}\
    EDIT_DISTANCE={params.umiBarcodeEditDistance}\
    OUTPUT_LONG_FORMAT={output.long}\
    STRAND_STRATEGY={params.strand_strategy}\
    OUTPUT_READS_INSTEAD=true\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}\
    CELL_BC_FILE={input.barcode_whitelist}"""

SnakeMake From line 53 of rules/extract_expression_species.smk

script:
    "../scripts/convert_mtx.py"

SnakeMake From line 74 of rules/extract_expression_species.smk

shell:
    """pigz -p {threads} {input.barcodes} {input.features} {input.mtx}"""

SnakeMake From line 88 of rules/extract_expression_species.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && SingleCellRnaSeqMetricsCollector -m {params.memory}\
    INPUT={input.data}\
    OUTPUT={output}\
    ANNOTATIONS_FILE={input.refFlat}\
    CELL_BC_FILE={input.barcode_whitelist}\
    RIBOSOMAL_INTERVALS={input.rRNA_intervals}
    """

SnakeMake From line 112 of rules/extract_expression_species.smk

script:
    '../scripts/plot_rna_metrics.R'

SnakeMake From line 127 of rules/extract_expression_species.smk

wrapper:
    '0.36.0/bio/fastqc'

SnakeMake FastQC From line 19 of rules/fastqc.smk

wrapper:
    '0.36.0/bio/fastqc'

SnakeMake FastQC From line 31 of rules/fastqc.smk

wrapper:
    '0.36.0/bio/multiqc'

SnakeMake MultiQC From line 41 of rules/fastqc.smk

wrapper:
    '0.36.0/bio/multiqc'

SnakeMake MultiQC From line 50 of rules/fastqc.smk

script:
    '../scripts/fa2tsv.py'

SnakeMake From line 59 of rules/fastqc.smk

shell:
    """cutadapt\
    --max-n {params.max_n}\
    -a file:{input.adapters}\
    -g file:{input.adapters}\
    -q {params.barcode_quality},{params.barcode_quality}\
    --minimum-length {params.barcode_length}\
    --cores={threads}\
    --overlap {params.cell_barcode_length}\
    -o {output.fastq} {input.R1}\
    {params.extra_params} > {log.qc}"""

SnakeMake From line 29 of rules/filter.smk

shell:
    """cutadapt\
    -a file:{input.adapters}\
    -g file:{input.adapters}\
    -q {params.read_quality}\
    --minimum-length {params.minimum_length}\
    --cores={threads}\
    --overlap {params.adapters_minimum_overlap}\
    -o {output.fastq} {input.R2}\
    {params.extra_params} > {log.qc}"""

SnakeMake From line 56 of rules/filter.smk

script:
    '../scripts/clean_cutadapt.py'

SnakeMake From line 73 of rules/filter.smk

shell:
    """repair.sh\
    -Xmx{params.memory}\
    in={input.R1}\
    in2={input.R2}\
    out1={output.R1}\
    out2={output.R2}\
    repair=t\
    threads={threads} 2> {log}"""

SnakeMake From line 89 of rules/filter.smk

script:
    '../scripts/detect_barcodes.py'

SnakeMake From line 105 of rules/filter.smk

script:
    '../scripts/plot_adapter_content.R'

SnakeMake From line 119 of rules/filter.smk

wrapper:
    '0.36.0/bio/multiqc'

SnakeMake MultiQC From line 128 of rules/filter.smk

wrapper:
    '0.36.0/bio/multiqc'

SnakeMake MultiQC From line 137 of rules/filter.smk

shell:
    """cat {input.annotation} | grep -E "{params.patterns}" > {output}"""

SnakeMake From line 22 of rules/generate_meta.smk

shell:
    """java -jar -Djava.io.tmpdir={params.temp_directory} {params.picard} CreateSequenceDictionary\
    REFERENCE={input}\
    OUTPUT={output}
    """

SnakeMake From line 36 of rules/generate_meta.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && ReduceGtf -m {params.memory}\
    GTF={input.annotation}\
    OUTPUT={output}\
    SEQUENCE_DICTIONARY={input.reference_dict}\
    IGNORE_FUNC_TYPE='null'\
    ENHANCE_GTF='false'"""

SnakeMake From line 52 of rules/generate_meta.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && ConvertToRefFlat -m {params.memory}\
    ANNOTATIONS_FILE={input.annotation}\
    OUTPUT={output}\
    SEQUENCE_DICTIONARY={input.reference_dict}
    """

SnakeMake From line 70 of rules/generate_meta.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && CreateIntervalsFiles -m {params.memory}\
    REDUCED_GTF={input.annotation_reduced}\
    SEQUENCE_DICTIONARY={input.reference_dict}\
    O={params.reference_directory}\
    PREFIX={params.prefix}
    """

SnakeMake From line 89 of rules/generate_meta.smk

run:
    """
    from math import log2
    from platform import system
    if (system() == 'Darwin'):
        genomeLength = shell("wc -c {} | cut -d' ' -f2".format(snakemake.reference_file), iterable=True)
    else:
        genomeLength = shell("wc -c {} | cut -d' ' -f1".format(snakemake.reference_file), iterable=True)
    genomeLength = int(next(genomeLength))
    referenceNumber = shell('grep "^>" {} | wc -l'.format(snakemake.reference_file), iterable=True)
    referenceNumber = int(next(referenceNumber))
    value = min([18,int(log2(genomeLength/referenceNumber))])
    """

SnakeMake From line 105 of rules/generate_meta.smk

script:
    '../scripts/prep_star.py'

SnakeMake From line 131 of rules/generate_meta.smk

shell:
    """mkdir -p {params.genomeDir}; STAR\
    --runThreadN {threads}\
    --runMode genomeGenerate\
    --genomeDir {params.genomeDir}\
    --genomeFastaFiles {input.reference_file}\
    --sjdbGTFfile {input.annotation_file}\
    --sjdbOverhang {params.sjdbOverhang}\
    --genomeChrBinNbits {params.genomeChrBinNbits}\
    --genomeSAsparseD 2
    """

SnakeMake From line 149 of rules/generate_meta.smk

wrapper:
    "0.27.1/bio/star/align"

SnakeMake From line 48 of rules/map.smk

wrapper:
    '0.36.0/bio/multiqc'

SnakeMake MultiQC From line 82 of rules/map.smk

shell:
    """pigz -p 4 {input}"""

SnakeMake From line 92 of rules/map.smk

script:
    '../scripts/merge_bam.py'

SnakeMake From line 108 of rules/map.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && TagReadWithGeneFunction -m {params.memory}\
    INPUT={input.data}\
    OUTPUT={output}\
    ANNOTATIONS_FILE={input.refFlat}
    """

SnakeMake From line 129 of rules/map.smk

shell:
    """
    export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DetectBeadSubstitutionErrors -m {params.memory}\
    I={input}\
    O={output.data}\
    OUTPUT_REPORT={output.report}\
    OUTPUT_SUMMARY={output.summary}\
    NUM_THREADS={threads}
    """

SnakeMake From line 149 of rules/map.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DetectBeadSynthesisErrors -m {params.memory}\
    INPUT={input}\
    OUTPUT={output}\
    OUTPUT_STATS={params.out_stats}\
    SUMMARY={params.summary}\
    NUM_BARCODES={params.barcodes}\
    PRIMER_SEQUENCE={params.SmartAdapter}\
    NUM_THREADS={threads}
    """

SnakeMake From line 173 of rules/map.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && BamTagHistogram -m {params.memory}\
    TAG=XC\
    I={input}\
    READ_MQ=10\
    O={output}
    """

SnakeMake From line 194 of rules/map.smk

script:
    '../scripts/plot_yield.R'

SnakeMake From line 217 of rules/map.smk

script:
    '../scripts/plot_knee_plot.R'

SnakeMake From line 230 of rules/map.smk

script:
    "../scripts/convert_mtx.py"

SnakeMake From line 17 of rules/merge.smk

script:
    '../scripts/plot_violine.R'

SnakeMake From line 46 of rules/merge.smk

script:
    '../scripts/create_summary_stats.R'

SnakeMake From line 61 of rules/merge.smk

script:
    "../scripts/publication_text.Rmd"

SnakeMake From line 16 of rules/report.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && FilterBam -m {params.memory}\
    REF_SOFT_MATCHED_RETAINED={params.species}\
    INPUT={input}\
    OUTPUT={output}"""

SnakeMake From line 17 of rules/split_species.smk

shell:
    """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && DigitalExpression -m {params.memory}\
    I={input.data}\
    O={output.umi_matrix}\
    SUMMARY={output.summary}\
    EDIT_DISTANCE={params.cellBarcodeEditDistance}\
    CELL_BC_FILE={input.barcode_whitelist}\
    LOCUS_FUNCTION_LIST={{{params.locus_list}}}\
    MIN_BC_READ_THRESHOLD={params.count_per_umi}"""

SnakeMake From line 38 of rules/split_species.smk

script: 
    '../scripts/plot_species_plot.R'

SnakeMake From line 58 of rules/split_species.smk

from collections import defaultdict
import re

def fill_results(snakemake,pair, adapter_results):
    adapter_pattern = re.compile(pattern="=== Adapter (.*) ===\n")
    with open(snakemake.input[pair], 'r') as logfile:
        line = logfile.readline()
        while(line):
            adapter_matched = re.findall(pattern=adapter_pattern,string=line)
            if(adapter_matched):
                logfile.readline()
                line = logfile.readline().rstrip('.\n')
                line_list = line.split(';')
                adapter_results[adapter_matched[0]]['Pair'] = pair
                adapter_results[adapter_matched[0]]['Sequence'] = line_list[0].split(':')[1].strip()
                if(adapter_matched[0] in adapter_results):
                    adapter_results[adapter_matched[0]]['Times'] += int(line_list[3].split(':')[1].split(' ')[1].strip())
                else:    
                    adapter_results[adapter_matched[0]]['Times'] += int(line_list[3].split(':')[1].split(' ')[1].strip())
            line = logfile.readline()
    return(adapter_results)

adapter_results_R1 = defaultdict(lambda :{'Pair':'','Sequence':'','Times':0})
adapter_results_R2 = defaultdict(lambda :{'Pair':'','Sequence':'','Times':0})


adapter_results_R1 = fill_results(snakemake, 'R1', adapter_results_R1)
adapter_results_R2 = fill_results(snakemake, 'R2', adapter_results_R2)

with open(snakemake.output[0],'w') as outfile:
    outfile.write('Adapter,Sequence,Pair,Count\n')
    for adapter in adapter_results_R1:
        outfile.write("{},{},{},{}\n".format(adapter, adapter_results_R1[adapter]['Sequence'],adapter_results_R1[adapter]['Pair'],adapter_results_R1[adapter]['Times']))
    for adapter in adapter_results_R2:
        outfile.write("{},{},{},{}\n".format(adapter, adapter_results_R2[adapter]['Sequence'],adapter_results_R2[adapter]['Pair'],adapter_results_R2[adapter]['Times']))

Python From line 1 of scripts/clean_cutadapt.py

import os
import subprocess


samples = snakemake.params['samples']
barcodes = {}
features = {}

out_folder = os.path.dirname(snakemake.output['mtx'])
out_barcodes = snakemake.output.barcodes
out_features = snakemake.output.features
mtx = snakemake.output.mtx
temp_mtx = os.path.join(out_folder,'temp_umi.mtx')
header = os.path.join(out_folder,'header.mtx')
n_lines=0
barcode_index = 1
feature_index = 1

with open(temp_mtx,'w') as mtx_stream:
	for i,sample in enumerate(snakemake.input):
		if samples[i] not in sample:
			sys.exit("Sample name not found in file path")
		with open(sample,'r') as input_file:
			next(input_file) # skip first line
			for line in input_file:
				barcode,feature,count = line.strip().split('\t')
				if(not isinstance(samples,str)):
					barcode = samples[i] + '_' + barcode
				if(barcode not in barcodes):
					barcodes[barcode] = barcode_index
					barcode_index += 1
				if(feature not in features):
					features[feature] = feature_index
					feature_index += 1
				mtx_stream.write('{} {} {}\n'.format(features[feature],barcodes[barcode],count))
				n_lines +=1

with open(out_barcodes,'w') as barcodes_outfile:
	for barcode in barcodes:
		barcodes_outfile.write('{}\n'.format(barcode))

with open(out_features,'w') as features_outfile:
	for feature in features:
		features_outfile.write('{}\n'.format(feature))

with open(header,'w') as header_outfile:
	header_outfile.write("%%MatrixMarket matrix coordinate real general\n")
	header_outfile.write('{} {} {}\n'.format(len(features), len(barcodes), n_lines))

subprocess.call("cat {} {} > {}".format(header, temp_mtx, mtx), shell=True)

os.remove(temp_mtx)
os.remove(header)

Python From line 5 of scripts/convert_mtx.py

debug_flag <- FALSE
# check if DEBUG flag is set
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "create_summary_stats_snakemake.rdata"))
}
#### /debug

library(dplyr) # Dataframe manipulation
library(Matrix) # Sparse matrices
library(stringr)
library(RColorBrewer)
library(devtools)
library(Seurat)
library(plotly)

samples <- snakemake@params$sample_names
batches <- snakemake@params$batches


# importing Seurat object

seuratobj <- readRDS(file  = file.path(snakemake@input$R_objects))
meta.data <- seuratobj@meta.data

# subset only highest stamps as set in config.yaml
# this is necessary since there are more stamps selected as a safty margin which now have to be taken out again to calculate stats.

meta.data.sub <- meta.data %>%
  group_by(orig.ident) %>%
  arrange(desc(nCounts)) %>%
  slice(1:expected_cells[1]) %>% # makes sure only # expected cell are kept
  as.data.frame()

gini_index <- function (x, weights = rep(1, length = length(x))) {
    ox      <- order(x)
    x       <- x[ox]
    weights <- weights[ox] / sum(weights)
    p       <- cumsum(weights)
    nu      <- cumsum(weights * x)
    n       <- length(nu)
    nu      <- nu/nu[n]
    sum(nu[-1] * p[-n]) - sum(nu[-n] * p[-1])
}

#------------------------------------ post-filter-stats
# stats only based after keeping most abundant `expected-cells`
# taken out from seurat object generated in violine_plot rule.

# median calculator

stats_post <- meta.data.sub %>%
  group_by(orig.ident) %>%
  summarise(
    Total_nb_reads                 = sum(nCounts),
    Nb_STAMPS                      = mean(expected_cells), # should be all the same anyway..
    Median_reads_per_STAMP         = round(median(nCounts), 2),
    Mean_reads_per_STAMP           = round(mean(nCounts), 2),
    Total_nb_UMIs                  = sum(nUMI),
    Median_UMIs_per_STAMP          = round(median(nUMI), 2),
    Mean_UMIs_per_STAMP            = round(mean(nUMI), 2),
    Mean_UMIs_per_Gene             = round(mean(umi.per.gene), 2),
    Median_number_genes_per_STAMP  = round(median(nGene), 2),
    Mean_number_genes_per_STAMP    = round(mean(nGene), 2),
    Mean_Ribo_pct                  = round(100 * mean(pct.Ribo), 2),
    Mean_Mito_pct                  = round(100 * mean(pct.mito), 2),
    Mean_Count_per_UMI             = round(sum(nCounts) / sum(nUMI), 2),
    Read_length                    = mean(read_length), # should be all the same anyway..
    Number_barcodes_used_for_debug = n()
  ) %>%
  as.data.frame()

row.names(stats_post) <- stats_post$orig.ident



# highest, lowest count/UMI Stamp
# pre STAMP stats

# hist out goes into knee plots
# 'results/logs/{sample}_hist_out_cell.txt'
# """export _JAVA_OPTIONS=-Djava.io.tmpdir={params.temp_directory} && BAMTagHistogram -m {params.memory}\
# TAG=XC\
# https://hpc.nih.gov/apps/dropseq.html
# there is not hint in documentation on any filtering (only read quality)

#------------------------------------ pre-filter-stats
# calculating statistics based on barcodes before thresholding it (i.e. keeping the most abudant barcodes based on `expected cells`)
# This is based on 'logs/{sample}_hist_out_cell.txt' generated by `BAMTagHistogram` from dropseq-tools
# TOOO: The documentation only mentions duplicate and quality filter. But it seems do more filtering since most barcodes are expected to have only one read assinged but there are usually more. Find out.
# https://hpc.nih.gov/apps/dropseq.html

# TODO: Nr UMI for pre filter

stats_pre <- data.frame(matrix(nrow = length(samples), ncol = 10))
colnames(stats_pre) <- c(
  "Sample",
  "Batch",
  "Total_raw_reads",
  "Nr_barcodes_total",
  "Nr_barcodes_more_than_1_reads",
  "Nr_barcodes_more_than_10_reads",
  "percentile99",
  "percentile95",
  "percentile50",
  "Gini-index"
)

stats_pre[, "Sample"] <- samples
stats_pre[, "Batch"]  <- batches


for (i in 1:length(samples)) {
  # importing 'logs/{sample}_hist_out_cell.txt'
  hist_out <- read.table(
    file = snakemake@input$hist_cell[i],
    header = FALSE, stringsAsFactors = FALSE
  )
  mysample <- samples[i]
  reads <- hist_out$V1
  barcodes <- hist_out$V2
  # calculations on reads
  # total reads are not sum(reads)! Needs to be taken from
  # results/logs/cutadapt/sample1_R1.qc.txt
  #
  # read in full text file "sample_R2.qc.txt"
  filedump <- readLines(snakemake@input$R2qc[i])
  # subset line matching a pattern
  total_reads <- filedump[grep("Total reads processed:", filedump)] %>% #extract line
    str_extract("[0-9,]+") %>% # extract number from line
    str_replace_all(",", "") %>% # delete comma for subsequent numeric casting
    as.numeric()
  reads_cumsum      <- cumsum(reads)
  reads_cumsum_perc <- (reads_cumsum / sum(reads))
  # reporting stats
  stats_pre[i, "Total_raw_reads"]                <- total_reads
  stats_pre[i, "Reads_assigned_to_expected_STAMPs"] <- sum(reads[1:stats_post$Nb_STAMPS[i]])
  stats_pre[i, "Nr_barcodes_total"]              <- length(barcodes)
  stats_pre[i, "percentile99"]                   <- which.min(reads_cumsum_perc < 0.99)
  stats_pre[i, "percentile95"]                   <- which.min(reads_cumsum_perc < 0.95)
  stats_pre[i, "percentile50"]                   <- which.min(reads_cumsum_perc < 0.50)
  stats_pre[i, "Nr_barcodes_more_than_1_reads"]  <- sum(reads > 1)
  stats_pre[i, "Nr_barcodes_more_than_10_reads"] <- sum(reads > 10)
  stats_pre[i, "Gini-index"]                     <- round(gini_index(reads), 2)
  expected_cells <- as.numeric(filter(stats_post, orig.ident==mysample) %>% select(Nb_STAMPS))
  # % of reads left after applying expected_cells cuttoff
  stats_post[mysample, "Pct_reads_after_filter_expected_cells"] <-
    round(100 * (
                 reads_cumsum[expected_cells] / sum(reads)
    ), 2)
  # % of reads left after applying all filters including mapping etc.
  # Thats the effecive usable reads of the sequencing run
  stats_post[mysample, "Pct_reads_after_filter_everything"] <-
    round(100 * (
                 filter(stats_post, orig.ident==mysample) %>% select(Total_nb_reads) /
                 stats_pre [i, "Total_raw_reads"]), 2
         )
}

stats_pre <- stats_pre %>%
  arrange(Sample)

stats_post <- stats_post %>%
  arrange((orig.ident))

# output
write.csv(stats_pre,  file.path(snakemake@output$stats_pre))
write.csv(stats_post, file.path(snakemake@output$stats_post)) # writes table for excel

if (debug_flag) {
  save.image(file = file.path(path_debug, "create_summary_stats_workspace.rdata"))
}

R dplyr stringr Seurat RColorBrewer Matrix devtools plotly From line 22 of scripts/create_summary_stats.R

from Bio import SeqIO
import gzip
from collections import Counter

fastq_parser = SeqIO.parse(gzip.open(snakemake.input.R1, "rt"), "fastq")
sequences = []
n=0
for fastq_R1 in fastq_parser:
		sequences.append(str(fastq_R1.seq))
		n+=1
		if(n==10000000):
			break
def parse_barcodes(fastq_parser):
	counts={}
	ranges = range(5,len(sequences[0]))

	for cell_bc_length in ranges:
		counts[cell_bc_length] = list()
		for fastq_R1 in sequences:
			counts[cell_bc_length].append(fastq_R1[0:cell_bc_length])
	return(counts)



counts = parse_barcodes(fastq_parser)

with open(snakemake.output[0], "w") as outfile:
	outfile.write('bc_length,first_counts\n')
	for cell_bc_length in counts:
		outfile.write('{},{}\n'.format(cell_bc_length,str(Counter(counts[cell_bc_length]).most_common(100)[0][1])))

Python Biopython From line 1 of scripts/detect_barcodes.py

import sys
from Bio import SeqIO

number_bp = 12

with open( snakemake.output['tsv'], "w" ) as output:
    for seq_record in SeqIO.parse(snakemake.input['fa'], "fasta"):
        myline = (str(seq_record.id)) + "\t" + str(seq_record.seq[0:(number_bp + 1)]) + "\n"
        output.write(myline)

Python Biopython From line 9 of scripts/fa2tsv.py

from itertools import combinations, product
from collections import defaultdict
from copy import deepcopy
import pickle
from shutil import copyfile


def save_obj(obj, name):
    with open(name, 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def generate_all(barcode, reference, mapping, edit_distance):
    mutants = generate_mutants(barcode, edit_distance)
    for mutant in mutants:
        if(mutant not in reference):
            reference.add(mutant)
            mapping[edit_distance][mutant]['ref'] = barcode
            mapping[edit_distance][mutant]['count'] = 0
            mapping[edit_distance][mutant]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}

    mapping['unknown']=defaultdict()
    return(reference, mapping)


def generate_mutants(sequence, d=1):
    """Taken from stackoverflow: https://stackoverflow.com/a/19823295/9178565"""
    N = len(sequence)
    letters = 'ACGTN'
    pool = list(sequence)
    for indices in combinations(range(N), d):
        for replacements in product(letters, repeat=d):
            skip = False
            for i, a in zip(indices, replacements):
                if pool[i] == a: skip = True
            if skip: continue

            keys = dict(zip(indices, replacements))
            yield ''.join([pool[i] if i not in indices else keys[i] 
                           for i in range(N)])

# Create empty sets and defaultdicts
barcode_ref = set()
mapping=defaultdict(dict)

# Initiate ref and mapping with the given barcodes
with open(snakemake.input.whitelist,'r') as ref_file:
    for line in ref_file.readlines():
        barcode = line.strip()
        barcode_ref.add(barcode)
        mapping[0][barcode]=defaultdict(dict)
        mapping[0][barcode]['ref'] = barcode
        mapping[0][barcode]['count'] = 0
        mapping[0][barcode]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}

barcode_ext_ref = deepcopy(barcode_ref)
# For now edit distance is one, but can be extended to a higher number later on.
max_edit_distance = 1
for edit_distance in range(1,max_edit_distance+1):
    mapping[edit_distance]=defaultdict(dict)
    for barcode in mapping[0]:
        (barcode_ext_ref,mapping) = generate_all(barcode, barcode_ext_ref, mapping, edit_distance)

# Delete given barcodes out of new reference. This helps later on when running "repair_barcodes.py"
barcode_ref = set(mapping[0])
barcode_ext_ref.difference_update(barcode_ref)

# Save mapping and references to reuse later.
save_obj(obj=mapping, name=snakemake.output.barcode_mapping)
save_obj(obj=barcode_ref,name=snakemake.output.barcode_ref)
save_obj(obj=barcode_ext_ref,name=snakemake.output.barcode_ext_ref)


copyfile(snakemake.input['whitelist'], snakemake.output['barcodes'])

Python From line 1 of scripts/generate_extended_ref.py

import pysam
import re
import csv
from Bio import SeqIO
import gzip
from collections import defaultdict
import sys

#This function fills in a dict with readids
#and their corresponding cell and umi barcodes until it finds
#a specific read id

discard_secondary_alignements = snakemake.params['discard_secondary_alignements']

barcodes_struct = {
	'BC_start':snakemake.params['BC_start'],
	'BC_end':snakemake.params['BC_end'],
	'UMI_start':snakemake.params['UMI_start'],
	'UMI_end':snakemake.params['UMI_end']
	}

def parse_barcodes(fastq_parser, query_name, read_barcodes, barcodes_struct):
	for fastq_R1 in fastq_parser:
		# Some sequencers give a /1 and /2 to R1 and R2 read ids respectively. This attempts to solve the issue #69.
		if '/' in fastq_R1.id:
			R1_id = fastq_R1.id[:fastq_R1.id.find("/")]
		else:
			R1_id = fastq_R1.id
		read_barcodes[R1_id]['XC'] = str(fastq_R1.seq)[barcodes_struct['BC_start']:barcodes_struct['BC_end']]
		read_barcodes[R1_id]['XM'] = str(fastq_R1.seq)[barcodes_struct['UMI_start']:barcodes_struct['UMI_end']]
		if(read_barcodes[R1_id]['XM']==''):
			sys.SystemExit('UMI empty for read {}.\n The barcode is: {}.\nWhole entry is:{}'.format(R1_id, fastq_R1.seq,fastq_R1))
		if (R1_id == query_name):
			return(fastq_parser,read_barcodes)
	return(fastq_parser,read_barcodes)

infile_bam = pysam.AlignmentFile(snakemake.input[0], "rb")

fastq_parser = SeqIO.parse(gzip.open(snakemake.input[1], "rt"), "fastq")

outfile = pysam.AlignmentFile(snakemake.output[0], "wb", template=infile_bam)

read_barcodes = defaultdict(lambda :{'XC':'','XM':''})

for bam_read in infile_bam:
	if(discard_secondary_alignements & bam_read.is_secondary):
		continue
	if (bam_read.query_name) in read_barcodes:
		current_barcodes = read_barcodes.pop(bam_read.query_name)
		tags = bam_read.get_tags()
		tags.extend([
			('XC', current_barcodes['XC'],'Z'),
			('XM', current_barcodes['XM'],'Z')])
		bam_read.set_tags(tags)
	else:
		fastq_parser,read_barcodes = parse_barcodes(fastq_parser, bam_read.query_name, read_barcodes, barcodes_struct)
		if (bam_read.query_name) not in read_barcodes:
			raise SystemExit('Read {} from mapped file is missing in reference fastq file!'.format(bam_read.query_name))
			os.remove(snakemake.output[0])
		current_barcodes = read_barcodes.pop(bam_read.query_name)
		tags = bam_read.get_tags()
		tags.extend([
			('XC', current_barcodes['XC'],'Z'),
			('XM', current_barcodes['XM'],'Z')])
		bam_read.set_tags(tags)
	outfile.write(bam_read)

Python Biopython pysam From line 1 of scripts/merge_bam.py

debug_flag <- FALSE
# check if DEBUG flag is set
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "plot_adapter_content_snakemake.rdata"))
}

#------------------------------------ debugging

library(ggplot2)
library(dplyr)
library(viridis)

samples <- snakemake@params$sample_names
batches <- snakemake@params$batches

#Read files into a list
cutadapt_clean_list <- list()
for (i in seq_along(samples)){
  cutadapt_clean           <- read.csv(snakemake@input[[i]][1], header = TRUE)
  cutadapt_clean$Sample    <- samples[i]
  cutadapt_clean$Batch     <- batches[i]
  cutadapt_clean_list[[i]] <- cutadapt_clean
}

# combining adaptors accross samples
cutadapt_counts <- Reduce(rbind, cutadapt_clean_list, NULL)

#Transform it into percentages
cutadapt_counts <- group_by(cutadapt_counts, Sample, Pair) %>%
  mutate(Percentages=Count/sum(Count))
#      Adapter                           Sequence Pair Count  Sample  Batch
# 1 PrefixNX/1                AGATGTGTATAAGAGACAG   R1     7 sample1 Batch1
# ...
# 6  Trans2_rc CTGTCTCTTATACACATCTCCGAGCCCACGAGAC   R2     5 sample2 Batch2

p1 <- ggplot(cutadapt_counts, aes(x=Sample, y = Percentages, fill = Adapter))  +
  geom_bar(stat = "identity") +
  facet_grid(Pair ~ Batch, scales = "free") +
  theme_minimal() +
  ggtitle("Comparison accross samples of adapter content") +
  scale_x_discrete(label=abbreviate) +
  scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  scale_fill_viridis(discrete=TRUE)

ggsave(plot=p1, filename=snakemake@output$pdf)

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_adapter_content_workspace.rdata"))
}

R ggplot2 dplyr viridis From line 7 of scripts/plot_adapter_content.R

debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug,
     paste0("plot_knee_plot_snakemake_",
            attr(snakemake, "wildcard")$sample, ".rdata"))
          )
}
#### /debug

library(ggplot2)
library(plyr)
# Create the cumulative plot
data=read.table(file = snakemake@input[[1]][1], header=FALSE, stringsAsFactors=FALSE)
barcodes = data$V2
total_reads = sum(data$V1)
y_raw=cumsum(data$V1)
y=(y_raw/total_reads)
x = 1:length(y)
plot_data = data.frame(rank = x,cum_sum=y, Barcode=data$V2)
x_scale = snakemake@params$cells * 4

knee_plot <- ggplot(plot_data, aes(x=rank, y=cum_sum)) +
 geom_point(size = 0.1) +
 xlim(0,x_scale) +
 geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
 ggtitle(paste0(snakemake@wildcards$sample, '\nTotal reads: ', prettyNum(total_reads))) +
 theme(plot.title = element_text(size=10)) +
 labs(x='STAMPS', y='Cumulative fraction of reads') +
 scale_y_continuous(labels = scales::percent) +
 theme_classic()

if(!is.null(snakemake@input$barcodes))
{
  selected_cells <- read.csv(snakemake@input$barcodes, header=FALSE, stringsAsFactors=FALSE)
  knee_plot <- knee_plot +
    geom_point(data = plot_data[plot_data$Barcode %in% selected_cells$V1,],
               aes(x=rank, y=cum_sum, color='Selected'), size=0.1) +
    scale_color_manual(values=c('Selected'='green'))
}
ggsave(knee_plot, file=snakemake@output$pdf, width = 4, height = 3)


if (debug_flag) {
  library(gridExtra)
  library(grid)
  # potential usefull to change lag of diff calculation
  mylag <- 1
  # data <- read.table(file = file, header=FALSE, stringsAsFactors=FALSE)
  # head(data,3)
  #        V1           V2
  # 1 1145137 CCCTTCGTCTGC
  # 2 1039974 ATAGTTTTTTAA
  # 3  912199 GCATGAAACTTC

  # borrowed from https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima
  localMaxima <- function(x) {
    # Use -Inf instead if x is numeric (non-integer)
    y <- diff(c(-.Machine$integer.max, x)) > 0L
    rle(y)$lengths
    y <- cumsum(rle(y)$lengths)
    y <- y[seq.int(1L, length(y), 2L)]
    if (x[[1]] == x[[2]]) {
      y <- y[-1]
    }
    y
  }

  reads        <- data$V1
  barcodes     <- data$V2
  total_reads  <- sum(reads)
  reads_cumsum <- cumsum(reads)
  # 1st dervivative (diff) needs also to be padded to keep same vector length
  reads_diff   <- c(diff(reads,lag=mylag,differences = 1),rep(0,mylag))

  # 2nd derivative: twice as much padding:
  reads_diff_diff   <- c(diff(reads,lag=mylag,differences = 2),rep(0,mylag*2))
  reads_cumsum_perc <- (reads_cumsum/total_reads)
  x                 <- 1:length(reads_cumsum_perc)
  plot_data         <- data.frame(rank       = x,
                                  cum_sum    = reads_cumsum_perc,
                                  read_count = reads,
                                  Barcode    = data$V2,
                                  diff       = reads_diff,
                                  diffdiff   = reads_diff_diff)

  # head(plot_data,6)
  #   rank     cum_sum read_count      Barcode    diff diffdiff
  # 1    1 0.007192916    1145137 CCCTTCGTCTGC -105163   -22612
  # 2    2 0.013725274    1039974 ATAGTTTTTTAA -127775    14025
  # 3    3 0.019455043     912199 GCATGAAACTTC -113750    61486
  # 4    4 0.024470318     798449 GTGTGGGTCTCT  -52264    34985
  # 5    5 0.029157308     746185 CGTACTGACTAC  -17279   -60441
  # 6    6 0.033735764     728906 GTTCGTCCCGCC  -77720    69104
  mystats <- paste0("| Nr barcodes total: ", length(barcodes), ' \n ',
                    "Nr barcodes for 50% reads: ", which.min(reads_cumsum_perc<0.50)," | ",
                    "Nr barcodes for 95% reads: ", which.min(reads_cumsum_perc<0.95)," | ",
                    "Nr barcodes for 99% reads: ", which.min(reads_cumsum_perc<0.99)
                    )

  # x_scale        <- which(reads_cumsum_perc>0.99)[1]
  x_scale        <- snakemake@params$cells * 4
  plot_data_head <- head(plot_data, x_scale)
  # plot_data_head$reads_diff_smooth <- predict(loess(diff~rank,data=plot_data_head))
  plot_data_head$reads_diffdiff_smooth <- predict(loess(diffdiff~rank,span=0.2,data=plot_data_head))

  ## Finding knee in knee-plot:
  # Best approach so far is to calclate the 2nd derivative of the read counts per STAMP (reads_diff_diff), smooth it (loess) and find it's maxima. Since ther can be several maxima, they are just ploted and it's up to the user to visually asses and decide.
  # finding local maxima in 2nd derivative:
  #https://stackoverflow.com/questions/6836409/finding-local-maxima-and-minima
  loc.max <- localMaxima(plot_data_head$reads_diffdiff_smooth)
  # local maxima are then colored green as lines in plots
  plot_data_head_sub <- plot_data_head[loc.max,]
  # which.peaks(plot_data_head$reads_diff_smooth2)

  knee_plot_ext <- ggplot(plot_data_head, aes(x=rank, y=cum_sum)) +
    xlim(0,x_scale) +
    ylim(0,1) +
    geom_text(data=plot_data_head_sub,aes(label = round(cum_sum,2)),nudge_y=-0.05, vjust = "inward", hjust = "inward") +
    geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    ggtitle(paste0(snakemake@wildcards$sample, '\nTotal reads: ', prettyNum(total_reads), mystats)) +
    theme(plot.title = element_text(size=10)) +
    labs(x='STAMPS', y='Cumulative fraction of reads')
  read_count_plot <- ggplot(plot_data_head, aes(x=rank, y=read_count)) +
    geom_text(data=plot_data_head_sub,aes(label = read_count),nudge_y=-0.05, vjust = "inward", hjust = "inward", check_overlap = TRUE) +
    #   geom_smooth() +
    xlim(0,x_scale) +
    #   ylim(0,1000) +
    geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    theme(plot.title = element_text(size=10)) +
    labs(x='STAMPS', y='Read counts per STAMP')

  # knee_plot_ext = knee_plot_ext + scale_y_continuous(labels = scales::percent)
  diff_plot <- ggplot(plot_data_head, aes(x=rank, y=diff)) +
    geom_text(data=plot_data_head_sub,aes(label = diff),nudge_y=-0.05, vjust = "inward", hjust = "inward", check_overlap = TRUE) +
    #   geom_smooth() +
    xlim(0,x_scale) +
    #   ylim(0,1000) +
    geom_vline(xintercept=snakemake@params$cells, linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    theme(plot.title = element_text(size=10)) +
    #1st derivative read count diff to next STAMP")
    labs(x='STAMPS', y="1st derivative of read counts")
  diff_diff_plot <- ggplot(plot_data_head,
                           aes(x=rank, y=reads_diffdiff_smooth)) +
    geom_text(data=plot_data_head_sub,
              aes(label = rank),nudge_y=-1,
              vjust = "inward",
              hjust = "inward",
              check_overlap = TRUE) +
    xlim(0,x_scale) +
    ylim(-30,30) +
    geom_vline(xintercept=snakemake@params$cells,
               linetype="dashed", color = "red") +
    #   geom_vline(xintercept=100, linetype="dashed", color = "red") +
    geom_vline(xintercept=loc.max, col="lightgreen") +
    geom_point(size = 0.1)  +
    theme(plot.title = element_text(size=10)) +
    labs(x='STAMPS', y='2nd derivative of read counts')

  if(!is.null(snakemake@input$barcodes))
  {
    selected_cells <- read.csv(snakemake@input$barcodes, header=FALSE, stringsAsFactors=FALSE)
    knee_plot_ext <- knee_plot_ext +
      geom_point(data = plot_data_head[plot_data_head$Barcode %in% selected_cells$V1,],
                 aes(x=rank, y=cum_sum, color='Selected'), size=0.1) +
      scale_color_manual(values=c('Selected'='green')) +
      theme(legend.position="none")
    diff_diff_plot <- diff_diff_plot +
      geom_point(data = plot_data_head[plot_data_head$Barcode %in% selected_cells$V1,],
                 aes(x=rank, y=reads_diffdiff_smooth, color='Selected'), size=0.1) +
      scale_color_manual(values=c('Selected'='green')) +
      theme(legend.position="none")
  }
  #   scale_y_continuous(position = "right")
  gp1 <- ggplotGrob(knee_plot_ext)
  gp2 <- ggplotGrob(read_count_plot)
  gp3 <- ggplotGrob(diff_plot)
  gp4 <- ggplotGrob(diff_diff_plot)
  # grid::grid.newpage()
  # gg <- grid::grid.draw(rbind(gp1, gp2, gp3, gp4, size = "last"))
  gg <- gridExtra::arrangeGrob(rbind(gp1, gp2, gp3, gp4, size = "last"))
  # gg <- gridExtra::arrangeGrob(gp1, gp2, gp3, gp4, ncol = 1)

  # if barcode.csv is present in base directory, only use barcodes in there (rule plot_knee_plot_whitelist in map.smk)
  ggsave(gg, file=paste0(snakemake@output$pdf, "_extended.pdf"), width = 9, height = 11)
}

if (debug_flag) {
  save.image(file = file.path(path_debug,
       paste0("plot_knee_plot_workspace_",
              attr(snakemake, "wildcard")$sample, ".rdata"))
            )
}

R ggplot2 gridExtra plyr From line 7 of scripts/plot_knee_plot.R

library(ggplot2)
library(tidyr)
library(gridExtra)
library(grid)
library(viridis)
debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "plot_rna_metrics_snakemake.rdata"))
}

#### /debug

mydata <- read.csv(file = snakemake@input$rna_metrics, header = T,
                   stringsAsFactors = F, skip = 6, sep = "\t")
mydata <- mydata[order(mydata$PF_ALIGNED_BASES, decreasing = T), ]
mydata_pct <- mydata[, c("READ_GROUP",
                         "PCT_INTERGENIC_BASES",
                         "PCT_UTR_BASES",
                         "PCT_RIBOSOMAL_BASES",
                         "PCT_INTRONIC_BASES",
                         "PCT_CODING_BASES")
                     ]
colnames(mydata_pct) = c('Cell Barcode', 'Intergenic', 'UTR', 'Ribosomial', 'Intronic', 'Coding')

mydata <- mydata[, c("READ_GROUP",
                     "INTERGENIC_BASES",
                     "UTR_BASES",
                     "RIBOSOMAL_BASES",
                     "INTRONIC_BASES",
                     "CODING_BASES")
                 ]
colnames(mydata) = c('Cell Barcode', 'Intergenic', 'UTR', 'Ribosomial', 'Intronic', 'Coding')

# converting into long format for ploting
mydata_long <- mydata %>% gather("Read Overlap", count, -"Cell Barcode")

# Keep the original order of the barcodes using factor and levels.
mydata_long$`Cell Barcode` <- factor(mydata_long$`Cell Barcode`,
                                 levels = factor(unique(mydata_long$`Cell Barcode`)))
mydata_long$`Read Overlap` <- factor(mydata_long$`Read Overlap`,
                                   levels = unique(mydata_long$`Read Overlap`))

p1 <- ggplot(mydata_long, aes(x = `Cell Barcode`, y = count, fill = `Read Overlap`)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 0), legend.position = "none")
p1 <- p1 + labs(title = paste(nrow(mydata),
                              "selected barcodes for",
                              snakemake@wildcards$sample),
                x = "Barcodes", y = "Bases")
p1 <- p1 + theme(axis.title.x = element_blank(),
                 axis.text.x = element_blank(),
                 axis.ticks.x = element_blank())
p1 <- p1 + scale_y_continuous(labels = scales::scientific)
p1 <- p1 + scale_fill_viridis(discrete = TRUE, option = "viridis")


mydata_long_pct <- mydata_pct %>% gather("Read Overlap", fraction, -"Cell Barcode")
# Keep the original order of the barcodes using factor and levels.
mydata_long_pct$`Cell Barcode` <- factor(mydata_long_pct$`Cell Barcode`,
                                     levels = factor(unique(mydata_long_pct$`Cell Barcode`)))
mydata_long_pct$`Read Overlap` <- factor(mydata_long_pct$`Read Overlap`,
                                       levels = unique(mydata_long_pct$`Read Overlap`))

p2 <- ggplot(mydata_long_pct, aes(x = `Cell Barcode`, y = fraction, fill = `Read Overlap`)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 0, size=8, vjust = 0.05), legend.position = "bottom") +
  labs(x = "Barcodes", y = "%Bases") +
  scale_y_continuous(labels = scales::percent) + scale_fill_viridis(discrete = TRUE, option = "viridis")
# This allows to align the main plots so that we can relate both directly with the label from the bottom one.
gp1 <- ggplotGrob(p1)
gp2 <- ggplotGrob(p2)
pdf(file = snakemake@output$pdf, width = 16, height = 13)
grid::grid.newpage()
grid::grid.draw(rbind(gp1, gp2, size = "last"))
dev.off()

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_rna_metrics_workspace.rdata"))
}

R ggplot2 tidyr gridExtra viridis From line 1 of scripts/plot_rna_metrics.R

debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(
    path_debug,
    paste0("plot_species_plot_snakemake_", attr(snakemake, "wildcard")$sample, ".rdata")
  ))
}

#### /debug

categorizeCellsUsingKneeKnownNumCellsPaper <- function(
                                                   digitalExpressionFileO1,
                                                   digitalExpressionFileO2,
                                                   organismOne,
                                                   organismTwo,
                                                   pureRatio = 0.2,
                                                   numCells,
                                                   numBeads,
                                                   point.cex = 1.5,
                                                   xlim_range = NULL,
                                                   ylim_range = NULL,
                                                   category = "transcripts") {
  dfFull <- getNumTranscriptsPerCellBarcodeByOrganismPair(
                                                    digitalExpressionFileO1,
                                                    digitalExpressionFileO2,
                                                    organismOne,
                                                    organismTwo,
                                                    category)
  dfFull <- dfFull[order(dfFull$total, decreasing = T), ]
  dfFull$ratio_one <- dfFull[, 2] / dfFull[, 4]
  dfFull <- head(dfFull, n = numBeads)
  df <- head(dfFull, n = numCells)

  dfNoCall <- dfFull[-1:-numCells, ]
  if (dim(dfNoCall)[1] > 0) {
    dfNoCall$organism <- "No Call"
  }

  df$organism <- "Mixed"

  idx <- which(df$ratio_one >= (1 - pureRatio))
  # checks if the species is actually assigned at all
  if (length(idx) > 0) {
    df[idx, ]$organism <- organismOne
  }
  idx <- which(df$ratio_one <= (pureRatio))
  if (length(idx) > 0) {
    df[idx, ]$organism <- organismTwo
  }

  result <- rbind(df, dfNoCall)

  maxRange <- max(result[, 2], result[, 3])

  dforganismOne <- result[result$organism == organismOne, ]
  dforganismTwo <- result[result$organism == organismTwo, ]
  dfMixed <- result[result$organism == "Mixed", ]
  dfNoCall <- result[result$organism == "No Call", ]

  if (is.null(xlim_range)) {
    xlim_range <- c(0, maxRange)
  }

  if (is.null(ylim_range)) {
    ylim_range <- c(0, maxRange)
  }
  colors <- c("blue", "red", "purple", "grey")
  plot(dforganismOne[, 2], dforganismOne[, 3], col = colors[1],
       pch = 16, xlim = xlim_range, ylim = ylim_range,
       xlab = paste(organismOne, category),
       ylab = paste(organismTwo, category),
       cex = point.cex)
  points(dforganismTwo[, 2], dforganismTwo[, 3],
         col = colors[2], pch = 16, cex = point.cex)
  points(dfMixed[, 2], dfMixed[, 3],
         col = colors[3], pch = 16, cex = point.cex)
  points(dfNoCall[, 2], dfNoCall[, 3],
         col = colors[4], pch = 16, cex = point.cex)
  l <- c(paste(organismOne, dim(dforganismOne)[1]),
         paste(organismTwo, dim(dforganismTwo)[1]),
         paste("Mixed", dim(dfMixed)[1]),
         paste("No Call", dim(dfNoCall)[1]))
  legend("topright", legend = l, fill = colors)
  title(paste("Species plot based on", category))
  return(df)
}

getNumTranscriptsPerCellBarcodeByOrganismPair <- function(
                                                      digitalExpressionFileO1,
                                                      digitalExpressionFileO2,
                                                      organismOne,
                                                      organismTwo,
                                                      category) {
  if (is.null(organismOne) || is.null(organismTwo)) {
    return(NULL)
  }

  o1 <- getGenesAndTranscriptsPerCellBarcode(digitalExpressionFileO1)
  o2 <- getGenesAndTranscriptsPerCellBarcode(digitalExpressionFileO2)

  commonBC <- union(o1$cellBC, o2$cellBC)
  o1p <- o1[match(commonBC, o1$cellBC), ]
  o2p <- o2[match(commonBC, o2$cellBC), ]
  if (category == "genes") {
    df <- data.frame(tag = commonBC, o1Count = o1p$numGenes,
                     o2Count = o2p$numGenes, stringsAsFactors = F)
  }
  else {
    df <- data.frame(tag = commonBC, o1Count = o1p$numTranscripts,
                     o2Count = o2p$numTranscripts, stringsAsFactors = F)
  }

  idx1 <- which(is.na(df$o1Count))
  idx2 <- which(is.na(df$o2Count))
  if (length(idx1) > 0) df[idx1, ]$o1Count <- 0
  if (length(idx2) > 0) df[idx2, ]$o2Count <- 0

  df$total <- apply(df[, 2:3], 1, sum, na.rm = T)
  df <- df[order(df$total, decreasing = T), ]
  colnames(df)[2] <- organismOne
  colnames(df)[3] <- organismTwo
  return(df)
}


getGenesAndTranscriptsPerCellBarcode <- function(digitalExpressionFile) {
  a <- read.table(digitalExpressionFile, header = T, stringsAsFactors = F)
  colnames(a) <- c("cellBC", "numGenicReads", "numTranscripts", "numGenes")
  return(a)
}

digitalExpressionFileO1 <- snakemake@input[[1]][1]
digitalExpressionFileO2 <- snakemake@input[[2]][1]

num_cells <- snakemake@params$expected_cells

organismOne <- names(snakemake@config$META$species)[1]
organismTwo <- names(snakemake@config$META$species)[2]

par(mar = c(5, 4, 4, 2) + 0.5)

pdf(snakemake@output$genes_pdf, height = 8, width = 8)
df_temp <- categorizeCellsUsingKneeKnownNumCellsPaper(
  digitalExpressionFileO1,
  digitalExpressionFileO2,
  organismOne = organismOne,
  organismTwo = organismTwo,
  pureRatio = snakemake@config$META$ratio,
  numCells = num_cells,
  numBeads = num_cells * 2,
  point.cex = 1,
  category = "genes"
)
dev.off()



pdf(snakemake@output$transcripts_pdf, height = 8, width = 8)
df <- categorizeCellsUsingKneeKnownNumCellsPaper(
  digitalExpressionFileO1,
  digitalExpressionFileO2,
  organismOne = organismOne,
  organismTwo = organismTwo,
  pureRatio = snakemake@config$META$ratio,
  numCells = num_cells,
  numBeads = num_cells * 2,
  point.cex = 1,
  category = "transcripts"
)
dev.off()
organism1 <- subset(df, df$organism == organismOne)
organism2 <- subset(df, df$organism == organismTwo)

write.table(organism1$tag, snakemake@output$barcodes_species[1],
            row.names = F, col.names = F, quote = F)
write.table(organism2$tag, snakemake@output$barcodes_species[2],
            row.names = F, col.names = F, quote = F)

# save.image(paste0(snakemake@output$genes_pdf,".rdata"))
if (debug_flag) {
  save.image(file = file.path(
    path_debug,
    paste0(
      "plot_species_plot_workspace_",
      attr(snakemake, "wildcard")$sample, ".rdata"
    )
  ))
}

R From line 4 of scripts/plot_species_plot.R

debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug, showWarnings = FALSE)
  save(snakemake, file = file.path(path_debug, "plot_violin_snakemake.rdata"))
}


options(warn = -1)
library(plyr, quietly = TRUE, warn.conflicts = FALSE)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE) # Dataframe manipulation
library(Matrix, quietly = TRUE, warn.conflicts = FALSE) # Sparse matrices
library(stringr, quietly = TRUE, warn.conflicts = FALSE)
library(RColorBrewer, quietly = TRUE, warn.conflicts = FALSE)
library(devtools, quietly = TRUE, warn.conflicts = FALSE)
library(Seurat, quietly = TRUE, warn.conflicts = FALSE)
library(plotly, quietly = TRUE, warn.conflicts = FALSE)

# rule map in Snakefile
# rule map:
#     input:
#         'plots/violinplots_comparison_UMI.pdf',
#         ...

# importing UMI
# importing counts ( summary/counts_expression_matrix.tsv )

ReadMTX <- function(mtx_path) {
  data_dir <- dirname(mtx_path)
  files <- list.files(data_dir)
  # Find files
  barcodes_file <- grep("barcodes", files, value = TRUE)
  features_file <- grep(pattern = "genes|features", x = files, value = TRUE)
  mtx <- grep("mtx", files, value = TRUE)
  # load the data
  data <- readMM(file.path(data_dir, mtx))
  barcodes <- read.csv(file.path(data_dir, barcodes_file), header = FALSE)$V1
  features <- read.csv(file.path(data_dir, features_file), header = FALSE)$V1

  colnames(data) <- barcodes
  rownames(data) <- features
  return(data)
}

#count_matrix <- ReadMTX(snakemake@input$counts)
# importing UMIs ( summary/umi_expression_matrix.tsv )
#umi_matrix <- ReadMTX(snakemake@input$UMIs)

count_matrix <- Read10X(file.path(snakemake@wildcards$results_dir,'summary','read'))
umi_matrix <- Read10X(file.path(snakemake@wildcards$results_dir,'summary','umi'))

design <- read.csv(snakemake@input$design,
  stringsAsFactors = TRUE,
  header = TRUE,
  row.names = NULL
)
metaData <- data.frame(cellNames = colnames(umi_matrix)) %>%
  mutate(samples = factor(str_replace(cellNames, "_[^_]*$", ""))) %>%
  mutate(barcode = factor(str_replace(cellNames, ".+_", ""))) %>%
  left_join(design, by = "samples")
rownames(metaData) <- metaData$cellNames

# possible to set is.expr = -1 to avoid filtering whilst creating
# seuratobj <- CreateSeuratObject(raw.data = umi_matrix, meta.data = metaData, is.expr = -1)
seuratobj <- CreateSeuratObject(raw.data = umi_matrix, meta.data = metaData)
seuratobj <- SetAllIdent(object = seuratobj, id = "samples")
# relabel cell idenity (https://github.com/satijalab/seurat/issues/380)
seuratobj@meta.data$orig.ident <- seuratobj@meta.data$samples

mycount <- CreateSeuratObject(raw.data = count_matrix, meta.data = metaData)
mycount <- SetAllIdent(object = mycount, id = "samples")
mycount@meta.data$orig.ident <- mycount@meta.data$samples
# turn off filtering
# note, the @meta.data slot contains usefull summary stuff
# head([email protected],2)
#                              nGene nUMI expected_cells read_length      barcode
# dropseqLib1_ACTAACATTATT    15   33            400         100 ACTAACATTATT
# dropseqLib1_GAGTCTGAGGCG     5    9            400         100 GAGTCTGAGGCG
#                                       origin      origin
# dropseqLib1_ACTAACATTATT dropseqLib1 dropseqLib1
# dropseqLib1_GAGTCTGAGGCG dropseqLib1 dropseqLib1
meta.data <- seuratobj@meta.data
# combining UMIs and Counts in to one Seurat object
meta.data$nCounts <- mycount@meta.data$nUMI
seuratobj@meta.data <- meta.data
# delete since Counts have been added to seuratobj as nCounts column
rm(mycount)


# mytheme <- theme_bw(base_size = 9) +
mytheme <- theme_bw() +
  theme(
    legend.position = "right",
    axis.ticks = element_blank(),
    axis.text.x = element_text(angle = 300, hjust = 0)
  )
theme_set(mytheme)

# predefined ggplot layers for subsequent plots
gglayers <- list(
  geom_smooth(method = "loess"),
  geom_point(size = .5),
  scale_y_continuous(
    labels = scales::unit_format(unit = "", scale = 1e-3, digits = 2),
    breaks = scales::pretty_breaks(n = 8)
  ),
  scale_x_continuous(
    labels = scales::unit_format(unit = "", scale = 1e-3, digits = 2),
    breaks = scales::pretty_breaks(n = 8)
  )
)

gg <- ggplot(meta.data, aes(x = nUMI, y = nCounts, color = orig.ident)) +
  #   coord_trans(y="log10",x = "log10") +
  gglayers +
  geom_abline(intercept = 0, slope = 1) +
  labs(
    title = "UMI counts vs raw Counts",
    subtitle = "Number of UMIs and raw Counts for each Bead",
    x = "Number of UMIs per Bead [k]",
    y = "Number of Counts per Bead [k]"
  )

# dev.new()
# htmlwidgets::saveWidget(ggplotly(gg), file.path(getwd(),snakemake@output$html_umivscounts))
ggsave(gg, file = file.path(getwd(), snakemake@output$pdf_umivscounts), width = 12, height = 7)

# how about unaligned reads/UMI?
# Note(Seb): raw.data is actually filtered data i.e. nr of genes likely to be smaller than input data!
mito.gene.names <- grep("^mt-", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE)
sribo.gene.names <- grep("^Rps", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE)
lribo.gene.names <- grep("^Rpl", rownames(seuratobj@raw.data), value = TRUE, ignore.case = TRUE)

col.total <- Matrix::colSums(seuratobj@raw.data)
meta.data$col.total <- col.total

seuratobj.top_50 <- apply(seuratobj@raw.data, 2, function(x) sum(x[order(x, decreasing = TRUE)][1:50]) / sum(x))
# mycount.top_50 <- apply([email protected], 2, function(x) sum(x[order(x, decreasing = TRUE)][1:50])/sum(x))

seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[sribo.gene.names, ]) / col.total, "pct.sribo")
seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[lribo.gene.names, ]) / col.total, "pct.lribo")
seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[unique(c(sribo.gene.names, lribo.gene.names)), ]) / col.total, "pct.Ribo")
seuratobj <- AddMetaData(seuratobj, Matrix::colSums(seuratobj@raw.data[mito.gene.names, ]) / col.total, "pct.mito")
seuratobj <- AddMetaData(seuratobj, seuratobj.top_50, "top50")
tmp <- seuratobj@meta.data$nUMI / seuratobj@meta.data$nGene
names(tmp) <- rownames(seuratobj@meta.data)
seuratobj <- AddMetaData(seuratobj, tmp, "umi.per.gene")


gg <- VlnPlot(seuratobj,
  c("nUMI", "nGene", "top50", "umi.per.gene", "pct.Ribo", "pct.mito"),
  x.lab.rot = TRUE, do.return = TRUE
)
# ggsave(gg,file=file.path("violinplots_comparison_UMI.pdf"),width=18,height=18)
ggsave(gg, file = snakemake@output$pdf_violine, width = 18, height = 18)
# gg <- VlnPlot(mycount,c("nUMI", "nGene", "top50", "count.per.gene","pct.Ribo", "pct.mito"), x.lab.rot = TRUE, do.return = TRUE)
# ggsave(gg,file=file.path("violinplots_comparison_count.pdf"),width=18,height=18)

# gg <- GenePlot(object = seuratobj, gene1 = "nUMI", gene2 = "nGene")
# ggsave(gg,file=file.path("violinplots_comparison.pdf"),width=18,height=18)


gg <- ggplot(meta.data, aes(x = nUMI, y = nGene, color = orig.ident)) +
  gglayers +
  labs(
    title = "Genes (pooled mouse and human set) vs UMIs for each bead",
    x = "Number of UMIs per Bead [k]",
    y = "Number of Genes per Bead [k]"
  )

# dev.new()
# htmlwidgets::saveWidget(ggplotly(gg),
# file.path(getwd(), snakemake@output$html_umi_vs_gene))
ggsave(gg, file = snakemake@output$pdf_umi_vs_gene, width = 12, height = 7)



################################################################################
## same for Counts instead UMIs (using mycount object)
gg <- ggplot(meta.data, aes(x = nCounts, y = nGene, color = orig.ident)) +
  gglayers +
  labs(
    title = "Genes (pooled mouse and human set) vs Counts for each bead",
    x = "Number of Counts per Bead [k]",
    y = "Number of Genes per Bead [k]"
  )

# dev.new()
# htmlwidgets::saveWidget(ggplotly(gg),
#                       file.path(getwd(), snakemake@output$html_count_vs_gene))

ggsave(gg, file = snakemake@output$pdf_count_vs_gene, width = 12, height = 7)


# head(meta.data,2)
#                              nGene nUMI                    cellNames         samples      barcode expected_cells read_length  batch      orig.ident pct.sribo  pct.lribo  pct.Ribo  pct.mito     top50 umi.per.gene
# sample1_GAGTCTGAGGCG     6    6 sample1_GAGTCTGAGGCG sample1 GAGTCTGAGGCG            100         100 batch1 sample1 0.0000000 0.00000000 0.0000000 0.0000000 1.0000000     1.000000
# sample1_CAGCCCTCAGTA   264  437 sample1_CAGCCCTCAGTA sample1 CAGCCCTCAGTA            100         100 batch1 sample1 0.0389016 0.07551487 0.1144165 0.0228833 0.5102975     1.655303

# saving snakemake meta information into misc slot so all can be exported as one object
seuratobj@misc <- snakemake
# exporting R Seurat objects into summary/R_Seurat_objects.rdata
saveRDS(seuratobj, file = file.path(snakemake@output$R_objects))

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_violin_workspace.rdata"))
}

R Snakemake ggplot2 dplyr stringr Seurat RColorBrewer plyr Matrix devtools plotly From line 13 of scripts/plot_violine.R

debug_flag <- FALSE
if (snakemake@config$DEBUG) {
  debug_flag <- TRUE
  message("In debug mode: saving R objects to inspect later")
  path_debug <- file.path(snakemake@config$LOCAL$results, "debug")
  dir.create(path_debug)
  save(snakemake, file = file.path(path_debug, "plot_yield_snakemake.rdata"))
}
#------------------------------------ debugging

library(ggplot2)
library(tidyr)
library(grid)
library(gridExtra)
library(viridis)
library(stringr)

samples <- snakemake@params$sample_names
batches <- snakemake@params$batches
mydata  <- data.frame(matrix(nrow = length(samples), ncol = 7))
colnames(mydata) <- c("Sample", "Batch", "Cutadapt filtered", "Unmapped",
                      "Multi mapped", "Uniquely mapped", "Total reads")
mydata[, "Sample"] <- samples
mydata[, "Batch"]  <- batches
for (i in 1:length(samples)) {
  # Input files and variables
  STAR_output <- read.table(snakemake@input$STAR_output[i],
                            skip = 5, sep = "\t",
                            fill = TRUE, stringsAsFactors = FALSE)
  mysample <- samples[i]
  # Read files
  # bbmap_log = read.table(snakemake@input$repaired[i], sep=':', header=FALSE, skip=8, row.names=1, nrows=4)
  # reads_after_filtering = as.numeric(str_match(bbmap_log['Pairs',], pattern = "\t([0-9]{1,20}) reads.*")[,2])/2
  bbmap_log   <- readLines(snakemake@input$repaired[i])
  reads_after_filtering <- as.numeric(str_match(bbmap_log[grep("^Pairs:", bbmap_log)],
                                                pattern = "\t([0-9]{1,20}) reads.*")[, 2]) / 2
  R1_filtered <- read.table(snakemake@input$R1_filtered[i], header = FALSE, skip = 7, sep = ":", nrows = 7, row.names = 1)
  total_reads <- as.numeric(str_replace_all(R1_filtered["Total reads processed", ], pattern = (" |,"), ""))

  # R2_filtered = read.table(snakemake@input$R2_filtered[i], header = FALSE, skip=8, sep=':', nrows=7, row.names=1)

  # R1_adapters = as.numeric(str_remove_all(str_match(R1_filtered['Reads with adapters',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R1_too_short = as.numeric(str_remove_all(str_match(R1_filtered['Reads that were too short',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R1_passed = as.numeric(str_remove_all(str_match(R1_filtered['Reads written (passing filters)',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R1_filtered = total_reads - R1_passed

  # R2_adapters = as.numeric(str_remove_all(str_match(R2_filtered['Reads with adapters',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R2_too_short = as.numeric(str_remove_all(str_match(R2_filtered['Reads that were too short',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R2_passed = as.numeric(str_remove_all(str_match(R2_filtered['Reads written (passing filters)',], pattern = "(.*) \\(")[,2], pattern = (' |,')))
  # R2_filtered = total_reads - R2_passed

  mydata[which(mydata$Sample == mysample), "Cutadapt filtered"] <- total_reads - reads_after_filtering
  mydata[which(mydata$Sample == mysample), "Total reads"] <- total_reads

  # STAR output
  reads_in        <- as.numeric(STAR_output$V2[1])
  uniquely_mapped <- as.numeric(STAR_output$V2[4])
  multi_mapped    <- as.numeric(STAR_output$V2[19])
  unmapped        <- reads_in - uniquely_mapped - multi_mapped

  mydata[which(mydata$Sample == mysample), "Uniquely mapped"] <- uniquely_mapped
  mydata[which(mydata$Sample == mysample), "Multi mapped"]    <- multi_mapped
  mydata[which(mydata$Sample == mysample), "Unmapped"]        <- unmapped
}

# tidyr version
 mydata_long <- mydata %>% gather(variable, value, -Sample, -Batch)
# melt will be retired, use gather instead: https://github.com/hadley/reshape
#Force factor order.
mydata_long$variable = factor(mydata_long$variable, levels = c('Cutadapt filtered','Multi mapped','Total reads','Unmapped','Uniquely mapped'))
color_palette = c('#e88270','#cb7262','#ae6254','#70d6e8')


p1 <- ggplot(subset(mydata_long, mydata_long$variable != "Total reads"),
             aes(x = Sample, y = value, fill = variable)) +
  geom_histogram(stat = "identity", binwidth = 1 / length(samples)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 0)) +
  labs(title = paste("Yield of all the reads for each category"),
       x = "Samples",
       y = "Number of reads") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        legend.position = "none",
        plot.title = element_text(size = 20, face = "bold")) +
  facet_grid(~Batch, scales = "free") +
  scale_fill_viridis(discrete = TRUE, option = "viridis") +
  scale_y_continuous(labels = scales::scientific)

mydata_pct <- mydata[, -c(1, 2)] / mydata[, "Total reads"]
mydata_pct <- cbind(Sample = mydata[, "Sample"],
                    Batch = mydata[, "Batch"], mydata_pct)

mydata_long_pct <- mydata_pct %>% gather(variable, value, -Sample, -Batch)

mydata_long_pct$variable = factor(mydata_long$variable, levels = c('Cutadapt filtered','Multi mapped','Unmapped','Uniquely mapped'))

p2 <- ggplot(subset(mydata_long_pct, mydata_long_pct$variable != "Total reads"),
             aes(x = Sample, y = value, fill = variable)) +
  labs(fill = "Filters") +
  geom_histogram(stat = "identity", binwidth = 1 / length(samples)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 0),
        legend.position = "bottom",
        strip.background = element_blank(),
        strip.text.x = element_blank()) +
  labs(x = "Samples",
       y = "Percentage of reads") +
  facet_grid(~Batch, scales = "free") +
  scale_fill_viridis(discrete = TRUE, option = "viridis") +
  scale_y_continuous(labels = scales::percent)

gp1 <- ggplotGrob(p1)
gp2 <- ggplotGrob(p2)

pdf(file = snakemake@output$pdf, width = 16, height = 13)
grid::grid.newpage()
grid::grid.draw(rbind(gp1, gp2, size = "last"))
dev.off()

if (debug_flag) {
  save.image(file = file.path(path_debug, "plot_yield_workspace.rdata"))
}

R ggplot2 STAR tidyr stringr gridExtra viridis From line 6 of scripts/plot_yield.R

library(yaml)

R Markdown yaml From line 8 of scripts/publication_text.Rmd

versions = list()
for (yaml_file in snakemake@input$yaml_files){
  current_env = yaml.load_file(yaml_file)
  for (package in current_env$dependencies){
    if(grepl(pattern = 'cutadapt', package)){
      versions[['cutadapt']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
    else if(grepl(pattern = 'star', package)){
      versions[['star']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
    else if(grepl(pattern = 'dropseq_tools', package)){
      versions[['dropseq_tools']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
    else if(grepl(pattern = 'bbmap', package)){
      versions[['bbmap']] = strsplit(package,'=|==| ==| == ')[[1]][2]
    }
  }
}

umi_distance=snakemake@config$EXTRACTION$`UMI-edit-distance`

R Markdown Cutadapt BBMap From line 12 of scripts/publication_text.Rmd

import pickle
import pysam

def load_obj(name):
    with open(name, 'rb') as f:
        return pickle.load(f)

def save_obj(obj, name):
    with open(name, 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)


infile_bam = pysam.AlignmentFile(snakemake.input.bam, "rb")
outfile = pysam.AlignmentFile(snakemake.output.bam, "wb", template=infile_bam)

mapping = load_obj(snakemake.input.barcode_mapping)
barcode_ref = load_obj(snakemake.input.barcode_ref)
barcode_ext_ref = load_obj(snakemake.input.barcode_ext_ref)
unknown_barcodes = set()

for bam_read in infile_bam:
	barcode = bam_read.get_tag('XC')
	#lane_number = bam_read.query_name.split(':')[3]
	if barcode in barcode_ref:
		mapping[0][barcode]['count'] += 1
		#mapping[0][barcode]['lanes'][lane_number] += 1
		outfile.write(bam_read)
		continue
	elif barcode in barcode_ext_ref:
		# The barcode is in our extended reference. Change the barcode to the original one
		reference_barcode = mapping[1][barcode]['ref']
		mapping[1][barcode]['count'] += 1
		#mapping[1][barcode]['lanes'][lane_number] += 1
		bam_read.set_tag('XC',reference_barcode,value_type='Z',replace=True)
		outfile.write(bam_read)
		continue
	else:
		# If the barcode is not found in the extended ref, then don't modify it.
		if barcode in unknown_barcodes:
			mapping['unknown'][barcode]['count'] += 1
			#mapping['unknown'][barcode]['lanes'][lane_number] += 1
		else:
			#mapping['unknown'][barcode] = {'count':1, 'lanes':{'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}}
			mapping['unknown'][barcode] = {'count':1}
			#mapping['unknown'][barcode]['lanes'][lane_number] += 1
			unknown_barcodes.add(barcode)
		outfile.write(bam_read)

save_obj(obj=mapping, name=snakemake.output.barcode_mapping_counts)

Python pysam From line 1 of scripts/repair_barcodes.py

import os
import sys
from collections import defaultdict
import pickle


def save_obj(obj, name):
    with open(name, 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)


mapping=defaultdict(dict)
barcode_ref = set()
barcode_ext_ref = set()


with open(snakemake.input['whitelist'],'r') as whitelist:
	for line in whitelist:
		if len(line.strip().split()) == 2:  # This means we didn't find any other linked barcode
			(reference,counts_ref) = line.strip().split()
			mapping[0][reference]= defaultdict()
			mapping[0][reference]['ref'] = reference
			mapping[0][reference]['count'] = 0
			mapping[0][reference]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}
			barcode_ref.add(reference)
			continue
		(reference,extended_ref,counts_ref,counts_ext) = line.strip().split()
		mapping[0][reference]= defaultdict()
		mapping[0][reference]['ref'] = reference
		mapping[0][reference]['count'] = 0
		mapping[0][reference]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}
		barcode_ref.add(reference)
		for barcode in extended_ref.split(','):
			mapping[1][barcode] = defaultdict()
			mapping[1][barcode]['ref'] = reference
			mapping[1][barcode]['count'] = 0
			mapping[1][barcode]['lanes'] = {'1':0,'2':0,'3':0,'4':0,'5':0,'6':0,'7':0,'8':0}
		barcode_ext_ref.update(extended_ref.split(','))

# Save mapping and references to reuse later.
save_obj(obj=mapping, name=snakemake.output.barcode_mapping)
save_obj(obj=barcode_ref,name=snakemake.output.barcode_ref)
save_obj(obj=barcode_ext_ref,name=snakemake.output.barcode_ext_ref)

Python From line 1 of scripts/umi_tools_extended_ref.py

__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1
fq2 =  snakemake.input.get("fq2")
if fq2:
    fq2 = [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2
    assert len(fq1) == len(fq2), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str =  " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

outprefix = os.path.dirname(snakemake.output[0]) + "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outSAMtype BAM Unsorted "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}")

Python Snakemake STAR From line 1 of align/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs 
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell("fastqc {snakemake.params} --quiet "
          "--outdir {tempdir} {snakemake.input[0]}"
          " {log}")

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path} {snakemake.output.html}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path} {snakemake.output.zip}")

Python Snakemake FastQC From line 3 of fastqc/wrapper.py

__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}")