Snakemake pipeline for single-cell RNA-seq analysis

public public 1yr ago Version: 2 0 bookmarks

Setup

Testing

# generate 10x v3 scRNA-seq test data
cd .test/ngs-test-data
snakemake scrnaseq_10x_v3 --use-conda -c1 --conda-cleanup-pkgs cache
# test the workflow
cd ../..
snakemake all --use-conda -c1 --directory .test --show-failed-logs

Configuration

  1. Create two TSVs: (1) a "runsheet" that describes each 10x run and (2) a "patientsheet" that describes each patient

The runsheet should have the following columns

  1. run_id : a unique identifier for each run

  2. r1 : the path to the R1 fastq file

  3. r2 : the path to the R2 fastq file

  4. total_barcodes : the total number of barcodes in the run

  5. cells_loaded : the number of cells loaded into the run

  6. expected_cells : the number of cells expected to be yielded in the run

  7. expected_multiplet_rate : the expected multiplet rate of the run

The patientsheet should have the following columns

  1. patient_id : a unique identifier for each patient

  2. condition : the condition of the patient

  3. Enter the paths to the runsheet and patientsheet in the config.yaml file. Specify an output directory.

outdir: <path to output directory>
runsheet: <path to runsheet>
patientsheet: <path to patientsheet>

Running the workflow

# run the workflow all the way through
# -c16 = use 16 cores, this can be changed to whatever you want
outdir="<path to output directory>"
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir"

To run until a certain step, use the --until flag

# run until a certain step
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until <rule name>
# IE: stop after indexing the genome
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until STARindex
# IE: stop after the STARsolo
snakemake all --use-conda -c16 --show-failed-logs --config outdir="$outdir" --until STARsolo

TODO:

  • [ ] Use different rules to run different stages of pipeline (instead of all , use map_count , compass , preprocess , etc.)

Code Snippets

 8
 9
10
11
12
shell:
    """
    python {input} install
    touch {output}
    """
28
29
script:
    "../scripts/prepare_data_for_compass.py"
39
40
41
42
43
shell:
    """
    compass --precache --species homo_sapiens
    touch {output}
    """
61
62
63
64
65
66
67
68
69
70
71
shell:
    """
    tmpdir=$(mktemp -d -p $(dirname {output}))
    compass \
        --data-mtx {input.norm} {input.features} {input.barcodes} \
        --output-dir $(dirname {output}) \
        --microcluster-size 10 \
        --num-processes {threads} \
        --temp-dir $tmpdir \
        --species homo_sapiens 2> {log}
    """
4
5
6
7
shell:
    """
    git clone https://github.com/aertslab/popscle_helper_tools.git {output}
    """
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
shell:
    """
    touch {log} && exec > {log} 2>&1
    source {input.tools}/filter_vcf_file_for_popscle.sh
    check_if_programs_exists

    tmpvcf=$(mktemp)
    trap 'rm -f $tmpvcf' EXIT

    # subset samples of interest
    bcftools view -s {params.samples} -Ou {input.vcf} > $tmpvcf

    {input.tools}/sort_vcf_same_as_bam.sh {input.bam} $tmpvcf \
    | only_keep_snps \
    | calculate_AF_AC_AN_values_based_on_genotype_info \
    | filter_out_mutations_missing_genotype_for_one_or_more_samples \
    | filter_out_mutations_homozygous_reference_in_all_samples \
    | filter_out_mutations_homozygous_in_all_samples > {output}

    rm -f $tmpvcf
    """
69
70
shell:
    "{input.tools}/filter_bam_file_for_popscle_dsc_pileup.sh {input.bam} {input.barcodes} {input.vcf} {output} > {log} 2>&1"
84
85
86
87
88
89
90
91
92
93
94
95
shell:
    """
    touch {log} && exec > {log} 2>&1

    out=$(dirname {output})/$(basename {output} .plp.gz)

    popscle dsc-pileup \
        --sam {input.bam} \
        --group-list {input.barcodes} \
        --vcf {input.vcf} \
        --out $out
    """
109
110
111
112
113
114
115
116
117
118
119
120
121
shell:
    """
    touch {log} && exec > {log} 2>&1

    out=$(dirname {output})/$(basename {output} .best)
    plp=$(dirname {input.plp})/$(basename {input.plp} .plp.gz)

    popscle demuxlet \
        --plp $plp \
        --group-list {input.barcodes} \
        --vcf {input.vcf} \
        --out $out
    """
146
147
shell:
    "juptyer nbconvert --to html {input} --output {output}"
30
31
shell:
    "STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {output} --genomeFastaFiles {input.fa} --sjdbGTFfile {input.gtf} --genomeSAindexNbases {params.genomeSAindexNbases}"
73
74
script:
    "../scripts/STARsolo.py"
87
88
wrapper:
    "v1.23.5/bio/sambamba/index"
126
127
shell:
    "jupyter nbconvert --to html {input}"
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
shell:
    """
    # use CUDA if GPU is present
    if which nvidia-smi > /dev/null; then
        cuda="--cuda"
    else
        cuda=""
    fi
    expected=$(wc -l {input.filtered} | tr ' ' '\n' | head -n 1)

    # run cellbender
    mkdir -p $(dirname {output.raw})
    cellbender remove-background \
        --input $(dirname {input.raw}) \
        --output {output.raw} \
        --epochs 300 \
        --learning-rate 0.00001 \
        --z-dim 50 \
        --expected-cells $expected \
        --total-droplets-included {params.total_droplets_included} $cuda
    """
204
205
206
207
208
209
210
211
212
213
shell:
    """
    irescue \
        --bam {input.bam} \
        --tmpdir IRescue_tmp_{wildcards.run} \
        --genome {params.genome} \
        --threads {threads} \
        --outdir $(dirname {output[0]}) \
        --whitelist {input.whitelist}
    """
223
224
225
226
227
228
229
230
231
shell:
    """
    git clone https://github.com/bvaldebenitom/SoloTE.git {output} 2> {log}

    # fix bad code on line 321 of SoloTE_1.08/SoloTE_pipeline.py
    sed -i 's/stdout.split/split/g' {output}/SoloTE_1.08/SoloTE_pipeline.py

    chmod +x {output}/SoloTE_1.08/convertRMOut_to_SoloTEinput.sh
    """
244
245
shell:
    "{input.solote}/SoloTE_1.08/convertRMOut_to_SoloTEinput.sh {input.rmsk_out} {output} > {log} 2>&1"
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
shell:
    """
    outdir=$(dirname $(dirname {output[0]}))

    python {input.solote}/SoloTE_1.08/SoloTE_pipeline.py \
        --threads {threads} \
        --bam {input.bam} \
        --teannotation {input.te_bed} \
        --outputprefix {wildcards.run} \
        --outputdir $outdir/ > {log} 2>&1

    mv {wildcards.run}_SoloTE.stats $outdir/{wildcards.run}_SoloTE_output/

    # clean up
    rm -rf $outdir/{wildcards.run}_SoloTE_temp
    """
25
26
27
28
29
30
31
shell:
    """
    tar -xf {input.ref10x} --wildcards '*genome.fa' '*genes.gtf'
    mv $(basename {input.ref10x} .tar.gz)/genes/genes.gtf {output.gtf}
    mv $(basename {input.ref10x} .tar.gz)/fasta/genome.fa {output.fa}
    gunzip -c {input.rmsk_out} > {output.rmsk_out}
    """
SnakeMake From line 25 of rules/ref.smk
56
57
58
59
60
61
62
63
shell:
    """
    if [[ {input} == *.gz ]]; then
        gunzip -c {input} > {output}
    else
        cp {input} {output}
    fi
    """
SnakeMake From line 56 of rules/ref.smk
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
__author__ = ["Michael Cuoco", "Joelle Faybishenko"]


import pegasus as pg
import sys
import pandas as pd
from pathlib import Path
from snakemake.shell import shell

sys.stderr = open(snakemake.log[0], "w")

print(f"opening file {snakemake.input[0]}")
data = pg.read_input(snakemake.input[0])

print("normalize data")
pg.normalize(data, norm_count=1e5)

outdir = Path(snakemake.output.norm).parent.parent
print(f"writing files to {outdir}")
pg.write_output(data, outdir, file_type="mtx")

for file in snakemake.output:
    print(f"decompressing {file}")
    shell(f"gunzip {file}.gz")

# fix barcodes and features files for proper reading by compass
df = pd.read_csv(snakemake.output.barcodes, sep="\t")
df["barcodekey"].to_csv(snakemake.output.barcodes, sep="\t", index=False, header=False)

df = pd.read_csv(snakemake.output.features, sep="\t")
df["featurekey"].to_csv(snakemake.output.features, sep="\t", index=False, header=False)


sys.stderr.close()
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
__author__ = "Michael Cuoco"

import tempfile, os
from pathlib import Path
from snakemake.shell import shell

# setup input fastqs and outdir
if type(snakemake.input.r1) is str:
    r1 = snakemake.input.r1
    r2 = snakemake.input.r2
else:
    assert len(snakemake.input.r1) == len(
        snakemake.input.r2
    ), "Unequal number of R1 and R2 files"
    r1 = ",".join(snakemake.input.r1)
    r2 = ",".join(snakemake.input.r2)

outdir = str(Path(snakemake.output.bam).parent) + "/"

# set readFilesIn and solo10xProtocol
print(f"10x chemistry: {snakemake.config['10x_chemistry']}")
if snakemake.config["10x_chemistry"] == "5prime":
    solo10xProtocol = "--soloBarcodeMate 1 --clip5pNbases 39 0 --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 10"
    readFilesIn = f"--readFilesIn {r1} {r2}"
elif snakemake.config["10x_chemistry"] == "3prime_v2":
    solo10xProtocol = "--soloUMIlen 16 --clipAdapterType CellRanger4"
    readFilesIn = f"--readFilesIn {r2} {r1}"
elif snakemake.config["10x_chemistry"] == "3prime_v3":
    solo10xProtocol = "--soloUMIlen 12 --clipAdapterType CellRanger4"
    readFilesIn = f"--readFilesIn {r2} {r1}"
else:
    raise ValueError("Invalid 10x protocol")

# set read command
print(f"Read 1 file(s): {r1}")
print(f"Read 2 file(s): {r2}")
if r1.endswith(".gz"):
    readcmd = "gunzip -c"
elif r1.endswith(".bz2"):
    readcmd = "bunzip2 -c"
elif r1.endswith(".fastq") or r1.endswith(".fq"):
    readcmd = ""
else:
    raise ValueError("Invalid read file")

# set solobarcode read length
if snakemake.config["STARsolo"].get("soloBarcodeReadLength"):
    soloBarcodeReadLength = "--soloBarcodeReadLength " + str(
        snakemake.config["STARsolo"]["soloBarcodeReadLength"]
    )
else:
    soloBarcodeReadLength = ""

# set soloMultiMappers
if snakemake.config["STARsolo"].get("soloMultiMappers"):
    soloMultiMappers = "--soloMultiMappers " + str(
        snakemake.config["STARsolo"]["soloMultiMappers"]
    )
else:
    soloMultiMappers = ""

if snakemake.config["STARsolo"].get("outSAMmultNmax"):
    outSAMmultNmax = "--outSAMmultNmax " + str(
        snakemake.config["STARsolo"]["outSAMmultNmax"]
    )
else:
    outSAMmultNmax = ""

outFilterMultimapNmax = snakemake.config["STARsolo"]["outFilterMultimapNmax"]
winAnchorMultimapNmax = snakemake.config["STARsolo"]["winAnchorMultimapNmax"]

mem = str(50e9)  # use 50 GB of RAM for sorting

with tempfile.TemporaryDirectory() as tmpdir:
    statvfs = os.statvfs(tmpdir)
    if statvfs.f_frsize * statvfs.f_bavail < 2e10:
        print(
            "WARNING: Less than 20 GB of free disk space detected at tmpdir location. STAR may fail due to lack of disk space. Set $TMPDIR to a location with more free space."
        )
    shell(
        "STAR "
        " --runThreadN {snakemake.threads}"
        " --genomeDir {snakemake.input.idx}"
        " --readFilesCommand {readcmd}"
        " {readFilesIn}"
        " --soloType CB_UMI_Simple"
        " {solo10xProtocol}"
        " {soloBarcodeReadLength}"
        " {soloMultiMappers}"
        " {outSAMmultNmax}"
        " --soloFeatures {snakemake.params.soloFeatures}"
        " --soloCellFilter EmptyDrops_CR"
        " --soloCBwhitelist {snakemake.input.whitelist}"
        " --soloOutFormatFeaturesGeneField3 -"
        " --outTmpDir {tmpdir}/STARtmp"
        " --soloOutFileNames outs genes.tsv barcodes.tsv matrix.mtx"
        " --outFilterMultimapNmax {outFilterMultimapNmax}"
        " --winAnchorMultimapNmax {winAnchorMultimapNmax}"
        " --limitBAMsortRAM {mem} "
        " --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM"
        " --outSAMtype BAM SortedByCoordinate"
        " --outFileNamePrefix {outdir}"
    )
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__author__ = "Jan Forster"
__copyright__ = "Copyright 2021, Jan Forster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "sambamba index {snakemake.params.extra} -t {snakemake.threads} "
    "{snakemake.input[0]} {snakemake.output[0]} "
    "{log}"
)
ShowHide 18 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/gage-lab/scrnaseq
Name: scrnaseq
Version: 2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...