A Snakemake workflow to process scRNA-seq data from 10x Genomics

public public 1yr ago Version: 3 0 bookmarks

A Snakemake workflow to process scRNA-seq data from 10x Genomics

Contents

Overview

Chromium is a Snakemake workflow to process 3' single cell RNA sequencing data from the 10x Genomics platform. It is compatible with 10xv2 and 10xv3 chemistry and features three different quantification methods to obtain both spliced and unspliced abundance estimates:

Installation

Chromium and all of its dependencies can be installed via the mamba package manager:

  1. Install Snakemake and Snakedeploy

    $ mamba create -c bioconda -c conda-forge --name snakemake snakemake snakedeploy
    
  2. Activate the Snakemake environment

    $ mamba activate snakemake
    
  3. Create a project directory

    $ mkdir -p path/to/project
    
  4. Deploy the workflow in the project directory

    $ snakedeploy deploy-workflow https://github.com/snakemake-workflows/chromium path/to/project
    

Usage

  1. Create workflow configuration

    $ vim config/config.yaml # workflow parameters
    
  2. Create samples table

    $ vim config/samples.csv # sample metadata
    
  3. Create units table

    $ vim config/units.csv # unit metdata
    
  4. Test configuration by performing a dry-run

    $ snakemake -n
    
  5. Execute workflow and deploy software dependencies

    $ snakemake --cores all --use-conda
    

    For more information, see the Usage section of the documentation.

Documentation

Full documentation for Chromium is available here

Support

If you need help, open an issue with one of the following labels:

  • help wanted (extra attention is needed)

  • question (further information is requested)

Feedback

If you have any suggestions, open an issue with one of the following labels:

  • documentation (improvements or additions to documentation)

  • enhancement (new feature or request)

Contributing

To contribute to Chromium, clone this repository locally and commit your code on a separate branch. Please generate unit tests for your code and run the linter before opening a pull request:

$ snakemake --generate-unit-tests # generate unit tests
$ snakemake --lint # run the linter

You can find more details in the Contributing guide.

Participation in this project is subject to a Code of Conduct .

Authors

Chromium was developed by James Ashmore but has benefited from contributions by the following:

If you would like to be added to this list, please open a pull request with your contribution.

Citation

If you use Chromium in your research, please cite using the DOI: 10.5281/zenodo.4783308

Used By

Chromium is used by the following companies and institutes:

If you would like to be added to this list, please open a pull request with your information.

Acknowledgements

The methods were chosen based on this research paper:

Soneson C, Srivastava A, Patro R, Stadler MB (2021) Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS Comput Biol 17(1): e1008585. https://doi.org/10.1371/journal.pcbi.1008585

The workflow was motivated by the following projects:

The documentation was informed by the following articles:

License

Chromium is licensed under the MIT license.
Copyright © 2020, James Ashmore

Code Snippets

26
27
script:
    "../scripts/get_velocity_files.R"
42
43
script:
    "../scripts/read_velocity_output.R"
18
19
shell:
    "bustools correct -o {output.bus} -w {input.txt} {input.bus} 2> {log}"
32
33
shell:
    "bustools sort -t {threads} -o {output.bus} {input.bus} 2> {log}"
49
50
shell:
    "bustools capture -s -x -o {output.bus} -c {input.txt} -e {input.mat} -t {input.txi} {input.bus} 2> {log}"
66
67
shell:
    "bustools capture -s -x -o {output.bus} -c {input.txt} -e {input.mat} -t {input.txi} {input.bus} 2> {log}"
85
86
shell:
    "bustools count -o {params.out} -g {input.tsv} -e {input.mat} -t {input.txt} --genecounts {input.bus} 2> {log}"
104
105
shell:
    "bustools count -o {params.out} -g {input.tsv} -e {input.mat} -t {input.txt} --genecounts {input.bus} 2> {log}"
20
21
script:
    "../scripts/eisar_ranges.R"
36
37
script:
    "../scripts/eisar_fa.R"
51
52
script:
    "../scripts/eisar_gtf.R"
66
67
script:
    "../scripts/eisar_tsv.R"
81
82
script:
    "../scripts/eisar_tx2gene.R"
23
24
shell:
    "genomepy install -g {params} -a {wildcards.genome} 2> {log}"
37
38
shell:
    "gunzip -c {input} > {output}"
17
18
shell:
    "gffread {input} --table transcript_id,gene_id | sort -u > {output} 2> {log}"
31
32
shell:
    "gffread {input} --table gene_id,gene_name | sort -u > {output} 2> {log}"
45
46
shell:
    "gffread {input} --table @chr,gene_id | awk '$1 == MT' | cut -f 2 | sort -u > {output} 2> {log}"
59
60
shell:
    "gffread {input} --table gene_biotype,gene_id | awk '$1 == rRNA' | cut -f 2 | sort -u > {output} 2> {log}"
17
18
19
20
21
shell:
    "kallisto index"
    " --index {output.idx}"
    " {input.fas}"
    " 2> {log}"
45
46
47
48
49
50
51
52
shell:
    "kallisto bus"
    " --index {input.idx}"
    " --output-dir {params.out}"
    " --technology {params.ver}"
    " --threads {threads}"
    " {params.fqz}"
    " 2> {log}"
17
18
shell:
    "curl {params.url} > {output.txt} 2> {log.err}"
31
32
shell:
    "curl {params.url} > {output.txt} 2> {log.err}"
45
46
shell:
    "curl {params.url} | gunzip -c > {output.txt} 2> {log.err}"
16
17
shell:
    "cat {input} > {output} 2> {log}"
28
29
shell:
    "cut -f 1 {input} > {output} 2> {log}"
46
47
48
49
50
51
52
53
54
shell:
    "salmon index"
    " --transcripts {input.transcripts}"
    " --index {output.index}"
    " --threads {threads}"
    " --decoys {input.decoys}"
    " --keepDuplicates"
    " 1> {log.out}"
    " 2> {log.err}"
77
78
79
80
81
82
83
84
85
86
87
88
89
90
shell:
    "salmon alevin"
    " --libType ISR"
    " --index {input.index}"
    " --mates1 {input.mates1}"
    " --mates2 {input.mates2}"
    " {params.arg[protocol]}"
    " --output {output.output}"
    " --threads {threads}"
    " --tgMap {input.tgMap}"
    " --mrna {input.mrna}"
    " --rrna {input.rrna}"
    " 1> {log.out}"
    " 2> {log.err}"
21
22
script:
    "../scripts/singlecellexperiment_kallisto.R"
38
39
script:
    "../scripts/singlecellexperiment_salmon.R"
53
54
script:
    "../scripts/singlecellexperiment_star.R"
22
23
24
25
26
27
28
29
30
shell:
    "STAR"
    " --runMode genomeGenerate"
    " --runThreadN {threads}"
    " --genomeDir {output.dir}"
    " --genomeFastaFiles {input.fas}"
    " --sjdbGTFfile {input.gtf}"
    " --sjdbOverhang {params.arg[sjdbOverhang]}"
    " 2> {log}"
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
shell:
    "STAR"
    " --runMode alignReads"
    " --runThreadN {threads}"
    " --genomeDir {input.idx}"
    " --sjdbGTFfile {input.gtf}"
    " --readFilesIn {params.fq2} {params.fq1}"
    " --readFilesCommand gunzip -c"
    " --outFileNamePrefix {params.out}"
    " --outSAMtype BAM SortedByCoordinate"
    " --soloType CB_UMI_Simple"
    " --soloCBwhitelist {input.txt}"
    " --soloFeatures Gene Velocyto"
    " --soloCBstart {params.arg[soloCBstart]}"
    " --soloCBlen {params.arg[soloCBlen]}"
    " --soloUMIstart {params.arg[soloUMIstart]}"
    " --soloUMIlen {params.arg[soloUMIlen]}"
    " --soloBarcodeReadLength {params.arg[soloBarcodeReadLength]}"
    " 2> {log}"
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
main <- function(input, output, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(GenomicFeatures)

    library(Rsamtools)

    grl <- readRDS(input$rds)

    con <- FaFile(input$fas)

    seq <- extractTranscriptSeqs(con, grl)

    writeXStringSet(seq, output$fas)

}

main(snakemake@input, snakemake@output, snakemake@log)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
main <- function(input, output, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(eisaR)

    grl <- readRDS(input$rds)

    exportToGtf(grl, filepath = output$gtf)

}

main(snakemake@input, snakemake@output, snakemake@log)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
main <- function(input, output, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(eisaR)

    len <- switch(params$chemistry, "10xv1" = 98, "10xv2" = 98, "10xv3" = 91)

    grl <- getFeatureRanges(
        gtf = input$gtf,
        featureType = c("spliced", "intron"),
        intronType = "separate",
        flankLength = len,
        joinOverlappingIntrons = FALSE,
        verbose = TRUE
    )

    saveRDS(grl, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
main <- function(input, output, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(GenomicRanges)

    grl <- readRDS(input$rds)

    write.table(
        x = metadata(grl)$corrgene,
        file = output$tsv,
        quote = FALSE,
        sep = "\t",
        row.names = FALSE
    )

}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
main <- function(input, output, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(eisaR)

    grl <- readRDS(input$rds)

    getTx2Gene(grl, filepath = output$tsv)

}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
main <- function(input, params, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(Biostrings)

    library(BUSpaRse)

    len <- switch(params$chemistry, "10xv1" = 98, "10xv2" = 98, "10xv3" = 91)

    dna <- readDNAStringSet(input$fas)

    names(dna) <- sapply(strsplit(names(dna), " "), .subset, 1)

    get_velocity_files(
        X = input$gtf,
        L = len,
        Genome = dna,
        out_path = params$out_path,
        style = params$style,
        transcript_version = NULL,
        gene_version = NULL
    )

}

main(snakemake@input, snakemake@params, snakemake@log)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
main <- function(input, output, log) {

    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(BUSpaRse)

    spliced_mtx <- sub('\\.mtx$', '', input$mtx[1])

    unspliced_mtx <- sub('\\.mtx$', '', input$mtx[2])

    out <- read_velocity_output(
        spliced_dir = dirname(spliced_mtx),
        unspliced_dir = dirname(unspliced_mtx),
        spliced_name = basename(spliced_mtx),
        unspliced_name = basename(unspliced_mtx)
    )

    row <- intersect(rownames(out$spliced), rownames(out$unspliced))

    col <- intersect(colnames(out$spliced), colnames(out$unspliced))

    out$spliced <- out$spliced[row, col]

    out$unspliced <- out$unspliced[row, col]

    saveRDS(out, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@log)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
main <- function(input, output, log, wildcards) {

     # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")

    # Script function

    library(SingleCellExperiment)

    mat <- readRDS(input$rds)

    ann <- read.delim(input$tsv, header = FALSE, col.names = c("id", "name"))

    dim <- list(i = rownames(mat$spliced), j = colnames(mat$spliced))

    ind <- match(dim$i, ann$id)

    ann <- ann[ind, ]

    sce <- SingleCellExperiment(

        assays = list(
            counts = mat$spliced,
            spliced = mat$spliced,
            unspliced = mat$unspliced
        ),

        rowData = DataFrame(
            ID = ann$id,
            Symbol = ann$name
        ),

        colData = DataFrame(
            Sample = wildcards$sample,
            Barcode = dim$j
        )
    )

    saveRDS(sce, file = output$rds)

}

main(snakemake@input, snakemake@output, snakemake@log, snakemake@wildcards)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
main <- function(input, output, log, wildcards) {


    # Log function

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")


    # Load Bioconductor packages

    library(tximport)

    library(tximeta)

    library(SummarizedExperiment)

    library(SingleCellExperiment)


    # Import salmon quantification

    con <- file.path(input$dir, "alevin", "quants_mat.gz")

    txi <- tximport(con, type = "alevin", dropInfReps = FALSE)

    rse <- SummarizedExperiment(assays = list(counts = round(txi$counts)))


    # Split by spliced and unspliced

    dat <- read.delim(input$tsv[1], header = TRUE, col.names = c("spliced", "unspliced"))

    rse <- splitSE(rse, dat, assayName = "counts")


    # Import and match annotation

    ann <- read.delim(input$tsv[2], header = FALSE, col.names = c("id", "name"))

    ann <- ann[match(rownames(rse), ann$id), ]


    # Create SCE object

    sce <- SingleCellExperiment(
        assays = list(
            counts = assay(rse, "spliced"),
            spliced = assay(rse, "spliced"),
            unspliced = assay(rse, "unspliced")
        ),
        rowData = DataFrame(
            ID = ann$id,
            Symbol = ann$name
        ),
        colData = DataFrame(
            Sample = wildcards$sample,
            Barcode = colnames(rse)
        )
    )


    # Save SCE object

    saveRDS(sce, file = output$rds)


}

main(snakemake@input, snakemake@output, snakemake@log, snakemake@wildcards)
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
main <- function(input, output, params, log, wildcards) {


    # Setup logging

    out <- file(log$out, open = "wt")

    err <- file(log$err, open = "wt")

    sink(out, type = "output")

    sink(err, type = "message")


    # Define directory

    dirname <- input$dir

    dirname <- file.path(dirname, "Solo.out", "Velocyto", "raw")


    # Read features

    names  <- c("id", "name", "type")

    features <- file.path(dirname, "features.tsv")

    features <- data.table::fread(features, header = FALSE, col.names = names)


    # Read barcodes

    names <- c("id")

    barcodes <- file.path(dirname, "barcodes.tsv")

    barcodes <- data.table::fread(barcodes, header = FALSE, col.names = names)

    # Read spliced matrix

    names <- c("feature", "barcode", "count")

    spliced <- file.path(dirname, "spliced.mtx")

    spliced <- data.table::fread(spliced, header = FALSE, skip = 2, col.names = names)


    # Read unspliced matrix

    names <- c("feature", "barcode", "count")

    unspliced <- file.path(dirname, "unspliced.mtx")

    unspliced <- data.table::fread(unspliced, header = FALSE, skip = 2, col.names = names)


    # Set dimensions

    shape <- c(nrow(features), nrow(barcodes))

    names <- list(features$id, barcodes$id)


    # Create object

    sce <- SingleCellExperiment::SingleCellExperiment(

        assays = list(

            counts = Matrix::sparseMatrix(
                i = spliced$feature,
                j = spliced$barcode,
                x = spliced$count,
                dims = shape,
                dimnames = names
            ),

            spliced = Matrix::sparseMatrix(
                i = spliced$feature,
                j = spliced$barcode,
                x = spliced$count,
                dims = shape,
                dimnames = names
            ),

            unspliced = Matrix::sparseMatrix(
                i = unspliced$feature,
                j = unspliced$barcode,
                x = unspliced$count,
                dims = shape,
                dimnames = names
            )
        ),

        rowData = S4Vectors::DataFrame(
            ID = features$id,
            Symbol = features$name
        ),

        colData = S4Vectors::DataFrame(
            Sample = wildcards$sample,
            Barcode = barcodes$id
        )

    )

    # Save object

    saveRDS(sce, file = output$rds)


}

main(snakemake@input, snakemake@output, snakemake@params, snakemake@log, snakemake@wildcards)
ShowHide 32 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jma1991/scrnaseq
Name: scrnaseq
Version: 3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...