scRNA-seq Workflow: STARsolo, Variant Calling, and Analysis

public public 1yr ago Version: 5 0 bookmarks

scRNAseq workflow

This workflow runs STARsolo with the appropriate parameters for the particular scRNA-seq technology. Currently, 'indrop_v2', '10x_v1', '10x_v2', '10x_v3' and 'cellseq192' are supported. Optionally, variants can be called suing the GATK RNA-seq workflow. Duplicates are identified using the UB tag for each cell barcode separately. Final variants are hard-filtered, annotated with SNPEff and passed through the R package, SNPRelate, for PCA, MDS and dendrograms.

How to use

  1. Download the pipeline using git clone [email protected]:vari-bbc/scRNAseq.git new_proj_dir_name or git clone https://github.com/vari-bbc/scRNAseq.git new_proj_dir_name depending on if you have an SSH key set up with GitHub or not, respectively.

  2. Put fastq files or symlinks into 'raw_data/'.

  3. Fill out 'samples.tsv':

    • sample Sample name; If more than one row has the same sample name, they will be merged.

    • fq1 R1 filename

    • fq2 R2 filename

    • RG Optional. If provided, read groups will not be inferred from fastq headers. Provide in the style specified for --outSAMattrRGline option in STAR. e.g. 'ID:zzz ”DS:z z”' or 'ID:yyy DS:yyyy'

  4. Fill out 'bin/config.yaml' to indicate the location of index files, the scRNA-seq technology etc. See config file comments for more details.

    For variant calling , set 'call_variants' to True. To variant call only a subset of the cell barcodes , specify only those barcodes in the 'sample_decoder' file. See config file for more info.

  5. Run qsub -q bbc bin/run_snakemake.sh .

Helpful commands

snakemake -l : Print all the rules and a description of what it does.

Code Snippets

18
19
20
21
# set the library path
#.libPaths("/secondary/projects/bbc/tools/kin_R_packages/R-3.6.0_20190701_pkges")

knitr::opts_chunk$set(echo=TRUE, warning=TRUE, message=TRUE, cache=TRUE, dev=c('png','pdf'), fig.width=8, fig.height=8)
27
28
29
30
31
32
library(readr)
library(stringr)
library(dplyr)
library(tibble)
library(Seurat)
library(Matrix)
38
39
40
41
42
43
44
45
46
47
48
49
# define function for reading from 10x-style output directory and renaming by Celseq 192 barcode number as listed in Sagar...Gruen paper
read10x_and_bc_num_as_colnames <- function(tenx_dir, decoder){
  counts <- Seurat::Read10X(tenx_dir)

  # check that barcode sequences match
  stopifnot(identical(sort(colnames(counts)), sort(decoder$bc)))

  # rename columns with bc_num
  colnames(counts) <- decoder$bc_num[match(colnames(counts), decoder$bc)]

  tibble::as_tibble(counts, rownames="Gene") %>% dplyr::select(Gene, as.character(1:192))
}
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# get input, output, params from the Snakemake pipeline
out_file <- snakemake@output[[1]] %>% str_replace(".html", ".txt")
out_file
starsolo_raw_dir <- dirname(snakemake@input[[1]])
starsolo_raw_dir # print out starsolo dir
decoder <- snakemake@params[["decoder"]]
decoder

# read barcode decoder from file in snakePipes installation
bc_nums_decoder <- read_tsv(decoder, col_names = c("bc_num","bc"))

# read the counts and rename to BC number
counts <- read10x_and_bc_num_as_colnames(starsolo_raw_dir, bc_nums_decoder)
str(counts) # print out str()

# output the counts to files
write_tsv(counts, out_file)
77
sessionInfo()
19
20
21
# save start time for script
start_tm <- Sys.time()
start_tm
25
26
# set the library path
knitr::opts_chunk$set(echo=TRUE, warning=TRUE, message=TRUE, cache=FALSE, dev=c('pdf'), fig.width=8, fig.height=8)
32
33
34
35
36
library(SNPRelate)
library(ggplot2)
library(dplyr)
library(tibble)
library(ggrepel)
45
46
47
48
49
50
51
52
53
54
55
56
57
#outdir <- "./snprelate_out"
#if(!outdir %in% list.dirs()) dir.create(outdir, recursive = TRUE)
vcf.fn <- snakemake@input[[1]]
# convert to GDS
gds_file <- snakemake@params[["gds"]]
snpgdsVCF2GDS(vcf.fn, gds_file, method="biallelic.only")

# Figures
old_figs_folder <- snakemake@params[["figures_dir"]]
new.folder <- snakemake@params[["new_figures_dir"]]

# Set maximum missing rate
missingrate <- 0.05
63
genofile <- snpgdsOpen(gds_file)
73
74
75
76
77
set.seed(1000)
# Try different LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.5, missing.rate=missingrate) # using LD threshold default used in SNPhylo
# Get all selected snp id
snpset.id <- unlist(snpset)
85
86
87
88
89
90
set.seed(100)
ibs <- snpgdsIBS(genofile, num.thread=1, autosome.only=TRUE, missing.rate=missingrate, snp.id = snpset.id)
# we find 1- ibs so that the values become 'distances' where higher values represent more dissimilar pairs of samples
one_minus_ibs <- 1 - ibs$ibs
colnames(one_minus_ibs) <- ibs$sample.id
rownames(one_minus_ibs) <- ibs$sample.id
 96
 97
 98
 99
100
101
102
103
104
loc <- cmdscale(one_minus_ibs, k = 2)
#x <- loc[, 1]; y <- loc[, 2]
df <- tibble::as_tibble(loc, rownames="sample.id") %>%
  dplyr::rename(Dimension1=V1, Dimension2=V2)
#df$sample.id <- ibs$sample.id
# plot(x, y, xlab = "", ylab = "",
#     main = "Multidimensional Scaling Analysis (IBS)")
ggplot(df, aes(x=Dimension1, y=Dimension2, label=sample.id)) + 
  geom_point() + ggrepel::geom_label_repel() + ggtitle("MDS (based on IBS)")
112
113
ibs.hc <- snpgdsHCluster(ibs)
rv <- snpgdsCutTree(ibs.hc)
117
plot(rv$dendrogram, main="Dendrogram based on IBS")
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
#pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=1)
pca <- snpgdsPCA(genofile, num.thread=1, autosome.only=TRUE, missing.rate=missingrate, snp.id = snpset.id)
# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
tab <- data.frame(sample.id = pca$sample.id,
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    EV3 = pca$eigenvect[,3],
    EV4 = pca$eigenvect[,4],
    EV5 = pca$eigenvect[,5],
    stringsAsFactors = FALSE)
head(tab)
#plot(tab$EV1, tab$EV2, xlab="eigenvector 1", ylab="eigenvector 2")
ggplot(tab, aes(x=EV1, y=EV2, label=sample.id)) + geom_point() + ggrepel::geom_label_repel() + ggtitle("PC1 and 2")
ggplot(tab, aes(x=EV3, y=EV4, label=sample.id)) + geom_point() + ggrepel::geom_label_repel() + ggtitle("PC3 and 4")
150
151
152
153
154
155
set.seed(100)
ibs <- snpgdsIBS(genofile, num.thread=1, autosome.only=TRUE, missing.rate=missingrate)
# we find 1- ibs so that the values become 'distances' where higher values represent more dissimilar pairs of samples
one_minus_ibs <- 1 - ibs$ibs
colnames(one_minus_ibs) <- ibs$sample.id
rownames(one_minus_ibs) <- ibs$sample.id
161
162
163
164
165
loc <- cmdscale(one_minus_ibs, k = 2)
df <- tibble::as_tibble(loc, rownames="sample.id") %>%
  dplyr::rename(Dimension1=V1, Dimension2=V2)
ggplot(df, aes(x=Dimension1, y=Dimension2, label=sample.id)) + 
  geom_point() + ggrepel::geom_label_repel() + ggtitle("MDS (based on IBS)")
173
174
ibs.hc <- snpgdsHCluster(ibs)
rv <- snpgdsCutTree(ibs.hc)
178
plot(rv$dendrogram, main="Dendrogram based on IBS")
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
#pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=1)
pca <- snpgdsPCA(genofile, num.thread=1, autosome.only=TRUE, missing.rate=missingrate)
# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
tab <- data.frame(sample.id = pca$sample.id,
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    EV3 = pca$eigenvect[,3],
    EV4 = pca$eigenvect[,4],
    EV5 = pca$eigenvect[,5],
    stringsAsFactors = FALSE)
head(tab)
#plot(tab$EV1, tab$EV2, xlab="eigenvector 1", ylab="eigenvector 2")
ggplot(tab, aes(x=EV1, y=EV2, label=sample.id)) + geom_point() + ggrepel::geom_label_repel() + ggtitle("PC1 and 2")
ggplot(tab, aes(x=EV3, y=EV4, label=sample.id)) + geom_point() + ggrepel::geom_label_repel() + ggtitle("PC3 and 4")
206
closefn.gds(genofile)
211
212
213
dir.create(new.folder, recursive = TRUE)

file.copy(list.files(old_figs_folder, full.names = TRUE), new.folder, recursive = TRUE)
220
sessionInfo()
226
227
228
229
# output time taken to run script
end_tm <- Sys.time()
end_tm
end_tm - start_tm
111
112
113
114
shell:
    """
    fastqc --outdir {params.outdir} {input}
    """
136
137
138
139
shell:
    """
    fastq_screen --threads {threads} --outdir analysis/fastq_screen/ {input}
    """
SnakeMake From line 136 of main/Snakefile
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
shell:
   """
   STAR  \
   --runThreadN {threads} \
   --limitBAMsortRAM 137438953472 \
   --genomeDir {params.index} \
   --readFilesIn {params.fqs_and_rg[fqs]} \
   --outSAMattrRGline {params.fqs_and_rg[RG]} \
   --outSAMattributes NH HI AS nM CB UB \
   --readFilesCommand zcat  \
   --outSAMtype BAM SortedByCoordinate \
   --outFileNamePrefix {params.outprefix} \
   {params.tech_params}

   samtools index {params.outprefix}Aligned.sortedByCoord.out.bam

   pigz -p {threads} {params.outprefix}Solo.out/Gene/raw/*
   pigz -p {threads} {params.outprefix}Solo.out/Gene/filtered/*
   """
318
319
320
321
shell:
   """
   gunzip -c {input} > {output}
   """
SnakeMake From line 318 of main/Snakefile
341
342
script:
    "scripts/read_star_solo_raw.Rmd"
SnakeMake From line 341 of main/Snakefile
366
367
368
369
shell:
    """
    bamtools split -in {input} -tag CB -stub {params.out_pref}
    """
391
392
393
394
shell:
    """
    samtools reheader -c 'perl -pe "s/^(@SQ.*)(\\tSM:\S+)/\$1\$2.{wildcards.CB}/"' {params.input} 1> {output} 2> {log.stderr}
    """
419
420
421
422
423
424
425
426
427
428
shell:
    """
    java -Xms16g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp -jar $PICARD MarkDuplicates \
    --INPUT {input} \
    --BARCODE_TAG 'UB' \
    --OUTPUT {output.bam} \
    --CREATE_INDEX true \
    --VALIDATION_STRINGENCY SILENT \
    --METRICS_FILE {output.metrics}
    """
SnakeMake From line 419 of main/Snakefile
450
451
452
453
454
455
456
457
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    SplitNCigarReads \
    -R {params.ref_fasta} \
    -I {input} \
    -O {output} 
    """
480
481
482
483
484
485
486
487
488
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -XX:+UseParallelGC -XX:ParallelGCThreads={threads} -Djava.io.tmpdir=./tmp" \
        BaseRecalibrator \
        -R {params.ref_fasta} \
        -I {input} \
        -O {output} \
        {params.known_variants}
    """
511
512
513
514
515
516
517
518
519
520
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g  -XX:+UseParallelGC -XX:ParallelGCThreads={threads} -Djava.io.tmpdir=./tmp" \
        ApplyBQSR \
        --add-output-sam-program-record \
        -R {params.ref_fasta} \
        -I {input.bam} \
        -O {output} \
        --bqsr-recal-file {input.recal_table}
    """
544
545
546
547
548
549
550
551
552
553
554
555
556
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    HaplotypeCaller \
    -R {params.ref_fasta} \
    -I {input.bam} \
    -O {output} \
    -ERC GVCF \
    -dont-use-soft-clipped-bases \
    --native-pair-hmm-threads {threads} \
    --standard-min-confidence-threshold-for-calling 20 \
    {params.contigs} 
    """
607
608
609
610
611
612
613
614
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    GenomicsDBImport \
    {params.sample_gvcfs} \
    --genomicsdb-workspace-path {output.genomicsdb} \
    {params.contigs} 
    """
637
638
639
640
641
642
643
644
645
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    GenotypeGVCFs \
    -R {params.ref_fasta} \
    -V gendb://{params.genomicsdb} \
    -O {output.vcf}

    """
667
668
669
670
671
672
673
674
675
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    SortVcf \
    -I {input.vcf} \
    -O {output.sorted_vcf} \
    -SD {params.dictionary} 

    """
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    MergeVcfs \
    {params.in_vcfs} \
    --SEQUENCE_DICTIONARY {params.dictionary} \
    --OUTPUT {output.raw} 

    echo "mergeVcfs done." >> {log.stdout}
    echo "mergeVcfs done." >> {log.stderr}

    vt peek -r {params.ref_fasta} {output.raw} 2> {output.vt_peek_raw} 1>>{log.stdout}

    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    VariantFiltration \
    --R {params.ref_fasta} \
    --V {output.raw} \
    --window 35 \
    --cluster 3 \
    --filter-name "FS" \
    --filter "FS > 30.0" \
    --filter-name "QD" \
    --filter "QD < 2.0" \
    --genotype-filter-name "GQ" \
    --genotype-filter-expression "GQ < 15.0" \
    --genotype-filter-name "DP" \
    --genotype-filter-expression "DP < 10.0" \
    -O {output.filt} 

    echo "VariantFiltration done." >> {log.stdout}
    echo "VariantFiltration done." >> {log.stderr}

    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
    SelectVariants \
    -R {params.ref_fasta} \
    -V {output.filt} \
    --exclude-filtered \
    --set-filtered-gt-to-nocall \
    -O {output.pass_only} 

    echo "SelectVariants done." >> {log.stdout}
    echo "SelectVariants done." >> {log.stderr}

    vt peek -r {params.ref_fasta} {output.pass_only} 2> {output.vt_peek_pass} 1>>{log.stdout}
    """
770
771
772
773
774
shell:
    """
    bcftools reheader --threads {threads} -s {params.sample_decoder} -o {output} {input.vcf}
    bcftools index --threads {threads} -t {output}
    """
800
801
802
803
804
805
806
807
808
809
810
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" VariantsToTable -V {input} \
            -R {params.ref_fasta} --sequence-dictionary {params.dictionary} \
            -F CHROM -F POS -F REF -F ALT -F TYPE -F ANN -GF AD -O {output.raw}

    head -n1 {output.raw} | perl -npe 's/\\tANN\\t/\\tGENE\\t/; s/\.AD(?=[\t\n])//g' > {output.parsed}

    # Parse the comma-separated ANN column (column 5, 0-based counting). Each comma-separated element contains a '|' separated string; We extract fields 3 and 4, 0-based for the gene names.
    tail -n+2 {output.raw} | perl -F'\\t' -lane 'my %hash; $hash{{join("|",(split(/\|/,$_))[3..4])}}++ foreach split(",", $F[5]); $F[5] = join(",", sort(keys(%hash))); print join("\\t", @F)' >> {output.parsed}
    """
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
shell:
    """
    # Only use canonical transcript of each gene
    java -Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp -jar $SNPEFF/snpEff.jar eff \
    -v \
    -canon \
    -stats {output.html_canon} \
    {params.db_id} \
    {input} | \
    bgzip > {output.vcf_canon}

    tabix {output.vcf_canon}

    # 'default' settings
    java -Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp -jar $SNPEFF/snpEff.jar eff \
    -v \
    -stats {output.html} \
    {params.db_id} \
    {input} | \
    bgzip > {output.vcf}

    tabix {output.vcf}

    # Only annotate variants that overlap genes
    java -Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp -jar $SNPEFF/snpEff.jar eff \
    -v \
    -no-downstream \
    -no-intergenic \
    -no-upstream \
    -stats {output.html_inGene} \
    {params.db_id} \
    {input} | \
    bgzip > {output.vcf_inGene}

    tabix {output.vcf_inGene}

    """
899
900
script:
    "scripts/snprelate.Rmd"
SnakeMake From line 899 of main/Snakefile
923
924
925
926
927
928
929
930
931
932
933
shell:
    """
    gatk --java-options "-Xms8g -Xmx{resources.mem_gb}g -Djava.io.tmpdir=./tmp" \
            ASEReadCounter \
            -R {params.ref_fasta} \
            -I {input} \
            -V {params.vcf} \
            --min-base-quality 20 \
            --min-mapping-quality 30 \
            -O {output}
    """
ShowHide 37 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/vari-bbc/scRNAseq
Name: scrnaseq
Version: 5
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...