snakemake workflow for post-processing scATACseq data

public public 1yr ago Version: v1.0 0 bookmarks

snakemake workflow for post-processing scATACseq data

cite if you use it:)

What does it do?

for single cell ATACseq experiment, one gets a merged bam file for all cells. After going through clustering, one groups similar cells into cell types (cell states). This workflow will split the bam by clusters to create a pseudo bulk bam for each cluster, create bigwig tracks for visulization, call peaks for each cluster and merge the peaks across the clusters. Finally it will count reads per peak per cell from the original bam file on the merged peaks.

Thanks Brent Pedersen for the rcbbc (Read Count By BarCode) in the scripts folder. This script can be very useful in general. It counts the number of reads per cell in regions specified in a bed format from the bam file, which contains CB tag for each cell. The code is written in nim and is super fast. For over 1 million regions for over 8000 cells in a ~64G bam, it took ~40 mins.

./rcbbc -h
rcbbc
read-count by barcode
Usage:
 rcbbc [options] bed bam
Arguments:
 bed
 bam
Options:
 -f, --fasta=FASTA path to fasta. required only for CRAM
 -e, --exclude=EXCLUDE exclude alignments with any of these bits set (default: 1796)
 -m, --min-mapping-quality=MIN_MAPPING_QUALITY
 exclude alignments with a mapping-quality below this (default: 1)
 -w, --white-list=WHITE_LIST
 line-delimited list of barcodes to include. default is to include any seen in the bam
 -t, --tag=TAG tag on which to partition read-counts (default: CB)
 -p, --prefix=PREFIX output prefix for files written by rcbbc (default: rcbbc)
 -h, --help Show this help

In the future, the peak calling software should be barcode aware, so one does not need to split the bam file by cluster. But for now, I have this work for me.

workflow of the piepline

Dependencies

How to use this pipeline

  1. clone the repo
git clone https://github.com/crazyhottommy/pyflow-scATACseq
cd pyflow-scATACseq
  1. Prepare a tab delimited tsv file with header: meta.tsv inside the pyflow-scATACseq folder:
sample bam_path cluster_id.csv white_list
sample1 /path/to/sample1.bam /path/to/sample1.csv /path/to/white_list_sample1.txt
sample2 /path/to/sample2.bam /path/to/sample2.csv /path/to/white_list_sample2.txt
  • The first column is the sample name.

  • The second column is the path to the 10x cellranger_atac produced possorted_bam.bam

  • The cluster_id.csv is a two column csv file with header (does not matter what name you give). The first column is the cell barcode, the second column is the cluster id (can be numeric or strings).

make sure the cluster_id.csv is named exactly as sample.csv .

e.g. if the sample column is sample1, it should be sample1.csv for the cluster information csv file.

cat example_cluster.csv
name,value
AAACGAAAGCACCATT-1,32
AAACGAAAGTAGTCGG-1,4
AAACGAAGTAACGGCA-1,31
AAACTCGAGAACAGGA-1,17
AAACTCGAGTCCAGAG-1,1
AAACTCGCAAAGGAAG-1,41
AAACTCGCAAGGCGTA-1,5
AAACTCGCAGAAAGAG-1,4
AAACTCGCAGTAGGCA-1,24
  • The white_list is a one column txt file with valid barcode you want to include.

you can create it by:

cat example_cluster.csv | cut -f1 -d, | sed '1d' > example_white_list.txt
  1. Create a samples.json file:
python sample2json.py meta.tsv
# a samples.json file will be created
cat samples.json
{
 "sample1": [
 "/path/to/sample1.bam",
 "/path/to/sample1.csv",
 "/path/to/white_list_sample1.txt"
 ],
 "sample2": [
 "/path/to/sample2.bam",
 "/path/to/sample2.csv",
 "/path/to/white_list_sample2.txt"
 ]
}
  1. Running the workflow
#dry run
snakemake -np 
## real-run on local machine
snakemake -j 10
## if you want to use a singularity image that contains presto for differential accessible test
snakemake -j 10 --use-singularity
## submit to slurm (you can change this script for your own HPC)
./pyflow-scATACseq.sh 

To-do list

  • [ ] check file paths exist and format.

  • [ ] Have a conda env set up for the Snakefile.

  • [ ] have a docker container.

  • [ ] Motif analysis.

  • [ ] Find differential peaks using presto.

  • [ ] Associating peaks with genes using Cicero

  • [ ] Gene set enrichment

Other Notes

more thoughts on pileup and shifting reads https://github.com/deeptools/deepTools/issues/453 maxFragmentLength

https://github.com/deeptools/deepTools/issues/370

https://groups.google.com/forum/#!topic/deeptools/JU9itiT5rYk

You probably want “–Offset 1”

The above is not shifting exactly, for shifting use https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html

--ATACshift

filter out fragments?

Code Snippets

  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
'Finding Differential accessible regions from one or more sparse matrix
The number of mtx sparse matrix should be the same as the cluster information csv
files and have the same basename. e.g. sample1.mtx and sample1.csv
Usage:
    presto.R --mtx=<mtx>... --cluster=<cluster>... <output>
    presto.R --mtx=sample1.mtx --cluster=sample1.csv sample1.txt
    presto.R --mtx=sample1.mtx --mtx=sample2.mtx --cluster=sample1.csv --cluster=sample2.csv sample.txt

Options:
    --mtx=<mtx>...  input file path of the sparse matrix
    --cluster=<cluster>...  input file path of the cluster information. a csv file with a header. column 1 is the
             cell barcode, column 2 is the cluster id.
    -h --help  Show this screen.
    -v --version  Show version.
Arguments:
    output   output file path of the differential peaks
' -> doc

suppressMessages(library(docopt))
suppressMessages(library(Seurat))
suppressMessages(library(Matrix))
suppressMessages(library(presto))
suppressMessages(library(tidyverse))

arguments <- docopt(doc, version = 'presto.R v1.0\n\n')
num_of_matrix<- length(arguments$mtx)
num_of_csv<- length(arguments$cluster)

if (num_of_matrix != num_of_csv){
        stop("the number of matrix must be the same with the number
             of cluster information csv file") 
}


check_basename<- function(mtx, csv){
        return(gsub(".mtx", "", basename(mtx)) == gsub(".csv", "", basename(csv)))
}

if (!all(purrr::map2_lgl(arguments$mtx, arguments$cluster, check_basename))){
        stop("the basename of the mtx and the cluster info csv file do not match")
} 

read_mtx<- function(mtx){
        recount<- readMM(mtx)
        peaks<- read_tsv(paste0(mtx, ".sites"), col_names = FALSE)
        cells<- read_tsv(paste0(mtx, ".cells"), col_names = FALSE)
        prefix<- gsub(".mtx", "", basename(mtx))
        cells<- paste0(prefix, "_", cells$X1)
        rownames(recount)<- peaks$X1
        colnames(recount)<- cells
        return(recount)
}

message(paste("reading in the sparse matrix:", arguments$mtx, "\n"))
mtxs<- purrr::map(arguments$mtx, read_mtx)

if (!length(unique((purrr::map(mtxs, rownames))))==1){
        stop("all matrix should have same row/peak name")
}


recount.mat<- purrr::reduce(mtxs, cbind)

makeSeuratObject<- function(recount.mat){
        recount.TFIDF.mat<- TF.IDF(recount.mat)
        recount.TFIDF.mat <- Seurat:::LogNorm(data = recount.TFIDF.mat, scale_factor = 1e4)
        rownames(recount.TFIDF.mat)<- rownames(recount.mat)
        colnames(recount.TFIDF.mat)<- colnames(recount.mat)
        recount.atac<- CreateSeuratObject(counts = recount.TFIDF.mat, assay = 'ATAC', project = "recount")
        return(recount.atac)
}



### transferred label information
message(paste("reading in the cluster information csv files:", arguments$cluster, "\n"))
read_id<- function(csv){
        id<- read_csv(csv)
        prefix<- gsub(".csv", "", basename(csv))
        id<- id %>% 
                mutate(cell = paste0(prefix, "_", name)) %>%
                dplyr::rename(cluster_id = value)
        return(id)
}

cluster_ids<- purrr::map(arguments$cluster, read_id)
cluster_ids<- bind_rows(cluster_ids)

message("making Seurat ATAC objects")
recount.atac<- makeSeuratObject(recount.mat)
recount.atac[["cell"]]<- rownames(recount.atac@meta.data)
old_meta<- recount.atac@meta.data
new_meta<- left_join(recount.atac@meta.data, cluster_ids) 

if(!identical(rownames(old_meta), new_meta$cell)){
       stop("The name of the cells are different") 
}

rownames(new_meta)<- rownames(old_meta)
recount.atac@meta.data<- new_meta

#library(devtools)
#install_github('immunogenomics/presto')
# this counts is the TF-IDF normalized and log transformed counts
# see how long it takes for 1.3 million features

message("Finding differential accessible regions by presto")
start_time <- Sys.time()
res<- wilcoxauc(recount.atac, 'cluster_id', seurat_assay = 'ATAC', assay = "counts")
end_time <- Sys.time()

message("presto takes")
end_time - start_time

#Time difference of 2.933884 mins !! impressive

message(paste("writing differential peaks to", arguments$output, "\n"))
write_tsv(res %>% arrange(padj, desc(auc)), arguments$output)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import pysam
import csv
import argparse
import os.path
import sys

parser = argparse.ArgumentParser()
parser.add_argument("csv", help="Required. the FULL path to the cluster csv file with header, \
    first column is the cell barcode, second column is the cluster id")
parser.add_argument("bam", help="Required. the FULL path to the 10x scATAC bam file generated \
    by cellranger-atac count")
parser.add_argument("-prefix", help="Optional, the prefix of the output bam, default is cluster_id.bam")
parser.add_argument("-outdir", help="Optional, the output directory for the splitted bams, default is current dir")
args = parser.parse_args()


if os.path.exists(args.csv):
    pass
else:
    print("csv file is not found")
    sys.exit(1)

if os.path.exists(args.bam):
    pass
else:
    print("10x scATAC bam not found")
    sys.exit(1)

if args.outdir:
    if os.path.isdir(args.outdir):
        pass
    else:
        try:
            os.mkdir(args.outdir)
        except OSError:
            print("can not create directory {}".format(args.outdir))

cluster_dict = {}
with open(args.csv) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    #skip header
    header = next(csv_reader)
    for row in csv_reader:
        cluster_dict[row[0]] = row[1]

clusters = set(x for x in cluster_dict.values())


fin = pysam.AlignmentFile(args.bam, "rb")

# open the number of bam files as the same number of clusters, and map the out file handler to the cluster id, write to a bam with wb
fouts_dict = {}
for cluster in clusters:
    if args.prefix:
        fout_name = args.prefix + "_" + cluster + ".bam"
    else:
        fout_name = cluster + ".bam"
    if args.outdir:
        fout = pysam.AlignmentFile(os.path.join(args.outdir,fout_name), "wb", template = fin)
    else:
        fout = pysam.AlignmentFile(fout_name, "wb", template = fin)
    fouts_dict[cluster] = fout

for read in fin:
    tags = read.tags
    # the 8th item is the CB tag
    CB_list = [ x for x in tags if x[0] == "CB"]
    if CB_list:
        cell_barcode = CB_list[0][1]
    else: 
        continue
    # the bam files may contain reads not in the final clustered barcodes
    # will be None if the barcode is not in the clusters.csv file
    cluster_id = cluster_dict.get(cell_barcode)
    if cluster_id:
        fouts_dict[cluster_id].write(read)

## do not forget to close the files
fin.close()
for fout in fouts_dict.values():
    fout.close()
112
113
114
115
116
shell:
    """
    scripts/split_scATAC_bam_by_cluster.py -prefix {wildcards.sample} -outdir 01split_bam/{wildcards.sample} {input[1]} {input[0]}
    touch {output}
    """
SnakeMake From line 112 of master/Snakefile
122
123
124
125
shell:
    """
    samtools index 01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam
    """
137
138
139
140
shell:
    """
    bamCoverage -b 01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam -p {threads} {params.custom} -o {output}
    """
168
169
170
171
shell:
    """
    samtools merge -@ 2 -r {output} {params.bam}
    """
176
177
178
179
shell:
    """
    samtools index {input}
    """
189
190
191
192
shell:
    """
    bamCoverage -b {input[0]} -p {threads} {params.custom} -o {output}
    """
203
204
205
206
207
208
shell:
    """
    samtools sort -n -m 2G -@ {threads} -T {wildcards.cluster_id} \
    -o {output} \
    {input[0]} 2> {log}
    """
217
218
219
220
shell:
   """
   Genrich -t {input} -o {output} {params.custom} 2> {log}
   """
228
229
230
231
232
shell:
    """
    bedtools intersect -a {input[0]} -b <(zcat {input[1]}) -v > {output}

    """
240
241
242
243
244
245
shell:
    """
    # the 10th column is the peak summit
    cat {input} | awk -v OFS="\t" '$2=$2+$10,$3=$2+1' > {output.summit}
    bedtools slop -i {output.summit} -g {config[genome_size]}  -b {config[extend_size]} > {output.extend_summit}
    """
263
264
265
266
267
268
shell:
    """
    samtools sort -n -m 2G -@ {threads} -T {wildcards.sample}_{wildcards.cluster_id} \
    -o {output} \
    01split_bam/{wildcards.sample}/{wildcards.sample}_{wildcards.cluster_id}.bam 2> {log}
    """
279
280
281
282
shell:
   """
   Genrich -t {input} -o {output} {params.custom} 2> {log}
   """
290
291
292
293
294
shell:
    """
    bedtools intersect -a {input[0]} -b <(zcat {input[1]}) -v > {output}

    """
302
303
304
305
306
307
shell:
    """
    # the 10th column is the peak summit
    cat {input} | awk -v OFS="\t" '$2=$2+$10,$3=$2+1' > {output.summit}
    bedtools slop -i {output.summit} -g {config[genome_size]}  -b {config[extend_size]} > {output.extend_summit}
    """
325
326
327
328
329
330
shell:
    """
    # get rid of unconventional chromosomes and natural sort the chr (chr10 after chr9)
    cat {input} | sort -k1,1V -k2,2n | bedtools merge | grep -v "_"  > {output}

    """
340
341
342
343
shell:
    """
    scripts/rcbbc -w {params.white_list} -p 07recount/{wildcards.sample}/{wildcards.sample} {input.bed} {input.bam} 
    """
SnakeMake From line 340 of master/Snakefile
355
356
357
358
359
shell:
    """
    # remove unconventional chromosomes and natural sort by -V
    cat {input} | sort -k1,1V -k2,2n | bedtools merge | grep -v "_"  > {output}
    """
370
371
372
373
shell:
    """
    scripts/rcbbc -w {params.white_list} -p 07recount_all/{wildcards.sample}/{wildcards.sample} {input.bed} {input.bam} 
    """
SnakeMake From line 370 of master/Snakefile
388
389
390
391
shell:
    """
    Rscript scripts/presto.R --mtx {input.mtx} --cluster {input.csv} {output}
    """
SnakeMake From line 388 of master/Snakefile
401
402
403
404
405
406
407
408
409
410
411
412
413
414
shell:
    """
    mtxs=""
    for mtx in {input.mtxs}; do
        mtxs+=" --mtx=$mtx"
    done

    csvs=""
    for csv in {input.csvs}; do
        csvs+=" --cluster=$csv"
    done

    Rscript scripts/presto.R $mtxs $csvs {output}> {log} 2>&1 
    """
SnakeMake From line 401 of master/Snakefile
ShowHide 16 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/crazyhottommy/pyflow-scATACseq
Name: pyflow-scatacseq
Version: v1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...