A Snakemake pipeline for cellranger to process 10x single-cell RNAseq data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
a snakemake pipeline to process 10x genomics data using cellranger .
work flow of the pipeline
Now, it is simply three steps: download, makefastqs and count.
The python downloader inside
scripts
folder is from https://help.basespace.illumina.com/articles/tutorials/using-the-python-run-downloader/
Also read the link above to get your tokens.
In the
config.yaml
file you can change settings. e.g. path to a different genome to align, and
cellranger
specific parameters.
The advantage of a
snakemake
workflow is that the jobs are dispatched to the cluster in parallel when possible.
Snakemake has built-in
SLURM
support (see
pyflow-cellranger.sh
), so you do not worry about making the sbatch scripts. You can also change the queue, memory, time and others in the
cluster.json
file. When a certain step fails, rerun the whole workflow will only rerun the failed ones. One can also specify which step/sample you want to run.
one can change the parameters in the master
config.yaml
file. I can move many other
cellranger
specific parameters into the
config.yaml
file.
How to distribute workflows
read doc
ssh login.rc.fas.harvard.edu
# start a screen session
screen
git clone https://gitlab.com/dulaclab/pyflow_cellranger
cd pyflow_cellranger
## edit the config.yaml file as needed, e.g. set mouse or human for ref genome
nano config.yaml
conda create -n snakemake python=3.6 -c snakemake
source activate snakemake
# need to prepare a meta.txt file containing run information, see example in the repo
# for each run, a csv file needs to be prepared as well for cellranger to make fastq.
# see examples in the repo. if you have 3 runs, you should have 3 csv files.
python sample2json.py meta.txt
## check the generated file
less -S samples.json
## dry run
./pyflow-cellranger.sh -np
TO do
-
add QC steps.
-
doublet remove.
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | from urllib2 import Request, urlopen, URLError import json import math import sys import os import socket import optparse def arg_parser(): cwd_dir = os.getcwd() parser = optparse.OptionParser() parser.add_option( '-r', dest='runid', help='Run ID: required') parser.add_option( '-a', dest='accesstoken', help='Access Token: required') ( options, args ) = parser.parse_args() try: if options.runid == None: raise Exception if options.accesstoken == None: raise Exception except Exception: print("Usage: BaseSpaceRunDownloader_vN.py -r <RunID> -a <AccessToken>") sys.exit() return options def restrequest(rawrequest): request = Request(rawrequest) try: response = urlopen(request) json_string = response.read() json_obj = json.loads(json_string) except URLError, e: print 'Got an error code:', e sys.exit() return json_obj def downloadrestrequest(rawrequest,path): dirname = RunID + os.sep + os.path.dirname(path) if dirname != '': if not os.path.isdir(dirname): os.makedirs(dirname) request = (rawrequest) outfile = open(RunID + os.sep + path,'wb') try: response = urlopen(request,timeout=1) outfile.write(response.read()) outfile.close() except URLError, e: print 'Got an error code:', e outfile.close() downloadrestrequest(rawrequest,path) except socket.error: print 'Got a socket error: retrying' outfile.close() downloadrestrequest(rawrequest,path) options = arg_parser() RunID = options.runid AccessToken = options.accesstoken request = 'http://api.basespace.illumina.com/v1pre3/runs/%s/files?access_token=%s' %(RunID,AccessToken) json_obj = restrequest(request) totalCount = json_obj['Response']['TotalCount'] noffsets = int(math.ceil(float(totalCount)/1000.0)) hreflist = [] pathlist = [] filenamelist = [] for index in range(noffsets): offset = 1000*index request = 'http://api.basespace.illumina.com/v1pre3/runs/%s/files?access_token=%s&limit=1000&Offset=%s' %(RunID,AccessToken,offset) json_obj = restrequest(request) nfiles = len(json_obj['Response']['Items']) for fileindex in range(nfiles): href = json_obj['Response']['Items'][fileindex]['Href'] hreflist.append(href) path = json_obj['Response']['Items'][fileindex]['Path'] pathlist.append(path) for index in range(len(hreflist)): request = 'http://api.basespace.illumina.com/%s/content?access_token=%s'%(hreflist[index],AccessToken) print 'downloading %s' %(pathlist[index]) downloadrestrequest(request, pathlist[index]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | library(Seurat) library(dplyr) ## see https://bitbucket.org/snakemake/snakemake/issues/917/enable-stdout-and-stderr-redirection log <- file(snakemake@log[[1]], open="wt") sink(log) sink(log, type="message") # https://github.com/satijalab/seurat/issues/671 # run NormalizeData first and then FindVariableGenes PreprocessData<- function(object, x.low.cutoff = 0.05, x.high.cutoff = 10, y.cutoff = 0.5, num.pc = 20, pc.use = NULL, do.par =TRUE, num.cores = 2, score.thresh = 1e-5, sig.pc.thresh = 0.05, n.start = 100, nn.eps = 0, resolution = 0.8, k.param = 30, ...){ object<- NormalizeData(object = object) object<- FindVariableGenes(object = object, x.low.cutoff = x.low.cutoff, x.high.cutoff = x.high.cutoff, y.cutoff = y.cutoff) object<- ScaleData(object = object, genes.use = object@var.genes, vars.to.regress = c("nUMI"), block.size = 1000, min.cells.to.block=3000, display.progress = TRUE, do.par = TRUE, num.cores = num.cores) object<- RunPCA(object = object, pc.genes = object@var.genes, pcs.compute = num.pc, do.print = FALSE) if (is.null(pc.use)){ object<- JackStraw( object = object, num.replicate = 100, num.cores = num.cores, do.par = T, num.pc = num.pc) object <- JackStrawPlot(object = object, PCs = 1:num.pc, score.thresh = score.thresh) PC_pvalues<- object@dr$pca@jackstraw@overall.p.values ## determin how many PCs to use. pc.use<- min(which(PC_pvalues[,"Score"] > sig.pc.thresh)) -1 } # add significant pc number to metadata, need to have names same as the cells pc.use.meta<- rep(pc.use, length(object@cell.names)) names(pc.use.meta)<- object@cell.names object<- AddMetaData(object = object, metadata = pc.use.meta, col.name = "pc.use") object <- RunTSNE(object = object, dims.use = 1:pc.use, do.fast = TRUE) object <- FindClusters(object = object, reduction.type = "pca", dims.use = 1:pc.use, n.start = n.start, k.param = k.param, nn.eps = nn.eps, resolution = resolution, print.output = FALSE, save.SNN = TRUE, force.recalc = TRUE) return(object) } sample<- snakemake@wildcards[["sample"]] data.dir = snakemake@params[["cellranger_count_dir"]] # Load the seurat_obj dataset data <- Read10X(data.dir = data.dir) seurat_obj <- CreateSeuratObject(raw.data = data, min.cells = 3, min.genes = 200, project = sample) make_seurat_obj_args<- snakemake@params[["make_seurat_obj_args"]] command<- paste("PreprocessData", "(", "seurat_obj,", make_seurat_obj_args, ")") seurat_obj<- eval(parse(text=command)) saveRDS(seurat_obj, file = paste0("seurat_objs/", sample, "_seurat.rds")) |
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | run: if config['download']: shell("source activate py27") shell("python scripts/BaseSpaceRunDownloader_v2.py -r {wildcards.run_id} -a {config[token]} 2> {log}") ## if not downloaded from baseSpace, soft link the directory else: with open(config['run_dirs']) as f: reader = csv.reader(f, delimiter = "\t") # skipt the header header = next(reader) run_dir_dict = {} for row in reader: run_id = row[0].strip() run_dir = row[1].strip() run_dir_dict[run_id] = run_dir # snakemake autocreate the output folder first, I need to remove it first, otherwise ln -s will link to the subfolder # of run_id cmd1 = "rm -rf {run_id}".format(run_id = wildcards.run_id) cmd2 = "ln -rs {run_dir} {run_id}".format(run_dir = run_dir_dict[wildcards.run_id], run_id = wildcards.run_id) shell(cmd1) shell(cmd2) |
104 105 106 107 108 109 110 111 112 113 114 | shell: """ module load bcl2fastq2 cellranger mkfastq --id={wildcards.batch_name} \ --run={params.run_info[0]} \ --samplesheet={params.run_info[2]} \ --jobmode=local \ --localmem=30 \ --localcores=8 2> {log} touch {output} """ |
137 138 139 140 141 142 143 144 145 146 147 148 | shell: """ cellranger count --id={wildcards.sample} \ --transcriptome={config[transcriptome]} \ --fastqs={params.fastqs} \ --sample={wildcards.sample} \ --expect-cells={config[expect_cells]} \ {params.custom} \ --localcores=12 \ --localmem=40 2> {log} touch {output} """ |
166 | script: "scripts/make_seurat.R" |
Support
- Future updates
Related Workflows





