A Snakemake pipeline for cellranger to process 10x single-cell RNAseq data

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

a snakemake pipeline to process 10x genomics data using cellranger .

work flow of the pipeline

Now, it is simply three steps: download, makefastqs and count.

The python downloader inside scripts folder is from https://help.basespace.illumina.com/articles/tutorials/using-the-python-run-downloader/

Also read the link above to get your tokens.

In the config.yaml file you can change settings. e.g. path to a different genome to align, and cellranger specific parameters.

The advantage of a snakemake workflow is that the jobs are dispatched to the cluster in parallel when possible. Snakemake has built-in SLURM support (see pyflow-cellranger.sh ), so you do not worry about making the sbatch scripts. You can also change the queue, memory, time and others in the cluster.json file. When a certain step fails, rerun the whole workflow will only rerun the failed ones. One can also specify which step/sample you want to run.

one can change the parameters in the master config.yaml file. I can move many other cellranger specific parameters into the config.yaml file.

How to distribute workflows

read doc

ssh login.rc.fas.harvard.edu
# start a screen session
screen
git clone https://gitlab.com/dulaclab/pyflow_cellranger
cd pyflow_cellranger
## edit the config.yaml file as needed, e.g. set mouse or human for ref genome 
nano config.yaml
conda create -n snakemake python=3.6 -c snakemake
source activate snakemake
# need to prepare a meta.txt file containing run information, see example in the repo
# for each run, a csv file needs to be prepared as well for cellranger to make fastq.
# see examples in the repo. if you have 3 runs, you should have 3 csv files.
python sample2json.py meta.txt
## check the generated file
less -S samples.json
## dry run
./pyflow-cellranger.sh -np

TO do

add QC steps.
doublet remove.

Code Snippets

from urllib2 import Request, urlopen, URLError
import json
import math
import sys
import os
import socket
import optparse

def arg_parser():
    cwd_dir = os.getcwd()
    parser = optparse.OptionParser()
    parser.add_option( '-r', dest='runid', help='Run ID: required')
    parser.add_option( '-a', dest='accesstoken', help='Access Token: required')
    ( options, args ) = parser.parse_args()

    try:
       if options.runid == None:
             raise Exception
       if options.accesstoken == None:
	     raise Exception

    except Exception:
	    print("Usage: BaseSpaceRunDownloader_vN.py -r <RunID> -a <AccessToken>")
	    sys.exit()

    return options

def restrequest(rawrequest):
	request = Request(rawrequest)

	try:
        	response = urlopen(request)
        	json_string = response.read()
        	json_obj = json.loads(json_string)

	except URLError, e:
    		print 'Got an error code:', e
		sys.exit()

	return json_obj

def downloadrestrequest(rawrequest,path):
	dirname = RunID + os.sep + os.path.dirname(path)

	if dirname != '':
		if not os.path.isdir(dirname):
			os.makedirs(dirname)

	request = (rawrequest)

	outfile = open(RunID + os.sep + path,'wb')

        try:
                response = urlopen(request,timeout=1)

		outfile.write(response.read())
		outfile.close()

        except URLError, e:
                print 'Got an error code:', e
		outfile.close()
		downloadrestrequest(rawrequest,path)


	except socket.error:
		print 'Got a socket error: retrying'
		outfile.close()
		downloadrestrequest(rawrequest,path)


options = arg_parser()

RunID = options.runid
AccessToken = options.accesstoken

request = 'http://api.basespace.illumina.com/v1pre3/runs/%s/files?access_token=%s' %(RunID,AccessToken)

json_obj = restrequest(request)
totalCount = json_obj['Response']['TotalCount']

noffsets = int(math.ceil(float(totalCount)/1000.0))

hreflist = []
pathlist = []
filenamelist = [] 
for index in range(noffsets):
	offset = 1000*index
	request = 'http://api.basespace.illumina.com/v1pre3/runs/%s/files?access_token=%s&limit=1000&Offset=%s' %(RunID,AccessToken,offset) 
	json_obj = restrequest(request)
	nfiles = len(json_obj['Response']['Items'])
	for fileindex in range(nfiles):
		href = json_obj['Response']['Items'][fileindex]['Href']
		hreflist.append(href)
		path = json_obj['Response']['Items'][fileindex]['Path']
		pathlist.append(path)

for index in range(len(hreflist)):
	request = 'http://api.basespace.illumina.com/%s/content?access_token=%s'%(hreflist[index],AccessToken)
	print 'downloading %s' %(pathlist[index]) 
	downloadrestrequest(request, pathlist[index])

Python JSON From line 1 of scripts/BaseSpaceRunDownloader_v2.py

library(Seurat)
library(dplyr)
## see https://bitbucket.org/snakemake/snakemake/issues/917/enable-stdout-and-stderr-redirection
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


# https://github.com/satijalab/seurat/issues/671
# run NormalizeData first and then FindVariableGenes
PreprocessData<- function(object,
                                x.low.cutoff = 0.05,
                                x.high.cutoff = 10,
                                y.cutoff = 0.5,
                                num.pc = 20,
                                pc.use = NULL,
                                do.par =TRUE,
                                num.cores = 2,
                                score.thresh = 1e-5,
                                sig.pc.thresh = 0.05,
                                n.start = 100,
                                nn.eps = 0,
                                resolution = 0.8,
                                k.param = 30,
                                ...){
		object<- NormalizeData(object = object)

        object<- FindVariableGenes(object = object,
                                   x.low.cutoff = x.low.cutoff,
                                   x.high.cutoff = x.high.cutoff,
                                   y.cutoff = y.cutoff)

        object<- ScaleData(object = object, genes.use = object@var.genes,
                           vars.to.regress = c("nUMI"), block.size = 1000,
                           min.cells.to.block=3000,
                           display.progress = TRUE, do.par = TRUE, num.cores = num.cores)

        object<- RunPCA(object = object, pc.genes = object@var.genes,
                        pcs.compute = num.pc, do.print = FALSE)

        if (is.null(pc.use)){
                object<- JackStraw( object = object, num.replicate = 100, num.cores = num.cores,
                                    do.par = T, num.pc = num.pc)

                object <- JackStrawPlot(object = object, PCs = 1:num.pc, score.thresh = score.thresh)

                PC_pvalues<- object@dr$pca@jackstraw@overall.p.values

                ## determin how many PCs to use.
                pc.use<- min(which(PC_pvalues[,"Score"] > sig.pc.thresh)) -1

        }

        # add significant pc number to metadata, need to have names same as the cells
        pc.use.meta<- rep(pc.use, length(object@cell.names))
        names(pc.use.meta)<- object@cell.names
        object<- AddMetaData(object = object, metadata = pc.use.meta, col.name = "pc.use")
        object <- RunTSNE(object = object, dims.use = 1:pc.use, do.fast = TRUE)
        object <- FindClusters(object = object, reduction.type = "pca",
                                dims.use = 1:pc.use,
                                n.start = n.start,
                                k.param = k.param,
                                nn.eps = nn.eps, resolution = resolution,
                                print.output = FALSE,
                                save.SNN = TRUE, force.recalc = TRUE)
        return(object)
}


sample<- snakemake@wildcards[["sample"]]
data.dir = snakemake@params[["cellranger_count_dir"]]
# Load the seurat_obj dataset
data <- Read10X(data.dir = data.dir)

seurat_obj <- CreateSeuratObject(raw.data = data, min.cells = 3, min.genes = 200, 
    project = sample)

make_seurat_obj_args<- snakemake@params[["make_seurat_obj_args"]]
command<- paste("PreprocessData", "(", "seurat_obj,", make_seurat_obj_args, ")")

seurat_obj<- eval(parse(text=command))

saveRDS(seurat_obj, file = paste0("seurat_objs/", sample, "_seurat.rds"))

R dplyr Seurat From line 1 of scripts/make_seurat.R

run:
	if config['download']:
		shell("source activate py27")
		shell("python scripts/BaseSpaceRunDownloader_v2.py -r {wildcards.run_id}  -a {config[token]} 2> {log}")
	## if not downloaded from baseSpace, soft link the directory 
	else:
		with open(config['run_dirs']) as f:
			reader = csv.reader(f, delimiter = "\t")
			# skipt the header
			header = next(reader)
			run_dir_dict = {}
			for row in reader:
				run_id = row[0].strip()
				run_dir = row[1].strip()
				run_dir_dict[run_id] = run_dir
	# snakemake autocreate the output folder first, I need to remove it first, otherwise ln -s will link to the subfolder
	# of run_id
	cmd1 = "rm -rf {run_id}".format(run_id = wildcards.run_id)			
	cmd2 = "ln -rs {run_dir} {run_id}".format(run_dir = run_dir_dict[wildcards.run_id], run_id = wildcards.run_id)
	shell(cmd1)
	shell(cmd2)

SnakeMake Snakemake From line 62 of master/Snakefile

	shell:
		"""
        module load bcl2fastq2
		cellranger mkfastq --id={wildcards.batch_name} \
                     --run={params.run_info[0]} \
                     --samplesheet={params.run_info[2]} \
                     --jobmode=local \
                     --localmem=30 \
                     --localcores=8 2> {log} 
        touch {output}
		"""

SnakeMake From line 104 of master/Snakefile

	shell:
		"""
		cellranger count --id={wildcards.sample} \
                 --transcriptome={config[transcriptome]} \
                 --fastqs={params.fastqs} \
                 --sample={wildcards.sample} \
                 --expect-cells={config[expect_cells]} \
                 {params.custom} \
                 --localcores=12 \
                 --localmem=40 2> {log}
  		touch {output}
		"""