Single-Cell RNA Sequencing Analysis Workflow Using Drop-seq Tools on SGE Cluster Environment
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This pipeline is based on snakemake and the dropseq tools provided by the McCarroll Lab . It starts from raw sequencing data and outputs digital gene expression matrices (DGEs).
WE ARE CURRENTLY USING AN UPDATED VERSION HERE
Updates
We have since moved on from using this method and now use an adapted version of dropSeqPipe which we have found to be much more sophisticated and flexible than our snakemake implementation of Drop-seq Tools .
We have made a number of changes to the original workflow:
1). We have adapted the dropSeqPipe workflow to work on a SGE Cluster Environment , specifically the UCLA Hoffman2 Cluster .
2). We have made the workflow compatible with the new pipeline dropEST which is capable of handling reads aligning to intronic regions and results in a large boost in cell number and genes/UMIs per cell in our hands.
If you would like to use this updated snakemake workflow, please go here .
Requirements
snakemake is the workflow management which wraps Drop-seq Tools which runs everything
Drop-seq Tools can be downloaded from here and a manual of the various functions is available here
STAR is used to map the reads.
Changes to config file
In order to run the workflow, you must modify the config.yaml file with your sample information. Simply provide the "SampleName" for each sample for read1 of your paired end fastq files.
Changes to Snakefile
The following changes must be made to the Snakefile
1). If you are running in a SGE Cluster Environment and you would like email notifications of your job status, add your email to the email parameter.
2). If you would like to give your cluster jobs a specific prefix, you can specify that with job_label
3). Provide a path to your STAR Index for your reference genome with: STARREFDIR
4). Give the paths to your reference genome fasta file: REF_SEQ and gtf file: GTF_FILE
5). Provide the path to the Drop-seq Tools directory: DropSeqTools
6). Provide the path to your STAR executable: STAR
7). If necessary, provide the path to your particular java version executable: JAVA_distro
Running the Workflow
Run the snakemake workflow with the following command:
snakemake -j 100 --cluster "qsub {params.sge_opts}"
Code Snippets
52 53 54 55 56 57 58 59 60 | shell: """ {JAVA_distro} -jar {picard_jar} FastqToSam \ F1={input.fastq1} \ F2={input.fastq2} \ O={output.bam} \ SM={wildcards.sample} \ SORT_ORDER=queryname """ |
71 72 73 74 75 | shell: """ {TagBamWithReadSequenceExtended} BASE_RANGE=1-12 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 \ INPUT={input.bam} SUMMARY={output.summary} OUTPUT={output.tagged_cell} """ |
85 86 87 88 89 | shell: """ {TagBamWithReadSequenceExtended} BASE_RANGE=13-20 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=true TAG_NAME=XM NUM_BASES_BELOW_QUALITY=1 \ INPUT={input.tagged_cell} SUMMARY={output.summary} OUTPUT={output.tagged_molecule} """ |
98 99 100 101 102 | shell: """ {FilterBAM} TAG_REJECT=XQ \ INPUT={input.tagged_molecule} OUTPUT={output.tagged_filtered} """ |
112 113 114 115 116 | shell: """ {TrimStartingSequence} SEQUENCE=AAGCAGTGGTATCAACGCAGAGTGAATGGG MISMATCHES=0 NUM_BASES=5 \ INPUT={input.tagged_filtered} OUTPUT_SUMMARY={output.summary} OUTPUT={output.trimmed_smart} """ |
126 127 128 129 130 | shell: """ {PolyATrimmer} MISMATCHES=0 NUM_BASES=6 \ INPUT={input.trimmed_smart} OUTPUT_SUMMARY={output.summary} OUTPUT={output.unmapped_bam} """ |
140 141 142 143 144 | shell: """ java -Xmx500m -jar {picard_jar} SamToFastq \ INPUT={input.unmapped_bam} FASTQ={output.fastq} """ |
154 155 156 157 158 | shell: """ {STAR} --genomeDir {input.starref} --runThreadN 6 \ --readFilesIn {input.fastq} --outFileNamePrefix Output/{wildcards.sample}/star. """ |
169 170 171 172 173 | shell: """ java -Dsamjdk.buffer_size=131072 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4000m -jar {picard_jar} SortSam \ INPUT={input.aligned_sam} OUTPUT={output.aligned_sorted_bam} SORT_ORDER=queryname TMP_DIR={output.temp_dir} """ |
185 186 187 188 189 | shell: """ java -Xmx4000m -jar {picard_jar} MergeBamAlignment INCLUDE_SECONDARY_ALIGNMENTS=false PAIRED_RUN=false \ REFERENCE_SEQUENCE={input.reference_seq} UNMAPPED_BAM={input.unmapped_bam} ALIGNED_BAM={input.aligned_sorted_bam} OUTPUT={output.merged} """ |
199 200 201 202 203 | shell: """ {TagReadWithGeneExon} TAG=GE CREATE_INDEX=true \ INPUT={input.merged} ANNOTATIONS_FILE={input.gtf} O={output.gene_exon_tagged} """ |
214 215 216 217 218 | shell: """ {DetectBeadSynthesisErrors} NUM_BARCODES={Number_Barcodes} PRIMER_SEQUENCE=AAGCAGTGGTATCAACGCAGAGTAC \ I={input.gene_exon_tagged} O={output.gene_exon_tagged_clean} OUTPUT_STATS={output.stats} SUMMARY={output.summary} """ |
227 228 229 230 231 | shell: """ {GatherMolecularBarcodeDistributionByGene} MIN_NUM_GENES_PER_CELL=500 \ I={input.gene_exon_tagged_clean} O={output.duplicateUMI} """ |
241 242 243 244 245 | shell: """ {BAMTagHistogram} TAG=XC \ I={input.gene_exon_tagged_clean} O={output.cell_readcounts} """ |
258 259 260 261 262 | shell: """ {DigitalExpression} NUM_CORE_BARCODES={Number_Core_Barcodes} \ INPUT={input.gene_exon_tagged_clean} OUTPUT={output.dge} SUMMARY={output.summary} """ |
Support
- Future updates