Snakemake workflow to grab raw fastqs from SRA from SRR, SRX, SRP, or PJRNA accession numbers

public public 1yr ago 0 bookmarks

WARNING: in development

Snakemake workflow to go from NCBI accessions to fastqs.

Input: could be SRR, SRX, or SRP ids -recommended to be SRP ids, directories are named based on this note: smallest resolution that can be returned is SRX. if a SRR is input, it will grab metadata and fastqs for the entire SRX it resides in

Output: -writes fastqs from fasterq-dump to any directory specified in config file -appends [project_id]/[experiment_id] to that specified directory to organize fastqs -status file in ./output/[project_id]/[experiment_id] that records date of most recent download

TODO: make clear which yaml is parameters vs conda env Test in isolated (singularity) environment Where does prefetch go Write metadata and fastqs to same location

First time running: #will create conda environment to call snakemake helper scripts with $ bash ./scripts/config.sh #may need to configure (https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration)

Call SRA_project_snake with: $ bash SRA_project_snake.sh [SRP-accession-id]

Code Snippets

13
14
15
16
17
18
19
20
21
22
23
24
25
shell:
	"""		
	#Given a experiment SRX accession number, get SRR runs that make up that experiment
	#Create a directory for each experiment to store fastqs
	esearch -db sra -query {wildcards.experiment_accession} | efetch -mode xml | \
		xpath -q -e '//EXPERIMENT/@accession' | cut -d'"' -f2 > \
		{output.run_accession_list}

	awk '{{print  "output/"$0"_dir"}}' {output.experiment_accession_list} > tmp_appended_experiment_accessions.txt


	xargs -d '\n' mkdir -p < tmp_appended_experiment_accessions.txt 
	"""
36
37
38
39
40
41
42
43
44
45
46
47
48
49
	shell:
		"""
		mkdir -p  output/{wildcards.project_accession}/{wildcards.experiment_accession}/ 

                #Grabs meta data, fetches in table format, and saves to csv
                esearch -db sra -query '{wildcards.experiment_accession}' |efetch -format runinfo \
			> {params.metadata_out_dir}/run_metadata.csv

                #same thing, but save only SRRs, one per line, to use with nstall -c bioconda perl-xml-xpathpprefetch or fasterqdump
                #First column (-f1) holds SRRs, delim is ",", egrep gets rid of the heading and returns just SRRs
		esearch -db sra -query '{wildcards.experiment_accession}' | efetch -format runinfo | cut -f1 -d, \
			| egrep 'SRR' > {output.run_accession_list}

		"""
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
	shell:
		"""


		#Use this section if you want to use prefetch and validate the downloads. Necessary with
		#fastq dump but not neccesary with fasterq-dump
		#prefetch SRR6854061
		cat {input.run_accession_list} | while IFS= read -r run_accession; do
            		prefetch $run_accession && fasterq-dump $run_accession {params.fasterq_flags} \
				-O {params.fastq_out_dir}/{wildcards.project_accession}/{wildcards.experiment_accession}
			done > {log} 2>&1

		#feeds each line of SRR accession numbers list to fasterq command
		#flags: --split-files separates paired reads
		#xargs operates on each line with "-l"
		#-I appends a unique "1" or "2" to pairs
		#-O output directory, -t temp directory
		##Tags project accession and exp accession to general fastq location, to be processed togtherr
#		cat {input.run_accession_list} | xargs -l fasterq-dump {params.fasterq_flags} \
#-O {params.fastq_out_dir}/{wildcards.project_accession}/{wildcards.experiment_accession} \
#2> {log}

		echo 'fastqs for {wildcards.experiment_accession} were obtained on:' > {output.status}
		date >> {output.status}

		"""
ShowHide 2 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/jaredslosberg/get_SRA_fastqs
Name: get_sra_fastqs
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...