Snakemake workflow to grab raw fastqs from SRA from SRR, SRX, SRP, or PJRNA accession numbers

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

WARNING: in development

Snakemake workflow to go from NCBI accessions to fastqs.

Input: could be SRR, SRX, or SRP ids -recommended to be SRP ids, directories are named based on this note: smallest resolution that can be returned is SRX. if a SRR is input, it will grab metadata and fastqs for the entire SRX it resides in

Output: -writes fastqs from fasterq-dump to any directory specified in config file -appends [project_id]/[experiment_id] to that specified directory to organize fastqs -status file in ./output/[project_id]/[experiment_id] that records date of most recent download

TODO: make clear which yaml is parameters vs conda env Test in isolated (singularity) environment Where does prefetch go Write metadata and fastqs to same location

First time running: #will create conda environment to call snakemake helper scripts with $ bash ./scripts/config.sh #may need to configure (https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration)

Call SRA_project_snake with: $ bash SRA_project_snake.sh [SRP-accession-id]

Code Snippets

shell:
	"""		
	#Given a experiment SRX accession number, get SRR runs that make up that experiment
	#Create a directory for each experiment to store fastqs
	esearch -db sra -query {wildcards.experiment_accession} | efetch -mode xml | \
		xpath -q -e '//EXPERIMENT/@accession' | cut -d'"' -f2 > \
		{output.run_accession_list}

	awk '{{print  "output/"$0"_dir"}}' {output.experiment_accession_list} > tmp_appended_experiment_accessions.txt


	xargs -d '\n' mkdir -p < tmp_appended_experiment_accessions.txt 
	"""

SnakeMake SRA From line 13 of master/Snakefile

	shell:
		"""
		mkdir -p  output/{wildcards.project_accession}/{wildcards.experiment_accession}/ 

                #Grabs meta data, fetches in table format, and saves to csv
                esearch -db sra -query '{wildcards.experiment_accession}' |efetch -format runinfo \
			> {params.metadata_out_dir}/run_metadata.csv

                #same thing, but save only SRRs, one per line, to use with nstall -c bioconda perl-xml-xpathpprefetch or fasterqdump
                #First column (-f1) holds SRRs, delim is ",", egrep gets rid of the heading and returns just SRRs
		esearch -db sra -query '{wildcards.experiment_accession}' | efetch -format runinfo | cut -f1 -d, \
			| egrep 'SRR' > {output.run_accession_list}

		"""

SnakeMake SRA From line 36 of master/Snakefile

	shell:
		"""


		#Use this section if you want to use prefetch and validate the downloads. Necessary with
		#fastq dump but not neccesary with fasterq-dump
		#prefetch SRR6854061
		cat {input.run_accession_list} | while IFS= read -r run_accession; do
            		prefetch $run_accession && fasterq-dump $run_accession {params.fasterq_flags} \
				-O {params.fastq_out_dir}/{wildcards.project_accession}/{wildcards.experiment_accession}
			done > {log} 2>&1

		#feeds each line of SRR accession numbers list to fasterq command
		#flags: --split-files separates paired reads
		#xargs operates on each line with "-l"
		#-I appends a unique "1" or "2" to pairs
		#-O output directory, -t temp directory
		##Tags project accession and exp accession to general fastq location, to be processed togtherr
#		cat {input.run_accession_list} | xargs -l fasterq-dump {params.fasterq_flags} \
#-O {params.fastq_out_dir}/{wildcards.project_accession}/{wildcards.experiment_accession} \
#2> {log}

		echo 'fastqs for {wildcards.experiment_accession} were obtained on:' > {output.status}
		date >> {output.status}

		"""