Snakemake workflow to grab raw fastqs from SRA from SRR, SRX, SRP, or PJRNA accession numbers
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
WARNING: in development
Snakemake workflow to go from NCBI accessions to fastqs.
Input: could be SRR, SRX, or SRP ids -recommended to be SRP ids, directories are named based on this note: smallest resolution that can be returned is SRX. if a SRR is input, it will grab metadata and fastqs for the entire SRX it resides in
Output: -writes fastqs from fasterq-dump to any directory specified in config file -appends [project_id]/[experiment_id] to that specified directory to organize fastqs -status file in ./output/[project_id]/[experiment_id] that records date of most recent download
TODO: make clear which yaml is parameters vs conda env Test in isolated (singularity) environment Where does prefetch go Write metadata and fastqs to same location
First time running: #will create conda environment to call snakemake helper scripts with $ bash ./scripts/config.sh #may need to configure (https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration)
Call SRA_project_snake with: $ bash SRA_project_snake.sh [SRP-accession-id]
Code Snippets
13 14 15 16 17 18 19 20 21 22 23 24 25 | shell: """ #Given a experiment SRX accession number, get SRR runs that make up that experiment #Create a directory for each experiment to store fastqs esearch -db sra -query {wildcards.experiment_accession} | efetch -mode xml | \ xpath -q -e '//EXPERIMENT/@accession' | cut -d'"' -f2 > \ {output.run_accession_list} awk '{{print "output/"$0"_dir"}}' {output.experiment_accession_list} > tmp_appended_experiment_accessions.txt xargs -d '\n' mkdir -p < tmp_appended_experiment_accessions.txt """ |
36 37 38 39 40 41 42 43 44 45 46 47 48 49 | shell: """ mkdir -p output/{wildcards.project_accession}/{wildcards.experiment_accession}/ #Grabs meta data, fetches in table format, and saves to csv esearch -db sra -query '{wildcards.experiment_accession}' |efetch -format runinfo \ > {params.metadata_out_dir}/run_metadata.csv #same thing, but save only SRRs, one per line, to use with nstall -c bioconda perl-xml-xpathpprefetch or fasterqdump #First column (-f1) holds SRRs, delim is ",", egrep gets rid of the heading and returns just SRRs esearch -db sra -query '{wildcards.experiment_accession}' | efetch -format runinfo | cut -f1 -d, \ | egrep 'SRR' > {output.run_accession_list} """ |
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | shell: """ #Use this section if you want to use prefetch and validate the downloads. Necessary with #fastq dump but not neccesary with fasterq-dump #prefetch SRR6854061 cat {input.run_accession_list} | while IFS= read -r run_accession; do prefetch $run_accession && fasterq-dump $run_accession {params.fasterq_flags} \ -O {params.fastq_out_dir}/{wildcards.project_accession}/{wildcards.experiment_accession} done > {log} 2>&1 #feeds each line of SRR accession numbers list to fasterq command #flags: --split-files separates paired reads #xargs operates on each line with "-l" #-I appends a unique "1" or "2" to pairs #-O output directory, -t temp directory ##Tags project accession and exp accession to general fastq location, to be processed togtherr # cat {input.run_accession_list} | xargs -l fasterq-dump {params.fasterq_flags} \ #-O {params.fastq_out_dir}/{wildcards.project_accession}/{wildcards.experiment_accession} \ #2> {log} echo 'fastqs for {wildcards.experiment_accession} were obtained on:' > {output.status} date >> {output.status} """ |
Support
- Future updates
Related Workflows





