Snakemake Pipeline for processing BioMob WP2 partial genome sequencing data
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Pipeline for processing Illumina sequencing data generated by target enrichment via hybrid capture experiments. Heavily follows the Phyluce methodology outlined in Tutorial I: UCE Phylogenomics .
-
Trims Illumina adapters and merges reads together BBDuk, BBMerge
-
Assembles trimmed and merged reads Abyss , SPAdes, rnaSPAdes
-
Detects and extracts target contigs Phyluce
-
Summary statistics on targets and assemblies BBTools Stats
-
Optional scripts and starting points to perform phylogenic inference
Prerequisites
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda install -c bioconda -c conda-forge snakemake
- Git
Getting Started
Within a working directory:
git clone https://github.com/AAFC-BICoE/snakemake-partial-genome-pipeline.git .
-
Create a folder named "fastq" that contains Illumina based raw reads in fastq.gz format. Fastq files should not begin with numbers, or contain a mix of "_" and "-" characters.
-
Create a folder named "probes" that contains a probe fasta file with fasta headers in Phyluce UCE format
>uce-1_p1
GCTGGTTATC...
>uce-1_p2
TAACAATA....
>uce-2_p1
AAGCATCT...
Dry-run to see if everything is prepared correctly
snakemake --use-conda -n
To run pipeline with 32 cores and continue if some samples fail:
snakemake --use-conda -k --cores 32
To save time on future runs, a central folder of conda environments can be called so they don't need to be repeatedly rebuilt. There is a path length limit to this feature so ensure the central folder is located in the home directory
snakemake --use-conda --conda-prefix <Path To Snakemake Conda Envs> --cores 32
Pipeline Overview
Pipeline Summary
This pipeline was heavily inspired by and closely followed protocols developed by Dr. Brant Faircloth and prescribed in Tutorial I: UCE Phylogenomics . Software versions employed and specific parameters and commands are available in the Conda yml environment files and the Snakefile respectively.
Illumina paired end reads from target enrichment sequencing are trimmed of adaptors using BBDuk. A copy of the trimmed fastq reads are merged using BBMerge. The unmerged reads are assembled using SPAdes, rnaSPAdes and Abyss. Merging paired end reads prior to assembly with Abyss demonstrated a noticeable impact on the number of detected targets when using Phyluce. Merging reads had neglible impact with SPAdes and rnaSPAdes. Therefore the merged reads were assembled using Abyss.
Phyluce, along with the corresponding probe set used in the target enrichment experiment is used to process each assembly independently. This generates four separate Phyluce databases of probe hits and UCE target contigs. Due to the heavy variation in target detection depending on assembly method, we opted to combine all detected targets into a unique set per sample. The custom script merge_uces.py examines each sample, and all detected UCEs across the four assemblies. It combines all targets, and keeps only the longest of any targets found in multiple assemblies. This unique set of merged targets dramatically increases the amount of data available for Phylogeny. However, the unadulterated assemblies are available for processing if required.
The merged targets are concatenated into a single file which is a substitute of the Phyluce generated all-taxa-incomplete.fasta file that is the entry point for the Phyluce phylogeny workflow. A rapid phylogeny is generated for quality control examination. Example commands are provided in the script phylogeny.sh. Phyluce aligns all UCE targets using Mafft, trims the alignments using Gblocks, and removes any targets not present in 50% or more of samples. The generated phylip file serves as the entry point for RAxML or IQTree which produces a rapid phylogeny for the purposes of quality control and detecting sample or sequencing errors.
Author
Jackson Eyres
Bioinformatics Programmer
Agriculture & Agri-Food Canada
[email protected]
Copyright
Government of Canada, Agriculture & Agri-Food Canada
License
This project is licensed under the MIT License - see the LICENSE file for details
Publications & Additional Resources
-
Brunke, A J., Hansen, A. K., Salnitska, M., Kypke, J. L., Escalona, H., Chapados, J.T., Eyres, J., Richter, R., Smetana, A., Ślipiński, A., Zwick, A., Hájek, J., Leschen, R., Solodovnikov, A. and Dettman, J.R. The limits of Quediini at last (Coleoptera: Staphylinidae: Staphylininae): a rove beetle mega-radiation resolved by comprehensive sampling and anchored phylogenomics. Systematic Entomology. Accepted. 1–36.
-
Dr. Adam Brunke provides some further custom phylogeny instructions
-
Douglas HB, Kundrata R, Brunke AJ, Escalona HE, Chapados JT, Eyres J, Richter R, Savard K, Ślipiński A, McKenna D, Dettman JR. Anchored Phylogenomics, Evolution and Systematics of Elateridae: Are All Bioluminescent Elateroidea Derived Click Beetles? Biology. 2021; 10(6):451. https://doi.org/10.3390/biology10060451
-
Hai D. T. Nguyen, Wayne McCormick, Jackson Eyres, Quinn Eggertson, Sarah Hambleton & Jeremy R. Dettman (2021) Development and evaluation of a target enrichment bait set for phylogenetic analysis of oomycetes, Mycologia, 113:4, 856-867, DOI: https://doi.org/10.1080/00275514.2021.1889276
Known Issues
-
Fastq files that start with numbers fail with Phyluce
-
rnaSPAdes 3.13.1 sometimes with randomly fails to generate a transcripts.fasta on a sample after completing K127. A workaround is to choose one of the K*** assemblies, and copy and rename it to transcripts.fasta in the higher level directory. Snakemake requires a transcripts.fasta for each rnaspades assembly to progress to Phyluce.
-
AAFC Specific Due to an incorrect and challenging to fix server wide implementation of OpenMPI, qsub commands should be run with "qsub -pe smp 1" which prevents abyss from starting in parallel mode and crashing. However Spades and rnaSPAdes appear to still use multiple cores as assigned via snakemake jobs
Citations
-
BioPython - Tools for biological computation
Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878 -
Snakemake - Workflow management system
Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012. -
SPAdes
Nurk S. et al. (2013) Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads. In: Deng M., Jiang R., Sun F., Zhang X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science, vol 7821. Springer, Berlin, Heidelberg -
BBTools
Brian-JGI (2018) BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. -
FASTQC
Andrews S. (2018). FastQC: a quality control tool for high throughput sequence data. Available online at: -
Phyluce - Target enrichment data analysis
Faircloth BC. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics 32:786-788. doi:10.1093/bioinformatics/btv646. -
Ultraconserved elements BC Faircloth, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. 2012. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Systematic Biology 61: 717–726. doi:10.1093/sysbio/SYS004.
-
Abyss
Shaun D Jackman, Benjamin P Vandervalk, Hamid Mohamadi, Justin Chu, Sarah Yeo, S Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, René L Warren, and Inanc Birol (2017). ABySS 2.0: Resource-efficient assembly of large genomes using a Bloom filter. Genome research, 27(5), 768-777. doi:10.1101/gr.214346.116
Code Snippets
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | from Bio import SeqIO import os import glob import argparse def main(): parser = argparse.ArgumentParser(description='Merges Phyluce UCEs from SPAdes and rnaSPAdes') parser.add_argument('-o', type=str, help='Output Folder', required=True) parser.add_argument('-i', type=str, help='Input folder of merged fastas', required=True) args = parser.parse_args() print("Counts merged_uces into a summary file in {} directory".format(args.o)) count_uces(args.o, args.i) def count_uces(output_directory, input_directory): # Gather each specimen file produced from the Phyluce merged_fastas = glob.glob(os.path.join(input_directory, "*_merged.fasta")) # Put all the contigs into a single dictionary specimen_dict = {} for fasta in merged_fastas: specimen = os.path.basename(fasta) specimen_name = specimen.replace("_merged.fasta", "").replace("-","_") with open(fasta) as f: count = 0 abyss_count = 0 spades_count = 0 rnaspades_count = 0 abyss_u_count = 0 for seq in SeqIO.parse(fasta, 'fasta'): if "_AU" in seq.id[-3:]: abyss_u_count += 1 elif "_A" in seq.id[-2:]: abyss_count += 1 elif "_R" in seq.id[-2:]: rnaspades_count += 1 elif "_S" in seq.id[-2:]: spades_count += 1 count += 1 if specimen_name in specimen_dict: specimen_dict[specimen_name] = [count, abyss_count, abyss_u_count, spades_count, rnaspades_count] else: specimen_dict[specimen_name] = [count, abyss_count, abyss_u_count, spades_count, rnaspades_count] output_file = os.path.join(output_directory, "merged_uce_summary.csv") with open(output_file, "w") as g: g.write("Specimen, Merged Targets, Abyss Contribution, Abyss Unmerged Contribution, SPAdes Contribution, rnaSPAdes Contribution\n") for key, value in specimen_dict.items(): g.write("{},{},{},{},{},{}\n".format(key, value[0],value[1],value[2],value[3],value[4])) if __name__ == "__main__": main() |
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | import os import argparse def main(): parser = argparse.ArgumentParser(description='Combines various log files into a CSV') parser.add_argument('-i', type=str, help='UCE Log Input', required=True) parser.add_argument('-f', type=str, help='Fastq Metrics from statswrapper.sh', required=True) parser.add_argument('-o', type=str, help='UCE Output', required=True) args = parser.parse_args() summarize_uces(args.i, args.f, args.o) def summarize_uces(input_path, fastq_metrics, output_path): with open(output_path, "w") as g: reads = {} with open(fastq_metrics) as f: lines = f.readlines() lines.pop(0) for line in lines: split = line.rstrip().split("\t") read_count = split[0] file_name = split[-1] sample_name = os.path.basename(file_name).\ replace("_L001_R1_001.fastq.gz", "").replace("_L001_R2_001.fastq.gz", "") reads[sample_name] = read_count with open(input_path) as f: index = 0 index_start = 0 index_end = 0 lines = f.readlines() for line in lines: if "INFO - ---" in line: if index_start > 0: index_end = index else: index_start = index index += 1 specimen_lines = lines[index_start+1: index_end] g.write("Species, Reads, Targets, Contigs, Dupes, Targets Filtered, Contigs Filtered\n") for line in specimen_lines: if "Writing" in line: continue sliced = line[76:] split = sliced.split(" ") species = split[0].replace(":", "") species_name = split[0].replace("_A:", "").replace("_S:", "").replace("_R:", "").replace("_AU:", "") read_count = 0 if species_name in reads: read_count = reads[species_name] uniques = split[1] contigs = split[5] dupes = split[7] removed = split[11] match = split[19] g.write("{},{},{},{},{},{},{}\n".format(species, read_count, uniques, contigs, dupes, removed, match)) if __name__ == "__main__": main() |
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | from Bio import SeqIO import os import glob import argparse def main(): parser = argparse.ArgumentParser(description='Merges Phyluce UCEs from SPAdes and rnaSPAdes') parser.add_argument('-o', type=str, help='Output Folder', required=True) parser.add_argument('-s', type=str, help='SPAdes exploded-fastas folder', required=True) parser.add_argument('-r', type=str, help='rnaSPAdes exploded-fastas folder', required=True) parser.add_argument('-a', type=str, help='Abyss exploded-fastas folder', required=True) parser.add_argument('-u', type=str, help='Abyss Unmerged exploded-fastas folder', required=True) args = parser.parse_args() print("Merging SPAdes and rnaSPAdes UCEs together into {} directory".format(args.o)) combine_uces(args.o, args.s, args.r, args.a, args.u) def combine_uces(output_directory, spades_directory, rnaspades_directory, abyss_directory, abyss_u_directory): """ Takes the UCES from various assembly runs and creates a seperate file taking only the best sequence per UCE :return: """ # Verify folders exist if os.path.isdir(spades_directory) and os.path.isdir(rnaspades_directory) and os.path.isdir(abyss_directory): pass else: print("Missing either {} or {} or {}".format(spades_directory, rnaspades_directory, abyss_directory)) return # Gather each specimen file produced from the Phyluce spades_fastas = glob.glob(os.path.join(spades_directory, "*.fasta")) rnaspades_fastas = glob.glob(os.path.join(rnaspades_directory, "*.fasta")) abyss_fastas = glob.glob(os.path.join(abyss_directory, "*.fasta")) abyss_u_fastas = glob.glob(os.path.join(abyss_u_directory, "*.fasta")) # Put all the contigs into a single dictionary specimen_dict = {} for fasta in spades_fastas: specimen = os.path.basename(fasta) specimen_name = specimen.replace("-S.unaligned.fasta", "") specimen_dict[specimen_name] = [fasta] for fasta in rnaspades_fastas: specimen = os.path.basename(fasta) specimen_name = specimen.replace("-R.unaligned.fasta", "") if specimen_name in specimen_dict: specimen_dict[specimen_name].append(fasta) for fasta in abyss_fastas: specimen = os.path.basename(fasta) specimen_name = specimen.replace("-A.unaligned.fasta", "") if specimen_name in specimen_dict: specimen_dict[specimen_name].append(fasta) for fasta in abyss_u_fastas: specimen = os.path.basename(fasta) specimen_name = specimen.replace("-AU.unaligned.fasta", "") if specimen_name in specimen_dict: specimen_dict[specimen_name].append(fasta) # For each specimen, add all the UCES to a single dictionary from every file, then examine each UCE sequence and # choose the one with the greatest length. Write all filtered UCEs to both a merged file, and monolithic file for key, value in specimen_dict.items(): all_uces = {} for fasta in value: for seq in SeqIO.parse(fasta, 'fasta'): uce = seq.description.split("|")[-1] if uce in all_uces: all_uces[uce].append(seq) else: all_uces[uce] = [seq] print(key, len(all_uces)) final_uces = [] for k, v in all_uces.items(): max_uce = None max_length = 0 for seq in v: if len(seq.seq) > max_length: max_uce = seq final_uces.append(max_uce) # Write Final UCES to merged file if not os.path.exists(output_directory): os.makedirs(output_directory) file_name = str(key) + "_merged.fasta" file_path = os.path.join(output_directory, file_name) with open(file_path, "w") as f: for seq in final_uces: SeqIO.write(seq, handle=f, format="fasta") file_name = "all-taxa-incomplete-merged-renamed.fasta" file_path = os.path.join(output_directory, file_name) with open(file_path, "a") as f: for seq in final_uces: uce = str(seq.id).split("_")[0] specimen = key seq.description = "|" + uce seq.id = uce + "_" + specimen SeqIO.write(seq, handle=f, format="fasta") # # Log all the changes made to the SPAdes UCE file to create the merged file # file_name = "UCE_Change_Log.txt" # file_path = os.path.join(new_directory, file_name) # with open(file_path, "a") as f: # f.writelines(uce_change_log) if __name__ == "__main__": main() |
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | from Bio import SeqIO import os import glob import argparse def main(): parser = argparse.ArgumentParser(description='Renames Abyss contigs to more closely match SPAdes') parser.add_argument("input", type=str, help='Input File') parser.add_argument('output', type=str, help='Output File') args = parser.parse_args() print("Renaming Contigs in {}".format(args.input)) rename_contigs(args.input, args.output) def rename_contigs(input, output): seqs = [] with open(input, "r") as f: for seq in SeqIO.parse(f, 'fasta'): seq.name = "" split = seq.description.split(" ") header = "NODE_{}_length_{}_cov_{}".format(split[0],split[1],split[2]) seq.id = header seq.description = "" seqs.append(seq) with open(output, "w") as g: SeqIO.write(seqs, handle=g, format="fasta") if __name__ == "__main__": main() |
123 | shell: "statswrapper.sh {input.r1} {input.r2} > {output}" |
135 136 | shell: "fastqc -o fastqc {input.r1} {input.r2}" |
150 | shell: "bbduk.sh in1={input.r1} out1={output.out1} in2={input.r2} out2={output.out2} ref={adaptors} ktrim=r k=23 mink=11 hdist=1 tpe tbo &>{log}; touch {output.out1} {output.out2}" |
163 | shell: "bbmerge.sh in1={input.r1} in2={input.r2} out={output.out_merged} outu={output.out_unmerged} ihist={output.ihist} &>{log}" |
175 176 | shell: "fastqc -o fastqc_trimmed {input.i1} {input.i2} &>{log}" |
189 190 | shell: "multiqc -n multiqc_report.html -o multiqc fastqc; multiqc -n multiqc_report_trimmed.html -o multiqc fastqc_trimmed;" |
205 206 | shell: "spades.py -t {threads} -1 {input.r1} -2 {input.r2} -o spades_assemblies/{wildcards.sample} &>{log}" |
215 216 217 218 219 220 221 | run: if os.path.exists(input.assembly): if os.path.exists("phyluce-spades/assemblies"): pass else: os.path.mkdir("phyluce-spades/assemblies") copyfile(input.assembly,output.renamed_assembly) |
226 227 228 229 230 | run: with open (output.w1, "w") as f: f.write("[all]\n") for item in SAMPLES: f.write(item + "_S\n") |
237 | shell: "statswrapper.sh phyluce-spades/assemblies/*.fasta > {output}" |
249 | shell: "rm -r phyluce-spades/uce-search-results; cd phyluce-spades; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta" |
256 | shell: "cd phyluce-spades; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf" |
265 | shell: "cd phyluce-spades/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log" |
274 | shell: "cd phyluce-spades/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}" |
281 | shell: "statswrapper.sh {input} > {output}" |
288 | shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}" |
305 306 | shell: "rnaspades.py -t {threads} -1 {input.r1} -2 {input.r2} -o rnaspades_assemblies/{wildcards.sample} &>{log}" |
314 315 316 317 318 319 320 | run: if os.path.exists(input.assembly): if os.path.exists("phyluce-rnaspades/assemblies"): pass else: os.path.mkdir("phyluce-rnaspades/assemblies") copyfile(input.assembly,output.renamed_assembly) |
325 326 327 328 329 | run: with open (output.w2, "w") as f: f.write("[all]\n") for item in SAMPLES: f.write(item + "_R\n") |
339 | shell: "rm -r phyluce-rnaspades/uce-search-results; cd phyluce-rnaspades; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta" |
345 | shell: "cd phyluce-rnaspades; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf" |
354 | shell: "cd phyluce-rnaspades/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log" |
363 | shell: "cd phyluce-rnaspades/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}" |
370 | shell: "statswrapper.sh phyluce-rnaspades/assemblies/*.fasta > {output}" |
377 | shell: "statswrapper.sh {input} > {output}" |
384 | shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}" |
401 402 | shell: "abyss-pe --directory=abyss_assemblies/{wildcards.sample} name={wildcards.sample} k=31 in=../../{input.i2} se=../../{input.i1} &>{log}" |
410 411 | shell: "python pipeline_files/rename_abyss_contigs.py {input} {output}" |
420 421 | shell: "sed -e '/^[^>]/s/[^ATGCatgc]/N/g' {input.assembly} >> {output.renamed_assembly}" |
428 | shell: "statswrapper.sh {input} > {output}" |
433 434 435 436 437 | run: with open (output.w1, "w") as f: f.write("[all]\n") for item in SAMPLES: f.write(item + "_A\n") |
446 | shell: "rm -r phyluce-abyss/uce-search-results; cd phyluce-abyss; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta" |
452 | shell: "cd phyluce-abyss; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf" |
460 | shell: "cd phyluce-abyss/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log" |
469 | shell: "cd phyluce-abyss/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}" |
476 | shell: "statswrapper.sh {input} > {output}" |
482 | shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}" |
499 500 | shell: "abyss-pe --directory=abyss_u_assemblies/{wildcards.sample} name={wildcards.sample} k=31 in='../../{input.r1} ../../{input.r2}' &>{log}" |
508 509 | shell: "python pipeline_files/rename_abyss_contigs.py {input} {output}" |
518 519 | shell: "sed -e '/^[^>]/s/[^ATGCatgc]/N/g' {input.assembly} >> {output.renamed_assembly}" |
526 | shell: "statswrapper.sh {input} > {output}" |
531 532 533 534 535 | run: with open (output.w1, "w") as f: f.write("[all]\n") for item in SAMPLES: f.write(item + "_AU\n") |
544 | shell: "rm -r phyluce-abyss_u/uce-search-results; cd phyluce-abyss_u; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta" |
550 | shell: "cd phyluce-abyss_u; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf" |
558 | shell: "cd phyluce-abyss_u/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log" |
567 | shell: "cd phyluce-abyss_u/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}" |
574 | shell: "statswrapper.sh {input} > {output}" |
580 | shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}" |
599 | shell: "python pipeline_files/merge_uces.py -o merged_uces -s phyluce-spades/taxon-sets/all/exploded-fastas/ -r phyluce-rnaspades/taxon-sets/all/exploded-fastas/ -a phyluce-abyss/taxon-sets/all/exploded-fastas/ -u phyluce-abyss_u/taxon-sets/all/exploded-fastas/" |
606 | shell: "python pipeline_files/count_uces.py -o summaries -i merged_uces" |
612 | shell: "cat {input} >> {output}" |
Support
- Future updates