Snakemake Pipeline for processing BioMob WP2 partial genome sequencing data

public 1yr ago Version: v0.6 0 bookmarks

View Workflow

snakemake-partial-genome-pipeline — View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Pipeline for processing Illumina sequencing data generated by target enrichment via hybrid capture experiments. Heavily follows the Phyluce methodology outlined in Tutorial I: UCE Phylogenomics .

Trims Illumina adapters and merges reads together BBDuk, BBMerge
Assembles trimmed and merged reads Abyss , SPAdes, rnaSPAdes
Detects and extracts target contigs Phyluce
Summary statistics on targets and assemblies BBTools Stats
Optional scripts and starting points to perform phylogenic inference

Prerequisites

Conda 4.7.10+

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Snakemake 5.5.4+

conda install -c bioconda -c conda-forge snakemake

Getting Started

Within a working directory:

git clone https://github.com/AAFC-BICoE/snakemake-partial-genome-pipeline.git .

Create a folder named "fastq" that contains Illumina based raw reads in fastq.gz format. Fastq files should not begin with numbers, or contain a mix of "_" and "-" characters.
Create a folder named "probes" that contains a probe fasta file with fasta headers in Phyluce UCE format

>uce-1_p1
GCTGGTTATC...
>uce-1_p2
TAACAATA....
>uce-2_p1
AAGCATCT...

Dry-run to see if everything is prepared correctly

snakemake --use-conda -n

To run pipeline with 32 cores and continue if some samples fail:

snakemake --use-conda -k --cores 32

To save time on future runs, a central folder of conda environments can be called so they don't need to be repeatedly rebuilt. There is a path length limit to this feature so ensure the central folder is located in the home directory

snakemake --use-conda --conda-prefix <Path To Snakemake Conda Envs> --cores 32

Pipeline Overview

Alt text

Pipeline Summary

This pipeline was heavily inspired by and closely followed protocols developed by Dr. Brant Faircloth and prescribed in Tutorial I: UCE Phylogenomics . Software versions employed and specific parameters and commands are available in the Conda yml environment files and the Snakefile respectively.

Illumina paired end reads from target enrichment sequencing are trimmed of adaptors using BBDuk. A copy of the trimmed fastq reads are merged using BBMerge. The unmerged reads are assembled using SPAdes, rnaSPAdes and Abyss. Merging paired end reads prior to assembly with Abyss demonstrated a noticeable impact on the number of detected targets when using Phyluce. Merging reads had neglible impact with SPAdes and rnaSPAdes. Therefore the merged reads were assembled using Abyss.

Phyluce, along with the corresponding probe set used in the target enrichment experiment is used to process each assembly independently. This generates four separate Phyluce databases of probe hits and UCE target contigs. Due to the heavy variation in target detection depending on assembly method, we opted to combine all detected targets into a unique set per sample. The custom script merge_uces.py examines each sample, and all detected UCEs across the four assemblies. It combines all targets, and keeps only the longest of any targets found in multiple assemblies. This unique set of merged targets dramatically increases the amount of data available for Phylogeny. However, the unadulterated assemblies are available for processing if required.

The merged targets are concatenated into a single file which is a substitute of the Phyluce generated all-taxa-incomplete.fasta file that is the entry point for the Phyluce phylogeny workflow. A rapid phylogeny is generated for quality control examination. Example commands are provided in the script phylogeny.sh. Phyluce aligns all UCE targets using Mafft, trims the alignments using Gblocks, and removes any targets not present in 50% or more of samples. The generated phylip file serves as the entry point for RAxML or IQTree which produces a rapid phylogeny for the purposes of quality control and detecting sample or sequencing errors.

Author

Jackson Eyres
Bioinformatics Programmer
Agriculture & Agri-Food Canada
[email protected]

Copyright

Government of Canada, Agriculture & Agri-Food Canada

License

This project is licensed under the MIT License - see the LICENSE file for details

Publications & Additional Resources

Brunke, A J., Hansen, A. K., Salnitska, M., Kypke, J. L., Escalona, H., Chapados, J.T., Eyres, J., Richter, R., Smetana, A., Ślipiński, A., Zwick, A., Hájek, J., Leschen, R., Solodovnikov, A. and Dettman, J.R. The limits of Quediini at last (Coleoptera: Staphylinidae: Staphylininae): a rove beetle mega-radiation resolved by comprehensive sampling and anchored phylogenomics. Systematic Entomology. Accepted. 1–36.
Dr. Adam Brunke provides some further custom phylogeny instructions
Douglas HB, Kundrata R, Brunke AJ, Escalona HE, Chapados JT, Eyres J, Richter R, Savard K, Ślipiński A, McKenna D, Dettman JR. Anchored Phylogenomics, Evolution and Systematics of Elateridae: Are All Bioluminescent Elateroidea Derived Click Beetles? Biology. 2021; 10(6):451. https://doi.org/10.3390/biology10060451
Hai D. T. Nguyen, Wayne McCormick, Jackson Eyres, Quinn Eggertson, Sarah Hambleton & Jeremy R. Dettman (2021) Development and evaluation of a target enrichment bait set for phylogenetic analysis of oomycetes, Mycologia, 113:4, 856-867, DOI: https://doi.org/10.1080/00275514.2021.1889276

Known Issues

Fastq files that start with numbers fail with Phyluce
rnaSPAdes 3.13.1 sometimes with randomly fails to generate a transcripts.fasta on a sample after completing K127. A workaround is to choose one of the K*** assemblies, and copy and rename it to transcripts.fasta in the higher level directory. Snakemake requires a transcripts.fasta for each rnaspades assembly to progress to Phyluce.
AAFC Specific Due to an incorrect and challenging to fix server wide implementation of OpenMPI, qsub commands should be run with "qsub -pe smp 1" which prevents abyss from starting in parallel mode and crashing. However Spades and rnaSPAdes appear to still use multiple cores as assigned via snakemake jobs

Citations

BioPython - Tools for biological computation
Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878
Snakemake - Workflow management system
Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
SPAdes
Nurk S. et al. (2013) Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads. In: Deng M., Jiang R., Sun F., Zhang X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science, vol 7821. Springer, Berlin, Heidelberg
BBTools
Brian-JGI (2018) BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.
FASTQC
Andrews S. (2018). FastQC: a quality control tool for high throughput sequence data. Available online at:
Phyluce - Target enrichment data analysis
Faircloth BC. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics 32:786-788. doi:10.1093/bioinformatics/btv646.
Ultraconserved elements BC Faircloth, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. 2012. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Systematic Biology 61: 717–726. doi:10.1093/sysbio/SYS004.
Abyss
Shaun D Jackman, Benjamin P Vandervalk, Hamid Mohamadi, Justin Chu, Sarah Yeo, S Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, René L Warren, and Inanc Birol (2017). ABySS 2.0: Resource-efficient assembly of large genomes using a Bloom filter. Genome research, 27(5), 768-777. doi:10.1101/gr.214346.116

Code Snippets

from Bio import SeqIO
import os
import glob
import argparse


def main():
    parser = argparse.ArgumentParser(description='Merges Phyluce UCEs from SPAdes and rnaSPAdes')
    parser.add_argument('-o', type=str,
                        help='Output Folder', required=True)
    parser.add_argument('-i', type=str,
                        help='Input folder of merged fastas', required=True)
    args = parser.parse_args()
    print("Counts merged_uces into a summary file in {} directory".format(args.o))

    count_uces(args.o, args.i)


def count_uces(output_directory, input_directory):
    # Gather each specimen file produced from the Phyluce
    merged_fastas = glob.glob(os.path.join(input_directory, "*_merged.fasta"))

    # Put all the contigs into a single dictionary
    specimen_dict = {}
    for fasta in merged_fastas:
        specimen = os.path.basename(fasta)
        specimen_name = specimen.replace("_merged.fasta", "").replace("-","_")
        with open(fasta) as f:
            count = 0
            abyss_count = 0
            spades_count = 0
            rnaspades_count = 0
            abyss_u_count = 0
            for seq in SeqIO.parse(fasta, 'fasta'):
                if "_AU" in seq.id[-3:]:
                    abyss_u_count += 1
                elif "_A" in seq.id[-2:]:
                    abyss_count += 1
                elif "_R" in seq.id[-2:]:
                    rnaspades_count += 1
                elif "_S" in seq.id[-2:]:
                    spades_count += 1

                count += 1
        if specimen_name in specimen_dict:
            specimen_dict[specimen_name] = [count, abyss_count, abyss_u_count, spades_count, rnaspades_count]
        else:
            specimen_dict[specimen_name] = [count, abyss_count, abyss_u_count, spades_count, rnaspades_count]

    output_file = os.path.join(output_directory, "merged_uce_summary.csv")
    with open(output_file, "w") as g:
        g.write("Specimen, Merged Targets, Abyss Contribution, Abyss Unmerged Contribution, SPAdes Contribution, rnaSPAdes Contribution\n")
        for key, value in specimen_dict.items():
            g.write("{},{},{},{},{},{}\n".format(key, value[0],value[1],value[2],value[3],value[4]))


if __name__ == "__main__":
    main()

Python Biopython From line 7 of pipeline_files/count_uces.py

import os
import argparse


def main():
    parser = argparse.ArgumentParser(description='Combines various log files into a CSV')
    parser.add_argument('-i', type=str,
                        help='UCE Log Input', required=True)
    parser.add_argument('-f', type=str,
                        help='Fastq Metrics from statswrapper.sh', required=True)
    parser.add_argument('-o', type=str,
                        help='UCE Output', required=True)

    args = parser.parse_args()
    summarize_uces(args.i, args.f, args.o)


def summarize_uces(input_path, fastq_metrics, output_path):
    with open(output_path, "w") as g:
        reads = {}

        with open(fastq_metrics) as f:
            lines = f.readlines()
            lines.pop(0)
            for line in lines:
                split = line.rstrip().split("\t")
                read_count = split[0]
                file_name = split[-1]
                sample_name = os.path.basename(file_name).\
                    replace("_L001_R1_001.fastq.gz", "").replace("_L001_R2_001.fastq.gz", "")
                reads[sample_name] = read_count

        with open(input_path) as f:

            index = 0
            index_start = 0
            index_end = 0
            lines = f.readlines()
            for line in lines:
                if "INFO - ---" in line:
                    if index_start > 0:
                        index_end = index
                    else:
                        index_start = index
                index += 1

            specimen_lines = lines[index_start+1: index_end]
            g.write("Species, Reads, Targets, Contigs, Dupes, Targets Filtered, Contigs Filtered\n")
            for line in specimen_lines:
                if "Writing" in line:
                    continue
                sliced = line[76:]
                split = sliced.split(" ")
                species = split[0].replace(":", "")
                species_name = split[0].replace("_A:", "").replace("_S:", "").replace("_R:", "").replace("_AU:", "")
                read_count = 0
                if species_name in reads:
                    read_count = reads[species_name]
                uniques = split[1]
                contigs = split[5]
                dupes = split[7]
                removed = split[11]
                match = split[19]

                g.write("{},{},{},{},{},{},{}\n".format(species, read_count, uniques, contigs, dupes, removed, match))


if __name__ == "__main__":
    main()

Python From line 7 of pipeline_files/evaluate.py

from Bio import SeqIO
import os
import glob
import argparse


def main():
    parser = argparse.ArgumentParser(description='Merges Phyluce UCEs from SPAdes and rnaSPAdes')
    parser.add_argument('-o', type=str,
                        help='Output Folder', required=True)
    parser.add_argument('-s', type=str,
                        help='SPAdes exploded-fastas folder', required=True)
    parser.add_argument('-r', type=str,
                        help='rnaSPAdes exploded-fastas folder', required=True)
    parser.add_argument('-a', type=str,
                        help='Abyss exploded-fastas folder', required=True)
    parser.add_argument('-u', type=str,
                        help='Abyss Unmerged exploded-fastas folder', required=True)
    args = parser.parse_args()
    print("Merging SPAdes and rnaSPAdes UCEs together into {} directory".format(args.o))

    combine_uces(args.o, args.s, args.r, args.a, args.u)


def combine_uces(output_directory, spades_directory, rnaspades_directory, abyss_directory, abyss_u_directory):
    """
    Takes the UCES from various assembly runs and creates a seperate file taking only the best sequence per UCE
    :return:
    """

    # Verify folders exist
    if os.path.isdir(spades_directory) and os.path.isdir(rnaspades_directory) and os.path.isdir(abyss_directory):
        pass
    else:
        print("Missing either {} or {} or {}".format(spades_directory, rnaspades_directory, abyss_directory))
        return

    # Gather each specimen file produced from the Phyluce
    spades_fastas = glob.glob(os.path.join(spades_directory, "*.fasta"))
    rnaspades_fastas = glob.glob(os.path.join(rnaspades_directory, "*.fasta"))
    abyss_fastas = glob.glob(os.path.join(abyss_directory, "*.fasta"))
    abyss_u_fastas = glob.glob(os.path.join(abyss_u_directory, "*.fasta"))
    # Put all the contigs into a single dictionary
    specimen_dict = {}
    for fasta in spades_fastas:
        specimen = os.path.basename(fasta)
        specimen_name = specimen.replace("-S.unaligned.fasta", "")
        specimen_dict[specimen_name] = [fasta]

    for fasta in rnaspades_fastas:
        specimen = os.path.basename(fasta)
        specimen_name = specimen.replace("-R.unaligned.fasta", "")
        if specimen_name in specimen_dict:
            specimen_dict[specimen_name].append(fasta)

    for fasta in abyss_fastas:
        specimen = os.path.basename(fasta)
        specimen_name = specimen.replace("-A.unaligned.fasta", "")
        if specimen_name in specimen_dict:
            specimen_dict[specimen_name].append(fasta)

    for fasta in abyss_u_fastas:
        specimen = os.path.basename(fasta)
        specimen_name = specimen.replace("-AU.unaligned.fasta", "")
        if specimen_name in specimen_dict:
            specimen_dict[specimen_name].append(fasta)

    # For each specimen, add all the UCES to a single dictionary from every file, then examine each UCE sequence and
    # choose the one with the greatest length. Write all filtered UCEs to both a merged file, and monolithic file
    for key, value in specimen_dict.items():
        all_uces = {}
        for fasta in value:
            for seq in SeqIO.parse(fasta, 'fasta'):
                uce = seq.description.split("|")[-1]
                if uce in all_uces:
                    all_uces[uce].append(seq)
                else:
                    all_uces[uce] = [seq]
        print(key, len(all_uces))

        final_uces = []
        for k, v in all_uces.items():
            max_uce = None
            max_length = 0
            for seq in v:
                if len(seq.seq) > max_length:
                    max_uce = seq
            final_uces.append(max_uce)

        # Write Final UCES to merged file
        if not os.path.exists(output_directory):
            os.makedirs(output_directory)

        file_name = str(key) + "_merged.fasta"
        file_path = os.path.join(output_directory, file_name)
        with open(file_path, "w") as f:
            for seq in final_uces:
                SeqIO.write(seq, handle=f, format="fasta")

        file_name = "all-taxa-incomplete-merged-renamed.fasta"
        file_path = os.path.join(output_directory, file_name)
        with open(file_path, "a") as f:
            for seq in final_uces:
                uce = str(seq.id).split("_")[0]
                specimen = key
                seq.description = "|" + uce
                seq.id = uce + "_" + specimen
                SeqIO.write(seq, handle=f, format="fasta")

        # # Log all the changes made to the SPAdes UCE file to create the merged file
        # file_name = "UCE_Change_Log.txt"
        # file_path = os.path.join(new_directory, file_name)
        # with open(file_path, "a") as f:
        #     f.writelines(uce_change_log)


if __name__ == "__main__":
    main()

Python Biopython From line 9 of pipeline_files/merge_uces.py

from Bio import SeqIO
import os
import glob
import argparse

def main():
    parser = argparse.ArgumentParser(description='Renames Abyss contigs to more closely match SPAdes')
    parser.add_argument("input", type=str,
                        help='Input File')
    parser.add_argument('output', type=str,
                        help='Output File')
    args = parser.parse_args()
    print("Renaming Contigs in {}".format(args.input))

    rename_contigs(args.input, args.output)

def rename_contigs(input, output):
    seqs = []
    with open(input, "r") as f:
        for seq in SeqIO.parse(f, 'fasta'):
            seq.name = ""
            split = seq.description.split(" ")
            header = "NODE_{}_length_{}_cov_{}".format(split[0],split[1],split[2])
            seq.id = header
            seq.description = ""
            seqs.append(seq)

    with open(output, "w") as g:
        SeqIO.write(seqs, handle=g, format="fasta")


if __name__ == "__main__":
    main()

Python Biopython From line 6 of pipeline_files/rename_abyss_contigs.py

shell: "statswrapper.sh {input.r1} {input.r2} > {output}"

SnakeMake From line 123 of master/Snakefile

shell:
    "fastqc -o fastqc {input.r1} {input.r2}"

SnakeMake FastQC From line 135 of master/Snakefile

shell: "bbduk.sh in1={input.r1} out1={output.out1} in2={input.r2} out2={output.out2} ref={adaptors} ktrim=r k=23 mink=11 hdist=1 tpe tbo &>{log}; touch {output.out1} {output.out2}"

SnakeMake BBMap From line 150 of master/Snakefile

shell: "bbmerge.sh in1={input.r1} in2={input.r2} out={output.out_merged} outu={output.out_unmerged} ihist={output.ihist} &>{log}"

SnakeMake From line 163 of master/Snakefile

shell:
    "fastqc -o fastqc_trimmed {input.i1} {input.i2} &>{log}"

SnakeMake FastQC From line 175 of master/Snakefile

shell:
    "multiqc -n multiqc_report.html -o multiqc fastqc; multiqc -n multiqc_report_trimmed.html -o multiqc fastqc_trimmed;"

SnakeMake MultiQC From line 189 of master/Snakefile

shell:
    "spades.py -t {threads} -1 {input.r1} -2 {input.r2} -o spades_assemblies/{wildcards.sample} &>{log}"

SnakeMake SPAdes From line 205 of master/Snakefile

run:
    if os.path.exists(input.assembly):
        if os.path.exists("phyluce-spades/assemblies"):
            pass
        else:
            os.path.mkdir("phyluce-spades/assemblies")
        copyfile(input.assembly,output.renamed_assembly)

SnakeMake From line 215 of master/Snakefile

run:
    with open (output.w1, "w") as f:
        f.write("[all]\n")
        for item in SAMPLES:
            f.write(item + "_S\n")

SnakeMake From line 226 of master/Snakefile

shell: "statswrapper.sh phyluce-spades/assemblies/*.fasta > {output}"

SnakeMake From line 237 of master/Snakefile

shell: "rm -r phyluce-spades/uce-search-results; cd phyluce-spades; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta"

SnakeMake From line 249 of master/Snakefile

shell: "cd phyluce-spades; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf"

SnakeMake From line 256 of master/Snakefile

shell: "cd phyluce-spades/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log"

SnakeMake From line 265 of master/Snakefile

shell: "cd phyluce-spades/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}"

SnakeMake From line 274 of master/Snakefile

shell: "statswrapper.sh {input} > {output}"

SnakeMake From line 281 of master/Snakefile

shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}"

SnakeMake From line 288 of master/Snakefile

shell:
    "rnaspades.py -t {threads} -1 {input.r1} -2 {input.r2} -o rnaspades_assemblies/{wildcards.sample} &>{log}"

SnakeMake From line 305 of master/Snakefile

run:
    if os.path.exists(input.assembly):
        if os.path.exists("phyluce-rnaspades/assemblies"):
            pass
        else:
            os.path.mkdir("phyluce-rnaspades/assemblies")
        copyfile(input.assembly,output.renamed_assembly)

SnakeMake From line 314 of master/Snakefile

run:
    with open (output.w2, "w") as f:
        f.write("[all]\n")
        for item in SAMPLES:
            f.write(item + "_R\n")

SnakeMake From line 325 of master/Snakefile

shell: "rm -r phyluce-rnaspades/uce-search-results; cd phyluce-rnaspades; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta"

SnakeMake From line 339 of master/Snakefile

shell: "cd phyluce-rnaspades; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf"

SnakeMake From line 345 of master/Snakefile

shell: "cd phyluce-rnaspades/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log"

SnakeMake From line 354 of master/Snakefile

shell: "cd phyluce-rnaspades/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}"

SnakeMake From line 363 of master/Snakefile

shell: "statswrapper.sh phyluce-rnaspades/assemblies/*.fasta > {output}"

SnakeMake From line 370 of master/Snakefile

shell: "statswrapper.sh {input} > {output}"

SnakeMake From line 377 of master/Snakefile

shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}"

SnakeMake From line 384 of master/Snakefile

shell:
    "abyss-pe --directory=abyss_assemblies/{wildcards.sample} name={wildcards.sample} k=31 in=../../{input.i2} se=../../{input.i1} &>{log}"

SnakeMake From line 401 of master/Snakefile

shell:
    "python pipeline_files/rename_abyss_contigs.py {input} {output}"

SnakeMake From line 410 of master/Snakefile

shell:
    "sed -e '/^[^>]/s/[^ATGCatgc]/N/g' {input.assembly} >> {output.renamed_assembly}"

SnakeMake From line 420 of master/Snakefile

shell: "statswrapper.sh {input} > {output}"

SnakeMake From line 428 of master/Snakefile

run:
    with open (output.w1, "w") as f:
        f.write("[all]\n")
        for item in SAMPLES:
            f.write(item + "_A\n")

SnakeMake From line 433 of master/Snakefile

shell: "rm -r phyluce-abyss/uce-search-results; cd phyluce-abyss; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta"

SnakeMake From line 446 of master/Snakefile

shell: "cd phyluce-abyss; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf"

SnakeMake From line 452 of master/Snakefile

shell: "cd phyluce-abyss/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log"

SnakeMake From line 460 of master/Snakefile

shell: "cd phyluce-abyss/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}"

SnakeMake From line 469 of master/Snakefile

shell: "statswrapper.sh {input} > {output}"

SnakeMake From line 476 of master/Snakefile

shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}"

SnakeMake From line 482 of master/Snakefile

shell:
    "abyss-pe --directory=abyss_u_assemblies/{wildcards.sample} name={wildcards.sample} k=31 in='../../{input.r1} ../../{input.r2}' &>{log}"

SnakeMake From line 499 of master/Snakefile

shell:
    "python pipeline_files/rename_abyss_contigs.py {input} {output}"

SnakeMake From line 508 of master/Snakefile

shell:
    "sed -e '/^[^>]/s/[^ATGCatgc]/N/g' {input.assembly} >> {output.renamed_assembly}"

SnakeMake From line 518 of master/Snakefile

shell: "statswrapper.sh {input} > {output}"

SnakeMake From line 526 of master/Snakefile

run:
    with open (output.w1, "w") as f:
        f.write("[all]\n")
        for item in SAMPLES:
            f.write(item + "_AU\n")

SnakeMake From line 531 of master/Snakefile

shell: "rm -r phyluce-abyss_u/uce-search-results; cd phyluce-abyss_u; phyluce_assembly_match_contigs_to_probes --keep-duplicates KEEP_DUPLICATES --contigs assemblies --output uce-search-results --probes ../probes/*.fasta"

SnakeMake From line 544 of master/Snakefile

shell: "cd phyluce-abyss_u; phyluce_assembly_get_match_counts --locus-db uce-search-results/probe.matches.sqlite --taxon-list-config taxon.conf --taxon-group 'all' --incomplete-matrix --output taxon-sets/all/all-taxa-incomplete.conf"

SnakeMake From line 550 of master/Snakefile

shell: "cd phyluce-abyss_u/taxon-sets/all; mkdir log; phyluce_assembly_get_fastas_from_match_counts --contigs ../../assemblies --locus-db ../../uce-search-results/probe.matches.sqlite --match-count-output all-taxa-incomplete.conf --output all-taxa-incomplete.fasta --incomplete-matrix all-taxa-incomplete.incomplete --log-path log"

SnakeMake From line 558 of master/Snakefile

shell: "cd phyluce-abyss_u/taxon-sets/all; rm -r exploded-fastas; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-fastas --by-taxon; phyluce_assembly_explode_get_fastas_file --input all-taxa-incomplete.fasta --output exploded-locus; cd ../../../; touch {output.exploded_fastas}"

SnakeMake From line 567 of master/Snakefile

shell: "statswrapper.sh {input} > {output}"

SnakeMake From line 574 of master/Snakefile

shell: "python pipeline_files/evaluate.py -i {input.r1} -f {input.f1} -o {output.r2}"

SnakeMake From line 580 of master/Snakefile

shell: "python pipeline_files/merge_uces.py -o merged_uces -s phyluce-spades/taxon-sets/all/exploded-fastas/ -r phyluce-rnaspades/taxon-sets/all/exploded-fastas/ -a phyluce-abyss/taxon-sets/all/exploded-fastas/ -u phyluce-abyss_u/taxon-sets/all/exploded-fastas/"

SnakeMake From line 599 of master/Snakefile

shell: "python pipeline_files/count_uces.py -o summaries -i merged_uces"

SnakeMake From line 606 of master/Snakefile

shell: "cat {input} >> {output}"

SnakeMake From line 612 of master/Snakefile

ShowHide 50 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/AAFC-BICoE/snakemake-partial-genome-pipeline

Name: snakemake-partial-genome-pipeline

Version: v0.6

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

BBMap Biopython FastQC MultiQC Snakemake SPAdes

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free