"DTR Phage Genome Polishing Pipeline"

public 1yr ago 0 bookmarks

View Workflow

DTR phage pipeline

The DTR phage pipeline is a Snakemake pipeline for obtaining polished, full-length dsDNA bacteriophage genomes with direct terminal repeats (DTRs) from enviromental samples.

Getting started

Dependencies

The pipeline relies heavily on conda to supply Snakemake and the software packages required for various steps in the analysis. If conda is not currently installed on your system (highly encouraged!), please first follow the installation instructions for installing miniconda .

Snakemake is a workflow management system that uses Python-based language that is ideally-suited for providing reproducible and scalable bioinformatic pipelines. If you are unfamiliar with Snakemake, this tutorial is a good place to start.

Installation

First clone the repositorty to the desired location and enter the cloned directory:

git clone https://github.com/nanoporetech/DTR-phage-pipeline.git
cd DTR-phage-pipeline

Next, create a conda environment where you will run Snakemake :

conda env create -f environment.yml

That's all. Ideally, conda shoud take care of all the remaining dependencies when each specific Snakemake step (described below) is executed.

If conda is slow

For speed tips related to conda, see mamba section below.

Testing

To verify that everything is set up correctly, simply run the workflow without modifying config.yaml:

snakemake --use-conda -p -r -j 10

Pipeline configuration

The pipeline determines the input files, output path, and the various software parameters from the file config.yml , which contains key:value pairs that control how the pipeline runs.

Attention must be paid to the first section of config.yml ('Editing mandatory'), as these might change from run to run.
- sample : name of the sample (e.g. 25m)
- stype : sample type (e.g. 1D or filtered_02um, etc)
- version : run version (for changing run parameters)
- input_fasta : FASTA file containing nanopore reads to be analyzed
- input_summary : Sequencing summary file associated with the FASTA file
- output_root : Base directory where pipeline results will be written
- max_threads : Maximum number of threads to allocate for each rule
- MEDAKA:model : Medaka model to use for polishing the draft genome (e.g. r941_min_high_g303 ). See Medaka documentation for details on selecting the proper model for you basecalled reads.
- MEDAKA:threads : Number of threads to use for each Medaka genome polishing. Should be << max_threads so that multiple genomes can be polished in parallel.
- KAIJU:db : Path to Kaiju database to use for taxonomic annotation. Folder should contain the *.fmi , names.dmp , and nodes.dmp files. These be downloaded from the Kaiju web server . nr+euk is recommended.
- KAIJU:switch : Parameters to use for Kaiju annotation. These parameters should suffice in most cases.
- SKIP_BINS:<sample>:<stype>:<version> : List of k-mer bins to skip for analysis. HDBSCAN assigns all unbinned reads to bin_id = -1, which should always be skipped. Occasionally downstream errors in the pipeline indicate that additional bad k-mer bins should also be skipped.
The second section of config.yml ('Editing optional') contains software parameters that can be adjusted, but should perform sufficiently well in most cases using the default values.
- 'BIN_AVA_ALGN:CLUST:max_recursion': If you get an error in this step from reaching Python's max recursion depth, you can try increasing the limit here if you have the system resources available.

Pipeline execution

The full pipeline, starting from raw reads and ending with nanopore-polished phage genomes, can be executed in one step.

snakemake --use-conda -p -j <nproc> -r

Where <nproc> is the maximum number of cores for all tasks in the pipeline. The output files for this workflow are placed in a path according to the variables set in the config.yml file. In this description, <run_output_dir> will refer to the path consisting of <output_root>/<sample>/<stype>/<version> as defined in the config.yml file.

all_kmer_count_and_bin

The first step provides summary plots for the input reads, identifies reads containing DTR sequences, creates k-mer count vectors, embeds these vectors into 2-d via UMAP , and calls bins in the embedding via HDBSCAN .

The outputs from this step will fall into three directories:

<run_output_dir>/reads_summary
- reads.summary.stats.png : Read length and qscore distributions for all input reads
<run_output_dir>/dtr_reads
- output.dtr.fasta : FASTA file of all DTR-containing reads
- output.dtr.hist.png : Read length distribution of all DTR-containing reads
- output.dtr.stats.tsv : Statistics for each DTR-containing read
<run_output_dir>/kmer_binning
- bin_membership.tsv : bin assignments for each DTR-containing read
- kmer_comp.tsv : 5-mer count vectors for each DTR-containing read
- kmer_comp.umap.tsv : x- and y-coordinates for each read in the 2-d embedding
- kmer_comp.umap.*.png : Variety of scatter plots of UMAP embedding of k-mer count vectors, annoted by features including GC-content, read length, and bin assigned by HDBSCAN
- kmer_comp.umap.bins.tsv : Mean x- and y-coordinates for each bin assigned by HDBSCAN (for finding bins in 2-d embedding)

all_kaiju

The next (optional) step annotates reads using Kaiju and is not strictly required for producing polished genomes. However, it can be informative for verifying the integrity of the k-mer bins and for other downstream analyses.

Some of the output files for this step include:

<run_output_dir>/kaiju
- results.html : Krona dynamic plot of annotated taxonomic composition of DTR-containing reads
- results.taxa.tsv : Per-read annotation results
<run_output_dir>/kmer_binning
- kmer_comp.umap.nr_euk.[0-6].png : Scatter plots of UMAP embedding of k-mer count vectors, annotated by various levels of taxonomic annotation

all_populate_kmer_bins

The next step simply populates subdirectories with the binned reads as assigned by HDBSCAN.

These subdirectories are located in the kmer_binning directory:

<run_output_dir>/kmer_binning/bins/<bin_id>
Each bin subdirectory contains a list of binned read names ( read_list.txt ) and an associated FASTA file of reads ( <bin_id>.reads.fa )

all_alignment_clusters

Next, each k-mer bin is refined by all-vs-all aligning all reads within a bin. The resulting alignment scores are clustered hierarchically and refined alignment clusters are called from the clustering.

The bin refinement results for each k-mer bin are also placed in kmer_binning directory:

<run_output_dir>/kmer_binning/refine_bins/align_clusters/<bin_id>
Each bin refinement procedure generates an alignment ( <bin_id>.ava.paf ), clustering heatmap ( <bin_id>.clust.heatmap.png ), and information on alignment cluster assignments ( <bin_id>.clust.info.tsv )

all_polish_and_annotate

Next, a single read is selected from each valid alignment cluster and is polished by the remaining reads in the alignment cluster. Polishing is first done using multiple rounds of Racon (3x by default), then is finished using a single round of Medaka polishing.

This step produces polished output in the following directories:

<run_output_dir>/kmer_binning/refine_bins/align_cluster_reads/<clust_id> simply contains the reads corresponding to each alignment cluster
<run_output_dir>/kmer_binning/refine_bins/align_cluster_polishing/racon/<clust_id> contains the Racon polishing output for each alignment cluster
<run_output_dir>/kmer_binning/refine_bins/align_cluster_polishing/medaka/<clust_id> contains the Medaka polishing output for each alignment cluster
- The critical output file in each of these medaka/<clust_id> folders is the <clust_id>.ref_read.medaka.fasta file containing the Medaka-polished genome produced from this alignment cluster . Subsequent Snakemake rules analyze, aggregate, and deduplicate these polished genomes.
- <clust_id>.ref_read.medaka.prodigal.cds.* files describe the coding sequence annotations from Prodigal for this Medaka-polished genome.
- <clust_id>.ref_read.strands.* files describe the strand abundance for reads in each alignment cluster. Clusters containing >80% reads from a single strand should be treated with suspicion.
- <clust_id>.ref_read.dtr.aligns.* files describe the results of aligning the DTR sequence from each corresponding k-mer bin to the polished genome from each alignment cluster. If the DTR sequences all align to the same reference positions, the DTR is fixed. However, if they align all over the reference genome, this suggests that a headful DNA packaging mechanism was used.

all_combine_dedup_summarize

Next, we finish up the genome discovery portion of the pipeline by running a series of aggregations and evaluations of the final polished genome sequences.

All output from this step is written to a single directory:

<run_output_dir>/kmer_binning/refine_bins/align_cluster_polishing
- polished.seqs.fasta : The combined set of genomes from each alignment cluster
- polished.seqs.unique.fasta : Same as above but only containing unique sequences after the deduplication step
- polished.stats.tsv : Various statistics for each polished genome, including length, GC content, DTR details, cluster strand abundance, CDS annotation statistics, circular permutation status, and many others
- polished.stats.unique.tsv : Same as above but only containing unique sequences after the deduplication step
- polished.unique.cds.summary.all.png : Summary plots of summary statistics for the coding sequences (CDS) annotated by Prodigal for each unique polished genome
- polished.unique.cds.summary.dtr_npol10.png : Same as above but only including polished genomes with a confirmed DTR and at least 10 reads used for polishing

all_linear_concatemer_reads

Finally, we run one final step to query the sequencing dataset for linear concatemer reads that could represent interesting mobile elements in the environmental sample.

All output from this step is written to a single directory:

<run_output_dir>/concatemers
- concats.fasta : All identified concatemeric reads found in the input reads
- concats.tsv : Statistic for each concatemeric read found, including readname, length, repeat size, and repeat copy count
- concats.contours.png : Scatter plot with density contours showing the relationship between the observed repeat length and copy count in all identified concatemeric reads.

Licence and Copyright

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

Code Snippets

shell:
    'minimap2 -t {threads} {params.cli} {input} {input} > {output}'

SnakeMake Minimap2 From line 12 of rules/align_clusters.smk

shell:
    'python {SCRIPT_DIR}/cluster_ava_alignments.py -p {params.prefix} '
    ' -R {params.max_recurs} '
    ' -s {params.min_score_frac} -n {params.min_reads} {input}'

SnakeMake From line 26 of rules/align_clusters.smk

run:
    fns = input.clust_info_csvs
    df = pd.concat([pd.read_csv(fn) for fn in fns], ignore_index=True)
    df = df.sort_values(['bin_id', 'cluster'])
    print(f"Writing to {output.combined}")
    df.to_csv(str(output.combined), index=False)

SnakeMake From line 37 of rules/align_clusters.smk

run: 
    # parse table; filter by cluster
    for clust_info_file in input.clust_info:
        bin_id = os.path.basename(os.path.dirname(clust_info_file))
        logger.debug(f"DEBUG: bin id: {bin_id}")
        df = pd.read_csv(clust_info_file)
        for clust_id, df_clust in df.groupby('cluster'):
            bin_clust_id = f"{bin_id}_{clust_id}"
            bin_clust_dir = str(BIN_CLUSTER_DIR).format(bin_clust_id=bin_clust_id)
            os.makedirs(bin_clust_dir, exist_ok=True)
            readlist = str(BIN_CLUSTER_READS_LIST).format(bin_clust_id=bin_clust_id)
            df_clust['read_id'].to_csv(readlist, index=False, header=False)
            readinfo = str(BIN_CLUSTER_READS_INFO).format(bin_clust_id=bin_clust_id)
            df_clust.to_csv(readinfo, index=False)

SnakeMake From line 54 of rules/align_clusters.smk

shell:
    'seqkit grep -f {input.clustreads} {input.allfasta} > {output}'

SnakeMake seqkit From line 90 of rules/align_clusters.smk

run:
    clust_info_df = pd.read_csv(input.cluster_info)
    ref_idx  = clust_info_df['clust_read_score'].idxmax()
    ref_read = clust_info_df.loc[ref_idx, ['read_id']]
    pol_reads = clust_info_df[~clust_info_df['read_id'].isin(ref_read.values)]['read_id']
    ref_read.to_csv(output.ref, index=False, header=False)
    pol_reads.to_csv(output.pol, index=False, header=False)

SnakeMake From line 99 of rules/align_clusters.smk

shell:
    'seqkit grep -f {input.readlist} {input.fasta} > {output}'

SnakeMake seqkit From line 113 of rules/align_clusters.smk

shell:
    'seqkit grep -f {input.readlist} {input.fasta} > {output}'

SnakeMake seqkit From line 122 of rules/align_clusters.smk

shell:
    'prodigal -p meta -q -i {input} -o {output.txt} -d {output.fasta}'

SnakeMake prodigal From line 12 of rules/annotate.smk

shell:
    'python {SCRIPT_DIR}/summarize_cds_result_fasta.py  -o {output} {input.cds} '
    '{input.ref}'

SnakeMake From line 21 of rules/annotate.smk

shell:
    'minimap2 -t {threads} {params.switches} {input.ref} '
    '{input.reads} > {output}'

SnakeMake Minimap2 From line 38 of rules/annotate.smk

run:
    clust_id = wildcards.bin_clust_id
    cols = ['qname', 'qlen', 'qstart', 'qend', 'strand', 'tname', \
            'tlen', 'tstart', 'tend', 'matches', 'alnlen', 'mapqv']
    paf   = pd.read_csv(input.paf, sep='\t', header=None, usecols=list(range(12)), names=cols)
    paf   = paf.groupby(['qname','strand','tname']).size().reset_index(name='count')[['qname','strand','tname']]
    n_pos = paf[paf['strand']=='+'].shape[0]
    n_neg = paf[paf['strand']=='-'].shape[0]
    n_tot = n_pos + n_neg
    ref   = paf['tname'].unique()[0].split(':')[0]
    f_n   = open(output.counts, 'w')
    f_n.write('read\tclust_id\tn_tot\tfrac_pos\tfrac_neg\n')
    f_n.write('{ref}\t{clust}\t{n_tot}\t{pos}\t{neg}\n'.format(ref=ref,clust=clust_id,n_tot=(n_pos+n_neg),pos=float(n_pos)/n_tot,neg=float(n_neg)/n_tot))

    reads_df       = paf[['qname','strand']].rename(columns={'qname':'read_id', 'strand':'label'})
    ref_df         = pd.DataFrame.from_dict({'read_id':[ref], 'label':['ref']})
    df             = pd.concat([reads_df, ref_df])
    df['clust_id'] = clust_id
    df.to_csv(output.annots, sep='\t', index=False)

SnakeMake From line 48 of rules/annotate.smk

shell:
    'nucmer -p {params.prefix} --nosimplify {input} {input}'

SnakeMake mummer2circos From line 11 of rules/dedup.smk

shell:
    'show-coords -c -l -T {input.delta} > {output.nucfilt}'

SnakeMake From line 20 of rules/dedup.smk

shell:
    'python {SCRIPT_DIR}/dedup_sample_genomes.py -p {params.minpct} '
    '-c {params.mincov} -f {output.fasta} -s {output.stats} '
    '{input.coords} {input.fasta} {input.stats}'

SnakeMake From line 35 of rules/dedup.smk

shell:
    'python {SCRIPT_DIR}/plot_cds_summaries.py --output1={output.plot1} '
    '--output2={output.plot2} {input}'

SnakeMake From line 46 of rules/dedup.smk

shell:
    'python {SCRIPT_DIR}/plot_headful_dtr_positions.py -t {threads} -p {params.prefix} '
    '-o {params.ovlp} {input.binsdir} {input.ref}'

SnakeMake From line 17 of rules/dtr_align.smk

shell:
    'python {SCRIPT_DIR}/find_dtr_all_seqs.py -t {threads} -p {params.prefix} '
    '-d {params.tmpdir} -o {params.ovlp} -c {params.chunksize} {input}'

SnakeMake From line 18 of rules/dtr_reads.smk

shell:
    'kaiju {params.switch} -z {threads} -t {params.db}/nodes.dmp '
    '-f {params.db}/kaiju_db.fmi -i {input} -o {output}'

SnakeMake Kaiju From line 13 of rules/kaiju.smk

shell:
    "csvcut -t -c 3 {input} | sort -nk1 | uniq > {output}; "
    "echo -e 'taxID' | cat - {output} > /tmp/out && mv /tmp/out {output}"

SnakeMake From line 21 of rules/kaiju.smk

shell:
    'kaiju-addTaxonNames -p -t {params.db}/nodes.dmp -n {params.db}/names.dmp -i {input.results} -o {output}'

SnakeMake From line 32 of rules/kaiju.smk

shell:
    'kaiju2krona -u -t {params.db}/nodes.dmp -n {params.db}/names.dmp '
    '-i {input.results} -o {output.krona}; ktImportText -o {output.html} {output.krona}'

SnakeMake Krona From line 45 of rules/kaiju.smk

shell:
    'python {SCRIPT_DIR}/kmer_freq.py {params.cli} -k {params.k} -t {threads} '
    '{input} > {output}'

SnakeMake From line 13 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/add_qscore_to_kmer_freq.py -o {output} {input.freq} {input.summ}'

SnakeMake From line 23 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/run_umap.py -l {params.min_rl} -q {params.min_q} '
    '-d {params.min_d} -n {params.neighbors} -o {output} {input.comp}'

SnakeMake From line 36 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/plot_umap_with_qscore.py -o {output} -s {params.size} '
    '-a {params.alpha} {input}'

SnakeMake From line 51 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/plot_umap_with_gc_content.py -o {output} -s {params.size} '
    '-a {params.alpha} {input.umap} {input.fasta}'

SnakeMake From line 64 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/plot_umap_with_readlength.py -o {output} -m {params.min_rl} '
    '-n {params.max_cb_rl} -s {params.size} -a {params.alpha} {input.umap} {input.fastx}'

SnakeMake From line 79 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/run_hdbscan.py -o {output} -c {params.min_cluster} {input}'

SnakeMake From line 90 of rules/kmer_bins.smk

shell: "cut -f 6 {input} | tail -n+2 | sort | uniq > {output}"

SnakeMake From line 100 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/plot_umap_with_hdbscan_labels.py -o {output} -s {params.size} '
    '-a {params.alpha} {input}'

SnakeMake From line 109 of rules/kmer_bins.smk

shell:
    'python {SCRIPT_DIR}/plot_umap_with_kaiju_labels.py -r {wildcards.rank} -o {output} '
    '-s {params.size} -a {params.alpha} {input.umap} {input.annot}'

SnakeMake From line 122 of rules/kmer_bins.smk

run:
    df_reads = pd.read_csv(input.bin_membership, sep='\t')

    # aggregate read stats into bin stats
    df_bins = df_reads.groupby('bin_id').agg(bases=pd.NamedAgg('length', sum),
                                             genomesize=pd.NamedAgg('length', np.mean),
                                             coverage=pd.NamedAgg('length', len),
                                             readlist=pd.NamedAgg('read', set),
                                            )
    df_bins['rl_gs_est'] = True

    # write bin stats (without readlist) to stats file
    df_bins[['bases','genomesize','coverage','rl_gs_est']] \
           .sort_index() \
           .to_csv(output.all_stats, sep='\t')

    # write bin specific files
    for bin_id, row in df_bins.iterrows():
        #
        # create bin specific subdir of BINS_ROOT
        bin_dir = str(BIN_DIR).format(bin_id=bin_id)
        os.makedirs(bin_dir, exist_ok=True)
        #
        # save genome size to file
        gsize_fn = str(BIN_GENOME_SIZE).format(bin_id=bin_id)
        with open(gsize_fn, 'w') as fh:
            fh.write(f'{row["genomesize"]}\n')
        #
        # save read list to file
        rlist_fn = str(BIN_READLIST).format(bin_id=bin_id)
        with open(rlist_fn, 'w') as fh:
            fh.write('{}\n'.format('\n'.join(row['readlist'])))

SnakeMake From line 144 of rules/kmer_bins.smk

shell:
    'seqkit grep -f {input.read_list} -o {output} {input.reads_fasta}'

SnakeMake seqkit From line 193 of rules/kmer_bins.smk

shell:
    'mkdir -p {params.aln_dir}; python {SCRIPT_DIR}/check_seqs_for_concatemers.py '
    '-o {params.ovlp_len} -d {params.aln_dir} -t {threads} -m {params.minlen} '
    '-o {output} {input}; '
    'rm -rf {params.aln_dir}'

SnakeMake From line 14 of rules/linear_concats.smk

shell:
    'python {SCRIPT_DIR}/plot_linear_concat_reads_contours.py -o {output} {input}'

SnakeMake From line 26 of rules/linear_concats.smk

shell:
    'python {SCRIPT_DIR}/run_racon.py -t {threads} -n {params.iterations} '
    '-q {params.minq} -o {output.polished} {input.raw_reads} {input.draft}'

SnakeMake From line 17 of rules/polish.smk

shell:
    'medaka_consensus -i {input.raw_reads} -d {input.draft} '
    '-o {params.out_dir} -t {threads} -m {params.model}; '
    'mv {params.out_dir}/consensus.fasta {output}'

SnakeMake Medaka From line 35 of rules/polish.smk

shell:
    'python {SCRIPT_DIR}/rename_polished_genome.py -o {output} '
    '{input.fasta} '
    '{input.ref_read_list} '
    '{input.pol_reads_list}'

SnakeMake From line 47 of rules/polish.smk

shell:
    'cat {input} > {output}'

SnakeMake From line 8 of rules/qc_genomes.smk

shell:
    'porechop {params.switches} -t {threads} -i {input} -v 1 -o {output}'

SnakeMake porechop From line 18 of rules/qc_genomes.smk

shell:
    'python {SCRIPT_DIR}/find_dtr_all_seqs.py --no_fasta --no_hist -t {threads} '
    '-p {params.prefix} -o {params.ovlp} -c {params.chunksize} {input}'

SnakeMake From line 31 of rules/qc_genomes.smk

shell:
       """python {SCRIPT_DIR}/combine_cds_summary.py -o {output.table} \
         {MEDAKA_DIR}/*/*.ref_read.medaka.prodigal.cds.stats.txt """

SnakeMake From line 41 of rules/qc_genomes.smk

shell:
    'python {SCRIPT_DIR}/calculate_genomes_gc_contents.py -p {params.prefix} '
    '-o {output} {input.dtr_table} {input.cds_table} {input.bins_dir} {params.clust_dir}'

SnakeMake From line 57 of rules/qc_genomes.smk

run:
    count_fns = [input.counts] if isinstance(input.counts, Path) else input.counts
    count_dfs = [pd.read_csv(fn, sep='\t') for fn in count_fns]
    count_df  = pd.concat(count_dfs)
    count_df['bin']     = count_df['clust_id'].map(lambda x: int(x.split('_')[0]))
    count_df['cluster'] = count_df['clust_id'].map(lambda x: int(x.split('_')[1]))
    count_df            = count_df.sort_values(['bin','cluster']).drop(['bin','cluster'], axis=1).round({'frac_pos': 2, 'frac_neg': 2})
    count_df.to_csv(output.counts, sep='\t', index=False)
    annot_fns = [input.annots] if isinstance(input.annots, Path) else input.annots
    annot_dfs = [pd.read_csv(fn, sep='\t') for fn in annot_fns]
    annot_df  = pd.concat(annot_dfs)
    annot_df.to_csv(output.annots, sep='\t', index=False)

SnakeMake From line 70 of rules/qc_genomes.smk

run:
    fns = list(str(i) for i in input)
    logger.debug("DEBUG: Loading dtr alignment infos from " + repr(fns))
    fns = expand_template_from_bin_clusters({}, DTR_ALIGN_TSV)
    logger.debug("DEBUG: Loading dtr alignment infos from " + repr(fns))
    df = pd.concat([pd.read_csv(fn, sep='\t') for fn in fns])
    print(df.shape)
    print(df.head())
    df['dtr_len'] = df['tend'] - df['tstart']
    df['left_dist'] = df['tend']
    df['right_dist'] = df['tlen'] - df['tstart']
    for_df = []
    for group,df_ in df.groupby('bin_clust_id'):
        cyc_perm = False
        # If the distance from the genome end to the inner DTR boundary is more than 2x the median DTR length, keep.
        # If > 25% of the reads pass that filter, label the cluster as being cyclically permuted.
        df_filt = df_[df_['left_dist'] > (2 * df_['dtr_len'].median())]
        df_filt = df_filt[df_filt['right_dist'] > (2 * df_filt['dtr_len'].median())]
        if df_filt.shape[0]>(0.25*df_.shape[0]):
            cyc_perm = True
        for_df.append( (df_.loc[0,'bin_clust_id'], cyc_perm))
    df_out = pd.DataFrame.from_records(for_df, columns=['clust_id','cyc_perm'])
    df_out.to_csv(output.cyc_perm_stats, sep='\t', index=False)

SnakeMake From line 87 of rules/qc_genomes.smk

run:
    df_gc     = pd.read_csv(input.dtr_gc_stats, sep='\t').set_index('clust_id')
    df_cds    = pd.read_csv(input.cds_stats, sep='\t').set_index('clust_id')
    df_strand = pd.read_csv(input.strand_stats, sep='\t').set_index('clust_id')
    df_cyc    = pd.read_csv(input.cyc_perm_stats, sep='\t').set_index('clust_id')
    df_cds = df_cds.drop(['read','ref_len'], axis=1)
    df_strand = df_strand.drop('read', axis=1)

    df = df_cds.join(df_strand, how='left')
    df = df.join(df_cyc, how='left')
    df = df.join(df_gc, how='left').reset_index().sort_values('clust_id')
    df.to_csv(output[0], sep='\t', index=False)

SnakeMake From line 118 of rules/qc_genomes.smk

run:
    import matplotlib; matplotlib.use('Agg')
    import matplotlib.pyplot as plt
    df = pd.read_csv(input.all_summary, quoting=3, sep='\t')

    # Sometimes these summary files have quotation marks around each row entry
    if type(df.iloc[0,0])=='str':
        if df.iloc[0,0].find('\"')>-1:
            df.columns    = df.columns.map(lambda x: x.strip('\"'))
            df.iloc[:,0]  = df.iloc[:,0].str.strip('\"')
            df.iloc[:,-1] = df.iloc[:,-1].str.strip('\"').astype('float')

    length = 'sequence_length_template'
    qscore = 'mean_qscore_template'
    if 'sequence_length_2d' in df.columns:
        # This is 1Dsq data, not 1D. Update column ID name for querying
        length = length.replace('template', '2d')
        qscore = qscore.replace('template', '2d')

    fig = plt.figure(figsize=(12,10))

    # Prevent outlier lengths from screwing up histogram
    plot_df = df.query('({l} <= 70000)'.format(l=length))

    ax1 = fig.add_subplot(2,1,1)
    ax1.hist(plot_df[qscore], bins=50, linewidth=0, alpha=0.5, color='b')
    ax1.set_xlabel('qscore')
    ax1.set_ylabel('count')
    ax1.set_title('%s reads; mean qscore = %.1f' % (len(df[qscore]), np.mean(df[qscore])))

    ax2 = fig.add_subplot(2,1,2)
    ax2.hist(plot_df[length], bins=50, linewidth=0, alpha=0.5, color='b')
    ax2.set_xlabel('length')
    ax2.set_ylabel('count')
    ax2.set_title('mean length = %i' % np.mean(df[length]))

    fig.savefig(output.plot)

SnakeMake matplotlib From line 10 of rules/summary.smk

shell:
    'wrapper_phage_contigs_sorter_iPlant.pl -f {input} \
     --wdir {VIRSORTER_DIR} \
     --data-dir {config[VIRSORTER][data_dir]} \
     --db {config[VIRSORTER][db]} --ncpu {threads} \
     {params.virome} {params.diamond}'

SnakeMake From line 34 of rules/virsorter.smk

shell: 'cat {input} > {output}'

SnakeMake From line 46 of rules/virsorter.smk

run:
    print(f"Cannot run kaiju! DB not found at {KAIJU_DB_DIR}")

SnakeMake From line 295 of master/Snakefile

ShowHide 42 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/nanoporetech/DTR-phage-pipeline

Name: dtr-phage-pipeline

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: Mozilla Public License 2.0

Keywords:

raw sequence reads Sequence read processing Genome report mummer2circos porechop prodigal Kaiju Krona Medaka Minimap2 seqkit Snakemake matplotlib Genomics

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free