Extended genecatalog workflow for metagenome-atlas

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Snakemake workflow: genecatalog extension for metagenome atlas

This is the template for a new Snakemake workflow. Replace this text with a comprehensive description covering the purpose and domain. Insert your code into the respective folders, i.e. scripts , rules , and envs . Define the entry point of the workflow in the Snakefile and the main configuration in the config.yaml file.

Authors

Silas Kieser (@silask)

Usage

Simple

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, if available, its DOI (see above).

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml .

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --cluster qsub --jobs 100

snakemake --use-conda --drmaa --jobs 100

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above. See the Snakemake documentation for further details.

Step 4: Investigate results

After successful execution, you can create a self-contained interactive HTML report with all results via:

snakemake --report report.html

This report can, e.g., be forwarded to your collaborators.

Advanced

The following recipe provides established best practices for running and extending this workflow in a reproducible way.

Fork the repo to a personal or lab account.
Clone the fork to the desired working directory for the concrete project/run on your machine.
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
Modify the config, and any necessary sheets (and probably the workflow) as needed.
Commit any changes and push the project-branch to your fork on github.
Run the analysis.
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
Optional: Push results (plots/tables) to the remote branch on your fork.
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
Optional: Delete the local clone/workdir to free space.

Testing

Tests cases are in the subfolder .test . They are automtically executed via continuous integration with Travis CI.

Code Snippets

shell:
    "ln -s {input} {output}"

SnakeMake From line 9 of rules/clustering.smk

shell:
    "mkdir {output} 2> {log} ; "
    "mmseqs createdb {input} {output}/db >> {log} 2>> {log} "

SnakeMake MMseqs From line 26 of rules/clustering.smk

shell:
    """
        mkdir -p {params.tmpdir} {output} 2>> {log}

        mmseqs {params.clustermethod} -c {params.coverage} \
        --min-seq-id {params.minid} {params.extra} \
        --threads {threads} {input.db}/db {output.clusterdb}/db {params.tmpdir}  >> {log} 2>> {log}

        rm -fr  {params.tmpdir} 2>> {log}
    """

SnakeMake MMseqs From line 50 of rules/clustering.smk

shell:
    """

    mkdir {output.rep_seqs_db} 2>> {log}

    mmseqs result2repseq {input.db}/db {input.clusterdb}/db {output.rep_seqs_db}/db  >> {log} 2>> {log}

    mmseqs result2flat {input.db}/db {input.db}/db {output.rep_seqs_db}/db {output.rep_seqs}  >> {log} 2>> {log}

    """

SnakeMake MMseqs From line 79 of rules/clustering.smk

shell:
    """
    mmseqs createtsv {input.db}/db {input.db}/db {input.clusterdb}/db {output.cluster_attribution}  > {log} 2>> {log}
    """

SnakeMake MMseqs From line 107 of rules/clustering.smk

script:
    "../scripts/rename_catalog.py"

SnakeMake From line 133 of rules/clustering.smk

script:
    "../scripts/rename_mapping.py"

SnakeMake From line 158 of rules/clustering.smk

shell:
    """
        mkdir -p {output} {params.tmpdir} 2> {log}
        mmseqs {params.clustermethod} -c {params.coverage} \
        --min-seq-id {params.minid} {params.extra} \
        --threads {threads} {input.db}/db {output.clusterdb}/db {params.tmpdir}  >>  {log} 2>> {log}
    """

SnakeMake MMseqs From line 198 of rules/clustering.smk

shell:
    """

    mkdir {output.rep_seqs_db} 2>> {log}

    mmseqs result2repseq {input.db}/db {input.clusterdb}/db {output.rep_seqs_db}/db  >> {log} 2>> {log}

    mmseqs result2flat {input.db}/db {input.db}/db {output.rep_seqs_db}/db {output.rep_seqs}  >> {log} 2>> {log}

    """

SnakeMake MMseqs From line 220 of rules/clustering.smk

script:
    "../scripts/rename_catalog.py"

SnakeMake From line 250 of rules/clustering.smk

shell:
    """
    mmseqs createtsv {input.db}/db {input.db}/db {input.clusterdb}/db {output.cluster_attribution}  > {log} 2>> {log}
    """

SnakeMake MMseqs From line 273 of rules/clustering.smk

script:
    "../scripts/rename_mapping.py"

SnakeMake From line 301 of rules/clustering.smk

import pandas as pd
import utils
import sys
sys.stdout= open(snakemake.log[0],"w")
sys.stderr= open(snakemake.log[0],"a")

def rename_fasta(fasta_in,fasta_out,new_names):

    old_names=[]
    n=0

    with open(fasta_in) as fin, open(fasta_out,'w') as fout:
        for line in fin:
            if line[0]=='>':

                old_name=line[1:].split(maxsplit=1)[0]

                assert (n==0) or (old_name!=old_names[-1]), f"Found duplicate representative name {old_name} in {fasta_in}"
                old_names.append(old_name)

                line=f">{new_names[n]} {old_name}\n"
                n+=1

            fout.write(line)
    return old_names

def parse_mmseqs_log_file(log_file,keyword="Number of clusters:"):

    with open(log_file) as f:

        result=None

        for line in f:
            if line.startswith(keyword):
                try:
                    result= int(line.replace(keyword,'').strip().rstrip())
                except ValueError as e:
                    raise Exception(f"Error parsing line:\n{line}\n") from e

        if result is None:
            raise Exception(f"Didn't found value in for keyword '{keyword}' in logfile {log_file}")
        else:
            return result

if __name__ == '__main__':

    Nrepresentatives = parse_mmseqs_log_file(snakemake.input.log)

    print(f"Number of representatives is {Nrepresentatives}")

    gene_names=utils.gen_names_for_range(Nrepresentatives,snakemake.params.prefix)

    original_names=  rename_fasta(snakemake.input.faa,snakemake.output.faa,gene_names)

    assert len(gene_names)==len(original_names), "Nuber of representatives should match the number of clusters found in the mmseqs log file"

    orf2gene = pd.Series(index=original_names,data=gene_names,name = 'Gene')
    orf2gene.index.name='ORF'

    orf2gene.to_csv(snakemake.output.name_mapping,sep='\t',header=True)

    del orf2gene

    # Rename representative sequence

Python Pandas utils MMseqs From line 1 of scripts/rename_catalog.py

import sys
sys.stdout= open(snakemake.log[0],"w")
sys.stderr= open(snakemake.log[0],"a")

import pandas as pd

name_mapping= pd.read_csv(snakemake.input.name_mapping,index_col=0,sep='\t',squeeze=True)
assert type(name_mapping)==pd.Series



with pd.HDFStore(snakemake.output[0],complevel=3, mode='w') as store:

    # read cluster mapping in chuncks
    chuncknr= 0
    for orf2gene in pd.read_csv(snakemake.input.cluster_mapping,
                                usecols=[0,1], #  clustermaping can have a tailing tab character leading to a
                               index_col=1, # the format is "{cluster}\t{orf}"
                               squeeze=True,
                               header=None,
                               sep='\t',
                               dtype={0:'category'},
                               chunksize=1e7
        ):


        orf2gene.cat.set_categories(name_mapping,
                                    inplace=True,
                                    rename=True,
                                    ordered=True)



        orf2gene.name=snakemake.params.headers[1]
        orf2gene.index.name = snakemake.params.headers[0]
        key= '/'.join(snakemake.params.headers)

    # map gene representative name to gene id, write to file with header only once
        store.append(key, orf2gene, format='table', data_columns=[orf2gene.name],
                     min_itemsize=50)

        chuncknr+=1
        print(f"processed chunck {chuncknr}")