Reproducible Scalable Pipeline For Amplicon-based Metagenomics (RSP4ABM) is a bioinformatic pipeline designed for convenient, reproducible and scalable amplicon-based metagenomics

public 1yr ago Version: v.0.9.18 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Overview

This is a comprehensive pipeline for amplicon-based metagenomics integrating in a Snakemake workflow the best functions of many tools . It enables performant and reproducibile processing of 16S rRNA or ITS Illumina paired-end reads. The whole process from local .fastq or SRA depository files to generation of basic visualization plots, including quality control plots of intermediate steps, is covered.

The Snakemake pipeline can be exectued by cloning this repository and relying on conda environments (Method 1) or singularity (Method 2)

################### TO BE UPDATED #######################

Method 1 - Snakemake with conda environnements

Allows flexibility, with possibility to easily modify and personalize the pipeline. However, there are risks of errors or result inconsistencies due to changes in versions. Furthermore, simulate_PCR must be installed independently for the in silico validation, since it is not available through conda.

Requirements:

Computer

A linux machine would be the best (should work as well on MacOSX, yet not tested). At least 16Gb of RAM are needed, even more with larger datasets and depending of the used classifier. ( RDP requiring more RAM than decipher )

Tested with Ubuntu 18.04 with 4 CPUs and 32Gb of RAM

Cloned pipeline

git clone https://github.com/metagenlab/microbiome16S_pipeline.git

Miniconda3

Installed following developers' recommendations and with relevant channels added running in a thermal the following commands :

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set restore_free_channel true

Tested with version 4.6.14

Snakemake

Installed in a dedicated snakemake environments with :

conda create -n snakemake snakemake=5.6.0

Tested with version 5.6.0

Use

snakemake --snakefile ${pipeline_folder}/Snakefile --use-conda --conda-prefix ${conda_path} --cores {threads_number} --configfile {config_file_name} --resources max_copy={max_number_of_files_copied_at_the_same_time}

Method 2 - With Singularity

Relatively easy to set-up and easier to work with than Docker, due to simpler user-rights management. We take advantage of the ability of Singularity to run the Docker container prepared for this pipeline. Insures software stability thanks to containerization. Here, all dependencies are contained within the container.

Requirements:

Computer

As for Method 1 but here only adapted to Linux (an alpha version for Singularity exists for MacOS).

Tested with Ubuntu 18.04 with 4 CPUs and 32Gb of RAM

Singularity

Singularity is a system enabling the use of singularity or Docker containers. It should be installed as indicated here .

Tested with version 3.0.1

Commands:

## Run the container interactively
singularity shell docker://metagenlab/amplicon_pipeline:v.0.9.13
## Run the pipeline from within the container.
snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time} mem_mb = {available_memory}

Method 3 - With Docker

Computer

Works on Windows, MacOS and Linux. Tested on Linux, Windows 10 and MacOSX

User settings:

Our Docker image is fitted for a user called "pipeline_user" whose UID is 1080. It is advised to create this user on your computer before using the Docker image to run your analysis:

sudo useradd -G docker,sudo -u 1080 pipeline_user
sudo mkdir /home/pipeline_user/
sudo chown pipeline_user -R /home/pipeline_user/
sudo passwd pipeline_user

Alternatively, you can run the Docker as root (--user root) but the created folders will belong to the root user of your computer.

Docker

Install the CE version following these instructions for ubuntu. Also make sure you have created the docker group and that you can run Docker without sudo following these instruction . If you can't have access to the internet when inside a Docker container, apply those changes .

Use

Connected as pipeline_user :

docker run -it --rm --mount source="$(pwd)",target=/home/pipeline_user/data/analysis/,type=bind metagenlab/amplicon_pipeline:v.0.9.13

and then

snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time}

or directly

docker run -it --rm --mount source="$(pwd)",target=/home/pipeline_user/data/analysis/,type=bind metagenlab/amplicon_pipeline:v.0.9.13 \ sh -c 'snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time}

References

Snakemake

Köster, J., & Rahmann, S. (2012). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480
FASTQC

Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data. 2010. Http://Www.Bioinformatics.Babraham.Ac.Uk/Projects/Fastqc/
MultiQC

Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354
DADA2

Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581–583. https://doi.org/10.1038/nmeth.3869
VSEARCH

Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. https://doi.org/10.7717/peerj.2584
Qiime

Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., … Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336. https://doi.org/10.1038/nmeth.f.303
Qiime2

Bolyen, E., Dillon, M., Bokulich, N., Abnet, C., Al-Ghalith, G., Alexander, H., … Caporaso, G. (2018). QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.27295
RDP

Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267. https://doi.org/10.1128/AEM.00062-07
IDTAXA in Decipher

Murali, A., Bhargava, A., & Wright, E. S. (2018). IDTAXA: A novel approach for accurate taxonomic classification of microbiome sequences. Microbiome. https://doi.org/10.1186/s40168-018-0521-5
EzBioCloud

Yoon, S.-H., Ha, S.-M., Kwon, S., Lim, J., Kim, Y., Seo, H., & Chun, J. (2017). Introducing EzBioCloud: a taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies. International Journal of Systematic and Evolutionary Microbiology, 67(5), 1613–1617. https://doi.org/10.1099/ijsem.0.001755
Silva

Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., … Glöckner, F. O. (2013). The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Research. https://doi.org/10.1093/nar/gks1219
phyloseq

McMurdie, P. J., & Holmes, S. (2013). phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE, 8(4), e61217. https://doi.org/10.1371/journal.pone.0061217
Krona

Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 12(1), 385. https://doi.org/10.1186/1471-2105-12-385
ALDex2

Fernandes, A. D., Reid, J. N. S., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 1–13. https://doi.org/10.1186/2049-2618-2-15
Vegan

Oksanen, J., Kindt, R., Legendre, P., O’Hara, B., Simpson, G. L., Solymos, P. M., … & Wagner, H. (2008). The vegan package. Community Ecology Package, (May 2014), 190. Retrieved from https://bcrc.bio.umass.edu/biometry/images/8/85/Vegan.pdf
metagenomeSeq

Joseph, A., Paulson, N., Olson, N. D., Wagner, J., Talukder, H., & Corrada, H. (2019). Package ‘ metagenomeSeq .’
edgeR

Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
Simulate_PCR

Gardner, S. N., & Slezak, T. (2014). Simulate_PCR for amplicon prediction and annotation from multiplex, degenerate primers and probes. BMC Bioinformatics, 15(1), 2–7. https://doi.org/10.1186/1471-2105-15-237

Code Snippets

shell:
    """
    if gzip -t {input[0]}
    then
        cp {input[0]} {output[0]}
    else 
        cp {input[0]} $(dirname {output[0]})
        gzip -qc $(dirname "{output[0]}")/$(basename "{input[0]}") > {output[0]}
        rm $(dirname "{output[0]}")/$(basename "{input[0]}")
    fi

    if gzip -t {input[1]}
    then
        cp {input[1]} {output[1]}
    else 
        cp {input[1]} $(dirname {output[1]})
        gzip -qc $(dirname "{output[1]}")/$(basename "{input[1]}") > {output[1]}
        rm $(dirname "{output[1]}")/$(basename "{input[1]}")
    fi            

    """

SnakeMake From line 12 of 0_preprocessing/get_inputs.rules

shell:
    """
    if gzip -t {input[0]}
    then
        cp {input[0]} {output[0]}
    else 
        cp {input[0]} $(dirname {output[0]})
        gzip -qc $(dirname "{output[0]}")/$(basename "{input[0]}") > {output[0]}
        rm $(dirname "{output[0]}")/$(basename "{input[0]}")
    fi
    """

SnakeMake From line 42 of 0_preprocessing/get_inputs.rules

shell:
    """
    #cache_dir=$(vdb-config --cfg -o n | grep "/repository/user/main/public/root" | cut -f2 -d'=' | sed "s/\\"//g")
    vdb-config --restore-defaults
    prefetch -v {wildcards.sample} --max-size 100000000 &> {log}
    """

SnakeMake From line 71 of 0_preprocessing/get_inputs.rules

shell:
    """
    fastq-dump --split-3 --gzip --log-level 1 --disable-multithreading --minReadLen 0 --outdir $(dirname {output[0]}) {input[0]} &> {log}
    """

SnakeMake From line 90 of 0_preprocessing/get_inputs.rules

shell:
    """
    fastq-dump --gzip --outdir $(dirname {output[0]}) --log-level 0 --disable-multithreading --minReadLen 0 {wildcards.sample} &> {log}
    """

SnakeMake From line 107 of 0_preprocessing/get_inputs.rules

shell:
    """
    fastqc -t {threads} {input} -o $(dirname {output[0]}) --nogroup 2> {log[0]}
    """

SnakeMake FastQC From line 16 of 0_preprocessing/QC_raw_reads.rules

shell:
    """
    multiqc --interactive -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]}
    """

SnakeMake MultiQC From line 47 of 0_preprocessing/QC_raw_reads.rules

shell:
    """
    multiqc -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]}
    """

SnakeMake MultiQC From line 80 of 0_preprocessing/QC_raw_reads.rules

from shutil import copyfile
import sys
import datetime
import os
import getpass


## Define log directory path
if "logging_folder" not in config.keys():
    logging_folder = "logs/"
else:
    logging_folder=config["logging_folder"]

if not logging_folder.endswith("/"):
    logging_folder = logging_folder + "/"

## Generate a day_hour folder to store execution specific parameters (command, config fie, sample sheet)
today = datetime.datetime.now()
date = today.strftime("%Y_%m_%d")
time = today.strftime('%H_%M_%S_%f')[:-4]
time_tag = str(date + "-" + time)
timed_logging_folder = logging_folder + time_tag + "/"

## Except for dryrun execution, create a log folder. This will be filled by execution parameters and be provided to all rules to record their standard ouput or specific log files. 
if "--dryrun" not in sys.argv and "-n" not in sys.argv:

    os.makedirs(logging_folder, exist_ok=True)
    os.makedirs(timed_logging_folder, exist_ok=True)

    ## Copy the command line, the config file and the sample sheet in a subdirectory of the log directory
    ### Log command line
    cmd_file = timed_logging_folder + "cmd.txt"
    with open(cmd_file, "w") as f:
        f.write(" ".join(sys.argv)+"\n")

    ### Log config file
    if workflow.overwrite_configfiles[0] is not None:
            copyfile(workflow.overwrite_configfiles[0], timed_logging_folder+"config.yaml")

    ### Log local or SRA sample sheet
    if "local_samples" in config.keys():
        copyfile(config["local_samples"], timed_logging_folder+"local_samples.tsv")
    if "sra_samples" in config.keys():
        copyfile(config["sra_samples"], timed_logging_folder+"sra_samples.tsv")

    ### Log git hash of the pipeline
    git_log_path = timed_logging_folder + "git.txt"
    with open(git_log_path, "w") as devnull:
        subprocess.run(args=["git -C " + workflow.basedir + " rev-parse HEAD"], shell=True, stdout=devnull)   

    ### Log user ID  
    user_path = timed_logging_folder + "user.txt"
    user_cred = getpass.getuser()
    with open(user_path, "w") as f:
        f.write(user_cred +"\n")

Python From line 3 of scripts/logging.py

import pandas
import re
import itertools

## Define a function to extract the .fastq name pattern
def get_read_naming_patterns(directory):
    result = []
    extension= {}
    for fname in sorted(os.listdir(directory)):
        if fname.endswith("fastq.gz") or fname.endswith("fq.gz") or fname.endswith("fastq") or fname.endswith("fq"):
            regex_str = '(_L0+[1-9]+)?_(R)?(1|2)(\.|_)' #regex for finding R1 and R2, if L001 is present before, it is also included
            regex = re.compile(regex_str)
            ext = re.search(regex, fname)
            if ext is None:
                ext = re.search(r'f(?:ast)?q(?:\.gz)?', fname)
                samp = re.sub("\.$", "", re.search(r'^([^\.]*)\.*', fname).group(0))
                if samp in extension.keys():
                    if ext.group(0).endswith(".gz"):
                        extension[samp] = [ext.group(0)]
                else:
                    extension[samp] = [ext.group(0)]
            else:
                regex_after = re.compile(regex_str+".*")
                regex_before = re.compile(".*"+regex_str)
                read = re.compile(re.search(regex_after, fname).group(0))
                samp = re.sub(regex, '', re.search(regex_before, fname).group(0))
                extension.setdefault(samp, [])
                extension[samp].append(re.sub("^_", "", read.pattern))
    return(extension)


## Check for the presence of a directory for .fastq in config. If none, will be "links".
if "link_directory" in config.keys():
    link_directory = config["link_directory"]
    if not link_directory.endswith("/"):
        link_directory = link_directory + "/"
else:
    link_directory = "links/"

## Check that "tax_DB_path" ends with "/". Otherwise, correct. 
if not config["tax_DB_path"].endswith("/"):
    config["tax_DB_path"] = config["tax_DB_path"] + "/"

## Check for the existence/accessbility of tax_DB_path   
if os.path.isdir(config['tax_DB_path']) is False:
   raise IOError("Cannot access to '{}' variable.".format('tax_DB_path'))


## Check for the presence of a metadata table in in config, either for local reads ("local_samples") or distant reads ("sra_samples")
if "local_samples" not in config.keys() and "sra_samples" not in config.keys():
    raise ValueError("No samples defined in the config file. local_samples or sra_samples must be defined in config")

## Check the local_samples/sra_samples tables for the presence of following variables in config: 'sample_label', 'grouping_column', 'run_column'.
### read sample dataframe from config file
if "local_samples" in config.keys():
    metadata = config["local_samples"] 
elif "sra_samples" in config.keys():
    metadata = config["sra_samples"]    
df = pandas.read_csv(metadata, sep="\t", index_col=0)
df_colnames = set(df.columns)

### Get values for which presence should be tested in dataframe
to_check = ['sample_label', 'grouping_column', 'run_column'] # a list of config values describing columns that should be in the dataframe
to_check_dict = (dict((k, config[k]) for k in to_check))

### Set a function to compare list
def list_diff(list1, list2): 
	return (list(set(list1) - set(list2))) 

### Loop over the values of config variable to check
for key in to_check_dict.keys():

    ## Extract values from dictionnary
    to_check_values = to_check_dict[key]

    ## If it is a single value, test that it is in the columns of the metadata
    if isinstance(to_check_values, str):

        ## if the column name list is not in the column names of df dataframe then raise an error
        if to_check_values not in df_colnames:                
            message = f"{' and '.join(to_check_values)} not available in the sample TSV file."
            raise IOError(message)

    ## If it is a list, check that it contained in metadata
    elif isinstance(to_check_values, list):

        diff = []
        diff = list_diff(to_check_values, df_colnames)

        ## if the set of column names in to_check list is not in the column names of df dataframe then raise an error
        if diff:
            for item in diff:
                """if the set of column names in to_check list is not a subset of column names of df dataframe then raise an error"""
                message = f"{' and '.join(diff)} not available in the sample TSV file."
                raise IOError(message)

    else:
        message = f"{key} should be a string or a list of string"
        raise IOError(message)


## Generate a list of input files, either local (when "local_samples" is in config, or SRA-hosted (when "sra_samples" is in config))
### Create the final pandas dataframe that will be completed
all_samples=pandas.DataFrame()

### Create a set of dictionaries to store sample characteristics
sras_ext = {}
reads_sra = {}
reads_local = {}
original_names = {}
reads_ext = {}
paths  = {}
layout = {}


### In case of local samples, work our way through the local_samples table to extract read paths (if indicated in the R1/R2 columns) or extract match it with .fastq found in the "links" directory.
if "local_samples" in config.keys():
    ## Read the metadata table
    local_data = pandas.read_csv(config["local_samples"], sep="\t", index_col=0)
    local_data.index = [str(x) for x in local_data.index]
    all_local_sample_names =  "".join(list(local_data.index))
    ## Check for forbidden characters in sample names
    if "(" in all_local_sample_names or ")" in all_local_sample_names or "_-_" in all_local_sample_names:
        raise ValueError("Forbidden character in sample name in sample name file")
    ## Extract path if indicated in "R1" column
    if "R1" in list(local_data):
        for sample, sample_data in local_data.iterrows():
            if sample in paths:
                ## Check for sample names to be unique
                raise IOError("Identical sample name used multiple times: %s" % sample_name)
            paths[sample] =[sample_data.loc["R1"]] 
            reads_ext[sample] = "single"
            layout[sample] = "single"
            if 'R2' in local_data.columns.values:
                if "R1" in str(sample_data.loc["R2"]):
                    raise IOError("ATTENTION! R1 flag within R2 filename: %s", sample_data.loc["R2"])
                if (str(sample_data.loc["R2"]).strip()) != "nan":
                    paths[sample].append(sample_data.loc["R2"])
                    reads_ext[sample] = ["R1", "R2"]
                    layout[sample] = "paired"
        all_samples = local_data   
        paths = {**paths}


    ## In absence of a "R1" column indicating file paths, try to match the sample names with the content of the "links" folder
    else:

        reads_local = get_read_naming_patterns(link_directory)            
        original_names = { x : x for x in reads_local.keys() }
        read_correct = {}
        original_correct = {}

        ## In absence of a "OldSampleName", the leftmost column is used to match with "links" content
        if "OldSampleName" not in list(local_data):

            for sample in list(local_data.index):
                regex = re.compile(r'%s([^a-zA-Z0-9]|$)' % sample) # this regex ensures that the matching of the sample names end at the end of the str, to prevent S1 matching S10 for instance
                match = [bool(re.match(regex, x)) for x in sorted(list(original_names.keys()))]
                if sum(match) != 1: #there must be one and only one entry matching one sample name
                    raise ValueError("Problem matching SampleName to read file names: " +  sample)
                read_name = str(sorted(list(original_names.keys()))[match.index(True)]) # Sample name with _SX
                original_correct[sample] = original_names[read_name]
                read_correct[sample] = reads_local[read_name]
                paths[sample] = expand(link_directory + read_name + "_{reads}" ,  reads = read_correct[sample]) 

                ## In presence of a "LibraryLayout" column, samples can be specifid to be "single" or "paired". 
                if "LibraryLayout" in list(local_data):
                    if local_data.loc[sample, "LibraryLayout"].lower()=="paired":
                        reads_ext[sample]=["R1", "R2"]
                        layout[sample] = "paired"
                    elif local_data.loc[sample, "LibraryLayout"].lower()=="single":
                        reads_ext[sample]=["single"]
                        layout[sample] = "single"
                    else:
                        raise ValueError("Problem in the Local_sample file, LibraryLayout badly defined")
                ## Otherwise, "paired" is assumed
                else:
                    reads_ext[sample]=["R1", "R2"]
                    layout[sample] = "paired"

        ## In presence of a "OldSampleName", the this column is used to match with "links" content
        else:
            for Old in list(local_data["OldSampleName"]):
                regex = re.compile(r'%s([^a-zA-Z0-9]|$)' % Old)
                match = [bool(re.match(regex, x)) for x in sorted(list(original_names.keys()))]
                if sum(match) != 1:
                    raise ValueError("Problem matching OldSampleName to read file names : " + Old)
                read_name=str(sorted(list(original_names.keys()))[match.index(True)]) # Sample with have SX to unify with above
                sample = str(local_data.index[local_data['OldSampleName'] == Old][0])
                original_correct[sample] = original_names[read_name]
                read_correct[sample] = reads_local[read_name]
                paths[sample] = expand(link_directory + read_name + "_{reads}" ,  reads = read_correct[sample])

                ## In presence of a "LibraryLayout" column, samples can be specifid to be "single" or "paired". 
                if "LibraryLayout" in list(local_data):
                    if local_data.loc[sample, "LibraryLayout"].lower()=="paired":
                        reads_ext[sample]=["R1", "R2"]
                        layout[sample] = "paired"
                    elif local_data.loc[sample, "LibraryLayout"].lower()=="single":
                        reads_ext[sample]=["single"]
                        layout[sample] = "single"
                    else:
                        raise ValueError("Problem in the Local_sample file, LibraryLayout badly defined")
                ## Otherwise, "paired" is assumed
                else:
                    reads_ext[sample]=["R1", "R2"]
                    layout[sample] = "paired"

        original_names = original_correct
        reads_local = read_correct
        reads_ext = reads_ext


        all_samples=local_data



### In case of sra-hosted samples, work our way through the "sra_table" to extract LibraryLayout and clean sample names.
if "sra_samples" in config.keys():
    sra_data = pandas.read_csv(config["sra_samples"], sep="\t", index_col=0).drop_duplicates()
    all_sra_sample_names = "".join(list(sra_data.index))
    if "(" in all_sra_sample_names or ")" in all_sra_sample_names or "_-_" in all_sra_sample_names:
        raise ValueError("Forbidden character in sample name in sra file")
    for sra_sample in list(sra_data.index):
        sample_name = str(sra_sample).replace(" ", "_").replace("&", "and").replace(":", "-")
        if sample_name in reads_sra.keys(): # if the sample name is already used, add _(n+1) at the end
            sample_name = sample_name+"_"+str(list(reads_sra.keys()).count(sample_name))
        reads_sra[sample_name]=str(sra_sample)
        if sra_data.loc[sra_sample, "LibraryLayout"].lower()=="paired":
            sras_ext[sample_name]=["1.fastq.gz", "2.fastq.gz"]
            reads_ext[sample_name]=["R1", "R2"]
        elif sra_data.loc[sra_sample, "LibraryLayout"].lower()=="single":
            sras_ext[sample_name] = ["fastq.gz"]
            reads_ext[sample_name]=["single"]
        else:
            raise ValueError("Problem in the sra file, LibraryLayout badly defined")
    all_samples=sra_data
    config["local_samples"] =  config["sra_samples"]

        # all_samples.loc[sample_name, "Replicate"]=sra_data.loc[i, "Replicate"]
read_naming = {**reads_local, **sras_ext}
original_names = {**original_names, **reads_sra}

Python Pandas SRA From line 3 of scripts/make_input_lists.py

def get_grouping_key(column_of_interest):
    file_list = []
    ## Convert to list if provided as a string in config
    if isinstance(column_of_interest,str):
        column_of_interest=[column_of_interest]
    unique_col=list(set(column_of_interest))
    for i in unique_col:
        combined_values = expand("{column_of_interest}/{column_values}", column_of_interest = i, column_values = list(set(all_samples[i])))
        file_list.extend(combined_values)
    return(file_list)

### Define rarefaction levels (values in config + no rarefaction)
def get_rarefaction_key(rarefaction_values):
    file_list = []
    for i in set(rarefaction_values):
        combined_values = expand("rarefaction_{rarefaction_values}", rarefaction_values = i)
        file_list = file_list + combined_values
    file_list = file_list + ["norarefaction"]

    return(file_list)


# Transform the collapse levels from the plotting taxonomic rank
def get_taxa_collapse_level_key(collapse_level):

    file_list = []
    for i in set(collapse_level):
        if i == "OTU":
            value = 'no_collapse'
        elif i == "Species" :
            value = 'collap_7'
        elif i == "Genus" :
            value = 'collap_6'
        elif i == "Family" :
            value = 'collap_5'
        elif i == "Order" :
            value = 'collap_4'
        elif i == "Class" :
            value = 'collap_3'
        elif i == "Phylum" :
            value = 'collap_2'
        elif i == "Kingdom" :
            value = 'collap_1'
        else :
            raise ValueError("Forbidden value in taxa level for barplots")
        file_list.append(value)
    file_list = file_list + ["no_collapse"]
    return(file_list)



## Set of function to generate list of output from config ##############################################################

### Light output, including QC, count table, consensus sequences and taxonomic assignement
def light_output_list():
    output = []
    output = MultiQC
    output.append(light_output)
    if "DADA2" in config["denoiser"]:
        output.append("DADA2/2_denoised/DADA2_denoising_stats.tsv")
    return(output)

### Basic output, diagnostic plots, KRONA plots and rarefaction curves
def basic_plots_list():
    output = []
    output = light_output_list()
    output.append(basic_plots)
    return(output)

### Complete set of phyloseq, in option including transposed count table and metadata (wide to long)
def phyloseq_output_list():
    output = []
    output = basic_plots_list()
    output.append(phyloseq)
    if config["melted_phyloseq"] == True:
        output.append(phyloseq_melted)
    if config["transposed_tables"] == True:
        output.append(transposed_output)
    return(output)


### Qiime2 outputs, offering interactive visualization of the data
def Qiime2_output_list():
    output = []
    output = basic_plots_list()
    if config["Qiime2_basic_output_visualization"] == True:
        output.append(Qiime2_vis_qzv)
    return(output)

### PICRUSt2 outputs, to predict genomic content based on taxonomic profiles
def PICRUSt2_list():
    output = []
    output = basic_plots_list()
    output.append(picrust2)
    return(output)

### Rule all, the default rule including all default output (not PICRUSt2, since long to compute)
def rule_all_list():
    output = []
    output.append(basic_plots_list())
    output.append(Qiime2_vis_qzv)
    output.append(phyloseq_output_list())
    return(output)

Python Consensus From line 3 of scripts/make_output_fcts.py

include: "make_output_fcts.py"

### Basic output
#### MuliQC
MultiQC = expand("QC/{RUN}_multiqc_raw_reads_report.html", RUN = set(all_samples[config["run_column"]]))
MultiQC.append("QC/multiqc_raw_reads_report.html")


#### Light pipeline output
light_output = expand("{denoiser}/2_denoised/dna-sequences.fasta", denoiser = config["denoiser"])
light_output.append(expand("{denoiser}/2_denoised/count_table.tsv", denoiser = config["denoiser"]))
light_output.append(expand("{denoiser}/3_classified/{classifier}_{tax_DB}/dna-sequences_tax_assignments.qzv", classifier = config["classifier"], denoiser = config["denoiser"], tax_DB = config["tax_DB_name"]))

#### DADA2 stat table
DADA2_stats_tables = "DADA2/2_denoised/DADA2_denoising_stats.tsv"


### Basic evaluation plots
basic_plots = expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/reads/reads_plot_with_filtered.png",
                                 classifier = config["classifier"],
                                 denoiser = config["denoiser"],
                                 tax_DB = config["tax_DB_name"])



basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/reads/reads_plot_with_{filter_lineage}_in_{filter_tax_rank}_filtered.png",
                                denoiser = config["denoiser"],
                                classifier = config["classifier"],
                                tax_DB = config["tax_DB_name"],
                                filter_tax_rank = config["filter_tax_rank"],
                                filter_lineage = config["filter_lineage"]))


basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/rarefaction_curve/nofiltering/{rarefaction_value}_rarefaction_curve_{grouping_column}.png",
                                 denoiser = config["denoiser"],
                                 classifier = config["classifier"],
                                 tax_DB = config["tax_DB_name"],
                                 rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                                 grouping_column = config["grouping_column"][0]))


basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/rarefaction_curve/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}_rarefaction_curve_{grouping_column}.png",
                                 denoiser = config["denoiser"],
                                 classifier = config["classifier"],
                                 tax_DB = config["tax_DB_name"],
                                 rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                                 filter_tax_rank = config["filter_tax_rank"],
                                 filter_lineage = config["filter_lineage"],
                                 grouping_column = config["grouping_column"][0]))



basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/KRONA/{grouping_key}.html",
                                 denoiser = config["denoiser"],
                                 classifier = config["classifier"],
                                 tax_DB = config["tax_DB_name"],
                                 grouping_key = get_grouping_key(config["grouping_column"])))



### Advances Phyloseq set of output
#### Not filtered
phyloseq = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]))


phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"])))

phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))

#### Tax filtered
phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))


phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))

phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"],
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))


### Potentially spurious ASVs
phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/norarefaction/no_collapse/base_export/tree_treeshrink/output.txt",
            denoiser = config["denoiser"],
            classifier = config["classifier"],
            tax_DB = config["tax_DB_name"],
            collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
            rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
            filter_tax_rank = config["filter_tax_rank"],
            filter_lineage = config["filter_lineage"],
            norm_value = config["normalization"], 
            abund_value = config["min_abundance"], 
            prev_value = config["min_prevalence"]))


#### Melted Phyloseq set of output
#### Not filtered
phyloseq_melted = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]))


phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"])))

phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))

#### Tax filtered
phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))


phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))

phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"],
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))


#### Transposed output tables
#### Count-table transposed format - filtered
transposed_output = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_export/count_table_transposed.txt",
            denoiser = config["denoiser"],
            classifier = config["classifier"],
            tax_DB = config["tax_DB_name"],
            collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
            rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
            filter_tax_rank = config["filter_tax_rank"],
            filter_lineage = config["filter_lineage"],
            norm_value = config["normalization"], 
            abund_value = config["min_abundance"], 
            prev_value = config["min_prevalence"])


#### Count-table transposed format - not filtered
transposed_output.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_export/count_table_transposed.txt",
            denoiser = config["denoiser"],
            classifier = config["classifier"],
            tax_DB = config["tax_DB_name"],
            collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
            rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
            filter_tax_rank = config["filter_tax_rank"],
            filter_lineage = config["filter_lineage"],
            norm_value = config["normalization"], 
            abund_value = config["min_abundance"], 
            prev_value = config["min_prevalence"]))


### Qiime2 output
#### Qiime2 interactive visualization
Qiime2_vis_qzv = expand("{denoiser}/2_denoised/dna-sequences.fasta",
                        denoiser = config["denoiser"])
Qiime2_vis_qzv.append(expand("{denoiser}/2_denoised/count-table.qzv",
                             denoiser = config["denoiser"]))
Qiime2_vis_qzv.append(expand("{denoiser}/2_denoised/rep-seqs.qzv",
                             denoiser = config["denoiser"]))
Qiime2_vis_qzv.append(expand("{denoiser}/3_classified/{classifier}_{tax_DB}/dna-sequences_tax_assignments.qzv",
                      denoiser = config["denoiser"],
                      classifier = config["classifier"],
                      tax_DB = config["tax_DB_name"]))


### Picrust2
picrust2 = expand("{denoiser}/6_picrust2/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/picrust/",
                    denoiser = config["denoiser"],
                    classifier = config["classifier"],
                    tax_DB = config["tax_DB_name"],
                    filter_tax_rank = config["filter_tax_rank"],
                    filter_lineage = config["filter_lineage"],
                    rarefaction_value = get_rarefaction_key(config["rarefaction_value"])
                    )

Python PICRUSt2 From line 4 of scripts/make_output_lists.py

script: 
    "scripts/1_cutadapt_paired.py"

SnakeMake From line 22 of 1_2_DADA2_ASVs/1_cutadapt_trim.rules

script: 
    "scripts/1_cutadapt_single.py"

SnakeMake From line 43 of 1_2_DADA2_ASVs/1_cutadapt_trim.rules

script:
    "scripts/1_DADA2_q_filtering_paired.R"

SnakeMake From line 40 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/1_DADA2_q_filtering_single.R"

SnakeMake From line 62 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/2a_big_data_DADA2_learn_errors_paired.R"

SnakeMake From line 85 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/2a_big_data_DADA2_learn_errors_single.R"

SnakeMake From line 104 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/2b_big_data_DADA2_infer_ASV_paired.R"

SnakeMake From line 132 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/2b_big_data_DADA2_infer_ASV_single.R"

SnakeMake From line 157 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/2c_big_data_DADA2_merge_ASV.R"

SnakeMake From line 196 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/2d_big_data_DADA2_merge_chimera.R"

SnakeMake From line 221 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

script:
    "scripts/2e_big_data_DADA2_filtering_stats.R"

SnakeMake From line 254 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

shell:
    """
    fastqc {input} -o $(dirname {output[0]}) 2> {log[0]}
    """

SnakeMake FastQC From line 271 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

shell:
    """
    multiqc -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]}
    """

SnakeMake MultiQC From line 306 of 1_2_DADA2_ASVs/2_DADA2_denoising.rules

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from snakemake import shell

if snakemake.params['amplicon_type'] == "ITS":
    print("ITS Trimming")
    forward_primer_compl = Seq.reverse_complement(Seq(snakemake.params['forward_primer'], IUPAC.ambiguous_dna))
    reverse_primer_compl = Seq.reverse_complement(Seq(snakemake.params['reverse_primer'], IUPAC.ambiguous_dna))
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 2 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -p {snakemake.output[R2_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -a '{reverse_primer_compl}' \
    -G '{snakemake.params[reverse_primer]}' \
    -A '{forward_primer_compl}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]} \
    {snakemake.input[R2_raw_reads]} >> {snakemake.log[0]}""")

elif snakemake.params['amplicon_type'] == "16S":
    print("16S Trimming")
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 1 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -p {snakemake.output[R2_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -G '{snakemake.params[reverse_primer]}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]} \
    {snakemake.input[R2_raw_reads]} >> {snakemake.log[0]}""")

Python Snakemake Biopython Cutadapt From line 1 of scripts/1_cutadapt_paired.py

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from snakemake import shell

if snakemake.params['amplicon_type'] == "ITS":
    print("ITS Trimming")
    forward_primer_compl = Seq.reverse_complement(Seq(snakemake.params['forward_primer'], IUPAC.ambiguous_dna))
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 2 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -a '{reverse_primer_compl}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]} >> {snakemake.log[0]}""")

elif snakemake.params['amplicon_type'] == "16S":
    print("16S Trimming")
    reverse_primer_compl = Seq.reverse_complement(Seq(snakemake.params['reverse_primer'], IUPAC.ambiguous_dna))
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 1 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -a '{reverse_primer_compl}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]}  >> {snakemake.log[0]}""")

Python Snakemake Biopython Cutadapt From line 1 of scripts/1_cutadapt_single.py

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    fnFs <- snakemake@input[[1]]
    fnRs <- snakemake@input[[2]]

## Output
    q_score_filtered_F <- snakemake@output[["q_score_filtered_F"]]
    q_score_filtered_R <- snakemake@output[["q_score_filtered_R"]]
    q_filtering_stats_path <- snakemake@output[["q_filtering_stats"]]

## Parameters
    F_length <- snakemake@params[["F_reads_length_trim"]]
    R_length <- snakemake@params[["R_reads_length_trim"]]
    F_extected_error <- snakemake@params[["F_reads_extected_error"]]
    R_extected_error <- snakemake@params[["R_reads_extected_error"]]
    sample_name <- snakemake@params[["sample_name"]]


## Load needed libraries
    library(dada2);packageVersion("dada2")

## Filter and trim
### Reads are filtered based on the number of errors expected for the read (integration of the qscore and the length of the read). All reads with uncalled nucleotide (N) are removed too. Remaining phiX reads will be removed too. Finally, reads are cut at a length set in config.

### Filter and trim. The filtered reads are directly written while the filtering stats are being save for later compilation.
    filtering_stats <- filterAndTrim(fwd = fnFs, filt = q_score_filtered_F, rev = fnRs, filt.rev = q_score_filtered_R, truncLen=c(F_length,R_length), maxN=0, maxEE=c(F_extected_error,R_extected_error), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=snakemake@threads, verbose = TRUE)
    filtering_stats <- as.data.frame(filtering_stats)
    filtering_stats$Sample <- sample_name

### Save the stats for this sample in a R object
    saveRDS(filtering_stats, file = q_filtering_stats_path)

R dada2 From line 8 of scripts/1_DADA2_q_filtering_paired.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    fnFs <- snakemake@input[[1]]

## Output
    q_score_filtered_F <- snakemake@output[["q_score_filtered_F"]]
    q_filtering_stats_path <- snakemake@output[["q_filtering_stats"]]

## Parameters
    F_length <- snakemake@params[["F_reads_length_trim"]]
    F_extected_error <- snakemake@params[["F_reads_extected_error"]]
    sample_name <- snakemake@params[["sample_name"]]


## Load needed libraries
    library(dada2);packageVersion("dada2")

## Filter and trim
### Reads are filtered based on the number of errors expected for the read (integration of the qscore and the length of the read). All reads with uncalled nucleotide (N) are removed too. Remaining phiX reads will be removed too. Finally, reads are cut at a length set in config.

### Filter and trim. The filtered reads are directly written while the filtering stats are being save for later compilation.
    filtering_stats <- filterAndTrim(fwd = fnFs, filt = q_score_filtered_F, truncLen=c(F_length), maxN=0, maxEE=c(F_extected_error), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=snakemake@threads, verbose = TRUE)
    filtering_stats <- as.data.frame(filtering_stats)
    filtering_stats$Sample <- sample_name

### Save the stats for this sample in a R object
    saveRDS(filtering_stats, file = q_filtering_stats_path)

R dada2 From line 8 of scripts/1_DADA2_q_filtering_single.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]
    q_score_filtered_R <- snakemake@input[["q_score_filtered_R"]]

## Output
    error_profile_F <- snakemake@output[["error_profile_F"]]
    error_profile_R <- snakemake@output[["error_profile_R"]]
    error_plot_F <- snakemake@output[["error_profile_F_plot"]]
    error_plot_R <- snakemake@output[["error_profile_R_plot"]]

## Load needed libraries
    library(dada2); packageVersion("dada2")


## Learn error rates
    errF <- learnErrors(q_score_filtered_F, nbases=1e8, multithread = snakemake@threads, verbose = 1)
    errR <- learnErrors(q_score_filtered_R, nbases=1e8, multithread = snakemake@threads, verbose = 1)

## Write these error profiles
    saveRDS(object = errF, file = error_profile_F)
    saveRDS(object = errR, file = error_profile_R)

## Write error_plots
    png(error_plot_F)
    plotErrors(errF, nominalQ=TRUE)
    png(error_plot_R)
    plotErrors(errR, nominalQ=TRUE)

    dev.off()

R dada2 From line 8 of scripts/2a_big_data_DADA2_learn_errors_paired.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]

## Output
    error_profile_F <- snakemake@output[["error_profile_F"]]
    error_plot_F <- snakemake@output[["error_profile_F_plot"]]

## Load needed libraries
    library(dada2); packageVersion("dada2")


## Learn error rates
    errF <- learnErrors(q_score_filtered_F, nbases=1e8, multithread = snakemake@threads, verbose = 1)

## Write these error profiles
    saveRDS(object = errF, file = error_profile_F)

## Write error_plots
    png(error_plot_F)
    plotErrors(errF, nominalQ=TRUE)

    dev.off()

R dada2 From line 8 of scripts/2a_big_data_DADA2_learn_errors_single.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]
    q_score_filtered_R <- snakemake@input[["q_score_filtered_R"]]
    error_profile_F <- snakemake@input[["error_profile_F"]]
    error_profile_R <- snakemake@input[["error_profile_R"]]

## Output
    infer_stats <- snakemake@output[["infer_stats"]]
    sample_seq_tab <- snakemake@output[["sample_seq_tab"]]

## Parameters
    sample_name <- snakemake@params[["sample_name"]]
    run <- snakemake@params[["run"]]
    x_column_value <- snakemake@params[["x_column_value"]]
    min_overlap <- snakemake@params[["min_overlap"]]
    print(paste("min_overlap is :", min_overlap))

## Load needed libraries
    library(dada2); packageVersion("dada2")

## Create a useful function to count the number of sequences
    getN <- function(x) sum(getUniques(x))

## File renaming
    names(q_score_filtered_F) <- sample_name
    names(q_score_filtered_R) <- sample_name

## Read error rates
    errF <- readRDS(error_profile_F)
    errR <- readRDS(error_profile_R)

## Prepare named vector
    mergers <- vector("list", 1)
    names(mergers) <- sample_name

## Sample inference and merger of paired-end reads
    cat("Processing:", sample_name, "\n")
    ## Forward
        derepF <- derepFastq(q_score_filtered_F, verbose = TRUE)
        ddF <- dada(derepF, err=errF, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE)
    ## Reverse
        derepR <- derepFastq(q_score_filtered_R, verbose = TRUE)
        ddR <- dada(derepR, err=errR, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE)
    ## Merge
        merger <- mergePairs(dadaF = ddF, derepF= derepF, dadaR = ddR, derepR = derepR, verbose = TRUE, minOverlap = min_overlap, maxMismatch = 0)
        mergers[[sample_name]] <- merger

## Save the dereplicated, corrected and merged sequences for this sample
    saveRDS(mergers, file = sample_seq_tab)

## For statistics record the number of reads
    ### Write the statistics in a dataframe
        infer <- data.frame(denoisedF = getN(ddF))
        infer$denoisedR <- getN(ddR)
        infer$merged_pairs <- getN(merger)
        infer$Sample <- sample_name
        infer$label <- x_column_value
        infer$RUN <- run

    ### Save the sequences stats for this sample
        saveRDS(infer, file = infer_stats)

R dada2 From line 8 of scripts/2b_big_data_DADA2_infer_ASV_paired.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]
    error_profile_F <- snakemake@input[["error_profile_F"]]

## Output
    infer_stats <- snakemake@output[["infer_stats"]]
    sample_seq_tab <- snakemake@output[["sample_seq_tab"]]

## Parameters
    sample_name <- snakemake@params[["sample_name"]]
    run <- snakemake@params[["run"]]
    x_column_value <- snakemake@params[["x_column_value"]]
    min_overlap <- snakemake@params[["min_overlap"]]
    print(paste("min_overlap is :", min_overlap))

## Load needed libraries
    library(dada2); packageVersion("dada2")

## Create a useful function to count the number of sequences
    getN <- function(x) sum(getUniques(x))

## File renaming
    names(q_score_filtered_F) <- sample_name

## Read error rates
    errF <- readRDS(error_profile_F)

## Prepare named vector
    mergers <- vector("list", 1)
    names(mergers) <- sample_name

## Sample inference
    cat("Processing:", sample_name, "\n")
    ## Forward
        derepF <- derepFastq(q_score_filtered_F, verbose = TRUE)
        ddF <- dada(derepF, err=errF, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE)
        mergers[[sample_name]] <- ddF

## Save the dereplicated, corrected sequences for this sample
    saveRDS(mergers, file = sample_seq_tab)

## For statistics record the number of reads
    ### Write the statistics in a dataframe
        infer <- data.frame(denoisedF = getN(ddF))
        infer$denoisedR <- NA
        infer$merged_pairs <- infer$denoisedF
        infer$Sample <- sample_name
        infer$label <- x_column_value
        infer$RUN <- run

    ### Save the sequences stats for this sample
        saveRDS(infer, file = infer_stats)

R dada2 From line 8 of scripts/2b_big_data_DADA2_infer_ASV_single.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    infer_stats <- snakemake@input[["infer_stats"]]
    sample_seq_table <- snakemake@input[["sample_seq_tab"]]

## Output
    run_stats <- snakemake@output[["run_stats"]]
    run_seq_table <- snakemake@output[["run_seq_table"]]

## Load needed libraries
    library(dada2); packageVersion("dada2")

## Merge and write paired-end merge reads of the run
    input_M <- sapply(sample_seq_table, readRDS, simplify = TRUE, USE.NAMES = FALSE)
    st.all <- makeSequenceTable(input_M)
    saveRDS(object = st.all, file = run_seq_table)

## Merge and write forward reads stats of the run
    stats <- do.call("rbind", lapply(infer_stats, readRDS))
    saveRDS(object = stats, file = run_stats)

R dada2 From line 8 of scripts/2c_big_data_DADA2_merge_ASV.R

  log <- file(snakemake@log[[1]], open="wt")
  sink(log)
  sink(log, type="message")

# Input
  seq_tab <- snakemake@input[["seq_tab"]]

# Output
  no_chim <- snakemake@output[["no_chim"]]
  length_filtered <- snakemake@output[["length_filtered"]]
  rep_seqs <- snakemake@output[["rep_seqs"]]
  count_table <- snakemake@output[["count_table"]]
  length_histo <- snakemake@output[["length_histo"]]

# Parameters
  merged_min_length <- snakemake@params[["merged_min_length"]]
  merged_max_length  <- snakemake@params[["merged_max_length"]]

# Load needed libraries
  library(dada2); packageVersion("dada2")
  library(ggplot2); packageVersion("ggplot2")

# Merge data from multiple runs (if necessary)
   if (length(seq_tab) == 1){
	print("Unique RUN, no merging of seq_tabl")
	st.all <- readRDS(seq_tab)
   }else{
	print("Multiple RUN, merging")
	st.all <- do.call("mergeSequenceTables", lapply(seq_tab, readRDS))
   }

# Remove chimeras
  print("Filter chimera")
  seqtab <- removeBimeraDenovo(st.all, method="consensus", multithread=snakemake@threads, verbose=TRUE)
  print("Chimera filtered")

# Sequences length inspection and filtration
#### That's the little added trick, the reason why we are using this script and not the one in Qiime2. Indeed we typically are here keeping only sequences between 390 and 500 bp of length after merging. This tcorresponds to the expected length of the V3V4 region of the 16S rRNA gene.
  ## Inspect distribution of sequence lengths
  unfiltered_length_table <- tapply(colSums(seqtab), factor(nchar(getSequences(seqtab))), sum)
  unfiltered_length_table
  unfiltered_length_table_df <- data.frame(length=as.numeric(names(unfiltered_length_table)), count=unfiltered_length_table)
  png(length_histo)
  p <- ggplot(data=unfiltered_length_table_df, aes(x=length, y=count)) + geom_col(binwidth=20)
  print(p)
  dev.off()

# Filter based on length
seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(merged_min_length, merged_max_length)]
# Inspect distribution of sequence lengths after filtration
table(nchar(getSequences(seqtab2)))

# Export reads and count
#### We are writing in files the product of this DADA2 process. These are one .fasta file contanining the dereplicated, errors corrected, paired-end merged representative sequences and one .txt file indicating the prevalence of sequencne in each sample (this is the result of dereplication).
    asv_seqs <- colnames(seqtab2)

    ## giving our seq headers more manageable names (ASV_1, ASV_2...)
    asv_headers <- vector(dim(seqtab2)[2], mode="character")
    for (i in 1:dim(seqtab2)[2]) {
      asv_headers[i] <- paste(">ASV", i, sep="_")
    }

    ## making and writing out a fasta of our final ASV seqs:
    asv_fasta <- c(rbind(asv_headers, asv_seqs))
    write(asv_fasta, rep_seqs)

  ## count table:
    print("Create count table")
    asv_tab <- t(seqtab2)

 ## Renamed
    row.names(asv_tab) <- sub(">", "", asv_headers)
    write.table(asv_tab, count_table , sep="\t", quote=F)

  ## Write sequences objects in .rds for use in statistics
    ## Before length filtration
    saveRDS(seqtab, no_chim)
    ## After length filtration
    saveRDS(seqtab2, length_filtered)

R ggplot2 Consensus dada2 From line 8 of scripts/2d_big_data_DADA2_merge_chimera.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_filtering_stats <- snakemake@input[["q_filtering_stats"]]
    run_stats <- snakemake@input[["run_stats"]]
    no_chim <- snakemake@input[["no_chim"]]
    l_filtered <- snakemake@input[["length_filtered"]]

## Output
    filtering_stats <- snakemake@output[["filtering_stats"]]

## Load needed libraries
    library(dplyr); packageVersion("dplyr")

## Load the q score filtration R stats
    filtration <- do.call("rbind", lapply(q_filtering_stats, readRDS))

## Load the forward, reverse and paired reads correction stats for each run
    dada_infer <- do.call("rbind", lapply(run_stats, readRDS))

## Load reads with filtered chimera
    no_chimera <- do.call("rowSums", lapply(no_chim, readRDS))
    no_chimera <- as.data.frame(no_chimera)
    no_chimera$Sample <- row.names(no_chimera)

## Load length_filtered sequences
    length_filtered <- do.call("rowSums", lapply(l_filtered, readRDS))
    length_filtered <- as.data.frame(length_filtered)
    length_filtered$Sample <- row.names(length_filtered)

## Merge all dataframe together
    all_stats <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "Sample", all.x = TRUE),
       list(filtration, dada_infer, no_chimera, length_filtered))

## set RUN at the very end of the table
    all_stats <- all_stats%>%select(Sample, label, RUN, everything())

## Compute % of passed reads
    all_stats$passed_reads_pct <- (100*all_stats$length_filtered)/all_stats$reads.in

## Write the  stat table
write.table(x = all_stats, file = filtering_stats, sep = "\t", row.names = FALSE)

R dplyr From line 8 of scripts/2e_big_data_DADA2_filtering_stats.R

    shell:
        '''
        pandaseq \
        -f {input[R1_raw_reads]} \
        -r {input[R2_raw_reads]} \
        -p {params[forward_primer]} \
        -q {params[reverse_primer]} \
        -A simple_bayesian \
        -l {params[merged_min_length]} \
        -L {params[merged_max_length]} \
        -G {log} \
        -w {output[paired_trimmed_reads]} \
        -B \
	-T {threads} \
	-o {params[min_overlap]} \
        -N
        '''

SnakeMake PANDAseq From line 24 of 1_2_vsearch_OTUs/1_PANDAseq_trim_filter_merge.rules

    shell:
        '''
        pandaseq \
        -f {input[R1_raw_reads]} \
        -r {input[R2_raw_reads]} \
        -A simple_bayesian \
        -l {params[merged_min_length]} \
        -L {params[merged_max_length]} \
        -G {log} \
        -w {output[paired_trimmed_reads]} \
        -B \
	-T {threads} \
	-o {params[min_overlap]} \
        -N
        '''

SnakeMake PANDAseq From line 62 of 1_2_vsearch_OTUs/1_PANDAseq_trim_filter_merge.rules

shell:
    '''
    if [ -s {input} ]
    then
        vsearch --derep_fulllength {input} \
        --sizeout \
        --output {output[0]} \
        --uc {output[1]} 2> {log}
    else
        echo "{input} is empty"
        echo "> \n" > {output[0]}
        touch {output[1]}
    fi

    '''

SnakeMake VSEARCH From line 27 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

shell:
    '''
    cat {input} >> {output}
    '''

SnakeMake From line 50 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

shell:
    '''
    vsearch --derep_fulllength {input} \
            --sizeout \
            --minuniquesize 2 \
            --output {output}
            2> {log}
    '''

SnakeMake VSEARCH From line 68 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

shell:
    '''
    vsearch --cluster_size {input} \
            --sizein \
            --id 0.97 \
            --centroids {output[centroid]} \
            --consout {output[consout]} \
            --profile {output[profile]} \
            --threads {threads} \
            2> {log}
    '''

SnakeMake VSEARCH From line 93 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

shell:
    '''
    vsearch --uchime_denovo {input} \
    --abskew 2 \
    --sizein \
    --nonchimeras {output[non_chimeras]} \
    --borderline {output[borderline]} \
    --uchimeout {output[chimera_out]}
    2> {log}
    '''

SnakeMake VSEARCH From line 121 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

shell:
    '''
    vsearch --derep_fulllength {input} \
    --sizein \
    --relabel OTU_ \
    --xsize \
    --output {output} \
    2> {log}
    '''

SnakeMake VSEARCH From line 146 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

shell:
    '''
    vsearch --usearch_global {input[samples]} \
    -otutabout {output} \
    -id 0.97 \
    -strand plus \
    --db {input[rep_seq]} \
    --threads {threads} \
    2> {log}
    '''

SnakeMake VSEARCH From line 171 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

script:
    "scripts/create_count_table_from_vsearch.R"

SnakeMake From line 195 of 1_2_vsearch_OTUs/2_vsearch_denoising.rules

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Load needed library
    library(dplyr);packageVersion("dplyr")
    library(reshape2);packageVersion("reshape2")
    library(magrittr);packageVersion("magrittr")

## Input
    count_table_samples <- snakemake@input[["count_table_samples"]]

## Output
    count_table <- snakemake@output[["count_table"]]

## Reformat
otus_table <- data.frame(array(dim=c(0,3)))

### Loop over each sample file. If it is empty, then we just add the factor in the levels to have it then
for (file_path in count_table_samples){
  sample_name <- gsub("_count_table.tsv", "", basename(file_path))
  sample_otu_table <- read.table(file = file_path, sep="\t", as.is=T, check.names = F, header=T, comment.char = "",  skipNul = TRUE)
  if (nrow(sample_otu_table)>0){
    sample_otu_table <- cbind("Sample"=sample_name, sample_otu_table)
    otus_table <- rbind(otus_table, sample_otu_table)
  }else if (nrow(sample_otu_table) == 0){
    levels(otus_table$Sample) <- c(levels(otus_table$Sample), sample_name)
  }
}

colnames(otus_table) <- c("Sample", "OTU_ID", "counts")

## Transform this table to have a wide format where we have a column by sample
transf_vsearch_table <- otus_table %>% 
  dplyr::group_by(Sample, OTU_ID, .drop = FALSE) %>%
  dplyr::summarise(counts = sum(counts)) %>%
  reshape2::dcast(OTU_ID ~ Sample) %>%
  dplyr::filter(!is.na(OTU_ID))

## Set OTU as rownames
transf_vsearch_table <- set_rownames(x = transf_vsearch_table, value = transf_vsearch_table$OTU_ID)
transf_vsearch_table[,1] <- NULL

## Write output
    write.table(x = transf_vsearch_table, file = count_table, sep="\t", quote=F, na = "0")

R dplyr magrittr reshape2 From line 7 of scripts/create_count_table_from_vsearch.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    seqs <- snakemake@input[["seqs"]]
    King_to_Species <- snakemake@input[["King_to_Species"]]
    King_to_Genus <- snakemake@input[["King_to_Genus"]]
    Genus_species <- snakemake@input[["Genus_species"]]

## Output
    tax <- snakemake@output[["tax"]]

## Parameters
    cpu <- snakemake@threads

## Load needed libraries
    library(dada2);packageVersion("dada2")
    library(dplyr);packageVersion("dplyr")
    library(tidyr);packageVersion("tidyr")
    library(Biostrings);packageVersion("Biostrings")

## Read data
    fastaFile <- readDNAStringSet(seqs)
    seq.name = names(fastaFile)
    sequence = paste(fastaFile)
    fasta <- data.frame(seq.name, sequence)

## Format seqs
    seq_table <- as.character(fasta$sequence)
    names(seq_table) <- as.character(fasta$seq.name)

## Assign taxonomy
    print("Assigning")
    taxa <- assignTaxonomy(seqs = seq_table, refFasta = King_to_Species, taxLevels = c("Kingdom","Phylum","Class","Order","Family","Genus", "Species"), multithread=cpu, tryRC = TRUE, minBoot = 50, verbose = TRUE, outputBootstraps = TRUE)
    #taxa <- addSpecies(taxtab = taxa, refFasta = Genus_species, verbose=TRUE, allowMultiple = TRUE, tryRC = TRUE)
    # Not working for some reason

## Format an write output
    taxa_table <- data.frame(taxa)
    taxa_table <- data.frame(cbind(Row.Names = names(rownames(unlist(taxa$tax))), taxa_table))
    taxa_table <- taxa_table %>% unite(taxonomy, starts_with(match = "tax."), sep = ";", remove = TRUE)
    taxa_table <- taxa_table %>% unite(boot, starts_with(match = "boot"), sep = ";", remove = TRUE)

    #taxa_table$Row.Names<-NULL
    write.table(taxa_table, tax, row.names = FALSE, sep = "\t", col.names = FALSE, quote = FALSE)

R dplyr tidyr dada2 Biostrings From line 8 of scripts/dada2_rdp_tax_assign.R

   log <- file(snakemake@log[[1]], open="wt")
   sink(log)
   sink(log, type="message")

## Input
    trained_tax_path <- snakemake@input[["trained_tax"]]
    seq_path <- snakemake@input[["seq"]]	

## Output
    tax_path <- snakemake@output[["tax"]]
    tax_plot_path <- snakemake@output[["tax_plot"]]

## Parameters
    cpu <- snakemake@threads

## Load needed libraries
    library(DECIPHER);packageVersion("DECIPHER")
    library(dplyr);packageVersion("dplyr")
    library(tidyr);packageVersion("tidyr")

# load a training set object (trainingSet)
# see http://DECIPHER.codes/Downloads.html
train_set <- readRDS(trained_tax_path )
query_seqs <- readDNAStringSet(seq_path)


# classify the sequences
ids <- IdTaxa(query_seqs,
   train_set,
   strand="both", # or "top" if same as trainingSet
   threshold=60, # 60 (very high) or 50 (high)
   processors=cpu) # use all available processors

# look at the results
 pdf(file = tax_plot_path)
 plot(ids)
 dev.off()


## Format taxonomy and write it
taxid <- data.frame(t(sapply(ids, function(x){
  taxa <- c(unlist(x$taxon)[2:8], last(x$confidence))
  taxa[startsWith(taxa, "unclassified_")] <- NA
  taxa
  })))
taxid <- unite(data = taxid, 1:7, col = "taxonomy", sep = ";", remove = TRUE)
write.table(x = taxid, tax_path, row.names = TRUE, sep = "\t", col.names = FALSE, quote = FALSE)

R dplyr tidyr DECIPHER From line 8 of scripts/decipher_assign_tax.R

   log <- file(snakemake@log[[1]], open="wt")
   sink(log)
   sink(log, type="message")

## Input
rdp_tax <- snakemake@input[["RDP_output"]]

## Output
simplified_tax <-snakemake@output[["formatted_output"]]


## Load needed libraries
library(dplyr);packageVersion("dplyr")
library(tidyr);packageVersion("tidyr")

## Read data
tax_table <- read.table(rdp_tax, sep="\t", stringsAsFactors=FALSE)

## Format table
   ### set rownames
   row.names(tax_table) <- tax_table$V1

   ### Filter unecessary columns
   tax_table[,c("V1", "V2", "V3", "V4", "V5")] <- NULL

## Set a table for confidence score filtering
   ### Keep score columns
   score_table <- tax_table[,seq(from = 3, to = 21, by = 3)]
   ### Rename columns
   colnames(score_table) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
   ### Create a boolean dataframe for filtering based on confidence
   score_table_bool <- score_table < 0.5


## Set a table with taxa names
   ### Keep names columns
   names_tables <- tax_table[,seq(from = 1, to = 19, by = 3)]
   ### Renames columns
   colnames(names_tables) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

## Filter taxa names based on boolean table
taxa_names_filtered <- replace(x = names_tables, list = score_table_bool, values = NA)

## Filter scores based on boolean table
score_table_filtered <- replace(x = score_table, list = score_table_bool, values = NA)

## Unite taxonomy
taxa_names_filtered_unite <- unite(data = taxa_names_filtered, col = "", sep = ";", na.rm = TRUE, remove = TRUE)

## Add confidence
taxa_names_filtered_unite$confidence <- apply(score_table_filtered, MARGIN = 1, FUN = min, na.rm = TRUE )

## Write formatted table
write.table(taxa_names_filtered_unite, sep = "\t", simplified_tax, row.names = TRUE, quote = FALSE, col.names = FALSE)

R dplyr tidyr From line 7 of scripts/format_RDP_output.R

shell:
    '''
    export RDP_JAR_PATH=$(command -v rdp_classifier-2.2.jar);
    assign_path=$(which assign_taxonomy.py)
    python $assign_path \
      -i {input[0]} \
      -r {input[1]} \
      -t {input[2]} \
      -m rdp \
      -o $(dirname {output[0]}) \
      -c 0.5 \
      --rdp_max_memory {resources[mem_mb]}
       2> {log[0]}
    '''

SnakeMake From line 19 of 3_tax_assignment/tax_assign.rules

script:
    "scripts/dada2_rdp_tax_assign.R"

SnakeMake From line 52 of 3_tax_assignment/tax_assign.rules

    shell:
        '''
        classifier -Xmx30g -XX:ConcGCThreads=1 classify \
        -c 0.5 \
        -f allrank \
        -t {input[0]} \
        -o {output[0]} {input[1]}  2> {log[0]}
	    '''

SnakeMake From line 72 of 3_tax_assignment/tax_assign.rules

script:
    "scripts/format_RDP_output.R"

SnakeMake From line 97 of 3_tax_assignment/tax_assign.rules

script:
    "scripts/decipher_assign_tax.R"

SnakeMake From line 119 of 3_tax_assignment/tax_assign.rules

script:
    "scripts/physeq_initial_import.R"

SnakeMake From line 33 of 4_post_processing/physeq_processing.rules

script:
    "scripts/physeq_export.R"

SnakeMake From line 53 of 4_post_processing/physeq_processing.rules

script:
    "scripts/physeq_filter_taxa.R"

SnakeMake From line 74 of 4_post_processing/physeq_processing.rules

script:
    "scripts/physeq_rarefy.R"

SnakeMake From line 91 of 4_post_processing/physeq_processing.rules

script:
    "scripts/physeq_add_new_tree.R"

SnakeMake From line 109 of 4_post_processing/physeq_processing.rules

script:
    "scripts/physeq_melt_table.R"

SnakeMake From line 125 of 4_post_processing/physeq_processing.rules

script:
    "scripts/physeq_collapse_taxa.R"

SnakeMake From line 143 of 4_post_processing/physeq_processing.rules

script:
    "scripts/transpose_and_add_meta_count_table.R"

SnakeMake From line 163 of 4_post_processing/physeq_processing.rules

script:
    "scripts/transf_norm.R"

SnakeMake From line 182 of 4_post_processing/physeq_processing.rules

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]
new_tree <- snakemake@input[["new_tree"]]

## Ouput
updated_phyloseq_path <- snakemake@output[["phyloseq_object"]]



## Load libraries
library(phyloseq);packageVersion("phyloseq")



## Read the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)
phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample

## Read the new tree
PHY <- read_tree(new_tree)

## Assign the new tree to the phyloseq object
phy_tree(phyloseq_obj) <- PHY

## Write the new phyloseq object
saveRDS(object = phyloseq_obj, file = updated_phyloseq_path)

R phyloseq From line 8 of scripts/physeq_add_new_tree.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[[1]]

## Output
phyloseq_filtered_object <- snakemake@output[[1]]

## Parameters
collapse_level <- snakemake@params[["collapse_level"]]



## Load needed libraries
library(phyloseq);packageVersion("phyloseq")



## Import the phyloseq object
phyloseq_object <- readRDS(phyloseq_object)

## Convert the taxonomic level to its name
rank_names(phyloseq_object)[[as.numeric(collapse_level)]]

## Collapse taxa
collapsed_physeq <- tax_glom(phyloseq_object, taxrank=rank_names(phyloseq_object)[[as.numeric(collapse_level)]], NArm=TRUE, bad_empty=c(NA, "", " ", "\t"))
collapsed_physeq <- prune_taxa(taxa_sums(collapsed_physeq) > 0, collapsed_physeq) ## Removes taxa not at least present in one sample


# Write the new phyloseq object
saveRDS(object = collapsed_physeq, file = phyloseq_filtered_object)

R phyloseq From line 8 of scripts/physeq_collapse_taxa.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Output
tree_path <- snakemake@output[["tree_path"]]
meta_path <- snakemake@output[["meta_path"]]
taxonomy_path <- snakemake@output[["taxonomy_path"]]
OTU_path  <- snakemake@output[["OTU_path"]]
rep_seq_path  <- snakemake@output[["rep_seq_path"]]



## Load needed libraries
library(phyloseq);packageVersion("phyloseq")
library(dplyr);packageVersion("dplyr")
library(stringr);packageVersion("stringr")
library(Biostrings);packageVersion("Biostrings")



## Import the phyloseq phyloseq_object
phyloseq_obj <- readRDS(phyloseq_object)
phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample

## Write the tree of the phyloseq object
tree <- phy_tree(phyloseq_obj)
ape::write.tree(tree, tree_path)

## Write the metadata table of the phyloseq object
metadata <- (sample_data(phyloseq_obj))
metadata$Sample <- rownames(metadata)
metadata <- metadata %>% select(Sample, everything())
write.table(metadata, meta_path , sep="\t", row.names = FALSE)

## Write the taxonomy table of the phyloseq object
taxa_df<- as.data.frame(tax_table(phyloseq_obj), stringsAsFactors = F)
taxa_df$taxonomy <- apply(taxa_df, 1, function(x) str_c(x[!is.na(x)], collapse = ";"))
taxa_df <- taxa_df %>% select(taxonomy)
write.table(taxa_df, taxonomy_path , sep="\t", quote=F, col.names = F)

## Write the OTU table of the phyloseq object
write.table(otu_table(phyloseq_obj), OTU_path , sep="\t", quote=F)

## Write the representative sequences
writeXStringSet(refseq(phyloseq_obj), rep_seq_path, append=FALSE,
                compress=FALSE, format="fasta")

R dplyr stringr Biostrings phyloseq From line 8 of scripts/physeq_export.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Output
phyloseq_filtered_object <- snakemake@output[["phyloseq_filtered_object"]]

## Parameters
tax_rank <- snakemake@params[["filter_tax_rank"]]
lineage <- snakemake@params[["filter_lineage"]]
filter_out_tax_rank  <- snakemake@params[["filter_out_tax_rank"]]
filter_out_lineage <- snakemake@params[["filter_out_lineage"]]


## Load needed libraries
library(phyloseq);packageVersion("phyloseq")
library(dplyr);packageVersion("dplyr")


## Import the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)

## Filter taxa
filtered_taxa <- subset_taxa(phyloseq_obj, get(tax_rank) == as.character(lineage))
filtered_taxa <- subset_taxa(filtered_taxa, get(filter_out_tax_rank) != as.character(filter_out_lineage))
filtered_taxa <- prune_taxa(taxa_sums(filtered_taxa) > 0, filtered_taxa) ## Removes taxa not at least present in one sample

## Recompute alpha diversity indexes after this filtration
### Remove the previously computed values
#sample_data(filtered_taxa) <- select(sample_data(filtered_taxa), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1))
drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(filtered_taxa) <- sample_data(filtered_taxa)[,!(names(sample_data(filtered_taxa)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = filtered_taxa, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div)


### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = filtered_taxa, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,filtered_taxa)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div_1)


## Write the new phyloseq object
saveRDS(object = filtered_taxa, file = phyloseq_filtered_object)

R dplyr phyloseq From line 7 of scripts/physeq_filter_taxa.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
count_table <- snakemake@input[["count_table"]]
Metadata_table <- snakemake@input[["Metadata_table"]]
taxonomy_table <- snakemake@input[["taxonomy_table"]]
rep_seqs <- snakemake@input[["rep_seqs"]]
tax_tree <- snakemake@input[["tax_tree"]]

## Ouput
phyloseq_object <- snakemake@output[["phyloseq_object"]]

## Parameters
replace_empty_tax <- snakemake@params[["viz_replace_empty_tax"]]



## Load libraries
library(dplyr);packageVersion("dplyr")
library(tibble);packageVersion("tibble")
library(tidyr);packageVersion("tidyr")
library(phyloseq);packageVersion("phyloseq")
library(Biostrings);packageVersion("Biostrings")


## Import data

### Read count table
print("reading count table")
count_table <- read.table(file = count_table, header = TRUE, check.names=FALSE)

### Read sample_data
print("reading metadata")
metadata <- read.delim(file = Metadata_table, sep = "\t", header = TRUE, na.strings = "NA")

### Read taxonomic tree
print("reading taxonomic tree")
PHY <- read_tree(tax_tree)

### Read representative sequences
print("importing representative sequences from fasta")
SEQS <- readDNAStringSet(rep_seqs)

### Read and format taxonomy table
print("reading taxonomy table")
taxonomy_table<-read.table(file = taxonomy_table, header = FALSE, sep = "\t")

    ### Convert the table into a tabular split version
    taxonomy_table<-taxonomy_table %>% as_tibble() %>% separate(V2, sep=";", c("Kingdom","Phylum","Class","Order","Family","Genus","Species"))

    ### Replace the not properly named headers into proper ones
    colnames(taxonomy_table)[colnames(taxonomy_table)=="V1"] <- "Feature.ID"
    colnames(taxonomy_table)[colnames(taxonomy_table)=="V3"] <- "Confidence"

    ### Convert taxonomic levels as character (needed for the next steps)
    taxonomy_table$Kingdom<-as.character(taxonomy_table$Kingdom)
    taxonomy_table$Phylum<-as.character(taxonomy_table$Phylum)
    taxonomy_table$Class<-as.character(taxonomy_table$Class)
    taxonomy_table$Order<-as.character(taxonomy_table$Order)
    taxonomy_table$Family<-as.character(taxonomy_table$Family)
    taxonomy_table$Genus<-as.character(taxonomy_table$Genus)
    taxonomy_table$Species<-as.character(taxonomy_table$Species)

        print(paste("replace empty taxonomy :", replace_empty_tax))

        if(isTRUE(replace_empty_tax)) {
          ### Replace NA by the previous order + a space_holder for each taxonomic level
          taxonomy_table$Kingdom[is.na(taxonomy_table$Kingdom)] <- (("Unkown_Kingdom")[is.na(taxonomy_table$Kingdom)])
          taxonomy_table$Phylum[is.na(taxonomy_table$Phylum)] <- ((paste(taxonomy_table$Kingdom,"_phy",sep=""))[is.na(taxonomy_table$Phylum)])
          taxonomy_table$Class[is.na(taxonomy_table$Class)] <- ((paste(taxonomy_table$Phylum,"_clas",sep=""))[is.na(taxonomy_table$Class)])
          taxonomy_table$Order[is.na(taxonomy_table$Order)] <- ((paste(taxonomy_table$Class,"_ord",sep=""))[is.na(taxonomy_table$Order)])
          taxonomy_table$Family[is.na(taxonomy_table$Family)] <- ((paste(taxonomy_table$Order,"_fam",sep=""))[is.na(taxonomy_table$Family)])
          taxonomy_table$Genus[is.na(taxonomy_table$Genus)] <- ((paste(taxonomy_table$Family,"_gen",sep=""))[is.na(taxonomy_table$Genus)])
          taxonomy_table$Species[is.na(taxonomy_table$Species)] <- ((paste(taxonomy_table$Genus,"_sp",sep=""))[is.na(taxonomy_table$Species)])

          print("table NA remplaced by spaceholders")

          }else{

          print("table NA NOT remplaced by spaceholders")
          }


## Import as physeq object where needed
OTU <- otu_table(count_table, taxa_are_rows = TRUE)
TAX <- taxonomy_table %>% column_to_rownames("Feature.ID") %>% as.matrix() %>% tax_table()
META <- metadata %>% as.data.frame() %>% column_to_rownames("Sample") %>% sample_data()


## Merge all in a phyloseq object
phyloseq_obj <- phyloseq(OTU, TAX, META, PHY, SEQS)



## Compute alpha diversity indexes after this filtration
### Remove the previously computed values, in case
#sample_data(phyloseq_obj) <- select(sample_data(phyloseq_obj), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1))

drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(phyloseq_obj) <- sample_data(phyloseq_obj)[,!(names(sample_data(phyloseq_obj)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = phyloseq_obj, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div)

### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = phyloseq_obj, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,phyloseq_obj)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div_1)

phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample



## Write the phyloseq object
saveRDS(object = phyloseq_obj, file = phyloseq_object)

R dplyr tidyr tibble Biostrings phyloseq From line 8 of scripts/physeq_initial_import.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[[1]]

## Output
phyloseq_melted_table <- snakemake@output[[1]]



## Load needed libraries
library(phyloseq);packageVersion("phyloseq")



## Import the physeq object
phyloseq_obj <- readRDS(phyloseq_object)
phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample

### Melt the physeq objet into a dataframe with one row per feature.id and per sample
physeq_df <- psmelt(phyloseq_obj)

### Write this table in a tsv format
write.table(x = physeq_df, file = phyloseq_melted_table, append = F, sep = "\t", eol = "\n", row.names = F, col.names = T )

R phyloseq From line 7 of scripts/physeq_melt_table.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Ouput
rarefied_phyloseq <- snakemake@output[["phyloseq_object"]]

## Parameters
rarefy_value <- snakemake@params[["rarefaction_value"]]

## Load libraries
library(vegan);packageVersion("vegan")
library(phyloseq);packageVersion("phyloseq")
library(dplyr);packageVersion("dplyr")

## Set seed for reproducibility
set.seed(1)

## Import the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)

## Rarefy the count table
otu_table(phyloseq_obj) <- t(rrarefy(t(otu_table(phyloseq_obj)), sample = as.numeric(rarefy_value)))

## Compute alpha diversity indexes after this rarefaction
### Remove the previously computed values, in case
#sample_data(phyloseq_obj) <- select(sample_data(phyloseq_obj), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Observed_min_1))
drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(phyloseq_obj) <- sample_data(phyloseq_obj)[,!(names(sample_data(phyloseq_obj)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = phyloseq_obj, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div)

### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = phyloseq_obj, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,phyloseq_obj)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div_1)

phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample


# Write the phyloseq object
saveRDS(object = phyloseq_obj, file = rarefied_phyloseq)

R dplyr phyloseq vegan From line 8 of scripts/physeq_rarefy.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Output
phyloseq_norm <- snakemake@output[["phyloseq_norm"]]

## Parameters
#lower all normalization to ease matching
normalization <- tolower(snakemake@params[["normalization"]])
min_abundance <- as.numeric(snakemake@params[["min_abundance_value"]])
min_prevalence <- as.numeric(snakemake@params[["min_prevalence_value"]])
print(paste("normalization setting is", normalization))

## Load needed libraries
library("phyloseq");packageVersion("phyloseq")
library("vegan");packageVersion("vegan")
library("edgeR");packageVersion("edgeR")
library("metagenomeSeq");packageVersion("metagenomeSeq")

## Import phyloseq
physeq <- readRDS(phyloseq_object)

## Recover OTU table
OTU <- data.frame(otu_table(physeq), check.names = FALSE)

## List methods computed by vegan
vegan_methods <- c("total", "max", "freq", "normalize", "range", "pa", "chi.square", "hellinger" ,"log")

## CLR must we computed sample-wise, on rows. In Phyloseq we have taxa_are_rows = TRUE. Thus, must be computed on the t() if OTU table
if (normalization == "clr"){
  print("clr normalization")
  OTU1 <- OTU + 1
  reads_trfs <- t(chemometrics::clr(t(OTU1)))

## Vegan decostand normalizations (with double t() to take in account the transposed structure of the counts used here, modified from https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/decostand)
}else if (normalization %in% vegan_methods){
  print(paste(normalization, "normalization by vegan"))
  reads_trfs <- t(vegan::decostand(t(OTU), method = normalization))

## Percent normalization with basic R
}else if (normalization == "pct"){
  print("Percent normalization with base R")
  reads_trfs <- sapply(OTU, function(x) 100*x/sum(x))
  rownames(reads_trfs) <- rownames(OTU)

## CSS normalization with metagenomeseq (modified form https://bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/inst/doc/metagenomeSeq.pdf )
}else if (normalization == "css"){
  print("css normalization by metagenomeseq")
  OTU_mtgs <- metagenomeSeq::newMRexperiment(counts = OTU, phenoData = NULL, featureData = NULL, libSize = NULL, normFactors = NULL)
  p <- metagenomeSeq::cumNormStat(OTU_mtgs)
  OTU_norm <-  metagenomeSeq::cumNorm(OTU_mtgs, p = p)
  reads_trfs = as.data.frame(metagenomeSeq::MRcounts(OTU_norm, norm = TRUE, log = FALSE))

## TMM normalization with edgeR (modified from https://joey711.github.io/phyloseq-extensions/edgeR.html)
} else if (normalization == "tmm"){
  print("TMM normalization by edgeR")
  OTU1 <- OTU + 1
  group <- rownames(get_variable(physeq))
  tax <- tax_table(physeq, errorIfNULL=FALSE)
  y <- edgeR::DGEList(counts=OTU1, group=group, genes=tax, remove.zeros = TRUE)
  z = edgeR::calcNormFactors(y, method="TMM")
  reads_trfs <- data.frame(z$counts, check.names = FALSE)
  #reads_trfs[reads_trfs==0.5] <- 0

} else if (normalization == "none"){
  print('No normalization (normalization = "none"')
  reads_trfs <- OTU

}else{
  stop(paste('"normalization" was', normalization, '.Must be one of : "clr", "pct", "css", "tmm", "total", "max", "freq", "normalize", "range", "pa", "chi.square", "hellinger" or "log", ')) 
}

## Filter based on counts and prevalence
### Filter based on the cumulated abundance
if(is.numeric(min_abundance)){
  print(paste("filter based on cumulated min_abundance:", min_abundance))
  filtered_data <- reads_trfs[,colSums(reads_trfs) > min_abundance, drop = FALSE]
}else if (min_abundance == "none"){
  print("NO filtering based on cumulated abundance")
  filtered_data <- reads_trfs
}else{
  stop(paste(min_abundance, 'must be a numeric value or "none')) 
}

### Filter based on prevalence
if(is.numeric(min_prevalence)){
  print(paste("filter based on prevalence:",min_prevalence))
  prevalence_cutoff <- (min_prevalence/100) * nsamples(physeq)
  print(paste("Features must be found in more than", prevalence_cutoff, "samples to be retained"))
  filtered_data <- filtered_data[,colSums(filtered_data != 0) > prevalence_cutoff, drop = FALSE]
}else if (min_prevalence == "none"){
  print("NO filtering based on cumulated abundance")
}else{
  stop(paste(min_prevalence, 'must be a numeric value or "none'))
}

## Repopulate physeq object (identical script than for taxa filter)
print(1)
physeq_filtered <- physeq
sample_names(physeq_filtered)
head(filtered_data)
otu_table(physeq_filtered) <- otu_table(round(filtered_data, digits = 0), taxa_are_rows = TRUE)
filtered_taxa <- prune_taxa(taxa_sums(physeq_filtered) > 0, physeq_filtered) ## Removes taxa not at least present in one sample

print(2)

## Recompute alpha diversity indexes after this filtration
### Remove the previously computed values
#sample_data(filtered_taxa) <- select(sample_data(filtered_taxa), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1))
drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(filtered_taxa) <- sample_data(filtered_taxa)[,!(names(sample_data(filtered_taxa)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = filtered_taxa, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div)


### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = filtered_taxa, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,filtered_taxa)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div_1)


## Write the new phyloseq object
saveRDS(object = filtered_taxa, file = phyloseq_norm)

R phyloseq vegan From line 7 of scripts/transf_norm.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    count_table <- snakemake@input[["count_table"]]
    print(count_table)
    meta <- snakemake@input[["meta"]]

## Output
    transposed_table <- snakemake@output[["transposed_table"]]
    merged_meta <- snakemake@output[["merged_meta"]]

## Load the phyloseq phyloseq_object
    table <- read.table(count_table,  sep = "\t", header = TRUE)
    meta <- read.delim(meta,  sep = "\t", header = TRUE, na.strings = "NA")

## Tranpose table
    table_t <- t(table)

## Bind tables
    binded_tables <- cbind(meta, table_t)

### Write this table in a tsv file since it is ground for coming analysis and slow to compute
    write.table(x = table_t, file = transposed_table, append = F, quote = F, sep = "\t", eol = "\n", row.names = F, col.names = T )
    write.table(x = binded_tables, file = merged_meta, append = F, quote = F, sep = "\t", eol = "\n", row.names = F, col.names = T )

R From line 8 of scripts/transpose_and_add_meta_count_table.R

shell:
    """
    run_treeshrink.py -o $(dirname {output[0]}) -q {params[quantile]} -t {input}
    """

SnakeMake From line 18 of 4_post_processing/TreeShrink.rules

script:
    "scripts/raw_to_processed_reads_stats.R"

SnakeMake From line 16 of 5_visualization/General_plotting.rules

script:
    "scripts/raw_to_processed_reads_plot.R"

SnakeMake From line 37 of 5_visualization/General_plotting.rules

script:
    "scripts/raw_to_tax_filtered_reads_stats.R"

SnakeMake From line 55 of 5_visualization/General_plotting.rules

script:
    "scripts/raw_to_processed_reads_plot.R"

SnakeMake From line 75 of 5_visualization/General_plotting.rules

script:
    "scripts/KRONA_plots.R"

SnakeMake From line 95 of 5_visualization/General_plotting.rules

script:
    "scripts/rarefaction_curve.R"

SnakeMake From line 113 of 5_visualization/General_plotting.rules

script:
    "scripts/create_biom_from_count_table.R"

SnakeMake From line 12 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime tools import \
    --input-path {input[0]} \
    --type 'FeatureTable[Frequency]' \
    --input-format BIOMV100Format \
    --output-path {output[0]}
    '''

SnakeMake QIIME2.0 From line 26 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime feature-table summarize \
    --i-table {input[0]} \
    --o-visualization {output[0]}
    '''

SnakeMake QIIME2.0 From line 46 of 5_visualization/QIIME2_import.rules

shell:
    '''
    awk '/^>/ {{print($0)}}; /^[^>]/ {{print(toupper($0))}}' {input[0]} > {output[0]}
    '''

SnakeMake From line 61 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime tools import \
    --input-path {input[0]} \
    --output-path {output[0]} \
    --type 'FeatureData[Sequence]'
    '''

SnakeMake QIIME2.0 From line 78 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime phylogeny align-to-tree-mafft-fasttree \
      --p-n-threads {threads} \
      --i-sequences {input[0]} \
      --o-alignment {output[aligned]} \
      --o-masked-alignment {output[masked]} \
      --o-tree {output[unrooted]} \
      --o-rooted-tree {output[rooted]}
    '''

SnakeMake QIIME2.0 From line 104 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime tools export --input-path {input} --output-path $(dirname {output[0]})
    '''

SnakeMake QIIME2.0 From line 126 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime feature-table tabulate-seqs \
    --i-data {input[0]} \
    --o-visualization {output[0]}
    '''

SnakeMake QIIME2.0 From line 143 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime tools import \
     --type FeatureData[Taxonomy] \
     --input-path {input[0]} \
     --input-format HeaderlessTSVTaxonomyFormat \
     --output-path {output[0]}
    '''

SnakeMake QIIME2.0 From line 161 of 5_visualization/QIIME2_import.rules

shell:
    '''
    qiime metadata tabulate \
        --m-input-file {input} \
        --o-visualization {output}
    '''

SnakeMake QIIME2.0 From line 181 of 5_visualization/QIIME2_import.rules

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    count_table <- snakemake@input[["count_table"]]

## Output
    biom_count <- snakemake@output[["biom_count"]]

## Library
    library(biomformat);packageVersion("biomformat")
    library(phyloseq);packageVersion("phyloseq")

## Reformat
    asv_tab <- read.table(file = count_table)
    otu <- as(otu_table(asv_tab, taxa_are_rows = TRUE),"matrix")
    otu_biom <- make_biom(data=otu)

## Write
    write_biom(x = otu_biom, biom_file = biom_count)

R phyloseq biomformat From line 8 of scripts/create_biom_from_count_table.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    phyloseq_melted_table <- snakemake@input[["phyloseq_melted_table"]]

## Output
    output_folder <- snakemake@output[["output"]]
    output_folder <- (dirname(output_folder)[1])

## Parameters
    sample_label <- snakemake@params[["sample_label"]]
    grouping_column <- snakemake@params[["grouping_column"]]
    grouping_filter_column_value <- snakemake@params[["grouping_col_value"]]

## Load needed library
    library(dplyr);packageVersion("dplyr")

## Read data
melted_dataframe<- read.csv(file.path(phyloseq_melted_table), header = TRUE, sep = "\t")

## Create KRONA
    df <- filter(melted_dataframe, melted_dataframe[[grouping_column]] == grouping_filter_column_value)
    df <- filter(df, df[["Abundance"]] != 0)

    df <- df[, c("Abundance", sample_label, "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "OTU")]
    df <- as.data.frame(unclass(df))
    df[, 2] <- gsub(" |\\(|\\)", "", df[, 2])
    df[, 2] <- as.factor(df[, 2])
    dir.create(file.path(output_folder,"/",grouping_filter_column_value))
    for (lvl in levels(df[, 2])) {
      write.table(unique(df[which(df[, 2] == lvl), -2]), file = paste0(output_folder,"/",grouping_filter_column_value, "/", lvl, "taxonomy.txt"), sep = "\t", row.names = F, col.names = F, na = "", quote = F)
    }

#As it is possible that some negative controls or samples have no reads, here we are tryuing to say if the entire sample is empty then make a log file with the message that sample has no read! otherwise make krona plt.
if (all(df[,2] == 0) == 1){
     dir.create(file.path(output_folder,"/",grouping_filter_column_value))
  cat(format(Sys.time(), "%a %b %d %Y %X TZ(%z)")," ", "All samples have 0 abundance for this sample group.",file= paste0(output_folder,"/",grouping_filter_column_value,".html"))

} else {

     dir.create(file.path(output_folder,"/",grouping_filter_column_value))
    for (lvl in levels(df[, 2])) {
      write.table(unique(df[which(df[, 2] == lvl), -2]), file = paste0(output_folder,"/",grouping_filter_column_value, "/", lvl, "taxonomy.txt"), sep = "\t", row.names = F, col.names = F, na = "", quote = F)
      }


    krona_args <- paste0(output_folder,"/", grouping_filter_column_value, "/", levels(df[,2]), "taxonomy.txt,", levels(df[, 2]), collapse = " ")
    output <- paste0(output_folder,"/",grouping_filter_column_value,".html")
    system(paste("ktImportText", krona_args, "-o", output, sep = " "))


}

R dplyr Krona From line 10 of scripts/KRONA_plots.R

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Ouput
rarefaction_curve <- snakemake@output[["rarefaction_curve"]]

## Parameters
color_factor <- snakemake@params[["grouping_column"]]

## Load libraries
library('phyloseq')
library('ggplot2')
library('plyr') # ldply
library('reshape2') # melt
library('vegan')

## Load the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)


## Create a modified version of estimate_richness function of phyloseq to accept sample names containing numbers only

modified_estimated_richness <- function (physeq, split = TRUE, measures = NULL)
{
  if (!any(otu_table(physeq) == 1)) {
    warning("The data you have provided does not have\n",
            "any singletons. This is highly suspicious. Results of richness\n",
            "estimates (for example) are probably unreliable, or wrong, if you have already\n",
            "trimmed low-abundance taxa from the data.\n", "\n",
            "We recommended that you find the un-trimmed data and retry.")
  }
  if (!split) {
    OTU <- taxa_sums(physeq)
  }
  else if (split) {
    OTU <- as(otu_table(physeq), "matrix")
    if (taxa_are_rows(physeq)) {
      OTU <- t(OTU)
    }
  }
  renamevec = c("Observed", "Chao1", "ACE", "Shannon", "Simpson",
                "InvSimpson", "Fisher")
  names(renamevec) <- c("S.obs", "S.chao1", "S.ACE", "shannon",
                        "simpson", "invsimpson", "fisher")
  if (is.null(measures)) {
    measures = as.character(renamevec)
  }
  if (any(measures %in% names(renamevec))) {
    measures[measures %in% names(renamevec)] <- renamevec[names(renamevec) %in%
                                                            measures]
  }
  if (!any(measures %in% renamevec)) {
    stop("None of the `measures` you provided are supported. Try default `NULL` instead.")
  }
  outlist = vector("list")
  estimRmeas = c("Chao1", "Observed", "ACE")
  if (any(estimRmeas %in% measures)) {
    outlist <- c(outlist, list(t(data.frame(estimateR(OTU), check.names = FALSE))))
  }
  if ("Shannon" %in% measures) {
    outlist <- c(outlist, list(shannon = diversity(OTU, index = "shannon")))
  }
  if ("Simpson" %in% measures) {
    outlist <- c(outlist, list(simpson = diversity(OTU, index = "simpson")))
  }
  if ("InvSimpson" %in% measures) {
    outlist <- c(outlist, list(invsimpson = diversity(OTU, index = "invsimpson")))
  }
  if ("Fisher" %in% measures) {
    fisher = tryCatch(fisher.alpha(OTU, se = TRUE), warning = function(w) {
      warning("phyloseq::estimate_richness: Warning in fisher.alpha(). See `?fisher.fit` or ?`fisher.alpha`. Treat fisher results with caution")
      suppressWarnings(fisher.alpha(OTU, se = TRUE)[, c("alpha",
                                                        "se")])
    })
    if (!is.null(dim(fisher))) {
      colnames(fisher)[1:2] <- c("Fisher", "se.fisher")
      outlist <- c(outlist, list(fisher))
    }
    else {
      outlist <- c(outlist, Fisher = list(fisher))
    }
  }


  out = do.call("cbind", outlist)
  namechange = intersect(colnames(out), names(renamevec))
  colnames(out)[colnames(out) %in% namechange] <- renamevec[namechange]
  colkeep = sapply(paste0("(se\\.){0,}", measures), grep, colnames(out),
                   ignore.case = TRUE)
  out = out[, sort(unique(unlist(colkeep))), drop = FALSE]
  out <- as.data.frame(out)
  return(out)
}



# Replicate curve function from : https://github.com/joey711/phyloseq/issues/143

set.seed(42)

calculate_rarefaction_curves <- function(psdata, measures, depths) {

  estimate_rarified_richness <- function(psdata, measures, depth) {
    if(max(sample_sums(psdata)) < depth) return()
    psdata <- prune_samples(sample_sums(psdata) >= depth, psdata)

    rarified_psdata <- rarefy_even_depth(psdata, depth, verbose = FALSE)

    alpha_diversity <- modified_estimated_richness(rarified_psdata, measures = measures)

    # as.matrix forces the use of melt.array, which includes the Sample names (rownames)
    molten_alpha_diversity <- melt(as.matrix(alpha_diversity), varnames = c('Sample', 'Measure'), value.name = 'Alpha_diversity')

    molten_alpha_diversity
  }

  names(depths) <- depths # this enables automatic addition of the Depth to the output by ldply
  rarefaction_curve_data <- ldply(depths, estimate_rarified_richness, psdata = psdata, measures = measures, .id = 'Depth', .progress = ifelse(interactive(), 'text', 'none'))

  # convert Depth from factor to numeric
  rarefaction_curve_data$Depth <- as.numeric(levels(rarefaction_curve_data$Depth))[rarefaction_curve_data$Depth]

  rarefaction_curve_data
}

rarefaction_curve_data <- calculate_rarefaction_curves(psdata = phyloseq_obj, measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson"), depth = rep(c(1, 10, 100, 1000, 1:100 * 10000), each = 10))
summary(rarefaction_curve_data)


####

rarefaction_curve_data_summary <- ddply(rarefaction_curve_data, c('Depth', 'Sample', 'Measure'), summarise, Alpha_diversity_mean = mean(Alpha_diversity), Alpha_diversity_sd = sd(Alpha_diversity))


####

rarefaction_curve_data_summary_verbose <- merge(rarefaction_curve_data_summary, data.frame(sample_data(phyloseq_obj)), by.x = 'Sample', by.y = 'row.names')

rarefaction_curve_data_summary_verbose

p <- ggplot(
  data = rarefaction_curve_data_summary_verbose,
  mapping = aes(
    x = Depth,
    y = Alpha_diversity_mean,
    ymin = Alpha_diversity_mean - Alpha_diversity_sd,
    ymax = Alpha_diversity_mean + Alpha_diversity_sd,
    colour = get(color_factor),
    group = get("Sample"))) +
      labs(col = "Sample type") +
      geom_line(size = 0.15) +
      #geom_pointrange(size = 0.2) +
      facet_wrap(facets = ~ Measure, scales = 'free_y')


ggsave(filename = rarefaction_curve,  plot = p, width = 10)

R ggplot2 reshape2 plyr phyloseq vegan From line 10 of scripts/rarefaction_curve.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")


## Input
    raw_to_filtered_reads_stats_path <- snakemake@input[["raw_to_filtered_reads_stats"]]
    raw_to_filtered_reads_stats <- read.table(file = raw_to_filtered_reads_stats_path, sep = "\t", header = TRUE)
    metadata_path <- snakemake@input[["Metadata_table"]]
    metadata <- read.delim(file = metadata_path, sep = "\t", header = TRUE)
    multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]]
    multi_QC_report <- read.table(multi_QC_report_path, header = T)

## Ouput
    reads_plot_with_filtered <- snakemake@output[["reads_plot_with_filtered"]]

## Parameters
    x_axis_column <- snakemake@params[["sample_label"]]

## Load needed libaries
    library("phyloseq");packageVersion("phyloseq")
    library("data.table");packageVersion("data.table")
    library("dplyr");packageVersion("dplyr")
    library("ggplot2");packageVersion("ggplot2")
    library("RColorBrewer"); packageVersion("RColorBrewer")


## Set theme
theme_set(theme_bw())


    ### Record data on the distribution of number of reads (useful later to scale plots axis)
        smin <- min(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences)
        print(smin)
        smean <- mean(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences)
        print(smean)
        smax <- max(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences)
        print(smax)

    ### Order the x axis as in the metadata_table
        raw_to_filtered_reads_stats[[x_axis_column]] = factor(raw_to_filtered_reads_stats[[x_axis_column]], levels = unique(metadata[[x_axis_column]]), ordered = TRUE)
        #raw_to_filtered_reads_stats[[grouping_column]] = factor(raw_to_filtered_reads_stats[[grouping_column]], levels = unique(metadata[[grouping_column]]), ordered = TRUE)

    ### Order the reads count in logical ordered
        ordered_factors <- c("Tax filtered reads", "Reads processing", "Maintained reads")
        raw_to_filtered_reads_stats$Reads <- factor(raw_to_filtered_reads_stats$Reads, ordered = TRUE, levels = ordered_factors)

    ### Generate colors colors palette
        getPalette <- colorRampPalette(brewer.pal(n=9, "Set3"))
        ColList <- ordered_factors
        ColPalette = getPalette(length(ColList))
        names(ColPalette) = ColList
        colors_palette <- ColPalette


    overall_reads_barplot <- ggplot(raw_to_filtered_reads_stats, aes(x = get(x_axis_column), y = Count, fill = Reads)) +
        geom_col() +
        scale_fill_manual(values = colors_palette) +
        labs(x= x_axis_column,  y ="Reads") +
        ggtitle(paste("Reads counts overall")) +
        scale_x_discrete(drop = TRUE) + # Keep all groups, included the ones with values. Alternative : (drop = FALSE)
        scale_y_continuous(labels = scales::comma, limits = c(0,smax)) +
        theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 7)) #+
        # guides(fill=guide_legend(title=filling_column))


    ### Save it
        w <- 7 + 0.07 * (length(unique(raw_to_filtered_reads_stats[[x_axis_column]]))) 
        ggsave(overall_reads_barplot, filename = reads_plot_with_filtered, width = w, height = 5)

R ggplot2 dplyr data.table RColorBrewer phyloseq From line 8 of scripts/raw_to_processed_reads_plot.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    phyloseq_object <- snakemake@input[["phyloseq_object"]]
    multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]]
    multi_QC_report <- read.table(multi_QC_report_path, header = T)

## Ouput
    raw_to_filtered_reads_stats <- snakemake@output[["raw_to_filtered_reads_stats"]]

    ## Load needed libaries
    library("phyloseq");packageVersion("phyloseq")
    library("data.table");packageVersion("data.table")
    library("dplyr");packageVersion("dplyr")

## Load the phyloseq object
    phyloseq_obj <- readRDS(phyloseq_object)

## Create a table with the number of raw reads and filtered reads

### Filtered reads
    reads_counts_df <- data.table(as(sample_data(phyloseq_obj), "data.frame"), TotalReads = sample_sums(phyloseq_obj), keep.rownames = TRUE)
    setnames(reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample

### Raw reads
    multi_QC_report <- multi_QC_report %>% filter(grepl("R1", Sample)) %>% select(c("Sample","FastQC_mqc.generalstats.fastqc.total_sequences")) # keep only the total of raw sequences. Since it is already twice (R1, R2), keep only R1.
    multi_QC_report$Sample <- gsub(x=multi_QC_report$Sample, pattern = "_R1", replacement = "") # Remove the "_R1"

### Raw reads Join the two dataframes, the one with the filtered reads from phyloseq and the one with the raw reads.
    reads_counts_df_with_raw <- merge(multi_QC_report, reads_counts_df, by=c("Sample"))

### Raw reads Calculate the difference of reads between the raw and the filtered reads.
    reads_counts_df_with_raw <- mutate(reads_counts_df_with_raw, filtered = FastQC_mqc.generalstats.fastqc.total_sequences - TotalReads)

### Raw reads Keep two version of the table, one with the filtered reads in "Count" and one with the difference with the raw reads in the same count column.
    reads_counts_df_raw_count_only <- reads_counts_df_with_raw %>% select(-"filtered") %>% rename("Count" = "TotalReads")
    reads_counts_df_raw_count_only$Reads <- "Maintained reads"
    reads_counts_df_raw_filtered_only <- reads_counts_df_with_raw %>% select(-"TotalReads") %>% rename("Count" = "filtered")
    reads_counts_df_raw_filtered_only$Reads <- "Reads processing"

### Raw reads Bind the rows of the two just prepared tables
    melted_reads_count <- rbind(reads_counts_df_raw_count_only, reads_counts_df_raw_filtered_only)

### Raw reads Write this table
    write.table(x = melted_reads_count, file = raw_to_filtered_reads_stats, sep = "\t", col.names = NA, row.names = TRUE)

R dplyr data.table phyloseq From line 7 of scripts/raw_to_processed_reads_stats.R

    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    read_filtering <- snakemake@input[["read_filtering"]]
    taxonomic_filtering <- snakemake@input[["taxonomic_filtering"]]
    multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]]

## Ouput
    raw_to_filtered_reads_stats <- snakemake@output[["raw_to_filtered_reads_stats"]]


## Load needed libaries
    library("phyloseq");packageVersion("phyloseq")
    library("data.table");packageVersion("data.table")
    library("dplyr");packageVersion("dplyr")


## Read files
    read_filtering_physeq <- readRDS(read_filtering)
    tax_filtering_physeq <- readRDS(taxonomic_filtering)
    multi_QC_report <- read.table(multi_QC_report_path, header = T)

## Create a table with the number of raw reads and filtered reads
### Raw reads
    multi_QC_report <- multi_QC_report %>% filter(grepl("R1", Sample)) %>% select(c("Sample","FastQC_mqc.generalstats.fastqc.total_sequences")) # keep only the total of raw sequences. Since it is already twice (R1, R2), keep only R1.
    multi_QC_report$Sample <- gsub(x=multi_QC_report$Sample, pattern = "_R1", replacement = "") # Remove the "_R1" in label

### Filtered reads
    filtered_reads_counts_df <- data.table(as(sample_data(read_filtering_physeq), "data.frame"), TotalReads_processing = sample_sums(read_filtering_physeq), keep.rownames = TRUE)
    setnames(filtered_reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample

### Taxonomically filtered reads
    tax_filtered_reads_counts_df <- data.table(as(sample_data(tax_filtering_physeq), "data.frame"), TotalReads_taxonomy = sample_sums(tax_filtering_physeq), keep.rownames = TRUE)
    setnames(tax_filtered_reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample
    tax_filtered_reads_counts_df <- select(tax_filtered_reads_counts_df, c("TotalReads_taxonomy","Sample"))

### Join the columns of interest
    merged_columns <- merge(filtered_reads_counts_df, multi_QC_report, by=c("Sample"))
    merged_columns <- merge(merged_columns, tax_filtered_reads_counts_df, by = c("Sample"))

### Calculate the differences
    merged_columns$reads_processing <- merged_columns$FastQC_mqc.generalstats.fastqc.total_sequences - merged_columns$TotalReads_processing
    merged_columns$tax_filtration <- merged_columns$TotalReads_processing - merged_columns$TotalReads_taxonomy


### Keep three versions of the table,  where the "Count" is consecutively the final number of reads, the reads lost during processing and the reads taxonomically filtered.
#### Keep the final count of reads, remove the two other column
    final_reads_count <- merged_columns %>% rename("Count" = "TotalReads_taxonomy")
    final_reads_count$Reads <- "Maintained reads"
    final_reads_count[ ,c("reads_processing", "tax_filtration")] <- list(NULL)

#### Keep the reads filtered during processing remove the two other column
    processing_reads_count <- merged_columns %>% rename("Count" = "reads_processing")
    processing_reads_count$Reads <- "Reads processing"
    processing_reads_count[ ,c("tax_filtration", "TotalReads_taxonomy")] <- list(NULL)

#### Keep the reads removed by taxonomic filtering remove the two other column
    tax_filtered_reads_count <- merged_columns %>% rename("Count" = "tax_filtration")
    tax_filtered_reads_count$Reads <- "Tax filtered reads"
    tax_filtered_reads_count[ ,c("reads_processing", "TotalReads_taxonomy")] <- list(NULL)

#### Bind the rows of the two just prepared tables
    melted_reads_count <- rbind(final_reads_count, tax_filtered_reads_count)
    melted_reads_count <- rbind(melted_reads_count, processing_reads_count)


# Write this table
    write.table(x = melted_reads_count, file = raw_to_filtered_reads_stats, sep = "\t", col.names = NA, row.names = TRUE)

R dplyr data.table phyloseq From line 7 of scripts/raw_to_tax_filtered_reads_stats.R

shell:
    '''
    picrust2_pipeline.py \
        --in_traits COG,EC,KO,PFAM,TIGRFAM \
        -s {input[seqs]} \
        -i {input[table]} \
        -o {output}/output \
        -p {threads}
    '''