Reproducible Scalable Pipeline For Amplicon-based Metagenomics (RSP4ABM) is a bioinformatic pipeline designed for convenient, reproducible and scalable amplicon-based metagenomics

public public 1yr ago Version: v.0.9.18 0 bookmarks

Overview

This is a comprehensive pipeline for amplicon-based metagenomics integrating in a Snakemake workflow the best functions of many tools . It enables performant and reproducibile processing of 16S rRNA or ITS Illumina paired-end reads. The whole process from local .fastq or SRA depository files to generation of basic visualization plots, including quality control plots of intermediate steps, is covered.

The Snakemake pipeline can be exectued by cloning this repository and relying on conda environments (Method 1) or singularity (Method 2)

################### TO BE UPDATED #######################

Method 1 - Snakemake with conda environnements

Allows flexibility, with possibility to easily modify and personalize the pipeline. However, there are risks of errors or result inconsistencies due to changes in versions. Furthermore, simulate_PCR must be installed independently for the in silico validation, since it is not available through conda.

Requirements:

Computer

A linux machine would be the best (should work as well on MacOSX, yet not tested). At least 16Gb of RAM are needed, even more with larger datasets and depending of the used classifier. ( RDP requiring more RAM than decipher )

Tested with Ubuntu 18.04 with 4 CPUs and 32Gb of RAM

Cloned pipeline
git clone https://github.com/metagenlab/microbiome16S_pipeline.git
Miniconda3

Installed following developers' recommendations and with relevant channels added running in a thermal the following commands :

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set restore_free_channel true

Tested with version 4.6.14

Snakemake

Installed in a dedicated snakemake environments with :

conda create -n snakemake snakemake=5.6.0

Tested with version 5.6.0

Use

snakemake --snakefile ${pipeline_folder}/Snakefile --use-conda --conda-prefix ${conda_path} --cores {threads_number} --configfile {config_file_name} --resources max_copy={max_number_of_files_copied_at_the_same_time}

Method 2 - With Singularity

Relatively easy to set-up and easier to work with than Docker, due to simpler user-rights management. We take advantage of the ability of Singularity to run the Docker container prepared for this pipeline. Insures software stability thanks to containerization. Here, all dependencies are contained within the container.

Requirements:

Computer

As for Method 1 but here only adapted to Linux (an alpha version for Singularity exists for MacOS).

Tested with Ubuntu 18.04 with 4 CPUs and 32Gb of RAM

Singularity

Singularity is a system enabling the use of singularity or Docker containers. It should be installed as indicated here .

Tested with version 3.0.1

Commands:

## Run the container interactively
singularity shell docker://metagenlab/amplicon_pipeline:v.0.9.13
## Run the pipeline from within the container.
snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time} mem_mb = {available_memory}

Method 3 - With Docker

Computer

Works on Windows, MacOS and Linux. Tested on Linux, Windows 10 and MacOSX

User settings:

Our Docker image is fitted for a user called "pipeline_user" whose UID is 1080. It is advised to create this user on your computer before using the Docker image to run your analysis:

sudo useradd -G docker,sudo -u 1080 pipeline_user
sudo mkdir /home/pipeline_user/
sudo chown pipeline_user -R /home/pipeline_user/
sudo passwd pipeline_user

Alternatively, you can run the Docker as root (--user root) but the created folders will belong to the root user of your computer.

Docker

Install the CE version following these instructions for ubuntu. Also make sure you have created the docker group and that you can run Docker without sudo following these instruction . If you can't have access to the internet when inside a Docker container, apply those changes .

Use

Connected as pipeline_user :

docker run -it --rm --mount source="$(pwd)",target=/home/pipeline_user/data/analysis/,type=bind metagenlab/amplicon_pipeline:v.0.9.13

and then

snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time}

or directly

docker run -it --rm --mount source="$(pwd)",target=/home/pipeline_user/data/analysis/,type=bind metagenlab/amplicon_pipeline:v.0.9.13 \ sh -c 'snakemake --snakefile /home/pipeline_user/microbiome16S_pipeline/Snakefile --use-conda --conda-prefix /opt/conda/ --cores {threads_number} --configfile {config_file_path} --resources max_copy={max_number_of_files_copied_at_the_same_time}

References

  • Snakemake

    Köster, J., & Rahmann, S. (2012). Snakemake-a scalable bioinformatics workflow engine. Bioinformatics, 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480

  • FASTQC

    Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data. 2010. Http://Www.Bioinformatics.Babraham.Ac.Uk/Projects/Fastqc/

  • MultiQC

    Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354

  • DADA2

    Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581–583. https://doi.org/10.1038/nmeth.3869

  • VSEARCH

    Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. https://doi.org/10.7717/peerj.2584

  • Qiime

    Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., … Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336. https://doi.org/10.1038/nmeth.f.303

  • Qiime2

    Bolyen, E., Dillon, M., Bokulich, N., Abnet, C., Al-Ghalith, G., Alexander, H., … Caporaso, G. (2018). QIIME 2: Reproducible, interactive, scalable, and extensible microbiome data science. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.27295

  • RDP

    Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267. https://doi.org/10.1128/AEM.00062-07

  • IDTAXA in Decipher

    Murali, A., Bhargava, A., & Wright, E. S. (2018). IDTAXA: A novel approach for accurate taxonomic classification of microbiome sequences. Microbiome. https://doi.org/10.1186/s40168-018-0521-5

  • EzBioCloud

    Yoon, S.-H., Ha, S.-M., Kwon, S., Lim, J., Kim, Y., Seo, H., & Chun, J. (2017). Introducing EzBioCloud: a taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies. International Journal of Systematic and Evolutionary Microbiology, 67(5), 1613–1617. https://doi.org/10.1099/ijsem.0.001755

  • Silva

    Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., … Glöckner, F. O. (2013). The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Research. https://doi.org/10.1093/nar/gks1219

  • phyloseq

    McMurdie, P. J., & Holmes, S. (2013). phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. PLoS ONE, 8(4), e61217. https://doi.org/10.1371/journal.pone.0061217

  • Krona

    Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 12(1), 385. https://doi.org/10.1186/1471-2105-12-385

  • ALDex2

    Fernandes, A. D., Reid, J. N. S., Macklaim, J. M., McMurrough, T. A., Edgell, D. R., & Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 1–13. https://doi.org/10.1186/2049-2618-2-15

  • Vegan

    Oksanen, J., Kindt, R., Legendre, P., O’Hara, B., Simpson, G. L., Solymos, P. M., … & Wagner, H. (2008). The vegan package. Community Ecology Package, (May 2014), 190. Retrieved from https://bcrc.bio.umass.edu/biometry/images/8/85/Vegan.pdf

  • metagenomeSeq

    Joseph, A., Paulson, N., Olson, N. D., Wagner, J., Talukder, H., & Corrada, H. (2019). Package ‘ metagenomeSeq .’

  • edgeR

    Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616

  • Simulate_PCR

    Gardner, S. N., & Slezak, T. (2014). Simulate_PCR for amplicon prediction and annotation from multiplex, degenerate primers and probes. BMC Bioinformatics, 15(1), 2–7. https://doi.org/10.1186/1471-2105-15-237

Code Snippets

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
shell:
    """
    if gzip -t {input[0]}
    then
        cp {input[0]} {output[0]}
    else 
        cp {input[0]} $(dirname {output[0]})
        gzip -qc $(dirname "{output[0]}")/$(basename "{input[0]}") > {output[0]}
        rm $(dirname "{output[0]}")/$(basename "{input[0]}")
    fi

    if gzip -t {input[1]}
    then
        cp {input[1]} {output[1]}
    else 
        cp {input[1]} $(dirname {output[1]})
        gzip -qc $(dirname "{output[1]}")/$(basename "{input[1]}") > {output[1]}
        rm $(dirname "{output[1]}")/$(basename "{input[1]}")
    fi            

    """
42
43
44
45
46
47
48
49
50
51
52
shell:
    """
    if gzip -t {input[0]}
    then
        cp {input[0]} {output[0]}
    else 
        cp {input[0]} $(dirname {output[0]})
        gzip -qc $(dirname "{output[0]}")/$(basename "{input[0]}") > {output[0]}
        rm $(dirname "{output[0]}")/$(basename "{input[0]}")
    fi
    """
71
72
73
74
75
76
shell:
    """
    #cache_dir=$(vdb-config --cfg -o n | grep "/repository/user/main/public/root" | cut -f2 -d'=' | sed "s/\\"//g")
    vdb-config --restore-defaults
    prefetch -v {wildcards.sample} --max-size 100000000 &> {log}
    """
90
91
92
93
shell:
    """
    fastq-dump --split-3 --gzip --log-level 1 --disable-multithreading --minReadLen 0 --outdir $(dirname {output[0]}) {input[0]} &> {log}
    """
107
108
109
110
shell:
    """
    fastq-dump --gzip --outdir $(dirname {output[0]}) --log-level 0 --disable-multithreading --minReadLen 0 {wildcards.sample} &> {log}
    """
16
17
18
19
shell:
    """
    fastqc -t {threads} {input} -o $(dirname {output[0]}) --nogroup 2> {log[0]}
    """
47
48
49
50
shell:
    """
    multiqc --interactive -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]}
    """
80
81
82
83
shell:
    """
    multiqc -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]}
    """
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from shutil import copyfile
import sys
import datetime
import os
import getpass


## Define log directory path
if "logging_folder" not in config.keys():
    logging_folder = "logs/"
else:
    logging_folder=config["logging_folder"]

if not logging_folder.endswith("/"):
    logging_folder = logging_folder + "/"

## Generate a day_hour folder to store execution specific parameters (command, config fie, sample sheet)
today = datetime.datetime.now()
date = today.strftime("%Y_%m_%d")
time = today.strftime('%H_%M_%S_%f')[:-4]
time_tag = str(date + "-" + time)
timed_logging_folder = logging_folder + time_tag + "/"

## Except for dryrun execution, create a log folder. This will be filled by execution parameters and be provided to all rules to record their standard ouput or specific log files. 
if "--dryrun" not in sys.argv and "-n" not in sys.argv:

    os.makedirs(logging_folder, exist_ok=True)
    os.makedirs(timed_logging_folder, exist_ok=True)

    ## Copy the command line, the config file and the sample sheet in a subdirectory of the log directory
    ### Log command line
    cmd_file = timed_logging_folder + "cmd.txt"
    with open(cmd_file, "w") as f:
        f.write(" ".join(sys.argv)+"\n")

    ### Log config file
    if workflow.overwrite_configfiles[0] is not None:
            copyfile(workflow.overwrite_configfiles[0], timed_logging_folder+"config.yaml")

    ### Log local or SRA sample sheet
    if "local_samples" in config.keys():
        copyfile(config["local_samples"], timed_logging_folder+"local_samples.tsv")
    if "sra_samples" in config.keys():
        copyfile(config["sra_samples"], timed_logging_folder+"sra_samples.tsv")

    ### Log git hash of the pipeline
    git_log_path = timed_logging_folder + "git.txt"
    with open(git_log_path, "w") as devnull:
        subprocess.run(args=["git -C " + workflow.basedir + " rev-parse HEAD"], shell=True, stdout=devnull)   

    ### Log user ID  
    user_path = timed_logging_folder + "user.txt"
    user_cred = getpass.getuser()
    with open(user_path, "w") as f:
        f.write(user_cred +"\n")
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
import pandas
import re
import itertools

## Define a function to extract the .fastq name pattern
def get_read_naming_patterns(directory):
    result = []
    extension= {}
    for fname in sorted(os.listdir(directory)):
        if fname.endswith("fastq.gz") or fname.endswith("fq.gz") or fname.endswith("fastq") or fname.endswith("fq"):
            regex_str = '(_L0+[1-9]+)?_(R)?(1|2)(\.|_)' #regex for finding R1 and R2, if L001 is present before, it is also included
            regex = re.compile(regex_str)
            ext = re.search(regex, fname)
            if ext is None:
                ext = re.search(r'f(?:ast)?q(?:\.gz)?', fname)
                samp = re.sub("\.$", "", re.search(r'^([^\.]*)\.*', fname).group(0))
                if samp in extension.keys():
                    if ext.group(0).endswith(".gz"):
                        extension[samp] = [ext.group(0)]
                else:
                    extension[samp] = [ext.group(0)]
            else:
                regex_after = re.compile(regex_str+".*")
                regex_before = re.compile(".*"+regex_str)
                read = re.compile(re.search(regex_after, fname).group(0))
                samp = re.sub(regex, '', re.search(regex_before, fname).group(0))
                extension.setdefault(samp, [])
                extension[samp].append(re.sub("^_", "", read.pattern))
    return(extension)


## Check for the presence of a directory for .fastq in config. If none, will be "links".
if "link_directory" in config.keys():
    link_directory = config["link_directory"]
    if not link_directory.endswith("/"):
        link_directory = link_directory + "/"
else:
    link_directory = "links/"

## Check that "tax_DB_path" ends with "/". Otherwise, correct. 
if not config["tax_DB_path"].endswith("/"):
    config["tax_DB_path"] = config["tax_DB_path"] + "/"

## Check for the existence/accessbility of tax_DB_path   
if os.path.isdir(config['tax_DB_path']) is False:
   raise IOError("Cannot access to '{}' variable.".format('tax_DB_path'))


## Check for the presence of a metadata table in in config, either for local reads ("local_samples") or distant reads ("sra_samples")
if "local_samples" not in config.keys() and "sra_samples" not in config.keys():
    raise ValueError("No samples defined in the config file. local_samples or sra_samples must be defined in config")

## Check the local_samples/sra_samples tables for the presence of following variables in config: 'sample_label', 'grouping_column', 'run_column'.
### read sample dataframe from config file
if "local_samples" in config.keys():
    metadata = config["local_samples"] 
elif "sra_samples" in config.keys():
    metadata = config["sra_samples"]    
df = pandas.read_csv(metadata, sep="\t", index_col=0)
df_colnames = set(df.columns)

### Get values for which presence should be tested in dataframe
to_check = ['sample_label', 'grouping_column', 'run_column'] # a list of config values describing columns that should be in the dataframe
to_check_dict = (dict((k, config[k]) for k in to_check))

### Set a function to compare list
def list_diff(list1, list2): 
	return (list(set(list1) - set(list2))) 

### Loop over the values of config variable to check
for key in to_check_dict.keys():

    ## Extract values from dictionnary
    to_check_values = to_check_dict[key]

    ## If it is a single value, test that it is in the columns of the metadata
    if isinstance(to_check_values, str):

        ## if the column name list is not in the column names of df dataframe then raise an error
        if to_check_values not in df_colnames:                
            message = f"{' and '.join(to_check_values)} not available in the sample TSV file."
            raise IOError(message)

    ## If it is a list, check that it contained in metadata
    elif isinstance(to_check_values, list):

        diff = []
        diff = list_diff(to_check_values, df_colnames)

        ## if the set of column names in to_check list is not in the column names of df dataframe then raise an error
        if diff:
            for item in diff:
                """if the set of column names in to_check list is not a subset of column names of df dataframe then raise an error"""
                message = f"{' and '.join(diff)} not available in the sample TSV file."
                raise IOError(message)

    else:
        message = f"{key} should be a string or a list of string"
        raise IOError(message)


## Generate a list of input files, either local (when "local_samples" is in config, or SRA-hosted (when "sra_samples" is in config))
### Create the final pandas dataframe that will be completed
all_samples=pandas.DataFrame()

### Create a set of dictionaries to store sample characteristics
sras_ext = {}
reads_sra = {}
reads_local = {}
original_names = {}
reads_ext = {}
paths  = {}
layout = {}


### In case of local samples, work our way through the local_samples table to extract read paths (if indicated in the R1/R2 columns) or extract match it with .fastq found in the "links" directory.
if "local_samples" in config.keys():
    ## Read the metadata table
    local_data = pandas.read_csv(config["local_samples"], sep="\t", index_col=0)
    local_data.index = [str(x) for x in local_data.index]
    all_local_sample_names =  "".join(list(local_data.index))
    ## Check for forbidden characters in sample names
    if "(" in all_local_sample_names or ")" in all_local_sample_names or "_-_" in all_local_sample_names:
        raise ValueError("Forbidden character in sample name in sample name file")
    ## Extract path if indicated in "R1" column
    if "R1" in list(local_data):
        for sample, sample_data in local_data.iterrows():
            if sample in paths:
                ## Check for sample names to be unique
                raise IOError("Identical sample name used multiple times: %s" % sample_name)
            paths[sample] =[sample_data.loc["R1"]] 
            reads_ext[sample] = "single"
            layout[sample] = "single"
            if 'R2' in local_data.columns.values:
                if "R1" in str(sample_data.loc["R2"]):
                    raise IOError("ATTENTION! R1 flag within R2 filename: %s", sample_data.loc["R2"])
                if (str(sample_data.loc["R2"]).strip()) != "nan":
                    paths[sample].append(sample_data.loc["R2"])
                    reads_ext[sample] = ["R1", "R2"]
                    layout[sample] = "paired"
        all_samples = local_data   
        paths = {**paths}


    ## In absence of a "R1" column indicating file paths, try to match the sample names with the content of the "links" folder
    else:

        reads_local = get_read_naming_patterns(link_directory)            
        original_names = { x : x for x in reads_local.keys() }
        read_correct = {}
        original_correct = {}

        ## In absence of a "OldSampleName", the leftmost column is used to match with "links" content
        if "OldSampleName" not in list(local_data):

            for sample in list(local_data.index):
                regex = re.compile(r'%s([^a-zA-Z0-9]|$)' % sample) # this regex ensures that the matching of the sample names end at the end of the str, to prevent S1 matching S10 for instance
                match = [bool(re.match(regex, x)) for x in sorted(list(original_names.keys()))]
                if sum(match) != 1: #there must be one and only one entry matching one sample name
                    raise ValueError("Problem matching SampleName to read file names: " +  sample)
                read_name = str(sorted(list(original_names.keys()))[match.index(True)]) # Sample name with _SX
                original_correct[sample] = original_names[read_name]
                read_correct[sample] = reads_local[read_name]
                paths[sample] = expand(link_directory + read_name + "_{reads}" ,  reads = read_correct[sample]) 

                ## In presence of a "LibraryLayout" column, samples can be specifid to be "single" or "paired". 
                if "LibraryLayout" in list(local_data):
                    if local_data.loc[sample, "LibraryLayout"].lower()=="paired":
                        reads_ext[sample]=["R1", "R2"]
                        layout[sample] = "paired"
                    elif local_data.loc[sample, "LibraryLayout"].lower()=="single":
                        reads_ext[sample]=["single"]
                        layout[sample] = "single"
                    else:
                        raise ValueError("Problem in the Local_sample file, LibraryLayout badly defined")
                ## Otherwise, "paired" is assumed
                else:
                    reads_ext[sample]=["R1", "R2"]
                    layout[sample] = "paired"

        ## In presence of a "OldSampleName", the this column is used to match with "links" content
        else:
            for Old in list(local_data["OldSampleName"]):
                regex = re.compile(r'%s([^a-zA-Z0-9]|$)' % Old)
                match = [bool(re.match(regex, x)) for x in sorted(list(original_names.keys()))]
                if sum(match) != 1:
                    raise ValueError("Problem matching OldSampleName to read file names : " + Old)
                read_name=str(sorted(list(original_names.keys()))[match.index(True)]) # Sample with have SX to unify with above
                sample = str(local_data.index[local_data['OldSampleName'] == Old][0])
                original_correct[sample] = original_names[read_name]
                read_correct[sample] = reads_local[read_name]
                paths[sample] = expand(link_directory + read_name + "_{reads}" ,  reads = read_correct[sample])

                ## In presence of a "LibraryLayout" column, samples can be specifid to be "single" or "paired". 
                if "LibraryLayout" in list(local_data):
                    if local_data.loc[sample, "LibraryLayout"].lower()=="paired":
                        reads_ext[sample]=["R1", "R2"]
                        layout[sample] = "paired"
                    elif local_data.loc[sample, "LibraryLayout"].lower()=="single":
                        reads_ext[sample]=["single"]
                        layout[sample] = "single"
                    else:
                        raise ValueError("Problem in the Local_sample file, LibraryLayout badly defined")
                ## Otherwise, "paired" is assumed
                else:
                    reads_ext[sample]=["R1", "R2"]
                    layout[sample] = "paired"

        original_names = original_correct
        reads_local = read_correct
        reads_ext = reads_ext


        all_samples=local_data



### In case of sra-hosted samples, work our way through the "sra_table" to extract LibraryLayout and clean sample names.
if "sra_samples" in config.keys():
    sra_data = pandas.read_csv(config["sra_samples"], sep="\t", index_col=0).drop_duplicates()
    all_sra_sample_names = "".join(list(sra_data.index))
    if "(" in all_sra_sample_names or ")" in all_sra_sample_names or "_-_" in all_sra_sample_names:
        raise ValueError("Forbidden character in sample name in sra file")
    for sra_sample in list(sra_data.index):
        sample_name = str(sra_sample).replace(" ", "_").replace("&", "and").replace(":", "-")
        if sample_name in reads_sra.keys(): # if the sample name is already used, add _(n+1) at the end
            sample_name = sample_name+"_"+str(list(reads_sra.keys()).count(sample_name))
        reads_sra[sample_name]=str(sra_sample)
        if sra_data.loc[sra_sample, "LibraryLayout"].lower()=="paired":
            sras_ext[sample_name]=["1.fastq.gz", "2.fastq.gz"]
            reads_ext[sample_name]=["R1", "R2"]
        elif sra_data.loc[sra_sample, "LibraryLayout"].lower()=="single":
            sras_ext[sample_name] = ["fastq.gz"]
            reads_ext[sample_name]=["single"]
        else:
            raise ValueError("Problem in the sra file, LibraryLayout badly defined")
    all_samples=sra_data
    config["local_samples"] =  config["sra_samples"]

        # all_samples.loc[sample_name, "Replicate"]=sra_data.loc[i, "Replicate"]
read_naming = {**reads_local, **sras_ext}
original_names = {**original_names, **reads_sra}
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def get_grouping_key(column_of_interest):
    file_list = []
    ## Convert to list if provided as a string in config
    if isinstance(column_of_interest,str):
        column_of_interest=[column_of_interest]
    unique_col=list(set(column_of_interest))
    for i in unique_col:
        combined_values = expand("{column_of_interest}/{column_values}", column_of_interest = i, column_values = list(set(all_samples[i])))
        file_list.extend(combined_values)
    return(file_list)

### Define rarefaction levels (values in config + no rarefaction)
def get_rarefaction_key(rarefaction_values):
    file_list = []
    for i in set(rarefaction_values):
        combined_values = expand("rarefaction_{rarefaction_values}", rarefaction_values = i)
        file_list = file_list + combined_values
    file_list = file_list + ["norarefaction"]

    return(file_list)


# Transform the collapse levels from the plotting taxonomic rank
def get_taxa_collapse_level_key(collapse_level):

    file_list = []
    for i in set(collapse_level):
        if i == "OTU":
            value = 'no_collapse'
        elif i == "Species" :
            value = 'collap_7'
        elif i == "Genus" :
            value = 'collap_6'
        elif i == "Family" :
            value = 'collap_5'
        elif i == "Order" :
            value = 'collap_4'
        elif i == "Class" :
            value = 'collap_3'
        elif i == "Phylum" :
            value = 'collap_2'
        elif i == "Kingdom" :
            value = 'collap_1'
        else :
            raise ValueError("Forbidden value in taxa level for barplots")
        file_list.append(value)
    file_list = file_list + ["no_collapse"]
    return(file_list)



## Set of function to generate list of output from config ##############################################################

### Light output, including QC, count table, consensus sequences and taxonomic assignement
def light_output_list():
    output = []
    output = MultiQC
    output.append(light_output)
    if "DADA2" in config["denoiser"]:
        output.append("DADA2/2_denoised/DADA2_denoising_stats.tsv")
    return(output)

### Basic output, diagnostic plots, KRONA plots and rarefaction curves
def basic_plots_list():
    output = []
    output = light_output_list()
    output.append(basic_plots)
    return(output)

### Complete set of phyloseq, in option including transposed count table and metadata (wide to long)
def phyloseq_output_list():
    output = []
    output = basic_plots_list()
    output.append(phyloseq)
    if config["melted_phyloseq"] == True:
        output.append(phyloseq_melted)
    if config["transposed_tables"] == True:
        output.append(transposed_output)
    return(output)


### Qiime2 outputs, offering interactive visualization of the data
def Qiime2_output_list():
    output = []
    output = basic_plots_list()
    if config["Qiime2_basic_output_visualization"] == True:
        output.append(Qiime2_vis_qzv)
    return(output)

### PICRUSt2 outputs, to predict genomic content based on taxonomic profiles
def PICRUSt2_list():
    output = []
    output = basic_plots_list()
    output.append(picrust2)
    return(output)

### Rule all, the default rule including all default output (not PICRUSt2, since long to compute)
def rule_all_list():
    output = []
    output.append(basic_plots_list())
    output.append(Qiime2_vis_qzv)
    output.append(phyloseq_output_list())
    return(output)
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
include: "make_output_fcts.py"

### Basic output
#### MuliQC
MultiQC = expand("QC/{RUN}_multiqc_raw_reads_report.html", RUN = set(all_samples[config["run_column"]]))
MultiQC.append("QC/multiqc_raw_reads_report.html")


#### Light pipeline output
light_output = expand("{denoiser}/2_denoised/dna-sequences.fasta", denoiser = config["denoiser"])
light_output.append(expand("{denoiser}/2_denoised/count_table.tsv", denoiser = config["denoiser"]))
light_output.append(expand("{denoiser}/3_classified/{classifier}_{tax_DB}/dna-sequences_tax_assignments.qzv", classifier = config["classifier"], denoiser = config["denoiser"], tax_DB = config["tax_DB_name"]))

#### DADA2 stat table
DADA2_stats_tables = "DADA2/2_denoised/DADA2_denoising_stats.tsv"


### Basic evaluation plots
basic_plots = expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/reads/reads_plot_with_filtered.png",
                                 classifier = config["classifier"],
                                 denoiser = config["denoiser"],
                                 tax_DB = config["tax_DB_name"])



basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/reads/reads_plot_with_{filter_lineage}_in_{filter_tax_rank}_filtered.png",
                                denoiser = config["denoiser"],
                                classifier = config["classifier"],
                                tax_DB = config["tax_DB_name"],
                                filter_tax_rank = config["filter_tax_rank"],
                                filter_lineage = config["filter_lineage"]))


basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/rarefaction_curve/nofiltering/{rarefaction_value}_rarefaction_curve_{grouping_column}.png",
                                 denoiser = config["denoiser"],
                                 classifier = config["classifier"],
                                 tax_DB = config["tax_DB_name"],
                                 rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                                 grouping_column = config["grouping_column"][0]))


basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/rarefaction_curve/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}_rarefaction_curve_{grouping_column}.png",
                                 denoiser = config["denoiser"],
                                 classifier = config["classifier"],
                                 tax_DB = config["tax_DB_name"],
                                 rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                                 filter_tax_rank = config["filter_tax_rank"],
                                 filter_lineage = config["filter_lineage"],
                                 grouping_column = config["grouping_column"][0]))



basic_plots.append(expand("{denoiser}/5_visualization/{classifier}_{tax_DB}/KRONA/{grouping_key}.html",
                                 denoiser = config["denoiser"],
                                 classifier = config["classifier"],
                                 tax_DB = config["tax_DB_name"],
                                 grouping_key = get_grouping_key(config["grouping_column"])))



### Advances Phyloseq set of output
#### Not filtered
phyloseq = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]))


phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"])))

phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))

#### Tax filtered
phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))


phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))

phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}.rds",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"],
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))


### Potentially spurious ASVs
phyloseq.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/norarefaction/no_collapse/base_export/tree_treeshrink/output.txt",
            denoiser = config["denoiser"],
            classifier = config["classifier"],
            tax_DB = config["tax_DB_name"],
            collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
            rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
            filter_tax_rank = config["filter_tax_rank"],
            filter_lineage = config["filter_lineage"],
            norm_value = config["normalization"], 
            abund_value = config["min_abundance"], 
            prev_value = config["min_prevalence"]))


#### Melted Phyloseq set of output
#### Not filtered
phyloseq_melted = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]))


phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"])))

phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))

#### Tax filtered
phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))


phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"]))

phyloseq_melted.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_melted.tsv",
                   denoiser = config["denoiser"],
                   classifier = config["classifier"],
                   tax_DB = config["tax_DB_name"],
                   collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
                   rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
                   filter_tax_rank = config["filter_tax_rank"],
                   filter_lineage = config["filter_lineage"],
                   norm_value = config["normalization"], 
                   abund_value = config["min_abundance"], 
                   prev_value = config["min_prevalence"]))


#### Transposed output tables
#### Count-table transposed format - filtered
transposed_output = expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_export/count_table_transposed.txt",
            denoiser = config["denoiser"],
            classifier = config["classifier"],
            tax_DB = config["tax_DB_name"],
            collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
            rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
            filter_tax_rank = config["filter_tax_rank"],
            filter_lineage = config["filter_lineage"],
            norm_value = config["normalization"], 
            abund_value = config["min_abundance"], 
            prev_value = config["min_prevalence"])


#### Count-table transposed format - not filtered
transposed_output.append(expand("{denoiser}/4_physeq/{classifier}_{tax_DB}/nofiltering/{rarefaction_value}/{collapse_key}/base_with_tree_norm{norm_value}_abund{abund_value}_prev{prev_value}_export/count_table_transposed.txt",
            denoiser = config["denoiser"],
            classifier = config["classifier"],
            tax_DB = config["tax_DB_name"],
            collapse_key = get_taxa_collapse_level_key(config["phyloseq_tax_ranks"]),
            rarefaction_value = get_rarefaction_key(config["rarefaction_value"]),
            filter_tax_rank = config["filter_tax_rank"],
            filter_lineage = config["filter_lineage"],
            norm_value = config["normalization"], 
            abund_value = config["min_abundance"], 
            prev_value = config["min_prevalence"]))


### Qiime2 output
#### Qiime2 interactive visualization
Qiime2_vis_qzv = expand("{denoiser}/2_denoised/dna-sequences.fasta",
                        denoiser = config["denoiser"])
Qiime2_vis_qzv.append(expand("{denoiser}/2_denoised/count-table.qzv",
                             denoiser = config["denoiser"]))
Qiime2_vis_qzv.append(expand("{denoiser}/2_denoised/rep-seqs.qzv",
                             denoiser = config["denoiser"]))
Qiime2_vis_qzv.append(expand("{denoiser}/3_classified/{classifier}_{tax_DB}/dna-sequences_tax_assignments.qzv",
                      denoiser = config["denoiser"],
                      classifier = config["classifier"],
                      tax_DB = config["tax_DB_name"]))


### Picrust2
picrust2 = expand("{denoiser}/6_picrust2/{classifier}_{tax_DB}/{filter_lineage}_in_{filter_tax_rank}/{rarefaction_value}/picrust/",
                    denoiser = config["denoiser"],
                    classifier = config["classifier"],
                    tax_DB = config["tax_DB_name"],
                    filter_tax_rank = config["filter_tax_rank"],
                    filter_lineage = config["filter_lineage"],
                    rarefaction_value = get_rarefaction_key(config["rarefaction_value"])
                    )
22
23
script: 
    "scripts/1_cutadapt_paired.py"
43
44
script: 
    "scripts/1_cutadapt_single.py"
40
41
script:
    "scripts/1_DADA2_q_filtering_paired.R"
62
63
script:
    "scripts/1_DADA2_q_filtering_single.R"
85
86
script:
    "scripts/2a_big_data_DADA2_learn_errors_paired.R"
104
105
script:
    "scripts/2a_big_data_DADA2_learn_errors_single.R"
132
133
script:
    "scripts/2b_big_data_DADA2_infer_ASV_paired.R"
157
158
script:
    "scripts/2b_big_data_DADA2_infer_ASV_single.R"
196
197
script:
    "scripts/2c_big_data_DADA2_merge_ASV.R"
221
222
script:
    "scripts/2d_big_data_DADA2_merge_chimera.R"
254
255
script:
    "scripts/2e_big_data_DADA2_filtering_stats.R"
271
272
273
274
shell:
    """
    fastqc {input} -o $(dirname {output[0]}) 2> {log[0]}
    """
306
307
308
309
shell:
    """
    multiqc -f -n {output[0]} $(dirname {input} | tr "\n" " ") --cl_config "max_table_rows: 100000" 2> {log[0]}
    """
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from snakemake import shell

if snakemake.params['amplicon_type'] == "ITS":
    print("ITS Trimming")
    forward_primer_compl = Seq.reverse_complement(Seq(snakemake.params['forward_primer'], IUPAC.ambiguous_dna))
    reverse_primer_compl = Seq.reverse_complement(Seq(snakemake.params['reverse_primer'], IUPAC.ambiguous_dna))
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 2 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -p {snakemake.output[R2_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -a '{reverse_primer_compl}' \
    -G '{snakemake.params[reverse_primer]}' \
    -A '{forward_primer_compl}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]} \
    {snakemake.input[R2_raw_reads]} >> {snakemake.log[0]}""")

elif snakemake.params['amplicon_type'] == "16S":
    print("16S Trimming")
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 1 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -p {snakemake.output[R2_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -G '{snakemake.params[reverse_primer]}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]} \
    {snakemake.input[R2_raw_reads]} >> {snakemake.log[0]}""")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from snakemake import shell

if snakemake.params['amplicon_type'] == "ITS":
    print("ITS Trimming")
    forward_primer_compl = Seq.reverse_complement(Seq(snakemake.params['forward_primer'], IUPAC.ambiguous_dna))
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 2 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -a '{reverse_primer_compl}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]} >> {snakemake.log[0]}""")

elif snakemake.params['amplicon_type'] == "16S":
    print("16S Trimming")
    reverse_primer_compl = Seq.reverse_complement(Seq(snakemake.params['reverse_primer'], IUPAC.ambiguous_dna))
    shell("""cutadapt \
    --cores {snakemake.threads} \
    --error-rate 0.1 \
    --times 1 \
    --overlap 3 \
    -o {snakemake.output[R1_trimmed_reads]} \
    -g '{snakemake.params[forward_primer]}' \
    -a '{reverse_primer_compl}' \
    --match-read-wildcards \
    --discard-untrimmed \
    {snakemake.input[R1_raw_reads]}  >> {snakemake.log[0]}""")
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    fnFs <- snakemake@input[[1]]
    fnRs <- snakemake@input[[2]]

## Output
    q_score_filtered_F <- snakemake@output[["q_score_filtered_F"]]
    q_score_filtered_R <- snakemake@output[["q_score_filtered_R"]]
    q_filtering_stats_path <- snakemake@output[["q_filtering_stats"]]

## Parameters
    F_length <- snakemake@params[["F_reads_length_trim"]]
    R_length <- snakemake@params[["R_reads_length_trim"]]
    F_extected_error <- snakemake@params[["F_reads_extected_error"]]
    R_extected_error <- snakemake@params[["R_reads_extected_error"]]
    sample_name <- snakemake@params[["sample_name"]]


## Load needed libraries
    library(dada2);packageVersion("dada2")

## Filter and trim
### Reads are filtered based on the number of errors expected for the read (integration of the qscore and the length of the read). All reads with uncalled nucleotide (N) are removed too. Remaining phiX reads will be removed too. Finally, reads are cut at a length set in config.

### Filter and trim. The filtered reads are directly written while the filtering stats are being save for later compilation.
    filtering_stats <- filterAndTrim(fwd = fnFs, filt = q_score_filtered_F, rev = fnRs, filt.rev = q_score_filtered_R, truncLen=c(F_length,R_length), maxN=0, maxEE=c(F_extected_error,R_extected_error), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=snakemake@threads, verbose = TRUE)
    filtering_stats <- as.data.frame(filtering_stats)
    filtering_stats$Sample <- sample_name

### Save the stats for this sample in a R object
    saveRDS(filtering_stats, file = q_filtering_stats_path)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    fnFs <- snakemake@input[[1]]

## Output
    q_score_filtered_F <- snakemake@output[["q_score_filtered_F"]]
    q_filtering_stats_path <- snakemake@output[["q_filtering_stats"]]

## Parameters
    F_length <- snakemake@params[["F_reads_length_trim"]]
    F_extected_error <- snakemake@params[["F_reads_extected_error"]]
    sample_name <- snakemake@params[["sample_name"]]


## Load needed libraries
    library(dada2);packageVersion("dada2")

## Filter and trim
### Reads are filtered based on the number of errors expected for the read (integration of the qscore and the length of the read). All reads with uncalled nucleotide (N) are removed too. Remaining phiX reads will be removed too. Finally, reads are cut at a length set in config.

### Filter and trim. The filtered reads are directly written while the filtering stats are being save for later compilation.
    filtering_stats <- filterAndTrim(fwd = fnFs, filt = q_score_filtered_F, truncLen=c(F_length), maxN=0, maxEE=c(F_extected_error), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=snakemake@threads, verbose = TRUE)
    filtering_stats <- as.data.frame(filtering_stats)
    filtering_stats$Sample <- sample_name

### Save the stats for this sample in a R object
    saveRDS(filtering_stats, file = q_filtering_stats_path)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]
    q_score_filtered_R <- snakemake@input[["q_score_filtered_R"]]

## Output
    error_profile_F <- snakemake@output[["error_profile_F"]]
    error_profile_R <- snakemake@output[["error_profile_R"]]
    error_plot_F <- snakemake@output[["error_profile_F_plot"]]
    error_plot_R <- snakemake@output[["error_profile_R_plot"]]

## Load needed libraries
    library(dada2); packageVersion("dada2")


## Learn error rates
    errF <- learnErrors(q_score_filtered_F, nbases=1e8, multithread = snakemake@threads, verbose = 1)
    errR <- learnErrors(q_score_filtered_R, nbases=1e8, multithread = snakemake@threads, verbose = 1)

## Write these error profiles
    saveRDS(object = errF, file = error_profile_F)
    saveRDS(object = errR, file = error_profile_R)

## Write error_plots
    png(error_plot_F)
    plotErrors(errF, nominalQ=TRUE)
    png(error_plot_R)
    plotErrors(errR, nominalQ=TRUE)

    dev.off()
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]

## Output
    error_profile_F <- snakemake@output[["error_profile_F"]]
    error_plot_F <- snakemake@output[["error_profile_F_plot"]]

## Load needed libraries
    library(dada2); packageVersion("dada2")


## Learn error rates
    errF <- learnErrors(q_score_filtered_F, nbases=1e8, multithread = snakemake@threads, verbose = 1)

## Write these error profiles
    saveRDS(object = errF, file = error_profile_F)

## Write error_plots
    png(error_plot_F)
    plotErrors(errF, nominalQ=TRUE)

    dev.off()
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]
    q_score_filtered_R <- snakemake@input[["q_score_filtered_R"]]
    error_profile_F <- snakemake@input[["error_profile_F"]]
    error_profile_R <- snakemake@input[["error_profile_R"]]

## Output
    infer_stats <- snakemake@output[["infer_stats"]]
    sample_seq_tab <- snakemake@output[["sample_seq_tab"]]

## Parameters
    sample_name <- snakemake@params[["sample_name"]]
    run <- snakemake@params[["run"]]
    x_column_value <- snakemake@params[["x_column_value"]]
    min_overlap <- snakemake@params[["min_overlap"]]
    print(paste("min_overlap is :", min_overlap))

## Load needed libraries
    library(dada2); packageVersion("dada2")

## Create a useful function to count the number of sequences
    getN <- function(x) sum(getUniques(x))

## File renaming
    names(q_score_filtered_F) <- sample_name
    names(q_score_filtered_R) <- sample_name

## Read error rates
    errF <- readRDS(error_profile_F)
    errR <- readRDS(error_profile_R)

## Prepare named vector
    mergers <- vector("list", 1)
    names(mergers) <- sample_name

## Sample inference and merger of paired-end reads
    cat("Processing:", sample_name, "\n")
    ## Forward
        derepF <- derepFastq(q_score_filtered_F, verbose = TRUE)
        ddF <- dada(derepF, err=errF, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE)
    ## Reverse
        derepR <- derepFastq(q_score_filtered_R, verbose = TRUE)
        ddR <- dada(derepR, err=errR, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE)
    ## Merge
        merger <- mergePairs(dadaF = ddF, derepF= derepF, dadaR = ddR, derepR = derepR, verbose = TRUE, minOverlap = min_overlap, maxMismatch = 0)
        mergers[[sample_name]] <- merger

## Save the dereplicated, corrected and merged sequences for this sample
    saveRDS(mergers, file = sample_seq_tab)

## For statistics record the number of reads
    ### Write the statistics in a dataframe
        infer <- data.frame(denoisedF = getN(ddF))
        infer$denoisedR <- getN(ddR)
        infer$merged_pairs <- getN(merger)
        infer$Sample <- sample_name
        infer$label <- x_column_value
        infer$RUN <- run

    ### Save the sequences stats for this sample
        saveRDS(infer, file = infer_stats)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_score_filtered_F <- snakemake@input[["q_score_filtered_F"]]
    error_profile_F <- snakemake@input[["error_profile_F"]]

## Output
    infer_stats <- snakemake@output[["infer_stats"]]
    sample_seq_tab <- snakemake@output[["sample_seq_tab"]]

## Parameters
    sample_name <- snakemake@params[["sample_name"]]
    run <- snakemake@params[["run"]]
    x_column_value <- snakemake@params[["x_column_value"]]
    min_overlap <- snakemake@params[["min_overlap"]]
    print(paste("min_overlap is :", min_overlap))

## Load needed libraries
    library(dada2); packageVersion("dada2")

## Create a useful function to count the number of sequences
    getN <- function(x) sum(getUniques(x))

## File renaming
    names(q_score_filtered_F) <- sample_name

## Read error rates
    errF <- readRDS(error_profile_F)

## Prepare named vector
    mergers <- vector("list", 1)
    names(mergers) <- sample_name

## Sample inference
    cat("Processing:", sample_name, "\n")
    ## Forward
        derepF <- derepFastq(q_score_filtered_F, verbose = TRUE)
        ddF <- dada(derepF, err=errF, multithread = snakemake@threads, verbose = 1, pool = FALSE, selfConsist = TRUE)
        mergers[[sample_name]] <- ddF

## Save the dereplicated, corrected sequences for this sample
    saveRDS(mergers, file = sample_seq_tab)

## For statistics record the number of reads
    ### Write the statistics in a dataframe
        infer <- data.frame(denoisedF = getN(ddF))
        infer$denoisedR <- NA
        infer$merged_pairs <- infer$denoisedF
        infer$Sample <- sample_name
        infer$label <- x_column_value
        infer$RUN <- run

    ### Save the sequences stats for this sample
        saveRDS(infer, file = infer_stats)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    infer_stats <- snakemake@input[["infer_stats"]]
    sample_seq_table <- snakemake@input[["sample_seq_tab"]]

## Output
    run_stats <- snakemake@output[["run_stats"]]
    run_seq_table <- snakemake@output[["run_seq_table"]]

## Load needed libraries
    library(dada2); packageVersion("dada2")

## Merge and write paired-end merge reads of the run
    input_M <- sapply(sample_seq_table, readRDS, simplify = TRUE, USE.NAMES = FALSE)
    st.all <- makeSequenceTable(input_M)
    saveRDS(object = st.all, file = run_seq_table)

## Merge and write forward reads stats of the run
    stats <- do.call("rbind", lapply(infer_stats, readRDS))
    saveRDS(object = stats, file = run_stats)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
  log <- file(snakemake@log[[1]], open="wt")
  sink(log)
  sink(log, type="message")

# Input
  seq_tab <- snakemake@input[["seq_tab"]]

# Output
  no_chim <- snakemake@output[["no_chim"]]
  length_filtered <- snakemake@output[["length_filtered"]]
  rep_seqs <- snakemake@output[["rep_seqs"]]
  count_table <- snakemake@output[["count_table"]]
  length_histo <- snakemake@output[["length_histo"]]

# Parameters
  merged_min_length <- snakemake@params[["merged_min_length"]]
  merged_max_length  <- snakemake@params[["merged_max_length"]]

# Load needed libraries
  library(dada2); packageVersion("dada2")
  library(ggplot2); packageVersion("ggplot2")

# Merge data from multiple runs (if necessary)
   if (length(seq_tab) == 1){
	print("Unique RUN, no merging of seq_tabl")
	st.all <- readRDS(seq_tab)
   }else{
	print("Multiple RUN, merging")
	st.all <- do.call("mergeSequenceTables", lapply(seq_tab, readRDS))
   }

# Remove chimeras
  print("Filter chimera")
  seqtab <- removeBimeraDenovo(st.all, method="consensus", multithread=snakemake@threads, verbose=TRUE)
  print("Chimera filtered")

# Sequences length inspection and filtration
#### That's the little added trick, the reason why we are using this script and not the one in Qiime2. Indeed we typically are here keeping only sequences between 390 and 500 bp of length after merging. This tcorresponds to the expected length of the V3V4 region of the 16S rRNA gene.
  ## Inspect distribution of sequence lengths
  unfiltered_length_table <- tapply(colSums(seqtab), factor(nchar(getSequences(seqtab))), sum)
  unfiltered_length_table
  unfiltered_length_table_df <- data.frame(length=as.numeric(names(unfiltered_length_table)), count=unfiltered_length_table)
  png(length_histo)
  p <- ggplot(data=unfiltered_length_table_df, aes(x=length, y=count)) + geom_col(binwidth=20)
  print(p)
  dev.off()

# Filter based on length
seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(merged_min_length, merged_max_length)]
# Inspect distribution of sequence lengths after filtration
table(nchar(getSequences(seqtab2)))

# Export reads and count
#### We are writing in files the product of this DADA2 process. These are one .fasta file contanining the dereplicated, errors corrected, paired-end merged representative sequences and one .txt file indicating the prevalence of sequencne in each sample (this is the result of dereplication).
    asv_seqs <- colnames(seqtab2)

    ## giving our seq headers more manageable names (ASV_1, ASV_2...)
    asv_headers <- vector(dim(seqtab2)[2], mode="character")
    for (i in 1:dim(seqtab2)[2]) {
      asv_headers[i] <- paste(">ASV", i, sep="_")
    }

    ## making and writing out a fasta of our final ASV seqs:
    asv_fasta <- c(rbind(asv_headers, asv_seqs))
    write(asv_fasta, rep_seqs)

  ## count table:
    print("Create count table")
    asv_tab <- t(seqtab2)

 ## Renamed
    row.names(asv_tab) <- sub(">", "", asv_headers)
    write.table(asv_tab, count_table , sep="\t", quote=F)

  ## Write sequences objects in .rds for use in statistics
    ## Before length filtration
    saveRDS(seqtab, no_chim)
    ## After length filtration
    saveRDS(seqtab2, length_filtered)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    q_filtering_stats <- snakemake@input[["q_filtering_stats"]]
    run_stats <- snakemake@input[["run_stats"]]
    no_chim <- snakemake@input[["no_chim"]]
    l_filtered <- snakemake@input[["length_filtered"]]

## Output
    filtering_stats <- snakemake@output[["filtering_stats"]]

## Load needed libraries
    library(dplyr); packageVersion("dplyr")

## Load the q score filtration R stats
    filtration <- do.call("rbind", lapply(q_filtering_stats, readRDS))

## Load the forward, reverse and paired reads correction stats for each run
    dada_infer <- do.call("rbind", lapply(run_stats, readRDS))

## Load reads with filtered chimera
    no_chimera <- do.call("rowSums", lapply(no_chim, readRDS))
    no_chimera <- as.data.frame(no_chimera)
    no_chimera$Sample <- row.names(no_chimera)

## Load length_filtered sequences
    length_filtered <- do.call("rowSums", lapply(l_filtered, readRDS))
    length_filtered <- as.data.frame(length_filtered)
    length_filtered$Sample <- row.names(length_filtered)

## Merge all dataframe together
    all_stats <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "Sample", all.x = TRUE),
       list(filtration, dada_infer, no_chimera, length_filtered))

## set RUN at the very end of the table
    all_stats <- all_stats%>%select(Sample, label, RUN, everything())

## Compute % of passed reads
    all_stats$passed_reads_pct <- (100*all_stats$length_filtered)/all_stats$reads.in

## Write the  stat table
write.table(x = all_stats, file = filtering_stats, sep = "\t", row.names = FALSE)
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
    shell:
        '''
        pandaseq \
        -f {input[R1_raw_reads]} \
        -r {input[R2_raw_reads]} \
        -p {params[forward_primer]} \
        -q {params[reverse_primer]} \
        -A simple_bayesian \
        -l {params[merged_min_length]} \
        -L {params[merged_max_length]} \
        -G {log} \
        -w {output[paired_trimmed_reads]} \
        -B \
	-T {threads} \
	-o {params[min_overlap]} \
        -N
        '''
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
    shell:
        '''
        pandaseq \
        -f {input[R1_raw_reads]} \
        -r {input[R2_raw_reads]} \
        -A simple_bayesian \
        -l {params[merged_min_length]} \
        -L {params[merged_max_length]} \
        -G {log} \
        -w {output[paired_trimmed_reads]} \
        -B \
	-T {threads} \
	-o {params[min_overlap]} \
        -N
        '''
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
shell:
    '''
    if [ -s {input} ]
    then
        vsearch --derep_fulllength {input} \
        --sizeout \
        --output {output[0]} \
        --uc {output[1]} 2> {log}
    else
        echo "{input} is empty"
        echo "> \n" > {output[0]}
        touch {output[1]}
    fi

    '''
50
51
52
53
shell:
    '''
    cat {input} >> {output}
    '''
68
69
70
71
72
73
74
75
shell:
    '''
    vsearch --derep_fulllength {input} \
            --sizeout \
            --minuniquesize 2 \
            --output {output}
            2> {log}
    '''
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
shell:
    '''
    vsearch --cluster_size {input} \
            --sizein \
            --id 0.97 \
            --centroids {output[centroid]} \
            --consout {output[consout]} \
            --profile {output[profile]} \
            --threads {threads} \
            2> {log}
    '''
121
122
123
124
125
126
127
128
129
130
shell:
    '''
    vsearch --uchime_denovo {input} \
    --abskew 2 \
    --sizein \
    --nonchimeras {output[non_chimeras]} \
    --borderline {output[borderline]} \
    --uchimeout {output[chimera_out]}
    2> {log}
    '''
146
147
148
149
150
151
152
153
154
shell:
    '''
    vsearch --derep_fulllength {input} \
    --sizein \
    --relabel OTU_ \
    --xsize \
    --output {output} \
    2> {log}
    '''
171
172
173
174
175
176
177
178
179
180
shell:
    '''
    vsearch --usearch_global {input[samples]} \
    -otutabout {output} \
    -id 0.97 \
    -strand plus \
    --db {input[rep_seq]} \
    --threads {threads} \
    2> {log}
    '''
195
196
script:
    "scripts/create_count_table_from_vsearch.R"
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Load needed library
    library(dplyr);packageVersion("dplyr")
    library(reshape2);packageVersion("reshape2")
    library(magrittr);packageVersion("magrittr")

## Input
    count_table_samples <- snakemake@input[["count_table_samples"]]

## Output
    count_table <- snakemake@output[["count_table"]]

## Reformat
otus_table <- data.frame(array(dim=c(0,3)))

### Loop over each sample file. If it is empty, then we just add the factor in the levels to have it then
for (file_path in count_table_samples){
  sample_name <- gsub("_count_table.tsv", "", basename(file_path))
  sample_otu_table <- read.table(file = file_path, sep="\t", as.is=T, check.names = F, header=T, comment.char = "",  skipNul = TRUE)
  if (nrow(sample_otu_table)>0){
    sample_otu_table <- cbind("Sample"=sample_name, sample_otu_table)
    otus_table <- rbind(otus_table, sample_otu_table)
  }else if (nrow(sample_otu_table) == 0){
    levels(otus_table$Sample) <- c(levels(otus_table$Sample), sample_name)
  }
}

colnames(otus_table) <- c("Sample", "OTU_ID", "counts")

## Transform this table to have a wide format where we have a column by sample
transf_vsearch_table <- otus_table %>% 
  dplyr::group_by(Sample, OTU_ID, .drop = FALSE) %>%
  dplyr::summarise(counts = sum(counts)) %>%
  reshape2::dcast(OTU_ID ~ Sample) %>%
  dplyr::filter(!is.na(OTU_ID))

## Set OTU as rownames
transf_vsearch_table <- set_rownames(x = transf_vsearch_table, value = transf_vsearch_table$OTU_ID)
transf_vsearch_table[,1] <- NULL

## Write output
    write.table(x = transf_vsearch_table, file = count_table, sep="\t", quote=F, na = "0")
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    seqs <- snakemake@input[["seqs"]]
    King_to_Species <- snakemake@input[["King_to_Species"]]
    King_to_Genus <- snakemake@input[["King_to_Genus"]]
    Genus_species <- snakemake@input[["Genus_species"]]

## Output
    tax <- snakemake@output[["tax"]]

## Parameters
    cpu <- snakemake@threads

## Load needed libraries
    library(dada2);packageVersion("dada2")
    library(dplyr);packageVersion("dplyr")
    library(tidyr);packageVersion("tidyr")
    library(Biostrings);packageVersion("Biostrings")

## Read data
    fastaFile <- readDNAStringSet(seqs)
    seq.name = names(fastaFile)
    sequence = paste(fastaFile)
    fasta <- data.frame(seq.name, sequence)

## Format seqs
    seq_table <- as.character(fasta$sequence)
    names(seq_table) <- as.character(fasta$seq.name)

## Assign taxonomy
    print("Assigning")
    taxa <- assignTaxonomy(seqs = seq_table, refFasta = King_to_Species, taxLevels = c("Kingdom","Phylum","Class","Order","Family","Genus", "Species"), multithread=cpu, tryRC = TRUE, minBoot = 50, verbose = TRUE, outputBootstraps = TRUE)
    #taxa <- addSpecies(taxtab = taxa, refFasta = Genus_species, verbose=TRUE, allowMultiple = TRUE, tryRC = TRUE)
    # Not working for some reason

## Format an write output
    taxa_table <- data.frame(taxa)
    taxa_table <- data.frame(cbind(Row.Names = names(rownames(unlist(taxa$tax))), taxa_table))
    taxa_table <- taxa_table %>% unite(taxonomy, starts_with(match = "tax."), sep = ";", remove = TRUE)
    taxa_table <- taxa_table %>% unite(boot, starts_with(match = "boot"), sep = ";", remove = TRUE)

    #taxa_table$Row.Names<-NULL
    write.table(taxa_table, tax, row.names = FALSE, sep = "\t", col.names = FALSE, quote = FALSE)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
   log <- file(snakemake@log[[1]], open="wt")
   sink(log)
   sink(log, type="message")

## Input
    trained_tax_path <- snakemake@input[["trained_tax"]]
    seq_path <- snakemake@input[["seq"]]	

## Output
    tax_path <- snakemake@output[["tax"]]
    tax_plot_path <- snakemake@output[["tax_plot"]]

## Parameters
    cpu <- snakemake@threads

## Load needed libraries
    library(DECIPHER);packageVersion("DECIPHER")
    library(dplyr);packageVersion("dplyr")
    library(tidyr);packageVersion("tidyr")

# load a training set object (trainingSet)
# see http://DECIPHER.codes/Downloads.html
train_set <- readRDS(trained_tax_path )
query_seqs <- readDNAStringSet(seq_path)


# classify the sequences
ids <- IdTaxa(query_seqs,
   train_set,
   strand="both", # or "top" if same as trainingSet
   threshold=60, # 60 (very high) or 50 (high)
   processors=cpu) # use all available processors

# look at the results
 pdf(file = tax_plot_path)
 plot(ids)
 dev.off()


## Format taxonomy and write it
taxid <- data.frame(t(sapply(ids, function(x){
  taxa <- c(unlist(x$taxon)[2:8], last(x$confidence))
  taxa[startsWith(taxa, "unclassified_")] <- NA
  taxa
  })))
taxid <- unite(data = taxid, 1:7, col = "taxonomy", sep = ";", remove = TRUE)
write.table(x = taxid, tax_path, row.names = TRUE, sep = "\t", col.names = FALSE, quote = FALSE)
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
   log <- file(snakemake@log[[1]], open="wt")
   sink(log)
   sink(log, type="message")

## Input
rdp_tax <- snakemake@input[["RDP_output"]]

## Output
simplified_tax <-snakemake@output[["formatted_output"]]


## Load needed libraries
library(dplyr);packageVersion("dplyr")
library(tidyr);packageVersion("tidyr")

## Read data
tax_table <- read.table(rdp_tax, sep="\t", stringsAsFactors=FALSE)

## Format table
   ### set rownames
   row.names(tax_table) <- tax_table$V1

   ### Filter unecessary columns
   tax_table[,c("V1", "V2", "V3", "V4", "V5")] <- NULL

## Set a table for confidence score filtering
   ### Keep score columns
   score_table <- tax_table[,seq(from = 3, to = 21, by = 3)]
   ### Rename columns
   colnames(score_table) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
   ### Create a boolean dataframe for filtering based on confidence
   score_table_bool <- score_table < 0.5


## Set a table with taxa names
   ### Keep names columns
   names_tables <- tax_table[,seq(from = 1, to = 19, by = 3)]
   ### Renames columns
   colnames(names_tables) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

## Filter taxa names based on boolean table
taxa_names_filtered <- replace(x = names_tables, list = score_table_bool, values = NA)

## Filter scores based on boolean table
score_table_filtered <- replace(x = score_table, list = score_table_bool, values = NA)

## Unite taxonomy
taxa_names_filtered_unite <- unite(data = taxa_names_filtered, col = "", sep = ";", na.rm = TRUE, remove = TRUE)

## Add confidence
taxa_names_filtered_unite$confidence <- apply(score_table_filtered, MARGIN = 1, FUN = min, na.rm = TRUE )

## Write formatted table
write.table(taxa_names_filtered_unite, sep = "\t", simplified_tax, row.names = TRUE, quote = FALSE, col.names = FALSE)
19
20
21
22
23
24
25
26
27
28
29
30
31
32
shell:
    '''
    export RDP_JAR_PATH=$(command -v rdp_classifier-2.2.jar);
    assign_path=$(which assign_taxonomy.py)
    python $assign_path \
      -i {input[0]} \
      -r {input[1]} \
      -t {input[2]} \
      -m rdp \
      -o $(dirname {output[0]}) \
      -c 0.5 \
      --rdp_max_memory {resources[mem_mb]}
       2> {log[0]}
    '''
52
53
script:
    "scripts/dada2_rdp_tax_assign.R"
72
73
74
75
76
77
78
79
    shell:
        '''
        classifier -Xmx30g -XX:ConcGCThreads=1 classify \
        -c 0.5 \
        -f allrank \
        -t {input[0]} \
        -o {output[0]} {input[1]}  2> {log[0]}
	    '''
97
98
script:
    "scripts/format_RDP_output.R"
119
120
script:
    "scripts/decipher_assign_tax.R"
33
34
script:
    "scripts/physeq_initial_import.R"
53
54
script:
    "scripts/physeq_export.R"
74
75
script:
    "scripts/physeq_filter_taxa.R"
91
92
script:
    "scripts/physeq_rarefy.R"
109
110
script:
    "scripts/physeq_add_new_tree.R"
125
126
script:
    "scripts/physeq_melt_table.R"
143
144
script:
    "scripts/physeq_collapse_taxa.R"
163
164
script:
    "scripts/transpose_and_add_meta_count_table.R"
182
183
script:
    "scripts/transf_norm.R"
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]
new_tree <- snakemake@input[["new_tree"]]

## Ouput
updated_phyloseq_path <- snakemake@output[["phyloseq_object"]]



## Load libraries
library(phyloseq);packageVersion("phyloseq")



## Read the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)
phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample

## Read the new tree
PHY <- read_tree(new_tree)

## Assign the new tree to the phyloseq object
phy_tree(phyloseq_obj) <- PHY

## Write the new phyloseq object
saveRDS(object = phyloseq_obj, file = updated_phyloseq_path)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[[1]]

## Output
phyloseq_filtered_object <- snakemake@output[[1]]

## Parameters
collapse_level <- snakemake@params[["collapse_level"]]



## Load needed libraries
library(phyloseq);packageVersion("phyloseq")



## Import the phyloseq object
phyloseq_object <- readRDS(phyloseq_object)

## Convert the taxonomic level to its name
rank_names(phyloseq_object)[[as.numeric(collapse_level)]]

## Collapse taxa
collapsed_physeq <- tax_glom(phyloseq_object, taxrank=rank_names(phyloseq_object)[[as.numeric(collapse_level)]], NArm=TRUE, bad_empty=c(NA, "", " ", "\t"))
collapsed_physeq <- prune_taxa(taxa_sums(collapsed_physeq) > 0, collapsed_physeq) ## Removes taxa not at least present in one sample


# Write the new phyloseq object
saveRDS(object = collapsed_physeq, file = phyloseq_filtered_object)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Output
tree_path <- snakemake@output[["tree_path"]]
meta_path <- snakemake@output[["meta_path"]]
taxonomy_path <- snakemake@output[["taxonomy_path"]]
OTU_path  <- snakemake@output[["OTU_path"]]
rep_seq_path  <- snakemake@output[["rep_seq_path"]]



## Load needed libraries
library(phyloseq);packageVersion("phyloseq")
library(dplyr);packageVersion("dplyr")
library(stringr);packageVersion("stringr")
library(Biostrings);packageVersion("Biostrings")



## Import the phyloseq phyloseq_object
phyloseq_obj <- readRDS(phyloseq_object)
phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample

## Write the tree of the phyloseq object
tree <- phy_tree(phyloseq_obj)
ape::write.tree(tree, tree_path)

## Write the metadata table of the phyloseq object
metadata <- (sample_data(phyloseq_obj))
metadata$Sample <- rownames(metadata)
metadata <- metadata %>% select(Sample, everything())
write.table(metadata, meta_path , sep="\t", row.names = FALSE)

## Write the taxonomy table of the phyloseq object
taxa_df<- as.data.frame(tax_table(phyloseq_obj), stringsAsFactors = F)
taxa_df$taxonomy <- apply(taxa_df, 1, function(x) str_c(x[!is.na(x)], collapse = ";"))
taxa_df <- taxa_df %>% select(taxonomy)
write.table(taxa_df, taxonomy_path , sep="\t", quote=F, col.names = F)

## Write the OTU table of the phyloseq object
write.table(otu_table(phyloseq_obj), OTU_path , sep="\t", quote=F)

## Write the representative sequences
writeXStringSet(refseq(phyloseq_obj), rep_seq_path, append=FALSE,
                compress=FALSE, format="fasta")
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Output
phyloseq_filtered_object <- snakemake@output[["phyloseq_filtered_object"]]

## Parameters
tax_rank <- snakemake@params[["filter_tax_rank"]]
lineage <- snakemake@params[["filter_lineage"]]
filter_out_tax_rank  <- snakemake@params[["filter_out_tax_rank"]]
filter_out_lineage <- snakemake@params[["filter_out_lineage"]]


## Load needed libraries
library(phyloseq);packageVersion("phyloseq")
library(dplyr);packageVersion("dplyr")


## Import the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)

## Filter taxa
filtered_taxa <- subset_taxa(phyloseq_obj, get(tax_rank) == as.character(lineage))
filtered_taxa <- subset_taxa(filtered_taxa, get(filter_out_tax_rank) != as.character(filter_out_lineage))
filtered_taxa <- prune_taxa(taxa_sums(filtered_taxa) > 0, filtered_taxa) ## Removes taxa not at least present in one sample

## Recompute alpha diversity indexes after this filtration
### Remove the previously computed values
#sample_data(filtered_taxa) <- select(sample_data(filtered_taxa), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1))
drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(filtered_taxa) <- sample_data(filtered_taxa)[,!(names(sample_data(filtered_taxa)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = filtered_taxa, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div)


### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = filtered_taxa, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,filtered_taxa)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div_1)


## Write the new phyloseq object
saveRDS(object = filtered_taxa, file = phyloseq_filtered_object)
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
count_table <- snakemake@input[["count_table"]]
Metadata_table <- snakemake@input[["Metadata_table"]]
taxonomy_table <- snakemake@input[["taxonomy_table"]]
rep_seqs <- snakemake@input[["rep_seqs"]]
tax_tree <- snakemake@input[["tax_tree"]]

## Ouput
phyloseq_object <- snakemake@output[["phyloseq_object"]]

## Parameters
replace_empty_tax <- snakemake@params[["viz_replace_empty_tax"]]



## Load libraries
library(dplyr);packageVersion("dplyr")
library(tibble);packageVersion("tibble")
library(tidyr);packageVersion("tidyr")
library(phyloseq);packageVersion("phyloseq")
library(Biostrings);packageVersion("Biostrings")


## Import data

### Read count table
print("reading count table")
count_table <- read.table(file = count_table, header = TRUE, check.names=FALSE)

### Read sample_data
print("reading metadata")
metadata <- read.delim(file = Metadata_table, sep = "\t", header = TRUE, na.strings = "NA")

### Read taxonomic tree
print("reading taxonomic tree")
PHY <- read_tree(tax_tree)

### Read representative sequences
print("importing representative sequences from fasta")
SEQS <- readDNAStringSet(rep_seqs)

### Read and format taxonomy table
print("reading taxonomy table")
taxonomy_table<-read.table(file = taxonomy_table, header = FALSE, sep = "\t")

    ### Convert the table into a tabular split version
    taxonomy_table<-taxonomy_table %>% as_tibble() %>% separate(V2, sep=";", c("Kingdom","Phylum","Class","Order","Family","Genus","Species"))

    ### Replace the not properly named headers into proper ones
    colnames(taxonomy_table)[colnames(taxonomy_table)=="V1"] <- "Feature.ID"
    colnames(taxonomy_table)[colnames(taxonomy_table)=="V3"] <- "Confidence"

    ### Convert taxonomic levels as character (needed for the next steps)
    taxonomy_table$Kingdom<-as.character(taxonomy_table$Kingdom)
    taxonomy_table$Phylum<-as.character(taxonomy_table$Phylum)
    taxonomy_table$Class<-as.character(taxonomy_table$Class)
    taxonomy_table$Order<-as.character(taxonomy_table$Order)
    taxonomy_table$Family<-as.character(taxonomy_table$Family)
    taxonomy_table$Genus<-as.character(taxonomy_table$Genus)
    taxonomy_table$Species<-as.character(taxonomy_table$Species)

        print(paste("replace empty taxonomy :", replace_empty_tax))

        if(isTRUE(replace_empty_tax)) {
          ### Replace NA by the previous order + a space_holder for each taxonomic level
          taxonomy_table$Kingdom[is.na(taxonomy_table$Kingdom)] <- (("Unkown_Kingdom")[is.na(taxonomy_table$Kingdom)])
          taxonomy_table$Phylum[is.na(taxonomy_table$Phylum)] <- ((paste(taxonomy_table$Kingdom,"_phy",sep=""))[is.na(taxonomy_table$Phylum)])
          taxonomy_table$Class[is.na(taxonomy_table$Class)] <- ((paste(taxonomy_table$Phylum,"_clas",sep=""))[is.na(taxonomy_table$Class)])
          taxonomy_table$Order[is.na(taxonomy_table$Order)] <- ((paste(taxonomy_table$Class,"_ord",sep=""))[is.na(taxonomy_table$Order)])
          taxonomy_table$Family[is.na(taxonomy_table$Family)] <- ((paste(taxonomy_table$Order,"_fam",sep=""))[is.na(taxonomy_table$Family)])
          taxonomy_table$Genus[is.na(taxonomy_table$Genus)] <- ((paste(taxonomy_table$Family,"_gen",sep=""))[is.na(taxonomy_table$Genus)])
          taxonomy_table$Species[is.na(taxonomy_table$Species)] <- ((paste(taxonomy_table$Genus,"_sp",sep=""))[is.na(taxonomy_table$Species)])

          print("table NA remplaced by spaceholders")

          }else{

          print("table NA NOT remplaced by spaceholders")
          }


## Import as physeq object where needed
OTU <- otu_table(count_table, taxa_are_rows = TRUE)
TAX <- taxonomy_table %>% column_to_rownames("Feature.ID") %>% as.matrix() %>% tax_table()
META <- metadata %>% as.data.frame() %>% column_to_rownames("Sample") %>% sample_data()


## Merge all in a phyloseq object
phyloseq_obj <- phyloseq(OTU, TAX, META, PHY, SEQS)



## Compute alpha diversity indexes after this filtration
### Remove the previously computed values, in case
#sample_data(phyloseq_obj) <- select(sample_data(phyloseq_obj), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1))

drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(phyloseq_obj) <- sample_data(phyloseq_obj)[,!(names(sample_data(phyloseq_obj)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = phyloseq_obj, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div)

### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = phyloseq_obj, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,phyloseq_obj)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div_1)

phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample



## Write the phyloseq object
saveRDS(object = phyloseq_obj, file = phyloseq_object)
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[[1]]

## Output
phyloseq_melted_table <- snakemake@output[[1]]



## Load needed libraries
library(phyloseq);packageVersion("phyloseq")



## Import the physeq object
phyloseq_obj <- readRDS(phyloseq_object)
phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample

### Melt the physeq objet into a dataframe with one row per feature.id and per sample
physeq_df <- psmelt(phyloseq_obj)

### Write this table in a tsv format
write.table(x = physeq_df, file = phyloseq_melted_table, append = F, sep = "\t", eol = "\n", row.names = F, col.names = T )
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Ouput
rarefied_phyloseq <- snakemake@output[["phyloseq_object"]]

## Parameters
rarefy_value <- snakemake@params[["rarefaction_value"]]

## Load libraries
library(vegan);packageVersion("vegan")
library(phyloseq);packageVersion("phyloseq")
library(dplyr);packageVersion("dplyr")

## Set seed for reproducibility
set.seed(1)

## Import the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)

## Rarefy the count table
otu_table(phyloseq_obj) <- t(rrarefy(t(otu_table(phyloseq_obj)), sample = as.numeric(rarefy_value)))

## Compute alpha diversity indexes after this rarefaction
### Remove the previously computed values, in case
#sample_data(phyloseq_obj) <- select(sample_data(phyloseq_obj), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Observed_min_1))
drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(phyloseq_obj) <- sample_data(phyloseq_obj)[,!(names(sample_data(phyloseq_obj)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = phyloseq_obj, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div)

### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = phyloseq_obj, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,phyloseq_obj)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(phyloseq_obj) <- cbind(sample_data(phyloseq_obj),alpha_div_1)

phyloseq_obj <- prune_taxa(taxa_sums(phyloseq_obj) > 0, phyloseq_obj) ## Removes taxa not at least present in one sample


# Write the phyloseq object
saveRDS(object = phyloseq_obj, file = rarefied_phyloseq)
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Output
phyloseq_norm <- snakemake@output[["phyloseq_norm"]]

## Parameters
#lower all normalization to ease matching
normalization <- tolower(snakemake@params[["normalization"]])
min_abundance <- as.numeric(snakemake@params[["min_abundance_value"]])
min_prevalence <- as.numeric(snakemake@params[["min_prevalence_value"]])
print(paste("normalization setting is", normalization))

## Load needed libraries
library("phyloseq");packageVersion("phyloseq")
library("vegan");packageVersion("vegan")
library("edgeR");packageVersion("edgeR")
library("metagenomeSeq");packageVersion("metagenomeSeq")

## Import phyloseq
physeq <- readRDS(phyloseq_object)

## Recover OTU table
OTU <- data.frame(otu_table(physeq), check.names = FALSE)

## List methods computed by vegan
vegan_methods <- c("total", "max", "freq", "normalize", "range", "pa", "chi.square", "hellinger" ,"log")

## CLR must we computed sample-wise, on rows. In Phyloseq we have taxa_are_rows = TRUE. Thus, must be computed on the t() if OTU table
if (normalization == "clr"){
  print("clr normalization")
  OTU1 <- OTU + 1
  reads_trfs <- t(chemometrics::clr(t(OTU1)))

## Vegan decostand normalizations (with double t() to take in account the transposed structure of the counts used here, modified from https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/decostand)
}else if (normalization %in% vegan_methods){
  print(paste(normalization, "normalization by vegan"))
  reads_trfs <- t(vegan::decostand(t(OTU), method = normalization))

## Percent normalization with basic R
}else if (normalization == "pct"){
  print("Percent normalization with base R")
  reads_trfs <- sapply(OTU, function(x) 100*x/sum(x))
  rownames(reads_trfs) <- rownames(OTU)

## CSS normalization with metagenomeseq (modified form https://bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/inst/doc/metagenomeSeq.pdf )
}else if (normalization == "css"){
  print("css normalization by metagenomeseq")
  OTU_mtgs <- metagenomeSeq::newMRexperiment(counts = OTU, phenoData = NULL, featureData = NULL, libSize = NULL, normFactors = NULL)
  p <- metagenomeSeq::cumNormStat(OTU_mtgs)
  OTU_norm <-  metagenomeSeq::cumNorm(OTU_mtgs, p = p)
  reads_trfs = as.data.frame(metagenomeSeq::MRcounts(OTU_norm, norm = TRUE, log = FALSE))

## TMM normalization with edgeR (modified from https://joey711.github.io/phyloseq-extensions/edgeR.html)
} else if (normalization == "tmm"){
  print("TMM normalization by edgeR")
  OTU1 <- OTU + 1
  group <- rownames(get_variable(physeq))
  tax <- tax_table(physeq, errorIfNULL=FALSE)
  y <- edgeR::DGEList(counts=OTU1, group=group, genes=tax, remove.zeros = TRUE)
  z = edgeR::calcNormFactors(y, method="TMM")
  reads_trfs <- data.frame(z$counts, check.names = FALSE)
  #reads_trfs[reads_trfs==0.5] <- 0

} else if (normalization == "none"){
  print('No normalization (normalization = "none"')
  reads_trfs <- OTU

}else{
  stop(paste('"normalization" was', normalization, '.Must be one of : "clr", "pct", "css", "tmm", "total", "max", "freq", "normalize", "range", "pa", "chi.square", "hellinger" or "log", ')) 
}

## Filter based on counts and prevalence
### Filter based on the cumulated abundance
if(is.numeric(min_abundance)){
  print(paste("filter based on cumulated min_abundance:", min_abundance))
  filtered_data <- reads_trfs[,colSums(reads_trfs) > min_abundance, drop = FALSE]
}else if (min_abundance == "none"){
  print("NO filtering based on cumulated abundance")
  filtered_data <- reads_trfs
}else{
  stop(paste(min_abundance, 'must be a numeric value or "none')) 
}

### Filter based on prevalence
if(is.numeric(min_prevalence)){
  print(paste("filter based on prevalence:",min_prevalence))
  prevalence_cutoff <- (min_prevalence/100) * nsamples(physeq)
  print(paste("Features must be found in more than", prevalence_cutoff, "samples to be retained"))
  filtered_data <- filtered_data[,colSums(filtered_data != 0) > prevalence_cutoff, drop = FALSE]
}else if (min_prevalence == "none"){
  print("NO filtering based on cumulated abundance")
}else{
  stop(paste(min_prevalence, 'must be a numeric value or "none'))
}

## Repopulate physeq object (identical script than for taxa filter)
print(1)
physeq_filtered <- physeq
sample_names(physeq_filtered)
head(filtered_data)
otu_table(physeq_filtered) <- otu_table(round(filtered_data, digits = 0), taxa_are_rows = TRUE)
filtered_taxa <- prune_taxa(taxa_sums(physeq_filtered) > 0, physeq_filtered) ## Removes taxa not at least present in one sample

print(2)

## Recompute alpha diversity indexes after this filtration
### Remove the previously computed values
#sample_data(filtered_taxa) <- select(sample_data(filtered_taxa), -c(Observed, Chao1, se.chao1, ACE, se.ACE, Shannon, Simpson, InvSimpson, Fisher, Observed_min_1))
drop <- c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson", "Fisher", "Observed_min_1")
sample_data(filtered_taxa) <- sample_data(filtered_taxa)[,!(names(sample_data(filtered_taxa)) %in% drop)]


### Add alpha diversity indexes to metadata
alpha_div <- estimate_richness(physeq = filtered_taxa, split = TRUE, c("Observed", "Chao1", "se.chao1", "ACE", "se.ACE", "Shannon", "Simpson", "InvSimpson"))
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div)


### In addition, add the Observed over 1% metric
#### Keep the IDSs of the taxa above 1%
physeqrF <- filter_taxa(physeq = filtered_taxa, function(x) mean(x) > 0.01, FALSE)
#### Keep only those
physeqaF <- prune_taxa(physeqrF,filtered_taxa)
#### Calculate the Observed index
alpha_div_1 <- estimate_richness(physeq = physeqaF, split = TRUE, measure = "Observed")
#### Rename this index
colnames(alpha_div_1) <- paste0(colnames(alpha_div_1), ("_min_1"))
#### Again bind this new column
sample_data(filtered_taxa) <- cbind(sample_data(filtered_taxa),alpha_div_1)


## Write the new phyloseq object
saveRDS(object = filtered_taxa, file = phyloseq_norm)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    count_table <- snakemake@input[["count_table"]]
    print(count_table)
    meta <- snakemake@input[["meta"]]

## Output
    transposed_table <- snakemake@output[["transposed_table"]]
    merged_meta <- snakemake@output[["merged_meta"]]

## Load the phyloseq phyloseq_object
    table <- read.table(count_table,  sep = "\t", header = TRUE)
    meta <- read.delim(meta,  sep = "\t", header = TRUE, na.strings = "NA")

## Tranpose table
    table_t <- t(table)

## Bind tables
    binded_tables <- cbind(meta, table_t)

### Write this table in a tsv file since it is ground for coming analysis and slow to compute
    write.table(x = table_t, file = transposed_table, append = F, quote = F, sep = "\t", eol = "\n", row.names = F, col.names = T )
    write.table(x = binded_tables, file = merged_meta, append = F, quote = F, sep = "\t", eol = "\n", row.names = F, col.names = T )
18
19
20
21
shell:
    """
    run_treeshrink.py -o $(dirname {output[0]}) -q {params[quantile]} -t {input}
    """
16
17
script:
    "scripts/raw_to_processed_reads_stats.R"
37
38
script:
    "scripts/raw_to_processed_reads_plot.R"
55
56
script:
    "scripts/raw_to_tax_filtered_reads_stats.R"
75
76
script:
    "scripts/raw_to_processed_reads_plot.R"
95
96
script:
    "scripts/KRONA_plots.R"
113
114
script:
    "scripts/rarefaction_curve.R"
12
13
script:
    "scripts/create_biom_from_count_table.R"
26
27
28
29
30
31
32
33
shell:
    '''
    qiime tools import \
    --input-path {input[0]} \
    --type 'FeatureTable[Frequency]' \
    --input-format BIOMV100Format \
    --output-path {output[0]}
    '''
46
47
48
49
50
51
shell:
    '''
    qiime feature-table summarize \
    --i-table {input[0]} \
    --o-visualization {output[0]}
    '''
61
62
63
64
shell:
    '''
    awk '/^>/ {{print($0)}}; /^[^>]/ {{print(toupper($0))}}' {input[0]} > {output[0]}
    '''
78
79
80
81
82
83
84
shell:
    '''
    qiime tools import \
    --input-path {input[0]} \
    --output-path {output[0]} \
    --type 'FeatureData[Sequence]'
    '''
104
105
106
107
108
109
110
111
112
113
shell:
    '''
    qiime phylogeny align-to-tree-mafft-fasttree \
      --p-n-threads {threads} \
      --i-sequences {input[0]} \
      --o-alignment {output[aligned]} \
      --o-masked-alignment {output[masked]} \
      --o-tree {output[unrooted]} \
      --o-rooted-tree {output[rooted]}
    '''
126
127
128
129
shell:
    '''
    qiime tools export --input-path {input} --output-path $(dirname {output[0]})
    '''
143
144
145
146
147
148
shell:
    '''
    qiime feature-table tabulate-seqs \
    --i-data {input[0]} \
    --o-visualization {output[0]}
    '''
161
162
163
164
165
166
167
168
shell:
    '''
    qiime tools import \
     --type FeatureData[Taxonomy] \
     --input-path {input[0]} \
     --input-format HeaderlessTSVTaxonomyFormat \
     --output-path {output[0]}
    '''
181
182
183
184
185
186
shell:
    '''
    qiime metadata tabulate \
        --m-input-file {input} \
        --o-visualization {output}
    '''
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    count_table <- snakemake@input[["count_table"]]

## Output
    biom_count <- snakemake@output[["biom_count"]]

## Library
    library(biomformat);packageVersion("biomformat")
    library(phyloseq);packageVersion("phyloseq")

## Reformat
    asv_tab <- read.table(file = count_table)
    otu <- as(otu_table(asv_tab, taxa_are_rows = TRUE),"matrix")
    otu_biom <- make_biom(data=otu)

## Write
    write_biom(x = otu_biom, biom_file = biom_count)
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    phyloseq_melted_table <- snakemake@input[["phyloseq_melted_table"]]

## Output
    output_folder <- snakemake@output[["output"]]
    output_folder <- (dirname(output_folder)[1])

## Parameters
    sample_label <- snakemake@params[["sample_label"]]
    grouping_column <- snakemake@params[["grouping_column"]]
    grouping_filter_column_value <- snakemake@params[["grouping_col_value"]]

## Load needed library
    library(dplyr);packageVersion("dplyr")

## Read data
melted_dataframe<- read.csv(file.path(phyloseq_melted_table), header = TRUE, sep = "\t")

## Create KRONA
    df <- filter(melted_dataframe, melted_dataframe[[grouping_column]] == grouping_filter_column_value)
    df <- filter(df, df[["Abundance"]] != 0)

    df <- df[, c("Abundance", sample_label, "Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "OTU")]
    df <- as.data.frame(unclass(df))
    df[, 2] <- gsub(" |\\(|\\)", "", df[, 2])
    df[, 2] <- as.factor(df[, 2])
    dir.create(file.path(output_folder,"/",grouping_filter_column_value))
    for (lvl in levels(df[, 2])) {
      write.table(unique(df[which(df[, 2] == lvl), -2]), file = paste0(output_folder,"/",grouping_filter_column_value, "/", lvl, "taxonomy.txt"), sep = "\t", row.names = F, col.names = F, na = "", quote = F)
    }

#As it is possible that some negative controls or samples have no reads, here we are tryuing to say if the entire sample is empty then make a log file with the message that sample has no read! otherwise make krona plt.
if (all(df[,2] == 0) == 1){
     dir.create(file.path(output_folder,"/",grouping_filter_column_value))
  cat(format(Sys.time(), "%a %b %d %Y %X TZ(%z)")," ", "All samples have 0 abundance for this sample group.",file= paste0(output_folder,"/",grouping_filter_column_value,".html"))

} else {

     dir.create(file.path(output_folder,"/",grouping_filter_column_value))
    for (lvl in levels(df[, 2])) {
      write.table(unique(df[which(df[, 2] == lvl), -2]), file = paste0(output_folder,"/",grouping_filter_column_value, "/", lvl, "taxonomy.txt"), sep = "\t", row.names = F, col.names = F, na = "", quote = F)
      }


    krona_args <- paste0(output_folder,"/", grouping_filter_column_value, "/", levels(df[,2]), "taxonomy.txt,", levels(df[, 2]), collapse = " ")
    output <- paste0(output_folder,"/",grouping_filter_column_value,".html")
    system(paste("ktImportText", krona_args, "-o", output, sep = " "))


}
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")


## Input
phyloseq_object <- snakemake@input[["phyloseq_object"]]

## Ouput
rarefaction_curve <- snakemake@output[["rarefaction_curve"]]

## Parameters
color_factor <- snakemake@params[["grouping_column"]]

## Load libraries
library('phyloseq')
library('ggplot2')
library('plyr') # ldply
library('reshape2') # melt
library('vegan')

## Load the phyloseq object
phyloseq_obj <- readRDS(phyloseq_object)


## Create a modified version of estimate_richness function of phyloseq to accept sample names containing numbers only

modified_estimated_richness <- function (physeq, split = TRUE, measures = NULL)
{
  if (!any(otu_table(physeq) == 1)) {
    warning("The data you have provided does not have\n",
            "any singletons. This is highly suspicious. Results of richness\n",
            "estimates (for example) are probably unreliable, or wrong, if you have already\n",
            "trimmed low-abundance taxa from the data.\n", "\n",
            "We recommended that you find the un-trimmed data and retry.")
  }
  if (!split) {
    OTU <- taxa_sums(physeq)
  }
  else if (split) {
    OTU <- as(otu_table(physeq), "matrix")
    if (taxa_are_rows(physeq)) {
      OTU <- t(OTU)
    }
  }
  renamevec = c("Observed", "Chao1", "ACE", "Shannon", "Simpson",
                "InvSimpson", "Fisher")
  names(renamevec) <- c("S.obs", "S.chao1", "S.ACE", "shannon",
                        "simpson", "invsimpson", "fisher")
  if (is.null(measures)) {
    measures = as.character(renamevec)
  }
  if (any(measures %in% names(renamevec))) {
    measures[measures %in% names(renamevec)] <- renamevec[names(renamevec) %in%
                                                            measures]
  }
  if (!any(measures %in% renamevec)) {
    stop("None of the `measures` you provided are supported. Try default `NULL` instead.")
  }
  outlist = vector("list")
  estimRmeas = c("Chao1", "Observed", "ACE")
  if (any(estimRmeas %in% measures)) {
    outlist <- c(outlist, list(t(data.frame(estimateR(OTU), check.names = FALSE))))
  }
  if ("Shannon" %in% measures) {
    outlist <- c(outlist, list(shannon = diversity(OTU, index = "shannon")))
  }
  if ("Simpson" %in% measures) {
    outlist <- c(outlist, list(simpson = diversity(OTU, index = "simpson")))
  }
  if ("InvSimpson" %in% measures) {
    outlist <- c(outlist, list(invsimpson = diversity(OTU, index = "invsimpson")))
  }
  if ("Fisher" %in% measures) {
    fisher = tryCatch(fisher.alpha(OTU, se = TRUE), warning = function(w) {
      warning("phyloseq::estimate_richness: Warning in fisher.alpha(). See `?fisher.fit` or ?`fisher.alpha`. Treat fisher results with caution")
      suppressWarnings(fisher.alpha(OTU, se = TRUE)[, c("alpha",
                                                        "se")])
    })
    if (!is.null(dim(fisher))) {
      colnames(fisher)[1:2] <- c("Fisher", "se.fisher")
      outlist <- c(outlist, list(fisher))
    }
    else {
      outlist <- c(outlist, Fisher = list(fisher))
    }
  }


  out = do.call("cbind", outlist)
  namechange = intersect(colnames(out), names(renamevec))
  colnames(out)[colnames(out) %in% namechange] <- renamevec[namechange]
  colkeep = sapply(paste0("(se\\.){0,}", measures), grep, colnames(out),
                   ignore.case = TRUE)
  out = out[, sort(unique(unlist(colkeep))), drop = FALSE]
  out <- as.data.frame(out)
  return(out)
}



# Replicate curve function from : https://github.com/joey711/phyloseq/issues/143

set.seed(42)

calculate_rarefaction_curves <- function(psdata, measures, depths) {

  estimate_rarified_richness <- function(psdata, measures, depth) {
    if(max(sample_sums(psdata)) < depth) return()
    psdata <- prune_samples(sample_sums(psdata) >= depth, psdata)

    rarified_psdata <- rarefy_even_depth(psdata, depth, verbose = FALSE)

    alpha_diversity <- modified_estimated_richness(rarified_psdata, measures = measures)

    # as.matrix forces the use of melt.array, which includes the Sample names (rownames)
    molten_alpha_diversity <- melt(as.matrix(alpha_diversity), varnames = c('Sample', 'Measure'), value.name = 'Alpha_diversity')

    molten_alpha_diversity
  }

  names(depths) <- depths # this enables automatic addition of the Depth to the output by ldply
  rarefaction_curve_data <- ldply(depths, estimate_rarified_richness, psdata = psdata, measures = measures, .id = 'Depth', .progress = ifelse(interactive(), 'text', 'none'))

  # convert Depth from factor to numeric
  rarefaction_curve_data$Depth <- as.numeric(levels(rarefaction_curve_data$Depth))[rarefaction_curve_data$Depth]

  rarefaction_curve_data
}

rarefaction_curve_data <- calculate_rarefaction_curves(psdata = phyloseq_obj, measures = c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson"), depth = rep(c(1, 10, 100, 1000, 1:100 * 10000), each = 10))
summary(rarefaction_curve_data)


####

rarefaction_curve_data_summary <- ddply(rarefaction_curve_data, c('Depth', 'Sample', 'Measure'), summarise, Alpha_diversity_mean = mean(Alpha_diversity), Alpha_diversity_sd = sd(Alpha_diversity))


####

rarefaction_curve_data_summary_verbose <- merge(rarefaction_curve_data_summary, data.frame(sample_data(phyloseq_obj)), by.x = 'Sample', by.y = 'row.names')

rarefaction_curve_data_summary_verbose

p <- ggplot(
  data = rarefaction_curve_data_summary_verbose,
  mapping = aes(
    x = Depth,
    y = Alpha_diversity_mean,
    ymin = Alpha_diversity_mean - Alpha_diversity_sd,
    ymax = Alpha_diversity_mean + Alpha_diversity_sd,
    colour = get(color_factor),
    group = get("Sample"))) +
      labs(col = "Sample type") +
      geom_line(size = 0.15) +
      #geom_pointrange(size = 0.2) +
      facet_wrap(facets = ~ Measure, scales = 'free_y')


ggsave(filename = rarefaction_curve,  plot = p, width = 10)
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")


## Input
    raw_to_filtered_reads_stats_path <- snakemake@input[["raw_to_filtered_reads_stats"]]
    raw_to_filtered_reads_stats <- read.table(file = raw_to_filtered_reads_stats_path, sep = "\t", header = TRUE)
    metadata_path <- snakemake@input[["Metadata_table"]]
    metadata <- read.delim(file = metadata_path, sep = "\t", header = TRUE)
    multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]]
    multi_QC_report <- read.table(multi_QC_report_path, header = T)

## Ouput
    reads_plot_with_filtered <- snakemake@output[["reads_plot_with_filtered"]]

## Parameters
    x_axis_column <- snakemake@params[["sample_label"]]

## Load needed libaries
    library("phyloseq");packageVersion("phyloseq")
    library("data.table");packageVersion("data.table")
    library("dplyr");packageVersion("dplyr")
    library("ggplot2");packageVersion("ggplot2")
    library("RColorBrewer"); packageVersion("RColorBrewer")


## Set theme
theme_set(theme_bw())


    ### Record data on the distribution of number of reads (useful later to scale plots axis)
        smin <- min(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences)
        print(smin)
        smean <- mean(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences)
        print(smean)
        smax <- max(multi_QC_report$FastQC_mqc.generalstats.fastqc.total_sequences)
        print(smax)

    ### Order the x axis as in the metadata_table
        raw_to_filtered_reads_stats[[x_axis_column]] = factor(raw_to_filtered_reads_stats[[x_axis_column]], levels = unique(metadata[[x_axis_column]]), ordered = TRUE)
        #raw_to_filtered_reads_stats[[grouping_column]] = factor(raw_to_filtered_reads_stats[[grouping_column]], levels = unique(metadata[[grouping_column]]), ordered = TRUE)

    ### Order the reads count in logical ordered
        ordered_factors <- c("Tax filtered reads", "Reads processing", "Maintained reads")
        raw_to_filtered_reads_stats$Reads <- factor(raw_to_filtered_reads_stats$Reads, ordered = TRUE, levels = ordered_factors)

    ### Generate colors colors palette
        getPalette <- colorRampPalette(brewer.pal(n=9, "Set3"))
        ColList <- ordered_factors
        ColPalette = getPalette(length(ColList))
        names(ColPalette) = ColList
        colors_palette <- ColPalette


    overall_reads_barplot <- ggplot(raw_to_filtered_reads_stats, aes(x = get(x_axis_column), y = Count, fill = Reads)) +
        geom_col() +
        scale_fill_manual(values = colors_palette) +
        labs(x= x_axis_column,  y ="Reads") +
        ggtitle(paste("Reads counts overall")) +
        scale_x_discrete(drop = TRUE) + # Keep all groups, included the ones with values. Alternative : (drop = FALSE)
        scale_y_continuous(labels = scales::comma, limits = c(0,smax)) +
        theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 7)) #+
        # guides(fill=guide_legend(title=filling_column))


    ### Save it
        w <- 7 + 0.07 * (length(unique(raw_to_filtered_reads_stats[[x_axis_column]]))) 
        ggsave(overall_reads_barplot, filename = reads_plot_with_filtered, width = w, height = 5)
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    phyloseq_object <- snakemake@input[["phyloseq_object"]]
    multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]]
    multi_QC_report <- read.table(multi_QC_report_path, header = T)

## Ouput
    raw_to_filtered_reads_stats <- snakemake@output[["raw_to_filtered_reads_stats"]]

    ## Load needed libaries
    library("phyloseq");packageVersion("phyloseq")
    library("data.table");packageVersion("data.table")
    library("dplyr");packageVersion("dplyr")

## Load the phyloseq object
    phyloseq_obj <- readRDS(phyloseq_object)

## Create a table with the number of raw reads and filtered reads

### Filtered reads
    reads_counts_df <- data.table(as(sample_data(phyloseq_obj), "data.frame"), TotalReads = sample_sums(phyloseq_obj), keep.rownames = TRUE)
    setnames(reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample

### Raw reads
    multi_QC_report <- multi_QC_report %>% filter(grepl("R1", Sample)) %>% select(c("Sample","FastQC_mqc.generalstats.fastqc.total_sequences")) # keep only the total of raw sequences. Since it is already twice (R1, R2), keep only R1.
    multi_QC_report$Sample <- gsub(x=multi_QC_report$Sample, pattern = "_R1", replacement = "") # Remove the "_R1"

### Raw reads Join the two dataframes, the one with the filtered reads from phyloseq and the one with the raw reads.
    reads_counts_df_with_raw <- merge(multi_QC_report, reads_counts_df, by=c("Sample"))

### Raw reads Calculate the difference of reads between the raw and the filtered reads.
    reads_counts_df_with_raw <- mutate(reads_counts_df_with_raw, filtered = FastQC_mqc.generalstats.fastqc.total_sequences - TotalReads)

### Raw reads Keep two version of the table, one with the filtered reads in "Count" and one with the difference with the raw reads in the same count column.
    reads_counts_df_raw_count_only <- reads_counts_df_with_raw %>% select(-"filtered") %>% rename("Count" = "TotalReads")
    reads_counts_df_raw_count_only$Reads <- "Maintained reads"
    reads_counts_df_raw_filtered_only <- reads_counts_df_with_raw %>% select(-"TotalReads") %>% rename("Count" = "filtered")
    reads_counts_df_raw_filtered_only$Reads <- "Reads processing"

### Raw reads Bind the rows of the two just prepared tables
    melted_reads_count <- rbind(reads_counts_df_raw_count_only, reads_counts_df_raw_filtered_only)

### Raw reads Write this table
    write.table(x = melted_reads_count, file = raw_to_filtered_reads_stats, sep = "\t", col.names = NA, row.names = TRUE)
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
    log <- file(snakemake@log[[1]], open="wt")
    sink(log)
    sink(log, type="message")

## Input
    read_filtering <- snakemake@input[["read_filtering"]]
    taxonomic_filtering <- snakemake@input[["taxonomic_filtering"]]
    multi_QC_report_path <- snakemake@input[["multi_QC_report_path"]]

## Ouput
    raw_to_filtered_reads_stats <- snakemake@output[["raw_to_filtered_reads_stats"]]


## Load needed libaries
    library("phyloseq");packageVersion("phyloseq")
    library("data.table");packageVersion("data.table")
    library("dplyr");packageVersion("dplyr")


## Read files
    read_filtering_physeq <- readRDS(read_filtering)
    tax_filtering_physeq <- readRDS(taxonomic_filtering)
    multi_QC_report <- read.table(multi_QC_report_path, header = T)

## Create a table with the number of raw reads and filtered reads
### Raw reads
    multi_QC_report <- multi_QC_report %>% filter(grepl("R1", Sample)) %>% select(c("Sample","FastQC_mqc.generalstats.fastqc.total_sequences")) # keep only the total of raw sequences. Since it is already twice (R1, R2), keep only R1.
    multi_QC_report$Sample <- gsub(x=multi_QC_report$Sample, pattern = "_R1", replacement = "") # Remove the "_R1" in label

### Filtered reads
    filtered_reads_counts_df <- data.table(as(sample_data(read_filtering_physeq), "data.frame"), TotalReads_processing = sample_sums(read_filtering_physeq), keep.rownames = TRUE)
    setnames(filtered_reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample

### Taxonomically filtered reads
    tax_filtered_reads_counts_df <- data.table(as(sample_data(tax_filtering_physeq), "data.frame"), TotalReads_taxonomy = sample_sums(tax_filtering_physeq), keep.rownames = TRUE)
    setnames(tax_filtered_reads_counts_df, "rn", "Sample") # Rename the first column of this news dataframe -> Sample
    tax_filtered_reads_counts_df <- select(tax_filtered_reads_counts_df, c("TotalReads_taxonomy","Sample"))

### Join the columns of interest
    merged_columns <- merge(filtered_reads_counts_df, multi_QC_report, by=c("Sample"))
    merged_columns <- merge(merged_columns, tax_filtered_reads_counts_df, by = c("Sample"))

### Calculate the differences
    merged_columns$reads_processing <- merged_columns$FastQC_mqc.generalstats.fastqc.total_sequences - merged_columns$TotalReads_processing
    merged_columns$tax_filtration <- merged_columns$TotalReads_processing - merged_columns$TotalReads_taxonomy


### Keep three versions of the table,  where the "Count" is consecutively the final number of reads, the reads lost during processing and the reads taxonomically filtered.
#### Keep the final count of reads, remove the two other column
    final_reads_count <- merged_columns %>% rename("Count" = "TotalReads_taxonomy")
    final_reads_count$Reads <- "Maintained reads"
    final_reads_count[ ,c("reads_processing", "tax_filtration")] <- list(NULL)

#### Keep the reads filtered during processing remove the two other column
    processing_reads_count <- merged_columns %>% rename("Count" = "reads_processing")
    processing_reads_count$Reads <- "Reads processing"
    processing_reads_count[ ,c("tax_filtration", "TotalReads_taxonomy")] <- list(NULL)

#### Keep the reads removed by taxonomic filtering remove the two other column
    tax_filtered_reads_count <- merged_columns %>% rename("Count" = "tax_filtration")
    tax_filtered_reads_count$Reads <- "Tax filtered reads"
    tax_filtered_reads_count[ ,c("reads_processing", "TotalReads_taxonomy")] <- list(NULL)

#### Bind the rows of the two just prepared tables
    melted_reads_count <- rbind(final_reads_count, tax_filtered_reads_count)
    melted_reads_count <- rbind(melted_reads_count, processing_reads_count)


# Write this table
    write.table(x = melted_reads_count, file = raw_to_filtered_reads_stats, sep = "\t", col.names = NA, row.names = TRUE)
17
18
19
20
21
22
23
24
25
shell:
    '''
    picrust2_pipeline.py \
        --in_traits COG,EC,KO,PFAM,TIGRFAM \
        -s {input[seqs]} \
        -i {input[table]} \
        -o {output}/output \
        -p {threads}
    '''
ShowHide 74 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://rsp4abm.readthedocs.io/en/latest/
Name: microbiome16s_pipeline
Version: v.0.9.18
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...