Open-source bioinformatics pipeline for the preprocessing of raw sequencing data.

public public 1yr ago Version: 1.0 0 bookmarks

logo

Natrix is an open-source bioinformatics pipeline for the preprocessing of raw sequencing data. The need for a scalable, reproducible workflow for the processing of environmental amplicon data led to the development of Natrix. It is divided into quality assessment, read assembly, dereplication, chimera detection, split-sample merging, ASV or OTU-generation and taxonomic assessment. The pipeline is written in Snakemake (Köster and Rahmann 2018), a workflow management engine for the development of data analysis workflows. Snakemake ensures reproducibility of a workflow by automatically deploying dependencies of workflow steps (rules) and scales seamlessly to different computing environments like servers, computer clusters or cloud services. While Natrix was only tested with 16S and 18S amplicon data, it should also work for other kinds of sequencing data. The pipeline contains seperate rules for each step of the pipeline and each rule that has additional dependencies has a seperate conda environment that will be automatically created when starting the pipeline for the first time. The encapsulation of rules and their dependencies allows for hassle-free sharing of rules between workflows.

DAG of an example workflow DAG of an example workflow: each node represents a rule instance to be executed. The direction of each edge represents the order in which the rules are executed, which dashed lines showing rules that are exclusive to the OTU version and dotted lines rules exclusive to the ASV variant of the workflow. Disjoint paths in the DAG can be executed in parallel. Below is a schematic representation of the main steps of the pipeline, the color coding represents which rules belong to which main step.

If you use Natrix, please cite: Welzel, M., Lange, A., Heider, D. et al. Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads. BMC Bioinformatics 21, 526 (2020). https://doi.org/10.1186/s12859-020-03852-4

Dependencies

Conda can be downloaded as part of the Anaconda or the Miniconda plattforms (Python 3.7). We recommend to install miniconda3. Using Linux you can get it with:

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

GNU screen can be found in the repositories of most Linux distributions:

  • Debian / Ubuntu based: apt-get install screen

  • RHEL based: yum install screen

  • Arch based: pacman -S screen

All other dependencies will be automatically installed using conda environments and can be found in the corresponding environment.yaml files in the envs folder and the natrix.yaml file in the root directory of the pipeline.

Getting Started

To install Natrix, you'll need the open-source package management system conda and, if you want to try Natrix using the accompanying pipeline.sh script you'll need GNU screen . After cloning this repository to a folder of your choice, it is recommended to create a general natrix conda environment with the accompanying natrix.yaml . In the main folder of the cloned repository, execute the following command:

$ conda env create -f natrix.yaml

This will create a conda environment containing all dependencies for Snakemake itself.

With Natrix comes an example primertable example_data.csv , configfile example_data.yaml and an example amplicon dataset in the folder example_data .

To try out Natrix using the example data, type in the following command:

$ ./pipeline.sh
Please enter the name of the project
$ example_data

The pipeline will then start a screen session using the project name (here, example_data ) as session name and will beginn downloading dependencies for the rules. To detach from the screen session, press Ctrl+a, d ( first press Ctrl+a and then d ). To reattach to a running screen session, type in:

$ screen -r

When the workflow has finished, you can press Ctrl+a, k ( first press Ctrl+a and then k ). This will end the screen session and any processes that are still running.

Tutorial

Prerequisites: dataset, primertable and configuration file

The FASTQ files need to follow a specific naming convention:

naming

samplename_unit_direction.fastq.gz

with:

  • samplename as the name of the sample, without special characters.

  • unit , identifier for split-samples ( A , B ). If the split-sample approach is not used, the unit identifier is simply A .

  • direction , identifier for forward ( R1 ) and reverse ( R2 ) reads of the same sample. If the reads are single-end, the direction identifier is R1 .

A dataset should look like this (two samples, paired-end, no split-sample approach):

S2016RU_A_R1.fastq.gz
S2016RU_A_R2.fastq.gz
S2016BY_A_R1.fastq.gz
S2016BY_A_R2.fastq.gz

Besides the FASTQ data from the sequencing process Natrix needs a primertable containing the sample names and, if they exists in the data, the length of the poly-N tails, the sequence of the primers and the barcodes used for each sample and direction. Besides the sample names all other information can be omitted if the data was already preprocessed or did not contain the corresponding subsequence. Natrix also needs a configuration file in YAML format, specifying parameter values for tools used in the pipeline.

The primertable, configfile and the folder containing the FASTQ files all have to be in the root directory of the pipeline and have the same name (with their corresponding file extensions, so project .yaml, project .csv and the project folder containing the FASTQ files). The first configfile entry ( filename ) also needs to be the name of the project.

Running Natrix with the pipeline.sh script

IF everything is configured correctly, you can start the pipeline by typing in the following commands into your terminal emulator:

$ ./pipeline.sh
Please enter the name of the project
$ example_data

The pipeline will then start a screen session using the project name as session name and will beginn downloading dependencies for the rules. To detach from the screen session, press Ctrl+a, d ( first press Ctrl+a and then d ). To reattach to a running screen session, type in:

$ screen -r

When the workflow has finished, you can press Ctrl+a, k ( first press Ctrl+a and then k ). This will end the screen session and any processes that are still running.

Running Natrix with Docker or docker-compose

Pulling the image from Dockerhub

Natrix can be run inside a Docker-container. Therefore, Docker has to be installed. Please have a look at the Docker website to find out how to install Docker and set up an environment if you have not used it before.

The easiest way to run the docker container is to download the pre-build container from dockerhub .

$ docker pull mw55/natrix

The docker container has all environments pre-installed, eliminating the need for downloading the environments during first-time initialization of the workflow. To connect to the shell inside the docker container, input the following command:

docker run -it --label natrix_container -v </host/database>:/app/database -v </host/results>:/app/results -v </host/input_folder>:/app/input -v </host/demultiplexed>:/app/demultiplexed mw55/natrix bash

/host/database is the full path to a local folder, in which you wish to install the database (SILVA or NCBI). This part is optional and only needed if you want to use BLAST for taxonomic assignment.

/host/results is the full path to a local folder in which the results of the workflow should be stored for the container to use.

/host/input_folder is the full path to a local folder in which the input (the project folder, the project .yaml and the project .csv) should be saved.

/host/demultiplexed is the full path to a local folder in which the demultiplexed data, or, if demultiplexing is turned off, the input data will be saved.

After you connected to the container shell, you can follow the running Natrix manually tutorial.

Directly starting the workflow using docker-compose

Alternatively, you can start the workflow using the the docker-compose command in the root directory of the workflow (it will pull the latest natrix image from DockerHub):

$ PROJECT_NAME="<project>" docker-compose up (-d)

with project being the name of your project. e.g.:

$ PROJECT_NAME="example_data" docker-compose up # sudo might be needed

all output folders will be available at /srv/docker/natrix/ make sure to copy your project folder, project .yaml and project .csv files to /srv/docker/natrix/input/ or create a new volume-mapping using the docker-compose.yml file. By default the container will wait until the input files exist. At first launch the container will download the required databases to /srv/docker/natrix/databases/, this process might take a while.

Building the container yourself

If you prefer to build the docker container yourself from the repository (for example, if you modified the source code of Natrix) the container can be build and started directly: ( host folders have to be changed!)

docker build . --tag natrix
docker run -it --label natrix_container -v </host/database>:/app/database -v </host/results>:/app/results -v </host/input_folder>:/app/input -v </host/demultiplexed>:/app/demultiplexed natrix bash # -v /host/database:/app/database is optional

You will then be at the command prompt inside the docker container, from there you can follow the tutorial for running Natrix manually .

Running Natrix manually

If you prefer to run the preperation script and snakemake manually, you have to start by activating the snakemake environment:

$ conda activate natrix

Followed by running the preperation script, with project being the name of your project:

$ python3 create_dataframe.py <project>.yaml

This command will create the units.tsv file, containing the file information in a way that Natrix can use it.

To start the main pipeline, type in:

snakemake --use-conda --configfile <project>.yaml --cores <cores>

with project being the name of your project and cores being the amount of cores you want to allocate for Natrix to use.

Should the pipeline prematurely terminate (either because of an error or by deliberatly stopping it) running the command above again will start the pipeline from the point it was terminated.

Cluster execution

Natrix can be run easily on a cluster system using either conda or the docker container. Adding --cluster to the start command of Natrix, together with a command to submit jobs (e. g. qsub) is enough for most cluster computing environments. An example command would be:

$ snakemake -s <path/to/Snakefile> --use-conda --configfile <path/to/configfile.yaml> --cluster "qsub -N <project name> -S /bin/bash/" --jobs 100

Further qsub arguments including brief explanations can be found under qsub arguments . For additional commands that should be executed for each job the argument --jobscript path/to/jobscript.sh can be used. A simple jobscript that sources before the execution of each job the bashrc and activates the snakemake environment looks like this:

#!/usr/bin/env bash
source ~/.bashrc
conda activate natrix
{exec_job}

Instead of directly passing the cluster submission arguments to the snakemake command it is also possible to write a profile that contains cluster commands and arguments. The use of profiles allows the assignment of rule-specific hardware requirements. For example, the BLAST rule benefits from a large amount of CPU cores, while other rules, like AmpliconDuo, do not. With profiles, rules could be assigned a low amount of CPU cores as default, with rules like BLAST being assigned a larger amount of cores. This allows optimal usage of cluster resources and shorter waiting times. The creation of profiles is largely dependant on the software and hardware available on the cluster. With a profile Natrix can simply be run with

$ snakemake -s <path/to/Snakefile> --profile myprofile 

The Snakemake documentation contains a tutorial for profile creation and the Snakemake profiles GitHub page contains example profiles for different cluster systems.

Output

After the workflow is finished, the original data can be found under Natrix-Pipeline/demultiplexed/ , while files created during the workflow can be found under Natrix-Pipeline/results/ .

ouput

Output file hierarchy, blue nodes represent folders, orange nodes represent files that are created in both variants of the workflow, green nodes are files exclusive to the OTU variant and purple nodes are files exclusive to the ASV variant of the workflow.

Folder File(s) Description
qc FastQC reports Quality reports of the FastQC application.
MultiQC report Aggregated FastQC reports in a single file.
logs Logfiles Logfiles of the different rules.
assembly (one folder for each sample) sample_low_qual.fastq Sequences of sample that did not pass the prinseq quality filtering.
sample_assembled.fastq With PANDAseq assembled sequences.
sample_singletons.fastq Sequences that could not be assembled.
sample.fasta FASTA file of the assembled sequences.
sample.dereplicated.fasta Dereplicated sequences of sample
sample_chimera.fasta Sequences of sample that are thought to be of chimeric origin.
finalData sample.nonchimera.fasta Sequences of sample that passed the chimera detection rule.
unfiltered_table.csv Table containing the sequences of all samples and their abundances per sample.
filtered_table.csv Table containing the sequences of all samples and their abundances per sample after filtering.
filtered_out_table.csv Table containing the sequences that did not pass the filtering rule.
filtered.fasta The sequences of the filtered_table.csv in FASTA format.
filtered_blast_table.csv Table containing the sequences of the filtered_table.csv and the taxonomic information assigned to each.
figures ampliconduo_unfiltered.png Discordance graph before the filtering.
ampliconduo_filtered.png Discordance graph after filtering.
AmpliconDuo.Rdata RData file containing the results of the AmpliconDuo statistical analysis.

Steps of the Pipeline

Initial demultiplexing

The sorting of reads according to their barcode is known as demultiplexing.

Quality control

For quality control the pipeline uses the programs FastQC (Andrews 2010), MultiQC (Ewels et al. 2016) and PRINSEQ (Schmieder and Edwards 2011).

FastQC

FastQC generates a quality report for each FASTQ file, containing information such as the per base and average sequence quality (using the Phred quality score), overrepresented sequences, GC content, adapter and the k-mer content of the FASTQ file.

MultiQC

MultiQC aggregates the FastQC reports for a given set of FASTQ files into a single report, allowing reviews of all FASTQ files at once.

PRINSEQ

PRINSEQ is used to filter out sequences with an average quality score below the threshold that can be defined in the configuration file of the pipeline.

Read assembly

Define primer

The define_primer rule specifies the subsequences to be removed by the assembly rule, specified by entries of the configuration file and a primer table that contains information about the primer and barcode sequences used and the length of the poly-N subsequences. Besides removing the subsequences based on their nucleotide sequence, it is also possible to remove them based solely on their length using an offset. Using an offset can be useful if the sequence has many uncalled bases in the primer region, which could otherwise hinder matches between the target sequence defined in the primer table and the sequence read.

Assembly & removal of undesired subsequences (OTU-variant)

To assemble paired-end reads and remove the subsequences described in the pre- vious section PANDAseq (Masella et al. 2012) is used, which uses a probabilistic error correction to assemble overlapping forward- and reverse-reads. After assembly and sequence trimming, it will remove sequences that do not meet a minimal or maximal length threshold, have an assembly quality score below a threshold that can be defined in the configuration file and sequences whose forward- and reverse-read do not have an sufficiently long overlap. The thresholds for each of these procedures can be adjusted in the configuration file. If the reads are single-end, the subsequences (poly-N, barcode and the primer) that are defined in the define_primer rule are removed, followed by the removal of sequences that do not meet a minimal or maximal length threshold as defined in the configuration file.

Removal of undesired subsequences (ASV-variant)

In the ASV variant of the workflow Cutadapt (Martin 2011) is used to remove the undesired subsequences defined in the primer table.

ASV denoising (ASV-Variant)

After the removal of undesired subsequences ASVs are generated using the DADA2 (Callahan et al. 2016) algorithm. It dereplicates the dataset and uses a denoising algorithm. This algorithm infers if a sequence was produced by a different sequence based on the composition, quality, and abundance of a sequence and an Illumina error model. After ASV generation, exactly overlapping forward and reverse reads are assembled. The assembled ASVs are saved as FASTA files for downstream analysis.

Similarity clustering

Conversion of FASTQ files to FASTA files (OTU-variant)

The rule copy_to_fasta converts the FASTQ files to FASTA files to reduce the disc space occupied by the files and to allow the usage of CD-HIT, which requires FASTA formated sequencing files. Clustering of similar sequences The CD-HIT-EST algorithm (Fu et al. 2012) clusters sequences together if they are either identical or if a sequence is a subsequence of another. This clustering approach is known as dereplication. Beginning with the longest sequence of the dataset as the first representative sequence, it iterates through the dataset in order of decreasing sequence length, comparing at each iteration the current query sequence to all representative sequences. If the sequence identity threshold defined in the configuration file is met for a representative sequence, the counter of the representative sequence is increased by one. If the threshold could not be met for any of the existing representative sequences, the query sequence is added to the pool of representative sequences. Cluster sorting The cluster_sorting rule uses the output of the cdhit rule to count the amount of sequences represented by each cluster, followed by sorting the representative sequences in descending order according to the cluster size and adds a specific header to each sequence as required by the UCHIME chimera detection algorithm.

Chimera detection

VSEARCH

VSEARCH is a open-source alternative to the USEARCH toolkit, which aims to functionally replicate the algorithms used by USEARCH for which the source code is not openly available and which are often only rudimentarily described (Rognes et al. 2016). Natrix uses as an alternative to UCHIME the VSEARCH uchime3_denovo algorithm to detect chimeric sequences (further referred to as VSEARCH3). The VSEARCH3 algorithm is a replication of the UCHIME2 algorithm with optimized standard parameters. The UCHIME2 algorithm is described by R. Edgar 2016 as follows:

"Given a query sequence Q , UCHIME2 uses the UCHIME algorithm to construct a model ( M ), then makes a multiple alignment of Q with the model and top hit ( T , the most similar reference sequence). The following metrics are calculated from the alignment: number of differences dQT between Q and T and dQM between Q and M , the alignment score ( H ) using eq. 2 in R. C. Edgar et al. 2011. The fractional divergence with respect to the top hit is calculated as divT = (dQT − dQM)/|Q|. If divT is large, the model is a much better match than the top hit and the query is more likely to be chimeric, and conversely if divT is small, the model is more likely to be a fake."

The difference between the UCHIME2 and UCHIME3 algorithm is that to be selected as a potential parent, a sequence needs to have at least 16 times the abundance of the query sequence in the UCHIME3 algorithm, while it only needs double the abundance of the query sequence to be selected as a potential parent in the UCHIME2 algorithm.

Table creation and filtering

Merging of all FASTA files into a single table

For further processing the rule unfiltered_table merges all FASTA files into a single, nested dictionary, containing each sequence as the key with another dictionary as value, whose keys are all (split -) samples in which the sequence occurred in and as values the abundance of the sequence in the particular (split - ) sample. For further pipeline processing the dictionary is temporarily saved in JSON format. To ease statistical analysis of the data the dictionary is also exported as a comma separated table. Filtering In the filtering rule of the pipeline all sequences that do not occur in both splitsamples of at least one sample are filtered out. For single-sample data, the filtering rule uses an abundance cutoff value that can be specified in the configuration file to filter out all sequences which have abundances less or equal the specified cutoff value. The filtered data and the filtered out data is subsequently exported as comma separated tables.

Table conversion to FASTA files

As the swarm rule needs FASTA files as input, the resulting table of the filtering is converted to a FASTA file by the rule write_fasta.

AmpliconDuo / Split-sample approach

The pipeline supports both single-sample and split-sample FASTQ amplicon data. The split-sample protocol (Lange et al. 2015) aims to reduce the amount of sequences that are the result of PCR or sequencing errors without the usage of stringent abundance cutoffs, which often lead to the loss of rare but naturally occurring sequences. To achieve this, extracted DNA from a single sample is divided into two split-samples, which are then separately amplified and sequenced. All sequences that do not occur in both split-samples are seen as erroneous sequences and filtered out. The method is therefore based on the idea that a sequence that was created by PCR or sequencing errors does not occur in both samples. A schematic representation of the split-sample method is shown below:

split_sample

Schematic representation of the split-sample approach: Extracted DNA from a single environmental sample is splitted and separately amplified and sequenced. The filtering rule compares the resulting read sets between two split-samples, filtering out all sequences that do not occur in both split-samples. Image adapted from Lange et al. 2015.

The initial proposal for the split-sample approach by Dr. Lange was accompanied by the release of the R package AmpliconDuo for the statistical analysis of amplicon data produced by the aforementioned split-sample approach. It uses Fishers exact test to detect significantly deviating read numbers between two experimental branches A and B from the sample S. To measure the discordance between two branches of a sample the read-weighted discordance ∆rSθ , which is weigthed by the average read number of sequence i in both branches and the unweighted discordance ∆uSθ for each sequence i are calculated. If ∆uSθ = 0 each branch of sample S contains the same set of sequences, while if ∆rSθ = 0 the read numbers for each sequence in sample S are within the error margin set by the chosen false discovery rate. The results of the discordance calculations are then plotted for visualization purposes and written to an R data file to allow the filtering of significantly deviating sequences.

OTU generation (OTU-variant)

OTUs are generated using the Swarm clustering algorithm (Mahé et al. 2015) in the identically named rule. Swarm clusters sequences into OTUs using an iterative approach with a local threshold: It creates a pool of amplicons from the input file and a empty OTU. Subsequently, it will remove the first amplicon from the pool, which will become the OTU seed. All amplicons left in the pool that differ in their nucleotide composition from the initial seed by a user given threshold (the default threshold used is 1 nucleotide) are removed from the pool and added to the OTU as subseeds. In the next iteration, each amplicon having at most a difference as high as the threshold to any of the subseeds is then removed from the pool and added to the OTU. This iterative process will continue until there are no amplicons left in the pool with a nucleotide difference of at most the threshold to any of the subseeds added in the previous iteration to the OTU, leading to the closure of the OTU and the opening of a new one. This approach to OTU generation circumvents two sources of OTU variability that are inherent to greedy clustering algorithms: the input order dependency, in which the first amplicon in a FASTA file will become the centroid of an OTU and the use of a global threshold, recruiting all amplicons that have less differences to the centroid than a user defined threshold. The iterative approach of Swarm produces a star-shaped minimum spanning tree, with an (usually highly abundant) amplicon as the center, regardless of the chosen first amplicon of the OTU, as visualized in Figure 12. The sequence of the amplicon at the center of each OTU tree is used in subsequent analysis steps as the representative sequence of the corresponding OTU.

split_sample

Schematic representation of the greedy clustering approach and the iterative Swarm approach. The greedy approach (a) that uses a global clustering threshold t and input order dependent centroid selection can lead to the placement of closely related amplicons into different OTUs. The iterative approach of Swarm (b), with a local threshold d, leads to OTUs containing only closely related amplicons with a natually forming centroid during the iterative growth process of the OTU. Image from Mahé et al. 2015.

Sequence comparison

The assignment of taxonomic information to an OTU/ASV is an important part of the processing of environmental amplicon data, as the characteristics of known groups or species can be used to assess the environmental conditions of the samples origin. To find sequences that are similar to the representative sequence of each OTU/ASV the BLAST (basic local alignment search tool) algorithm (Altschul et al. 1990) is used to search for similar sequences in the SILVA database (Pruesse et al. 2007). The SILVA database contains aligned rRNA sequencing data that are curated in a multi-step process. While it has an extensive collection of highquality prokaryotic rRNA sequencing data, it only contains a limited amount of microbial eukaryotic sequencing data. If the database is not locally available, the required files will automatically be downloaded and the database will be build in the rule make_silva_db. The BLAST algorithm itself will be executed in the blast rule. As the aim is to find similar nucleotide sequences in the database for each query sequence the nucleotide-nucleotide BLAST (BLASTn) variation of the BLAST algorithm is used. The tab separated output file of the blast rule contains the following information for each representative sequence if the output of the BLASTn search meets the criteria defined in the configuration file:

Column Nr. Column Name Description
1. qseqid Query sequence identification
2. qlen Length of the query sequence
3. length Length of the alignment
4. pident Percentage of identical matches
5. mismatch Number of mismatches
6. qstart Start of the alignment in the query sequence
7. qend End of the alignment in the query sequence
8. sstart Start of the alignment in the target sequence
9. send End of the alignment in the target sequence
10. gaps Number of gaps
11. evalue E-value
12. stitle Title (taxonomy) of the target sequence

Merging of the results

The output data from the write_fasta, swarm and blast rules are merged to a single comma separated table in the merge_results rule, containing for each representative sequence the sequence identification number, the nucleotide sequence, the abundance of the sequence in each sample, the sum of abundances and, if the blast rule found a similar sequence, all information found in the table shown in the previous section.

Example primertable

The primertable should be a .csv file ( project .csv) in the following format:

Probe poly_N Barcode_forward specific_forward_primer poly_N_rev Barcode_reverse specific_reverse_primer
S2016BY_A NNNNN GTACACACCGCCCGTC N GCTGCGYYCTTCATCGDTR
S2016RU_A NNNN GTACACACCGCCCGTC NN GCTGCGYYCTTCATCGDTR

Configfile

Below are the explainations for the configfile ( project .yaml) entries:

Option Default Description
filename project The path / filename of the project folder, primertable (.csv) and configfile (.yaml). If the raw data folder is not in the root directory of Natrix, please add the path relative to the root directory (e.g. input/example_data)
primertable project.csv Path to the primertable. If the primertable is not in the root directory of Natrix, please add the path relative to the root directory (e.g. input/example_data.yaml)
units units.tsv Path to the sequencing unit sheet.
cores 4 Amount of cores available for the workflow.
multiqc False Initial quality check (fastqc & multiqc), currently only works for not yet assembled reads.
demultiplexing False Demultiplexing for reads that were not demultiplexed by the sequencing company (slow).
read_sorting False Read sorting for paired end reads that were not sorted by the sequencing company (slow).
already_assembled False Skipping of the quality control and read assembly steps for data that is already assembled.
seq_rep OTU How the sequences should be represented, possible values are: "ASV", amplicon sequence variants, created with DADA2 or "OTU", operational taxonomic units, created with SWARM
threshold 0.9 PANDAseq score threshold a sequence must meet to be kept in the output.
minoverlap 15 Sets the minimum overlap between forward and reverse reads.
minqual 1 Minimal quality score for bases in an assembled read to be accepted by PANDAseq.
minlen 100 The minimal length of a sequence after primer removal to be accepted by PANDAseq.
maxlen 600 The maximal length of a sequence after primer removal to be accepted by PANDAseq.
primer_offset False Using PANDAseq to remove primer sequences by length offset instead of sequence identity.
mq 25 Minimum quality sequence check (prinseq), filtering of sequences according to the PHRED quality score before the assembly.
barcode_removed True Boolean that indicates if the sequence is free of barcodes.
all_primer True Boolean that indicates if the sequence is free of any kind of additional subsequences (primer, barcodes etc.).
clustering 1.0 Percent identity for cdhit (dereplication) (1 = 100%), if cdhit is solely to be used for dereplication (recommended), keep the default value.
length_overlap 0.0 Length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster.
representative longest Which sequence to use as a representative sequence per CDHIT cluster. longest = the longest sequence of the corresponding cluster, most_common = the most common sequence of the corresponding cluster.
beta 8.0 Weight of a "no" vote for the VSEARCH chimera detection algorithm.
pseudo_count 1.2 Pseudo - count prior on number of “no” votes.
abskew 16 Minimum abundance skew, definied by (min(abund.(paren1), abund.(paren2))) / abund.(child).
filter_method not_split If the split sample approach was used (split_sample) or not (not_split).
cutoff 3 An additional abundance filter if the split sample approach was not used.
ampli_corr fdr Specifies the correction method for Fisher's exact test.
save_format png File format for the frequency-frequency plot.
plot_AmpDuo True If the frequency-frequency plot should be saved.
paired_End True The format of the sequencing data, TRUE if the reads are in paired-end format.
name_ext R1 The identifier for the forward read (for the reverse read the 1 is switched with 2, if the data is in paired-end format), has to be included at the end of the file name, before the file format identifier (including for single end files).
swarm True Boolean to indicate the use of the SWARM clustering algorithm to create operational taxonomic units (OTUs) from the data.
blast False Boolean to indicate the use of the BLAST clustering algorithm to assign taxonomic information to the OTUs.
database SILVA Database against which the BLAST should be carried out, at the moment "NCBI" and "SILVA" are supported.
drop_tax_classes '.*unclassified Bacteria.*,.*uncultured.*bacterium.*' Given a comma-separated list, drops undesired classes either by id, by name or using regex.
db_path database/silva/silva.db Path to the database file against which the BLAST should be carried out, at the moment only the SILVA and NCBI databases will be automatically downloaded, other databases have to be downloaded and configurated manually.
max_target_seqs 1 Number of blast hits that are saved per sequence / OTU.
ident 90.0 Minimal identity overlap between target and query sequence.
evalue 1e-51 Highest accepted evalue.

References

  • Köster, Johannes and Sven Rahmann (2018). “Snakemake—a scalable bioinformatics workflow engine”. In: Bioinformatics.

  • Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data.

  • Ewels, P et al. (2016). “MultiQC: summarize analysis results for multiple tools and samples in a single report”. In: Bioinformatics 32.19, pp. 3047–3048.

  • Callahan, B. J. et al. (2016). “DADA2: High-resolution sample inference from illumina amplicon data.“ In: Nature Methods ,13(7):581–583.

  • Schmieder, Robert and Robert A. Edwards (2011). “Quality control and preprocessing of metagenomic datasets.” In: Bioinformatics 27.6, pp. 863–864.

  • Masella, Andre P et al. (2012). “PANDAseq: paired-end assembler for illumina sequences”. In: BMC Bioinformatics 13.1 , p. 31.

  • Fu, Limin et al. (2012). “CD-HIT: accelerated for clustering the next-generation sequencing data”. In: Bioinformatics 28.23 , pp. 3150–3152.

  • Martin, M. (2011). “Cutadapt removes adapter sequences from high-throughput sequencing reads.“ In: EMB- net.journal, 17(1):10.

  • Rognes, Torbjørn et al. (2016). “VSEARCH: a versatile open source tool for metagenomics”. In: doi: 10.7287/peerj.preprints.2409v1.

  • Edgar, Robert (2016). “UCHIME2: improved chimera prediction for amplicon sequencing”. In: bioRxiv.

  • Mahé, Frédéric et al. (2015). “Swarm v2: highly-scalable and high-resolution amplicon clustering”. In: PeerJ 3.

  • Lange, Anja et al. (2015). “AmpliconDuo: A Split-Sample Filtering Protocol for High-Throughput Amplicon Sequencing of Microbial Communities”. In: Plos One 10.11.

  • Altschul, Stephen F. et al. (1990). “Basic local alignment search tool”. In: Journal of Molecular Biology 215.3 , pp. 403–410.

  • Pruesse, E. et al. (2007). “SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB”. In: Nucleic Acids Research 35.21 , pp. 7188–7196.

Code Snippets

13
14
15
16
17
18
19
shell:
    """
        dir_name=$(dirname {params[0]});
        wget -P $dir_name/ --progress=bar ftp://ftp.arb-silva.de/current/Exports/SILVA_*_SSURef_tax_silva.fasta.gz;
        gunzip -c $dir_name/SILVA_*_SSURef_tax_silva.fasta.gz > $dir_name/silva.db.fasta;
        makeblastdb -in $dir_name/silva.db.fasta -dbtype nucl -parse_seqids -out $dir_name/silva.db -blastdb_version 5
    """
26
27
script:
    "../scripts/create_silva_taxonomy.py"
40
41
42
43
44
45
46
47
48
49
50
shell:
    """
        dir_name=$(dirname {params[0]});
        wget --spider --no-remove-listing ftp://ftp.ncbi.nlm.nih.gov/blast/db/
        number=$(awk '$9 ~ /^nt.[0-9]*.tar.gz[^.]/ {{print substr($9,4,2)}}' {output.listing} | tail -n 1)
        for i in `seq -w 00 $number`;
            do wget -N -P $dir_name/ --progress=bar ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.${{i}}.tar.gz;
        tar xvzf $dir_name/nt.${{i}}.tar.gz -C $dir_name;
        done;
        touch {output.path}
    """
59
60
61
62
63
64
65
shell:
    """
        dir_name=$(dirname {params[0]});
        wget -N -P $dir_name/ --progress=bar ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz;
        wget -N -P $dir_name/ --progress=bar ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz;
        tar xvzf $dir_name/new_taxdump.tar.gz -C $dir_name;
    """
76
77
script:
    "../scripts/create_blast_taxonomy.py"
94
95
96
97
98
shell:
    "blastn -num_threads {threads} -query {input[0]} -db {params.db_path}"
    " -max_target_seqs {params.max_target_seqs}"
    " -perc_identity {params.ident} -evalue {params.evalue}"
    " -outfmt {params.out6} -out {output}"
116
117
script:
    "../scripts/ncbi_taxonomy.py"
SnakeMake From line 116 of rules/blast.smk
133
134
script:
    "../scripts/silva_taxonomy.py"
SnakeMake From line 133 of rules/blast.smk
149
150
script:
    "../scripts/merge_results.py"
SnakeMake From line 149 of rules/blast.smk
17
18
19
20
shell:
    "vsearch --uchime3_denovo {input} -uchimeout {output.uchime_out}"
    " -chimeras {output.chim} -nonchimeras {output.nonchim} -xn {params.beta}"
    "  -dn {params.pseudo_c} -abskew {params.abskew} --log {log} 2>&1"
18
19
script:
    "../scripts/dada2.R"
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
run:
    import csv
    with open(input[0], "r") as csv_in, open(
        output[0], "w") as fasta_out, open(output[1], "w") as csv_out:
        filtered_table = csv.reader(csv_in)
        filtered_table_seqid = csv.writer(csv_out)
        for row in enumerate(filtered_table):
            if row[0] != 0:
                abu_sum = sum([int(num) for num in row[1][1:]])
                filtered_table_seqid.writerow([">{};size={};".format(
                    row[0], abu_sum)] + row[1])
                fasta_out.write(">{};size={};\n{}\n".format(row[0],
                    abu_sum, row[1][0]))
            else:
                filtered_table_seqid.writerow(["seqid"] + row[1])
53
54
shell:
    "swarm -t {threads} -f -z -w {output[0]} < {input} > {output[1]}"
64
65
script:
    "../scripts/merge_swarm_results.py"
15
16
script:
    "../scripts/demultiplexing.py"
23
24
shell:
     "gunzip -c {input} > {output}"
13
14
15
shell:
    "cd-hit-est -i {input} -o {output[0]} -c {params.id_percent} -T"
    " {threads} -s {params.length_cutoff} -M 2000 -sc 1 -d 0"
29
30
script:
    "../scripts/dereplication.py"
 9
10
script:
    "../scripts/unfiltered_table.py"
23
24
script:
    "../scripts/filtering.py"
40
41
script:
    "../scripts/ampliconduo.R"
10
11
shell:
    "fastqc -o results/qc/ {input} -t {threads}"
21
22
shell:
    "multiqc results/qc/ -o results/qc/"
19
20
script:
    "../scripts/define_primer.py"
35
36
script:
    "../scripts/prinseq.py"
58
59
script:
    "../scripts/cutadapt.py"
82
83
script:
    "../scripts/assembly.py"
92
93
shell:
    "seqtk seq -a {input} > {output}"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
library(AmpliconDuo)
library(ggplot2)
library(xtable)
library(data.table)

# apparently ggplot opens a device to get for example hight and width for plots if not specified
# which in turn does produce an additional Rplots.pdf in the working dir. 
# The following line should suppress this.
pdf(NULL)

log <- file(toString(snakemake@log), open="wt")
sink(log, append = TRUE)
sink(log, type="message", append =TRUE)

# Script for statistical analysis of the filtering process (filtered vs unfiltered),
# it will also use Fisher's exact test to find significantly deviating read numbers
# between split-samples, which could be used to implement an additional filtering step.
unfiltered.table <- fread(snakemake@input[["unfiltered_table"]],
                          stringsAsFactors=FALSE, data.table=FALSE)
filtered.table <- fread(snakemake@input[["filtered_table"]],
                        stringsAsFactors=FALSE, data.table=FALSE)
figure.folder <- unlist(strsplit(toString(snakemake@output),
                                 split="/AmpliconDuo.RData"))

amp.duo <- function(table, figure.folder, saving.format, file.name, p.corr) {
  intable <- table[, -1]
  names <- names(intable)
  names <- names[seq(1, length(names), by=2)]
  names <- sapply(X=names, FUN=function(x){
  unlist(strsplit(x, split="+.A$"))
  })
  splitsamples <-  lapply(names, function(x) grep(x, colnames(intable), value=T))
  allzero <- unlist(lapply(splitsamples, function(x) all(rowSums(intable[,x])==0)))
  if(any(allzero)){
    print("These splitsamples do not have any OTUS/ASVs in common:")
    print(splitsamples[allzero])
    intable <- intable[,unlist(splitsamples[!allzero])]
    names <- names[!allzero]
  }
  ampliconduos <- ampliconduo(intable, sample.names=names,
                              correction=p.corr)
  if (snakemake@params[["plot_ampduo"]]) {
    nrow = as.integer(sqrt(length(names)))
    plotAmpliconduo.set(ampliconduos, nrow=nrow, save=TRUE,
                        path=figure.folder, format=saving.format,
                        file.name=file.name)
  }
  return(ampliconduos)
}

discordance <- function(ampliconduos, figure.folder, file.name) {
  discordance.delta(ampliconduos, corrected=TRUE, printToTex=TRUE,
                    directory=figure.folder, file.name=file.name)
}

amp.duo.unfiltered = amp.duo(unfiltered.table, figure.folder,
                             snakemake@params[["saving_format"]],
                             paste("ampliconduo_unfiltered_",
                                   Sys.Date(), sep=""),
                             snakemake@params[["p_corr"]])

discordance(amp.duo.unfiltered, figure.folder,
            paste("discordanceDelta_unfiltered_", Sys.Date(), sep=""))

amp.duo.filtered = amp.duo(filtered.table, figure.folder,
                           snakemake@params[["saving_format"]],
                           paste("ampliconduo_filtered_", Sys.Date(), sep=""),
                           snakemake@params[["p_corr"]])

discordance(amp.duo.filtered, figure.folder,
            paste("discordanceDelta_filtered_", Sys.Date(), sep=""))

save(amp.duo.unfiltered, amp.duo.filtered, file=paste(figure.folder,
                                                        "/AmpliconDuo.RData",
                                                        sep=""))

sink()
sink(type="message")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
import re
import yaml
import dinopy
import logging
import subprocess
import numpy as np
import pandas as pd
from glob import glob

# Script to assemble paired end reads using PandaSeq.
# If the primer were previously not removed, PandaSeq will remove them.
# If the sequences are single end, the primer will be cut off
# and the remaining length of the sequences will be compared to the
# cutoff values defined in the config.

primer_table = pd.read_csv(snakemake.input.primer_t, index_col="Probe",
        na_filter=False).to_dict("index")

if snakemake.params.paired_end:
    if snakemake.params.prim_rm:
        subprocess.call(["pandaseq",
            "-f",snakemake.input[0], "-r", snakemake.input[1], "-B", "-a", "-F",
            "-g", str(snakemake.log),
            "-w", str(snakemake.output), "-N",
            "-T", str(snakemake.threads),
            "-t", str(snakemake.params.threshold),
            "-o", str(snakemake.params.minoverlap),
            "-l", str(snakemake.params.minlen),
            "-L", str(snakemake.params.maxlen),
            "-C" "min_phred:" + str(snakemake.params.minqual)])
    else:
        r1_primer = primer_table[snakemake.wildcards.sample + "_"
            + snakemake.wildcards.unit]["specific_forward_primer"]
        r2_primer = primer_table[snakemake.wildcards.sample + "_"
            + snakemake.wildcards.unit]["specific_reverse_primer"]

        subprocess.call(["pandaseq",
            "-f", snakemake.input[0], "-r", snakemake.input[1], "-B", "-a", "-F",
            "-g", str(snakemake.log),
            "-w", str(snakemake.output),"-N",
            "-p", r1_primer, "-q", r2_primer,
            "-T", str(snakemake.threads),
            "-t", str(snakemake.params.threshold),
            "-o", str(snakemake.params.minoverlap),
            "-l", str(snakemake.params.minlen),
            "-L", str(snakemake.params.maxlen),
            "-C" "min_phred:" + str(snakemake.params.minqual)])
else:
    logging.basicConfig(filename=str(snakemake.log),
            level=logging.DEBUG)
    iupac_dict_regex = {"M":"[AC]", "R":"[AG]", "W":"[AT]", "S":"[CG]",
            "Y":"[CT]","K":"[GT]", "V":"[ACG]", "H":"[ACT]",
            "D":"[AGT]", "B":"[CGT]", "X":"[ACGT]", "N":"[ACGT]"}

    # Function to remove the primer, barcode & polyN from the sequences
    # and compare the sequence length to the cutoffs defined in the config.
    def primer_len_filter(path, sample):
        sequence = dinopy.FastqReader(path)
        assembled = dinopy.FastqWriter(path.rsplit("_", 1)[0]
                + "_assembled.fastq")
        filt_out = dinopy.FastqWriter(path.rsplit("_", 1)[0]
                + "_filtered_out.fastq")
        assembled_counter = 0
        filt_out_counter = 0
        assembled.open()
        filt_out.open()
        for read in sequence.reads(quality_values=True):
            name = read.name.decode()
            seq = check_for_match(read.sequence.decode(), sample)
            if seq[0] and snakemake.params.maxlen >= len(seq[1]) >= snakemake.params.minlen:
                assembled.write(seq[1].encode(), name.split(" ")[0].encode(), read.quality)
                assembled_counter += 1
            else:
                filt_out.write(read.sequence, name.split(" ")[0].encode(), read.quality)
                filt_out_counter += 1
        logging.info("{}: {} sequences were kept, \
                {} sequences were filtered out".format(sample,
                    assembled_counter, filt_out_counter))
        assembled.close()
        filt_out.close()

    # Helper function for the IUPAC extended nucleotide base code.
    def iupac_replace(sequence, iupac_dict):
        for i, j in iupac_dict_regex.items():
            sequence = sequence.replace(i, j)
        return sequence

    # Function to search and remove primer and barcode sequences.
    # It will also search for matching sequences two bases shifted
    # to the left and right from the position defined by the primertable,
    # to account for potential errors in the polyN column of the primertable.
    def check_for_match(sequence, sample):
        if snakemake.params.prim_rm:
            return (True, sequence)
        else:
            poly_prim_bar = [primer_table[sample][key] for key
                    in primer_table[sample].keys() if key in ["poly_N",
                        "specific_forward_primer", "Barcode_forward"]]
            prim_bar = re.compile(poly_prim_bar[1]
                    + iupac_replace(poly_prim_bar[2], iupac_dict_regex))
            for i in [0, 1, -1, 2, -2]:
                start = np.clip(len(primer_table[sample]["poly_N"]) + i,
                        a_min=0, a_max=None)
                end = np.clip(len("".join(poly_prim_bar)) + i, a_min=0,
                        a_max=None)
                if prim_bar.match(sequence[start : end]):
                    return (True, sequence.replace(sequence[ : end], ""))
                else:
                    return (False, sequence)

    primer_len_filter(snakemake.input[0],
            snakemake.input[0].split("/")[-1].rsplit("_", 1)[0])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
import logging

logging.basicConfig(filename=str(snakemake.log),
            level=logging.DEBUG)

tax_df = pd.read_csv(snakemake.input[0], sep="\t\\|\t", index_col=0, header=None, engine='python')

def get_tax_lineage(row):
    if(isinstance(row[2], str)):
        result = row[2][:-2]
    else:
        result = row[2]
        logging.info("Taxid {} is not present in NCBI lineage file.".format(row.name))
    return result

tax_df["tax_lineage"] = tax_df.apply(lambda row: get_tax_lineage(row), axis=1).map(str) + tax_df[1]
tax_df = tax_df.drop([1,2], axis=1)

tax_df.to_hdf(snakemake.output[0], key='tax_df', mode='w')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import pandas as pd
import logging
import dinopy

far = dinopy.FastaReader(snakemake.input[0])
header = [i.name.decode().split(" ", 1) for i in far.entries()]
df = pd.DataFrame(header, columns=["id", "taxonomy"])
df = df.set_index(keys="id", drop=True)

df.to_hdf(snakemake.output[0], key='df', mode='w')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import subprocess
import pandas as pd

primer_table = pd.read_csv(snakemake.input.primer_t, index_col="Probe",
        na_filter=False).to_dict("index")

if not snakemake.params.bar_removed:
    r1_barcode = primer_table[snakemake.wildcards.sample + "_"
    + snakemake.wildcards.unit]["Barcode_forward"]
    r2_barcode = primer_table[snakemake.wildcards.sample + "_"
                 + snakemake.wildcards.unit]["Barcode_reverse"]
r1_primer = primer_table[snakemake.wildcards.sample + "_"
                         + snakemake.wildcards.unit]["specific_forward_primer"]
r2_primer = primer_table[snakemake.wildcards.sample + "_"
                         + snakemake.wildcards.unit]["specific_reverse_primer"]

logfile = open(str(snakemake.log), "w")
if snakemake.params.paired_end:
    if snakemake.params.prim_rm:
        subprocess.call(["mv", snakemake.input[0], snakemake.output[0]])
        subprocess.call(["mv", snakemake.input[1], snakemake.output[1]])
    elif not snakemake.params.bar_removed:
        subprocess.call(["cutadapt","-j 0", "-m", str(snakemake.params.minlen), "-M", str(snakemake.params.maxlen),
                         "-g", r1_primer + r1_barcode, "-G", r2_primer + r2_barcode, "-o",
                         snakemake.output[0], "-p", snakemake.output[1], snakemake.input[0], snakemake.input[1]], stdout=logfile)
    else:
        subprocess.call(["cutadapt", "-j 0", "-m", str(snakemake.params.minlen), "-M", str(snakemake.params.maxlen),
                         "-g", r1_primer, "-G", r2_primer, "-o", snakemake.output[0], "-p",
                         snakemake.output[1], snakemake.input[0], snakemake.input[1]], stdout=logfile)
else:
    if snakemake.params.prim_rm:
        subprocess.call(["mv", snakemake.input[0], snakemake.output[0]])
    elif not snakemake.params.bar_removed:
        subprocess.call(["cutadapt", "-j 0", "-m", str(snakemake.params.minlen), "-M", str(snakemake.params.maxlen),
                         "-g", r1_primer + r1_barcode, "-o", snakemake.output[0], snakemake.input[0]], stdout=logfile)
    else:
        subprocess.call(["cutadapt", "-j 0", "-m", str(snakemake.params.minlen), "-M", str(snakemake.params.maxlen),
                         "-g", r1_primer, "-o", snakemake.output[0], snakemake.input[0]], stdout=logfile)
logfile.close()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
library(dada2)
library(ShortRead)

log <- file(toString(snakemake@log), open="wt")
sink(log, append = TRUE)
sink(log, type="message", append =TRUE)

paired.end <- snakemake@params[["paired_end"]]
min.overlap <- snakemake@params[["minoverlap"]]
split.samples <- snakemake@params[["splitsamples"]]=="split_sample"
sample.names <- sapply(strsplit(basename(snakemake@input[["fwd"]]), "_[12]_cut.fastq"), `[`, 1)
fnFs <- snakemake@input[["fwd"]]
names(fnFs) <- sample.names

set.seed(100)
# Filtering was done before with prinseq and cutadapt
# Learn forward error rates
errF <- learnErrors(fnFs, nbases=1e8, multithread=TRUE, randomize = TRUE, verbose = TRUE)

if(paired.end){
  fnRs <- snakemake@input[["rev"]]
  # Learn reverse error rates
  names(fnRs) <- sample.names
  errR <- learnErrors(fnRs, nbases=1e8, multithread=TRUE, randomize = TRUE, verbose = TRUE)
}

if(split.samples){
  samples <- unlist(unique(strsplit(sample.names, "_[AB]$")))
  sample.names2 <- lapply(samples, function(x) grep(paste("^",x,sep=""), sample.names, value=TRUE))
  names(sample.names2) <- samples
} else {
  sample.names2 <- sample.names
  names(sample.names2) <- sample.names
}

n <- 1
for(i in 1:length(sample.names2)) {
  result <- c()
  sam <- sample.names2[i]
  cat("Processing:", names(sam), "\n")
  # dereplication done on the fly since dada 1.12
  dadaF <- dada(fnFs[unlist(sam)], err=errF, multithread=TRUE, verbose = TRUE)
  print("Forward After DADA2 dereplication and denoising:")
  print(dadaF)
  if (paired.end == TRUE) {
    dadaR <- dada(fnRs[unlist(sam)], err=errR, multithread=TRUE, verbose = TRUE)
    print("Reverse After DADA2 dereplication and denoising:")
    print(dadaR)
    result <- mergePairs(dadaF, fnFs[unlist(sam)], dadaR, fnRs[unlist(sam)], verbose = TRUE, minOverlap=min.overlap)
    if(!is(result, "list")){result <- list(result)}
  } else {
    if(is(dadaF, "dada")){dadaF <- list(dadaF)}
    seqtab <- lapply(dadaF, function(x) t(makeSequenceTable(x)))
    result <- lapply(seqtab, function(x) data.frame(abundance = x[,1], sequence=rownames(x)))
  }
  print("Sequences left after assembly:")
  lapply(result, function(x) print(nrow(x)))
  for(j in 1:length(result)){
    if(nrow(result[[j]]) == 0)
    {
      print(paste0("No sequences left for sample ", names(sam)))
      cat(NULL, file=snakemake@output[[j]])
    } else {
      seq_names = paste0(seq(nrow(result[[j]])), ";size=", result[[j]]$abundance, ";")
      uniquesToFasta(result[[j]], fout = snakemake@output[[n]], ids = seq_names)
    }
    n <- n+1
  }
}

sink()
sink(type="message")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import pandas as pd

# Script to define the primer used by pandaseq. The defined
# primer depends on the existence of the barcode, if the reads are
# single or paired end and if pandaseq should use an offset of
# the primer length instead of the sequence. Using an offset can
# be useful if the sequence has many uncalled bases in the primer
# region, preventing a nucleotide primer from matching.

primertable = pd.read_csv(str(snakemake.input), index_col="Probe")

if snakemake.params.all_removed:
    pass
else:
    if snakemake.params.paired_end:
        if snakemake.params.offset:
            if snakemake.params.bar_removed:
                primertable["f_primer"] = (
                    primertable[["poly_N",
                    "specific_forward_primer"]].sum(axis=1).str.len())
                primertable["r_primer"] = (
                    primertable[["poly_N_rev",
                    "specific_reverse_primer"]].sum(axis=1).str.len())
            else:
                primertable["f_primer"] = (
                    primertable[["poly_N", "Barcode_forward",
                    "specific_forward_primer"]].sum(axis=1).str.len())
                primertable["r_primer"] = (
                    primertable[["poly_N_rev","Barcode_reverse",
                    "specific_reverse_primer"]].sum(axis=1).str.len())
        else:
            if snakemake.params.bar_removed:
                primertable["f_primer"] = (
                    primertable["specific_forward_primer"])
                primertable["r_primer"] = (
                    primertable["specific_reverse_primer"])
            else:
                primertable["f_primer"] = (
                    primertable[["Barcode_forward",
                    "specific_forward_primer"]].sum(axis=1))
                primertable["r_primer"] = (
                    primertable[["Barcode_reverse",
                    "specific_reverse_primer"]].sum(axis=1))
    else:
        if snakemake.params.offset:
            if snakemake.params.bar_removed:
                primertable["f_primer"] = (
                    primertable[["poly_N",
                    "specific_forward_primer"]].sum(axis=1).str.len())
            else:
                primertable["f_primer"] = (
                    primertable[["poly_N", "Barcode_forward",
                    "specific_forward_primer"]].sum(axis=1).str.len())
        else:
            if snakemake.params.bar_removed:
                primertable["f_primer"] = (
                    primertable["specific_forward_primer"])
            else:
                primertable["f_primer"] = (
                    primertable[["Barcode_forward",
                    "specific_forward_primer"]].sum(axis=1))

primertable.to_csv(str(snakemake.output))
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
import pandas as pd
import numpy as np
import dinopy
import os
import re
import yaml
import glob
import sys
import shutil
import subprocess
import pathlib
# This script is used for moving already assembled fastq files to the correct folder for further processing,
# demultiplexing samples that were pooled in-lab (instead of at the sequencing company) and
# to manually sort reads depending on their primersequence.

p_table = pd.read_csv(snakemake.params.primertable, index_col='Probe')
primertable = p_table.to_dict('index')
data_folder = str(snakemake.params.filename)
file_path_list = sorted(glob.glob(data_folder + "/*.fast*"))

# Regex substitution dict.
iupac_dict_regex = {'M':'[AC]', 'R':'[AG]', 'W':'[AT]', 'S':'[CG]', 'Y':'[CT]',
                    'K':'[GT]', 'V':'[ACG]', 'H':'[ACT]', 'D':'[AGT]',
                    'B':'[CGT]', 'X':'[ACGT]', 'N':'[ACGT]'}


# Substitution helper function.
def iupac_replace(sequence, iupac_dict):
    for i, j in iupac_dict_regex.items():
        sequence = sequence.replace(i, j)
    return sequence


def define_direction_demulti(polyN, prim, barcode):
    def check_for_match_demulti(sequence, sample):
        poly_prim_bar = [primertable[sample][key] for key
                        in primertable[sample].keys() if key
                        in [polyN, prim, barcode]]
        prim_bar = re.compile(poly_prim_bar[1] + iupac_replace(poly_prim_bar[2],
                                iupac_dict_regex))
        for i in [0, 1, -1, 2, -2]:
            start = np.clip(len(primertable[sample][polyN]) + i, a_min=0,
                                a_max=None)
            end = np.clip(len(''.join(poly_prim_bar)) + i, a_min=0,
                                a_max=None)
            if prim_bar.match(sequence[start : end]):
                return True
        else:
            return False
    return check_for_match_demulti


# These are the variations of the check_for_match closure for the
# forward and reverse primer, the arguments are the column indices
# of the corresponding polyN col, the col after the primer
# (as the slice beginning is inclusive, end is exclusive) and the
# col index of the barcode.
check_for_match_fwd_demulti = define_direction_demulti('poly_N', 'specific_forward_primer',
                                        'Barcode_forward')
check_for_match_rev_demulti = define_direction_demulti('poly_N_rev', 'specific_reverse_primer',
                                       'Barcode_reverse')


def define_direction_sort(prim):
    def check_for_match_sort(sequence, sample):
        prim_regex = re.compile(iupac_replace(primertable[sample][prim],
                                              iupac_dict_regex))
        if prim_regex.match(sequence[:len(primertable[sample][prim])]):
            return True
        else:
            return False
    return check_for_match_sort


check_for_match_sort_fwd = define_direction_sort('specific_forward_primer')
check_for_match_sort_rev = define_direction_sort('specific_reverse_primer')


# Create a dict of Dinopy writer instances and write the sequences
# according to their barcode and primer sequence in the corresponding
# files defined in the primertable.
def demultiplexer(file_path_list):
    samples = []
    output_filepaths = []
    for sample in primertable.keys():
        samples.append(sample + '_R1')
        samples.append(sample + '_R2')
        output_filepaths.append('demultiplexed/' + sample + '_R1.fastq.gz')
        output_filepaths.append('demultiplexed/' + sample + '_R2.fastq.gz')

    # Create a dict of writers.
    writers = {name: dinopy.FastqWriter(path) for name, path in
               zip(samples, output_filepaths)}

    # Open all writers.
    for writer in writers.values():
        writer.open()

    # Start writing.
    for sample in file_path_list:
        sequence = dinopy.FastqReader(sample)
        for read in sequence.reads(quality_values=True):
            for sample in primertable.keys():
                if check_for_match_fwd_demulti(read.sequence.decode(), sample):
                    writers[sample + '_R1'].write(read.sequence, read.name,
                            read.quality)
                elif check_for_match_rev_demulti(read.sequence.decode(), sample):
                    writers[sample + '_R2'].write(read.sequence, read.name,
                            read.quality)
                else:
                    pass

    # Close all writers.
    for writer in writers.values():
        writer.close()

def read_sorter(primertable):
    if not os.path.exists('demultiplexed/not_sorted'):
        os.mkdir('demultiplexed/not_sorted')
    samples = []
    output_filepaths = []
    for sample in primertable.keys():
        samples.append(sample + snakemake.params.name_ext[:-1] + '1')
        samples.append(sample + snakemake.params.name_ext[:-1] + '2')
        samples.append(sample + '_not_sorted')
        output_filepaths.append('demultiplexed/' + sample + '_R1.fastq.gz')
        output_filepaths.append('demultiplexed/' + sample + '_R2.fastq.gz')
        output_filepaths.append('demultiplexed/not_sorted/' + sample + '_not_sorted.fastq.gz')

    # Create a dict of writers.
    writers = {name: dinopy.FastqWriter(path) for name, path in
               zip(samples, output_filepaths)}

    # Open all writers.
    for writer in writers.values():
        writer.open()

    # Start writing.
    for sample in primertable.keys():
        fwd = dinopy.FastqReader(data_folder + '/' + sample + str(snakemake.params.name_ext)[:-1] + '1.fastq.gz')
        rev = dinopy.FastqReader(data_folder + '/' + sample + str(snakemake.params.name_ext)[:-1] + '2.fastq.gz')
        for read_f, read_r in zip(fwd.reads(quality_values=True), rev.reads(quality_values=True)):
            if check_for_match_sort_fwd(read_f.sequence.decode(),
                sample.split('/')[-1]) and check_for_match_sort_rev(read_r.sequence.decode(),
                    sample.split('/')[-1]):
                writers[sample + '_R1'].write(read_f.sequence, read_f.name,
                                read_f.quality)
                writers[sample + '_R2'].write(read_r.sequence, read_r.name,
                                read_r.quality)
            elif check_for_match_sort_rev(read_f.sequence.decode(),
                sample.split('/')[-1]) and check_for_match_sort_fwd(read_r.sequence.decode(),
                    sample.split('/')[-1]):
                writers[sample + '_R2'].write(read_f.sequence, read_f.name,
                                read_f.quality)
                writers[sample + '_R1'].write(read_r.sequence, read_r.name,
                                read_r.quality)
            else:
                writers[sample + '_not_sorted'].write(read_f.sequence, read_f.name,
                                read_f.quality)
                writers[sample + '_not_sorted'].write(read_r.sequence, read_r.name,
                                read_r.quality)

    # Close all writers.
    for writer in writers.values():
        writer.close()


def already_assembled(primertable, file_path_list):
    for f_ in file_path_list:
        rm_unzipped = False
        if '.gz' in f_:
            subprocess.run(['gunzip', '-k', f_])
            f_ = f_.split('.gz')[0]
            rm_unzipped = True
        for sample in primertable.keys():
            pathlib.Path('results/assembly/' + sample).mkdir(parents=True, exist_ok=True)
            if sample in f_:
                shutil.copy(f_, 'results/assembly/' + sample + '/' + sample + '_assembled.fastq')
        if rm_unzipped:
            os.remove(f_)


# Run the demultiplexing / read sorting script.
if snakemake.params.demultiplexing:
    print('1')
    demultiplexer(file_path_list)
elif snakemake.params.read_sorting:
    read_sorter(primertable)
    print('2')
elif snakemake.params.assembled:
    already_assembled(primertable, file_path_list)
    print('3')
else:
    print('4')
    # If the files do not need demultiplexing / sorting, just copy them to
    # the demultiplexed folder. Leave original files in input folder
    for file in file_path_list:
        print(file)
        shutil.copy(file, 'demultiplexed/')
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import dinopy
from glob import glob

fasta = snakemake.input[0]
clstr = snakemake.input[1]
fasta_not_clstr = snakemake.input[2]

def sequence_dict(fasta_file):
    return {entry.name:entry.sequence for entry
                in dinopy.FastaReader(str(fasta_file)).entries()}


# Returns a list of tuples containing the old header of the
# representative seq of each cluster and the new header as required
# by uchime (sorted by cluster size).
# This version uses the longest sequence as representative of a cluster,
# just as CDHIT does.
def get_longest_rep(clstr):
    clust_sizes = []
    ids = []
    with open(clstr) as f:
        current = None
        count = 0
        for line in f:
            if line[0] == ">":
                if current is not None and count > 0:
                    clust_sizes.append("{};size={};".format(
                        current[1:].strip().replace("Cluster ",
                            clstr.split("/")[-2] + "_"), count))
                current = line
                count = 0
            elif line[-2] == "*":
                ids.append(line[line.find(">")+1:line.find("...")])
                count += 1
            else:
                count += 1
    return list(zip(clust_sizes, ids))

# Helper functions to get the abundance of each sequence per
# sequence header, needed to use the most common sequence
# of a cluster as representative.
def not_clstr_dict(clstr_list, seq_dict_mc):
    not_clstr = dict()
    for id in clstr_list:
        s = seq_dict_mc[id.encode()]
        if s in not_clstr.keys():
            not_clstr[s]["names"].append(id)
            not_clstr[s]["count"] += 1
        else:
            not_clstr[s] = {"count" : 1}
            not_clstr[s]["names"] = list()
            not_clstr[s]["names"].append(id)
    return not_clstr

def count_clstr(clstr_list, seq_dict_mc):
    not_clstr = not_clstr_dict(clstr_list, seq_dict_mc)
    count = 0
    sum = 0
    max = None
    for s in not_clstr.keys():
        if not_clstr[s]["count"] > count:
            max = s
            count = not_clstr[s]["count"]
        sum += not_clstr[s]["count"]
    return [sum, not_clstr[max]["names"][0]]

# Version that uses the most common sequence as representative,
# because of the larger overhead it is much slower than taking
# the longest sequence as representative, but could lead to
# a better representation of the genetic composition of the
# sample.
def get_most_common_rep(clstr, fasta_not_clstr):
    seq_dict_mc = sequence_dict(fasta_not_clstr)
    with open(clstr) as f:
        clstr_list = []
        cluster = -1
        clust_sizes = []
        ids = []
        for line in f:
            if line[0] == ">":
                if cluster != -1:
                    if snakemake.params.length_cutoff == 0:
                        clust_sizes.append("{};size={};".format(clstr.split("/")[-2] + "_" + str(cluster), str(len(clstr_list))))
                        ids.append(clstr_list[0])
                    else:
                        hd = count_clstr(clstr_list, seq_dict_mc)
                        clust_sizes.append("{};size={};".format(clstr.split("/")[-2] + "_" + str(cluster), str(hd[0])))
                        ids.append(hd[1])
                clstr_list = []
                cluster += 1
            else:
                clstr_list.append(line[line.find(">")+1:line.find("...")])
        # for last cluster
        if snakemake.params.length_cutoff == 0:
            clust_sizes.append("{};size={};".format(clstr.split("/")[-2] + "_" + str(cluster), str(len(clstr_list))))
            ids.append(clstr_list[0])
        else:
            hd = count_clstr(clstr_list, seq_dict_mc)
            clust_sizes.append("{};size={};".format(clstr.split("/")[-2] + "_" + str(cluster), str(hd[0])))
            ids.append(hd[1])
    return list(zip(clust_sizes, ids))

# Writes the representatives in a fasta file with the new headers, uses
# the old header as key to get the matching sequences from the dict.
def writer(c_size, seq_dict):
    with dinopy.FastaWriter(str(snakemake.output), force_overwrite=True,
            line_width=1000) as clust:
        clust.write_entries([(seq_dict[line[1].encode()],
            line[0].encode()) for line in c_size])
        clust.close()

if str(snakemake.params.repr) == "most_common":
    seq_dict_mc = sequence_dict(fasta_not_clstr)
    c_size_mc = get_most_common_rep(clstr, fasta_not_clstr)
    writer(c_size_mc, seq_dict_mc)
else:
    seq_dict_l = sequence_dict(fasta)
    c_size_l = get_longest_rep(clstr)
    writer(c_size_l, seq_dict_l)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import tables
import numpy as np

df = pd.read_hdf(str(snakemake.input), 'df')

if snakemake.params.filter_method == "split_sample":
    samples =  set(map(lambda x: x.split("_")[0], list(df.columns)))
    for sample in samples:
        pair = [ sample+'_A', sample+'_B']
        # replace value by zero where XOR is True for A and B pair of samples,
        # since where() replace False positions, replace the invert of xor
        df[pair] = df[pair].where(np.invert(np.logical_xor(df[pair[0]]>=1, df[pair[1]]>=1)),0)
    # finally remove all empty rows
    selected_rows = df.sum(axis=1)>0
    df_filtered = df[selected_rows]
    df_filtered_out = df[np.invert(selected_rows)]
elif snakemake.params.filter_method == "not_split":
    df_filtered = df[(df > snakemake.params.cutoff).all(1)]
    df_filtered_out = df[(df <= snakemake.params.cutoff).all(1)]
else:
    raise ValueError("Valid filter methods are 'split_sample' and 'not_split'")

df_filtered.to_csv(snakemake.output[0])
df_filtered_out.to_csv(snakemake.output[1])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import collections as col
import logging
import numpy as np

blast = pd.read_csv(snakemake.input['blast_result'], sep="\t")

if snakemake.params['seq_rep'] == 'ASV':
    merged = pd.read_csv(snakemake.input['merged'], sep=",") #merged = pd.read_csv(snakemake.input['merged'], index_col=0, sep=",")
else:
    merged = pd.read_csv(snakemake.input['merged'], index_col=0, sep=",")


def new_index(table):
    new_index = table["seqid"].tolist()[0:]
    splitted = [i.lstrip(">").replace("size=", "").split(";")[0:2] for i in new_index]
    new_index = ["N{}_{}".format(i[0], i[1]) for i in splitted]
    df = pd.DataFrame(new_index)
    table["seqid"] = df[0]
    table = table.set_index('seqid')
    return table

blast = new_index(blast)
if snakemake.params['seq_rep'] == 'ASV':
    merged = new_index(merged)

result = merged.join(blast, how='outer')

logging.basicConfig(filename=str(snakemake.log), level=logging.DEBUG)

nas = pd.isna(result["taxonomy"])
nbh_counter = sum(nas)
bh_counter = sum(~nas)

no_blast_hit = list(result.loc[nas, :].index)

logging.info("{} sequences could be assigned taxonomic information using BLAST,\
            {} could not be assigned taxonomic information using BLAST".format(bh_counter, nbh_counter))
logging.info("No fitting BLAST hit found for seq_id:")
logging.info(no_blast_hit)

cols = set(result.columns.tolist())
cols_include = ['sequences', 'qlen', 'length', 'pident', 'mismatch', 'qstart', 'qend', 'sstart', 'send', 'gaps', 'evalue', 'taxonomy']
cols_all = cols_include + list(cols - set(cols_include))

result = result[cols_all]
result.to_csv(snakemake.output["complete"], index_label="seqid")
# filter out OTUs without taxonomic annotation
result = result.loc[~nas, :]
result.to_csv(snakemake.output["filtered"], index_label="seqid")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import collections as col

filtered_dict = pd.read_csv(snakemake.input["final_table_path2"],
                index_col="seqid").to_dict(orient="index")

with open(snakemake.input["merged"], "r") as swarms:
    seq_names = []
    for row in swarms:
        otu_seq_list = [s for s in row.split()]
        seq_names.append(otu_seq_list)

swarm_dict = col.defaultdict()
for j in range(len(seq_names)):
    swarm_dict[j] = {}
    swarm_dict[j]["sequences"] = filtered_dict[">" + seq_names[j][0]]["sequences"]
    for sample in filter(lambda i: i not in ["sequences"], filtered_dict[">"
                    + seq_names[j][0]].keys()):
        swarm_dict[j][sample] = sum([filtered_dict[">"
            + key][sample] for key in seq_names[j]])
    swarm_dict[j]["seqid"] = "N{}_{}".format(seq_names[j][0].split(";")[0],
        sum([value for key, value in swarm_dict[j].items() if not key in {"seqid", "sequences"}]))

df = pd.DataFrame.from_dict(swarm_dict, orient="index").set_index("seqid")
df.to_csv(snakemake.output[0])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import pandas as pd
import logging
import math

columns = ["seqid","qlen","length","pident","mismatch","qstart","qend",
            "sstart","send","gaps","evalue","taxonomy", "dbseqid"]

logging.basicConfig(filename=str(snakemake.log),
            level=logging.DEBUG)

blast_df = pd.read_csv(snakemake.input["blast_result"], sep="\t",names=columns, index_col="seqid")

taxids = list(set(blast_df["taxonomy"]))

tax_df = pd.read_hdf(snakemake.input["lineage"], 'tax_df')
tax_df = tax_df.loc[taxids]

def split2(char, string):
    if(isinstance(string, str)):
        if char in string:
            return string.split(char)
        else:
            return [string]
    else:
        return ["NaN"]


def create_filter_set():
    res_set = set()
    for f in snakemake.params.drop_tax_classes.split(","):
        f = f.strip()
        if not f:
            pass
        elif f.isdigit():
            res_set.add(int(f))
        else:
            # f = f.replace("*", ".*")
            tmp = tax_df[tax_df.tax_lineage.str.contains(f, na=False)]
            res_set.update(tmp.index)
    return res_set

querynames = list(set(blast_df.index.values))

header = True
f_set = create_filter_set()
for i in querynames:
    blast_res = blast_df.loc[i,:] # blast results
    if blast_res.ndim == 1: # case only 1 blast hit
        blast_res = blast_res.to_frame().T
    blast_res = blast_res[~blast_res['taxonomy'].isin(f_set)] # remove blast results with blacklisted taxonomy
    tax_res = tax_df.loc[blast_res['taxonomy'],:] # taxonomy for result
    tax_res = tax_res.set_index(blast_res.index)
    blast_res = pd.concat([blast_res, tax_res], axis=1)
    if blast_res.empty:
        continue
    blast_res["tax_level"] = blast_res.apply(lambda row: math.log(min(len(split2("; ", row['tax_lineage'])),6)), axis=1) # max length 6, log
    blast_res["len_val"] = blast_res.apply(lambda row: math.log(row['qlen']+1), axis=1)
    blast_res["Mscore"] = blast_res.apply(lambda row: row['pident']*row['tax_level']*row['len_val'], axis=1)
    blast_res = blast_res.sort_values(by=['Mscore'], ascending=False) # sort by score
    mode = 'w' if header else 'a'
    blast_res.to_csv(snakemake.output["all_tax"], mode=mode, header=header, sep="\t", index_label="seqid")
    best_tax = blast_res[:1][["qlen","length","pident","mismatch","qstart","qend", "sstart","send","gaps","evalue","tax_lineage"]]
    best_tax = best_tax.rename(index=str, columns={"tax_lineage": "taxonomy"})
    best_tax.to_csv(snakemake.output["tax_lineage"], mode=mode, header=header, sep="\t", index_label="seqid")
    header = False
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from subprocess import call

# If the sequences are single-end, they still need the read identifier
# (R1), otherwise, the string slicing should remove the last six symbols
# instead of the last 8.

if(len(snakemake.input)) == 2:
    output_edit = str(snakemake.output[0])[:-8]
    output_bad = str(snakemake.output[0])[:-8] + "_low_qual"
    call(["prinseq-lite.pl", "-verbose",
          "-fastq", snakemake.input[0], "-fastq2", snakemake.input[1], "-ns_max_n", "0",
          "-min_qual_mean", str(snakemake.params), "-out_good", output_edit,
          "-out_bad", output_bad, "-log", str(snakemake.log)])
else:
    output_edit = str(snakemake.output[0])[:-6]
    output_bad = str(snakemake.output[0])[:-6] + "_low_qual"
    call(["prinseq-lite.pl", "-verbose",
          "-fastq", snakemake.input[0], "-ns_max_n", "0", "-min_qual_mean", str(snakemake.params),
          "-out_good", output_edit, "-out_bad", output_bad, "-log", str(snakemake.log)])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pandas as pd
import logging
import math

columns = ["seqid","qlen","length","pident","mismatch","qstart","qend",
            "sstart","send","gaps","evalue","taxonomy", "dbseqid"]

logging.basicConfig(filename=str(snakemake.log),
            level=logging.DEBUG)

blast_df = pd.read_csv(snakemake.input["blast_result"], sep="\t",names=columns, index_col="seqid")
blast_df = blast_df.drop(['taxonomy'], axis=1)

taxids = list(set(blast_df["dbseqid"]))

tax_df = pd.read_hdf(snakemake.input["lineage"], 'df')
tax_df = tax_df.loc[taxids, :]


def create_filter_set():
    res_set = set()
    for f in snakemake.params.drop_tax_classes.split(","):
        f = f.strip()
        if any(tax_df.index == f):
            res_set.add(f)
        else:
            tmp = tax_df[tax_df.taxonomy.str.contains(f, na=False)]
            res_set.update(tmp.index)
    return res_set


querynames=list(set(blast_df.index.values))

header = True
f_set = create_filter_set()
for i in querynames:
    blast_res = blast_df.loc[i,:] # blast results
    if blast_res.ndim == 1: # case only 1 blast hit
        blast_res = blast_res.to_frame().T
    blast_res = blast_res[~blast_res['dbseqid'].isin(f_set)] # remove blast results with blacklisted taxonomy
    if blast_res.empty:
        continue
    tax_res = tax_df.loc[blast_res['dbseqid'],:] # taxonomy for result
    tax_res = tax_res.set_index(blast_res.index)
    blast_res = pd.concat([blast_res, tax_res], axis=1)
    mode = 'w' if header else 'a'
    best_tax = blast_res[["qlen","length","pident","mismatch","qstart","qend", "sstart","send","gaps","evalue","taxonomy"]]
    best_tax.to_csv(snakemake.output[0], mode=mode, header=header, sep="\t", index=True, index_label="seqid")
    header = False
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import dinopy
import pandas as pd
import tables
import numpy as np

f_names = []
uniq_seqs = []
for i in range(len(snakemake.input)):
    seqs = dinopy.FastaReader(snakemake.input[i])
    uniq_seqs = uniq_seqs + list(set([entry.sequence.decode() for entry in seqs.entries()]))
uniq_seqs = set(uniq_seqs)

# create empty matrix and fill, all other solutions cost too much memory
sample_names = [i.split("/")[-1].split(".")[0] for i in snakemake.input]
df = pd.DataFrame(0, index=uniq_seqs, columns=sample_names, dtype=np.uint16)

# fill matrix
for i in range(len(snakemake.input)):
    sample_name = sample_names[i]
    seqs = dinopy.FastaReader(snakemake.input[i])
    for entry in seqs.entries():
        seq = entry.sequence.decode()
        value = np.uint16(entry.name.decode().split("size=")[1].split(";")[0])
        df.at[seq, sample_name] = value

# save to file
df.index.name = "sequences"
df.to_hdf(snakemake.output[1], key='df', mode='w')
df.to_csv(snakemake.output[0])
ShowHide 32 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/MW55/Natrix
Name: natrix
Version: 1.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Other Versions:
Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...