Pipeline for de novo clustering of long transcriptomic reads

public 1yr ago 0 bookmarks

View Workflow

pipeline-nanopore-denovo-isoforms — View Workflow

We have a new bioinformatic resource that largely replaces the functionality of this project! See our new repository here: https://github.com/epi2me-labs/wf-isoforms

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: [email protected] for help with your application if it is not possible to upgrade to our new resources, or we are missing key features.

Pipeline for de novo clustering of long transcriptomic reads

The first natural step in the de novo analysis of long transcriptomic data in the absence of a reference genome is the clustering of the reads into groups corresponding to gene families. This pipeline performs that task using the isONclust2 tool, which is based on the approach pioneered by isONclust using minimizers and occasional pairwise alignment.

Since isONclust2 is implemented in C++ using efficient data structures, it can distribute computing across multiple cores and machines, thus it is able to cope with large transcriptomic datasets generated using PromethION P24 and P48 flow cells.

The pipeline optionally concatenates FASTQ files in the MinKNOW/guppy output and performs the optional trimming and orientation of cDNA reads using pychopper .

The main output of the pipeline is the assignment of read identifiers to gene clusters and the clustered reads grouped into one FASTQ file per gene cluster. This output is fit for downstream, gene level analysis.

Getting Started

Input

The input is a file with fastq records or a directory of containing fastq files, which is specified is config.yml .

Output

The main output files generated by the pipeline are under output_directory/final_clusters :

batch_info.tsv - information on the final output batch
cluster_cons.fq - a fastq file with the cluster representatives (or consensuses)
cluster_fastq/ - a directory of fastq files (one per gene cluster)
clusters_info.tsv - A TSV file with the cluster sizes
clusters.tsv - A TSV file with read identifiers assigned to clusters

Dependencies

miniconda
The rest of the dependencies are installed via conda.
CPU with AVX/SSE extensions; the workflow has been validated using the GridION device.

Installation

Clone the pipeline and the pipeline toolset by issuing:

git clone --recursive https://github.com/nanoporetech/pipeline-nanopore-denovo-isoforms.git

Install the dependencies using conda into a new environment:

conda env create -f env.yml

Activate the conda environment:

conda activate denovo-isoforms

Usage

Edit config.yml to set the input fastq and parameters, then on a local machine issue:

snakemake -j <num_cores> all

For analysing larger datsets (e.g. a PromethION flow cell) it is advisable to run the pipeline on a SGE cluster through DRMAA:

snakemake --rerun-incomplete -j 1000 --latency-wait 600 --drmaa-log-dir sge_logs --drmaa ' -P project_name -V -cwd -l h_vmem=200G,mem_free=155G -pe mt 5' all

Results

The evaluation metrics (also described in the isONclust paper - free version ) reported are:

The clustering has a binary classification issue (i.e. a single read is either correctly or incorrectly "labelled" by the algorithm given a ground truth). Each read must instead be evaluated in relation to the reads in the same and other clusters (e.g. which pairs or reads are "correctly assigned to the same cluster?" and "erroneously assigned to different clusters?"). From this, common measures such as precision, recall, and F-score cannot be used. The Homogeneity, completeness, and V-measure are analogous to the precision, recall, and F-score measures for binary classification issues, but are adapted for clustering issues.

Intuitively, homogeneity (i.e. precision) penalizes over-clustering, i.e. wrongly clustering together reads, while completeness (i.e. sensitivity) penalizes under-clustering, i.e. mistakenly keeping reads in different clusters. The V-measure is then defined as the mean of homogeneity and completeness (i.e. F-measure). We also include the commonly used Adjusted Rand Index (ARI). Intuitively, ARI measures the percentage of read pairs correctly clustered, normalized so that a perfect clustering achieves an ARI of 1 and a random cluster assignment achieves an ARI of 0. Briefly, both of these clustering quality metrics are derived from computing pairwise correct and incorrect groupings of reads , instead of individually classifying a read as correct or incorrect (as in classification issues).

Performance on PCS109 SIRV data

The performance on ~19k SIRV E0 reads generated using the PCS109 protocol can be assesed by running the evaluation script:

./run_evaluation.sh

The main results are:

bench_SIRV

Performance on PCS109 Drosophila melanogaster data

The performance on a D. melanogaster datasets generated using the SQK-PCS109 protocol can be assesed by running the evaluation script:

./run_evaluation_dmel.sh

The main results are:

bench_Dmel

Acknowledgements

This software was built in collaboration with Kristoffer Sahlin and Paul Medvedev .

Licence and Copyright

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

FAQs and tips

References and Supporting Information

See the post announcing transcriptomics tools in the Nanopore Community here .

Research Release

Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software will be minimal and is only provided directly by the developers. Feature requests, improvements, and discussions are welcome and can be implemented by forking and pull requests. However much as we would like to rectify every issue and piece of feedback users may have, the developers may have limited resource for support of this software. Research releases may be unstable and subject to rapid iteration by Oxford Nanopore Technologies.

Code Snippets

shell: "isONclust2 cluster -x %s -v -Q -l %s -o %s %s; sync"

"""
template = """

SnakeMake isonclust2 From line 90 of master/Snakefile

shell: "isONclust2 cluster -x %s -v -Q -l %s -r %s -o %s %s; sync"

"""

link_template="""

SnakeMake isonclust2 From line 99 of master/Snakefile

    shell: "ln -s `realpath {input}` {output}"

    """

    fh = open(snk, "w")
    for nr, l in levels.items():
        for n in l:
            purge = "-z" if n.RightSide else ""
            if nr == 0 or n.Left is None or n.Right is None:
                jr = init_template % (n.Id, n.Id, n.Id, config["cls_mode"], "{input.left}", "{output}", purge)
                fh.write(jr)
            else:
                jr = template % (n.Id, n.Left.Id, n.Right.Id, n.Id, config["cls_mode"], "{input.left}", "{input.right}", "{output}", purge)
                fh.write(jr)
    global ROOT
    fh.write(link_template % ROOT)
    fh.write("\nROOT = %d" % ROOT)
    fh.flush()
    fh.close()

def count_fastq_bases(fname, size=128000000):
    fh = open(fname, "r")
    count = 0
    while True:
        b = fh.read(size)
        if not b:
            break
        count += b.count("A")
        count += b.count("T")
        count += b.count("G")
        count += b.count("C")
        count += b.count("U")
    fh.close()
    return count

def preprocess_reads(fq):
    pc_opts = config["pychopper_opts"]
    concat = config["concatenate"]
    thr = config["cores"]

    out_fq = "processed_reads/full_length_reads.fq"
    if os.path.isdir("processed_reads"):
        return out_fq

    shell("mkdir -p processed_reads")
    if concat:
        print("Concatenating reads under directory: " + fq)
        shell("find %s  -regextype posix-extended -regex '.*\.(fastq|fq)$' -exec cat {{}} \\; > processed_reads/input_reads.fq" % fq)
    else:
        shell("ln -s `realpath %s` processed_reads/input_reads.fq" % fq)

    if config["run_pychopper"]:
        print("Running pychopper of fastq file: processed_reads/input_reads.fq")
        shell("(cd processed_reads; cdna_classifier.py -t %d %s input_reads.fq full_length_reads.fq)" % (thr, pc_opts))
    else:
        shell("ln -s `realpath processed_reads/input_reads.fq` processed_reads/full_length_reads.fq")

    return out_fq


ROOT = None
DYNAMIC_RULES="job_rules.snk"
if ((not os.path.isfile(os.path.join(WORKDIR,"sorted","sorted_reads.fastq"))) or (not os.path.isfile(os.path.join(SNAKEDIR, DYNAMIC_RULES)))):
    print("Preprocessing read in fastq file:", in_fastq)
    proc_fastq = preprocess_reads(in_fastq)
    print("Counting records in input fastq:", proc_fastq)
    nr_bases = count_fastq_bases(proc_fastq)
    print("Bases in input: {} megabases".format(int(nr_bases/10**6)))
    if config['batch_size'] < 0:
        config['batch_size'] = int(nr_bases/1000/config["cores"])

    print("Batch size is: {}".format(config['batch_size']))

    init_cls_options = """ --batch-size {} --kmer-size {} --window-size {} --min-shared {} --min-qual {}\

SnakeMake From line 107 of master/Snakefile

shell:
    """ isONclust2 dump -v -i sorted/sorted_reads_idx.cer -o final_clusters {input}; sync """

SnakeMake isonclust2 From line 204 of master/Snakefile

run:
    print("--------------------------------------------")
    [generate_help(sfile) for sfile in input]
    print("--------------------------------------------")

SnakeMake From line 15 of snakelib/utils.snake

shell: "rm -fr {input.wdir}"

SnakeMake From line 23 of snakelib/utils.snake

run:
    print("Pipeline name: ", params.name)
    print("Pipeline working directory: ", params.wdir)
    print("Pipeline repository: ", params.repo)

SnakeMake From line 30 of snakelib/utils.snake

ShowHide 6 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/nanoporetech/pipeline-nanopore-denovo-isoforms

Name: pipeline-nanopore-denovo-isoforms

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: None

Keywords:

FASTQ raw sequence reads Clustering Histogram isonclust isonclust2 pychopper Snakemake De Novo

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free