Source code for the characterization of new genes paper published in NAR.

public 10mo ago Version: v0.0 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Important note

The code in this package is intended to specifically run the analysis as it appears in Boivin V. and Reulet G. et al. , 2020. Reducing the structure bias of RNA-Seq reveals a large number of non-annotated non-coding RNA. Nucleic Acids Research . Available here

RNA-Seq pipeline

Author : Gaspard Reulet

Email : [email protected]

Software to install

Conda (Miniconda3) needs to be installed (https://docs.conda.io/en/latest/miniconda.html)

For Linux users :

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Answer yes to Do you wish the installer to initialize Miniconda3?

To create the Snakemake environment used to launch Snakemake, run the following. The conda create command can appear to be stuck on Solving environment. The command is probably not stuck, be patient.

To create the Snakemake environment used to launch Snakemake

exec bash
conda config --set auto_activate_base False
conda create --name smake -c bioconda -c conda-forge snakemake=5.7.0

Before running Snakemake, you have to initialize the environment

conda activate smake

Usage

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release .

If you use this workflow in a paper, don't forget to give credits to the author by citing the URL of this repository and the publication. Link

Step 2: Configure workflow

This workflow was developped for a Slurm cluster system. The system used for developping, testing and running the analysis is described here .

If you also use a Slurm cluster system, configure the workflow according to your system by editing the file cluster.json .

If you use a cluster system that operates with another schedule manager, configure the workflow according to your system via editing the files cluster.json and slurmSubmit.py See the Snakemake documentation for further details.

If you don't use a cluster system, you don't need to edit any file. Although, a cluster is highly recommended, certain steps of the worklfow are CPU and RAM intensive (eg: rule blockbuster )

Step 3: Execute workflow

After initializing the environment, you can test your configuration by performing a dry run with:

snakemake --use-conda --dry-run --printshellcmds
# OR
snakemake --use-conda -np

You can visualize the steps of the workflow with:

snakemake --dag | dot -Tpdf > dag.pdf
# OR
snakemake --dag | dot -Tsvg > dag.svg
# OR
snakemake --dag | dot -Tpng > dag.png

Execute the workflow in a cluster with:

snakemake -j 99 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

If the nodes of the cluster don't have access to Internet, you will need to download all the required files. Then, you will be able to execute the worklfow in the cluster:

snakemake --use-conda --until all_internet &&
snakemake -j 99 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

You can execute the workflow locally using N cores with:

snakemake --use-conda --cores N

Step 4: Investigate results

The clusters found by blockbuster should be located in the file data/dataset/clusters/clusters.sorted.bed

Code Snippets

import pandas as pd
import sys


def Dupplicate_count_column(dataf, ID_column, dup_column):
    dataf = dataf[dataf.name.str.contains("/2") == False]
    dupplicate_serie = dataf.groupby(dataf[ID_column], as_index=False).size()
    dupplicate_quants = list(dupplicate_serie.values)
    unique_keys = list(dupplicate_serie.keys())
    dataf_dup = pd.DataFrame(
        data={
            ID_column: unique_keys,
            dup_column: dupplicate_quants
        }
    )
    dataf = pd.merge(dataf, dataf_dup, on=ID_column, how='left')
    dataf = dataf.drop_duplicates(subset=['chr', 'strand', 'start', 'end'])
    return dataf


def main(bedpath, output):
    bed_df = pd.read_csv(
        filepath_or_buffer=bedpath,
        index_col=False,
        sep='\t',
        header=None,
        names=['chr', 'start', 'end', 'name', 'score', 'strand']
    )
    ID_column = 'ID'
    dup_column = 'quantity'
    for col in ['chr', 'start', 'end', 'strand']:
        bed_df[col] = bed_df[col].map(str)
    bed_df[ID_column] = bed_df[['chr', 'start', 'end', 'strand']].apply(
        lambda x: '_'.join(x),
        axis=1
    )
    print(bed_df)
    bed_df = Dupplicate_count_column(bed_df, ID_column, dup_column)
    bed_df = bed_df.sort_values(['chr', 'strand', 'start', 'end'], axis=0)
    cols = ['chr', 'start', 'end', 'name', dup_column, 'strand']
    bed_df = bed_df[cols]
    bed_df.to_csv(path_or_buf=output, index=False, sep='\t', header=None)


main(snakemake.input.bed, snakemake.output.bed)

Python Pandas From line 1 of master/bed_to_blockbuster.py

shell:
    "bedtools bamtobed -i {input.bam} > {output.bed}"

SnakeMake BEDTools From line 12 of rules/bamtobed.smk

script:
    "../bed_to_blockbuster.py"

SnakeMake From line 13 of rules/bed_to_blockbuster.smk

shell:
    "blockbuster.x -format 1"
    " -minBlockHeight 100"
    " -print 1"
    " -tagFilter 50"
    " {input.bed}"
    " > {output.block} &&"
    """
    grep -E '>' {output.block} | \
    awk '{{ print "chr"$2 "\t" $3 "\t" $4 "\t" $1 "\t" $6 "\t" $5}}' | \
    sort -k1,1 -k2,2n \
    > {output.clusters}
    """

SnakeMake From line 14 of rules/blockbuster.smk

shell:
    "wget -O {output.bam} {params.link}"

SnakeMake From line 6 of rules/downloads.smk

shell:
    "samtools flagstat {input} > {output}"

SnakeMake SAMtools From line 9 of rules/flagstat.smk

shell:
    "samtools sort -n {input} > {output}"

SnakeMake SAMtools From line 14 of rules/samtools_sort.smk

ShowHide 4 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 10mo ago

Updated: 10mo ago

Maitainers: public

URL: https://github.com/GaspardR/snakemake_blockbuster

Name: snakemake_blockbuster

Version: v0.0

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

BEDTools Pandas SAMtools Snakemake

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free