Source code for the characterization of new genes paper published in NAR.

public public 7mo ago Version: v0.0 0 bookmarks

Important note

The code in this package is intended to specifically run the analysis as it appears in Boivin V. and Reulet G. et al. , 2020. Reducing the structure bias of RNA-Seq reveals a large number of non-annotated non-coding RNA. Nucleic Acids Research . Available here

RNA-Seq pipeline

Author : Gaspard Reulet

Email : [email protected]

Software to install

Conda (Miniconda3) needs to be installed (https://docs.conda.io/en/latest/miniconda.html)

For Linux users :

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Answer yes to Do you wish the installer to initialize Miniconda3?

To create the Snakemake environment used to launch Snakemake, run the following. The conda create command can appear to be stuck on Solving environment. The command is probably not stuck, be patient.

To create the Snakemake environment used to launch Snakemake

exec bash
conda config --set auto_activate_base False
conda create --name smake -c bioconda -c conda-forge snakemake=5.7.0

Before running Snakemake, you have to initialize the environment

conda activate smake

Usage

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release .

If you use this workflow in a paper, don't forget to give credits to the author by citing the URL of this repository and the publication. Link

Step 2: Configure workflow

This workflow was developped for a Slurm cluster system. The system used for developping, testing and running the analysis is described here .

If you also use a Slurm cluster system, configure the workflow according to your system by editing the file cluster.json .

If you use a cluster system that operates with another schedule manager, configure the workflow according to your system via editing the files cluster.json and slurmSubmit.py See the Snakemake documentation for further details.

If you don't use a cluster system, you don't need to edit any file. Although, a cluster is highly recommended, certain steps of the worklfow are CPU and RAM intensive (eg: rule blockbuster )

Step 3: Execute workflow

After initializing the environment, you can test your configuration by performing a dry run with:

snakemake --use-conda --dry-run --printshellcmds
# OR
snakemake --use-conda -np

You can visualize the steps of the workflow with:

snakemake --dag | dot -Tpdf > dag.pdf
# OR
snakemake --dag | dot -Tsvg > dag.svg
# OR
snakemake --dag | dot -Tpng > dag.png

Execute the workflow in a cluster with:

snakemake -j 99 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

If the nodes of the cluster don't have access to Internet, you will need to download all the required files. Then, you will be able to execute the worklfow in the cluster:

snakemake --use-conda --until all_internet &&
snakemake -j 99 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

You can execute the workflow locally using N cores with:

snakemake --use-conda --cores N

Step 4: Investigate results

The clusters found by blockbuster should be located in the file data/dataset/clusters/clusters.sorted.bed

Code Snippets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pandas as pd
import sys


def Dupplicate_count_column(dataf, ID_column, dup_column):
    dataf = dataf[dataf.name.str.contains("/2") == False]
    dupplicate_serie = dataf.groupby(dataf[ID_column], as_index=False).size()
    dupplicate_quants = list(dupplicate_serie.values)
    unique_keys = list(dupplicate_serie.keys())
    dataf_dup = pd.DataFrame(
        data={
            ID_column: unique_keys,
            dup_column: dupplicate_quants
        }
    )
    dataf = pd.merge(dataf, dataf_dup, on=ID_column, how='left')
    dataf = dataf.drop_duplicates(subset=['chr', 'strand', 'start', 'end'])
    return dataf


def main(bedpath, output):
    bed_df = pd.read_csv(
        filepath_or_buffer=bedpath,
        index_col=False,
        sep='\t',
        header=None,
        names=['chr', 'start', 'end', 'name', 'score', 'strand']
    )
    ID_column = 'ID'
    dup_column = 'quantity'
    for col in ['chr', 'start', 'end', 'strand']:
        bed_df[col] = bed_df[col].map(str)
    bed_df[ID_column] = bed_df[['chr', 'start', 'end', 'strand']].apply(
        lambda x: '_'.join(x),
        axis=1
    )
    print(bed_df)
    bed_df = Dupplicate_count_column(bed_df, ID_column, dup_column)
    bed_df = bed_df.sort_values(['chr', 'strand', 'start', 'end'], axis=0)
    cols = ['chr', 'start', 'end', 'name', dup_column, 'strand']
    bed_df = bed_df[cols]
    bed_df.to_csv(path_or_buf=output, index=False, sep='\t', header=None)


main(snakemake.input.bed, snakemake.output.bed)
12
13
shell:
    "bedtools bamtobed -i {input.bam} > {output.bed}"
13
14
script:
    "../bed_to_blockbuster.py"
14
15
16
17
18
19
20
21
22
23
24
25
26
shell:
    "blockbuster.x -format 1"
    " -minBlockHeight 100"
    " -print 1"
    " -tagFilter 50"
    " {input.bed}"
    " > {output.block} &&"
    """
    grep -E '>' {output.block} | \
    awk '{{ print "chr"$2 "\t" $3 "\t" $4 "\t" $1 "\t" $6 "\t" $5}}' | \
    sort -k1,1 -k2,2n \
    > {output.clusters}
    """
6
7
shell:
    "wget -O {output.bam} {params.link}"
 9
10
shell:
    "samtools flagstat {input} > {output}"
14
15
shell:
    "samtools sort -n {input} > {output}"
ShowHide 4 more snippets with no or duplicated tags.

Free

Created: 7mo ago
Updated: 7mo ago
Maitainers: public
URL: https://github.com/GaspardR/snakemake_blockbuster
Name: snakemake_blockbuster
Version: v0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...