Source code for the characterization of new genes paper published in NAR.
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
Important note
The code in this package is intended to specifically run the analysis as it appears in Boivin V. and Reulet G. et al. , 2020. Reducing the structure bias of RNA-Seq reveals a large number of non-annotated non-coding RNA. Nucleic Acids Research . Available here
RNA-Seq pipeline
Author : Gaspard Reulet
Email : [email protected]
Software to install
Conda (Miniconda3) needs to be installed (https://docs.conda.io/en/latest/miniconda.html)
For Linux users :
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Answer
yes
to
Do you wish the installer to initialize Miniconda3?
To create the Snakemake environment used to launch Snakemake, run the following. The conda create command can appear to be stuck on Solving environment. The command is probably not stuck, be patient.
To create the Snakemake environment used to launch Snakemake
exec bash
conda config --set auto_activate_base False
conda create --name smake -c bioconda -c conda-forge snakemake=5.7.0
Before running Snakemake, you have to initialize the environment
conda activate smake
Usage
Step 1: Install workflow
If you simply want to use this workflow, download and extract the latest release .
If you use this workflow in a paper, don't forget to give credits to the author by citing the URL of this repository and the publication. Link
Step 2: Configure workflow
This workflow was developped for a Slurm cluster system. The system used for developping, testing and running the analysis is described here .
If you also use a
Slurm
cluster system, configure the workflow according to your system by editing the file
cluster.json
.
If you use a cluster system that operates with another schedule manager, configure the workflow according to your system via editing the files
cluster.json
and
slurmSubmit.py
See the
Snakemake documentation
for further details.
If you don't use a cluster system, you don't need to edit any file. Although, a cluster is highly recommended, certain steps of the worklfow are CPU and RAM intensive (eg:
rule blockbuster
)
Step 3: Execute workflow
After initializing the environment, you can test your configuration by performing a dry run with:
snakemake --use-conda --dry-run --printshellcmds
# OR
snakemake --use-conda -np
You can visualize the steps of the workflow with:
snakemake --dag | dot -Tpdf > dag.pdf
# OR
snakemake --dag | dot -Tsvg > dag.svg
# OR
snakemake --dag | dot -Tpng > dag.png
Execute the workflow in a cluster with:
snakemake -j 99 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'
If the nodes of the cluster don't have access to Internet, you will need to download all the required files. Then, you will be able to execute the worklfow in the cluster:
snakemake --use-conda --until all_internet &&
snakemake -j 99 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'
You can execute the workflow locally using
N
cores with:
snakemake --use-conda --cores N
Step 4: Investigate results
The clusters found by blockbuster should be located in the file
data/dataset/clusters/clusters.sorted.bed
Code Snippets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import pandas as pd import sys def Dupplicate_count_column(dataf, ID_column, dup_column): dataf = dataf[dataf.name.str.contains("/2") == False] dupplicate_serie = dataf.groupby(dataf[ID_column], as_index=False).size() dupplicate_quants = list(dupplicate_serie.values) unique_keys = list(dupplicate_serie.keys()) dataf_dup = pd.DataFrame( data={ ID_column: unique_keys, dup_column: dupplicate_quants } ) dataf = pd.merge(dataf, dataf_dup, on=ID_column, how='left') dataf = dataf.drop_duplicates(subset=['chr', 'strand', 'start', 'end']) return dataf def main(bedpath, output): bed_df = pd.read_csv( filepath_or_buffer=bedpath, index_col=False, sep='\t', header=None, names=['chr', 'start', 'end', 'name', 'score', 'strand'] ) ID_column = 'ID' dup_column = 'quantity' for col in ['chr', 'start', 'end', 'strand']: bed_df[col] = bed_df[col].map(str) bed_df[ID_column] = bed_df[['chr', 'start', 'end', 'strand']].apply( lambda x: '_'.join(x), axis=1 ) print(bed_df) bed_df = Dupplicate_count_column(bed_df, ID_column, dup_column) bed_df = bed_df.sort_values(['chr', 'strand', 'start', 'end'], axis=0) cols = ['chr', 'start', 'end', 'name', dup_column, 'strand'] bed_df = bed_df[cols] bed_df.to_csv(path_or_buf=output, index=False, sep='\t', header=None) main(snakemake.input.bed, snakemake.output.bed) |
12 13 | shell: "bedtools bamtobed -i {input.bam} > {output.bed}" |
13 14 | script: "../bed_to_blockbuster.py" |
14 15 16 17 18 19 20 21 22 23 24 25 26 | shell: "blockbuster.x -format 1" " -minBlockHeight 100" " -print 1" " -tagFilter 50" " {input.bed}" " > {output.block} &&" """ grep -E '>' {output.block} | \ awk '{{ print "chr"$2 "\t" $3 "\t" $4 "\t" $1 "\t" $6 "\t" $5}}' | \ sort -k1,1 -k2,2n \ > {output.clusters} """ |
6 7 | shell: "wget -O {output.bam} {params.link}" |
9 10 | shell: "samtools flagstat {input} > {output}" |
14 15 | shell: "samtools sort -n {input} > {output}" |
Support
- Future updates