Small snakemake pipeline to explore RNA-Seq data with deepTools.

public public 1yr ago Version: v2.0.0 0 bookmarks

Maciej_Bak
Swiss_Institute_of_Bioinformatics

deepTools is a very nice toolset for exploring RNA-Seq data.
This repository is a snakemake workflow that is based on the example usage from the deepTools manual:
https://deeptools.readthedocs.io/en/develop/content/example_usage.html
My aim was to develop an automatized and reproducible pipeline for my research which I would now happily share with the community :)

Snakemake pipeline execution

Snakemake is a workflow management system that helps to create and execute data processing pipelines. It requires Python 3 and can be most easily installed via the bioconda package from the anaconda cloud service.

Step 1: Download and installation of Miniconda3

Linux:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source .bashrc

macOS:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
source .bashrc

Step 2: Pandas and Snakemake installation

To execute the workflow one would require pandas python library and snakemake workflow menager.
Unless a specific snakemake version is specified explicitly it is most likely the best choice to install the latest versions:

conda install -c conda-forge pandas
conda install -c bioconda snakemake

In case you are missing some dependancy packages please install them first (with conda install ... as well).

Step 3: Pipeline execution

Specify all the required information (input/output/parameters) in the config.yaml
The main input to the pipeline is a simple design table which has to have the following format:

sample bam
[sample_name] [path_to_bam]
...

Where:

  • Each row is a sequencing sample.

  • All the bam files need to have a different name regardless of their location.

  • Design table might have more columns than these above.

Apart from the design table the pipeline requires a FASTA-formatted genome file.

Once the metadata are ready write a DAG (directed acyclic graph) into dag.pdf:

bash snakemake_dag_run.sh

There are two scripts to start the pipeline, depending on whether you want to run locally or on a SLURM computational cluster. In order to execute the workflow snakemake automatically creates internal conda virtual environments and installs software from anaconda cloud service. For the cluster execution it might be required to adapt the 'cluster_config.json' and submission scripts before starting the run.

bash snakemake_local_run_conda_env.sh
bash snakemake_cluster_run_conda_env.sh

License

Apache 2.0

Code Snippets

75
76
77
78
79
80
81
shell:
    """
    mkdir -p {params.DIR_results_dir}; \
    mkdir -p {params.DIR_cluster_log}; \
    mkdir -p {log.DIR_local_log}; \
    touch {output.TMP_output}
    """
112
113
114
115
116
117
shell:
    """
    samtools sort -@ {resources.threads} {params.BAM_path} \
    1> {output.BAM_sorted} \
    2> {log.LOG_local_log};
    """
148
149
150
151
152
153
shell:
    """
    samtools index -@ {resources.threads} {input.BAM_sorted} \
    1> {output.BAI_index} \
    2> {log.LOG_local_log};
    """
186
187
188
189
190
191
192
193
194
195
shell:
    """
    bamCoverage \
    --bam {input.BAM_sorted} \
    --outFileName {output.BW_sample} \
    --binSize 1 \
    --outFileFormat bigwig \
    --numberOfProcessors {resources.threads} \
    2> {log.LOG_local_log};
    """
226
227
228
229
230
231
232
233
shell:
    """
    multiBigwigSummary bins \
    --bwfiles {input.BW_sample} \
    --outFileName {output.NPZ_summary} \
    --numberOfProcessors {resources.threads} \
    2> {log.LOG_local_log};
    """
SnakeMake From line 226 of master/Snakefile
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
shell:
    """
    plotPCA \
    --corData {input.NPZ_summary} \
    --plotFile {output.PNG_pca} \
    --outFileNameData {params.TSV_pca_table} \
    --ntop 1000 \
    2> {log.LOG_local_log};
    plotPCA --transpose \
    --corData {input.NPZ_summary} \
    --plotFile {output.PNG_pca_transposed} \
    --outFileNameData {params.TSV_pca_transposed_table} \
    --ntop 1000 \
    2> {log.LOG_local_log};
    """
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
shell:
    """
    plotCorrelation \
    --corData {input.NPZ_summary} \
    --corMethod pearson \
    --whatToPlot heatmap \
    --plotFile {output.PNG_heatmap} \
    --outFileCorMatrix {params.TSV_heatmap_table} \
    2> {log.LOG_local_log};
    plotCorrelation \
    --corData {input.NPZ_summary} \
    --corMethod pearson \
    --whatToPlot scatterplot \
    --plotFile {output.PNG_scatterplot} \
    --outFileCorMatrix {params.TSV_scatterplot_table} \
    2> {log.LOG_local_log};
    """
SnakeMake From line 319 of master/Snakefile
370
371
372
373
374
375
376
377
shell:
    """
    plotCoverage \
    --bamfiles {input.BAM_sorted} \
    --plotFile {output.PNG_coverage_plot} \
    --numberOfProcessors {resources.threads} \
    2> {log.LOG_local_log};
    """
SnakeMake From line 370 of master/Snakefile
405
406
407
408
409
shell:
    """
    faToTwoBit {params.FASTA_genome} {output.TWOBIT_genome_2bit} \
    2> {log.LOG_local_log};
    """
446
447
448
449
450
451
452
453
454
455
456
457
shell:
    """
    computeGCBias \
    --bamfile {input.BAM_sorted} \
    --effectiveGenomeSize {params.effective_genome_size} \
    --genome {input.TWOBIT_genome_2bit} \
    --numberOfProcessors {resources.threads} \
    --GCbiasFrequenciesFile {params.TSV_gc_tsv} \
    --biasPlot {output.PNG_gc_plot} \
    --plotFileFormat png \
    2> {log.LOG_local_log};
    """
SnakeMake From line 446 of master/Snakefile
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/AngryMaciek/snakemake_deeptools
Name: snakemake_deeptools
Version: v2.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: Apache License 2.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...