Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.

public public 1yr ago 0 bookmarks

MIRFLOWZ is a [Snakemake][snakemake] workflow for mapping miRNAs and isomiRs.

Table of Contents

  1. Installation

  2. Usage

  3. Workflow description

  4. Contributing

  5. License

  6. Contact

Installation

The workflow lives inside this repository and will be available for you to run after following the installation instructions layed out in this section.

Cloning the repository

Traverse to the desired path on your file system, then clone the repository and change into it with:

git clone https://github.com/zavolanlab/mirflowz.git
cd mirflowz

Dependencies

For improved reproducibility and reusability of the workflow, as well as an easy means to run it on a high performance computing (HPC) cluster managed, e.g., by [Slurm][slurm], all steps of the workflow run inside their own containers. As a consequence, running this workflow has only a few individual dependencies. These are managed by the package manager [Conda][conda], which needs to be installed on your system before proceeding.

If you do not already have [Conda][conda] installed globally on your system, we recommend that you install [Miniconda][miniconda-installation]. For faster creation of the environment (and Conda environments in general), you can also install [Mamba][mamba] on top of Conda. In that case, replace conda with mamba in the commands below (particularly in conda env create ).

Setting up the virtual environment

Create and activate the environment with necessary dependencies with Conda:

conda env create -f environment.yml
conda activate mirflowz

If you plan to run MIRFLOWZ via Singularity and do not already have it installed globally on your system, you must further update the Conda environment using the environment.root.yml with the command below. Mind that you must have the environment activated to update it.

conda env update -f environment.root.yml

Note that you will need to have root permissions on your system to be able to install Singularity. If you want to run MIRFLOWZ on an HPC cluster (recommended in almost all cases), ask your systems administrator about Singularity.

If you would like to contribute to MIRFLOWZ development, you may find it useful to further update your environment with the development dependencies:

conda env update -f environment.dev.yml

Testing your installation

Several tests are provided to check the integrity of the installation. Follow the instructions in this section to make sure the workflow is ready to use.

Run test workflow on local machine

Execute one of the following commands to run the test workflow on your local machine:

  • Test workflow on local machine with Singularity :
bash test/test_workflow_local_with_singularity.sh
  • Test workflow on local machine with Conda :
bash test/test_workflow_local_with_conda.sh

Run test workflow on a cluster via SLURM

Execute one of the following commands to run the test workflow on a [Slurm][slurm]-managed high-performance computing (HPC) cluster:

  • Test workflow with Singularity :
bash test/test_workflow_slurm_with_singularity.sh
  • Test workflow with Conda :
bash test/test_workflow_slurm_with_conda.sh

Rule graph

Execute the following command to generate a rule graph image for the workflow. The output will be found in the images/ directory in the repository root.

bash test/test_rule_graph.sh

You can see the rule graph below in the workflow description section.

Clean up test results

After successfully running the tests above, you can run the following command to remove all artifacts generated by the test runs:

bash test/test_cleanup.sh

Usage

Now that your virtual environment is set up and the workflow is deployed and tested, you can go ahead and run the workflow on your samples.

Preparing inputs

It is suggested to have all the input files for a given run (or hard links pointing to them) inside a dedicated directory, for instance under the MIRFLOWZ root directory. This way it is easier to keep the data together, reproduce an analysis and set up Singularity access to them.

1. Prepare a sample table

touch path/to/your/sample/table.csv

Fill the sample table according to the following requirements:

  • sample . This column contains the library name.

  • sample_file . In this column, you must provide the path to the library file. The path must be relative to the working directory.

  • adapter . This field must contain the adapter sequence in capital letters.

  • format . In this field you mast state the library format. It can either be fa if providing a FASTA file or fastq if the library is a FASTQ file.

You can look at the test/test_files/sample_table.csv to know what this file must look like, or use it as a template.

2. Prepare genome resources

There are 4 files you must provide:

  1. A gzip ped FASTA file containing reference sequences , typically the genome of the source/organism from which the library was extracted.

  2. A gzip ped GTF file with matching gene annotations for the reference sequences above.

MIRFLOWZ expects both the reference sequence and gene annotation files to follow [Ensembl][ensembl] style/formatting. If you obtained these files from a source other than Ensembl, you may first need to convert them to the expected style to avoid issues!

  1. An uncompressed GFF3 file with microRNA annotations for the reference sequences above.

MIRFLOWZ expects the miRNA annotations to follow [miRBase][mirbase] style/formatting. If you obtained this file from a source other than miRBase, you may first need to convert it to the expected style to avoid issues!

  1. An uncompressed tab-separated file with a mapping between the reference names used in the miRNA annotation file (column 1; "UCSC style") and in the gene annotations and reference sequence files (column 2; "Ensembl style"). Values in column 1 are expected to be unique, no header is expected, and any additional columns will be ignored. [This resource][chrMap] provides such files for various organisms, and in the expected format.

General note: If you want to process the genome resources before use (e.g., filtering), you can do that, but make sure the formats of any modified resource files meet the formatting expectations outlined above!

3. Prepare a configuration file

We recommend creating a copy of the configuration file template:

cp path/to/config_template.yaml path/to/config.yaml

Open the new copy in your editor of choice and adjust the configuration parameters to your liking. The template explains what each of the parameters means and how you can meaningfully adjust them.

Running the workflow

With all the required files in place, you can now run the workflow locally with the following command:

snakemake \
 --snakefile="path/to/Snakefile" \
 --cores 4 \
 --configfile="path/to/config.yaml" \
 --use-singularity \
 --singularity-args "--bind ${PWD}/../" \
 --printshellcmds \
 --rerun-incomplete \
 --verbose

NOTE: Depending on your working directory, you do not need to use the parameters --snakefile and --configfile . For instance, if the Snakefile is in the same directory or the workflow/ directory is beneath the current working directory, there's no need for the --snakefile directory. Refer to the [Snakemake documentation][snakemakeDocu] for more information.

After successful execution of the workflow, results and logs will be found in the results/ and logs/ directories, respectively.

Creating a Snakemake report

Snakemake provides the option to generate a detailed HTML report on runtime statistics, workflow topology and results. If you want to create a Snakemake report, you must run the following command:

snakemake \
 --snakefile="path/to/Snakefile" \
 --configfile="path/to/config.yaml" \
 --report="snakemake_report.html"

NOTE: The report creation must be done after running the workflow in order to have the runtime statistics and the results.

Workflow description

The MIRFLOWZ workflow first processes and indexes the user-provided genome resources. Afterwards, the user-provided short read smallRNA-seq libraries will be aligned seperately against the genome and transcriptome. For increased fidelity, two seperated aligners, [Segemehl][segemehl] and our in-house tool [Oligomap][oligomap], are used. All the resulting alignments are merged such that only the best alignments of each read are kept (smallest edit distance). Finally, alignments are intersected with the user-provided, pre-processed miRNA annotation file using [ bedtools ][bedtools]. Counts are tabulated seperately for reads consistent with either miRNA precursors, mature miRNA and/or isomiRs.

The schema below is a visual representation of the individual workflow steps and how they are related:

![rule-graph][rule-graph]

Contributing

MIRFLOWZ is an open-source project which relies on community contributions. You are welcome to participate by submitting bug reports or feature requests, taking part in discussions, or proposing fixes and other code changes.

License

This project is covered by the MIT License .

Contact

Do not hesitate on contacting us via [email][email] for any inquiries on MIRFLOWZ . Please mention the name of the tool.

Code Snippets

92
93
shell:
    "(zcat {input.reads} > {output.reads}) &> {log}"
SnakeMake From line 92 of rules/map.smk
122
123
124
125
126
127
128
129
shell:
    "(fastq_quality_filter \
    -v \
    -q {params.q} \
    -p {params.p} \
    -i {input.reads} \
    > {output.reads} \
    ) &> {log}"
SnakeMake From line 122 of rules/map.smk
156
157
shell:
    "(fastq_to_fasta -r -n -i {input.reads} > {output.reads}) &> {log}"
SnakeMake From line 156 of rules/map.smk
187
188
shell:
    "(fasta_formatter -w 0 -i {input.reads} > {output.reads}) &> {log}"
SnakeMake From line 187 of rules/map.smk
222
223
224
225
226
227
228
229
230
231
shell:
    "(cutadapt \
    -a {params.adapter} \
    --error-rate {params.error_rate} \
    --minimum-length {params.minimum_length} \
    --overlap {params.overlap} \
    --trim-n \
    --max-n {params.max_n} \
    --cores {resources.threads} \
    -o {output.reads} {input.reads}) &> {log}"
SnakeMake From line 222 of rules/map.smk
260
261
shell:
    "(fastx_collapser -i {input.reads} > {output.reads}) &> {log}"
SnakeMake From line 260 of rules/map.smk
296
297
298
299
300
301
302
303
304
shell:
    "(segemehl.x \
    -i {input.genome_index_segemehl} \
    -d {input.genome} \
    -t {threads} \
    -e \
    -q {input.reads} \
    -outfile {output.gmap} \
    ) &> {log}"
SnakeMake From line 296 of rules/map.smk
346
347
348
349
350
351
352
353
354
shell:
    "(segemehl.x \
    -i {input.transcriptome_index_segemehl} \
    -d {input.transcriptome} \
    -t {threads} \
    -q {input.reads} \
    -e \
    -outfile {output.tmap} \
    ) &> {log}"
SnakeMake From line 346 of rules/map.smk
387
388
389
390
391
392
shell:
    "(python {input.script} \
    -r {params.max_length_reads} \
    -i {input.reads} \
    -o {output.reads} \
    ) &> {log}"
SnakeMake From line 387 of rules/map.smk
429
430
431
432
433
434
435
shell:
    "(oligomap \
    {input.target} \
    {input.reads} \
    -r {output.report} \
    > {output.gmap} \
    ) &> {log}"
SnakeMake From line 429 of rules/map.smk
467
468
469
470
471
472
shell:
    "(bash {input.script} \
    {input.tmap} \
    {resources.threads} \
    {output.sort} \
    ) &> {log}"
SnakeMake From line 467 of rules/map.smk
509
510
511
512
513
shell:
    "(python {input.script} \
    -i {input.sort} \
    -n {params.nh} \
    > {output.gmap}) &> {log}"
SnakeMake From line 509 of rules/map.smk
557
558
559
560
561
562
563
564
shell:
    "(oligomap \
    {input.target} \
    {input.reads} \
    -s \
    -r {output.report} \
    > {output.tmap} \
    ) &> {log}"
SnakeMake From line 557 of rules/map.smk
604
605
606
607
608
609
shell:
    "(bash {input.script} \
    {input.tmap} \
    {resources.threads} \
    {output.sort} \
    ) &> {log}"
SnakeMake From line 604 of rules/map.smk
651
652
653
654
655
656
shell:
    "(python {input.script} \
    -i {input.sort} \
    -n {params.nh} \
    > {output.tmap} \
    ) &> {log}"
SnakeMake From line 651 of rules/map.smk
684
685
shell:
    "(cat {input.gmap1} {input.gmap2} > {output.gmaps}) &> {log}"
SnakeMake From line 684 of rules/map.smk
719
720
shell:
    "(cat {input.tmap1} {input.tmap2} > {output.tmaps}) &> {log}"
SnakeMake From line 719 of rules/map.smk
749
750
751
752
753
754
shell:
    "(python {input.script} \
    {input.gmaps} \
    {params.nh} \
    {output.gmaps} \
    ) &> {log}"
SnakeMake From line 749 of rules/map.smk
787
788
789
790
791
792
shell:
    "(python {input.script} \
    {input.tmaps} \
    {params.nh} \
    {output.tmaps} \
    ) &> {log}"
SnakeMake From line 787 of rules/map.smk
821
822
shell:
    "samtools view {input.gmap} > {output.gmap}"
857
858
shell:
    "samtools view {input.tmap} > {output.tmap}"
893
894
895
896
897
898
shell:
    "(perl {input.script} \
    --in {input.tmap} \
    --exons {input.exons} \
    --out {output.genout} \
    ) &> {log}"
SnakeMake From line 893 of rules/map.smk
928
929
shell:
    "(cat {input.gmap1} {input.gmap2} > {output.catmaps}) &> {log}"
SnakeMake From line 928 of rules/map.smk
957
958
shell:
    "(cat {input.header} {input.catmaps} > {output.concatenate}) &> {log}"
SnakeMake From line 957 of rules/map.smk
987
988
shell:
    "(samtools sort -n -o {output.sort} {input.concatenate}) &> {log}"
SnakeMake From line 987 of rules/map.smk
1024
1025
1026
1027
1028
1029
1030
shell:
    "(perl {input.script} \
    --print-header \
    --keep-mm \
    --in {input.sort} \
    --out {output.remove_inf} \
    ) &> {log}"
SnakeMake From line 1024 of rules/map.smk
1064
1065
1066
1067
1068
1069
shell:
    "(python {input.script} \
    {input.sam} \
    --nh \
    > {output.sam} \
    ) &> {log}"
SnakeMake From line 1064 of rules/map.smk
1098
1099
shell:
    "(samtools view -b {input.maps} > {output.maps}) &> {log}"
SnakeMake From line 1098 of rules/map.smk
1130
1131
shell:
    "(samtools sort {input.maps} > {output.maps}) &> {log}"
SnakeMake From line 1130 of rules/map.smk
1162
1163
shell:
    "(samtools index -b {input.maps} > {output.maps}) &> {log}"
SnakeMake From line 1162 of rules/map.smk
81
82
shell:
    "(zcat {input.genome} | {input.script} > {output.genome}) &> {log}"
110
111
shell:
    "(zcat {input.gtf} | gffread -w {output.fasta} -g {input.genome}) &> {log}"
133
134
shell:
    "(cat {input.fasta} | {input.script} > {output.fasta}) &> {log}"
168
169
shell:
    "(segemehl.x -x {output.idx} -d {input.fasta}) &> {log}"
200
201
shell:
    "(segemehl.x -x {output.idx} -d {input.genome}) &> {log}"
221
222
223
224
225
226
227
228
shell:
    "(bash \
    {input.script} \
    -f {input.gtf} \
    -c 3 \
    -p exon \
    -o {output.exons} \
    ) &> {log}"
250
251
252
253
254
255
shell:
    "(Rscript \
    {input.script} \
    --gtf {input.exons} \
    -o {output.exons} \
    ) &> {log}"
278
279
shell:
    "(samtools dict -o {output.header} --uri=NA {input.genome}) &> {log}"
304
305
306
307
308
309
310
311
shell:
    "(perl {input.script} \
    {input.anno} \
    {params.column} \
    {params.delimiter} \
    {input.map_chr} \
    {output.gff} \
    ) &> {log}"
334
335
shell:
    "(samtools faidx {input.genome}) &> {log}"
354
355
shell:
    "(cut -f1,2 {input.genome} > {output.chrsize}) &> {log}"
395
396
397
398
399
400
401
shell:
    "(python {input.script} \
    {input.gff3} \
    --chr {input.chrsize} \
    --extension {params.extension} \
    --outdir {params.out_dir} \
    ) &> {log}"
118
119
120
121
122
123
124
125
126
127
shell:
    "(bedtools intersect \
    -wb \
    -s \
    -F 1 \
    -b {input.alignment} \
    -a {input.primir} \
    -bed \
    > {output.intersect} \
    ) &> {log}"
165
166
167
168
169
170
171
shell:
    "((samtools view \
    -H {input.alignments}; \
    awk 'NR==FNR {{bed[$13]=1; next}} $1 in bed' \
    {input.intersect} {input.alignments} \
    ) > {output.sam} \
    ) &> {log}"
206
207
shell:
    "(samtools view -b {input.maps} > {output.maps}) &> {log}"
242
243
shell:
    "(samtools sort -n {input.maps} > {output.maps}) &> {log}"
276
277
shell:
    "(samtools index -b {input.maps} > {output.maps}) &> {log}"
315
316
317
318
319
320
321
322
323
324
shell:
    "(bedtools intersect \
    -wo \
    -s \
    -F 1 \
    -b {input.alignment} \
    -a {input.mirna} \
    -bed \
    > {output.intersect} \
    ) &> {log}"
362
363
364
365
366
367
368
shell:
    "((samtools view \
    -H {input.alignments}; \
    awk 'NR==FNR {{bed[$13]=1; next}} $1 in bed' \
    {input.intersect} {input.alignments} \
    ) > {output.sam} \
    ) &> {log}"
406
407
408
409
410
411
412
shell:
    "(python {input.script} \
    --bed {input.intersect} \
    --sam {input.alignments} \
    --extension {params.extension} \
    > {output.sam} \
    ) &> {log}"
447
448
shell:
    "(samtools sort -t YW -O SAM {input.sam} > {output.sam}) &> {log}"
481
482
483
484
485
486
487
488
489
490
shell:
    "(python \
    {input.script} \
    {input.alignments} \
    --collapsed \
    --nh \
    --lib {params.library} \
    --outdir {params.out_dir} \
    --mir-list {params.mir_list} \
    ) &> {log}"
526
527
528
529
530
531
532
533
shell:
    "(python \
    {input.script} \
    {input.intersect} \
    --collapsed \
    --nh \
    > {output.table} \
    ) &> {log}"
567
568
569
570
571
572
573
574
shell:
    "(Rscript \
    {input.script} \
    --input_dir {params.input_dir} \
    --output_file {output.table} \
    --prefix {params.prefix} \
    --verbose \
    ) &> {log}"
606
607
608
609
610
611
shell:
    "(perl {input.script} \
    --suffix \
    --in {input.maps} \
    --out {output.maps} \
    ) &> {log}"
646
647
shell:
    "(samtools view -b {input.maps} > {output.maps}) &> {log}"
682
683
shell:
    "(samtools sort {input.maps} > {output.maps}) &> {log}"
716
717
shell:
    "(samtools index -b {input.maps} > {output.maps}) &> {log}"
ShowHide 56 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/zavolanlab/mirflowz
Name: mirflowz
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...