Make colorful identity heatmaps of genomic sequence

public public 1yr ago Version: v0.5 0 bookmarks

This is a repository for making colorful identity heatmaps of genomic sequence.

Installation

To install you can follow the directions on the usage page or use the information below.

You will need a current version of snakemake to run this workflow. To get snakemake please follow the install instructions on their website, but in brief once conda and mamba are installed you can install snakemake with:

mamba create -n snakemake -c conda-forge -c bioconda snakemake

Afterwards you can activate the conda environment and download the repository. And all additional dependencies will be handled by snakemake .

conda activate snakemake
git clone https://github.com/mrvollger/StainedGlass.git

Running

Choose a sample identifier for your run e.g. chr8 and a fasta file on which you want to show the colorful alignments and the modify the config file config/config.yaml accordingly.

Once this is done and you have activated your conda env with snakemake you can run the pipeline like so:

snakemake --use-conda --cores 24

Or do a dry run of the pipeline:

snakemake --use-conda --cores 24 -n

All parameters are described in config/README.md and you can modify any of them by modifying config/config.yaml . You can also change the configuration via the command line. For example, to change the sample identifier and fasta options do:

snakemake --use-conda --cores 24 --config sample=test2 fasta=/some/fasta/path.fa

Please try the test case with the default configuration file before submitting issues. If you are familiar with snakemake and want to trouble shoot yourself you can find the Snakefile in the directory workflow .

Output

The file results/{sample}.{\d+}.{\d+}.bed will contain all the alignments identified by the pipeline, and is the main input for figure generation. Under the same prefix there will also be a bam file that contains the unprocessed alignments. Note the bam will contain additional alignments not present in the bed file because redundant alignments with lower scores are removed before the figure generation.

Static dot-plots for moderate sized regions

To make pdfs and pngs for a particular set of regions just add make_figures to your command. This is generally appropriate for comparing up to ~5 regions totaling at most ~40 Mbp.

snakemake --use-conda --cores 24 make_figures

This will make an output directory under results/{sample}.{\d+}.{\d+}_figures with a variety of dot plots in pdf and png format.

If you see tri.TRUE in the output pdf/png it means that the dot plot is rotated and cropped into a triangle. If you see onecolorscale.FALSE it means that between different facets in the same plot different color scales are being used.

Visualization of a large region or a whole genome

Making an interactive whole genome visualization requires the use of the program HiGlass and a web browser. However, this pipeline will make the necessary input files with the following command:

snakemake --use-conda --cores 24 cooler

To view locally, use higlass-manage :

pip install higlass-manage
higlass-manage view results/small.5000.10000.strand.mcool

See the T2T CHM13 v1.0 StainedGlass for an example.

High resolution interactive visualization

To create a high-resolution interactive visualization where the coloring is proportionally to the number of reads mapped to each bin, use the following command:

snakemake --use-conda --cores 24 cooler_density --config window=32 cooler_window=100

Arabidopsis: quick start, case example, and benchmark

To demonstrate a case example of using StainedGlass we applied the tool to a 132 Mbp chromosome level assembly of the Arabidopsis genome ( DOI:10.1126/science.abi7489 ).

wget https://github.com/schatzlab/Col-CEN/raw/main/v1.2/Col-CEN_v1.2.fasta.gz \
 && gunzip Col-CEN_v1.2.fasta.gz \
 && samtools faidx Col-CEN_v1.2.fasta

Using 8 cores on a laptop with 32 GB of ram we ran StainedGlass using the following commands:

time snakemake --cores 8 --config sample=arabidopsis fasta=Col-CEN_v1.2.fasta --use-conda

This command generated 41,036,963 self-self pairwise alignments within the assembly, 16,699,976 of which passed filters for downstream analysis.

Then to generate the cooler files that can be loaded in HiGlass we ran the following command with the already computed alignments:

time snakemake --cores 8 --config sample=arabidopsis fasta=Col-CEN_v1.2.fasta --use-conda cooler

The results can be viewed at resgen.io/paper-data/Naish , and we include a static view of the centromeres here:

Arabidopsis CENs

Arabidopsis runtime statistics:

step window (bp) user (s) system (s) cpu (%) wall (h:m:s)
alignment 1,000 16,014.07 163.41 481 56:00.57
cooler 1,000 544.51 32.98 213 4:30.64
static figures [^tnote] 1,000 2,635.30 188.07 58 1:20:14.59

[^tnote]: Not recommended for whole genomes.

A full report of all steps executed and the runtime of those steps is available in case-example-arabidopsis/report.html .

TODO

  • Allow users to adjust the color pallet used in R

  • Test short read aligners with smaller window sizes

  • Make a more intelligent fragmentation method that won't be affected by offsets in repeat motifs

  • Consider alternative ways to score cells with multiple non-overlapping alignments

Cite

Mitchell R Vollger, Peter Kerpedjiev, Adam M Phillippy, Evan E Eichler, StainedGlass: Interactive visualization of massive tandem repeat structures with identity heatmaps, Bioinformatics, 2022; https://doi.org/10.1093/bioinformatics/btac018

Code Snippets

148
149
150
151
shell:
    """
    bedtools makewindows -g {input.fai} -w {wildcards.W} {params.slide} > {output.bed}
    """
168
169
170
171
shell:
    """
    python {params.script} {input.bed} --outputs {output.bed}
    """
187
188
189
190
191
shell:
    """
    bedtools getfasta -fi {input.ref} -bed {input.bed} \
        > {output.fasta}
    """
208
209
210
211
shell:
    """
    samtools faidx {input.ref} {params.name} > {output.fasta} 
    """
229
230
231
232
233
234
235
236
237
shell:
    """
    minimap2 \
        -f {wildcards.F} -s {params.S} \
        {params.MAP_PARAMS} \
         -d {output.split_ref_index} \
        {input.ref} \
    2> {log}
    """
253
254
255
256
257
shell:
    """
    bedtools getfasta -fi {input.ref} -bed {input.bed} \
        > {output.query_fasta}
    """
276
277
278
279
280
281
282
283
284
285
286
287
shell:
    """
    ( minimap2 \
        -t {threads} \
        -f {wildcards.F} -s {params.S} \
        {params.MAP_PARAMS} \
        --dual=yes --eqx \
        {input.split_ref} {input.query} \
            | samtools sort -m {resources.mem}G \
                -o {output.aln} \
    ) 2> {log}
    """
305
306
run:
    open(output.alns, "w").write("\n".join(input.aln) + "\n")
327
328
329
330
331
shell:
    """
    #samtools cat -b {input.alns} -o {output.aln} 
    samtools merge -b {input.alns} {output.aln} 
    """
346
347
348
349
350
shell:
    """
    samtools sort -m {resources.mem}G -@ {threads} --write-index \
        -o {output.aln} {input.aln}
    """
369
370
371
372
373
374
375
shell:
    """
    python {params.script} --threads {threads} \
        --matches  {params.S} --header \
        {input.aln} \
        | pigz -p {threads} > {output.tbl}
    """
395
396
397
398
399
400
401
402
shell:
    """
    python {params.script} \
        --window {wildcards.W} --fai {input.fai} \
        --full {output.full} \
        {params.one} \
        {input.tbl} {output.bed}
    """
435
436
437
438
439
440
441
shell:
    """
    Rscript {params.script} \
        --bed {input.bed} \
        --threads {threads} \
        --prefix {wildcards.SM}.{wildcards.W}.{wildcards.F}
    """
465
466
467
468
469
470
471
472
473
474
475
476
477
shell:
    """
    gunzip -c {input.bed} | tail -n +2 \
        | sed  's/+$/100/g' \
        | sed  's/-$/50/g' \
          | cooler cload pairs \
            -c1 1 -p1 2 -c2 4 -p2 5 \
                    --field count=8:agg=mean,dtype=float \
            --chunksize 50000000000 \
            {input.fai}:{wildcards.W} \
            --zero-based \
            - {output.cool}
    """
493
494
495
496
497
498
499
500
501
502
503
504
shell:
    """
    gunzip -c {input.bed} | tail -n +2 \
          | cooler cload pairs \
            -c1 1 -p1 2 -c2 4 -p2 5 \
                    --field count=7:agg=mean,dtype=float \
            --chunksize 50000000000 \
            {input.fai}:{wildcards.W} \
            --zero-based \
            - {output.cool}

    """
519
520
521
522
523
524
shell:
    """
    cooler zoomify --field count:agg=mean,dtype=float {input.i} \
            -n {threads} \
         -o {output.i}
    """
539
540
541
542
543
544
shell:
    """
    cooler zoomify --field count:agg=mean,dtype=float {input.s} \
            -n {threads} \
         -o {output.s}
    """
557
558
559
560
shell:
    """
    bwa index -p temp/ref_{wildcards.SM} {input.fasta}
    """
576
577
578
579
580
581
shell:
    """
    bwa aln -t {threads} temp/ref_{wildcards.SM} {input.reads} \
        | bwa samse -n {params.num_dups} temp/ref_{wildcards.SM} - {input.reads} \
        | gzip > {output.sam}
    """
596
597
598
599
600
shell:
    """
    gunzip -c {input.sam} | python {params.script} - \
        | gzip > {output.contacts}
    """
616
617
618
619
620
621
622
623
624
shell:
    """
    cooler cload pairs \
            -c1 1 -p1 2 -c2 4 -p2 5 \
            --chunksize 50000000000 \
            {input.fai}:{wildcards.CW} \
            --zero-based \
            {input.contacts} {output.cool}
    """
639
640
641
642
643
644
shell:
    """
    cooler zoomify {input.d} \
            -n {threads} \
         -o {output.d}
    """
ShowHide 17 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://mrvollger.github.io/StainedGlass/
Name: stainedglass
Version: v0.5
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...