Snakemake-based HiC-Pro Workflow for Hi-C Data Analysis

public public 1yr ago 0 bookmarks

snakemake_hic

This is a template repository for running HiC-Pro under snakemake. The steps currently implemented are:

  1. Download and index the genome

  2. Trim raw data using AdapterRemoval

  3. Run FastQC on raw and trimmed data

  4. Prepare Files for HiC-Pro

    • Organise genome annotations

    • Define restriction fragments

    • Update hicpro-config.txt based on config.yml

Setup

HiC-Pro must be installed and visible to this workflow. Please ensure you have a working installation on your system. This workflow is built around v3.0.0

  • Data must be placed in data/raw/fastq

    • Each sample/replicate needs to go in a separate folder. Use the test dataset as a guide.

    • Delete the test data once you have placed yours in the correct folder

  • Edit config/samples.tsv to reflect the sample names and file names.

    • Data is assumed to be paired, so only one file needs to be listed per directory
  • Edit config/config.yml

Testing

Once all edits are performed and data is uploaded, run:

snakemake -n

This is a dry run to check everything should work as expected. If the dry run is successful, create the conda environments:

snakemake --use-conda --conda-create-envs-only --cores 1

This may take a while. Snakemake creates these in series so only one core is required.

Once complete, create the rulegraph to check everything is as expected.

snakemake --rulegraph > rules/rulegraph.dot
dot -Tpdf rules/rulegraph.dot > rules/rulegraph.pdf

Execution

The following will set the workflow running using 12 cores

snakemake \ --use-conda \ --notemp \ --cores 12

Code Snippets

15
16
17
18
shell:
    """
    pigz -p {threads} -c {input.valid_pairs} > {output.valid_pairs}
    """
32
33
34
35
shell:
    """
    pigz -p {threads} -c {input.matrix} > {output.matrix}
    """
57
58
59
60
61
62
63
64
shell:
    """
    ## The generic copy all. Clearly this will repeat every time though
    # cp {params.in_path}/* {params.out_path}

    ## Copy the specific files
    cp {input.stat} {output.stat}
    """
81
82
83
84
85
86
87
88
shell:
    """
    convert \
        -density {params.density} \
        {input.pdf}[0] \
        -quality {params.quality} \
        {output.png}
    """
102
103
104
105
106
107
shell:
    """
    Rscript --vanilla \
      {input.script} \
      {params.bin} &> {log}
    """
14
15
16
17
18
19
20
21
shell:
    """
    # Run the python script
    python {params.script} \
      -r {params.restriction_site} \
      -o {output.rs} \
      {input}
    """
27
28
29
30
31
32
shell:
    """
    awk '{{print($3 - $2)}}' {input} | \
      sort -n | \
      uniq -c > {output}
    """
39
40
41
42
shell:
    """
    pigz -p {threads} -c {input} > {output}
    """
70
71
72
73
74
75
76
77
78
79
shell:
    """
    Rscript --vanilla \
      scripts/write_hicpro_config.R \
      {params.idx_root} \
      {input.chr_sizes} \
      {input.rs} \
      {params.template} \
      {output}
    """
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
shell:
    """
    ## Remove any existing data as leaving this here causes HicPro to
    ## make an interactive request. Piping `yes` into HicPro may be the
    ## source of some current problems
    if [[ -d {params.outdir} ]]; then
      rm -rf {params.outdir}
    fi

    ## Run HiC-pro
    HiC-Pro \
      -s mapping \
      -c {input.config} \
      -i {params.indir} \
      -o {params.outdir} 
    """
196
197
198
199
200
201
202
203
shell:
    """
    HiC-Pro \
      -s quality_checks \
      -c {input.config} \
      -i {params.indir} \
      -o {params.outdir} 
    """
237
238
239
240
241
242
243
244
shell:
    """
    HiC-Pro \
      -s proc_hic \
      -c {input.config} \
      -i {params.indir} \
      -o {params.outdir} 
    """
284
285
286
287
288
289
290
291
shell:
    """
    HiC-Pro \
      -s merge_persample \
      -c {input.config} \
      -i {params.indir} \
      -o {params.outdir} 
    """
319
320
321
322
323
324
325
326
shell:
    """
    HiC-Pro \
      -s build_contact_maps \
      -c {input.config} \
      -i {params.indir} \
      -o {params.outdir} 
    """
 9
10
11
12
13
shell:
    """
    git clone https://github.com/Rassa-Gvm/MaxHiC.git scripts/MaxHiC
    rm -rf scripts/MaxHiC/Sample_Inputs
    """
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
shell:
    """
    ## Given the problems with the raw output from HiC-Pro, we should
    ## delete any *ord.bed* files that exist. They seem to have been 
    ## excluded from HiC-Pro v3
    if compgen -G "{params.input_path}/*ord.bed" > /dev/null; then
      echo -e "Deleting unnecessary symlink"
      rm {params.input_path}/*ord.bed
    fi

    python {input.maxhic_exe} \
      -t {threads} \
      {params.input_path} \
      {params.output_path} &> {log}

    ## Compress the output files
    pigz -p {threads} {params.output_path}/*txt
    """
37
38
39
40
41
42
43
44
45
shell:
    """
    Rscript --vanilla \
        scripts/merge_matrices.R \
        {params.bin} \
        {params.in_path} \
        {params.out_path} \
        {params.samples} &> {log}
    """
64
65
66
67
68
shell:
    """
    pigz -p {threads} -c {input.mat} > {output.mat}
    pigz -p {threads} -c {input.bed} > {output.bed}
    """
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
shell:
    """
    # Write to a separate temp directory for each run to avoid I/O clashes
    TEMPDIR=$(mktemp -d -t fqcXXXXXXXXXX)
    fastqc \
      {params} \
      -t {threads} \
      --outdir $TEMPDIR \
      {input} &> {log}

    # Move the files
    mv $TEMPDIR/*html $(dirname {output.html})
    mv $TEMPDIR/*zip $(dirname {output.zip})

    # Clean up the temp directory
    rm -rf $TEMPDIR
    """
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
shell:
    """
    # Write to a separate temp directory for each run to avoid I/O clashes
    TEMPDIR=$(mktemp -d -t fqcXXXXXXXXXX)
    fastqc \
      {params} \
      -t {threads} \
      --outdir $TEMPDIR \
      {input} &> {log}

    # Move the files
    mv $TEMPDIR/*html $(dirname {output.html})
    mv $TEMPDIR/*zip $(dirname {output.zip})

    # Clean up the temp directory
    rm -rf $TEMPDIR
    """
12
13
14
15
16
17
18
shell:
    """
    wget \
      -O "{output}" \
      -o {log} \
      {params.ftp}
    """
24
25
26
27
shell:
    """
    gunzip -c {input} > {output}
    """
46
47
48
49
50
51
52
shell:
    """
    bowtie2-build \
      --threads {threads} \
      -f {input} \
      {params.prefix} &> {log}
    """
65
66
67
68
69
70
71
72
73
74
75
76
77
    shell:
        """
        # Download the assembly report
        TEMPDIR=$(mktemp -d -t chrXXXXXXXXXX)
        REPORT="assembly_report.txt"
	    curl {params.ftp} > $TEMPDIR/$REPORT 2> {log}

        # Extract the chrom_sizes
        egrep 'assembled-molecule' "$TEMPDIR/$REPORT" | \
          awk '{{print "chr"$2"\t"$3}}' > {output}

        rm -rf $TEMPDIR
        """
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
shell:
    """
    AdapterRemoval \
        --adapter1 {params.adapter1} \
        --adapter2 {params.adapter2} \
        --file1 {input.r1} \
        --file2 {input.r2} \
        --threads {threads} \
        --gzip \
        --maxns {params.maxns} \
        --trimqualities \
        --minquality {params.minqual} \
        --minlength {params.minlength} \
        --output1 {output.t1} \
        --output2 {output.t2} \
        --discarded /dev/null \
        --singleton /dev/null \
        --settings {output.log} &> {log}
    """
4
5
6
7
shell:
    """
    echo -e "Version: 1.0\n\nRestoreWorkspace: Default\nSaveWorkspace: Default\nAlwaysSaveHistory: Default\n\nEnableCodeIndexing: Yes\nUseSpacesForTab: Yes\nNumSpacesForTab: 2\nEncoding: UTF-8\n\nRnwWeave: knitr\nLaTeX: pdfLaTeX\n\nAutoAppendNewline: Yes\nStripTrailingWhitespace: Yes" > {output}
    """
28
29
30
31
32
33
34
shell:
    """
    git add analysis/*
    R -e "workflowr::wflow_build('{input.rmd}')" 2>&1 > {log}
    git add docs/*
    git commit -m 'Updated all'
    """
54
55
56
57
shell:
    """
    R -e "workflowr::wflow_build('{input.rmd}')" 2>&1 > {log}
    """
77
78
79
80
shell:
    """
    R -e "workflowr::wflow_build('{input.rmd}')" 2>&1 > {log}
    """
132
133
134
135
shell:
    """
    R -e "workflowr::wflow_build('{input.rmd}')" 2>&1 > {log}
    """
155
156
157
158
shell:
    """
    R -e "workflowr::wflow_build('{input.rmd}')" 2>&1 > {log}
    """
ShowHide 27 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/steveped/snakemake_hic
Name: snakemake_hic
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...