Running a Hi-C Data Processing Pipeline Using Snakemake

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Project Description:

Description of individual steps in pipeline:

DAG of Pipeline

1. run_bwa_mem

# align pair of .fastq files to the genome and convert the .sam output to .bam
bwa mem \
 -t {params.bwaThreads} \` # number of threads to use for alignment`
 -SP5M \` # -5 for split alignment, take the alignment with the smallest coordinate as primary
 # -S skip mate rescue
 # -P skip pairing
 # -M mark shorter split hits as secondary`
 {params.bwaIndex} \` # path to bwa indexed genome`
 {input.fq1} \` # path to fastq 1`
 {input.fq2} | \` # path to fastq 2`
 samtools view -Shb - > {output.bam}` # convert bwa mem output to .bam file with header`

2. make_pairs_with_pairtools_parse

pairtools parse \
 -c {params.chrom_sizes} \
 --drop-sam \
 --add-columns mapq \
 --output {output.pairs} \
 {input.bam}

3. sort_pairs_with_pairtools_sort

# if temporary directory is not available, make it
[ -d {params.tempdir} ] || mkdir {params.tempdir}
# sort .pairs file created in step 2
pairtools sort \
 --nproc {params.nproc} \
 --memory {params.memory} \
 --tmpdir {params.tempdir} \
 --output {output.sorted_pairs} \
 {output.pairs}
# remove temporary directory after sorting
rm -rf {params.tempdir}

4. mark_duplicates_with_pairtools_dedup

# Find and mark PCR/optical duplicates in .pairs file from step 3.
pairtools dedup \
 --mark-dups \` # duplicate pairs are marked as DD in pair_type and as a duplicate in the sam entries.`
 --output-dups - \` # output duplicates together with deduped pairs`
 --output-unmapped - \` # output unmapped pairs together with deduped pairs`
 --output {output.marked_pairs} \` # name of output .pairs file with duplicates marked`
 {input.sorted_pairs}` # .pairsam file input from step 3`
# index the .pairsam
pairix {output.marked_pairs}

5. filter_pairs

## Generate lossless bam from the pairsam file
pairtools split \
 --output-sam {output.lossless_bam} \`# name of .bam file produced`
 {input.marked_pairs}` # .pairsam file input from step 3`
# Select UU, UR, RU reads
 ## UU = unique-unique, both alignments are unique
 ## UR or RU = unique-rescued, one alignment unique, the other rescued
pairtools select \
 '(pair_type == "UU") or (pair_type == "UR") or (pair_type == "RU")' \
 --output-rest {output.unmapped_sam} \ `# name of file with the remainder of read pairs`
 --output {params.temp_file} \` # temporary output file with the selected read pairs`
 {input.marked_pairs}` # .pairs file input from step 4`
# Generate .pairs file from the UU, UR, and RU reads selected above
pairtools split \
 --output-pairs {params.temp_file1} \` # temporary .pairs output file`
 {params.temp_file}` # input .pairsam file generated with pairtools select above`
# Make a .pairs file with only pairs in chromosomes of interest
pairtools select 'True' \
 --chrom-subset {params.chrom_sizes} \` # path to a chromosomes file containing a chromosome subset of interest.`
 -o {output.dedup_pairs} \` # ouput path and filename for .pairs file in chromosomes of interest`
 {params.temp_file1}` # input .pairsam file generated with pairtools split above`
# index the .pairs file
pairix {output.dedup_pairs}

6. add_frag2Pairs

The pearl script used here (fragment_4dnpairs.pl) is from this GitHub (Release 1.6). Github - aidenlab/juicer: A one-click system for analyzing loop-resolution hi-c experiments. (n.d.). GitHub. Retrieved March 10, 2022, from https://github.com/aidenlab/juicer. This requres a restriction site file. See these instructions for generating this file outside of the snakemake pipeline.

6A. FIRST TIME ONLY. Generate restriction site file.

python generate_site_positions.py 'HindIII' 'ecoli' '/net/qlotsam.lan.omrf.org/qlotsam/data/sansam/hpc-nobackup/scripts/CutAndRun_2020Aug1/ecoliBowtie2/GCF_000005845.2_ASM584v2_genomic.fna'"

# convert to fragment map
gunzip -ck {input.dedup_pairs} | \` # unzip dsthe .pairs file generated in step 4`
 workflow/scripts/fragment_4dnpairs.pl \
 -a - ` # allows replacing existing frag1/frag2 columns`\
 {params.frag2_pairs_basename} \ ` # filename for .pairs file generated`
 {params.restriction_file}` # a restriction site file, which lists on each line, the sorted locations of the enzyme restriction sites.`
# Compress the .pairs file generated
bgzip -f {params.frag2_pairs_basename}
# Index the .pairs file generated
pairix -f {output.frag2_pairs}

7. run_cooler

# use for all chromosomes and contigs
cp {params.chrom_sizes} {params.cooler_tempchrsize}
# the cload command requires the chrom size file to exist besides the chrom size bin file.
cooler cload pairix \
 -p {params.cooler_n_cores} \
 -s {params.cooler_max_split} \
 {params.cooler_tempchrsize}:{params.cooler_bin_size} \
 {input.frag2_pairs} \
 {output.cooler}

Step-by-step instructions on running Snakemake pipeline:

1. Load slurm and miniconda

Note. The commands to do this will be different on your machine. These commands are specific to an HPC using slurm with these modules installed.

ml slurm
ml miniconda

2. Clone repository

git clone https://github.com/SansamLab/Process_HiC_SnakeMake.git
# rename folder with project name
mv Process_HiC_SnakeMake/ My_HiC_Project_Folder/
# change directory into root of your project folder
cd My_HiC_Project_Folder

3. Start the conda environment

3A. FIRST TIME ONLY: Setup conda environment for snakemake

# -f is the location of the environment .yml file. 
## The relative path assumes that you are in the root directory of this repository.
# -p is the path where you want to install this environment
conda env create -f workflow/envs/SnakemakeEnv.yml -p /s/sansam-lab/SnakemakeEnv

3B. Activate conda environment for snakemake

conda activate /s/sansam-lab/SnakemakeEnv

4. Modify the job-specific configuration files.

4A. Modify the config/config.yml file

You must enter paths to the following:

bwa_genome:
- location of bwa indexed genome for the alignment
chrom_sizes
- chromosome sizes file
juicer_RE_file
- restriction enzyme file generated with juicer

4B. Modify the config/samples.csv file

The samples.csv file in the config folder has paths to the test fastq files. You must replace those paths with those for your own fastq files. The first column of each row is the sample name. This name will be used for all output files.

4C. IF SLURM RESOURCE CHANGES ARE NEEDED. Modify the config/cluster_config.yml file

CPU and memory requests for each rule in the pipeline are detailed in this file. If you are using SLURM, you may need to alter this file to fit your needs/system.

5. Do a dry run.

A dry run produces a text output showing exactly what commands will be executed. Look this over carefully before submitting the full job. It is normal to see warnings about changes made to the code, input, and params.

snakemake -npr

6. Make a DAG diagram.

snakemake --dag | dot -Tpdf > dag.pdf

7A. Use conda environments

If conda is to be used for rule-specific environments, you may find it useful to create the environments first. Running 'snakemake' with the '--conda-create-envs-only' option will create the environments without running the pipeline. The '--conda-prefix' option is used to set a directory in which the ‘conda’ and ‘conda-archive’ directories are created. This directory may be changed to a stable or shared location.

sbatch --mem 32G \
--wrap="\
snakemake \
--cores all \
--use-conda \
--conda-prefix ../condEnvs/ \
--conda-create-envs-only \
--conda-frontend conda"

Once the environments are setup, you may execute pipeline with conda environments using the following command:

sbatch --constraint=westmere \
--wrap="\
snakemake \
-R \
-j 999 \
--use-conda \
--conda-prefix ../condEnvs/ \
--conda-frontend conda \
--latency-wait 100 \
--cluster-config config/cluster_config.yml \
--cluster '\
sbatch \
-A {cluster.account} \
-p {cluster.partition} \
--cpus-per-task {cluster.cpus-per-task} \
--mem {cluster.mem} \
--output {cluster.output}'"

7B. Use environment modules.

Rather than using conda environments, you may prefer to use modules installed on your computing cluster. These modules are defined for each rule in 'workflow/Snakefile'. This must be customized for your environment, and you must modify the Snakefile yourself.

To execute the pipeline with environment modules, enter the following:

sbatch --constraint=westmere \
--wrap="\
snakemake \
-R \
-j 999 \
--use-envmodules \
--latency-wait 100 \
--cluster-config config/cluster_config.yml \
--cluster '\
sbatch \
-A {cluster.account} \
-p {cluster.partition} \
--cpus-per-task {cluster.cpus-per-task} \
--mem {cluster.mem} \
--output {cluster.output}'"

8. Check results, and when finished, exit environment.

The results will be saved to the "results" folder. Look over log files generated in either the logs/ or logs/snakelogs folders (depending on whether slurm was used).

conda deactivate

References:

Goloborodko, A., Nezar Abdennur, Venev, S., Hbbrandao, & Gfudenberg. (2019). mirnylab/pairtools v0.3.0 (v0.3.0) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.2649383

Abdennur, N., Goloborodko, A., Imakaev, M., Kerpedjiev, P., Fudenberg, G., Oullette, S., Lee, S., Strobelt, H., Gehlenborg, N., & Mirny, L. (2021). open2c/cooler: v0.8.11 (v0.8.11) [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.4655850

Lee, S., Bakker, C. R., Vitzthum, C., Alver, B. H., & Park, P. J. (2022). Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs. In C. Alkan (Ed.), Bioinformatics (Vol. 38, Issue 6, pp. 1729–1731). Oxford University Press (OUP). https://doi.org/10.1093/bioinformatics/btab870

Durand, N. C., Shamim, M. S., Machol, I., Rao, S. S. P., Huntley, M. H., Lander, E. S., & Aiden, E. L. (2016). Juicer provides a one-click system for analyzing loop-resolution hi-c experiments. Cell Systems, 3(1), 95–98. https://doi.org/10.1016/j.cels.2016.07.002

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve years of samtools and bcftools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv:1303.3997 [q-Bio]. http://arxiv.org/abs/1303.3997

Köster, J., & Rahmann, S. (2012). Snakemake—A scalable bioinformatics workflow engine. Bioinformatics (Oxford, England), 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480

Anaconda Software Distribution. (2020). Anaconda Documentation. Anaconda Inc. Retrieved from https://docs.anaconda.com/

Code Snippets

shell:
    """
    bwa mem -t {params.bwaThreads} -SP5M {params.bwaIndex} {input.fq1} {input.fq2} | samtools view -Shb - > {output.bam}  
    """

SnakeMake SAMtools BWA From line 69 of workflow/Snakefile

shell:
    """
    pairtools parse -c {params.chrom_sizes} --drop-sam --add-columns mapq --output {output.pairs} {input.bam}
    """

SnakeMake pairtools From line 100 of workflow/Snakefile

shell:
    """
    [ -d {params.tempdir} ] || mkdir {params.tempdir}
    pairtools sort --nproc {params.nproc} --memory {params.memory} --tmpdir {params.tempdir} --output {output.sorted_pairs} {input.pairs}
    rm -rf {params.tempdir}
    """

SnakeMake pairtools From line 126 of workflow/Snakefile

shell:
    """
    pairtools merge --output results/sorted_pairs/{params.filename}_sorted.pairs.gz {params.sorted_pairs_list}
    rm -rf results/temp_sorted_pairs/
    """

SnakeMake pairtools From line 146 of workflow/Snakefile

shell:
    """
    [ -d {params.tempdir} ] || mkdir {params.tempdir}
    pairtools sort --nproc {params.nproc} --memory {params.memory} --tmpdir {params.tempdir} --output {output.sorted_pairs} {input.pairs}
    rm -rf {params.tempdir}
    """

SnakeMake pairtools From line 173 of workflow/Snakefile

shell:
    """
    pairtools dedup --mark-dups --output-dups - --output-unmapped - --output {output.marked_pairs} {input.sorted_pairs}
    pairix {output.marked_pairs}
    """

SnakeMake pairtools Pairix From line 196 of workflow/Snakefile

shell:
    """
    # Select UU, UR, RU reads
    pairtools select '(pair_type == "UU") or (pair_type == "UR") or (pair_type == "RU")' --output-rest {output.unmapped} --output {params.temp_file} {input.marked_pairs}
    # Select reads overlapping chromosomes in chromosome sizes file
    pairtools select 'True' --chrom-subset {params.chrom_sizes} -o {output.dedup_pairs} {params.temp_file}
    pairix {output.dedup_pairs}

    # remove temp file
    rm {params.temp_file}
    """

SnakeMake pairtools Pairix From line 218 of workflow/Snakefile

shell:
    """
    gunzip -ck {input.dedup_pairs} | workflow/scripts/fragment_4dnpairs.pl -a - {params.frag2_pairs_basename} {params.restriction_file}
    bgzip -f {params.frag2_pairs_basename}
    pairix -f {output.frag2_pairs}
    """

SnakeMake Pairix From line 251 of workflow/Snakefile

shell:
    """
    pairtools merge --output results/frag2_pairs/{params.filename}.frag2pairs.pairs.gz {params.frag2_pairs_list}
    pairix -f {output.merged}
    rm -rf results/temp_frag2_pairs/
    """

SnakeMake pairtools Pairix From line 270 of workflow/Snakefile

shell:
    """
    gunzip -ck {input.dedup_pairs} | workflow/scripts/fragment_4dnpairs.pl -a - {params.frag2_pairs_basename} {params.restriction_file}
    bgzip -f {params.frag2_pairs_basename}
    pairix -f {output.frag2_pairs}
    """

SnakeMake Pairix From line 297 of workflow/Snakefile

shell:
    """
    # use for all chromosomes and contigs
    cp {params.chrom_sizes} {params.cooler_tempchrsize}

    # the cload command requires the chrom size file to exist besides the chrom size bin file.
    cooler cload pairix -p {params.cooler_n_cores} -s {params.cooler_max_split} {params.cooler_tempchrsize}:{params.cooler_bin_size} {input.frag2_pairs} {output.cooler}
    """

SnakeMake cooler Pairix From line 326 of workflow/Snakefile

shell:
    """
    juicer_tools pre -r {params.custom_resolutions} -j {params.nproc} -q {params.mapQ_minimum} {input.frag2_pairs} {output.hic_file} {params.chrom_sizes}
    """

SnakeMake From line 353 of workflow/Snakefile

shell:
    """
    chromosomes=($(echo {params.chromosomes} | tr "," "\n"))
    mkdir results/compartments/temp_{params.sampleName}/
    grep ${{chromosomes[@]/#/-e }} {params.chrom_sizes} | sort | uniq > results/compartments/temp_{params.sampleName}/chromosome.sizes
    bedtools makewindows -g results/compartments/temp_{params.sampleName}/chromosome.sizes -w {params.resolution} > results/compartments/temp_{params.sampleName}/{params.resolution}_windows.bed
    parallel juicer_tools eigenvector -p {params.normalization} {input.hic_file} {{.}} BP {params.resolution} results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.txt ::: ${{chromosomes[@]}}
    parallel grep -P {{.}}'\t' results/compartments/temp_{params.sampleName}/{params.resolution}_windows.bed '>' results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.bed ::: ${{chromosomes[@]}}
    parallel paste results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.bed results/compartments/temp_{params.sampleName}/{params.sampleName}_eigen_{{.}}.txt '>' results/compartments/{params.sampleName}_eigen_{{.}}.bedgraph ::: ${{chromosomes[@]}}
    gzip --stdout results/compartments/{params.sampleName}_eigen*.bedgraph > {output.compartment_file}
    rm -rf results/compartments/temp_{params.sampleName}/
    """

SnakeMake BEDTools From line 378 of workflow/Snakefile

shell:
    """
    juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory}
    """

SnakeMake From line 410 of workflow/Snakefile

shell:
    """
    juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory}
    """

SnakeMake From line 429 of workflow/Snakefile

shell:
    """
    juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory}
    """

SnakeMake From line 448 of workflow/Snakefile

shell:
    """
    juicer_tools48g arrowhead --threads {params.threads} -k {params.normalization} -m {params.sliding_window} -r {params.resolution} {input.hic_file} results/TADs/{params.outputDirectory}
    """