Beginning analysis of paired end ATAC seq data

public 1yr ago 0 bookmarks

View Workflow

Authors

Rowan Callahan (@callahro)

Usage

Simple

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.

Installing the workflow involves cloning this directory and ensuring that you have Snakemake added to your path. To ensure that you have Snakemake installed you can run which snakemake and it will tell you which snakemake you have installed. Once you have cloned the workflow and checked your path you need to ensure that you have correctly formatted your configuration file.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml . alternatively you can create your own configuration file and specify it when running snakemake with snakemake --configfile my_config_file.yaml viewing the example configuration file will specify what values are needed where.

An example configuration file is included that has information around what indices and databases to use. Some of these configurations are paths to databases or whole genomes. Some of these databases have already been downloaded by me. If you are using Human sample data then you can probably use some of the defaults in the pipeline that are included in the file titled Snakefile . This file also includes the final target outputs that will be created by the pipeline and can be edited if you want to include or not include certain final outputs from the pipeline.

Metadata File configuration The metadata file is required to have two columns one titled SampleID and the other titled Condition This file also must be tab separated. When you run this pipeline in a dry run it will create a list of the sample names that it has found and print them out The script works by matching these sample names exactly to the SampleID strings that are present in the metadata file. One exception is if you use the remove_filepath_regex: option in the configuration file. This will perform a string replacement on the sample names that are listed before trying to match them to the SampleID strings. This can be useful if a large part of the filepath does not contain sample identifiers and is instead repeated. In this case you can simply write the part of the filepath that you care about in SampleID and cut off the rest with remove_filepath_regex: .

An example of this is as follows:

lets say I have samples in my sample list called ['sample1_long_filepath_to_ignore', 'sample2_long_filepath_to_ignore'] I can set my remove_filepath_regex: in my config file to remove_filepath_regex = '_long_filepath_to_ignore' . This would mean that now the SampleID list is looking for matches that look like this ['sample1','sample2'] these matches are used to create a dictionary that assigns sample names to the condition that they are present in.

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

This pipeline requires the usage of conda to run as all of its external dependencies are installed using conda. its important that every time the pipeline is run that --use-conda is also called.

It is important that you remember to specity to use a profile path using --profile /path/to/profile/folder . For different versions of Snakemake sometimes this will lead to an error saying it did not find a profile configuration file. If this is the case then it is necessary that you provide the path to the profile configuration file with --profile /path/to/profile/config.yaml .

It is possible to submit your snakemake command with the sbatch directive but it is also possible to just use a terminal multiplexer like screen or tmux as the snakemake scheduler process does not take up that much cpu or memory. However, its important that your profile is set up so that it sends jobs to the queue and does not run these jobs on the head node.

in combination with any of the modes above. See the Snakemake documentation for further details.

Advanced

The following recipe provides established best practices for running and extending this workflow in a reproducible way.

Fork the repo to a personal or lab account.
Clone the fork to the desired working directory for the concrete project/run on your machine.
Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).
Modify the config, and any necessary sheets (and probably the workflow) as needed.
Commit any changes and push the project-branch to your fork on github.
Run the analysis.
Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .
Optional: Push results (plots/tables) to the remote branch on your fork.
Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).
Optional: Delete the local clone/workdir to free space.

Code Snippets

shell:
    """
    bamCoverage --bam {input.quality} -o {output.bigwig_rough} --numberOfProcessors {threads} --skipNAs --binSize 5 
    bamCoverage --bam {input.quality} -o {output.bigwig} --numberOfProcessors {threads} --skipNAs --binSize 5 --smoothLength 15  
    """

SnakeMake DeepTools From line 16 of rules/filter_shift_downsample.smk

shell:
    """
    sambamba sort -t {threads} {input.bamfile} -o {output.input_sort}
    sambamba index -t {threads} {output.input_sort} {output.input_index}
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    #if [[ $(grep {wildcards.merged_sample} {input.downsample_list}) ]]; then sambamba view --subsample=0.$(grep {wildcards.merged_sample} {input.downsample_list}| awk '{{print $2}}') --subsampling-seed=2 {output.input_sort} -o {output.bamfile}; else cp {output.input_sort} {output.bamfile}; fi
    if [[ $(grep {wildcards.merged_sample} {input.downsample_list}) ]]; then samtools view -@ 4 -b -h -s 1.$(grep {wildcards.merged_sample} {input.downsample_list}| awk '{{print $2}}') {output.input_sort} -o {output.bamfile}; else cp {output.input_sort} {output.bamfile}; fi 
    sambamba index -t {threads} {output.bamfile} {output.index}
    """

SnakeMake SAMtools Sambamba From line 44 of rules/filter_shift_downsample.smk

run:
    import pandas as pd
    dataframes = []
    for file in input.readsfile:
        dataframes.append(pd.read_csv(file, names=["name","description","reads"]))
    merged_dataframe = pd.concat(dataframes, ignore_index=True)
    print(merged_dataframe)        
    filtered = merged_dataframe[merged_dataframe.description=="satisfy_quality"] #ensure that we are normalizing to the final processed reads
    print(filtered)
    filtered = filtered.drop_duplicates() #its possible that some values have been written more than once
    print(filtered)                       #we want to drop these values as they are identical rows and are not needed
    filtered['downsample'] = pd.to_numeric(filtered['reads'].min()) / pd.to_numeric(filtered['reads']) #what percentage of the smallest sample to downsample
    print(filtered)  # each of the samples to then print the result
    filtered.to_csv(output.filtered_list, header=True, index=False, sep='\t')
    out_data = filtered.loc[:,["name", "downsample"]]
    normalizing_sample = out_data[out_data.downsample == 1] # which one is hte normalizing sample
    print(normalizing_sample)
    out_data = out_data[out_data.downsample < 1] #only keep samples that need to be downsampled and skip the ones that wont be
    out_data["downsample"] = (round(out_data["downsample"], 8)*100000000).astype(int) #round the sample to the nearest number, I will add the . later
    out_data.to_csv(output.downsample_list, header=False, index=False, sep='\t') #write everything out to a csv

SnakeMake Pandas From line 63 of rules/filter_shift_downsample.smk

shell:
    """
    alignmentSieve --bam {input.bamfile} --outFile {output.bamfile} --numberOfProcessors {threads} --ATACshift

    """

SnakeMake DeepTools From line 97 of rules/filter_shift_downsample.smk

shell:
    """
    sambamba index -t {threads} {input.full_bam} {output.indexed_full}
    sambamba index -t {threads} {input.deduplicated} {output.indexed}

    """

SnakeMake Sambamba From line 117 of rules/filter_shift_downsample.smk

shell:
   """
   samtools view -h -@ 4 -b -F 1804 {input.bamfile} > {output.quality}
   samtools view -@ 4 -c {output.quality}| xargs -I{{}} echo "{wildcards.merged_sample},satisfy_quality,"{{}}$'\n'  >> {params.readsfile}
   """

SnakeMake SAMtools From line 138 of rules/filter_shift_downsample.smk

shell:
     """
     sambamba markdup -t {threads} {input.bamfile} {output.deduplicated}
     samtools view -@ 4 -F 1024 -c {output.deduplicated}| xargs -I{{}} echo "{wildcards.merged_sample},no_duplicates,"{{}}$'\n'  >> {params.readsfile}
     """

SnakeMake SAMtools Sambamba From line 161 of rules/filter_shift_downsample.smk

shell:
    """
    samtools view -@ 4 -c {input.bamfile}| xargs -I{{}} echo "{wildcards.merged_sample},starting_reads,"{{}}$'\n' >> {output.readsfile}
    samtools view -h -@ {threads} {input.bamfile}| grep -v chrM| samtools sort -@ {threads} -O BAM > {output.bamfile}
    samtools view -@ 4 -c {output.bamfile}| xargs -I{{}} echo "{wildcards.merged_sample},no_ChrM,"{{}}$'\n' >> {output.readsfile}

    """

SnakeMake SAMtools From line 184 of rules/filter_shift_downsample.smk

shell:
     """ cp {input.lane1} {output.merged_bamfile} """

SnakeMake From line 207 of rules/filter_shift_downsample.smk

shell:
    """
    multiBigwigSummary bins -b {params.bigwigs} --labels {params.names} -out {output.scores} -p {threads}

    plotCorrelation -in {output.scores} --corMethod pearson --skipZeros --plotTitle "Pearson Correlation of Read Counts"  --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o {output.correlation_plot} 
    """

SnakeMake From line 58 of rules/final_qc_metrics.smk

shell:
    """
    export PATH=$PATH:/home/groups/MaxsonLab/software/bedtools/
    bedtools multiinter -i {params.bedfiles} -header -names {params.names} > {output.bedfile}

    """

SnakeMake BEDTools From line 75 of rules/final_qc_metrics.smk

shell:
    """
    export PATH=$PATH:/home/groups/MaxsonLab/software/bedtools/
    bedtools intersect -wa -a {input.read_catalog} -b {input.sample_condition_catalog} > {output.sample_condition_catalog}

    """

SnakeMake BEDTools From line 91 of rules/final_qc_metrics.smk

shell:
    """
    wc -l {input.summits}| awk '{{print $1}}'| xargs -I {{}} echo "{wildcards.merged_sample},unfiltered_peaks,"{{}}$'\n' >> {params.readsfile} 
    touch {output.done}
    """

SnakeMake From line 108 of rules/final_qc_metrics.smk

shell:
    """
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    samtools flagstat {input.bamfile} > {params.flagstat}
    grep total {params.flagstat}|awk '{{print $1}}'|xargs -I {{}} echo "{wildcards.merged_sample},total_reads_trimd,"{{}}$'\n' >> {params.readsfile} 
    grep "[0-9] mapped" {params.flagstat} |awk '{{print $1}}' | xargs -I {{}} echo "{wildcards.merged_sample},mapping_reads,"{{}}$'\n' >> {params.readsfile} 
    awk '{{sum +=$7}} END{{print sum}}' {input.read_count}| xargs -I {{}} echo "{wildcards.merged_sample},reads_in_peaks,"{{}}$'\n' >> {params.readsfile} 
    touch {output.done}
    """  

SnakeMake SAMtools From line 131 of rules/final_qc_metrics.smk

shell:
    """
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    samtools merge -@ {threads} - {params.merge_input} > {output.bamfile}
    samtools index {output.bamfile}
    bamCoverage --bam {output.bamfile} -o {output.bigwig} --numberOfProcessors {threads} --skipNAs --binSize 5 --smoothLength 15  
    """

SnakeMake SAMtools DeepTools From line 24 of rules/merged_bigwig.smk

shell:
    """
    bedtools genomecov -ibam {input.non_downsampled} -bg > {output.non_downsampled_bedGraph}

    """

SnakeMake BEDTools From line 10 of rules/non_downsampled_bigwig.smk

shell:
    """
    bedSort {input.non_downsampled_bedGraph} {output.non_downsampled_sorted_bedGraph}

    """

SnakeMake From line 25 of rules/non_downsampled_bigwig.smk

shell:
    """
    bedGraphToBigWig {input.non_downsampled_sorted_bedGraph} {params.chrom_size} {output.non_downsampled_bigwig}

    """

SnakeMake bedGraphToBigWig From line 42 of rules/non_downsampled_bigwig.smk

run:
    import pandas as pd
    dataframes = []
    for file in input.read_count:
        dataframes.append(pd.read_csv(file, sep='\t'))
    merged = pd.concat(dataframes, axis=1)
    merged = merged.loc[:,~merged.columns.duplicated()]   
    merged = merged.sort_index(axis=1)
    merged.to_csv(output.read_count, header=True, index=False, sep='\t')

SnakeMake Pandas From line 7 of rules/peak_catalog_no_downsample.smk

shell:
    """
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    samtools sort -@ 4 {input.bamfile} > {output.bamfile}
    samtools index {output.bamfile}

    bedtools multicov -bams {output.bamfile} -bed {input.read_catalog} >> {output.read_count}
    sed -i "1i {params.header}" {output.read_count}

    """

SnakeMake SAMtools BEDTools From line 33 of rules/peak_catalog_no_downsample.smk

shell:
    """
    echo "" > {params.merged_bed}
    cat {params.peaks_input} >> {params.merged_bed}
    Rscript --vanilla {params.script_file} {params.merged_bed} {params.blacklist} {params.present_in_number} {output.reads_catalog_bed} {params.genome} {params.metadata} {params.regex} 
    """

SnakeMake From line 66 of rules/peak_catalog_no_downsample.smk

shell:
    """
    macs2 callpeak --treatment {input.bamfile} --name {params.name} --format BAMPE --gsize {params.genome} --outdir {params.directory} --broad
    """

SnakeMake macs2 From line 88 of rules/peak_catalog_no_downsample.smk

run:
    import pandas as pd
    dataframes = []
    for file in input.read_count:
        dataframes.append(pd.read_csv(file, sep='\t'))
    merged = pd.concat(dataframes, axis=1)
    merged = merged.loc[:,~merged.columns.duplicated()]   
    merged = merged.sort_index(axis=1)
    merged.to_csv(output.read_count, header=True, index=False, sep='\t')

SnakeMake Pandas From line 6 of rules/peak_catalog.smk

shell:
    """
    bedtools multicov -bams {input.bamfile} -bed {input.read_catalog} >> {output.read_count}
    sed -i "1i {params.header}" {output.read_count}

    """

SnakeMake BEDTools From line 27 of rules/peak_catalog.smk

shell:
    """
    annotatePeaks.pl {input.bedfile}  {params.genome} > {output.read_count}
    """

SnakeMake homer From line 43 of rules/peak_catalog.smk

shell:
    """
    echo "" > {params.merged_bed}
    cat {params.peaks_input} >> {params.merged_bed}
    Rscript --vanilla {params.script_file} {params.merged_bed} {params.blacklist} {params.present_in_number} {params.reads_catalog_bed} {params.genome} {params.metadata} {params.regex} 
    """

SnakeMake From line 71 of rules/peak_catalog.smk

shell:
    """
    macs2 callpeak --treatment {input.bamfile} --name {params.name} --format BAMPE --gsize {params.genome} --outdir {params.directory} --broad
    """

SnakeMake macs2 From line 93 of rules/peak_catalog.smk

shell:
    """samtools sort -T {params.outdirectory} -@ {threads} -o {output.sorted_bamfile} {input.bamfile} """

SnakeMake SAMtools From line 12 of rules/quality_and_align.smk

shell:
    """samtools view -bh -@ {threads} -o {output.bamfile} {input.samfile}"""

SnakeMake SAMtools From line 26 of rules/quality_and_align.smk

shell:
    """bwa mem -t {threads} {params.reference} {input.forward_paired} {input.reverse_paired} > {output.samfile} """

SnakeMake BWA From line 43 of rules/quality_and_align.smk

shell:
    """trimmomatic PE {input.forward_strand} {input.reverse_strand} {output.forward_paired} {output.forward_unpaired} {output.reverse_paired} {output.reverse_unpaired}   ILLUMINACLIP:{params.adapter}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30"""

SnakeMake Trimmomatic From line 64 of rules/quality_and_align.smk

shell:
    """    
    fastqc --outdir  {params.outdir} --extract  -f fastq {input.forward_strand} {input.reverse_strand}
    touch {output.fastqc_done}
    """

SnakeMake FastQC From line 91 of rules/quality_and_align.smk

shell:
    """
    fastq_screen --aligner bowtie2 --conf {params.conf} --outdir {params.out_dir} {input.forward_strand}
    fastq_screen --aligner bowtie2 --conf {params.conf} --outdir {params.out_dir} {input.reverse_strand}
    touch {output.fastq_screen_done}
    """