Beginning analysis of paired end ATAC seq data

public public 1yr ago 0 bookmarks

Authors

  • Rowan Callahan (@callahro)

Usage

Simple

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release . If you intend to modify and further extend this workflow or want to work under version control, fork this repository as outlined in Advanced . The latter way is recommended.

Installing the workflow involves cloning this directory and ensuring that you have Snakemake added to your path. To ensure that you have Snakemake installed you can run which snakemake and it will tell you which snakemake you have installed. Once you have cloned the workflow and checked your path you need to ensure that you have correctly formatted your configuration file.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml . alternatively you can create your own configuration file and specify it when running snakemake with snakemake --configfile my_config_file.yaml viewing the example configuration file will specify what values are needed where.

An example configuration file is included that has information around what indices and databases to use. Some of these configurations are paths to databases or whole genomes. Some of these databases have already been downloaded by me. If you are using Human sample data then you can probably use some of the defaults in the pipeline that are included in the file titled Snakefile . This file also includes the final target outputs that will be created by the pipeline and can be edited if you want to include or not include certain final outputs from the pipeline.

Metadata File configuration The metadata file is required to have two columns one titled SampleID and the other titled Condition This file also must be tab separated. When you run this pipeline in a dry run it will create a list of the sample names that it has found and print them out The script works by matching these sample names exactly to the SampleID strings that are present in the metadata file. One exception is if you use the remove_filepath_regex: option in the configuration file. This will perform a string replacement on the sample names that are listed before trying to match them to the SampleID strings. This can be useful if a large part of the filepath does not contain sample identifiers and is instead repeated. In this case you can simply write the part of the filepath that you care about in SampleID and cut off the rest with remove_filepath_regex: .

An example of this is as follows:

lets say I have samples in my sample list called ['sample1_long_filepath_to_ignore', 'sample2_long_filepath_to_ignore'] I can set my remove_filepath_regex: in my config file to remove_filepath_regex = '_long_filepath_to_ignore' . This would mean that now the SampleID list is looking for matches that look like this ['sample1','sample2'] these matches are used to create a dictionary that assigns sample names to the condition that they are present in.

Step 3: Execute workflow

Test your configuration by performing a dry-run via

snakemake --use-conda -n

This pipeline requires the usage of conda to run as all of its external dependencies are installed using conda. its important that every time the pipeline is run that --use-conda is also called.

It is important that you remember to specity to use a profile path using --profile /path/to/profile/folder . For different versions of Snakemake sometimes this will lead to an error saying it did not find a profile configuration file. If this is the case then it is necessary that you provide the path to the profile configuration file with --profile /path/to/profile/config.yaml .

It is possible to submit your snakemake command with the sbatch directive but it is also possible to just use a terminal multiplexer like screen or tmux as the snakemake scheduler process does not take up that much cpu or memory. However, its important that your profile is set up so that it sends jobs to the queue and does not run these jobs on the head node.

in combination with any of the modes above. See the Snakemake documentation for further details.

Advanced

The following recipe provides established best practices for running and extending this workflow in a reproducible way.

  1. Fork the repo to a personal or lab account.

  2. Clone the fork to the desired working directory for the concrete project/run on your machine.

  3. Create a new branch (the project-branch) within the clone and switch to it. The branch will contain any project-specific modifications (e.g. to configuration, but also to code).

  4. Modify the config, and any necessary sheets (and probably the workflow) as needed.

  5. Commit any changes and push the project-branch to your fork on github.

  6. Run the analysis.

  7. Optional: Merge back any valuable and generalizable changes to the upstream repo via a pull request . This would be greatly appreciated .

  8. Optional: Push results (plots/tables) to the remote branch on your fork.

  9. Optional: Create a self-contained workflow archive for publication along with the paper (snakemake --archive).

  10. Optional: Delete the local clone/workdir to free space.

Code Snippets

16
17
18
19
20
shell:
    """
    bamCoverage --bam {input.quality} -o {output.bigwig_rough} --numberOfProcessors {threads} --skipNAs --binSize 5 
    bamCoverage --bam {input.quality} -o {output.bigwig} --numberOfProcessors {threads} --skipNAs --binSize 5 --smoothLength 15  
    """
44
45
46
47
48
49
50
51
52
shell:
    """
    sambamba sort -t {threads} {input.bamfile} -o {output.input_sort}
    sambamba index -t {threads} {output.input_sort} {output.input_index}
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    #if [[ $(grep {wildcards.merged_sample} {input.downsample_list}) ]]; then sambamba view --subsample=0.$(grep {wildcards.merged_sample} {input.downsample_list}| awk '{{print $2}}') --subsampling-seed=2 {output.input_sort} -o {output.bamfile}; else cp {output.input_sort} {output.bamfile}; fi
    if [[ $(grep {wildcards.merged_sample} {input.downsample_list}) ]]; then samtools view -@ 4 -b -h -s 1.$(grep {wildcards.merged_sample} {input.downsample_list}| awk '{{print $2}}') {output.input_sort} -o {output.bamfile}; else cp {output.input_sort} {output.bamfile}; fi 
    sambamba index -t {threads} {output.bamfile} {output.index}
    """
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
run:
    import pandas as pd
    dataframes = []
    for file in input.readsfile:
        dataframes.append(pd.read_csv(file, names=["name","description","reads"]))
    merged_dataframe = pd.concat(dataframes, ignore_index=True)
    print(merged_dataframe)        
    filtered = merged_dataframe[merged_dataframe.description=="satisfy_quality"] #ensure that we are normalizing to the final processed reads
    print(filtered)
    filtered = filtered.drop_duplicates() #its possible that some values have been written more than once
    print(filtered)                       #we want to drop these values as they are identical rows and are not needed
    filtered['downsample'] = pd.to_numeric(filtered['reads'].min()) / pd.to_numeric(filtered['reads']) #what percentage of the smallest sample to downsample
    print(filtered)  # each of the samples to then print the result
    filtered.to_csv(output.filtered_list, header=True, index=False, sep='\t')
    out_data = filtered.loc[:,["name", "downsample"]]
    normalizing_sample = out_data[out_data.downsample == 1] # which one is hte normalizing sample
    print(normalizing_sample)
    out_data = out_data[out_data.downsample < 1] #only keep samples that need to be downsampled and skip the ones that wont be
    out_data["downsample"] = (round(out_data["downsample"], 8)*100000000).astype(int) #round the sample to the nearest number, I will add the . later
    out_data.to_csv(output.downsample_list, header=False, index=False, sep='\t') #write everything out to a csv
 97
 98
 99
100
101
shell:
    """
    alignmentSieve --bam {input.bamfile} --outFile {output.bamfile} --numberOfProcessors {threads} --ATACshift

    """
117
118
119
120
121
122
shell:
    """
    sambamba index -t {threads} {input.full_bam} {output.indexed_full}
    sambamba index -t {threads} {input.deduplicated} {output.indexed}

    """
138
139
140
141
142
shell:
   """
   samtools view -h -@ 4 -b -F 1804 {input.bamfile} > {output.quality}
   samtools view -@ 4 -c {output.quality}| xargs -I{{}} echo "{wildcards.merged_sample},satisfy_quality,"{{}}$'\n'  >> {params.readsfile}
   """
161
162
163
164
165
shell:
     """
     sambamba markdup -t {threads} {input.bamfile} {output.deduplicated}
     samtools view -@ 4 -F 1024 -c {output.deduplicated}| xargs -I{{}} echo "{wildcards.merged_sample},no_duplicates,"{{}}$'\n'  >> {params.readsfile}
     """
184
185
186
187
188
189
190
shell:
    """
    samtools view -@ 4 -c {input.bamfile}| xargs -I{{}} echo "{wildcards.merged_sample},starting_reads,"{{}}$'\n' >> {output.readsfile}
    samtools view -h -@ {threads} {input.bamfile}| grep -v chrM| samtools sort -@ {threads} -O BAM > {output.bamfile}
    samtools view -@ 4 -c {output.bamfile}| xargs -I{{}} echo "{wildcards.merged_sample},no_ChrM,"{{}}$'\n' >> {output.readsfile}

    """
207
208
shell:
     """ cp {input.lane1} {output.merged_bamfile} """
58
59
60
61
62
63
shell:
    """
    multiBigwigSummary bins -b {params.bigwigs} --labels {params.names} -out {output.scores} -p {threads}

    plotCorrelation -in {output.scores} --corMethod pearson --skipZeros --plotTitle "Pearson Correlation of Read Counts"  --whatToPlot heatmap --colorMap RdYlBu --plotNumbers -o {output.correlation_plot} 
    """
75
76
77
78
79
80
shell:
    """
    export PATH=$PATH:/home/groups/MaxsonLab/software/bedtools/
    bedtools multiinter -i {params.bedfiles} -header -names {params.names} > {output.bedfile}

    """
91
92
93
94
95
96
shell:
    """
    export PATH=$PATH:/home/groups/MaxsonLab/software/bedtools/
    bedtools intersect -wa -a {input.read_catalog} -b {input.sample_condition_catalog} > {output.sample_condition_catalog}

    """
108
109
110
111
112
shell:
    """
    wc -l {input.summits}| awk '{{print $1}}'| xargs -I {{}} echo "{wildcards.merged_sample},unfiltered_peaks,"{{}}$'\n' >> {params.readsfile} 
    touch {output.done}
    """
131
132
133
134
135
136
137
138
139
shell:
    """
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    samtools flagstat {input.bamfile} > {params.flagstat}
    grep total {params.flagstat}|awk '{{print $1}}'|xargs -I {{}} echo "{wildcards.merged_sample},total_reads_trimd,"{{}}$'\n' >> {params.readsfile} 
    grep "[0-9] mapped" {params.flagstat} |awk '{{print $1}}' | xargs -I {{}} echo "{wildcards.merged_sample},mapping_reads,"{{}}$'\n' >> {params.readsfile} 
    awk '{{sum +=$7}} END{{print sum}}' {input.read_count}| xargs -I {{}} echo "{wildcards.merged_sample},reads_in_peaks,"{{}}$'\n' >> {params.readsfile} 
    touch {output.done}
    """  
24
25
26
27
28
29
30
shell:
    """
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    samtools merge -@ {threads} - {params.merge_input} > {output.bamfile}
    samtools index {output.bamfile}
    bamCoverage --bam {output.bamfile} -o {output.bigwig} --numberOfProcessors {threads} --skipNAs --binSize 5 --smoothLength 15  
    """
10
11
12
13
14
shell:
    """
    bedtools genomecov -ibam {input.non_downsampled} -bg > {output.non_downsampled_bedGraph}

    """
25
26
27
28
29
shell:
    """
    bedSort {input.non_downsampled_bedGraph} {output.non_downsampled_sorted_bedGraph}

    """
42
43
44
45
46
shell:
    """
    bedGraphToBigWig {input.non_downsampled_sorted_bedGraph} {params.chrom_size} {output.non_downsampled_bigwig}

    """
 7
 8
 9
10
11
12
13
14
15
run:
    import pandas as pd
    dataframes = []
    for file in input.read_count:
        dataframes.append(pd.read_csv(file, sep='\t'))
    merged = pd.concat(dataframes, axis=1)
    merged = merged.loc[:,~merged.columns.duplicated()]   
    merged = merged.sort_index(axis=1)
    merged.to_csv(output.read_count, header=True, index=False, sep='\t')
33
34
35
36
37
38
39
40
41
42
shell:
    """
    export PATH=$PATH:/opt/installed/samtools-1.6/bin/
    samtools sort -@ 4 {input.bamfile} > {output.bamfile}
    samtools index {output.bamfile}

    bedtools multicov -bams {output.bamfile} -bed {input.read_catalog} >> {output.read_count}
    sed -i "1i {params.header}" {output.read_count}

    """
66
67
68
69
70
71
shell:
    """
    echo "" > {params.merged_bed}
    cat {params.peaks_input} >> {params.merged_bed}
    Rscript --vanilla {params.script_file} {params.merged_bed} {params.blacklist} {params.present_in_number} {output.reads_catalog_bed} {params.genome} {params.metadata} {params.regex} 
    """
88
89
90
91
shell:
    """
    macs2 callpeak --treatment {input.bamfile} --name {params.name} --format BAMPE --gsize {params.genome} --outdir {params.directory} --broad
    """
 6
 7
 8
 9
10
11
12
13
14
run:
    import pandas as pd
    dataframes = []
    for file in input.read_count:
        dataframes.append(pd.read_csv(file, sep='\t'))
    merged = pd.concat(dataframes, axis=1)
    merged = merged.loc[:,~merged.columns.duplicated()]   
    merged = merged.sort_index(axis=1)
    merged.to_csv(output.read_count, header=True, index=False, sep='\t')
27
28
29
30
31
32
shell:
    """
    bedtools multicov -bams {input.bamfile} -bed {input.read_catalog} >> {output.read_count}
    sed -i "1i {params.header}" {output.read_count}

    """
43
44
45
46
shell:
    """
    annotatePeaks.pl {input.bedfile}  {params.genome} > {output.read_count}
    """
71
72
73
74
75
76
shell:
    """
    echo "" > {params.merged_bed}
    cat {params.peaks_input} >> {params.merged_bed}
    Rscript --vanilla {params.script_file} {params.merged_bed} {params.blacklist} {params.present_in_number} {params.reads_catalog_bed} {params.genome} {params.metadata} {params.regex} 
    """
93
94
95
96
shell:
    """
    macs2 callpeak --treatment {input.bamfile} --name {params.name} --format BAMPE --gsize {params.genome} --outdir {params.directory} --broad
    """
12
13
shell:
    """samtools sort -T {params.outdirectory} -@ {threads} -o {output.sorted_bamfile} {input.bamfile} """
26
27
shell:
    """samtools view -bh -@ {threads} -o {output.bamfile} {input.samfile}"""
43
44
shell:
    """bwa mem -t {threads} {params.reference} {input.forward_paired} {input.reverse_paired} > {output.samfile} """
64
65
shell:
    """trimmomatic PE {input.forward_strand} {input.reverse_strand} {output.forward_paired} {output.forward_unpaired} {output.reverse_paired} {output.reverse_unpaired}   ILLUMINACLIP:{params.adapter}:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30"""
91
92
93
94
95
shell:
    """    
    fastqc --outdir  {params.outdir} --extract  -f fastq {input.forward_strand} {input.reverse_strand}
    touch {output.fastqc_done}
    """
117
118
119
120
121
122
shell:
    """
    fastq_screen --aligner bowtie2 --conf {params.conf} --outdir {params.out_dir} {input.forward_strand}
    fastq_screen --aligner bowtie2 --conf {params.conf} --outdir {params.out_dir} {input.reverse_strand}
    touch {output.fastq_screen_done}
    """
ShowHide 22 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/maxsonBraunLab/atac_seq---old
Name: atac_seq-old
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...