MNase-seq analysis pipeline using BWA and DANPOS2.

public public 1yr ago Version: 1.0.0 0 bookmarks
Loading...

Introduction

nfcore/mnaseseq is a bioinformatics analysis pipeline used for DNA sequencing data obtained via micrococcal nuclease digestion.

The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Pipeline summary

  1. Raw read QC ( FastQC )

  2. Adapter trimming ( Trim Galore! )

  3. Alignment ( BWA )

  4. Mark duplicates ( picard )

  5. Merge alignments from multiple libraries of the same sample ( picard )

    1. Re-mark duplicates ( picard )

    2. Filtering to remove:

      • reads mapping to blacklisted regions ( SAMtools , BEDTools )

      • reads that are marked as duplicates ( SAMtools )

      • reads that arent marked as primary alignments ( SAMtools )

      • reads that are unmapped ( SAMtools )

      • reads that map to multiple locations ( SAMtools )

      • reads containing > 4 mismatches ( BAMTools )

      • reads that are soft-clipped ( BAMTools )

      • reads that have an insert size within specified range ( BAMTools ; paired-end only )

      • reads that map to different chromosomes ( Pysam ; paired-end only )

      • reads that arent in FR orientation ( Pysam ; paired-end only )

      • reads where only one read of the pair fails the above criteria ( Pysam ; paired-end only )

    3. Alignment-level QC and estimation of library complexity ( picard , Preseq )

    4. Create normalised bigWig files scaled to 1 million mapped reads ( BEDTools , bedGraphToBigWig )

    5. Calculate genome-wide coverage assessment ( deepTools )

    6. Call nucleosome positions and generate smoothed, normalised coverage bigWig files that can be used to generate occupancy profile plots between samples across features of interest ( DANPOS2 )

    7. Generate gene-body meta-profile from DANPOS2 smoothed bigWig files ( deepTools )

  6. Merge filtered alignments across replicates ( picard )

    1. Re-mark duplicates ( picard )

    2. Remove duplicate reads ( SAMtools )

    3. Create normalised bigWig files scaled to 1 million mapped reads ( BEDTools , wigToBigWig )

    4. Call nucleosome positions and generate smoothed, normalised coverage bigWig files that can be used to generate occupancy profile plots between samples across features of interest ( DANPOS2 )

    5. Generate gene-body meta-profile from DANPOS2 smoothed bigWig files ( deepTools )

  7. Create IGV session file containing bigWig tracks for data visualisation ( IGV ).

  8. Present QC for raw read and alignment results ( MultiQC )

Quick Start

i. Install nextflow

ii. Install either Docker or Singularity for full pipeline reproducibility (please only use Conda as a last resort; see docs )

iii. Download the pipeline and test it on a minimal dataset with a single command

nextflow run nf-core/mnaseseq -profile test,<docker/singularity/conda/institute>

Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

iv. Start running your own analysis!

nextflow run nf-core/mnaseseq -profile <docker/singularity/conda/institute> --input design.csv --genome GRCh37

See usage docs for all of the available options when running the pipeline.

Documentation

The nf-core/mnaseseq pipeline comes with documentation about the pipeline, found in the docs/ directory:

  1. Installation

  2. Pipeline configuration

  3. Running the pipeline

  4. Output and how to interpret the results

  5. Troubleshooting

Credits

The pipeline was originally written by The Bioinformatics & Biostatistics Group for use at The Francis Crick Institute , London.

The pipeline was developed by Harshil Patel .

Many thanks to others who have helped out along the way too, including (but not limited to): @crickbabs .

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines .

For further information or help, don't hesitate to get in touch on Slack (you can join with this invite ).

Citation

If you use nf-core/mnaseseq for your analysis, please cite it using the following doi: 10.5281/zenodo.6581372 .

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x .
ReadCube: Full Access Link

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Code Snippets

288
289
290
"""
check_design.py $design design_reads.csv
"""
NextFlow From line 288 of master/main.nf
344
345
346
347
"""
bwa index -a bwtsw $fasta
mkdir BWAIndex && mv ${fasta}* BWAIndex
"""
367
368
369
"""
gtf2bed $gtf > ${gtf.baseName}.bed
"""
388
389
390
"""
cat $bed | awk -v FS='\t' -v OFS='\t' '{ if(\$6=="+") \$3=\$2+1; else \$2=\$3-1; print \$1, \$2, \$3, \$4, \$5, \$6;}' > ${bed.baseName}.tss.bed
"""
NextFlow From line 388 of master/main.nf
418
419
420
421
422
"""
samtools faidx $fasta
cut -f 1,2 ${fasta}.fai > ${fasta}.sizes
$blacklist_filter > ${fasta}.include_regions.bed
"""
456
457
458
459
"""
[ ! -f  ${name}.fastq.gz ] && ln -s $reads ${name}.fastq.gz
fastqc -q -t $task.cpus ${name}.fastq.gz
"""
461
462
463
464
465
466
"""
[ ! -f  ${name}_1.fastq.gz ] && ln -s ${reads[0]} ${name}_1.fastq.gz
[ ! -f  ${name}_2.fastq.gz ] && ln -s ${reads[1]} ${name}_2.fastq.gz
fastqc -q -t $task.cpus ${name}_1.fastq.gz
fastqc -q -t $task.cpus ${name}_2.fastq.gz
"""
524
525
526
527
"""
[ ! -f  ${name}.fastq.gz ] && ln -s $reads ${name}.fastq.gz
trim_galore --cores $cores --fastqc --gzip $c_r1 $tpc_r1 $nextseq ${name}.fastq.gz
"""
529
530
531
532
533
"""
[ ! -f  ${name}_1.fastq.gz ] && ln -s ${reads[0]} ${name}_1.fastq.gz
[ ! -f  ${name}_2.fastq.gz ] && ln -s ${reads[1]} ${name}_2.fastq.gz
trim_galore --cores $cores --paired --fastqc --gzip $c_r1 $c_r2 $tpc_r1 $tpc_r2 $nextseq ${name}_1.fastq.gz ${name}_2.fastq.gz
"""
566
567
568
569
570
571
572
573
574
"""
bwa mem \\
    -t $task.cpus \\
    -M \\
    -R $rg \\
    ${index}/${bwa_base} \\
    $reads \\
    | samtools view -@ $task.cpus -b -h -F 0x0100 -O BAM -o ${prefix}.bam -
"""
602
603
604
605
606
607
608
"""
samtools sort -@ $task.cpus -o ${prefix}.sorted.bam -T $name $bam
samtools index ${prefix}.sorted.bam
samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat
samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats
samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats
"""
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
"""
picard -Xmx${avail_mem}g MergeSamFiles \\
    ${'INPUT='+bam_files.join(' INPUT=')} \\
    OUTPUT=${name}.sorted.bam \\
    SORT_ORDER=coordinate \\
    VALIDATION_STRINGENCY=LENIENT \\
    TMP_DIR=tmp
samtools index ${name}.sorted.bam

picard -Xmx${avail_mem}g MarkDuplicates \\
    INPUT=${name}.sorted.bam \\
    OUTPUT=${prefix}.sorted.bam \\
    ASSUME_SORTED=true \\
    REMOVE_DUPLICATES=false \\
    METRICS_FILE=${prefix}.MarkDuplicates.metrics.txt \\
    VALIDATION_STRINGENCY=LENIENT \\
    TMP_DIR=tmp

samtools index ${prefix}.sorted.bam
samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats
samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat
samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats
"""
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
"""
picard -Xmx${avail_mem}g MarkDuplicates \\
    INPUT=${bam_files[0]} \\
    OUTPUT=${prefix}.sorted.bam \\
    ASSUME_SORTED=true \\
    REMOVE_DUPLICATES=false \\
    METRICS_FILE=${prefix}.MarkDuplicates.metrics.txt \\
    VALIDATION_STRINGENCY=LENIENT \\
    TMP_DIR=tmp

samtools index ${prefix}.sorted.bam
samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats
samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat
samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats
"""
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
"""
sed 's/MIN_INSERT_SIZE/${params.min_insert}/g' <$bamtools_filter_config >bamtools_filter.json
sed -i -e 's/MAX_INSERT_SIZE/${params.max_insert}/g' bamtools_filter.json
sed -i -e 's/MAX_MISMATCH/${params.max_mismatch}/g' bamtools_filter.json

samtools view \\
    $filter_params \\
    $dup_params \\
    $multimap_params \\
    $blacklist_params \\
    -b ${bam[0]} \\
    | bamtools filter \\
        -out ${prefix}.sorted.bam \\
        -script bamtools_filter.json

samtools index ${prefix}.sorted.bam
samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat
samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats
samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats

$name_sort_bam
"""
807
808
809
810
811
812
813
814
815
"""
bampe_rm_orphan.py ${bam[0]} ${prefix}.bam --only_fr_pairs

samtools sort -@ $task.cpus -o ${prefix}.sorted.bam -T $prefix ${prefix}.bam
samtools index ${prefix}.sorted.bam
samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat
samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats
samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats
"""
846
847
848
"""
preseq lc_extrap -v -output ${prefix}.ccurve.txt -bam ${bam[0]}
"""
883
884
885
886
887
888
889
890
"""
picard -Xmx${avail_mem}g CollectMultipleMetrics \\
    INPUT=${bam[0]} \\
    OUTPUT=${prefix}.CollectMultipleMetrics \\
    REFERENCE_SEQUENCE=$fasta \\
    VALIDATION_STRINGENCY=LENIENT \\
    TMP_DIR=tmp
"""
925
926
927
928
929
930
931
932
"""
picard -Xmx${avail_mem}g CollectMultipleMetrics \\
    INPUT=${bam[0]} \\
    OUTPUT=${prefix}.CollectMultipleMetrics \\
    REFERENCE_SEQUENCE=$fasta \\
    VALIDATION_STRINGENCY=LENIENT \\
    TMP_DIR=tmp
"""
961
962
963
964
965
966
967
968
969
"""
SCALE_FACTOR=\$(grep 'mapped (' $flagstat | awk '{print 1000000/\$1}')
echo \$SCALE_FACTOR > ${prefix}.scale_factor.txt
genomeCoverageBed -ibam ${bam[0]} -bg -scale \$SCALE_FACTOR $pe_fragment $extend | sort -T '.' -k1,1 -k2,2n >  ${prefix}.bedGraph

bedGraphToBigWig ${prefix}.bedGraph $sizes ${prefix}.bigWig

find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedLibrary/bigwig/"{}"\\t0,0,178" \\; > ${prefix}.bigWig.igv.txt
"""
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
"""
plotFingerprint \\
    --bamfiles ${bam[0]} \\
    --plotFile ${prefix}.plotFingerprint.pdf \\
    $extend \\
    --labels $prefix \\
    --outRawCounts ${prefix}.plotFingerprint.raw.txt \\
    --outQualityMetrics ${prefix}.plotFingerprint.qcmetrics.txt \\
    --skipZeros \\
    --numberOfProcessors $task.cpus \\
    --numberOfSamples $params.fingerprint_bins
"""
1034
1035
1036
"""
bamToBed -i $ibam > ${prefix}.bed
"""
NextFlow From line 1034 of master/main.nf
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
"""
danpos.py dpos \\
    $bed \\
    --span 1 \\
    --smooth_width 20 \\
    --width 40 \\
    --count 1000000 \\
    --out ./result/ \\
    $pe_params
mv ./result/*/* .

wigToBigWig -clip ${prefix}.Fnor.smooth.wig $sizes ${prefix}.Fnor.smooth.bigWig

awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$2-1, \$3, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.bed
awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$4-1, \$4, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.summit.bed

find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedLibrary/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bigWig.igv.txt
find * -type f -name "*.bed" -exec echo -e "bwa/mergedLibrary/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bed.igv.txt
"""
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
"""
computeMatrix scale-regions \\
    --regionsFileName $bed \\
    --scoreFileName $bigwig \\
    --outFileName ${prefix}.computeMatrix.mat.gz \\
    --outFileNameMatrix ${prefix}.computeMatrix.vals.mat.gz \\
    --regionBodyLength 1000 \\
    --beforeRegionStartLength 3000 \\
    --afterRegionStartLength 3000 \\
    --skipZeros \\
    --samplesLabel $name \\
    --numberOfProcessors $task.cpus

plotProfile --matrixFile ${prefix}.computeMatrix.mat.gz \\
    --outFileName ${prefix}.plotProfile.pdf \\
    --outFileNameData ${prefix}.plotProfile.tab
"""
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
"""
picard -Xmx${avail_mem}g MergeSamFiles \\
    ${'INPUT='+bam_files.join(' INPUT=')} \\
    OUTPUT=${name}.sorted.bam \\
    SORT_ORDER=coordinate \\
    VALIDATION_STRINGENCY=LENIENT \\
    TMP_DIR=tmp
samtools index ${name}.sorted.bam

picard -Xmx${avail_mem}g MarkDuplicates \\
    INPUT=${name}.sorted.bam \\
    OUTPUT=${prefix}.sorted.bam \\
    ASSUME_SORTED=true \\
    REMOVE_DUPLICATES=true \\
    METRICS_FILE=${prefix}.MarkDuplicates.metrics.txt \\
    VALIDATION_STRINGENCY=LENIENT \\
    TMP_DIR=tmp

samtools index ${prefix}.sorted.bam
samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat
samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats
samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats
"""
1206
1207
1208
1209
1210
1211
1212
1213
"""
ln -s ${bams[0]} ${prefix}.sorted.bam
ln -s ${bams[1]} ${prefix}.sorted.bam.bai
touch ${prefix}.MarkDuplicates.metrics.txt
samtools flagstat ${prefix}.sorted.bam > ${prefix}.sorted.bam.flagstat
samtools idxstats ${prefix}.sorted.bam > ${prefix}.sorted.bam.idxstats
samtools stats ${prefix}.sorted.bam > ${prefix}.sorted.bam.stats
"""
1238
1239
1240
"""
samtools sort -n -@ $task.cpus -o ${prefix}.bam -T $prefix ${bam[0]}
"""
1281
1282
1283
1284
1285
1286
1287
1288
1289
"""
SCALE_FACTOR=\$(grep 'mapped (' $flagstat | awk '{print 1000000/\$1}')
echo \$SCALE_FACTOR > ${prefix}.scale_factor.txt
genomeCoverageBed -ibam ${bam[0]} -bg -scale \$SCALE_FACTOR $pe_fragment $extend | sort -T '.' -k1,1 -k2,2n >  ${prefix}.bedGraph

bedGraphToBigWig ${prefix}.bedGraph $sizes ${prefix}.bigWig

find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedReplicate/bigwig/"{}"\\t0,0,178" \\; > ${prefix}.bigWig.igv.txt
"""
1319
1320
1321
"""
bamToBed -i $ibam > ${prefix}.bed
"""
NextFlow From line 1319 of master/main.nf
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
"""
danpos.py dpos \\
    $bed \\
    --span 1 \\
    --smooth_width 20 \\
    --width 40 \\
    --count 1000000 \\
    --out ./result/ \\
    $pe_params
mv ./result/*/* .

wigToBigWig -clip ${prefix}.Fnor.smooth.wig $sizes ${prefix}.Fnor.smooth.bigWig

awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$2-1, \$3, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.bed
awk -v FS='\t' -v OFS='\t' 'FNR > 1 { print \$1, \$4-1, \$4, "Interval_"NR-1, \$6, "+" }' ${prefix}.Fnor.smooth.positions.xls > ${prefix}.Fnor.smooth.positions.summit.bed

find * -type f -name "*.bigWig" -exec echo -e "bwa/mergedReplicate/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bigWig.igv.txt
find * -type f -name "*.bed" -exec echo -e "bwa/mergedReplicate/danpos/"{}"\\t0,0,178" \\; > ${prefix}.danpos.bed.igv.txt
"""
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
"""
computeMatrix scale-regions \\
    --regionsFileName $bed \\
    --scoreFileName $bigwig \\
    --outFileName ${prefix}.computeMatrix.mat.gz \\
    --outFileNameMatrix ${prefix}.computeMatrix.vals.mat.gz \\
    --regionBodyLength 1000 \\
    --beforeRegionStartLength 3000 \\
    --afterRegionStartLength 3000 \\
    --skipZeros \\
    --samplesLabel $name \\
    --numberOfProcessors $task.cpus

plotProfile --matrixFile ${prefix}.computeMatrix.mat.gz \\
    --outFileName ${prefix}.plotProfile.pdf \\
    --outFileNameData ${prefix}.plotProfile.tab
"""
1444
1445
1446
1447
"""
cat *.txt > igv_files.txt
igv_files_to_session.py igv_session.xml igv_files.txt ../reference_genome/${fasta.getName()} --path_prefix '../'
"""
NextFlow From line 1444 of master/main.nf
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
"""
echo $workflow.manifest.version > v_pipeline.txt
echo $workflow.nextflow.version > v_nextflow.txt
fastqc --version > v_fastqc.txt
trim_galore --version > v_trim_galore.txt
echo \$(bwa 2>&1) > v_bwa.txt
samtools --version > v_samtools.txt
bedtools --version > v_bedtools.txt
echo \$(bamtools --version 2>&1) > v_bamtools.txt
echo \$(plotFingerprint --version 2>&1) > v_deeptools.txt || true
picard MarkDuplicates --version &> v_picard.txt  || true
echo \$(R --version 2>&1) > v_R.txt
python -c "import pysam; print(pysam.__version__)" > v_pysam.txt
preseq &> v_preseq.txt
danpos.py --version > v_danpos.txt
multiqc --version > v_multiqc.txt
scrape_software_versions.py &> software_versions_mqc.yaml
"""
1555
1556
1557
1558
"""
multiqc . -f $rtitle $rfilename $custom_config_file \\
    -m custom_content -m fastqc -m cutadapt -m samtools -m picard -m preseq -m deeptools
"""
1582
1583
1584
"""
markdown_to_html.py $output_docs -o results_description.html
"""
NextFlow From line 1582 of master/main.nf
ShowHide 21 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://nf-co.re/mnaseseq
Name: mnaseseq
Version: 1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...