CAGE-sequencing analysis pipeline with trimming, alignment and counting of CAGE tags.

public public 1yr ago Version: 1.0.2 0 bookmarks

CAGE-seq pipeline .

Introduction

nf-core/cageseq is a bioinformatics analysis pipeline used for CAGE-seq sequencing data.

The pipeline takes raw demultiplexed fastq-files as input and includes steps for linker and artefact trimming ( cutadapt ), rRNA removal ( SortMeRNA , alignment to a reference genome ( STAR or bowtie1 ) and CAGE tag counting and clustering ( paraclu ). Additionally, several quality control steps ( FastQC , RSeQC , MultiQC ) are included to allow for easy verification of the results after a run.

The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Quick Start

  1. Install nextflow

  2. Install any of Docker , Singularity or Podman for full pipeline reproducibility (please only use Conda as a last resort; see docs )

  3. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run nf-core/cageseq -profile test,<docker/singularity/podman/conda/institute>
    

    Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

  4. Start running your own analysis!

nextflow run nf-core/cageseq -profile <docker/singularity/podman/conda/institute> --input '*_R1.fastq.gz' --aligner <'star'/'bowtie1'> --genome GRCh38

See usage docs for all of the available options when running the pipeline.

Pipeline Summary

By default, the pipeline currently performs the following:

  1. Input read QC ( FastQC )

  2. Adapter + EcoP15 + 5'G trimming ( cutadapt )

  3. (optional) rRNA filtering ( SortMeRNA ),

  4. Trimmed and filtered read QC ( FastQC )

  5. Read alignment to a reference genome ( STAR or bowtie1 )

  6. CAGE tag counting and clustering ( paraclu )

  7. CAGE tag clustering QC ( RSeQC )

  8. Present QC and visualisation for raw read, alignment and clustering results ( MultiQC )

Documentation

The nf-core/cageseq pipeline comes with documentation about the pipeline: usage and output .

Credits

nf-core/cageseq was originally written by Kevin Menden ( @KevinMenden ) and Tristan Kast ( @TrisKast ) and updated by Matthias Hörtenhuber ( @mashehu ).

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines .

For further information or help, don't hesitate to get in touch on the Slack #cageseq channel (you can join with this invite ).

Citations

If you use nf-core/cageseq for your analysis, please cite it using the following doi: 10.5281/zenodo.4095105

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x . ReadCube: Full Access Link

In addition, references of tools and data used in this pipeline are as follows:

Nextflow

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

Pipeline tools

  • BEDTools

    Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

  • bowtie

    Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4. PMID: 19261174; PMCID: PMC2690996.

  • cutadapt

    Martin, M., 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal, 17(1), pp.10-12.

  • FastQC

  • MultiQC

    Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

  • paraclu

    Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genome Res. 2008 Jan;18(1):1-12. doi: 10.1101/gr.6831208. Epub 2007 Nov 21. PMID: 18032727; PMCID: PMC2134772.

  • RSeQC

    Wang L, Wang S, Li W. RSeQC: quality control of RNA-seq experiments Bioinformatics. 2012 Aug 15;28(16):2184-5. doi: 10.1093/bioinformatics/bts356. Epub 2012 Jun 27. PubMed PMID: 22743226.

  • SAMtools

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

  • SortMeRNA

    Kopylova E, Noé L, Touzet H. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data Bioinformatics. 2012 Dec 15;28(24):3211-7. doi: 10.1093/bioinformatics/bts611. Epub 2012 Oct 15. PubMed PMID: 23071270.

  • STAR

    Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PubMed PMID: 23104886; PubMed Central PMCID: PMC3530905.

  • UCSC tools

    Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010 Sep 1;26(17):2204-7. doi: 10.1093/bioinformatics/btq351. Epub 2010 Jul 17. PubMed PMID: 20639541; PubMed Central PMCID: PMC2922891.

Software packaging/containerisation tools

  • Anaconda

    Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

  • Bioconda

    Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

  • BioContainers

    da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

  • Docker

  • Singularity

    Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

Code Snippets

347
348
349
350
351
352
353
354
355
356
357
358
359
360
"""
echo $workflow.manifest.version > v_pipeline.txt
echo $workflow.nextflow.version > v_nextflow.txt
fastqc --version > v_fastqc.txt
multiqc --version > v_multiqc.txt
STAR --version > v_star.txt
bowtie --version > v_bowtie.txt
cutadapt --version > v_cutadapt.txt
samtools --version > v_samtools.txt
bedtools --version > v_bedtools.txt
read_distribution.py --version > v_rseqc.txt
sortmerna --version > v_sortmerna.txt
scrape_software_versions.py &> software_versions_mqc.yaml
"""
373
374
375
"""
gtf2bed.pl $gtf > ${gtf.baseName}.bed
"""
NextFlow From line 373 of master/main.nf
386
387
388
'''
cat !{fasta} |  awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' > chrom_sizes.txt
'''
NextFlow From line 386 of master/main.nf
410
411
412
"""
fastqc --quiet --threads $task.cpus $reads
"""
438
439
440
441
442
443
444
445
446
447
"""
mkdir star
STAR \\
    --runMode genomeGenerate \\
    --runThreadN $task.cpus \\
    --sjdbGTFfile $gtf \\
    --genomeDir star/ \\
    --genomeFastaFiles $fasta \\
    $avail_mem
"""
465
466
467
"""
bowtie-build --threads $task.cpus ${fasta} ${fasta.baseName}.index
"""
495
496
497
498
499
500
501
502
503
504
505
"""
cutadapt -a ^${params.eco_site}...${params.linker_seq} \\
    --match-read-wildcards \\
    --minimum-length 15 --maximum-length 40 \\
    --discard-untrimmed \\
    --quality-cutoff 30 \\
    --cores=$task.cpus \\
    -o "${name}".adapter_trimmed.fastq.gz \\
    $reads \\
    > "${name}"_adapter_trimming.output.txt
"""
510
511
512
513
514
515
516
517
518
519
520
521
522
"""
mkdir trimmed
cutadapt -g ^${params.eco_site} \\
    -e 0 \\
    --match-read-wildcards \\
    --minimum-length 20 --maximum-length 40 \\
    --discard-untrimmed \\
    --quality-cutoff 30 \\
    --cores=$task.cpus \\
    -o "${name}".adapter_trimmed.fastq.gz \\
    $reads \\
    > "${name}"_adapter_trimming.output.txt
"""
527
528
529
530
531
532
533
534
535
536
537
538
539
"""
mkdir trimmed
cutadapt -a ${params.linker_seq}\$ \\
    -e 0 \\
    --match-read-wildcards \\
    --minimum-length 20 --maximum-length 40 \\
    --discard-untrimmed \\
    --quality-cutoff 30 \\
    --cores=$task.cpus \\
    -o "${name}".adapter_trimmed.fastq.gz \\
    $reads \\
    > "${name}"_adapter_trimming.output.txt
"""
567
568
569
570
571
572
573
574
"""
cutadapt -g ^G \\
    -e 0 --match-read-wildcards \\
    --cores=$task.cpus \\
    -o "${name}".g_trimmed.fastq.gz \\
    $reads \\
    > "${name}".g_trimming.output.txt
"""
604
605
606
607
608
609
610
611
612
"""
cutadapt -a file:$artifacts_3end \\
    -g file:$artifacts_5end -e 0.1 --discard-trimmed \\
    --match-read-wildcards -m 15 -O 19 \\
    --cores=$task.cpus \\
    -o "${name}".artifacts_trimmed.fastq.gz \\
    $reads \\
    > ${reads.baseName}.artifacts_trimming.output.txt
"""
645
646
647
648
649
650
651
652
653
654
655
656
657
"""
sortmerna ${Refs} \\
    --reads ${reads} \\
    --num_alignments 1 \\
    --threads $task.cpus \\
    --workdir . \\
    --fastx \\
    --aligned rRNA-reads \\
    --other non-rRNA-reads \\
    -v
gzip --force < non-rRNA-reads.fastq > ${name}.fq.gz
mv rRNA-reads.log ${name}_rRNA_report.txt
"""
NextFlow From line 645 of master/main.nf
680
681
682
"""
fastqc -q $reads
"""
711
712
713
714
715
716
717
718
719
720
721
722
723
724
"""
STAR --genomeDir $index \\
    --sjdbGTFfile $gtf \\
    --readFilesIn $reads \\
    --runThreadN $task.cpus \\
    --outSAMtype BAM SortedByCoordinate \\
    --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 \\
    --seedSearchStartLmax 20 \\
    --outFilterMismatchNmax 1 \\
    --readFilesCommand zcat \\
    --runDirPerm All_RWX \\
    --outFileNamePrefix $name \\
    --outFilterMultimapNmax 1
"""
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
"""
bowtie --sam \\
    -m 1 \\
    --best \\
    --strata \\
    -k 1 \\
    --tryhard \\
    --threads $task.cpus \\
    --phred33-quals \\
    --chunkmbs 64 \\
    --seedmms 2 \\
    --seedlen 20 \\
    --maqerr 70  \\
    ${index}  \\
    -q ${reads} \\
    --un ${reads.baseName}.unAl > ${name}.sam 2> ${name}.out
    samtools sort -@ $task.cpus -o ${name}.bam ${name}.sam
"""
789
790
791
"""
samtools idxstats $bam_count > ${bam_count}.idxstats
"""
818
819
820
'''
make_ctss.sh -q 20 -i !{bam_count.baseName} -n !{name}
'''
NextFlow From line 818 of master/main.nf
835
836
837
838
839
"""
bedtools genomecov -bg -i ${name}.ctss.bed -g ${chrom_sizes} > ${name}.bedgraph
sort -k1,1 -k2,2n ${name}.bedgraph > ${name}_sorted.bedgraph
bedGraphToBigWig ${name}_sorted.bedgraph ${chrom_sizes} ${name}.ctss.bw
"""
857
858
859
860
861
862
863
864
865
866
867
868
'''
process_ctss.sh -t !{params.tpm_cluster_threshold} !{ctss}

paraclu !{params.min_cluster} "ctss_all_pos_4Ps" > "ctss_all_pos_clustered"
paraclu !{params.min_cluster} "ctss_all_neg_4Ps" > "ctss_all_neg_clustered"

paraclu-cut  "ctss_all_pos_clustered" >  "ctss_all_pos_clustered_simplified"
paraclu-cut  "ctss_all_neg_clustered" >  "ctss_all_neg_clustered_simplified"

cat "ctss_all_pos_clustered_simplified" "ctss_all_neg_clustered_simplified" >  "ctss_all_clustered_simplified"
awk -F '\t' '{print $1"\t"$3"\t"$4"\t"$1":"$3".."$4","$2"\t"$6"\t"$2}' "ctss_all_clustered_simplified" >  "ctss_all_clustered_simplified.bed"
'''
886
887
888
889
890
891
892
893
'''
intersectBed -a !{clusters} -b !{ctss} -loj -s > !{ctss}_counts_tmp

echo !{name} > !{ctss}_counts.txt

bedtools groupby -i !{ctss}_counts_tmp -g 1,2,3,4,6 -c 11 -o sum > !{ctss}_counts.bed
awk -v OFS='\t' '{if($6=="-1") $6=0; print $6 }' !{ctss}_counts.bed >> !{ctss}_counts.txt
'''
910
911
912
913
914
'''
echo 'coordinates' > coordinates
awk '{ print $4}' !{clusters} >> coordinates
paste -d "\t" coordinates !{counts} >> count_table.tsv
'''
NextFlow From line 910 of master/main.nf
938
939
940
941
"""
bedtools bedtobam -i $clusters -g $chrom_sizes > ${clusters.baseName}.bam
read_distribution.py -i ${clusters.baseName}.bam -r $gtf > ${clusters.baseName}.read_distribution.txt
"""
982
983
984
"""
multiqc . -f $rtitle $rfilename $custom_config_file
"""
1001
1002
1003
"""
markdown_to_html.py $output_docs -o results_description.html
"""
NextFlow From line 1001 of master/main.nf
ShowHide 21 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://nf-co.re/cageseq
Name: cageseq
Version: 1.0.2
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...