Annotating Novel lncRNAs Using Existing Annotations: A Snakemake-Based Pipeline

public public 1yr ago 0 bookmarks

A Snakemake-based pipeline to annotate novel lncRNA using existing annotation.

Description

This pipeline uses various bioinformatics tools to annotate lncRNA. It requires reference genome(s) (optional HISAT index), sequencing reads, and reference annotation(s).

The workflow is mainly divided into two parts:

Part 1 (Snakefile):

The first part of pipeline works with the following steps:

  • uses HISAT2 to align all reads to the transcriptome,

  • creates a new gene list with all known and novel genes.

  • separates the novel genes

  • uses CPAT and CPC to annotate novel genes with coding potential

  • extracts non-coding genes

  • identifies and removes transcripts with coding isoforms and less than 200 bp long genes

Part 2 (Snakefile_quantification):

The second part of pipeline works with the following steps:

  • uses StringTie to merge lncRNAs from different sample

  • creates transcriptome using GFFread

  • creates Salmon index for transcriptome

  • quantify lncRNAs of each replicate of tissue/sample using salmon

  • creates a TPM matrix for each sample and replicates

  • filter lncRNAs with low expression value

Pipeline

A simple overview of pipeline is shown below:

Part 1 (Snakefile) Part 2 (Snakefile_quantification)

Creating Conda environment

This pipeline uses conda environments. Snakemake will create conda environments during the first run of the pipeline. However, if running this pipeline on High-Performance Computing (HPC) it may not have an active internet connection. In that case, conda environment can be created using snakemake -s env_snakefile --useconda command from the node with an internet connection. And then the pipeline will be able to utilise existing environments.

Authors

Code Snippets

25
26
shell:
    "sed -n 's/^##//p' {input}"
SnakeMake From line 25 of main/Snakefile
47
48
49
50
shell:
	"""
		trim_galore --illumina -o {params.dir} --paired {input.fq1} {input.fq2}
	"""
63
64
65
66
67
shell:
	"""
		mkdir -p {params.fq_dir} && \
		fastqc {input.fq} --outdir={params.fq_dir}
	"""
86
87
88
89
shell:
	"""
		hisat2 -q -p {threads} --dta -x {params.index} -1 {input.fq1} -2 {input.fq2} -S {output.sam} --summary {output.summary} 
	"""
 99
100
101
102
shell:
	"""
		samtools view -b {input} > {output} 
	"""
111
112
113
114
shell:
	"""
		samtools sort {input} > {output}
	"""
123
124
125
126
shell:
	"""
		samtools index -c {input} {output}
	"""
145
146
147
148
shell:
	"""
		stringtie {input.bam} -p 4 -G {input.gff} -o {output.gtf} -A {output.abundance} -b {params.ballgown} -l {params.label}
	"""
170
171
172
173
174
175
176
177
shell:
	"""
		cd {params.gtf_dir} && \
		gffcompare -r {input.ref_gff} {input.query_gtf} -o {params.label} && \
		cat {params.gffcompare_file} | awk '$3=="u"{{print $0}}' | cut {params.column} | sort | uniq > {output.novel_transcripts_list} && \
		sed 's/^/"/; s/$/"/' {output.novel_transcripts_list} > {output.novel_transcripts_list2} && \
		grep -F -f {output.novel_transcripts_list2} {input.query_gtf} > {output.novel_transcripts}	
	"""
192
193
194
195
shell:
	"""
		gffread -w {output.novel_transcripts_fasta} -g {input.reference_fasta} {input.novel_transcripts_gtf}
	"""
207
208
209
210
shell:
	"""
		make_hexamer_tab.py -c {input.cds} -n {input.ncrna} > {output.hexamer}
	"""
SnakeMake From line 207 of main/Snakefile
225
226
227
228
shell:
	"""
		make_logitModel.py -c {input.cds} -n {input.ncrna} -x {input.hexamer} -o {params.logitmodel}
	"""
SnakeMake From line 225 of main/Snakefile
242
243
244
245
246
247
248
shell:
	"""
		cpat.py -g {input.fasta}  -d {input.logitmodel} -x {input.hexamer} -o {output.tsv} && \

		# changing the case of Stringtie, CPAT changes Stringtie to STRINGTIE in previous rule 
		perl -p -e 's/STRINGTIE/Stringtie/g' {output.tsv} > {output.tsv_lowercase}
	"""
SnakeMake From line 242 of main/Snakefile
260
261
262
263
shell:
	"""
		awk -F "\t" '{{ if($6 <= {params.cutoff} ) {{print $1}} }}' {input.tsv} > {output.nctsv}
	"""
SnakeMake From line 260 of main/Snakefile
272
273
274
275
shell:
	"""
		sort {input.nctsv} > {output.ncsortedtsv}
	"""
SnakeMake From line 272 of main/Snakefile
288
289
290
291
shell:
	"""
		CPC2.py -i {input.fasta}  -o {output.tsv}
	"""
SnakeMake From line 288 of main/Snakefile
302
303
304
305
shell:
	"""
		awk '{{ if($8 == "{params.filter}" ) {{print $1}} }}' {input.tsv} > {output.nctsv}
	"""
SnakeMake From line 302 of main/Snakefile
314
315
316
317
shell:
	"""
		sort {input.nctsv} > {output.ncsortedtsv}
	"""
SnakeMake From line 314 of main/Snakefile
327
328
329
330
    shell:
        """
            comm -12 {input.cpat} {input.cpc} > {output.comm}
		"""
SnakeMake From line 327 of main/Snakefile
341
342
343
344
345
shell:
	"""
		sed 's/^/"/; s/$/"/' {input.tsv} > {output.tsv} && \
		grep -F -f {output.tsv} {input.query_gtf} > {output.novel_transcripts}
	"""
SnakeMake From line 341 of main/Snakefile
354
355
356
357
358
359
shell:
	"""
		#keeping " to avoid matching with wrong gene ids
		perl -lne 'print @m if @m=(/((?:gene_id)\s+\S+)/g);' {input.nc_gtf} | perl -p -i -e 's/gene_id|;| //g' > {output.gene_list} && \
		sort {output.gene_list} | uniq
	"""
SnakeMake From line 354 of main/Snakefile
370
371
372
373
374
shell:
	"""
		# gene list is already quoted so not adding quotes again, could use Mikado
		grep -F -f {input.gene_list} {input.query_gtf} > {output.transcripts}
	"""
SnakeMake From line 370 of main/Snakefile
388
389
390
391
392
shell:
	"""
		python {params.scripts}/tra.py < {input.nc_gtf} > {output.nc_pair} && \
		python {params.scripts}/tra.py < {input.combine_gtf} > {output.combine_pair}
	"""
SnakeMake From line 388 of main/Snakefile
404
405
406
407
shell:
	"""
		python {params.scripts}/filter.py {input.nc_pair} {input.combine_pair} > {output.diff}
	"""
SnakeMake From line 404 of main/Snakefile
420
421
422
423
424
shell:
    """
        sed 's/^/"/; s/$/"/' {input.genes} > {output.quoted_genes} && \
        grep -v -F -f {output.quoted_genes} {input.gtf} > {output.gtf}
    """
SnakeMake From line 420 of main/Snakefile
434
435
436
437
shell:
	"""
		gffread {input} -T -l 200 -o {output}
	"""
ShowHide 19 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/TGAC/lncRNA-analysis
Name: lncrna-analysis
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...