Analysis of long non-coding RNAs from RNA-seq datasets

public public 1yr ago Version: dev 0 bookmarks

A Nextflow-based pipeline for comprehensive analyses of long non-coding RNAs from RNA-seq datasets

Introduction

Recently, long noncoding RNA molecules (lncRNA) captured widespread attentions for their critical roles in diverse biological process and important implications in variety of human diseases and cancers. Identification and profiling of lncRNAs is a fundamental step to advance our knowledge on their function and regulatory mechanisms. However, RNA sequencing based lncRNA discovery is currently limited due to complicated operations and implementation of the tools involved. Therefore, we present a one-stop multi-tool integrated pipeline called LncPipe focused on characterizing lncRNAs from raw transcriptome sequencing data. The pipeline was developed based on a popular workflow framework Nextflow , composed of four core procedures including reads alignment, assembly, identification and quantification. It contains various unique features such as well-designed lncRNAs annotation strategy, optimized calculating efficiency, diversified classification and interactive analysis report. LncPipe allows users additional control in interuppting the pipeline, resetting parameters from command line, modifying main script directly and resume analysis from previous checkpoint.

Documentation

The nf-core/lncpipe pipeline comes with documentation about the pipeline, found in the docs/ directory:

  1. Installation

  2. Pipeline configuration

  3. Running the pipeline

  4. Output and how to interpret the results

  5. Run analysis for non-human species

  6. Troubleshooting

Schematic diagram

workflow

Acknowledgment

Thanks to the author of AfterQC /fastp, Shifu Chen, for his help on providing a gzip output support to meet the require of LncPipe. Thanks to the internal test by Hongwan Zhang and Yan Wang from SYSUCC Cancer bioinformatics platform.

And also many thanks to the wonderful guys @apeltzer, @ewels and others from nf-core that help me to polish the code and structure of lncpipe.

Citation

For details of LncPipe, plz read the article beblow :happy:

Qi Zhao, Yu Sun, Dawei Wang, Hongwan Zhang, Kai Yu, Jian Zheng, Zhixiang Zuo. LncPipe: A Nextflow-based pipeline for identification and analysis of long non-coding RNAs from RNA-Seq data. J Genet Genomics. 2018 Jul 20;45(7):399-401

Code Snippets

208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
'''
set -o pipefail
touch filenames.txt

perl -lpe 's/ ([^"]\\S+) ;/ "$1" ;/g' !{gencode_annotation_gtf} > gencode_annotation_gtf_mod.gtf 
perl -lpe 's/ ([^"]\\S+) ;/ "$1" ;/g' !{lncipedia_gtf} > lncipedia_mod.gtf 

echo  gencode_annotation_gtf_mod.gtf   >>filenames.txt
echo lncipedia_mod.gtf   >>filenames.txt


stringtie --merge -o merged_lncRNA.gtf  filenames.txt
cat gencode_annotation_gtf_mod.gtf   |grep "protein_coding" > gencode_protein_coding.gtf
gffcompare -r gencode_protein_coding.gtf -p !{task.cpus} merged_lncRNA.gtf
awk '$3 =="u"||$3=="x"{print $5}' gffcmp.merged_lncRNA.gtf.tmap |sort|uniq|perl !{baseDir}/bin/extract_gtf_by_name.pl merged_lncRNA.gtf - > merged.filter.gtf
mv  merged.filter.gtf known.lncRNA.gtf

'''
228
229
230
231
232
233
234
235
236
237
    '''
    set -o pipefail

    cuffmerge -o merged_lncRNA  !{lncRNA_gtflistfile}
    cat !{gencode_annotation_gtf} |grep "protein_coding" > gencode_protein_coding.gtf
    cuffcompare -o merged_lncRNA -r gencode_protein_coding.gtf -p !{task.cpus} merged_lncRNA/merged.gtf
    awk '$3 =="u"||$3=="x"{print $5}' merged_lncRNA/merged_lncRNA.merged.gtf.tmap  |sort|uniq|perl !{baseDir}/bin/extract_gtf_by_name.pl merged_lncRNA/merged.gtf - > merged.filter.gtf
    mv  merged.filter.gtf known.lncRNA.gtf

'''
NextFlow From line 228 of dev/main.nf
276
277
278
279
280
281
282
283
284
285
"""
    mkdir star_index
    STAR \
        --runMode genomeGenerate \
        --runThreadN ${star_threads} \
        --sjdbGTFfile $gencode_annotation_gtf \
        --sjdbOverhang 149 \
        --genomeDir star_index/ \
        --genomeFastaFiles $fasta
    """
303
304
305
"""
    bowtie2-build !{fasta} genome_bt2
    """
325
326
327
328
329
330
"""
    #for human genome it will take more than 160GB memory and take really  long time (6 more hours), thus we recommand to down pre-build genome from hisat website
    extract_splice_sites.py !{gencode_annotation_gtf} >genome_ht2.ss
    extract_exons.py !{gencode_annotation_gtf} > genome_ht2.exon
    hisat2-build -p !{hisat2_index_threads} --ss genome_ht2.ss --exo genome_ht2.exon !{fasta} genome_ht2
    """
365
366
367
    '''
    fastqc -t !{task.cpus} !{fastq_file[0]} !{fastq_file[1]}
'''
391
392
393
    '''
after.py -z -1 !{fastq_file[0]} -g ./
'''
NextFlow From line 391 of dev/main.nf
395
396
397
    '''
after.py -z -1 !{fastq_file[0]} -2 !{fastq_file[1]} -g ./
'''
NextFlow From line 395 of dev/main.nf
424
425
426
427
    '''
fastp -i !{fastq_file[0]} -o !{samplename}.qc.gz -h !{samplename}_fastp.html

'''
429
430
431
    '''
fastp -i !{fastq_file[0]}  -I !{fastq_file[1]} -o !{samplename}_1.qc.fq.gz  -O !{samplename}_2.qc.fq.gz -h !{samplename}_fastp.html
'''
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
"""
         STAR --runThreadN !{task.cpus} \
            --twopassMode Basic \
            --genomeDir !{star_index} \
            --readFilesIn !{pair} \
            --readFilesCommand zcat \
            --outSAMtype BAM SortedByCoordinate \
            --chimSegmentMin 20 \
            --outFilterIntronMotifs RemoveNoncanonical \
            --outFilterMultimapNmax 20 \
            --alignIntronMin 20 \
            --alignIntronMax 1000000 \
            --alignMatesGapMax 1000000 \
            --outFilterType BySJout \
            --alignSJoverhangMin 8 \
            --alignSJDBoverhangMin 1 \
            --outFileNamePrefix !{file_tag_new} 
    """
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
'''
            STAR --runThreadN !{task.cpus}  \
                 --twopassMode Basic --genomeDir !{star_index} \
                 --readFilesIn !{pair[0]} !{pair[1]} \
                 --readFilesCommand zcat \
                 --outSAMtype BAM SortedByCoordinate \
                 --chimSegmentMin 20 \
                 --outFilterIntronMotifs RemoveNoncanonical \
                 --outFilterMultimapNmax 20 \
                 --alignIntronMin 20 \
                 --alignIntronMax 1000000 \
                 --alignMatesGapMax 1000000 \
                 --outFilterType BySJout \
                 --alignSJoverhangMin 8 \
                 --alignSJDBoverhangMin 1 \
                 --outFileNamePrefix !{file_tag_new} 
    '''
541
542
543
544
'''
         tophat -p !{task.cpus} -G !{gtf} -–no-novel-juncs -o !{samplename}_thout --library-type !{strand_str} !{index_base} !{pair} 

'''
NextFlow From line 541 of dev/main.nf
547
548
549
'''
     tophat -p !{task.cpus} -G !{gtf} -–no-novel-juncs -o !{samplename}_thout --library-type !{strand_str} !{index_base} !{pair[0]} !{pair[1]} 
'''
NextFlow From line 547 of dev/main.nf
580
581
582
583
584
585
586
587
588
    '''
   mkdir tmp
   hisat2  -p !{task.cpus} --dta  -x  !{index_base}  -U !{pair}  -S !{file_tag_new}.sam 2>!{file_tag_new}.hisat2_summary.txt
  sambamba view -S -f bam -t !{task.cpus} !{file_tag_new}.sam -o temp.bam 
  sambamba sort -o !{file_tag_new}.sort.bam --tmpdir ./tmp -t !{task.cpus} temp.bam
  rm !{file_tag_new}.sam
  rm temp.bam

'''
591
592
593
594
595
596
597
    '''
    mkdir tmp
  hisat2  -p !{task.cpus} --dta  -x  !{index_base}  -1 !{pair[0]}  -2 !{pair[1]}  -S !{file_tag_new}.sam 2> !{file_tag_new}.hisat2_summary.txt
  sambamba view -S -f bam -t !{hisat2_threads} !{file_tag_new}.sam -o temp.bam
  sambamba sort -o !{file_tag_new}.sort.bam --tmpdir ./tmp -t !{task.cpus} temp.bam
  rm !{file_tag_new}.sam
'''
602
603
604
605
606
607
608
609
610
    '''
   mkdir tmp
   hisat2  -p !{task.cpus} --dta --rna-strandness !{params.hisat_strand} -x  !{index_base}  -U !{pair}  -S !{file_tag_new}.sam 2>!{file_tag_new}.hisat2_summary.txt
  sambamba view -S -f bam -t !{hisat2_threads} !{file_tag_new}.sam -o temp.bam 
  sambamba sort -o !{file_tag_new}.sort.bam --tmpdir ./tmp -t !{hisat2_threads} temp.bam
  rm !{file_tag_new}.sam
  rm temp.bam

'''
613
614
615
616
617
618
619
    '''
 mkdir tmp
  hisat2  -p !{task.cpus} --dta --rna-strandness !{params.hisat_strand} -x  !{index_base}  -1 !{pair[0]}  -2 !{pair[1]}  -S !{file_tag_new}.sam 2> !{file_tag_new}.hisat2_summary.txt
  sambamba view -S -f bam -t !{task.cpus} !{file_tag_new}.sam -o temp.bam
  sambamba sort -o !{file_tag_new}.sort.bam --tmpdir ./tmp -t !{task.cpus} temp.bam
  rm !{file_tag_new}.sam
'''
647
648
649
650
    '''
#run stringtie
stringtie -p !{task.cpus} -G !{gencode_annotation_gtf} -l stringtie_!{file_tag_new} -o stringtie_!{file_tag_new}_transcripts.gtf !{alignment_bam}
'''
652
653
654
655
    '''
#run stringtie
stringtie -p !{task.cpus} -G !{gencode_annotation_gtf} --rf -l stringtie_!{file_tag_new} -o stringtie_!{file_tag_new}_transcripts.gtf !{alignment_bam}
'''
683
684
685
686
687
'''
stringtie --merge -p !{task.cpus} -o merged.gtf !{gtf_filenames}


'''
711
712
713
714
715
716
717
718
719
720
721
722
723
    '''
#run cufflinks

cufflinks -g !{gencode_annotation_gtf} \
          -b !{fasta} \
          --library-type !{strand_str}\
          --max-multiread-fraction 0.25 \
          --3-overhang-tolerance 2000 \
          -o Cufout_!{file_tag_new} \
          -p !{task.cpus} !{alignment_bam}

mv Cufout_!{file_tag_new}/transcripts.gtf Cufout_!{file_tag_new}_transcripts.gtf
'''
726
727
728
729
730
731
732
733
734
735
736
737
738
    '''
#run cufflinks

cufflinks -g !{gencode_annotation_gtf} \
          -b !{fasta} \
          --library-type !{strand_str} \
          --max-multiread-fraction 0.25 \
          --3-overhang-tolerance 2000 \
          -o Cufout_!{file_tag_new} \
          -p !{task.cpus} !{alignment_bam}

mv Cufout_!{file_tag_new}/transcripts.gtf Cufout_!{file_tag_new}_transcripts.gtf
'''
771
772
773
774
775
776
777
778
'''
mkdir CUFFMERGE
cuffmerge -o CUFFMERGE \
          -s !{fasta} \
          -p !{task.cpus} \
             !{gtf_filenames}

'''
NextFlow From line 771 of dev/main.nf
817
818
819
    '''
    fastqc -t !{task.cpus} !{fastq_file[0]} !{fastq_file[1]}
'''
845
846
847
    '''
after.py -z -1 !{fastq_file[0]} -g ./
'''
NextFlow From line 845 of dev/main.nf
849
850
851
    '''
after.py -z -1 !{fastq_file[0]} -2 !{fastq_file[1]} -g ./
'''
NextFlow From line 849 of dev/main.nf
878
879
880
881
    '''
fastp -i !{fastq_file[0]} -o !{samplename}.qc.gz -h !{samplename}_fastp.html

'''
883
884
885
    '''
fastp -i !{fastq_file[0]}  -I !{fastq_file[1]} -o !{samplename}_1.qc.fq.gz  -O !{samplename}_2.qc.fq.gz -h !{samplename}_fastp.html
'''
914
915
916
917
'''
    #!/bin/sh
    gffcompare -r !{gencode_annotation_gtf} -p !{task.cpus} !{mergeGtfFile} -o merged_lncRNA
    '''
938
939
940
941
942
943
944
945
946
947
948
949
950
951
'''
    # filtering novel lncRNA based on cuffmerged trascripts
    awk '$3 =="x"||$3=="u"||$3=="i"{print $0}' !{comparedTmap} > novel.gtf.tmap
    #   excluding length smaller than 200 nt
    awk '$10 >200{print}' novel.gtf.tmap > novel.longRNA.gtf.tmap
    #   extract gtf
    awk '{print $5}' novel.longRNA.gtf.tmap |perl !{baseDir}/bin/extract_gtf_by_name.pl !{mergedGTF} - >novel.longRNA.gtf
    awk '{if($3=="exon"){print $0}}' novel.longRNA.gtf > novel.longRNA.format.gtf 
    perl !{baseDir}/bin/get_exoncount.pl novel.longRNA.format.gtf  > novel.longRNA.exoncount.txt
    # gtf2gff3
    #check whether required
    # get fasta from gtf
    gffread novel.longRNA.gtf -g !{fasta} -w novel.longRNA.fa -W
 '''
968
969
970
971
972
973
    '''
        PLEK.py -fasta !{novel_lncRNA_fasta} \
                                   -out novel.longRNA.PLEK.out \
                                   -thread !{task.cpus}
	    exit 0
        '''
NextFlow From line 968 of dev/main.nf
983
984
985
986
987
988
'''
cpat.py -g !{novel_lncRNA_fasta} \
                               -x !{baseDir}/bin/cpat_model/Human_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/Human_logitModel.RData \
                               -o novel.longRNA.CPAT.out
'''
NextFlow From line 983 of dev/main.nf
990
991
992
993
994
995
'''
cpat.py -g !{novel_lncRNA_fasta} \
                               -x !{baseDir}/bin/cpat_model/Mouse_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/Mouse_logitModel.RData \
                               -o novel.longRNA.CPAT.out
'''
NextFlow From line 990 of dev/main.nf
 998
 999
1000
1001
1002
1003
'''
cpat.py -g !{novel_lncRNA_fasta} \
                               -x !{baseDir}/bin/cpat_model/zebrafish_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/zebrafish_logitModel.RData \
                               -o novel.longRNA.CPAT.out
'''
NextFlow From line 998 of dev/main.nf
1005
1006
1007
1008
1009
1010
'''
cpat.py -g !{novel_lncRNA_fasta} \
                               -x !{baseDir}/bin/cpat_model/fly_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/fly_logitModel.RData \
                               -o novel.longRNA.CPAT.out
'''
NextFlow From line 1005 of dev/main.nf
1034
1035
1036
1037
1038
1039
1040
1041
'''
    #merged transcripts
    perl !{baseDir}/bin/integrate_novel_transcripts.pl > novel.longRNA.txt
    awk '$4 >1{print $1}' novel.longRNA.txt|perl !{baseDir}/bin/extract_gtf_by_name.pl !{cuffmergegtf} - > novel.longRNA.stringent.gtf
    # retain lncRNA only by coding ability
    awk '$4 >1&&$5=="lncRNA"{print $1}' novel.longRNA.txt|perl !{baseDir}/bin/extract_gtf_by_name.pl !{cuffmergegtf} - > novel.lncRNA.stringent.gtf
    awk '$4 >1&&$5=="TUCP"{print $1}' novel.longRNA.txt|perl !{baseDir}/bin/extract_gtf_by_name.pl !{cuffmergegtf} - > novel.TUCP.stringent.gtf
    '''
NextFlow From line 1034 of dev/main.nf
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
'''
gffcompare -G -o filter \
            -r !{knowlncRNAgtf} \
            -p !{task.cpus} !{novel_lncRNA_stringent_Gtf}
awk '$3 =="u"||$3=="x"{print $5}' filter.novel.lncRNA.stringent.gtf.tmap |sort|uniq| \
            perl !{baseDir}/bin/extract_gtf_by_name.pl !{novel_lncRNA_stringent_Gtf} - > novel.lncRNA.stringent.filter.gtf

#rename lncRNAs according to neighbouring protein coding genes
awk '$3 =="gene"{print }' !{gencode_protein_coding_gtf} | perl -F'\\t' -lane '$F[8]=~/gene_id "(.*?)";/ && print join qq{\\t},@F[0,3,4],$1,@F[5,6,1,2,7,8,9]' - | \
    sort-bed - > gencode.protein_coding.gene.bed
gtf2bed < novel.lncRNA.stringent.filter.gtf |sort-bed - > novel.lncRNA.stringent.filter.bed
gtf2bed < !{knowlncRNAgtf} |sort-bed - > known.lncRNA.bed

perl !{baseDir}/bin/rename_lncRNA_2.pl gencode_annotation_gtf_mod.gtf lncipedia_mod.gtf 
# mv lncRNA.final.v2.gtf all_lncRNA_for_classifier.gtf
grep -v NA-1-1 lncRNA.final.v2.gtf > all_lncRNA_for_classifier.gtf
perl !{baseDir}/bin/rename_proteincoding.pl !{gencode_protein_coding_gtf}> protein_coding.final.gtf
cat all_lncRNA_for_classifier.gtf protein_coding.final.gtf > final_all.gtf
gffread final_all.gtf -g !{fasta} -w final_all.fa -W
gffread all_lncRNA_for_classifier.gtf -g !{fasta} -w lncRNA.fa -W
gffread protein_coding.final.gtf -g !{fasta} -w protein_coding.fa -W
#classification 
perl !{baseDir}/bin/lincRNA_classification.pl all_lncRNA_for_classifier.gtf !{gencode_protein_coding_gtf} lncRNA_classification.txt 


'''
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
'''
gffcompare -G -o filter \
            -r !{knowlncRNAgtf} \
            -p !{task.cpus} !{novel_lncRNA_stringent_Gtf}
awk '$3 =="u"||$3=="x"{print $5}' filter.novel.lncRNA.stringent.gtf.tmap |sort|uniq| \
            perl !{baseDir}/bin/extract_gtf_by_name.pl !{novel_lncRNA_stringent_Gtf} - > novel.lncRNA.stringent.filter.gtf

#rename lncRNAs according to neighbouring protein coding genes
awk '$3 =="gene"{print }' !{gencode_protein_coding_gtf} | perl -F'\\t' -lane '$F[8]=~/gene_id "(.*?)";/ && print join qq{\\t},@F[0,3,4],$1,@F[5,6,1,2,7,8,9]' - | \
    sort-bed - > gencode.protein_coding.gene.bed
gtf2bed < novel.lncRNA.stringent.filter.gtf |sort-bed - > novel.lncRNA.stringent.filter.bed
gtf2bed < !{knowlncRNAgtf} |sort-bed - > known.lncRNA.bed
perl !{baseDir}/bin/rename_lncRNA_2.pl non_human_mod.gtf
# mv lncRNA.final.v2.gtf all_lncRNA_for_classifier.gtf
grep -v NA-1-1 lncRNA.final.v2.gtf > all_lncRNA_for_classifier.gtf
perl !{baseDir}/bin/rename_proteincoding.pl !{gencode_protein_coding_gtf}> protein_coding.final.gtf
cat all_lncRNA_for_classifier.gtf protein_coding.final.gtf > final_all.gtf
gffread final_all.gtf -g !{fasta} -w final_all.fa -W
gffread all_lncRNA_for_classifier.gtf -g !{fasta} -w lncRNA.fa -W
gffread protein_coding.final.gtf -g !{fasta} -w protein_coding.fa -W
#classification 
perl !{baseDir}/bin/lincRNA_classification.pl all_lncRNA_for_classifier.gtf !{gencode_protein_coding_gtf} lncRNA_classification.txt 


'''
1144
1145
1146
1147
1148
1149
'''
cpat.py -g !{lncRNA_final_cpat_fasta} \
                               -x !{baseDir}/bin/cpat_model/Human_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/Human_logitModel.RData \
                               -o lncRNA.final.CPAT.out
'''
NextFlow From line 1144 of dev/main.nf
1151
1152
1153
1154
1155
1156
'''
cpat.py -g !{lncRNA_final_cpat_fasta} \
                               -x !{baseDir}/bin/cpat_model/Mouse_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/Mouse_logitModel.RData \
                               -o lncRNA.final.CPAT.out
'''
NextFlow From line 1151 of dev/main.nf
1159
1160
1161
1162
1163
1164
'''
cpat.py -g !{lncRNA_final_cpat_fasta} \
                               -x !{baseDir}/bin/cpat_model/zebrafish_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/zebrafish_logitModel.RData \
                               -o lncRNA.final.CPAT.out
'''
NextFlow From line 1159 of dev/main.nf
1166
1167
1168
1169
1170
1171
'''
cpat.py -g !{lncRNA_final_cpat_fasta} \
                               -x !{baseDir}/bin/cpat_model/fly_Hexamer.tsv \
                               -d !{baseDir}/bin/cpat_model/fly_logitModel.RData \
                               -o lncRNA.final.CPAT.out
'''
NextFlow From line 1166 of dev/main.nf
1181
1182
1183
1184
1185
1186
'''
    cpat.py -g !{final_coding_gene_for_CPAT} \
                                   -x !{baseDir}/bin/cpat_model/Human_Hexamer.tsv \
                                   -d !{baseDir}/bin/cpat_model/Human_logitModel.RData \
                                   -o protein_coding.final.CPAT.out
    '''
NextFlow From line 1181 of dev/main.nf
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
'''
    #!/usr/bin/perl -w
    #since CPAT arbitrarily transforms gene names into upper case, we apply 'uc' function to keep the genenames' consistency.  
    use strict;
    open OUT,">basic_charac.txt" or die;

    open FH,"all_lncRNA_for_classifier.gtf" or die;

    my %class;
    my %g2t;
    my %trans_len;
    my %exon_num;
    while(<FH>){
    chomp;
    my @field=split "\t";
    $_=~/gene_id "(.+?)"/;
    my $gid=$1;
    $_=~/transcript_id "(.+?)"/;
    my $tid=uc($1);
    $class{$tid}=$field[1];
    $g2t{$tid}=$gid;
    my $len=$field[4]-$field[3];
    $trans_len{$tid}=(exists $trans_len{$tid})?$trans_len{$tid}+$len:$len;
    $exon_num{$tid}=(exists $exon_num{$tid})?$exon_num{$tid}+1:1;
    }
    open FH,"protein_coding.final.gtf" or die;

    while(<FH>){
    chomp;
    my @field=split "\t";
    $_=~/gene_id "(.+?)"/;
    my $gid=uc($1);
    $_=~/transcript_id "(.+?)"/;
    my $tid=$1;
    $class{$tid}="protein_coding";
    $g2t{$tid}=$gid;
    my $len=$field[4]-$field[3];
    $trans_len{$tid}=(exists $trans_len{$tid})?$trans_len{$tid}+$len:$len;
    $exon_num{$tid}=(exists $exon_num{$tid})?$exon_num{$tid}+1:1;
    }

    my %lin_class;
    open IN,"lncRNA_classification.txt" or die;                 #change the file name
    while(<IN>){
    chomp;
    my @data = split /\\t/,$_;
    $lin_class{$data[0]} = $data[1];
    }
    open FH,"lncRNA.final.CPAT.out" or die;

    <FH>;

    while(<FH>){
        chomp;
        my @field=split "\t";
        my $tid=uc($field[0]);
        my $class;
        if (defined($lin_class{$tid})){
            $class = $lin_class{$tid};
        }else{
            $class = 'NA';
        }
        print OUT $g2t{$tid}."\t".$tid."\t".$class{$tid}."\t".$field[5]."\t".$trans_len{$tid}."\t".$exon_num{$tid}."\t".$class."\n";
    }

    open FH,"protein_coding.final.CPAT.out" or die;

    <FH>;

    while(<FH>){
        chomp;
        my @field=split "\t";
        my $tid=uc($field[0]);
        my $class;
        if (defined($lin_class{$tid})){
            $class = $lin_class{$tid};
        }else{
            $class = 'protein_coding';
        }
        print OUT $g2t{$tid}."\t".$tid."\t".$class{$tid}."\t".$field[5]."\t".$trans_len{$tid}."\t".$exon_num{$tid}."\t".$class."\n";
     }

'''
NextFlow From line 1201 of dev/main.nf
1310
1311
1312
1313
1314
'''
sambamba view !{bamfile} > !{samplename}.sam # resolved error caused by bam and htseq version conflicts 
htseq-count -t exon -i gene_id -s no -r pos -f sam !{samplename}.sam !{final_gtf} > !{samplename}.htseq.count 
rm !{samplename}.sam
'''
1316
1317
1318
1319
1320
'''
sambamba view !{bamfile} > !{samplename}.sam # resolved error caused by bam and htseq version conflicts 
htseq-count -t exon -i gene_id -r pos -f sam !{samplename}.sam !{final_gtf} > !{samplename}.htseq.count 
rm !{samplename}.sam
'''
1336
1337
1338
1339
1340
'''
#index kallisto reference 
kallisto index -i transcripts.idx !{transript_fasta}

'''
1361
1362
1363
1364
1365
'''
#quantification by kallisto in single end mode
kallisto quant -i !{kallistoIndex} -o !{file_tag_new}_kallisto -t !{task.cpus} -b 100 --single -l 180 -s 20  !{pair} 
mv !{file_tag_new}_kallisto/abundance.tsv !{file_tag_new}_abundance.tsv
'''
1370
1371
1372
1373
1374
'''
#quantification by kallisto 
kallisto quant -i !{kallistoIndex} -o !{file_tag_new}_kallisto -t !{task.cpus} -b 100 !{pair[0]} !{pair[1]}
mv !{file_tag_new}_kallisto/abundance.tsv !{file_tag_new}_abundance.tsv
'''
1396
1397
1398
1399
1400
        '''
#index kallisto reference 
kallisto index -i transcripts.idx !{transript_fasta}

'''
1421
1422
1423
1424
1425
1426
        '''
#quantification by kallisto in single end mode
kallisto quant -i !{kallistoIndex} -o !{file_tag_new}_kallisto -t !{task.cpus} -b 100 --single -l 180 -s 20 !{pair} 
mv !{file_tag_new}_kallisto/abundance.tsv !{file_tag_new}_abundance.tsv

'''
1431
1432
1433
1434
1435
        '''
#quantification by kallisto 
kallisto quant -i !{kallistoIndex} -o !{file_tag_new}_kallisto -t !{task.cpus} -b 100 !{pair[0]} !{pair[1]}
mv !{file_tag_new}_kallisto/abundance.tsv !{file_tag_new}_abundance.tsv
'''
1459
1460
1461
1462
'''
perl !{baseDir}/bin/get_map_table.pl  final_all.gtf  > map.file
R CMD BATCH !{baseDir}/bin/get_htseq_matrix.R
'''
NextFlow From line 1459 of dev/main.nf
1478
1479
1480
1481
'''
perl !{baseDir}/bin/get_map_table.pl  --gtf_file=final_all.gtf  > map.file
R CMD BATCH !{baseDir}/bin/get_kallisto_matrix.R
'''
NextFlow From line 1478 of dev/main.nf
1519
1520
1521
    """
 Rscript -e "library(LncPipeReporter);run_reporter(input='.', output = 'reporter.html',output_dir='./LncPipeReports',de.method=\'${detools}\',theme = 'npg',cdf.percent = ${lncRep_cdf_percent},max.lncrna.len = ${lncRep_max_lnc_len},min.expressed.sample = ${lncRep_min_expressed_sample}, ask = FALSE)"
"""
NextFlow From line 1519 of dev/main.nf
1540
1541
1542
1543
"""
perl -F':|,' -lanE'BEGIN{say qq{SampleID\tcondition}} $del = shift @F; say qq{$_\t$del} for @F' ${design}  > design.matrix 
Rscript -e "library(LncPipeReporter);run_reporter(input='.', output = 'reporter.html',output_dir='./LncPipeReports',de.method=\'${detools}\',theme = 'npg',cdf.percent = ${lncRep_cdf_percent},max.lncrna.len = ${lncRep_max_lnc_len},min.expressed.sample = ${lncRep_min_expressed_sample}, ask = FALSE)"
"""
NextFlow From line 1540 of dev/main.nf
1565
1566
1567
"""
 Rscript -e "library(LncPipeReporter);run_reporter(input='.', output = 'reporter.html',output_dir='./LncPipeReports',de.method=\'${detools}\',theme = 'npg',cdf.percent = ${lncRep_cdf_percent},max.lncrna.len = ${lncRep_max_lnc_len},min.expressed.sample = ${lncRep_min_expressed_sample}, ask = FALSE)"
"""
NextFlow From line 1565 of dev/main.nf
1586
1587
1588
"""
 Rscript -e "library(LncPipeReporter);run_reporter(input='.', output = 'reporter.html',output_dir='./LncPipeReports',de.method=\'${detools}\',theme = 'npg',cdf.percent = ${lncRep_cdf_percent},max.lncrna.len = ${lncRep_max_lnc_len},min.expressed.sample = ${lncRep_min_expressed_sample}, ask = FALSE)"
"""
NextFlow From line 1586 of dev/main.nf
ShowHide 46 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://nf-co.re/lncpipe
Name: lncpipe
Version: dev
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...