The ProteoGenomics database generation workflow creates different protein databases for ProteoGenomics data analysis.

public public 1yr ago Version: 1.0.0 0 bookmarks
Loading...

The ProteoGenomics database generation workflow ( pgdb ) use the pypgatk and nextflow to create different protein databases for ProteoGenomics data analysis.

Introduction

nf-core/pgdb is a bioinformatics pipeline to generate proteogenomics databases. pgdb allows users to create proteogenomics databases using EMSEMBL as the reference proteome database. Three different major databases can be attached to the final proteogenomics database:

  • The reference proteome (ENSEMBL Reference proteome)

  • Non canonical proteins: pseudo-genes, sORFs, lncRNA.

  • Variants: COSMIC, cBioPortal, GENOMAD variants

The pipeline allows to estimate decoy proteins with different methods and attach them to the final proteogenomics database.

The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Quick Start

  1. Install nextflow

  2. Install any of Docker , Singularity , Podman , Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort; see docs )

  3. Download the pipeline and test it on a minimal dataset with a single command (This run will download the canonical ENSEMBL reference proteome and create proteomics database with it):

    nextflow run nf-core/pgdb -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>
    

    Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

  4. Start running your own analysis!

    nextflow run nf-core/pgdb -profile <docker/singularity/podman/conda/institute> --ncrna true --pseudogenes true --altorfs true
    

    This will create a proteogenomics database with the ENSEMBL reference proteome and non canonical proteins like pseudo genes, non coding rnas or alternative open reading frames.

See usage docs for all of the available options when running the pipeline.

Pipeline Summary

By default, the pipeline currently performs the following:

ProteoGenomics Database

  • Download protein databases from ENSEMBL

  • Translate from Genomics Variant databases into ProteoGenomics Databases ( COSMIC , GNOMAD )

  • Add to a Reference proteomics database, non-coding RNAs + pseudogenes.

  • Compute Decoy for a proteogenomics databases

Documentation

The nf-core/pgdb pipeline comes with documentation about the pipeline: usage and output .

Credits

nf-core/pgdb was originally written by Husen M. Umer (EMBL-EBI) & Yasset Perez-Riverol (Karolinska Institute)

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines .

For further information or help, don't hesitate to get in touch on the Slack #pgdb channel (you can join with this invite ).

Citations

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x .

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Code Snippets

138
139
140
141
142
"""
echo $workflow.manifest.version > v_pipeline.txt
echo $workflow.nextflow.version > v_nextflow.txt
scrape_software_versions.py &> software_versions_mqc.yaml
"""
NextFlow From line 138 of 1.0.0/main.nf
164
165
166
167
168
169
"""
pypgatk_cli.py ensembl-downloader \\
    --config_file $ensembl_downloader_config \\
    --ensembl_name $params.ensembl_name \\
    -sv -sc
"""
184
185
186
"""
cat $reference_proteome >> reference_proteome.fa
"""
NextFlow From line 184 of 1.0.0/main.nf
203
204
205
206
"""
cat $a >> total_cdnas.fa
cat $b >> total_cdnas.fa
"""
NextFlow From line 203 of 1.0.0/main.nf
225
226
227
228
229
230
231
232
"""
pypgatk_cli.py dnaseq-to-proteindb \\
    --config_file "$ensembl_config" \\
    --input_fasta $x \\
    --output_proteindb ncRNAs_proteinDB.fa \\
    --include_biotypes "${params.biotypes['ncRNA']}" \\
    --skip_including_all_cds --var_prefix ncRNA_
"""
253
254
255
256
257
258
259
260
261
"""
pypgatk_cli.py dnaseq-to-proteindb \\
    --config_file "$ensembl_config" \\
    --input_fasta "$x" \\
    --output_proteindb pseudogenes_proteinDB.fa \\
    --include_biotypes "${params.biotypes['pseudogene']}" \\
    --skip_including_all_cds \\
    --var_prefix pseudo_
"""
282
283
284
285
286
287
288
289
290
"""
pypgatk_cli.py dnaseq-to-proteindb \\
    --config_file "$ensembl_config" {{
    --input_fasta "$x" \\
    --output_proteindb altorfs_proteinDB.fa \\
    --include_biotypes "${params.biotypes['protein_coding']}'" \\
    --skip_including_all_cds \\
    --var_prefix altorf_
"""
315
316
317
318
319
320
"""
pypgatk_cli.py cosmic-downloader \\
    --config_file "$cosmic_config" \\
    --username $params.cosmic_user_name \\
    --password $params.cosmic_password
"""
340
341
342
343
344
345
346
347
"""
pypgatk_cli.py cosmic-to-proteindb \\
    --config_file "$cosmic_config" \\
    --input_mutation $m --input_genes $g \\
    --filter_column 'Histology subtype 1' \\
    --accepted_values $params.cosmic_cancer_type \\
    --output_db cosmic_proteinDB.fa
"""
369
370
371
372
373
374
375
376
377
"""
pypgatk_cli.py cosmic-to-proteindb \\
    --config_file "$cosmic_config" \\
    --input_mutation $m \\
    --input_genes $g \\
    --filter_column 'Sample name' \\
    --accepted_values $params.cosmic_cellline_name \\
    --output_db cosmic_celllines_proteinDB.fa
"""
397
398
399
400
401
402
"""
pypgatk_cli.py ensembl-downloader \\
    --config_file $ensembl_downloader_config \\
    --ensembl_name $params.ensembl_name \\
    -sg -sp -sc -sd -sn
"""
420
421
422
"""
awk 'BEGIN{FS=OFS="\t"}{if(\$1~"#" || (\$5!="" && \$4!="")) print}' $vcf_file > checked_$vcf_file
"""
NextFlow From line 420 of 1.0.0/main.nf
446
447
448
449
450
451
452
453
454
455
456
"""
pypgatk_cli.py vcf-to-proteindb \\
    --config_file $e \\
    --af_field "$ensembl_af_field" \\
    --input_fasta $f \\
    --gene_annotations_gtf $g \\
    --vcf $v \\
    --output_proteindb "${v}_proteinDB.fa"  \\
    --var_prefix ensvar \\
    --annotation_field_name 'CSQ'
"""
482
483
484
"""
gffread -w transcripts.fa -g $f $g
"""
504
505
506
507
508
509
510
511
512
513
514
515
516
"""
awk 'BEGIN{FS=OFS="\t"}{if(\$1=="chrM") \$1="MT"; gsub("chr","",\$1); print}' \\
    $v > ${v.baseName}_changedChrNames.vcf

pypgatk_cli.py vcf-to-proteindb \\
    --config_file $e \\
    --af_field "$af_field" \\
    --input_fasta $f \\
    --gene_annotations_gtf $g \\
    --vcf ${v.baseName}_changedChrNames.vcf \\
    --output_proteindb ${v.baseName}_proteinDB.fa \\
    --annotation_field_name ''
"""
540
541
542
543
544
"""
wget ${g}/gencode.v19.pc_transcripts.fa.gz
wget ${g}/gencode.v19.annotation.gtf.gz
gunzip *.gz
"""
NextFlow From line 540 of 1.0.0/main.nf
562
563
564
"""
gsutil cp $g .
"""
NextFlow From line 562 of 1.0.0/main.nf
582
583
584
"""
zcat $g > ${g}.vcf
"""
NextFlow From line 582 of 1.0.0/main.nf
605
606
607
608
609
610
611
612
613
614
615
616
"""
pypgatk_cli.py vcf-to-proteindb \\
    --config_file $e \\
    --vcf $v \\
    --input_fasta $f \\
    --gene_annotations_gtf $g \\
    --output_proteindb "${v}_proteinDB.fa" \\
    --af_field controls_AF \\
    --transcript_index 6 \\
    --annotation_field_name vep  \\
    --var_prefix gnomadvar
"""
638
639
640
641
"""
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz
gunzip *.gz
"""
NextFlow From line 638 of 1.0.0/main.nf
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
"""
git clone https://github.com/cBioPortal/datahub.git .
git lfs install --local --skip-smudge
git lfs pull -I public --include "data*clinical*sample.txt"
git lfs pull -I public --include "data_mutations_mskcc.txt"
cat public/*/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt
cat public/*/*data*clinical*sample.txt | \\
    awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\
    awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\
        for(i=1;i<=NF;i++){ \\
            if(\$i=="CANCER_TYPE_DETAILED") j=1; \\
            if(\$i=="CANCER_TYPE") s=1; \\
        } \\
        if(j==1 && s==0){ \\
            gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE"); \\
        } \\
        print; \\
    }' \\
    > cbioportal_allstudies_data_clinical_sample.txt
"""
NextFlow From line 658 of 1.0.0/main.nf
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
"""
pypgatk_cli.py cbioportal-downloader \\
    --config_file "$cbioportal_config" \\
    -d "$params.cbioportal_study_id"

tar -xzvf database_cbioportal/${params.cbioportal_study_id}.tar.gz
cat ${params.cbioportal_study_id}/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt
cat ${params.cbioportal_study_id}/data_clinical_sample.txt | \\
    awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\
    awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\
    for(i=1;i<=NF;i++){ \\
        if(\$i=="CANCER_TYPE_DETAILED") j=1; if(\$i=="CANCER_TYPE") s=1; \\
    } \\
    if(j==1 && s==0){gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE");} print;}' \\
    > cbioportal_allstudies_data_clinical_sample.txt
"""
715
716
717
718
719
720
721
722
723
724
"""
pypgatk_cli.py cbioportal-to-proteindb \\
    --config_file $cbioportal_config \\
    --input_mutation $m \\
    --input_cds $g \\
    --clinical_sample_file $s \\
    --filter_column $params.cbioportal_filter_column \\
    --accepted_values $params.cbioportal_accepted_values \\
    --output_db cbioPortal_proteinDB.fa
"""
747
748
749
"""
cat proteindb* > merged_databases.fa
"""
NextFlow From line 747 of 1.0.0/main.nf
779
780
781
782
783
784
785
786
"""
pypgatk_cli.py ensembl-check \\
    -in "$file" \\
    --config_file "$e" \\
    -out database_clean.fa \\
    --num_aa "$params.minimum_aa" \\
    "$stop_codons"
"""
811
812
813
814
815
816
817
818
819
"""
pypgatk_cli.py generate-decoy \\
    --method "$params.decoy_method" \\
    --enzyme "$params.decoy_enzyme" \\
    --config_file $protein_decoy_config \\
    --input_database $f \\
    --decoy_prefix "$params.decoy_prefix" \\
    --output_database decoy_database.fa
"""
838
839
840
"""
markdown_to_html.py $output_docs -o results_description.html
"""
NextFlow From line 838 of 1.0.0/main.nf
138
139
140
141
142
"""
echo $workflow.manifest.version > v_pipeline.txt
echo $workflow.nextflow.version > v_nextflow.txt
scrape_software_versions.py &> software_versions_mqc.yaml
"""
NextFlow From line 138 of master/main.nf
164
165
166
167
168
169
"""
pypgatk_cli.py ensembl-downloader \\
    --config_file $ensembl_downloader_config \\
    --ensembl_name $params.ensembl_name \\
    -sv -sc
"""
184
185
186
"""
cat $reference_proteome >> reference_proteome.fa
"""
NextFlow From line 184 of master/main.nf
203
204
205
206
"""
cat $a >> total_cdnas.fa
cat $b >> total_cdnas.fa
"""
NextFlow From line 203 of master/main.nf
225
226
227
228
229
230
231
232
"""
pypgatk_cli.py dnaseq-to-proteindb \\
    --config_file "$ensembl_config" \\
    --input_fasta $x \\
    --output_proteindb ncRNAs_proteinDB.fa \\
    --include_biotypes "${params.biotypes['ncRNA']}" \\
    --skip_including_all_cds --var_prefix ncRNA_
"""
253
254
255
256
257
258
259
260
261
"""
pypgatk_cli.py dnaseq-to-proteindb \\
    --config_file "$ensembl_config" \\
    --input_fasta "$x" \\
    --output_proteindb pseudogenes_proteinDB.fa \\
    --include_biotypes "${params.biotypes['pseudogene']}" \\
    --skip_including_all_cds \\
    --var_prefix pseudo_
"""
282
283
284
285
286
287
288
289
290
"""
pypgatk_cli.py dnaseq-to-proteindb \\
    --config_file "$ensembl_config" {{
    --input_fasta "$x" \\
    --output_proteindb altorfs_proteinDB.fa \\
    --include_biotypes "${params.biotypes['protein_coding']}'" \\
    --skip_including_all_cds \\
    --var_prefix altorf_
"""
315
316
317
318
319
320
"""
pypgatk_cli.py cosmic-downloader \\
    --config_file "$cosmic_config" \\
    --username $params.cosmic_user_name \\
    --password $params.cosmic_password
"""
340
341
342
343
344
345
346
347
"""
pypgatk_cli.py cosmic-to-proteindb \\
    --config_file "$cosmic_config" \\
    --input_mutation $m --input_genes $g \\
    --filter_column 'Histology subtype 1' \\
    --accepted_values $params.cosmic_cancer_type \\
    --output_db cosmic_proteinDB.fa
"""
369
370
371
372
373
374
375
376
377
"""
pypgatk_cli.py cosmic-to-proteindb \\
    --config_file "$cosmic_config" \\
    --input_mutation $m \\
    --input_genes $g \\
    --filter_column 'Sample name' \\
    --accepted_values $params.cosmic_cellline_name \\
    --output_db cosmic_celllines_proteinDB.fa
"""
397
398
399
400
401
402
"""
pypgatk_cli.py ensembl-downloader \\
    --config_file $ensembl_downloader_config \\
    --ensembl_name $params.ensembl_name \\
    -sg -sp -sc -sd -sn
"""
420
421
422
"""
awk 'BEGIN{FS=OFS="\t"}{if(\$1~"#" || (\$5!="" && \$4!="")) print}' $vcf_file > checked_$vcf_file
"""
NextFlow From line 420 of master/main.nf
446
447
448
449
450
451
452
453
454
455
456
"""
pypgatk_cli.py vcf-to-proteindb \\
    --config_file $e \\
    --af_field "$ensembl_af_field" \\
    --input_fasta $f \\
    --gene_annotations_gtf $g \\
    --vcf $v \\
    --output_proteindb "${v}_proteinDB.fa"  \\
    --var_prefix ensvar \\
    --annotation_field_name 'CSQ'
"""
482
483
484
"""
gffread -w transcripts.fa -g $f $g
"""
504
505
506
507
508
509
510
511
512
513
514
515
516
"""
awk 'BEGIN{FS=OFS="\t"}{if(\$1=="chrM") \$1="MT"; gsub("chr","",\$1); print}' \\
    $v > ${v.baseName}_changedChrNames.vcf

pypgatk_cli.py vcf-to-proteindb \\
    --config_file $e \\
    --af_field "$af_field" \\
    --input_fasta $f \\
    --gene_annotations_gtf $g \\
    --vcf ${v.baseName}_changedChrNames.vcf \\
    --output_proteindb ${v.baseName}_proteinDB.fa \\
    --annotation_field_name ''
"""
540
541
542
543
544
"""
wget ${g}/gencode.v19.pc_transcripts.fa.gz
wget ${g}/gencode.v19.annotation.gtf.gz
gunzip *.gz
"""
NextFlow From line 540 of master/main.nf
562
563
564
"""
gsutil cp $g .
"""
NextFlow From line 562 of master/main.nf
582
583
584
"""
zcat $g > ${g}.vcf
"""
NextFlow From line 582 of master/main.nf
605
606
607
608
609
610
611
612
613
614
615
616
"""
pypgatk_cli.py vcf-to-proteindb \\
    --config_file $e \\
    --vcf $v \\
    --input_fasta $f \\
    --gene_annotations_gtf $g \\
    --output_proteindb "${v}_proteinDB.fa" \\
    --af_field controls_AF \\
    --transcript_index 6 \\
    --annotation_field_name vep  \\
    --var_prefix gnomadvar
"""
638
639
640
641
"""
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cds/Homo_sapiens.GRCh37.75.cds.all.fa.gz
gunzip *.gz
"""
NextFlow From line 638 of master/main.nf
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
"""
git clone https://github.com/cBioPortal/datahub.git .
git lfs install --local --skip-smudge
git lfs pull -I public --include "data*clinical*sample.txt"
git lfs pull -I public --include "data_mutations_mskcc.txt"
cat public/*/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt
cat public/*/*data*clinical*sample.txt | \\
    awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\
    awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\
        for(i=1;i<=NF;i++){ \\
            if(\$i=="CANCER_TYPE_DETAILED") j=1; \\
            if(\$i=="CANCER_TYPE") s=1; \\
        } \\
        if(j==1 && s==0){ \\
            gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE"); \\
        } \\
        print; \\
    }' \\
    > cbioportal_allstudies_data_clinical_sample.txt
"""
NextFlow From line 658 of master/main.nf
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
"""
pypgatk_cli.py cbioportal-downloader \\
    --config_file "$cbioportal_config" \\
    -d "$params.cbioportal_study_id"

tar -xzvf database_cbioportal/${params.cbioportal_study_id}.tar.gz
cat ${params.cbioportal_study_id}/data_mutations_mskcc.txt > cbioportal_allstudies_data_mutations_mskcc.txt
cat ${params.cbioportal_study_id}/data_clinical_sample.txt | \\
    awk 'BEGIN{FS=OFS="\\t"}{if(\$1!~"#SAMPLE_ID"){gsub("#SAMPLE_ID", "\\nSAMPLE_ID");} print}' | \\
    awk 'BEGIN{FS=OFS="\\t"}{s=0; j=0; \\
    for(i=1;i<=NF;i++){ \\
        if(\$i=="CANCER_TYPE_DETAILED") j=1; if(\$i=="CANCER_TYPE") s=1; \\
    } \\
    if(j==1 && s==0){gsub("CANCER_TYPE_DETAILED", "CANCER_TYPE");} print;}' \\
    > cbioportal_allstudies_data_clinical_sample.txt
"""
715
716
717
718
719
720
721
722
723
724
"""
pypgatk_cli.py cbioportal-to-proteindb \\
    --config_file $cbioportal_config \\
    --input_mutation $m \\
    --input_cds $g \\
    --clinical_sample_file $s \\
    --filter_column $params.cbioportal_filter_column \\
    --accepted_values $params.cbioportal_accepted_values \\
    --output_db cbioPortal_proteinDB.fa
"""
747
748
749
"""
cat proteindb* > merged_databases.fa
"""
NextFlow From line 747 of master/main.nf
779
780
781
782
783
784
785
786
"""
pypgatk_cli.py ensembl-check \\
    -in "$file" \\
    --config_file "$e" \\
    -out database_clean.fa \\
    --num_aa "$params.minimum_aa" \\
    "$stop_codons"
"""
811
812
813
814
815
816
817
818
819
"""
pypgatk_cli.py generate-decoy \\
    --method "$params.decoy_method" \\
    --enzyme "$params.decoy_enzyme" \\
    --config_file $protein_decoy_config \\
    --input_database $f \\
    --decoy_prefix "$params.decoy_prefix" \\
    --output_database decoy_database.fa
"""
838
839
840
"""
markdown_to_html.py $output_docs -o results_description.html
"""
NextFlow From line 838 of master/main.nf
ShowHide 51 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://nf-co.re/pgdb
Name: pgdb
Version: 1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...