A snakemake wrapper around Nesvilab's FragPipe-CLI

public public 1yr ago Version: v1.0.0 0 bookmarks
 _____________ < mspipeline1 > ------------- \ ___......__ _ \
 _.-' ~-_ _.=a~~-_ --=====-.-.-_----------~ .--. _ -.__.-~ ( ___===> '''--...__ ( \ \\\ { ) _.-~ 
 =_ ~_ \\-~~~//~~~~-=-~ |-=-~_ \\ \\ |_/ =. ) ~} |} || // || _// {{ 
 '='~' \\_ = ~~'

If you want to use fragpipe using the command line interface, then this is the tool for you.

This pipeline takes 1) a list of .d files and 2) a list of fasta-amino acid files and outputs sane protein calls with abundances. It uses philosopher database and fragpipe to do the job. The snakemake pipeline maintains a nice output file tree.

Why you should use this pipeline

Because it makes sure that all outputs are updated when you change input-parameters. It also yells at you if something fails, and hopefully makes it a bit easier to find the error.

Installation

  1. Prerequisites:
  • Preferably a HPC system, or a beefy local workstation.

  • An anaconda or miniconda3 package manager on that system.

  1. Clone this repo on the HPC/workstation where you want to work.

    git clone https://github.com/cmkobel/mspipeline1.git && cd mspipeline1
    
  2. If you don't already have an environment with snakemake and mamba installed, use the following command to install a "snakemake" environment with the bundled environment file:

    conda env create -f environment.yaml -n mspipeline1
    

    This environment can then be activated by typing conda activate mspipeline1

  3. If needed, tweak the profiles/slurm/ configuration so that it matches your execution environment. There is a profile for local execution without a job management system (profiles/local/) as well as a few profiles for different HPC environments like PBS and SLURM.

Usage

1) Update config.yaml

The file config_template.yaml contains all the parameters needed to run this pipeline. You should change the parameters to reflect your sample batch.

Because nesvilab do not make their executables immediately publicly available, you need to tell the pipeline where to find them on your system. Update addresses for the keys philosopher_executable , msfragger_jar , ionquant_jar and fragpipe_executable which can be downloaded here , here , here and here , respectively.

Currently the pipeline is only tested on the input of .d-files ( agilent/bruker ): Create an item in batch_parameters where you define key d_base which is the base directory where all .d-files reside. Define key database_glob which is a path (or glob) to the fasta-amino acid files that you want to include in the target protein database.

Define items under the samples key which link sample names to the .d-files.

Lastly, set the batch key to point at the batch that you want to run.

2) Run

Finally, run the pipeline in your command line with:

$ snakemake --profile profiles/slurm/

Below is visualization of the workflow graph:

Screenshot 2023-02-23 at 10 48 07

Future

This pipeline might involve an R-markdown performing trivial QC. Also, a test data set that accelerates the development cycle. 🚴‍♀️

Code Snippets

125
126
127
128
129
130
131
shell: """

    echo '''{params.dataframe}''' > {output}

    # TODO: Write something clever to the benchmarks/ directory, so we can infer relationship between hardware allocations, input size and running time.

"""
SnakeMake From line 125 of main/snakefile
143
144
145
146
147
148
149
150
shell: """

    cp -vr {params.d_files} {output.dir}

    # Enable editing of these files
    chmod -R 775 output/

"""
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
shell: """

    mkdir -p output/{config_batch}/
    >&2 echo "Concatenating database ..."
    cat {input.glob} > output/{config_batch}/cat_database_sources.faa

    mkdir -p output/{config_batch}/
    cd output/{config_batch}/

    {params.philosopher} workspace --init

    # https://github.com/Nesvilab/philosopher/wiki/Database
    {params.philosopher} database \
        --custom cat_database_sources.faa \
        --contam 

    echo "Existing database pattern-matched files:"
    ls *-decoys-contam-cat_database_sources.faa.fas

    mv *-decoys-contam-cat_database_sources.faa.fas philosopher_database.faa # rename database file.
    # rm cat_database_sources.faa # remove unneccessary .faa file. # No, keep it for annotation purposes.

    {params.philosopher} workspace --clean




"""
SnakeMake From line 168 of main/snakefile
210
211
212
213
214
215
216
217
218
shell: """

    # TODO: Database length in basepairs?
    seqkit stats \
        --tabular \
        {params.config_database_glob_read} {input.database} \
    > {output.db_stats_seqkit}

"""
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
shell: """

    echo "Create manifest ..."
    echo '''{params.manifest}''' > {output.manifest} 
    tail {output.manifest}


    echo "Modifying workflow with runtime parameters ..." # TODO: Check if it matters to overwrite or not.
    # Copy and modify parameter file with dynamic content.
    cp {params.original_fragpipe_workflow} {output.fragpipe_workflow}
    echo -e "\n# Added by mspipeline1 in rule fragpipe in snakefile below ..." >> {output.fragpipe_workflow}

    echo "num_threads={threads}" >> {output.fragpipe_workflow}
    echo "database_name={input.database}" >> {output.fragpipe_workflow}
    echo "database.db-path={input.database}" >> {output.fragpipe_workflow}

    echo "output_location={params.fragpipe_workdir}" >> {output.fragpipe_workflow}

    # These settings minimize memory usage. 
    echo "msfragger.misc.slice-db={params.slice_db}" >> {output.fragpipe_workflow} # Default 1
    echo "msfragger.calibrate_mass=0" >> {output.fragpipe_workflow} # Default 2
    echo "msfragger.digest_max_length=35" >> {output.fragpipe_workflow} # Default 50
    # echo "msfragger.allowed_missed_cleavage_1=1" >> {output.fragpipe_workflow} # Default 2
    # echo "msfragger.allowed_missed_cleavage_2=1" >> {output.fragpipe_workflow} # Default 2

    # Debug, presentation of the bottom of the modified workflow
    echo "" >> {output.fragpipe_workflow}
    tail {output.fragpipe_workflow}


    # Convert mem_mb into gb (I'm not sure if fragpipe reads GiB or GB?)
    mem_gib=$(({resources.mem_mib}/1024-2)) # Because there is some overhead, we subtract a few GBs. Everytime fragpipe runs out of memory, I subtract another one: that should be more effective than doing a series of tests ahead of time.
    echo "Fragpipe will be told not to use more than $mem_gib GiB. In practice it usually uses a bit more." # Or maybe there just is an overhead when using a conda environment?

    echo "Fragpipe ..."
    # https://fragpipe.nesvilab.org/docs/tutorial_headless.html
    {params.fragpipe_executable} \
        --headless \
        --workflow {output.fragpipe_workflow} \
        --manifest {output.manifest} \
        --workdir {params.fragpipe_workdir} \
        --ram $mem_gib \
        --threads {threads} \
        --config-msfragger {params.msfragger_jar} \
        --config-ionquant {params.ionquant_jar} \
        --config-philosopher {params.philosopher_executable} \
    | tee {output.fragpipe_stdout} # Write the log, so we can later extract the number of "scans"



"""
SnakeMake From line 272 of main/snakefile
334
335
336
337
338
339
340
341
shell: """

    # Extract scans from the fragpipe stdout log. Will later be compared to the individual psm files.
    grep -E ": Scans = [0-9]+" {input.fragpipe_stdout} \
    > {output.scans}


"""
SnakeMake From line 334 of main/snakefile
366
367
368
369
370
371
372
shell: """

    cp scripts/QC.Rmd rmarkdown_template.rmd
    Rscript -e 'rmarkdown::render("rmarkdown_template.rmd", "html_document", output_file = "{output.report}", knit_root_dir = "output/{config_batch}/", quiet = F)' 
    rm rmarkdown_template.rmd

"""
SnakeMake From line 366 of main/snakefile
392
393
394
395
396
shell: """

    zip {output} {input} 

"""
SnakeMake From line 392 of main/snakefile
ShowHide 6 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/cmkobel/MS-pipeline1
Name: ms-pipeline1
Version: v1.0.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...