A snakemake wrapper around Nesvilab's FragPipe-CLI
Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
_____________ < mspipeline1 > ------------- \ ___......__ _ \
_.-' ~-_ _.=a~~-_ --=====-.-.-_----------~ .--. _ -.__.-~ ( ___===> '''--...__ ( \ \\\ { ) _.-~
=_ ~_ \\-~~~//~~~~-=-~ |-=-~_ \\ \\ |_/ =. ) ~} |} || // || _// {{
'='~' \\_ = ~~'
If you want to use fragpipe using the command line interface, then this is the tool for you.
This pipeline takes 1) a list of .d files and 2) a list of fasta-amino acid files and outputs sane protein calls with abundances. It uses philosopher database and fragpipe to do the job. The snakemake pipeline maintains a nice output file tree.
Why you should use this pipeline
Because it makes sure that all outputs are updated when you change input-parameters. It also yells at you if something fails, and hopefully makes it a bit easier to find the error.
Installation
- Prerequisites:
-
Preferably a HPC system, or a beefy local workstation.
-
An anaconda or miniconda3 package manager on that system.
-
Clone this repo on the HPC/workstation where you want to work.
git clone https://github.com/cmkobel/mspipeline1.git && cd mspipeline1
-
If you don't already have an environment with snakemake and mamba installed, use the following command to install a "snakemake" environment with the bundled environment file:
conda env create -f environment.yaml -n mspipeline1
This environment can then be activated by typing
conda activate mspipeline1
-
If needed, tweak the profiles/slurm/ configuration so that it matches your execution environment. There is a profile for local execution without a job management system (profiles/local/) as well as a few profiles for different HPC environments like PBS and SLURM.
Usage
1) Update config.yaml
The file config_template.yaml contains all the parameters needed to run this pipeline. You should change the parameters to reflect your sample batch.
Because nesvilab do not make their executables immediately publicly available, you need to tell the pipeline where to find them on your system. Update addresses for the keys
philosopher_executable
,
msfragger_jar
,
ionquant_jar
and
fragpipe_executable
which can be downloaded
here
,
here
,
here
and
here
, respectively.
Currently the pipeline is only tested on the input of .d-files (
agilent/bruker
): Create an item in batch_parameters where you define key
d_base
which is the base directory where all .d-files reside. Define key
database_glob
which is a path (or glob) to the fasta-amino acid files that you want to include in the target protein database.
Define items under the
samples
key which link sample names to the .d-files.
Lastly, set the
batch
key to point at the batch that you want to run.
2) Run
Finally, run the pipeline in your command line with:
$ snakemake --profile profiles/slurm/
Below is visualization of the workflow graph:

Future
This pipeline might involve an R-markdown performing trivial QC. Also, a test data set that accelerates the development cycle. 🚴♀️
Code Snippets
125 126 127 128 129 130 131 | shell: """ echo '''{params.dataframe}''' > {output} # TODO: Write something clever to the benchmarks/ directory, so we can infer relationship between hardware allocations, input size and running time. """ |
143 144 145 146 147 148 149 150 | shell: """ cp -vr {params.d_files} {output.dir} # Enable editing of these files chmod -R 775 output/ """ |
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | shell: """ mkdir -p output/{config_batch}/ >&2 echo "Concatenating database ..." cat {input.glob} > output/{config_batch}/cat_database_sources.faa mkdir -p output/{config_batch}/ cd output/{config_batch}/ {params.philosopher} workspace --init # https://github.com/Nesvilab/philosopher/wiki/Database {params.philosopher} database \ --custom cat_database_sources.faa \ --contam echo "Existing database pattern-matched files:" ls *-decoys-contam-cat_database_sources.faa.fas mv *-decoys-contam-cat_database_sources.faa.fas philosopher_database.faa # rename database file. # rm cat_database_sources.faa # remove unneccessary .faa file. # No, keep it for annotation purposes. {params.philosopher} workspace --clean """ |
210 211 212 213 214 215 216 217 218 | shell: """ # TODO: Database length in basepairs? seqkit stats \ --tabular \ {params.config_database_glob_read} {input.database} \ > {output.db_stats_seqkit} """ |
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 | shell: """ echo "Create manifest ..." echo '''{params.manifest}''' > {output.manifest} tail {output.manifest} echo "Modifying workflow with runtime parameters ..." # TODO: Check if it matters to overwrite or not. # Copy and modify parameter file with dynamic content. cp {params.original_fragpipe_workflow} {output.fragpipe_workflow} echo -e "\n# Added by mspipeline1 in rule fragpipe in snakefile below ..." >> {output.fragpipe_workflow} echo "num_threads={threads}" >> {output.fragpipe_workflow} echo "database_name={input.database}" >> {output.fragpipe_workflow} echo "database.db-path={input.database}" >> {output.fragpipe_workflow} echo "output_location={params.fragpipe_workdir}" >> {output.fragpipe_workflow} # These settings minimize memory usage. echo "msfragger.misc.slice-db={params.slice_db}" >> {output.fragpipe_workflow} # Default 1 echo "msfragger.calibrate_mass=0" >> {output.fragpipe_workflow} # Default 2 echo "msfragger.digest_max_length=35" >> {output.fragpipe_workflow} # Default 50 # echo "msfragger.allowed_missed_cleavage_1=1" >> {output.fragpipe_workflow} # Default 2 # echo "msfragger.allowed_missed_cleavage_2=1" >> {output.fragpipe_workflow} # Default 2 # Debug, presentation of the bottom of the modified workflow echo "" >> {output.fragpipe_workflow} tail {output.fragpipe_workflow} # Convert mem_mb into gb (I'm not sure if fragpipe reads GiB or GB?) mem_gib=$(({resources.mem_mib}/1024-2)) # Because there is some overhead, we subtract a few GBs. Everytime fragpipe runs out of memory, I subtract another one: that should be more effective than doing a series of tests ahead of time. echo "Fragpipe will be told not to use more than $mem_gib GiB. In practice it usually uses a bit more." # Or maybe there just is an overhead when using a conda environment? echo "Fragpipe ..." # https://fragpipe.nesvilab.org/docs/tutorial_headless.html {params.fragpipe_executable} \ --headless \ --workflow {output.fragpipe_workflow} \ --manifest {output.manifest} \ --workdir {params.fragpipe_workdir} \ --ram $mem_gib \ --threads {threads} \ --config-msfragger {params.msfragger_jar} \ --config-ionquant {params.ionquant_jar} \ --config-philosopher {params.philosopher_executable} \ | tee {output.fragpipe_stdout} # Write the log, so we can later extract the number of "scans" """ |
334 335 336 337 338 339 340 341 | shell: """ # Extract scans from the fragpipe stdout log. Will later be compared to the individual psm files. grep -E ": Scans = [0-9]+" {input.fragpipe_stdout} \ > {output.scans} """ |
366 367 368 369 370 371 372 | shell: """ cp scripts/QC.Rmd rmarkdown_template.rmd Rscript -e 'rmarkdown::render("rmarkdown_template.rmd", "html_document", output_file = "{output.report}", knit_root_dir = "output/{config_batch}/", quiet = F)' rm rmarkdown_template.rmd """ |
392 393 394 395 396 | shell: """ zip {output} {input} """ |
Support
- Future updates
Related Workflows





