WOMBAT-Pipelines

public public 1yr ago Version: Version 1 0 bookmarks

Introduction

wombat-p pipelines is a bioinformatics analysis pipeline that bundles different workflow for the analysis of label-free proteomics data with the purpose of comparison and benchmarking. It allows using files from the proteomics metadata standard SDRF .

The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. We used one of the nf-core templates.

Pipeline summary

This work contains four major different workflows for the analysis or label-free proteomics data, originating from LC-MS experiments.

  1. MaxQuant + NormalyzerDE
  2. SearchGui + Proline + PolySTest
  3. Compomics tools + FlashLFQ + MSqRob
  4. Tools from the Trans-Proteomic Pipeline + ROTS

Initialization and parameterization of the workflows is based on tools from the SDRF pipelines , the ThermoRawFileParser with our own contributions and additional programs from the wombat-p organizaion [https://github.com/wombat-p/Utilities] as well as our fork . This includes setting a generalized set of data analysis parameters and the calculation of a multiple benchmarks.

Code Snippets

30
31
32
33
34
35
36
37
38
39
40
41
"""
echo '$foo' > params.json
cp "${fasta_file}" database.fasta
if [[ "${exp_design_file}" != "exp_design.txt" ]]
then
  cp "${exp_design_file}" exp_design.txt
fi
Rscript $baseDir/bin/CalcBenchmarks.R
mv benchmarks.json benchmarks_${workflow}.json
cp stand_pep_quant_merged.csv stand_pep_quant_merged${workflow}.csv
cp stand_prot_quant_merged.csv stand_prot_quant_merged${workflow}.csv
"""
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
"""
first_line=""
# avoid exp_design ending with .txt
mv "${exp_design}" exp_design.tsv
for file in *.txt
do
  echo \$file
  tail -n +2 "\$file" >> tlfq_ident.tabular
  first_line=\$(head -n1 "\$file")
done
# Use awk to add 3 new columns rep, frac, trep with ones to exp_design file
#awk 'NR==1{print \$0"\trep\tfrac\ttrep"} NR>1{print \$0"\t1\t1\t1"}' "exp_design.tsv" > ExperimentalDesign.tsv
cp exp_design.tsv ExperimentalDesign.tsv

# Remove .raw and .mzml from file names in first column of ExperimentalDesign.tsv
sed -i 's/.mzML//g' ExperimentalDesign.tsv
sed -i 's/.raw//g' ExperimentalDesign.tsv
sed       -i 's/.Raw//g' ExperimentalDesign.tsv
sed -i 's/.RAW//g' ExperimentalDesign.tsv
# Add first line to tlfq_ident.tabular
echo "\$first_line" | cat - tlfq_ident.tabular > lfq_ident.tabular
# Needed as path is overwritten when running with singularity
PATH=\$PATH:/usr/local/lib/dotnet:/usr/local/lib/dotnet/tools
CONDA_PREFIX=/usr/local FlashLFQ --idt "lfq_ident.tabular" --rep "./" --out ./ --mbr ${parameters.enable_match_between_runs} --ppm ${parameters.precursor_mass_tolerance} --sha ${protein_inference} --thr ${task.cpus}
"""
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
"""
first_line=""
# avoid exp_design ending with .txt
mv "${exp_design}" exp_design.tsv
for file in *.txt
do
  echo \$file
  tail -n +2 "\$file" >> tlfq_ident.tabular
  first_line=\$(head -n1 "\$file")
done
# Use awk to add 3 new columns rep, frac, trep with ones to exp_design file
#awk 'NR==1{print \$0"\trep\tfrac\ttrep"} NR>1{print \$0"\t1\t1\t1"}' "exp_design.tsv" > ExperimentalDesign.tsv
cp exp_design.tsv ExperimentalDesign.tsv

# Remove .raw and .mzml from file names in first column of ExperimentalDesign.tsv
sed -i 's/.mzML//g' ExperimentalDesign.tsv
sed -i 's/.raw//g' ExperimentalDesign.tsv
sed       -i 's/.Raw//g' ExperimentalDesign.tsv
sed -i 's/.RAW//g' ExperimentalDesign.tsv
# Add first line to tlfq_ident.tabular
echo "\$first_line" | cat - tlfq_ident.tabular > lfq_ident.tabular
FlashLFQ --idt "lfq_ident.tabular" --rep "./" --out ./ --mbr ${parameters.enable_match_between_runs} --ppm ${parameters.precursor_mass_tolerance} --sha ${protein_inference} --thr ${task.cpus}
"""
22
23
24
"""
mzdb2mgf -i "${mzdbfile}" -o "${mzdbfile.baseName}.mgf" 
"""
31
32
33
34
35
"""
cp "proteinGroups.txt" protein_file.txt
cp "peptides.txt" peptide_file.txt
Rscript $baseDir/bin/runNormalyzer.R --comps="${params.comps}" --method="${parameters.normalization_method}" --exp_design="${exp_file}" --comp_file="${comp_file}"
"""
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
"""
convertProline=\$(which runPolySTestCLI.R)

echo \$convertProline
convertProline=\$(dirname \$convertProline)

echo \$convertProline
Rscript \${convertProline}/convertFromProline.R ${exp_design} ${proline_res}

sed -i "s/threads: 2/threads: ${task.cpus}/g" pep_param.yml
sed -i "s/threads: 2/threads: ${task.cpus}/g" prot_param.yml

runPolySTestCLI.R pep_param.yml
runPolySTestCLI.R prot_param.yml
"""
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
    """
    if [[ "$map" != "params2sdrf.yml" ]]
    then
        cp "${map}" params2sdrf.yml
    fi
    if [[ "$sdrf" != "no_sdrf" ]] 
    then    
        if [[ "$sdrf" != "sdrf_local.tsv" ]]
        then
	    cp "${sdrf}" sdrf_local.tsv
        fi
    fi	
    if [[ "$exp_design" != "no_exp_design" ]] 
    then    
       if [[ "$exp_design" != "exp_design.txt" ]] 
       then
	     cp "${exp_design}" exp_design.txt
       fi
    else
        if [[ "$sdrf" == "no_sdrf" ]] 
        then
            printf "raw_file\texp_condition" >> exp_design.txt
	    for a in $raws
	    do
	        printf "\n\$a\tA" >> exp_design.txt
	    done
        else 
	    $baseDir/bin/sdrf2exp_design.py
        fi        
    fi
    if [[ "$sdrf" == "no_sdrf" ]] 
    then
	$baseDir/bin/exp_design2sdrf.py
    fi
    if [ "$raws" == "no_raws" ] && [ "$mzmls" == "no_mzmls" ]
    then
        # Download all files from column file uri		
        echo "Downloading raw files from column file uri\n"
	for a in \$(awk -F '\t' -v column_val='comment[file uri]' '{ if (NR==1) {val=-1; for(i=1;i<=NF;i++) { if (\$i == column_val) {val=i;}}} if(val != -1) { if (NR!=1) print \$val} } ' "$sdrf")
	do
            echo "Downloading \$a\n"
	    wget -c -T 100 -t 5 "\$a"
        done
    fi

    if [[ "$parameters" == "no_params" ]]
    then
	printf "params:\n  None:  \nrawfiles: None\nfastafile: None" >  params.yml
    elif [[ "$parameters" != "params.yml" ]] 
    then
        cp "${parameters}" params.yml
    fi
    echo "See workflow version" > prepare_files.version.txt
    cp sdrf_local.tsv sdrf_temp.tsv
    """
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
"""
mkdir ./searchgui_results
for file in *.zip
do
unzip "\$file" -d ./searchgui_results/
mv \$(find ./ -type f \\( -name "*.t.xml.gz" -o -name "*.mzid" \\)) ./
gunzip *.t.xml.gz
rm -rf ./searchgui_results/*
done
touch import_file_list.txt
all_id_files=\$(find ./ -type f \\( -name "*.t.xml" -o -name "*.mzid" \\))
for file in \$all_id_files
do
echo "./\$file" >> import_file_list.txt
sed -i "s/PEPFDR/expected_fdr=${peptide_fdr}/g" "${param_file}"
sed -i "s/PROTFDR/expected_fdr=${protein_fdr}/g" "${param_file}"
sed -i "s/NUMPEPS/threshold=${parameters.min_num_peptides}/g" "${param_file}"
sed -i "s/moz_tol = 5/moz_tol=${prec_tol}/g" "${param_file}"
sed -i "s/moz_tol_unit = ppm/moz_tol_unit=${prec_ppm}/g" "${param_file}"
done
cp "${param_file}" lfq_param_file.txt
"""    
29
30
31
32
"""
touch quant_exp_design.txt    
echo "${exp_design_text}" >> quant_exp_design.txt
"""
34
35
36
37
38
39
40
41
42
43
44
"""
cp ${exp_design} quant_exp_design.txt
sed -i 's/raw_file/mzdb_file/g' quant_exp_design.txt
sed -i 's/.raw/.mzDB/g' quant_exp_design.txt
sed -i 's/.mzML/.mzDB/g' quant_exp_design.txt
sed -i 's/.mzml/.mzDB/g' quant_exp_design.txt
sed -i '2,\$s|^|./|' quant_exp_design.txt
# keep first two columns of quant_exp_design.txt
cut -f1,2 quant_exp_design.txt > quant_exp_design.txt.tmp
mv quant_exp_design.txt.tmp quant_exp_design.txt
"""
20
21
22
23
"""
ls -la
thermo2mzdb -i "${rawfile}" -o "${rawfile.baseName}.mzDB" 
"""
19
20
21
22
23
24
25
26
27
28
29
30
31
"""
# Check if the file is a mzML file
if [[ "${rawfile}" == *.{mzML,mzml} ]]
  then
      # check if same file
      if [[ "${rawfile}" != "${rawfile.baseName}.mzML" ]]
      then
        cp "${rawfile}" "${rawfile.baseName}.mzML"
      fi
  else
      thermorawfileparser -i "${rawfile}" -b "${rawfile.baseName}.mzML" -f 2
fi
"""
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
"""
parse_sdrf \\
convert-maxquant \\
-s "${sdrf}" \\
-f "PLACEHOLDER${fasta}" \\
-r PLACEHOLDER \\
-t PLACEHOLDERtemp \\
-o2 exp_design.tsv \\
-n ${task.cpus} 
echo "Preliminary" > sdrf_merge.version.txt

parse_sdrf \\
convert-normalyzerde \\
-s "${sdrf}" \\
-mq exp_design.tsv \\
-o Normalyzer_design.tsv \\
-oc Normalyzer_comparisons.txt
"""
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
    """
    if [[ "$sdrf" != "sdrf.tsv" ]]
    then
	cp "${sdrf}" sdrf.tsv
    fi
    if [[ "$parameters" != "params.yml" ]] 
    then
        cp "${parameters}" params.yml
    fi
    if [[ "$map" != "params2sdrf.yml" ]]
    then
        cp "${map}" params2sdrf.yml
    fi
    # TODO change to package when available
    python $projectDir/bin/add_data_analysis_param.py > changed_params.txt
    python $projectDir/bin/sdrf2params.py
    """
26
27
28
"""
searchgui eu.isas.searchgui.cmd.FastaCLI -in ${fasta} -decoy
"""    
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
"""
mkdir tmp
mkdir log
searchgui eu.isas.searchgui.cmd.PathSettingsCLI -temp_folder ./tmp -log ./log
searchgui eu.isas.searchgui.cmd.IdentificationParametersCLI -out searchgui \\
  -frag_tol ${frag_tol} -frag_ppm ${frag_ppm} -prec_tol ${prec_tol} -prec_ppm ${prec_ppm} -enzyme "${enzyme}" -mc ${parameters["allowed_miscleavages"]} \\
  -max_isotope ${parameters["isotope_error_range"]} \\
  ${fixed_mods} ${var_mods}\\
  -fi "${parameters["fions"]}" -ri "${parameters["rions"]}" -xtandem_quick_acetyl 0 -xtandem_quick_pyro 0 -peptide_fdr ${parameters["ident_fdr_peptide"]}\\
  -protein_fdr ${parameters["ident_fdr_protein"]} -psm_fdr ${parameters["ident_fdr_psm"]} \\
  -myrimatch_num_ptms ${parameters["max_mods"]} -ms_amanda_max_mod ${parameters["max_mods"]} -msgf_num_ptms ${parameters["max_mods"]} -meta_morpheus_max_mods_for_peptide\\
  ${parameters["max_mods"]} -directag_max_var_mods ${parameters["max_mods"]} -comet_num_ptms ${parameters["max_mods"]} \\#-tide_max_ptms ${parameters["max_mods"]}  \\
  -myrimatch_min_pep_length ${parameters["min_peptide_length"]} -myrimatch_max_pep_length ${parameters["max_peptide_length"]} -ms_amanda_min_pep_length ${parameters["min_peptide_length"]} \\
  -ms_amanda_max_pep_length ${parameters["max_peptide_length"]} -msgf_min_pep_length ${parameters["min_peptide_length"]} -msgf_max_pep_length ${parameters["max_peptide_length"]} \\
  -omssa_min_pep_length ${parameters["min_peptide_length"]} -omssa_max_pep_length ${parameters["max_peptide_length"]} -comet_min_pep_length ${parameters["min_peptide_length"]} \\
  -comet_max_pep_length ${parameters["max_peptide_length"]} -tide_min_pep_length ${parameters["min_peptide_length"]} -tide_max_pep_length ${parameters["max_peptide_length"]} \\
  -andromeda_min_pep_length ${parameters["min_peptide_length"]} -andromeda_max_pep_length ${parameters["max_peptide_length"]} -meta_morpheus_min_pep_length ${parameters["min_peptide_length"]} \\
  -meta_morpheus_max_pep_length ${parameters["max_peptide_length"]} -max_charge ${parameters.max_precursor_charge} -min_charge ${parameters.min_precursor_charge}
"""    
30
31
32
33
34
35
36
37
38
39
40
41
42
"""
cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        maxquant: \$(maxquant --version 2>&1 > /dev/null | cut -f2 -d\" \")
END_VERSIONS
sed \"s_<numThreads>.*_<numThreads>$task.cpus</numThreads>_\" ${paramfile} > mqpar_changed.xml
sed -i \"s|PLACEHOLDER|\$PWD/|g\" mqpar_changed.xml
mkdir temp
chmod -R a+rw *
maxquant mqpar_changed.xml
mv combined/txt/*.txt .
mv combined/proc/*unningTimes.txt runningTimes.txt
"""
ShowHide 15 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/wombat-p/WOMBAT-Pipelines
Name: wombat-pipelines
Version: Version 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...