Cross validation workflow of Semi-binary Matrix Factorization (SBMF)

public public 1yr ago Version: v1.0.3 0 bookmarks
Loading...

SBMFCV

Cross validation workflow of Semi-binary Matrix Factorization (SBMF)

SBMFCV searches for the optimal hyper-parameters (rank and binary regularization parameters) for Semi-binary Matrix Factorization (SBMF) performed by dcTensor::dNMF . In SBMF, a non-negative matrix X is decomposed to a matrix product U * V' and only U is imposed to have binary ({0,1}) values. For the details, see the vignette of dNMF .

SBMFCV consists of the rules below:

Pre-requisites (our experiment)

  • Snakemake: v7.30.1

  • Singularity: v3.7.1

  • Docker: v20.10.10 (optional)

Snakemake is available via Python package managers like pip , conda , or mamba .

Singularity and Docker are available by the installer provided in each website or package manager for each OS like apt-get/yum for Linux, or brew for Mac.

For the details, see the installation documents below.

  • https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

  • https://docs.sylabs.io/guides/3.0/user-guide/installation.html

  • https://docs.docker.com/engine/install/

Note: The following source code does not work on M1/M2 Mac. M1/M2 Mac users should refer to README_AppleSilicon.md instead.

Usage

In this demo, we use a toy data matrix (data/testdata.tsv) consisting of 1280 samples and 13 variables but any non-negative matrix can be specified by user.

Note that the input file is assumed to be tab separated values (TSV) format with no row/column names.

Download this GitHub repository

First, download this GitHub repository and change the working directory.

git clone https://github.com/chiba-ai-med/SBMFCV.git
cd SBMFCV

Example with local machine

Next, perform SBMFCV by the snakemake command as follows.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity

The meanings of all the arguments are below.

  • -j : Snakemake option to set the number of cores (e.g. 10, mandatory)

  • --config : Snakemake option to set the configuration (mandatory)

  • input : Input file (e.g., testdata.tsv, mandatory)

  • outdir : Output directory (e.g., output, mandatory)

  • rank_min : Lower limit of rank parameter to search (e.g., 2, which is used for the rank parameter J of dNMF, mandatory)

  • rank_max : Upper limit of rank parameter to search (e.g., 10, which is used for the rank parameter J of dNMF, mandatory)

  • lambda_min : Lower limit of lambda parameter to search (e.g., -10, which means 10^-10 is used for the binary regularization parameter Bin_U of dNMF, mandatory)

  • lambda_max : Upper limit of lambda parameter to search (e.g., -10, which means 10^10 is used for the binary regularization parameter Bin_U of dNMF, mandatory)

  • trials : Number of random trials (e.g., 50, mandatory)

  • n_iter_max : Number of iterations (e.g., 100, mandatory)

  • ratio : Sampling ratio of cross-validation (0 - 100, e.g., 20, mandatory)

  • --resources : Snakemake option to control resources (optional)

  • mem_gb : Memory usage (GB, e.g. 10, optional)

  • --use-singularity : Snakemake option to use Docker containers via Singularity (mandatory)

Example with the parallel environment (GridEngine)

If the GridEngine ( qsub command) is available in your environment, you can add the qsub command. Just adding the --cluster option, the jobs are submitted to multiple nodes and the computations are distributed.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity \
--cluster "qsub -l nc=4 -p -50 -r yes" --latency-wait 60

Example with the parallel environment (Slurm)

Likewise, if the Slurm ( sbatch command) is available in your environment, you can add the sbatch command after the --cluster option.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

snakemake -j 4 --config input=data/testdata.tsv outdir=output rank_min=2 \
rank_max=10 lambda_min=-10 lambda_max=10 trials=10 \
n_iter_max=100 ratio=20 --resources mem_gb=10 --use-singularity \
--cluster "sbatch -n 4 --nice=50 --requeue" --latency-wait 60

Example with a local machine with Docker

If the docker command is available, the following command can be performed without installing any tools.

Note: To check if the command is executable, set smaller parameters such as rank_min=2 rank_max=2 lambda_max=2 lambda_min=2 trials=2 n_iter_max=2.

docker run --rm -v $(pwd):/work ghcr.io/chiba-ai-med/sbmfcv:main \
-i /work/data/testdata.tsv -o /work/output \
--cores=4 --rank_min=2 --rank_max=10 \
--lambda_min=-10 --lambda_max=10 --trials=10 \
--n_iter_max=100 --ratio=20 --memgb=10

Reference

Authors

  • Koki Tsuyuzaki

  • Eiryo Kawakami

Code Snippets

50
51
shell:
	'src/check_input.sh {input} {output} >& {log}'
SnakeMake From line 50 of main/Snakefile
67
68
shell:
	'src/nmf.sh {input.in1} {output} {wildcards.rank} {N_ITER_MAX} {RATIO} >& {log}'
SnakeMake From line 67 of main/Snakefile
80
81
shell:
	'src/aggregate_nmf.sh {RANK_MIN} {RANK_MAX} {TRIALS} {OUTDIR} {output} > {log}'
SnakeMake From line 80 of main/Snakefile
92
93
shell:
	'src/plot_test_error.sh {input} {output} > {log}'
SnakeMake From line 92 of main/Snakefile
104
105
shell:
	'src/bestrank.sh {input} {output} > {log}'
SnakeMake From line 104 of main/Snakefile
121
122
shell:
	'src/sbmf.sh {input} {output} {wildcards.l} {N_ITER_MAX} >& {log}'
SnakeMake From line 121 of main/Snakefile
134
135
shell:
	'src/aggregate_sbmf.sh {LAMBDA_MIN} {LAMBDA_MAX} {TRIALS} {OUTDIR} {output} > {log}'
SnakeMake From line 134 of main/Snakefile
146
147
shell:
	'src/plot_zero_one_percentage.sh {input} {output} > {log}'
SnakeMake From line 146 of main/Snakefile
158
159
shell:
	'src/bestlambda.sh {input} {output} > {log}'
SnakeMake From line 158 of main/Snakefile
175
176
shell:
	'src/bestrank_bestlambda_sbmf.sh {input} {output} {N_ITER_MAX} >& {log}'
SnakeMake From line 175 of main/Snakefile
188
189
shell:
	'src/aggregate_bestrank_bestlambda_sbmf.sh {TRIALS} {OUTDIR} {output} > {log}'
SnakeMake From line 188 of main/Snakefile
204
205
shell:
	'src/b3.sh {input} {output} > {log}'
SnakeMake From line 204 of main/Snakefile
216
217
shell:
	'src/bindata_for_landscaper.sh {input} {output} > {log}'
SnakeMake From line 216 of main/Snakefile
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/aggregate_bestrank_bestlambda_sbmf.R $@
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/aggregate_nmf.R $@
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/aggregate_sbmf.R $@
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/b3.R $@
Shell From line 11 of src/b3.sh
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/bestlambda.R $@
Shell From line 11 of src/bestlambda.sh
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/bestrank_bestlambda_sbmf.R $@
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/bestrank.R $@
Shell From line 11 of src/bestrank.sh
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/bindata_for_landscaper.R $@
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/check_input.R $@
Shell From line 11 of src/check_input.sh
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/nmf.R $@
Shell From line 11 of src/nmf.sh
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/plot_test_error.R $@
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/plot_zero_one_percentage.R $@
11
12
13
SLURM_RESTART_COUNT=2

Rscript src/sbmf.R $@
Shell From line 11 of src/sbmf.sh
ShowHide 26 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/chiba-ai-med/SBMFCV
Name: sbmfcv
Version: v1.0.3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...