Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is the code basis for the paper 'Insights to HIV-1 coreceptor usage by estimating HLA adaptation using Bayesian generalized linear mixed models'. Since HLA information contains private sensitive data that we are not allowed to publish, we can't publish neihter the training HLA data nor the final models.
Content
Snakemake pipeline
The repository contains
- a pipeline for learning HLA adaptation based on the p24 protein and the HLA I and II alleles, as well as clinical factors like age, sex, and ethnicity
./Snakefile
- a prediction pipeline for predicting the HLA adaptation based on the HLA profile and the p24 sequence, as well as clinical factors.
./Snakefile_predict
The folder
./rules
contains the different parts of the pipeline.
The folder
./scripts
contains the source code for the rules.
Config files
For each pipeline there is a corresponding config file:
./config.yml
./config_predict.yml
Environments
All environments are stored in the directory
./envs
.
The conda environment is defined in the file
./envs/conda/HIVIA.yml
The Snakemake environment is defined in the directory
./envs/snakemake/
There are different environments depending on the computational environment (GRID Engine, SLURM, cluster, etc. )
Run pipepline
The command to run the pipeline (on a cluster) is as follows
snakemake -n --configfile config.yml --profile envs/snakemake/[cluster] --nolock
This is a dry run (option -n). Remove the option -n to execute the workflow.
Code for manuscript figures and tables
The file
./notebooks/HIVIA.R
contains the code to produce all figures and tables and p-values in the paper.
Key idea
The key elements of this project is to compute the adaptation score of a viral sequence based on its HLA profile, age, sex, and ethnicity. The adaptation score is computed as follows:
For each frequent variant site si, two Bayesian generalized linear mixed models are computed:
-
HLA model
-
age
-
sex
-
ethnicity
-
phylogeny
-
HLA information (binarized)
-
-
baseline model
-
age
-
sex
-
ethnicity
-
phylogeny
-
The modeling is performed with the script
./scripts/modeling/fit_glmm.R
.
The computation of the adaptation score is performed in the script
./scripts/eval/compute_adaptation.R.
The call for fitting the HLA model using Bayesian GLMM is the following:
fit<- brm(formula = 'y ~ HLA_allele_1 + ... + HLA_allele_i + ... HLA_allele_n + age + sex + ethnicity + (1|patient_id)', cov_ranef = list(patient_id = ape::vcv.phylo(tree)) , data = df, family = 'categorical', prior = c(set_prior('horseshoe(df = 1, par_ratio = 0.01)', class="b")), seed = 100, control = list(adapt_delta = 0.99))
Code Snippets
19 20 | shell: 'Rscript {params.script_dir}/modeling/stratified_kCV_folds.R -i {input.df} -s {wildcards.seed} -d {params.dir} -k {params.k} --group {params.group} --log {log}' |
30 31 | shell: "Rscript {params.script_dir}/modeling/get_rand_test.R -i {input.rand_df} -t {input.test_df} -o {output} --cv cv" |
44 45 46 47 48 49 50 | run: hla = ",".join(params.hla) exec = 'Rscript {params.script_dir}/modeling/convert_train_hla2binary.R' exec += ' -i {input} -c {wildcards.hla_cutoff} -o {output}' exec += ' --hla ' + hla exec += ' -l {log}' shell(exec) |
63 64 65 66 | run: hla = ",".join(params.hla) exec = 'Rscript {params.script_dir}/eval/convert_test_hla2binary.R -i {input.df} --hla ' + hla + ' -o {output}' shell(exec) |
93 94 95 96 97 98 99 100 101 102 | run: import os my_env=dict(os.environ) ENV_VARS=['HOME', 'TMPDIR','TMP','HOSTNAME','PATH','SGE_STDERR_PATH', 'SGE_O_WORKDIR' ,'SGE_O_HOST','SGE_O_HOME', 'PYTHONPATH'] #with open(log[0], 'w') as l: # [l.write(env_var+":"+ my_env.get(env_var, "")+"\n") for env_var in ENV_VARS] for key, value in my_env.items(): print(key, value) fixed = ",".join(params.fixed) |
130 131 | shell: "Rscript {params.script_dir}/eval/CV_classify_SAAV.R -f {input.fit} -t {input.test} -r {input.rand} -o {output} --model {wildcards.model} --model_version {wildcards.model_version} -p {wildcards.protein} -s {wildcards.saav} --seed {wildcards.seed}" |
156 157 158 | run: with open(output[0], 'w') as f: f.write('\n'.join(input)) |
167 168 169 | run: exec = "Rscript {params.script_dir}/eval/compute_adaptation_score.R -i {input} --output_adapt {output} --output_pred {params.output_pred} --outcome_type {wildcards.outcome_type} --rand" shell(exec) |
10 11 12 13 14 | run: path = wildcards.run_id + '/'+'models/' + wildcards.protein + '/' + wildcards.outcome_type +'/final_models/' + wildcards.model_version + '/*' SAAVs_paths=glob.glob(path) SAAVs = [output.split('/')[-1] for output in SAAVs_paths] pd.DataFrame(SAAVs,columns=['sites']).to_csv(output[0], index = None) |
27 28 29 30 31 32 33 | run: hla = ",".join(params.hla) exec = 'Rscript {params.script_dir}/modeling/convert_train_hla2binary.R' exec += ' -i {input} -c {wildcards.hla_cutoff} -o {output}' exec += ' --hla ' + hla exec += ' -l {log}' shell(exec) |
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | run: import os my_env=dict(os.environ) ENV_VARS=['HOME', 'TMPDIR','TMP','HOSTNAME','PATH','SGE_STDERR_PATH', 'SGE_O_WORKDIR' ,'SGE_O_HOST','SGE_O_HOME', 'PYTHONPATH'] for key, value in my_env.items(): print(key, value) fixed = ",".join(params.fixed) #random = ",".join(params.random) #outcome = ",".join(params.outcome) # build fit exec = "Rscript {params.script_dir}/modeling/fit_glmm.R" exec += " -i {input.df} -t {input.tree} --method {params.method}" exec += " --model {wildcards.model} -p {params.prior} -s {wildcards.seed}" exec += " --adapt_delta {params.adapt_delta} -o {output}" exec += " -f " + fixed exec += " -r {params.random}" exec += " -y {params.outcome}" exec += " --coeffs {params.coeffs}" exec += " --outcome_type {wildcards.outcome_type}" exec += " --model_version {wildcards.model_version}" exec += " --protein {wildcards.protein}" exec += " --saav {wildcards.saav}" exec += " --ref_map {params.ref_map}" exec += " --par_ratio {params.par_ratio}" shell(exec) |
16 17 | shell: "Rscript {params.script_dir}/prep/coreceptor_data/clean_coreceptor_data.R -i {input} -o {output} -l {log}" |
31 32 | shell: "Rscript {params.script_dir}/prep/sequence_data/subtype_sequences.R -i {input.sequences} -o {output.subtypes} -a {params.api}" |
42 43 | shell: 'Rscript {params.script_dir}/prep/sequence_data/clean_subtype_table.R -s {input.subtypes} -o {output.cleaned_subtypes}' |
95 96 | shell: "Rscript {params.script_dir}/prep/sequence_data/get_reference_sequence.R -a {output.alignment_file} -o {output.refseq_file} -t {params.alignment_type} -s {params.suborganism} --subtype {params.subtype} -d {params.region_definition} -r {params.region} -b {params.basetype} -y {params.year} -f {params.format} -q {params.query} --api_source {params.api_source}" |
109 110 | shell: "mafft --add {input.selected_sequences} --quiet {input.reference_sequence} > {output.alignment}" |
132 133 | shell: "Rscript {params.script_dir}/prep/sequence_data/get_reference_sequence2.R -i {params.id} -n {params.protein} -d {params.db} -o {output}" |
152 153 | shell: "Rscript {params.script_dir}/prep/sequence_data/get_codon_alignment.R -i {input.dna_alignment} -f {params.frame} -c {params.compensating_num} -t {params.input_type} --output_format {params.output_format} --output_seqs_type {params.output_seqs_type} -n {params.protein} --output_dir {params.output_dir} --api_source {params.api_source}" |
167 168 169 170 171 172 173 174 175 | run: exec = "Rscript {params.script_dir}/prep/sequence_data/clean_codon_align.R " exec+= "-i {input.codon_align_aa} " exec+= "-r {output.ref_seq} " exec+= "-a {output.alignment} " exec+= "-l {log} " exec+= "-s {output.seq_freq_mat} " exec+= "-p {output.pos_freq_mat} " shell(exec) |
188 189 | shell: "mafft --add {input.ref_seq_aa} --quiet --mapout {input.codon_align_aa_ref_seq} > {output.align}" |
199 200 | shell: 'Rscript {params.script_dir}/prep/sequence_data/remove_refseq.R -i {input} -o {output} -r {params.refseq_name}' |
219 220 | shell: "Rscript {params.script_dir}/prep/sequence_data/make_SAAV_table.R -a {input.alignment} -r {input.ref_map} -d {wildcards.saav_cutoff} -o {output} --outcome_type {wildcards.outcome_type}" |
264 265 266 267 268 269 270 271 272 273 274 275 276 | run: fixed = ",".join(params.fixed) hla = ",".join(params.hla) exec = 'Rscript {params.script_dir}/modeling/select_variables.R' exec += ' -i {input} -o {output}' exec += ' -f ' + fixed exec += ' -r {params.random}' exec += ' -y {params.outcome}' exec += ' --id {params.id_column}' exec += ' --hla ' + hla shell(exec) |
291 292 | shell: 'raxmlHPC -s {input} -w $(pwd)/{params.dir} -n {params.run_id} -m {params.model} -p {params.seed} --silent' |
Support
- Future updates