Pipeline to identify the abundance determinants of human snoRNAs.

public public 1yr ago 0 bookmarks

Author : Etienne Fafard-Couture

Email : etienne.fafard-couture@usherbrooke.ca

Description

Snakemake-based workflow to predict the abundance status of human snoRNAs and identify their main abundance determinants. This pipeline also predicts by default the abundance status of mouse snoRNAs and can also used to predict the abundance status of several vertebrate species (see below).

Requirements

  • Conda (Tested with version=4.12.0)

  • Mamba (Tested with version=0.15.3)

  • Snakemake (Tested with version=6.0.5)

1 - Conda (Miniconda3) needs to be installed (https://docs.conda.io/en/latest/miniconda.html). For Linux users:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Answer yes to Do you wish the installer to initialize Miniconda3?

2 - Mamba needs to be installed via conda (mamba greatly speeds up environment creation, but conda can still be used instead of mamba):

conda install -n base -c conda-forge mamba

3 - Snakemake needs to be installed via mamba:

conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake=6.0.5

To activate the 'snakemake' environment that was just created:

conda activate snakemake

Running the pipeline on Slurm-based cluster (to predict the abundance status of human and mouse snoRNAs)

Firstly, download environments need to be created:

snakemake all_downloads --conda-create-envs-only --use-conda --conda-frontend mamba --cores 1

Secondly, datasets need to be downloaded (this might take a while):

snakemake all_downloads --use-conda --cores 1

Thirdly, environments required by tools and calculations need to be created:

snakemake --conda-create-envs-only --use-conda --conda-frontend mamba --cores 1

Fourthly, running the pipeline using computation nodes and the previously created envs is done as follows (Human snoRNA prediction):

snakemake -j 999 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

Fifthly, running the pipeline using computation nodes and the previously created envs is done as follows (Mouse snoRNA prediction):

snakemake all_mouse -j 999 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

Finally, generating figures for the human and mouse snoRNAs is done as follows on the cluster:

snakemake all_figures -j 999 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

or locally (of note: mouse figures might differ slightly from those in the paper since RNAcentral data is updated frequently):

snakemake all_figures --use-conda --cores 1

Running the pipeline on Slurm-based cluster (to predict the abundance status of vertebrate species snoRNAs)

Firstly, the human/mouse snoRNA pipeline described above needs to be run (to generate the model used to predict species snoRNA abundance status)

Secondly, datasets need to be downloaded (this might take a while):

snakemake species_downloads --use-conda --cores 1

Thirdly, running the species prediction pipeline using computation nodes and the previously created envs is done as follows:

snakemake species_predictions -j 999 --use-conda --immediate-submit --notemp --cluster-config cluster.json --cluster 'python3 slurmSubmit.py {dependencies}'

Finally, generating figures for the vertebrate species snoRNAs is done as follows locally (of note: figures might differ slightly from those in the paper since RNAcentral data is updated frequently):

snakemake species_figures --use-conda --cores 1

Code Snippets

16
17
18
shell:
    "find .snakemake/conda/*/bin/ -name .bioconductor-bsgenome*-link.sh > {output.bedtools_dir} && "
    "sed -i 's/\.bioconductor-bsgenome.*-link\.sh/bedtools/g' {output.bedtools_dir}"
35
36
script:
    "../scripts/r/branch_pointer.R"
57
58
script:
    "../scripts/python/get_best_bp.py"
17
18
19
20
shell:
    "samtools index {input.bams} {output.bam_index} && "
    "samtools view {input.bams} '19:50797704-50804929' -b > {output.filtered_bams} && "
    "samtools index {output.filtered_bams} {output.bam_index_filtered}"
33
34
shell:
    "./{input.sashimi_script} -b {input.bams} -c '19:50797704-50804929' -o {output.sashimi} -g {input.gtf} -F 'svg'"
47
48
script:
    "../scripts/python/multi_HG_different_label_snoRNAs.py"
62
63
script:
    "../scripts/python/graphs/bar_FP_vs_TN_multi_HG_different_labels.py"
77
78
script:
    "../scripts/python/graphs/decision_plot_interesting_snoRNAs.py"
95
96
script:
    "../scripts/python/graphs/decision_plot_multi_HG_snoRNAs.py"
122
123
script:
    "../scripts/python/graphs/pie_donut_multi_HG_snoRNAs.py"
141
142
script:
    "../scripts/python/graphs/hbar_nb_sno_per_confusion_val_HG.py"
33
34
script:
    "../scripts/python/bed_for_vap.py"
22
23
script:
    "../scripts/python/merge_bed_peaks.py"
32
33
shell:
    "cut -f1-6 {input.snoRNA_bed} > {output.formated_bed}"
42
43
44
45
run:
    for i, bed in enumerate(input.beds):
        output_i = output.sorted_beds[i]
        sp.call("""awk -v OFS="\t" '{print $1,$2,$3,$4,".",$5}' """+bed+""" | sort -k1,1 -k2n > """+output_i, shell=True)
59
60
script:
    "../scripts/python/map_bed_peaks_to_sno.py"
74
75
script:
    "../scripts/python/graphs/rbp_enrichment_density.py"
88
89
script:
    "../scripts/python/graphs/rbp_enrichment_density_confusion_value.py"
45
46
script:
    "../scripts/python/fasta_per_sno_type.py"
59
60
script:
    "../scripts/python/cd_box_location_all.py"
74
75
script:
    "../scripts/python/haca_box_location_all.py"
87
88
script:
    "../scripts/python/hamming_distance_box_all.py"
104
105
script:
    "../scripts/python/fasta_sequence_abundance_status.py"
121
122
script:
    "../scripts/python/fasta_sequence_abundance_status_length.py"
135
136
script:
    "../scripts/python/cd_box_location.py"
150
151
script:
    "../scripts/python/haca_box_location.py"
180
181
script:
    "../scripts/python/flanking_nt_to_motifs.py"
207
208
script:
    "../scripts/python/convert_motif_table_to_fa.py"
225
226
script:
    "../scripts/python/graphs/logo_box.py"
244
245
script:
    "../scripts/python/graphs/logo_box.py"
274
275
script:
    "../scripts/python/graphs/logo_box_wo_blank_plus_pie.py"
37
38
script:
    "../scripts/python/scale_features_after_manual_split_only_hamming.py"
60
61
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
79
80
script:
    "../scripts/python/train_models_scale_after_split.py"
93
94
script:
    "../scripts/python/test_models_scale_after_split.py"
111
112
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
35
36
script:
    "../scripts/python/scale_features_after_split_10_iterations.py"
58
59
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
77
78
script:
    "../scripts/python/train_models_scale_after_split.py"
91
92
script:
    "../scripts/python/test_models_scale_after_split.py"
109
110
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
31
32
script:
    "../scripts/python/test_models_scale_after_split_log_reg_thresh.py"
50
51
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split_log_reg_thresh.py"
27
28
script:
    "../scripts/python/scale_features_species_prediction_top3_rs.py"
46
47
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
65
66
script:
    "../scripts/python/train_models_scale_after_split.py"
85
86
script:
    "../scripts/python/predict_species_snoRNA_label_top3.py"
 99
100
script:
    "../scripts/python/test_models_scale_after_split.py"
117
118
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
140
141
script:
    "../scripts/python/predict_species_snoRNA_label_wo_dup_top3.py"
157
158
script:
    "../scripts/python/test_models_scale_after_split.py"
175
176
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
31
32
script:
    "../scripts/python/test_models_scale_after_split_log_reg_thresh.py"
50
51
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split_log_reg_thresh.py"
26
27
script:
    "../scripts/python/scale_features_species_prediction_top4_rs.py"
45
46
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
64
65
script:
    "../scripts/python/train_models_scale_after_split.py"
84
85
script:
    "../scripts/python/predict_species_snoRNA_label.py"
98
99
script:
    "../scripts/python/test_models_scale_after_split.py"
116
117
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
138
139
script:
    "../scripts/python/predict_species_snoRNA_label_wo_dup.py"
155
156
script:
    "../scripts/python/test_models_scale_after_split.py"
173
174
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
195
196
script:
    "../scripts/python/predict_species_snoRNA_label_no_dup.py"
212
213
script:
    "../scripts/python/test_models_scale_after_split.py"
230
231
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
29
30
script:
    "../scripts/python/scale_features_species_prediction_top4.py"
48
49
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
67
68
script:
    "../scripts/python/train_models_scale_after_split.py"
87
88
script:
    "../scripts/python/predict_species_snoRNA_label.py"
101
102
script:
    "../scripts/python/test_models_scale_after_split.py"
119
120
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
21
22
script:
    "../scripts/python/one_hot_encode.py"
34
35
script:
    "../scripts/python/one_hot_encode.py"
67
68
script:
    "../scripts/python/scale_features_after_manual_split.py"
86
87
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
105
106
script:
    "../scripts/python/train_models_scale_after_split.py"
119
120
script:
    "../scripts/python/test_models_scale_after_split.py"
137
138
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
38
39
script:
    "../scripts/python/scale_features_after_manual_split.py"
57
58
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
76
77
script:
    "../scripts/python/train_models_scale_after_split.py"
90
91
script:
    "../scripts/python/test_models_scale_after_split.py"
108
109
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
40
41
script:
    "../scripts/python/scale_features_after_manual_split_top3.py"
59
60
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
78
79
script:
    "../scripts/python/train_models_scale_after_split.py"
92
93
script:
    "../scripts/python/test_models_scale_after_split.py"
110
111
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
40
41
script:
    "../scripts/python/scale_features_after_manual_split_top4.py"
59
60
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
78
79
script:
    "../scripts/python/train_models_scale_after_split.py"
92
93
script:
    "../scripts/python/test_models_scale_after_split.py"
110
111
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
19
20
script:
    "../scripts/python/keep_one_feature.py"
41
42
script:
    "../scripts/python/hyperparameter_tuning_cv.py"
59
60
script:
    "../scripts/python/train_models.py"
73
74
script:
    "../scripts/python/test_models.py"
91
92
script:
    "../scripts/python/confusion_matrix_f1.py"
26
27
script:
    "../scripts/python/hyperparameter_tuning_cv_scale_after_split.py"
45
46
script:
    "../scripts/python/train_models_scale_after_split.py"
60
61
script:
    "../scripts/python/test_models_scale_after_split.py"
79
80
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
24
25
script:
    "../scripts/python/hyperparameter_tuning_cv.py"
42
43
script:
    "../scripts/python/train_models.py"
56
57
script:
    "../scripts/python/test_models.py"
74
75
script:
    "../scripts/python/confusion_matrix_f1.py"
18
19
script:
    "../scripts/python/remove_snoRNA_clusters.py"
41
42
script:
    "../scripts/python/hyperparameter_tuning_cv.py"
59
60
script:
    "../scripts/python/train_models_wo_clusters.py"
73
74
script:
    "../scripts/python/test_models_wo_clusters.py"
91
92
script:
    "../scripts/python/confusion_matrix_f1_wo_clusters.py"
18
19
script:
    "../scripts/python/remove_feature.py"
40
41
script:
    "../scripts/python/hyperparameter_tuning_cv.py"
58
59
script:
    "../scripts/python/train_models.py"
72
73
script:
    "../scripts/python/test_models.py"
90
91
script:
    "../scripts/python/confusion_matrix_f1.py"
18
19
script:
    "../scripts/python/add_gene_biotype.py"
31
32
script:
    "../scripts/python/add_HG_tpm.py"
45
46
script:
    "../scripts/python/abundance_cutoff.py"
57
58
script:
    "../scripts/python/abundance_cutoff_all_biotypes.py"    
69
70
71
72
73
shell:
    """awk -v FS="," 'NR> 1 {{print $8}}' {input.host_info_df} | sort | uniq > hg_temp && """
    """grep -Ff hg_temp {input.gtf} | grep -v SNHG14 > hg_temp_gtf && """
    """cat hg_temp_gtf {input.snhg14_gtf} > {output.hg_gtf} && """
    """rm hg_temp && rm hg_temp_gtf"""
83
84
85
86
87
shell:
    """awk -v OFS='\t' 'NR>6 {{print $1, $4, $5, "to_remove"$10"to_remove", $6, $7, $2, $3, $8, "to_delete"$0}}' {input.gtf} | """
    """sed -E 's/to_remove"//g; s/";to_remove//g; s/to_delete.*gene_id/gene_id/g' | """
    """sort -n -k1,1 -k2,2 > {output.gtf_bed} && """
    """awk '$8=="gene" {{print $0}}' {output.gtf_bed}  | grep snoRNA | sed 's/\t$//g; s/^/chr/g' | sort -k1,1 -k2,2n > {output.all_sno_bed}"""
95
96
97
98
shell:
    """awk -v OFS='\t' 'NR>6 {{print $1, $4, $5, "to_remove"$10"to_remove", $6, $7, $2, $3, $8, "to_delete"$0}}' {input.HG_gtf} | """
    """sed -E 's/to_remove"//g; s/";to_remove//g; s/to_delete.*gene_id/gene_id/g' | """
    """sort -n -k1,1 -k2,2 > {output.HG_bed} """
106
107
shell:
    """awk '$8=="gene" {{print "chr"$0}}' {input.HG_gtf_bed} | sort -k1,1 -k2,2n > {output.HG_bed}"""
124
125
script:
    "../scripts/python/generate_snoRNA_beds.py"
143
144
script:
    "../scripts/python/sno_location_exon.py"
152
153
shell:
    """awk -v OFS="\t" '{{print $4, $3-$2+1}}' {input.all_sno_bed} > {output.sno_length}"""
168
169
script:
    "../scripts/python/host_functions.py"
186
187
script:
    "../scripts/python/snodb_table_formatting.py"
11
12
shell:
    "wget -O {output.tpm_df} {params.link}"
20
21
shell:
    "wget -O {output.gtf} {params.link}"
29
30
shell:
    "wget -O {output.gtf_df} {params.link}"
41
42
43
44
45
46
47
48
shell:
    """wget {params.link} -O refseq_temp.gtf.gz --quiet && zcat refseq_temp.gtf.gz | """
    """grep SNHG14 | sed 's/NC_000015.10/15/g' > {output.snhg14_gtf} && """
    """zcat refseq_temp.gtf.gz | grep SNHG14 | """
    """awk -v OFS='\t' 'NR>1 {{print "chr15", $4, $5, "ENSG00000224078", """
    """$6, $7, $2, $3, $8, $1, $12, "TEST"$0"TEST2", $10, $16, "transcript_name", $16}}' | """
    """sed -E 's/;//g; s/TEST.*exon_number..//g; s/..TEST2//g' | awk -v OFS='\t' 'NR>1' > {output.snhg14_bed} && """
    """rm refseq_temp.gtf.gz"""
57
58
59
60
61
shell:
    "wget -O temp.gz {params.link} && "
    "gunzip temp.gz && "
    "sed '/>KI270728.1/Q' temp > {output.genome} && "
    "rm temp"
70
71
shell:
    "wget -O {output.phastcons} {params.link}"
84
85
86
87
shell:
    "wget -O {output.snodb} {params.link_snodb} && "
    "wget -O {output.nmd} {params.link_nmd} && "
    "wget -O {output.di_promoter} {params.link_di_promoter}"
 98
 99
100
shell:
    "wget -O {output.hg_df} {params.link_hg} && "
    "wget -O {output.hg_pc_function} {params.link_pc_functions}"
109
110
shell:
    "wget -O {output.lnctard} {params.link_lnctard}"
116
117
shell:
    "pip install --upgrade forgi &> {output.forgi_log}"
130
131
132
133
134
135
shell:
    "wget -O temp_bigbed_1.bb {params.link_bed_file_1} && "
    "bigBedToBed temp_bigbed_1.bb {output.bed_file_1} && "
    "wget -O temp_bigbed_2.bb {params.link_bed_file_2} && "
    "bigBedToBed temp_bigbed_2.bb {output.bed_file_2} && "
    "rm temp_bigbed_*"
145
146
147
148
149
shell:
    "wget -O {output.bed_file_1}.gz {params.link_bed_file_1} && "
    "gunzip {output.bed_file_1}.gz && "
    "wget -O {output.bed_file_2}.gz {params.link_bed_file_2} && "
    "gunzip {output.bed_file_2}.gz"
157
158
159
160
shell:
    "wget -O DKC1_temp.gz {params.link_bed_file} && "
    "gunzip DKC1_temp.gz && sort -k1,1 -k2,2n DKC1_temp > {output.bed_file} && "
    "rm DKC1_temp*"
168
169
170
shell:
    "wget -O {output.liftover_chain_file}.gz {params.link_chain_file} && "
    "gunzip {output.liftover_chain_file}"
193
194
195
196
197
198
199
200
201
202
203
204
205
206
shell:
    "paths=$(echo {params}) && "
    "arr=(${{paths// / }}) && "
    "outputs=$(echo {output}) && "
    "arr_outputs=(${{outputs// / }}) && "
    "for index in ${{!arr[@]}}; do "
    "echo $index; "
    "echo ${{arr[$index]}}; "
    "wget -O temp_par${{index}}.gz ${{arr[$index]}}; "
    "gunzip temp_par${{index}}.gz; "
    "liftOver temp_par${{index}} {input.chain_file} temp_liftover${{index}} unmapped_${{index}}; "
    "sort -k1,1 -k2,2n temp_liftover${{index}} > ${{arr_outputs[$index]}}; "
    "done; "
    "rm temp_par* && rm temp_liftover* && rm unmapped_*"
214
215
shell:
    "wget {params.link} -O {output.sashimi_script} && chmod u+x {output.sashimi_script}"
226
227
228
shell:
    "wget {params.link}.1.fastq.gz.1 -O {output.mouse_fastq_1_gz} && "
    "wget {params.link}.2.fastq.gz.1 -O {output.mouse_fastq_2_gz}"
239
240
241
shell:
    "wget {params.link}.1.fastq.gz.1 -O {output.mouse_fastq_1_gz} && "
    "wget {params.link}.2.fastq.gz.1 -O {output.mouse_fastq_2_gz}"
250
251
252
253
254
shell:
    "wget -O temp.gz {params.link} && "
    "gunzip temp.gz && "
    "sed '/>JH584299.1/,$d; s/>/>chr/' temp > {output.genome} && "
    "rm temp"
263
264
265
266
shell:
    "wget -O temp2.gz {params.link} && "
    "gunzip temp2.gz && "
    "mv temp2 {output.gtf}"
275
276
277
278
279
shell:
    'cd scripts && pwd && '
    'git clone https://github.com/Population-Transcriptomics/pairedBamToBed12 && '
    'cd pairedBamToBed12 && '
    'make '
291
292
293
shell:
    'mkdir -p {output.git_coco_folder} '
    '&& git clone {params.git_coco_link} {output.git_coco_folder}'
301
302
303
304
shell:
    "wget -O tempfile {params.link} && "
    "grep --color=never ENSMUSG tempfile | grep --color=never snoRNA > {output.conversion_table} && "
    "rm tempfile"
311
312
313
314
315
316
shell:
    """pip3 install 'taxoniq==0.6.0' && """
    """taxoniq --scientific-name '{organism_name}' url > temp_id && """
    """IN=$(cat temp_id) && arr=(${{IN//id=/ }}) && """
    """echo ${{arr[1]}} | sed s'/\"//' > {output.taxid} && """
    """rm temp_id"""
323
324
325
326
327
328
shell:
    """pip3 install 'taxoniq==0.6.0' && """
    """taxoniq --scientific-name 'Saccharomyces cerevisiae' url > temp_id_sacch && """
    """IN=$(cat temp_id_sacch) && arr=(${{IN//id=/ }}) && """
    """echo ${{arr[1]}} | sed s'/\"//' > {output.taxid} && """
    """rm temp_id_sacch"""
336
337
script:
    "../scripts/r/download_mouse_HG_RNA_seq_datasets.R"
346
347
348
349
350
shell:
    "wget --quiet -O temp_yeast_genome.fa.gz {params.link} && "
    "gunzip temp_yeast_genome.fa.gz && "
    "sed 's/>/>chr/' temp_yeast_genome.fa > {output.genome} && "
    "rm temp_yeast_genome.fa"
358
359
360
shell:
    "wget --quiet -O temp_yeast.gtf.gz {params.link_std_annotation} && "
    "gunzip temp_yeast.gtf.gz && mv temp_yeast.gtf {output.std_gtf}"
372
373
shell:
    "gtex_var=$(echo {params.ids}) ./scripts/bash/gtex_download.sh {params.link} {output.gtex_data}"
386
387
shell:
    "gtex_var=$(echo {params.ids}) ./scripts/bash/gtex_download.sh {params.link} {output.gtex_data}"
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
shell:
    """
    taxid_var=$(cat {input.taxid}) &&
    psql postgres://reader:NWDMCE5xdipIjRrp@hh-pgsql-public.ebi.ac.uk:5432/pfmegrnargs \
    -c "\copy
     (
        SELECT DISTINCT ON (upi, region_start, region_stop, database)
            r.upi,
            p.short_description,
            p.taxid, a.database,
            a.external_id,
            a.optional_id,
            a.gene,
            a.gene_synonym,
            a.description,
            r.len,
            r.seq_short,
            rfam.rfam_model_id
        FROM rna r
        LEFT JOIN
            rnc_rna_precomputed p ON p.upi = r.upi
        LEFT JOIN rnc_sequence_regions s
            ON s.urs_taxid = p.id
        LEFT JOIN xref x
            ON x.upi = r.upi
        LEFT JOIN rnc_accessions a
            ON a.accession = x.ac
        LEFT JOIN rfam_model_hits rfam
            ON rfam.upi = r.upi
        WHERE p.taxid = $taxid_var
            AND a.ncrna_class IN ('snoRNA', 'scaRNA')
     )  TO '{output.RNA_central_snoRNAs}' WITH (FORMAT CSV, DELIMITER E'\t', HEADER)"
    """
10
11
12
13
shell:
    "wget -O temp.gz {params.link} && "
    "gunzip temp.gz && "
    "mv temp {output.genome}"
27
28
29
30
shell:
    "wget -O temp2.gz {params.link} && "
    "gunzip temp2.gz && "
    "mv temp2 {output.gtf}"
43
44
45
46
47
shell:
    "read id < <(grep -oE 'gene_id \"([[:upper:]]*[[:lower:]]*|[[:lower:]]*[[:upper:]])' {input.gtf} | head -n1 | sed 's/gene_id \"//') && "
    "wget -O tempfile {params.link}/{params.file} && "
    "grep --color=never $id tempfile | grep --color=never snoRNA > {output.conversion_table} && "
    "rm tempfile"
57
58
59
60
61
62
shell:
    """pip3 install 'taxoniq==0.6.0' && """
    """taxoniq --scientific-name '{params.species_complete_name}' url > temp_id && """
    """IN=$(cat temp_id) && arr=(${{IN//id=/ }}) && """
    """echo ${{arr[1]}} | sed s'/\"//' > {output.taxid} && """
    """rm temp_id"""
74
75
76
77
78
79
80
shell:
    "wget {params.link}/{params.species_complete_name}/{params.species_complete_name}_RNA-Seq_read_counts_TPM_FPKM.tar.gz && "
    "tar -xf {params.species_complete_name}_RNA-Seq_read_counts_TPM_FPKM.tar.gz && "
    "mkdir -p data/references && "
    "mv {params.species_complete_name}_RNA-Seq_read_counts_TPM_FPKM {output.expression_dir} && "
    "for file in {output.expression_dir}/*; do gunzip $file; done && "
    "rm {params.species_complete_name}_RNA-Seq_read_counts_TPM_FPKM.tar.gz"
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
shell:
    """
    taxid_var=$(cat {input.taxid}) &&
    psql postgres://reader:NWDMCE5xdipIjRrp@hh-pgsql-public.ebi.ac.uk:5432/pfmegrnargs \
    -c "\copy
     (
        SELECT DISTINCT ON (upi, region_start, region_stop, database)
            r.upi,
            p.short_description,
            p.taxid, a.database,
            a.external_id,
            a.optional_id,
            a.gene,
            a.gene_synonym,
            a.description,
            r.len,
            r.seq_short,
            rfam.rfam_model_id
        FROM rna r
        LEFT JOIN
            rnc_rna_precomputed p ON p.upi = r.upi
        LEFT JOIN rnc_sequence_regions s
            ON s.urs_taxid = p.id
        LEFT JOIN xref x
            ON x.upi = r.upi
        LEFT JOIN rnc_accessions a
            ON a.accession = x.ac
        LEFT JOIN rfam_model_hits rfam
            ON rfam.upi = r.upi
        WHERE p.taxid = $taxid_var
            AND a.ncrna_class IN ('snoRNA', 'scaRNA')
     )  TO '{output.RNA_central_snoRNAs}' WITH (FORMAT CSV, DELIMITER E'\t', HEADER)"
    """
18
19
script:
    "../scripts/python/scale_features.py"
31
32
script:
    "../scripts/python/one_hot_encode.py"
45
46
script:
    "../scripts/python/one_hot_encode.py"
67
68
script:
    "../scripts/python/scale_features_after_split.py"
11
12
13
14
15
16
17
shell:
    """
    conda install -c conda-forge --force-reinstall seaborn
    conda install -c conda-forge --force-reinstall matplotlib
    pip install pca
    &> {output}
    """
30
31
script:
    "../scripts/python/graphs/pie.py"
46
47
script:
    "../scripts/python/graphs/donut_labels_sno_type.py"
62
63
script:
    "../scripts/python/graphs/donut_labels_host_biotype.py"
77
78
script:
    "../scripts/python/graphs/density_features_simple.py"
93
94
script:
    "../scripts/python/graphs/density_features.py"
111
112
script:
    "../scripts/python/graphs/density_features_split.py"
129
130
script:
    "../scripts/python/graphs/density_normalized_mfe_split.py"
166
167
script:
    "../scripts/python/graphs/density_intron_groups_sno_type_features.py"
182
183
script:
    "../scripts/python/graphs/donut_labels_intron_subgroup.py"
203
204
script:
    "../scripts/python/graphs/pairplot_numerical.py"
223
224
script:
    "../scripts/python/graphs/bar_categorical.py"
242
243
script:
    "../scripts/python/graphs/bar_intronic_features.py"
259
260
script:
    "../scripts/python/graphs/venn_host_abundance_cutoff_tgirt_gtex.py"
276
277
script:
    "../scripts/python/graphs/venn_host_abundance_cutoff_tgirt_gtex.py"
25
26
27
28
29
shell:
    "mkdir -p log/ &&"
    "sed -i -E 's/np.sum\(np.abs/np.median\(np.abs/g' {params.path} && "
    "sed -i -E 's/global_shap_values = np.abs\(shap_values\).mean\(0\)/global_shap_values = np.median\(np.abs\(shap_values\), axis=0\)/g' {params.path}"
    "&> {output.fake_log}"
38
39
40
41
42
43
44
45
46
47
48
shell:
    """
    conda install -c conda-forge --force-reinstall seaborn
    conda install -c conda-forge --force-reinstall matplotlib
    conda install -c conda-forge --force-reinstall scikit-learn
    conda install -c conda-forge --force-reinstall shap
    conda install -c conda-forge --force-reinstall upsetplot
    conda install -c bioconda --force-reinstall viennarna
    conda install -c conda-forge --force-reinstall scipy
    &> {output}
    """
68
69
script:
    "../scripts/python/graphs/scatter_accuracies.py"
87
88
script:
    "../scripts/python/graphs/scatter_accuracies_10_iterations.py"
107
108
script:
    "../scripts/python/graphs/scatter_accuracies_10_iterations.py"
135
136
script:
    "../scripts/python/graphs/scatter_accuracies_20_iterations.py"
154
155
script:
    "../scripts/python/graphs/scatter_accuracies_10_iterations.py"
173
174
script:
    "../scripts/python/graphs/scatter_accuracies_10_iterations.py"
192
193
script:
    "../scripts/python/graphs/scatter_accuracies_10_iterations.py"
211
212
script:
    "../scripts/python/graphs/scatter_accuracies_10_iterations.py"
232
233
script:
    "../scripts/python/graphs/roc_curve.py"
254
255
script:
    "../scripts/python/graphs/roc_curve_wo_clusters.py"
276
277
script:
    "../scripts/python/graphs/roc_curve_wo_feature.py"
298
299
script:
    "../scripts/python/graphs/roc_curve_one_feature.py"
321
322
script:
    "../scripts/python/graphs/roc_curve_scale_after_split.py"
340
341
script:
    "../scripts/python/graphs/roc_curve_scale_after_split_10_iterations.py"
359
360
script:
    "../scripts/python/graphs/roc_curve_scale_after_split_10_iterations.py"
378
379
script:
    "../scripts/python/graphs/roc_curve_scale_after_split_10_iterations.py"
397
398
script:
    "../scripts/python/graphs/roc_curve_scale_after_split_10_iterations.py"
417
418
script:
    "../scripts/python/graphs/roc_curve_scale_after_split_10_iterations.py"
436
437
script:
    "../scripts/python/graphs/roc_curve_scale_after_split_10_iterations.py"
452
453
script:
    "../scripts/python/graphs/summary_shap.py"
467
468
script:
    "../scripts/python/graphs/summary_shap_snotype.py"
484
485
script:
    "../scripts/python/graphs/summary_shap_snotype_scale_after_split.py"
499
500
script:
    "../scripts/python/graphs/summary_shap_hg_biotype.py"
516
517
script:
    "../scripts/python/graphs/summary_shap_hg_biotype_scale_after_split.py"
529
530
script:
    "../scripts/python/graphs/decision_plot.py"
542
543
script:
    "../scripts/python/graphs/global_shap_bar_plot.py"
556
557
script:
    "../scripts/python/graphs/global_shap_bar_plot_scale_after_split.py"
576
577
script:
    "../scripts/python/graphs/upset_per_confusion_value.py"
596
597
script:
    "../scripts/python/graphs/upset_per_confusion_value.py"
619
620
script:
    "../scripts/python/graphs/upset_per_confusion_value_10_iterations.py"
643
644
script:
    "../scripts/python/graphs/upset_per_confusion_value_10_iterations.py"
659
660
script:
    "../scripts/python/graphs/upset_top_features_df.py"
675
676
script:
    "../scripts/python/all_feature_rank_df.py"
694
695
script:
    "../scripts/python/all_feature_rank_df_10_iterations.py"
707
708
script:
    "../scripts/python/concat_feature_rank_iterations_df.py"
721
722
script:
    "../scripts/python/graphs/violin_feature_rank.py"
736
737
script:
    "../scripts/python/graphs/violin_feature_rank_10_iterations.py"
750
751
script:
    "../scripts/python/graphs/heatmap_feature_rank_correlation.py"
764
765
script:
    "../scripts/python/graphs/clustermap_feature_rank_correlation.py"
778
779
script:
    "../scripts/python/graphs/pairplot_top_5.py"
792
793
script:
    "../scripts/r/upset_top_features.R"
808
809
script:
    "../scripts/python/graphs/donut_top_features_rank_percent.py"
826
827
script:
    "../scripts/python/graphs/decision_plot_FN.py"
844
845
script:
    "../scripts/python/graphs/decision_plot_FP.py"
862
863
script:
    "../scripts/python/graphs/decision_plot_TN.py"
880
881
script:
    "../scripts/python/graphs/decision_plot_TP.py"
899
900
script:
    "../scripts/python/graphs/decision_plot_FN_scale_after_split.py"
918
919
script:
    "../scripts/python/graphs/decision_plot_FP_scale_after_split.py"
937
938
script:
    "../scripts/python/graphs/decision_plot_TN_scale_after_split.py"
956
957
script:
    "../scripts/python/graphs/decision_plot_TP_scale_after_split.py"
973
974
975
shell:
    "{params.bash_script} {input.sno_sequences} "
    "{input.flanking} {output.snora77b_terminal_stem} "
990
991
script:
    "../scripts/python/graphs/forgi_snora77b.py"
1004
1005
script:
    "../scripts/python/graphs/bivariate_density_haca.py"
1017
1018
script:
    "../scripts/python/graphs/comparison_tpm_sno_type.py"
1034
1035
script:
    "../scripts/python/graphs/stem_comparison_tpm.py"
1048
1049
script:
    "../scripts/python/graphs/PCA.py"
1062
1063
script:
    "../scripts/python/graphs/tSNE.py"
1075
1076
script:
    "../scripts/python/graphs/feature_importance.py"
1089
1090
script:
    "../scripts/python/graphs/feature_importance.py"
1103
1104
script:
    "../scripts/python/graphs/feature_importance.py"
1123
1124
script:
    "../scripts/python/graphs/heatmap_shap.py"
1136
1137
script:
    "../scripts/python/sno_presence_in_all_test_sets.py"
1150
1151
script:
    "../scripts/python/regroup_sno_confusion_value_iterations.py"
1163
1164
script:
    "../scripts/python/regroup_sno_confusion_value_manual_split.py"
1180
1181
script:
    "../scripts/python/graphs/num_feature_distribution_comparison_confusion_value_top10.py"
1197
1198
script:
    "../scripts/python/graphs/num_feature_distribution_comparison_confusion_value_top10_snotype.py"
1214
1215
script:
    "../scripts/python/graphs/cat_feature_distribution_comparison_confusion_value_top10.py"
1231
1232
script:
    "../scripts/python/graphs/cat_feature_distribution_comparison_confusion_value_top10_snotype.py"
1247
1248
script:
    "../scripts/python/get_all_shap_values.py"
1263
1264
script:
    "../scripts/python/get_all_shap_values_manual_split.py"
1276
1277
script:
    "../scripts/python/all_feature_rank_df_manual_split.py"
1290
1291
script:
    "../scripts/python/concat_feature_rank_iterations_df.py"
1305
1306
script:
    "../scripts/python/graphs/violin_feature_rank_manual_split.py"
1321
1322
script:
    "../scripts/python/get_all_shap_values_manual_split.py"
1334
1335
script:
    "../scripts/python/all_feature_rank_df_manual_split.py"
1348
1349
script:
    "../scripts/python/concat_feature_rank_iterations_df.py"
1363
1364
script:
    "../scripts/python/graphs/violin_feature_rank_manual_split.py"
1382
1383
script:
    "../scripts/python/graphs/summary_shap_snotype_scale_after_manual_split.py"
1401
1402
script:
    "../scripts/python/graphs/bar_summary_shap_snotype_scale_after_manual_split.py"
1418
1419
script:
    "../scripts/python/graphs/decision_plot_scale_after_split_10_iterations_per_confusion_value.py"
1431
1432
script:
    "../scripts/python/concat_all_shap_values.py"
1448
1449
script:
    "../scripts/python/graphs/density_FP_vs_TN_host_expressed.py"
1461
1462
script:
    "../scripts/python/real_confusion_value_df.py"
1475
1476
script:
    "../scripts/python/graphs/violin_tpm_confusion_value.py"
1488
1489
script:
    "../scripts/python/graphs/pie_confusion_values.py"
1505
1506
script:
    "../scripts/python/graphs/donut_confusion_values_host_biotype.py"
19
20
script:
    "../scripts/python/add_HG_tpm_gtex.py"
33
34
script:
    "../scripts/python/abundance_cutoff.py"
50
51
script:
    "../scripts/python/add_HG_tpm_gtex.py"
64
65
script:
    "../scripts/python/abundance_cutoff.py"
21
22
script:
    "../scripts/python/merge_features.py"
40
41
script:
    "../scripts/python/merge_features.py"
59
60
script:
    "../scripts/python/merge_features.py"
22
23
script:
    "../scripts/python/graphs/density_features_mouse.py"
39
40
script:
    "../scripts/python/graphs/bar_categorical_mouse.py"
54
55
script:
    "../scripts/python/graphs/violin_models_accuracies_iterations_mouse.py"
71
72
script:
    "../scripts/python/get_consensus_confusion_value_mouse.py"
87
88
script:
    "../scripts/python/get_consensus_confusion_value_per_model_mouse.py"
100
101
script:
    "../scripts/python/graphs/pie_confusion_values_mouse.py"
114
115
script:
    "../scripts/python/graphs/pie_confusion_values_mouse.py"
128
129
script:
    "../scripts/python/graphs/pie_confusion_values_species_prediction_simple.py"
142
143
script:
    "../scripts/python/graphs/pie_confusion_values_species_prediction.py"
160
161
script:
    "../scripts/python/graphs/donut_confusion_values_host_biotype_species_prediction_log_reg_thresh.py"
178
179
script:
    "../scripts/python/graphs/donut_confusion_values_host_biotype_species_prediction_log_reg_thresh.py"
197
198
script:
    "../scripts/python/graphs/donut_confusion_values_host_biotype_species_prediction_log_reg.py"
216
217
script:
    "../scripts/python/graphs/donut_confusion_values_host_biotype_species_prediction_wo_dup.py"
235
236
script:
    "../scripts/python/graphs/donut_confusion_values_host_biotype_species_prediction_wo_dup.py"
251
252
script:
    "../scripts/python/graphs/donut_labels_sno_type_mouse.py"
268
269
script:
    "../scripts/python/graphs/donut_labels_host_biotype_mouse.py"
286
287
script:
    "../scripts/python/graphs/donut_labels_sno_type_mouse_wo_dup.py"
305
306
script:
    "../scripts/python/graphs/donut_labels_host_biotype_mouse_wo_dup.py"
323
324
script:
    "../scripts/python/graphs/donut_labels_sno_type_mouse_no_dup.py"
342
343
script:
    "../scripts/python/graphs/donut_labels_host_biotype_mouse_no_dup.py"
356
357
script:
    "../scripts/python/graphs/violin_tpm_confusion_value_mouse.py"
370
371
script:
    "../scripts/python/graphs/violin_tpm_confusion_value_mouse.py"
384
385
script:
    "../scripts/python/graphs/violin_tpm_confusion_value_log_reg_thresh_mouse.py"
408
409
script:
    "../scripts/python/graphs/scatter_accuracies.py"
432
433
script:
    "../scripts/python/graphs/scatter_accuracies_species_prediction_top4_random_state.py"
456
457
script:
    "../scripts/python/graphs/scatter_accuracies_species_prediction_top4_random_state.py"
480
481
script:
    "../scripts/python/graphs/scatter_accuracies_species_prediction_top4_random_state.py"
506
507
script:
    "../scripts/python/graphs/scatter_accuracies_species_prediction_top4_random_state_w_log_reg_thresh.py"
532
533
script:
    "../scripts/python/graphs/scatter_accuracies_species_prediction_top4_w_log_reg_thresh.py"
556
557
script:
    "../scripts/python/graphs/scatter_accuracies_species_prediction_top4_random_state.py"
582
583
script:
    "../scripts/python/graphs/scatter_accuracies_species_prediction_top3_random_state_w_log_reg_thresh.py"
16
17
18
19
20
shell:
    """awk -v OFS='\t' 'NR>6 {{print $1, $4, $5, "to_remove"$10"to_remove", $6, $7, $2, $3, $8, "to_delete"$0}}' {input.gtf} | """
    """sed -E 's/to_remove"//g; s/";to_remove//g; s/to_delete.*gene_id/gene_id/g' | """
    """sort -n -k1,1 -k2,2 > {output.gtf_bed} && """
    """awk '$8=="gene" {{print $0}}' {output.gtf_bed}  | grep snoRNA | sed 's/\t$//g; s/^/chr/g' | sort -k1,1 -k2,2n > {output.all_sno_bed}"""
32
33
script:
    "../scripts/python/fasta_sno_sequence_species.py"
49
50
script:
    "../scripts/python/find_mouse_snoRNA_type.py"
63
64
script:
    "../scripts/python/find_mouse_snoRNA_labels_w_length.py"
75
76
script:
    "../scripts/python/format_gtf_bed_for_HG.py"
88
89
script:
    "../scripts/python/find_mouse_snoRNA_HG.py"
106
107
script:
    "../scripts/python/find_mouse_HG_expression_level.py"
121
122
123
124
125
shell:
    "sed -E '/^[^>]/ s/T/U/g' {input.sequences} | sed 's/>/>MOUSE_/g' > temp_structure && "
    "RNAfold --infile=temp_structure --outfile={params.temp_name} && sed -i 's/MOUSE_//g' {params.temp_name} && "
    "mv {params.temp_name} {output.mfe} && mkdir -p data/structure/stability_mouse/ && "
    "mv MOUSE*.ps data/structure/stability_mouse/ && rm temp_structure"
137
138
139
140
141
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_final} && """
    """rm temporary_mouse*"""
153
154
155
shell:
    "mkdir -p log/ && samtools faidx {input.genome} && "
    "cut -f1,2 {input.genome}.fai > {output.genome_chr_size}"
174
175
script:
    "../scripts/python/flank_extend_snoRNA_mouse.py"
190
191
script:
    "../scripts/python/get_fasta_terminal_stem.py"
202
203
204
205
shell:
    "sed 's/>/>Mouse_/g' {input.fasta} > Mouse_cofold.fa && "
    "RNAcofold < Mouse_cofold.fa > {output.mfe_stem} && sed -i 's/Mouse_//g' {output.mfe_stem} && "
    "mkdir -p data/terminal_stem_mouse/ && mv Mouse*.ps data/terminal_stem_mouse/ && rm Mouse_cofold.fa"
218
219
220
221
222
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_stem_final} && """
    """rm temporary_terminal_stem*"""
235
236
script:
    "../scripts/python/fasta_per_sno_type_mouse.py"
249
250
script:
    "../scripts/python/cd_box_location_all.py"
264
265
script:
    "../scripts/python/haca_box_location_all.py"
277
278
script:
    "../scripts/python/hamming_distance_box_all.py"
292
293
script:
    "../scripts/python/merge_features_mouse.py"
313
314
script:
    "../scripts/python/predict_mouse_snoRNA_label.py"
327
328
script:
    "../scripts/python/test_models_scale_after_split.py"
345
346
script:
    "../scripts/python/confusion_matrix_f1_scale_after_split.py"
16
17
script:
    "../scripts/python/graphs/bar_abundance_status_per_biotype.py"
18
19
script:
    "../scripts/python/graphs/heatmap_shap_all_models_iterations.py"
30
31
script:
    "../scripts/python/graphs/heatmap_shap_per_model_all_iterations.py"
43
44
script:
    "../scripts/python/graphs/heatmap_shap_per_confusion_value_all_models_iterations.py"
56
57
script:
    "../scripts/python/graphs/heatmap_shap_per_confusion_value_per_model_all_iterations.py"
73
74
script:
    "../scripts/python/graphs/bar_shap_top_3_features_per_confusion_value.py"
16
17
script:
    "../scripts/python/graphs/density_features_species.py"
31
32
script:
    "../scripts/python/graphs/summary_table_sno_type_host_biotype_species.py"
48
49
script:
    "../scripts/python/graphs/bar_categorical_species.py"
66
67
script:
    "../scripts/python/graphs/donut_labels_sno_type_species.py"
84
85
script:
    "../scripts/python/graphs/donut_labels_host_biotype_species.py"
103
104
script:
    "../scripts/python/graphs/bar_ab_status_prediction_species.py"
120
121
script:
    "../scripts/python/graphs/scatter_ab_status_prediction_species.py"
13
14
15
16
17
shell:
    """awk -v OFS='\t' 'NR>6 {{print $1, $4, $5, "to_remove"$10"to_remove", $6, $7, $2, $3, $8, "to_delete"$0}}' {input.gtf} | """
    """sed -E 's/to_remove"//g; s/";to_remove//g; s/to_delete.*gene_id/gene_id/g' | """
    """sort -n -k1,1 -k2,2 > {output.gtf_bed} && """
    """awk '$8=="gene" {{print $0}}' {output.gtf_bed}  | grep snoRNA | sed 's/\t$//g; s/^/chr/g' | sort -k1,1 -k2,2n > {output.all_sno_bed}"""
29
30
31
shell:
    "mkdir -p log/ && samtools faidx {input.genome} && "
    "cut -f1,2 {input.genome}.fai > {output.genome_chr_size}"
45
46
47
48
shell:
    """for i in $(cut -f1 {input.chr_size}); do if [[ $i != *"_ALT_"* ]]; """
    """then echo $i && samtools faidx {input.genome} $i >> {params.temp_filter}; fi; done && """
    """sed 's/>/>chr/' {params.temp_filter} > {output.filtered_genome} && rm {params.temp_filter}"""
60
61
62
shell:
    "mkdir -p log/ && samtools faidx {input.genome} && "
    "cut -f1,2 {input.genome}.fai > {output.genome_chr_size}"
74
75
script:
    "../scripts/python/fasta_sno_sequence_species.py"
88
89
script:
    "../scripts/python/find_species_snoRNA_type.py"
100
101
script:
    "../scripts/python/format_gtf_bed_for_HG.py"
113
114
script:
    "../scripts/python/find_species_snoRNA_HG.py"
128
129
script:
    "../scripts/python/find_species_HG_expression_level.py"
143
144
145
146
147
148
shell:
    "sed -E '/^[^>]/ s/T/U/g; s/>/>{wildcards.species}_fold/g' {input.sequences} > {params.mfe_dir}/temp_structure_species && "
    "RNAfold --infile={params.mfe_dir}/temp_structure_species --outfile={params.temp_name} && "
    "mkdir -p {params.mfe_dir} && mv {params.temp_name} {output.mfe} && "
    "mv {wildcards.species}_fold*.ps {params.mfe_dir} && rm {params.mfe_dir}/temp_structure_species && "
    "sed -i 's/{wildcards.species}_fold//g' {output.mfe}"
160
161
162
163
164
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_final} && """
    """rm {params.temp_mfe} {params.temp_id}"""
183
184
script:
    "../scripts/python/flank_extend_snoRNA_species.py"
199
200
script:
    "../scripts/python/get_fasta_terminal_stem.py"
213
214
215
216
217
shell:
    "sed 's/>/>{wildcards.species}_cofold/g' {input.fasta} > {wildcards.species}_cofold.fa && "
    "RNAcofold < {wildcards.species}_cofold.fa > {output.mfe_stem} && "
    "mv {wildcards.species}_cofold*.ps {params.mfe_dir} && rm {wildcards.species}_cofold.fa && "
    "sed -i 's/{wildcards.species}_cofold//g' {output.mfe_stem}"
230
231
232
233
234
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_stem_final} && """
    """rm {params.temp_mfe} {params.temp_id}"""
247
248
script:
    "../scripts/python/fasta_per_sno_type_mouse.py"
261
262
script:
    "../scripts/python/cd_box_location_all.py"
276
277
script:
    "../scripts/python/haca_box_location_all.py"
289
290
script:
    "../scripts/python/hamming_distance_box_all.py"
304
305
script:
    "../scripts/python/merge_features_species.py"
326
327
script:
    "../scripts/python/predict_yeast_snoRNA_label.py"
13
14
script:
    "../scripts/python/clean_sno_sequences.py"
27
28
29
30
31
32
33
shell:
    "sed 's/>/>HUMAN_/g' {input.sequences} > HUMAN_rna_fold.tsv && "
    "RNAfold --infile=HUMAN_rna_fold.tsv --outfile={params.temp_name} && "
    "sed -i 's/HUMAN_//g' {params.temp_name} && "
    "mv {params.temp_name} {output.mfe} && "
    "mv HUMAN_*.ps data/structure/stability/ && "
    "rm -f data_structure_stability_mfe.tsv HUMAN_rna_fold.tsv"
45
46
47
48
49
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_final} && """
    """rm temporary_*"""
61
62
shell:
    "RNAalifold -f S {input.stk} > {output.consensus_sequence}"
12
13
script:
    "../scripts/python/graphs/density_upstream_conservation.py"
28
29
script:
    "../scripts/python/graphs/stacked_bar_labels_snotype_HG_biotype.py"
58
59
script:
    "../scripts/python/graphs/dist_to_bp_thresh_other_features.py"
76
77
script:
    "../scripts/python/graphs/dist_to_bp_thresh_HG_AQR_overlap.py"
105
106
script:
    "../scripts/python/graphs/core_protein_binding_ab_status_bar.py"
10
11
shell:
    """bigWigToBedGraph {input.phastcons_bigwig} {output.phastcons_bedgraph} """
28
29
script:
    "../scripts/python/sno_conservation.py"
44
45
script:
    "../scripts/python/AQR_binding.py"
59
60
script:
    "../scripts/python/DKC1_binding_eCLIP.py"
74
75
script:
    "../scripts/python/core_RBP_binding_cd_par_clip.py"
22
23
script:
    "../scripts/python/flank_extend_snoRNA.py"
33
34
shell:
    "sed 's/>/>chr/g' {input.genome} > {output.genome_chr}"
50
51
script:
    "../scripts/python/get_fasta_terminal_stem.py"
63
64
65
66
shell:
    "sed 's/>/>Human_cofold/g' {input.fasta} > Human_cofold.fa && "
    "RNAcofold < Human_cofold.fa > {output.mfe_stem} && sed -i 's/Human_cofold//g' {output.mfe_stem} && "
    "mv Human_cofold*.ps data/terminal_stem/ && rm Human_cofold.fa"
79
80
81
82
83
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_stem_final} && """
    """rm temporary_*"""
94
95
script:
    "../scripts/python/terminal_stem_length.py"
20
21
22
shell:
    "export PATH=$PWD/{params}:$PATH && "
    "python3 git_repos/coco/bin/coco.py ca {input.gtf} -o {output.gtf_corrected}"
41
42
43
44
45
46
47
48
shell:
    "fastqc "
    "-f fastq "
    "-t {threads} "
    "-o {params.out_dir} "
    "{input.fastq1} "
    "{input.fastq2} "
    "&> {log}"
72
73
74
75
76
77
78
79
80
shell:
    "trimmomatic PE "
    "-threads {threads} "
    "-phred33 "
    "{input.fastq1} {input.fastq2} "
    "{output.fastq1} {output.unpaired_fastq1} "
    "{output.fastq2} {output.unpaired_fastq2} "
    "{params.options} "
    "&> {log}"
 99
100
101
102
103
104
105
106
shell:
    "fastqc "
    "-f fastq "
    "-t {threads} "
    "-o {params.out_dir} "
    "{input.fastq1} "
    "{input.fastq2} "
    "&> {log}"
115
116
shell:
    "sed 's/^>chr/>/g' {input.fasta} > {output.mod_fasta}"
133
134
135
136
137
138
139
140
shell:
    "STAR --runMode genomeGenerate "
    "--runThreadN {threads} "
    "--genomeDir {params.index_dir} "
    "--genomeFastaFiles {input.fasta} "
    "--sjdbGTFfile {input.standard_gtf} "
    "--sjdbOverhang 74 "
    "&> {log}"
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
shell:
    "STAR --runMode alignReads "
    "--genomeDir {params.index_dir} "
    "--readFilesIn {input.fastq1} {input.fastq2} "
    "--runThreadN {threads} "
    "--readFilesCommand zcat "
    "--outReadsUnmapped Fastx "
    "--outFilterType BySJout "
    "--outStd Log "
    "--outSAMunmapped None "
    "--outSAMtype BAM SortedByCoordinate "
    "--outFileNamePrefix {params.outdir} "
    "--outFilterScoreMinOverLread 0.3 "
    "--outFilterMatchNminOverLread 0.3 "
    "--outFilterMultimapNmax 100 "
    "--winAnchorMultimapNmax 100 "
    "--alignEndsProtrude 5 ConcordantPair"
    "&> {log}"
196
197
198
199
200
201
202
203
204
205
shell:
    "python {params.coco_path}/coco.py cc "
    "--countType both "
    "--thread {threads} "
    "--strand 1 "
    "--paired "
    "{input.gtf} "
    "{input.bam} "
    "{output.counts} "
    "&> {log}"
221
222
script:
    "../scripts/python/merge_coco_cc_output_mouse.py"
15
16
17
18
19
shell:
    """awk -v OFS='\t' 'NR>6 {{print $1, $4, $5, "to_remove"$10"to_remove", $6, $7, $2, $3, $8, "to_delete"$0}}' {input.gtf} | """
    """sed -E 's/to_remove"//g; s/";to_remove//g; s/to_delete.*gene_id/gene_id/g' | """
    """sort -n -k1,1 -k2,2 > {output.gtf_bed} && """
    """awk '$8=="gene" {{print $0}}' {output.gtf_bed}  | grep snoRNA | sed 's/\t$//g; s/^/chr/g' | sort -k1,1 -k2,2n > {output.all_sno_bed}"""
31
32
script:
    "../scripts/python/fasta_sno_sequence_species.py"
46
47
script:
    "../scripts/python/find_yeast_snoRNA_type.py"
60
61
script:
    "../scripts/python/find_mouse_snoRNA_labels_w_length.py"
72
73
script:
    "../scripts/python/format_gtf_bed_for_HG.py"
85
86
script:
    "../scripts/python/find_species_snoRNA_HG.py"
100
101
script:
    "../scripts/python/find_yeast_HG_expression_level.py"
114
115
116
117
118
119
shell:
    "mkdir -p data/structure/stability_yeast/ && "
    "sed -E '/^[^>]/ s/T/U/g' {input.sequences} > temp_structure_yeast && "
    "RNAfold --infile=temp_structure_yeast --outfile={params.temp_name} && "
    "mv {params.temp_name} {output.mfe} && "
    "mv *.ps data/structure/stability_yeast/ && rm temp_structure_yeast"
131
132
133
134
135
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_final} && """
    """rm temporary_yeast*"""
147
148
149
shell:
    "mkdir -p log/ && samtools faidx {input.genome} && "
    "cut -f1,2 {input.genome}.fai > {output.genome_chr_size}"
168
169
script:
    "../scripts/python/flank_extend_snoRNA_mouse.py"
184
185
script:
    "../scripts/python/get_fasta_terminal_stem.py"
196
197
198
shell:
    "RNAcofold < {input.fasta} > {output.mfe_stem} && "
    "mv *.ps data/terminal_stem_yeast/"
211
212
213
214
215
shell:
    """grep -E ">" {input.mfe} | sed 's/>//g' > {params.temp_id} && """
    """grep -oE "\-*[0-9]+\.[0-9]*" {input.mfe} > {params.temp_mfe} && """
    """paste {params.temp_id} {params.temp_mfe} > {output.mfe_stem_final} && """
    """rm temporary_terminal_stem_mfe_yeast*"""
228
229
script:
    "../scripts/python/fasta_per_sno_type_mouse.py"
242
243
script:
    "../scripts/python/cd_box_location_all.py"
257
258
script:
    "../scripts/python/haca_box_location_all.py"
270
271
script:
    "../scripts/python/hamming_distance_box_all.py"
285
286
script:
    "../scripts/python/merge_features_mouse.py"
309
310
script:
    "../scripts/python/predict_species_snoRNA_label_final.py"
331
332
script:
    "../scripts/python/predict_yeast_snoRNA_label.py"
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import collections as coll

""" Add an abundance cutoff column for all RNAs to define them as expressed or not
    expressed. """

df = pd.read_csv(snakemake.input.tpm_df)

# If oRNA is expressed >1 TPM in at least one average tissue (average of the triplicates), it is expressed, else not expressed
rna_abundance = df.iloc[:, 4:25]
cols = list(rna_abundance.columns)
triplicates = [cols[n:n+3] for n in range(0, len(cols), 3)]

df['abundance_cutoff_2'] = ''
for i in range(0, len(df)):
    row = rna_abundance.iloc[i]
    for j, triplicate in enumerate(triplicates):
        if (df.loc[i, triplicate].mean() > 1) == True:
            df.loc[i, 'abundance_cutoff_2'] = 'expressed'
            break
df['abundance_cutoff_2'] = df['abundance_cutoff_2'].replace('', 'not_expressed')

for i, bio in enumerate(list(pd.unique(df['gene_biotype']))):
    print(bio)
    d = df[df['gene_biotype'] == bio]
    print(coll.Counter(d['abundance_cutoff_2']))  


df.to_csv(snakemake.output.abundance_cutoff_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
import collections as coll

""" Add an abundance cutoff column for snoRNAs to define them as expressed or not
    expressed. This column will serve as the label used by the predictor. Also
    define an abundance cutoff for host genes that will be used as a feature. """

df = pd.read_csv(snakemake.input.tpm_df)

# If snoRNA is expressed >1 TPM in at least one tissue sample, it is expressed, else not expressed
df.loc[(df.iloc[:, 4:25] > 1).any(axis=1), 'abundance_cutoff'] = 'expressed'
df.loc[(df.iloc[:, 4:25] <= 1).all(axis=1), 'abundance_cutoff'] = 'not_expressed'
print('Abundance cutoff based on >1 TPM in at least one sample:')
print(coll.Counter(df['abundance_cutoff']))  # 967 not_expressed; 574 expressed


# If snoRNA is expressed >1 TPM in at least one average tissue (average of the triplicates), it is expressed, else not expressed
sno_abundance = df.iloc[:, 4:25]
cols = list(sno_abundance.columns)
triplicates = [cols[n:n+3] for n in range(0, len(cols), 3)]

df['abundance_cutoff_2'] = ''
for i in range(0, len(df)):
    row = sno_abundance.iloc[i]
    for j, triplicate in enumerate(triplicates):
        if (df.loc[i, triplicate].mean() > 1) == True:
            df.loc[i, 'abundance_cutoff_2'] = 'expressed'
            break
df['abundance_cutoff_2'] = df['abundance_cutoff_2'].replace('', 'not_expressed')

print('Abundance cutoff based on >1 TPM in at least one average tissue:')
print(coll.Counter(df['abundance_cutoff_2']))  # 1056 not expressed; 485 expressed

# If host gene is expressed >1 TPM in at least one average tissue (average of the triplicates), it is expressed, else not expressed (or no host gene at all)
hg_abundance = df.iloc[:, 33:54]
cols_hg = list(hg_abundance.columns)
triplicates_hg = [cols_hg[n:n+3] for n in range(0, len(cols_hg), 3)]

df['abundance_cutoff_host'] = ''
for i in range(0, len(df)):
    row = hg_abundance.iloc[i]
    for j, triplicate in enumerate(triplicates_hg):
        if (df.loc[i, triplicate].mean() > 1) == True:
            df.loc[i, 'abundance_cutoff_host'] = 'host_expressed'
            break
df['abundance_cutoff_host'] = df['abundance_cutoff_host'].replace('', 'host_not_expressed')
df.loc[df['host_id'].isnull(), 'abundance_cutoff_host'] = 'intergenic'


df.to_csv(snakemake.output.abundance_cutoff_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd


"""Add a gene biotype and a simplified gene biotype column in an existing dataframe using a reference table"""

ref_table = pd.read_csv(snakemake.input.ref_table, sep='\t')
df = pd.read_csv(snakemake.input.tpm_df)

# Create a dictionary of gene_id/gene_biotype (key/value) from the ref table and use it to create the gene_biotype col
ref_dict = ref_table.set_index('gene_id').to_dict()['gene_biotype']
df.insert(loc=2, column='gene_biotype', value=df['gene_id'].map(ref_dict))  # insert new col at position 2

# Create a dictionary of simplified gene biotypes and use it to create the gene_biotype2 column
simplified_dict = {}
all_biotypes = list(set(ref_dict.values()))
pc_list = ['IG_C_gene', 'IG_D_gene', 'TR_V_gene', 'IG_V_gene', 'IG_J_gene', 'TR_J_gene', 'TR_C_gene', 'TR_D_gene', 'protein_coding', ]
pseudogene_list = ['translated_unprocessed_pseudogene', 'translated_processed_pseudogene', 'unitary_pseudogene', 'unprocessed_pseudogene',
                   'processed_pseudogene', 'transcribed_unprocessed_pseudogene', 'transcribed_unitary_pseudogene', 'transcribed_processed_pseudogene',
                   'IG_V_pseudogene', 'pseudogene', 'TR_V_pseudogene', 'TR_J_pseudogene', 'IG_C_pseudogene', 'IG_J_pseudogene', 'IG_pseudogene',
                   'polymorphic_pseudogene', 'rRNA_pseudogene']
other_list = ['rRNA', 'Mt_rRNA', 'ribozyme', 'scRNA', 'vault_RNA', 'sRNA', 'ETS-RNA', 'ITS-RNA', 'TEC']
tRNA_list = ['tRNA', 'Mt_tRNA', 'pre-tRNA', 'tRNA_fragment']
same_biotype = ['lncRNA', 'misc_RNA', 'intronic_cluster', 'intergenic_cluster', 'snoRNA', 'snRNA', 'miRNA', 'scaRNA']


for gene in pc_list:
    simplified_dict[gene] = 'protein_coding'
for gene in pseudogene_list:
    simplified_dict[gene] = 'pseudogene'
for gene in other_list:
    simplified_dict[gene] = 'other'
for gene in tRNA_list:
    simplified_dict[gene] = 'tRNA'
for gene in same_biotype:
    simplified_dict[gene] = gene

df.insert(loc=3, column='gene_biotype2', value=df['gene_biotype'].map(simplified_dict))  # insert new col at position 3

df.to_csv(snakemake.output.tpm_biotype, index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd

"""Add host gene (HG) abundance values in each tissue to each snoRNA in a dataframe using a reference table"""

ref_table_HG = pd.read_csv(snakemake.input.ref_HG_table)
tpm_df = pd.read_csv(snakemake.input.tpm_biotype_df)
df = tpm_df[tpm_df['gene_biotype2'] == 'snoRNA']
gtex_df = pd.read_csv(snakemake.input.gtex_tpm_df, sep='\t')
gtex_id_dict = snakemake.params.gtex_id_dict

# Merge host id, name, biotype, start, end and also sno start and end to existing dataframe
df = df.merge(ref_table_HG[['sno_id', 'sno_start', 'sno_end', 'host_id', 'host_name', 'host_biotype', 'host_start', 'host_end']],
              how='left', left_on='gene_id', right_on='sno_id')

# Format gtex_df
gtex_df = gtex_df.rename(columns={'Description': 'gene_name', 'Name': 'gene_id_temp'})
gtex_df = gtex_df.rename(columns=gtex_id_dict)
gtex_df[['gene_id', 'gene_version']] = gtex_df.gene_id_temp.str.split('.', expand=True)
gtex_df = gtex_df.drop(columns=['gene_id_temp', 'gene_version'])
col_list = ['gene_id', 'gene_name'] + list(gtex_id_dict.values())
gtex_df = gtex_df[col_list]

# Create dictionary of dictionaries from gtex_df (primary key: tissue and primary value: abundance (TPM) (secondary value) for each HG id (secondary key))
gtex_df = gtex_df[gtex_df['gene_id'].isin(ref_table_HG['host_id'])].set_index('gene_id').iloc[:, 1:23].add_suffix('_host')  # select tpm col and add '_host' to col titles
tpm_dict = gtex_df.to_dict()

# Add new columns of HG tpm based on the host_id col in df and the tpm_dict
for key, val in tpm_dict.items():
    df[key] = df['host_id'].map(val)

df.to_csv(snakemake.output.sno_HG_df, index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd

"""Add host gene (HG) abundance values in each tissue to each snoRNA in a dataframe using a reference table"""

ref_table_HG = pd.read_csv(snakemake.input.ref_HG_table)
tpm_df = pd.read_csv(snakemake.input.tpm_biotype_df)
df = tpm_df[tpm_df['gene_biotype2'] == 'snoRNA']

# Merge host id, name, biotype, start, end and also sno start and end to existing dataframe
df = df.merge(ref_table_HG[['sno_id', 'sno_start', 'sno_end', 'host_id', 'host_name', 'host_biotype', 'host_start', 'host_end']],
              how='left', left_on='gene_id', right_on='sno_id')

# Create dictionary of dictionaries from tpm_df (primary key: tissue and primary value: abundance (TPM) (secondary value) for each HG id (secondary key))
tpm_df = tpm_df[tpm_df['gene_id'].isin(ref_table_HG['host_id'])].set_index('gene_id').iloc[:, 3:25].add_suffix('_host')  # select tpm col and add '_host' to col titles
tpm_dict = tpm_df.to_dict()

# Add new columns of HG tpm based on the host_id col in df and the tpm_dict
for key, val in tpm_dict.items():
    df[key] = df['host_id'].map(val)

df.to_csv(snakemake.output.sno_HG_df, index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
import shap
import numpy as np
import re
""" Create a dataframe containing the rank of importance for each feature
    and per model per iteration."""

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
iteration = snakemake.wildcards.iteration

log_reg_model = snakemake.input.log_reg
svc_model, rf_model = snakemake.input.svc, snakemake.input.rf

# Instantiate log_reg model and get the SHAP ranking of its features (highest absolute value of SHAP is considered as rank 1 most predictive feature)
log_reg = pickle.load(open(log_reg_model[0], 'rb'))  # log_reg_model is a snakemake Namedlist of one item, so we need to select the first item through indexing
explainer_log_reg = shap.LinearExplainer(log_reg, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
shap_values_log_reg = explainer_log_reg.shap_values(X_test)
vals_log_reg = np.abs(shap_values_log_reg).mean(0)  # mean SHAP value across all examples in X_test for each feature
feature_importance_log_reg = pd.DataFrame(list(zip(X_train.columns, vals_log_reg)), columns=['feature', 'feature_importance'])
feature_importance_log_reg.sort_values(by=['feature_importance'], ascending=False , inplace=True)
feature_importance_log_reg['feature_rank'] = feature_importance_log_reg.reset_index().index + 1  # Create a rank column for feature importance rank
feature_importance_log_reg['model'] = f'log_reg_{iteration}'
feature_importance_log_reg = feature_importance_log_reg.drop('feature_importance', axis=1)

# Instantiate all other 2 models (svc and rf) and get their SHAP ranking of all features
dfs = [feature_importance_log_reg]
for i, mod in enumerate([svc_model, rf_model]):
    model = pickle.load(open(mod[0], 'rb'))  # mod is a snakemake Namedlist of one item, so we need to select the first item through indexing
    model_substring = re.search("results/trained_models/(.*)_trained_scale_.*sav", mod[0]).group(1)  # find the model name within the pickled model name
    explainer = shap.KernelExplainer(model.predict, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test)
    vals = np.abs(shap_values).mean(0)  # mean SHAP value across all examples in X_test for each feature
    feature_importance = pd.DataFrame(list(zip(X_train.columns, vals)), columns=['feature', 'feature_importance'])
    feature_importance.sort_values(by=['feature_importance'], ascending=False , inplace=True)
    feature_importance['feature_rank'] = feature_importance.reset_index().index + 1  # Create a rank column for feature importance rank
    feature_importance['model'] = f'{model_substring}_{iteration}'
    feature_importance = feature_importance.drop('feature_importance', axis=1)
    dfs.append(feature_importance)

# Concat all dfs into one df
df_final = pd.concat(dfs)
df_final.to_csv(snakemake.output.rank_features_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
import numpy as np
""" Create a dataframe containing the rank of importance for each feature
    and per model per iteration."""
manual_iteration = snakemake.wildcards.manual_iteration
shap_paths = snakemake.input.shap_vals
log_reg_shap_path = [path for path in shap_paths if 'log_reg' in path][0]
svc_shap_path = [path for path in shap_paths if 'svc' in path][0]
rf_shap_path = [path for path in shap_paths if 'rf' in path][0]

shap_values_log_reg = pd.read_csv(log_reg_shap_path, sep='\t')
shap_values_log_reg = shap_values_log_reg.set_index('gene_id_sno')
log_reg_cols = list(shap_values_log_reg.columns)
log_reg_cols = [col.split('_norm_SHAP')[0] for col in log_reg_cols]
shap_values_log_reg.columns = log_reg_cols # remove the _norm_SHAP suffix
vals_log_reg = np.abs(shap_values_log_reg).mean(0)  # mean SHAP value across all examples in X_test for each feature
feature_importance_log_reg = pd.DataFrame(list(zip(shap_values_log_reg.columns, vals_log_reg)), columns=['feature', 'feature_importance'])
feature_importance_log_reg.sort_values(by=['feature_importance'], ascending=False , inplace=True)
feature_importance_log_reg['feature_rank'] = feature_importance_log_reg.reset_index().index + 1  # Create a rank column for feature importance rank
feature_importance_log_reg['model'] = f'log_reg_{manual_iteration}'
feature_importance_log_reg = feature_importance_log_reg.drop('feature_importance', axis=1)

# Get the SHAP ranking of all features for the svc and rf models
shap_values_svc = pd.read_csv(svc_shap_path, sep='\t')
shap_values_svc = shap_values_svc.set_index('gene_id_sno')
svc_cols = list(shap_values_svc.columns)
svc_cols = [col.split('_norm_SHAP')[0] for col in svc_cols]
shap_values_svc.columns = svc_cols # remove the _norm_SHAP suffix
shap_values_rf = pd.read_csv(rf_shap_path, sep='\t')
shap_values_rf = shap_values_rf.set_index('gene_id_sno')
rf_cols = list(shap_values_rf.columns)
rf_cols = [col.split('_norm_SHAP')[0] for col in rf_cols]
shap_values_rf.columns = rf_cols # remove the _norm_SHAP suffix

dfs = [feature_importance_log_reg]
paths = [svc_shap_path, rf_shap_path]
for i, shap_values in enumerate([shap_values_svc, shap_values_rf]):
    specific_path = paths[i]
    model_substring = specific_path.split('/')[-1].split('_manual')[0]  # find the model name within the path
    vals = np.abs(shap_values).mean(0)  # mean SHAP value across all examples in X_test for each feature
    feature_importance = pd.DataFrame(list(zip(shap_values.columns, vals)), columns=['feature', 'feature_importance'])
    feature_importance.sort_values(by=['feature_importance'], ascending=False , inplace=True)
    feature_importance['feature_rank'] = feature_importance.reset_index().index + 1  # Create a rank column for feature importance rank
    feature_importance['model'] = f'{model_substring}_{manual_iteration}'
    feature_importance = feature_importance.drop('feature_importance', axis=1)
    dfs.append(feature_importance)

# Concat all dfs into one df
df_final = pd.concat(dfs)
df_final.to_csv(snakemake.output.rank_features_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
import shap
import numpy as np
import re
""" Create a dataframe containing the rank of importance for each feature
    and per model (wo RF)."""

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')

# Instantiate log_reg model and get the SHAP ranking of its features (highest absolute value of SHAP is considered as rank 1 most predictive feature)
log_reg = pickle.load(open(snakemake.input.log_reg, 'rb'))
explainer_log_reg = shap.LinearExplainer(log_reg, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
shap_values_log_reg = explainer_log_reg.shap_values(X_test)
vals_log_reg = np.abs(shap_values_log_reg).mean(0)  # mean SHAP value across all examples in X_test for each feature
feature_importance_log_reg = pd.DataFrame(list(zip(X_train.columns, vals_log_reg)), columns=['feature', 'feature_importance'])
feature_importance_log_reg.sort_values(by=['feature_importance'], ascending=False , inplace=True)
feature_importance_log_reg['feature_rank'] = feature_importance_log_reg.reset_index().index + 1  # Create a rank column for feature importance rank
feature_importance_log_reg['model'] = 'log_reg'
feature_importance_log_reg = feature_importance_log_reg.drop('feature_importance', axis=1)

# Instantiate all other 3 models (knn, gbm and svc) and get their SHAP ranking of all features
dfs = [feature_importance_log_reg]
for i, mod in enumerate(snakemake.input.other_model):
    model = pickle.load(open(mod, 'rb'))
    model_substring = re.search("results/trained_models/(.*)_trained_scale_after_split.sav", mod).group(1)  # find the model name within the pickled model name
    explainer = shap.KernelExplainer(model.predict, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test)
    vals = np.abs(shap_values).mean(0)  # mean SHAP value across all examples in X_test for each feature
    feature_importance = pd.DataFrame(list(zip(X_train.columns, vals)), columns=['feature', 'feature_importance'])
    feature_importance.sort_values(by=['feature_importance'], ascending=False , inplace=True)
    feature_importance['feature_rank'] = feature_importance.reset_index().index + 1  # Create a rank column for feature importance rank
    feature_importance['model'] = model_substring
    feature_importance = feature_importance.drop('feature_importance', axis=1)
    dfs.append(feature_importance)

# Concat all dfs into one df
df_final = pd.concat(dfs)
df_final.to_csv(snakemake.output.rank_features_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Determine the extended overlap between a bed file of all snoRNAs and a bed
    of the binding of AQR (eCLIP data)."""

col = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info']
sno_bed = pd.read_csv(snakemake.input.sno_bed, sep='\t', names=col)  # generated with gtf_to_bed
HG_bed = BedTool(snakemake.input.HG_bed)
aqr_hepg2, aqr_k562 = snakemake.input.aqr_HepG2_bed, snakemake.input.aqr_K562_bed
df = pd.read_csv(snakemake.input.df, sep='\t')


# First, merge the two AQR eCLIP bed files (remove redundancy in peaks)
# All peaks have at least a pVal < 0.01 and max 1 nt between them
sp.call(f'cat {aqr_hepg2} {aqr_k562} | sort -k1,1 -k2,2n > temp_aqr.bed', shell=True)
aqr_temp_bed = BedTool('temp_aqr.bed')
aqr_bed = aqr_temp_bed.merge(s=True, d=1, c=[5,7,6,8], o='distinct')


# Generate bed of the intron of all intronic snoRNAs
df = df[df['abundance_cutoff_host'] != 'intergenic']
sno_bed = sno_bed[sno_bed['gene_id'].isin(list(df.gene_id_sno))]
d_upstream = dict(zip(df.gene_id_sno, df.distance_upstream_exon))
d_downstream = dict(zip(df.gene_id_sno, df.distance_downstream_exon))
intron_bed = sno_bed.copy()

# Extend sno_bed to upstream and downstream exon (thereby the snoRNA whole intron) depending on the strand
intron_bed.loc[intron_bed['strand'] == "+", 'start_intron'] = intron_bed['start'] - intron_bed['gene_id'].map(d_upstream)
intron_bed.loc[intron_bed['strand'] == "+", 'end_intron'] = intron_bed['end'] + intron_bed['gene_id'].map(d_downstream)
intron_bed.loc[intron_bed['strand'] == "-", 'start_intron'] = intron_bed['start'] - intron_bed['gene_id'].map(d_downstream)
intron_bed.loc[intron_bed['strand'] == "-", 'end_intron'] = intron_bed['end'] + intron_bed['gene_id'].map(d_upstream)
intron_bed = intron_bed[['chr', 'start_intron', 'end_intron', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info']]
intron_bed['start_intron'] = intron_bed['start_intron'].astype('int')
intron_bed['end_intron'] = intron_bed['end_intron'].astype('int')

intron_bed.to_csv(snakemake.output.intron_bed, sep='\t', header=False, index=False)
intron_bed = BedTool(snakemake.output.intron_bed)

# Intersect the aqr peaks with snoRNA intron bed
intersection = intron_bed.intersect(aqr_bed, wa=True, s=True, wb=True, sorted=True).saveas(snakemake.output.overlap_sno_AQR)

sp.call('rm temp_aqr.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
from pybedtools import BedTool

""" Generate specific bed files for expressed vs not_expressed snoRNAs (either
    intronic or intergenic) and of their HG (if the snoRNA is intronic)"""
cols = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature', 'dot2', 'characteristics']
intronic = pd.read_csv(snakemake.input.intronic_sno_bed, names=cols, sep='\t')
intergenic = pd.read_csv(snakemake.input.intergenic_sno_bed, names=cols, sep='\t')
hg_bed = pd.read_csv(snakemake.input.hg_bed, names=cols, sep='\t')
abundance_status_df = pd.read_csv(snakemake.input.abundance_status_df, sep='\t')
hg_df = pd.read_csv(snakemake.input.hg_df)

ab_status_dict = dict(zip(abundance_status_df['gene_id_sno'], abundance_status_df['abundance_cutoff_2']))

# Split intronic snoRNA bed file based on their abundance status
expressed_intronic_snoRNA = intronic.loc[intronic['gene_id'].isin([k for k,v in ab_status_dict.items() if v == 'expressed'])]
not_expressed_intronic_snoRNA = intronic.loc[intronic['gene_id'].isin([k for k,v in ab_status_dict.items() if v == 'not_expressed'])]

# Split intergenic snoRNA bed file based on their abundance status
expressed_intergenic_snoRNA = intergenic.loc[intergenic['gene_id'].isin([k for k,v in ab_status_dict.items() if v == 'expressed'])]
not_expressed_intergenic_snoRNA = intergenic.loc[intergenic['gene_id'].isin([k for k,v in ab_status_dict.items() if v == 'not_expressed'])]

# Create a bed file for HG of either expressed or not expressed snoRNAs
HG_expressed_sno_df = hg_df[hg_df['sno_id'].isin(list(expressed_intronic_snoRNA['gene_id']))]
HG_expressed_sno_bed = hg_bed[hg_bed['gene_id'].isin(list(HG_expressed_sno_df['host_id']))]

HG_not_expressed_sno_df = hg_df[hg_df['sno_id'].isin(list(not_expressed_intronic_snoRNA['gene_id']))]
HG_not_expressed_sno_bed = hg_bed[hg_bed['gene_id'].isin(list(HG_not_expressed_sno_df['host_id']))]

# Save all bed files
expressed_intronic_snoRNA.to_csv(snakemake.output.expressed_intronic_sno_bed, index=False, sep='\t', header=False)
not_expressed_intronic_snoRNA.to_csv(snakemake.output.not_expressed_intronic_sno_bed, index=False, sep='\t', header=False)
expressed_intergenic_snoRNA.to_csv(snakemake.output.expressed_intergenic_sno_bed, index=False, sep='\t', header=False)
not_expressed_intergenic_snoRNA.to_csv(snakemake.output.not_expressed_intergenic_sno_bed, index=False, sep='\t', header=False)
HG_expressed_sno_bed.to_csv(snakemake.output.HG_expressed_sno_bed, index=False, sep='\t', header=False)
HG_not_expressed_sno_bed.to_csv(snakemake.output.HG_not_expressed_sno_bed, index=False, sep='\t', header=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
import pandas as pd
import re
import regex
from functools import reduce

""" Find C, D, C' and D' boxes of each snoRNA (if they exist) and their position"""
cd_fasta = snakemake.input.cd_fasta

def cut_sequence(seq):
    # Get the 20 first and 20 last nt of a given sequence
    first, last = seq[:20], seq[-20:]
    length = len(seq)
    return first, last, length

def find_d_box(seq):
    """ Find exact D box (CUGA), if not present, find D box with 1 or max 2
        substitutions. Return also the start and end position of that box as
        1-based values. If no D box is found, return a 'NNNN' empty D box and 0
        as start and end of D box."""
    first_20, last_20, length_seq = cut_sequence(seq)
    len_d_box = 4
    # First, find exact D box (CUGA) within 20 nt of the snoRNA 3' end
    if re.search('CUGA', last_20) is not None:  # find exact D box
        *_, last_possible_d = re.finditer('CUGA', last_20)
        d_motif = last_possible_d.group(0)  # if multiple exact D boxes found, keep the D box closest to 3' end
        d_start = (length_seq - 20) + last_possible_d.start() + 1
        d_end = (length_seq - 20) + last_possible_d.end()
        return d_motif, d_start, d_end
    else:  # find not exact D box (up to max 50% of substitution allowed (i.e. 2 nt))
        for sub in range(1, int(len_d_box/2 + 1)):  # iterate over 1 to 2 substitutions allowed
            d_motif = regex.findall("(CUGA){s<="+str(sub)+"}", last_20, overlapped=True)
            if len(d_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                d_motif = d_motif[-1]  # if multiple D boxes found, keep the the D box closest to 3' end
                d_start = (length_seq - 20) + last_20.rindex(d_motif) + 1
                d_end = d_start + len(d_motif) - 1
                return d_motif, d_start, d_end  # this exits the global else statement
        # If no D box is found, return NNNN and 0, 0 as D box sequence, start and end
        d_motif, d_start, d_end = 'NNNN', 0, 0
        return d_motif, d_start, d_end


def find_c_box(seq):
    """ Find exact C box (RUGAUGA, where R is A or G), if not present, find C
        box with 1,2 or max 3 substitutions. Return also the start and end
        position of that box as 1-based values. If no C box is found, return a
        'NNNNNNN' empty C box and 0 as start and end of C box."""
    first_20, last_20, length_seq = cut_sequence(seq)
    len_c_box = 7
    # First, find exact C box (RUGAUGA) within 20 nt of the snoRNA 5' end
    if re.search('(A|G)UGAUGA', first_20) is not None:  # find exact C box
        i = 1
        for possible_c in re.finditer('(A|G)UGAUGA', first_20):
            if i <= 1:  # select first matched group only (closest RUGAUGA to 5' end of snoRNA)
                c_motif = possible_c.group(0)
                c_start = possible_c.start() + 1
                c_end = possible_c.end()
                i += 1
                return c_motif, c_start, c_end  # this exits the global if statement
    else:  # find not exact C box (up to max 3 substitution allowed)
        for sub in range(1, int((len_c_box-1)/2 + 1)):  # iterate over 1 to 3 substitutions allowed
            c_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", first_20, overlapped=True)
            if len(c_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                c_motif = c_motif[0][0]  # if multiple C boxes found, keep the the C box closest to 5' end
                c_start = first_20.find(c_motif) + 1
                c_end = c_start + len(c_motif) - 1
                return c_motif, c_start, c_end  # this exits the global else statement
        # If no C box is found, return NNNNNNN and 0, 0 as C box sequence, start and end
        c_motif, c_start, c_end = 'NNNNNNN', 0, 0
        return c_motif, c_start, c_end



def find_c_prime_d_prime_hamming(seq):
    """ Find best C'/D' pair that minimizes Hamming distance compared to consensus C'/D' motif. """
    # Get the nucleotides between the 20th and last 20th nucleotides
    middle_seq = seq[20:-20]
    len_c_prime_box, len_d_prime_box = 7, 4

    # Find all possible C' boxes and their start/end and compute their Hamming distance compared to consensus motif
    hamming_c_prime, all_c_primes, all_c_primes_start, all_c_primes_end, temp_motif = [], [], [], [], ''
    for sub in range(0, int((len_c_prime_box-1)/2 + 1)):
        c_prime_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", middle_seq, overlapped=True)
        if len(c_prime_motif) >= 1:
            for motif in c_prime_motif:
                if motif[0] != temp_motif:  # to avoid repeated motif between 0 and 1 substitution allowed
                    all_c_primes.append(motif[0])
                    hamming_c_prime.append(sub)
                    c_prime_start = middle_seq.index(motif[0]) + 21
                    c_prime_end = c_prime_start + len(motif[0]) - 1
                    all_c_primes_start.append(c_prime_start)
                    all_c_primes_end.append(c_prime_end)
                    temp_motif = motif[0]
    if len(all_c_primes) == 0:  # if no C' box was found, hamming distance is 7, C' motif is NNNNNNN and start and end are 0
        hamming_c_prime, all_c_primes, all_c_primes_start, all_c_primes_end = [7], ['NNNNNNN'], [0], [0]

    # Find all possible D' boxes and their start/end and compute their Hamming distance compared to consensus motif
    hamming_d_prime, all_d_primes, all_d_primes_start, all_d_primes_end, temp_motif = [], [], [], [], ''
    for sub in range(0, int((len_d_prime_box)/2 + 1)):
        d_prime_motif = regex.findall("(CUGA){s<="+str(sub)+"}", middle_seq, overlapped=True)
        if len(d_prime_motif) >= 1:
            for motif in d_prime_motif:
                if motif != temp_motif:  # to avoid repeated motif between 0 and 1 substitution allowed
                    all_d_primes.append(motif)
                    hamming_d_prime.append(sub)
                    d_prime_start = middle_seq.index(motif) + 21
                    d_prime_end = d_prime_start + len(motif) - 1
                    all_d_primes_start.append(d_prime_start)
                    all_d_primes_end.append(d_prime_end)
                    temp_motif = motif
    if len(all_d_primes) == 0:  # if no D' box was found, hamming distance is 4, D' motif is NNNN and start and end are 0
        hamming_d_prime, all_d_primes, all_d_primes_start, all_d_primes_end = [4], ['NNNN'], [0], [0]

    # Find all possible D'-C' pairs where C' is downstream of D' by at least 2 nt (i.e. at least 1 nt between D' and C' boxes)
    # and return the best pair according to the lowest total Hamming distance (if two pairs have the same Hamming distance, the closest one to 5' of snoRNA is chosen)
    total_hamming, d_prime_index, c_prime_index = 10000000000, 10000000000, 10000000000  # these are dummy high numbers
    for i, d_prime in enumerate(all_d_primes):
        for j, c_prime in enumerate(all_c_primes):
            # C' downstream of D' by at least 2 nt or if D' is found but not C' (the case where no D' is found but C' is found is also included: d_prime_end = 0 is smaller than any existing c_prime_start)
            if (all_d_primes_end[i] <= all_c_primes_start[j] + 2) | ((all_d_primes[i] != 'NNNN') & (all_c_primes[j] == 'NNNNNNN')):
                temp_total_hamming = hamming_d_prime[i] + hamming_c_prime[j]
                if temp_total_hamming < total_hamming:
                    total_hamming = temp_total_hamming
                    d_prime_index = i
                    c_prime_index = j

            # if no D' nor C' box are found
            elif (all_d_primes[i] != 'NNNN') & (all_c_primes[j] == 'NNNNNNN'):
                temp_total_hamming = hamming_d_prime[i] + hamming_c_prime[j]
                d_prime_index, c_prime_index = 0, 0


    # If only one C' box is found and multiple D' are found (i.e. for 3 snoRNAs) but are overlapping, keep the D' box with the
    # lowest Hamming distance (closest to 5' if multiple D' have the same Hamming) and return NNNNNNN as the C'
    if (d_prime_index == 10000000000) | (c_prime_index == 10000000000):
        all_c_primes, all_c_primes_start, all_c_primes_end, c_prime_index = ['NNNNNNN'], [0], [0], 0
        d_prime_index = hamming_d_prime.index(min(hamming_d_prime))
    c_prime_motif, c_prime_start, c_prime_end = all_c_primes[c_prime_index], all_c_primes_start[c_prime_index], all_c_primes_end[c_prime_index]
    d_prime_motif, d_prime_start, d_prime_end = all_d_primes[d_prime_index], all_d_primes_start[d_prime_index], all_d_primes_end[d_prime_index]
    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end



def generate_df(fasta, func, motif_name):
    """ From a fasta of snoRNA sequences, find a given motif (C or D) using
        predefined function func and output the motif sequence, start and end
        as a df."""
    # Get motif, start and end position inside dict
    box_dict = {}
    with open(fasta, 'r') as f:
        sno_id = ''
        for line in f:
            if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
            else:
                seq = line.rstrip('\n')
                motif, start, end = func(seq)
                box_dict[sno_id] = [motif, start, end]

    # Create dataframe from box_dict
    box = pd.DataFrame.from_dict(box_dict, orient='index',
                                columns=[f'{motif_name}_sequence', f'{motif_name}_start',
                                        f'{motif_name}_end'])
    box = box.reset_index()
    box = box.rename(columns={"index": "gene_id"})
    return box


def generate_df_prime(fasta):
    """ From a fasta of snoRNA sequences, find a given motif (C' or D') using
        predefined function find_c_prime_d_prime_hamming and output the motif sequence,
        start and end as a df."""
    # Get motif, start and end position inside dict
    box_dict = {}
    with open(fasta, 'r') as f:
        sno_id = ''
        for line in f:
            if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
            else:
                seq = line.rstrip('\n')
                c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end = find_c_prime_d_prime_hamming(seq)
                box_dict[sno_id] = [c_prime_motif, c_prime_start,
                                    c_prime_end, d_prime_motif, d_prime_start,
                                    d_prime_end]

    # Create dataframe from box_dict
    box = pd.DataFrame.from_dict(box_dict, orient='index',
                                columns=['C_prime_sequence', 'C_prime_start',
                                        'C_prime_end', 'D_prime_sequence',
                                        'D_prime_start', 'D_prime_end'])
    box = box.reset_index()
    box = box.rename(columns={"index": "gene_id"})
    return box


def find_all_boxes(fasta, path):
    """ Find C, D, C' and D' boxes in given fasta using generate_df and concat
        resulting dfs horizontally."""
    df_c = generate_df(fasta, find_c_box, 'C')
    df_d = generate_df(fasta, find_d_box, 'D')
    df_c_prime_d_prime = generate_df_prime(fasta)

    df_final = reduce(lambda  left,right: pd.merge(left,right,on=['gene_id'],
                                            how='outer'),
                                            [df_c, df_d, df_c_prime_d_prime])
    df_final.to_csv(path, index=False, sep='\t')


find_all_boxes(cd_fasta, snakemake.output.c_d_box_location)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
import pandas as pd
import re
import regex
from functools import reduce

""" Find C, D, C' and D' boxes of each snoRNA (if they exist) and their position"""
expressed_cd = snakemake.input.expressed_cd
not_expressed_cd = snakemake.input.not_expressed_cd


def cut_sequence(seq):
    # Get the 20 first and 20 last nt of a given sequence
    first, last = seq[:20], seq[-20:]
    length = len(seq)
    return first, last, length

def find_d_box(seq):
    """ Find exact D box (CUGA), if not present, find D box with 1 or max 2
        substitutions. Return also the start and end position of that box as
        1-based values. If no D box is found, return a 'NNNN' empty D box and 0
        as start and end of D box."""
    first_20, last_20, length_seq = cut_sequence(seq)
    len_d_box = 4
    # First, find exact D box (CUGA) within 20 nt of the snoRNA 3' end
    if re.search('CUGA', last_20) is not None:  # find exact D box
        *_, last_possible_d = re.finditer('CUGA', last_20)
        d_motif = last_possible_d.group(0)  # if multiple exact D boxes found, keep the D box closest to 3' end
        d_start = (length_seq - 20) + last_possible_d.start() + 1
        d_end = (length_seq - 20) + last_possible_d.end()
        return d_motif, d_start, d_end
    else:  # find not exact D box (up to max 50% of substitution allowed (i.e. 2 nt))
        for sub in range(1, int(len_d_box/2 + 1)):  # iterate over 1 to 2 substitutions allowed
            d_motif = regex.findall("(CUGA){s<="+str(sub)+"}", last_20, overlapped=True)
            if len(d_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                d_motif = d_motif[-1]  # if multiple D boxes found, keep the the D box closest to 3' end
                d_start = (length_seq - 20) + last_20.rindex(d_motif) + 1
                d_end = d_start + len(d_motif) - 1
                return d_motif, d_start, d_end  # this exits the global else statement
        # If no D box is found, return NNNN and 0, 0 as D box sequence, start and end
        d_motif, d_start, d_end = 'NNNN', 0, 0
        return d_motif, d_start, d_end


def find_c_box(seq):
    """ Find exact C box (RUGAUGA, where R is A or G), if not present, find C
        box with 1,2 or max 3 substitutions. Return also the start and end
        position of that box as 1-based values. If no C box is found, return a
        'NNNNNNN' empty C box and 0 as start and end of C box."""
    first_20, last_20, length_seq = cut_sequence(seq)
    len_c_box = 7
    # First, find exact C box (RUGAUGA) within 20 nt of the snoRNA 5' end
    if re.search('(A|G)UGAUGA', first_20) is not None:  # find exact C box
        i = 1
        for possible_c in re.finditer('(A|G)UGAUGA', first_20):
            if i <= 1:  # select first matched group only (closest RUGAUGA to 5' end of snoRNA)
                c_motif = possible_c.group(0)
                c_start = possible_c.start() + 1
                c_end = possible_c.end()
                i += 1
                return c_motif, c_start, c_end  # this exits the global if statement
    else:  # find not exact C box (up to max 3 substitution allowed)
        for sub in range(1, int((len_c_box-1)/2 + 1)):  # iterate over 1 to 3 substitutions allowed
            c_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", first_20, overlapped=True)
            if len(c_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                c_motif = c_motif[0][0]  # if multiple C boxes found, keep the the C box closest to 5' end
                c_start = first_20.find(c_motif) + 1
                c_end = c_start + len(c_motif) - 1
                return c_motif, c_start, c_end  # this exits the global else statement
        # If no C box is found, return NNNNNNN and 0, 0 as C box sequence, start and end
        c_motif, c_start, c_end = 'NNNNNNN', 0, 0
        return c_motif, c_start, c_end


def find_c_prime_d_prime(seq):
    """ Find exact C' box (RUGAUGA, where R is A or G), if not present, find C'
        box with 1,2 or max 3 substitutions. Return also the start and end
        position of that box as 1-based values. If no C' box is found, return a
        'NNNNNNN' empty C' box and 0 as start and end of C' box. Of all potential
        C' boxes, always pick the C' closest to 3' end. Find after the closest
        D' box that must be upstream of the found C' box (exact D' (CUGA), then
        1 and 2 substitutions allowed)."""
    # Get the nucleotides between the 20th and last 20th nucleotides
    middle_seq = seq[20:-20]
    len_c_prime_box, len_d_prime_box = 7, 4

    # First, find exact C' box (RUGAUGA) closest to 3' end
    if re.search('(A|G)UGAUGA', middle_seq) is not None:  # find exact C' box
        *_, last_possible_c_prime = re.finditer('(A|G)UGAUGA', middle_seq)
        c_prime_motif = last_possible_c_prime.group(0)
        c_prime_start = last_possible_c_prime.start() + 21
        c_prime_end = last_possible_c_prime.end() + 20
        # Find closest exact D' box upstream of found C' box (upstream by at least 1 nt between D' and C')
        if re.search('CUGA', seq[20:c_prime_start]) is not None:  # find exact D' box
            *_, last_possible_d_prime = re.finditer('CUGA', seq[20:c_prime_start])
            d_prime_motif = last_possible_d_prime.group(0)  # if multiple exact D' boxes found, keep the D' box closest to 3' (i.e. closest to C' box)
            d_prime_start = last_possible_d_prime.start() + 21
            d_prime_end = last_possible_d_prime.end() + 20
            return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global if statement
        else:  # find not exact D' box (up to max 50% of substitution allowed (i.e. 2 nt))
            for sub in range(1, int(len_d_prime_box/2 + 1)):  # iterate over 1 to 2 substitutions allowed
                d_prime_motif = regex.findall("(CUGA){s<="+str(sub)+"}", seq[20:c_prime_start], overlapped=True)
                if len(d_prime_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                    d_prime_motif = d_prime_motif[-1]  # if multiple D boxes found, keep the the D box closest to 3' (i.e. closest to C' box)
                    d_prime_start = seq[20:c_prime_start].rindex(d_prime_motif) + 21
                    d_prime_end = d_prime_start + len(d_prime_motif) - 1
                    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global if statement
            # If no D' box is found, return NNNN and 0, 0 as D' box sequence, start and end
            d_prime_motif, d_prime_start, d_prime_end = 'NNNN', 0, 0
            return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end

    else: # find not exact C' box (up to max 3 substitution allowed)
        for sub in range(1, int((len_c_prime_box-1)/2 + 1)):  # iterate over 1 to 3 substitutions allowed
            c_prime_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", middle_seq, overlapped=True)
            if len(c_prime_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                c_prime_motif = c_prime_motif[0][0]  # if multiple C' boxes found, keep the the C box closest to 3' end
                c_prime_start = middle_seq.rfind(c_prime_motif) + 21
                c_prime_end = c_prime_start + len(c_prime_motif) - 1
                # Find closest exact D' box upstream of found C' box (upstream by at least 1 nt between D' and C')
                if re.search('CUGA', seq[20:c_prime_start]) is not None:  # find exact D' box
                    *_, last_possible_d_prime = re.finditer('CUGA', seq[20:c_prime_start])
                    d_prime_motif = last_possible_d_prime.group(0)  # if multiple exact D' boxes found, keep the D' box closest to 3' (i.e. closest to C' box)
                    d_prime_start = last_possible_d_prime.start() + 21
                    d_prime_end = last_possible_d_prime.end() + 20
                    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global else statement
                else:  # find closest not exact D' box (up to max 50% of substitution allowed (i.e. 2 nt))
                    for sub in range(1, int(len_d_prime_box/2 + 1)):  # iterate over 1 to 2 substitutions allowed
                        d_prime_motif = regex.findall("(CUGA){s<="+str(sub)+"}", seq[20:c_prime_start], overlapped=True)
                        if len(d_prime_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                            d_prime_motif = d_prime_motif[-1]  # if multiple D boxes found, keep the the D box closest to 3' (i.e. closest to C' box)
                            d_prime_start = seq[20:c_prime_start].rindex(d_prime_motif) + 21
                            d_prime_end = d_prime_start + len(d_prime_motif) - 1
                            return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global else statement
                    # If no D' box is found, return NNNN and 0, 0 as D' box sequence, start and end
                    d_prime_motif, d_prime_start, d_prime_end = 'NNNN', 0, 0
                    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global else statement
        # If no C' and no D' box are found, return NNNNNNN/NNNN and 0, 0 as C'/D' box sequence, start and end
        d_prime_motif, d_prime_start, d_prime_end = 'NNNN', 0, 0
        c_prime_motif, c_prime_start, c_prime_end = 'NNNNNNN', 0, 0
        return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end


def find_d_prime_c_prime(seq):
    """ Find exact D' box (CUGA) first; if not present, find D'
        box with 1 or 2 max substitutions. Return also the start and end
        position of that box as 1-based values. If no D' box is found, return a
        'NNNN' empty D' box and 0 as start and end of C box. Of all potential
        D' boxes, always pick the D' closest to 5' end. Find after the closest
        C' box that must be downstream of the found D' box (exact C' (RUGAUGA), then
        1, 2 and 3 substitutions allowed)."""
    # Get the nucleotides between the 20th and last 20th nucleotides
    middle_seq = seq[20:-20]
    len_c_prime_box, len_d_prime_box = 7, 4

    # First, find exact D' box (CUGA) closest to 5' end
    if re.search('CUGA', middle_seq) is not None:  # find exact D' box
        i = 1
        for possible_d_prime in re.finditer('CUGA', middle_seq):
            if i <= 1:  # select first matched group only (closest CUGA to 5' end of snoRNA)
                d_prime_motif = possible_d_prime.group(0)
                d_prime_start = possible_d_prime.start() + 21
                d_prime_end = possible_d_prime.end() + 20
                i += 1

        # Find closest exact C' box downstream of found D' box (downstream by at least 2 nt (1 nt between D' and C'))
        if re.search('(A|G)UGAUGA', seq[d_prime_end+2:-20]) is not None:  # find exact C' box
            j = 1
            for possible_c_prime in re.finditer('(A|G)UGAUGA', seq[d_prime_end+2:-20]):
                if j <= 1:  # select first matched group only (closest RUGAUGA to found D' box)
                    c_prime_motif = possible_c_prime.group(0)
                    c_prime_start = possible_c_prime.start() + d_prime_end
                    c_prime_end = possible_c_prime.end() + d_prime_end
                    j += 1
                    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global if statement

        else:  # find not exact C' box (up to max 3 substitutions allowed)
            for sub in range(1, int((len_c_prime_box-1)/2 + 1)):  # iterate over 1 to 3 substitutions allowed
                c_prime_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", seq[d_prime_end+2:-20], overlapped=True)
                if len(c_prime_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                    c_prime_motif = c_prime_motif[0][0]  # if multiple C' boxes found, keep the the C' box closest to 5' (i.e. closest to D' box)
                    c_prime_start = seq[d_prime_end+2:-20].index(c_prime_motif) + d_prime_end
                    c_prime_end = c_prime_start + len(c_prime_motif) - 1
                    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global if statement
            # If no C' box is found, return NNNNNNN and 0, 0 as C' box sequence, start and end
            c_prime_motif, c_prime_start, c_prime_end = 'NNNNNNN', 0, 0
            return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end

    else: # find not exact D' box (up to max 2 substitutions allowed)
        for sub in range(1, int((len_d_prime_box)/2 + 1)):  # iterate over 1 to 2 substitutions allowed
            d_prime_motif = regex.findall("(CUGA){s<="+str(sub)+"}", middle_seq, overlapped=True)
            if len(d_prime_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                d_prime_motif = d_prime_motif[0]  # if multiple D' boxes found, keep the the D' box closest to 5' end
                d_prime_start = middle_seq.find(d_prime_motif) + 21
                d_prime_end = d_prime_start + len(d_prime_motif) - 1
                # Find closest exact C' box downstream of found D' box (downstream by at least 2 nt (1 nt between D' and C'))
                if re.search('(A|G)UGAUGA', seq[d_prime_end+2:-20]) is not None:  # find exact C' box
                    k = 1
                    for possible_c_prime in re.finditer('(A|G)UGAUGA', seq[d_prime_end+2:-20]):
                        if k <= 1:  # select first matched group only (closest RUGAUGA to found D' box)
                            c_prime_motif = possible_c_prime.group(0)
                            c_prime_start = possible_c_prime.start() + d_prime_end
                            c_prime_end = possible_c_prime.end() + d_prime_end
                            k += 1
                            return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global else statement
                else:  # find closest not exact C' box (up to max 3 substitutions allowed)
                    for sub in range(1, int((len_c_prime_box-1)/2 + 1)):  # iterate over 1 to 3 substitutions allowed
                        c_prime_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", seq[d_prime_end+2:-20], overlapped=True)
                        if len(c_prime_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                            c_prime_motif = c_prime_motif[0][0]  # if multiple C' boxes found, keep the the C' box closest to 5' (i.e. closest to D' box)
                            c_prime_start = seq[d_prime_end+2:-20].index(c_prime_motif) + d_prime_end
                            c_prime_end = c_prime_start + len(c_prime_motif) - 1
                            return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global else statement
                    # If no C' box is found, return NNNN and 0, 0 as C' box sequence, start and end
                    c_prime_motif, c_prime_start, c_prime_end = 'NNNNNNN', 0, 0
                    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end  # this exits the global else statement
        # If no D' and no C' box are found, return NNNN/NNNNNNN and 0, 0 as D'/C' box sequence, start and end
        d_prime_motif, d_prime_start, d_prime_end = 'NNNN', 0, 0
        c_prime_motif, c_prime_start, c_prime_end = 'NNNNNNN', 0, 0
        return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end


def find_c_prime_d_prime_hamming(seq):
    """ Find best C'/D' pair that minimizes Hamming distance compared to consensus C'/D' motif. """
    # Get the nucleotides between the 20th and last 20th nucleotides
    middle_seq = seq[20:-20]
    len_c_prime_box, len_d_prime_box = 7, 4

    # Find all possible C' boxes and their start/end and compute their Hamming distance compared to consensus motif
    hamming_c_prime, all_c_primes, all_c_primes_start, all_c_primes_end, temp_motif = [], [], [], [], ''
    for sub in range(0, int((len_c_prime_box-1)/2 + 1)):
        c_prime_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", middle_seq, overlapped=True)
        if len(c_prime_motif) >= 1:
            for motif in c_prime_motif:
                if motif[0] != temp_motif:  # to avoid repeated motif between 0 and 1 substitution allowed
                    all_c_primes.append(motif[0])
                    hamming_c_prime.append(sub)
                    c_prime_start = middle_seq.index(motif[0]) + 21
                    c_prime_end = c_prime_start + len(motif[0]) - 1
                    all_c_primes_start.append(c_prime_start)
                    all_c_primes_end.append(c_prime_end)                    
                    temp_motif = motif[0]              
    if len(all_c_primes) == 0:  # if no C' box was found, hamming distance is 7, C' motif is NNNNNNN and start and end are 0
        hamming_c_prime, all_c_primes, all_c_primes_start, all_c_primes_end = [7], ['NNNNNNN'], [0], [0]

    # Find all possible D' boxes and their start/end and compute their Hamming distance compared to consensus motif
    hamming_d_prime, all_d_primes, all_d_primes_start, all_d_primes_end, temp_motif = [], [], [], [], ''
    for sub in range(0, int((len_d_prime_box)/2 + 1)):
        d_prime_motif = regex.findall("(CUGA){s<="+str(sub)+"}", middle_seq, overlapped=True)
        if len(d_prime_motif) >= 1:
            for motif in d_prime_motif:
                if motif != temp_motif:  # to avoid repeated motif between 0 and 1 substitution allowed
                    all_d_primes.append(motif)
                    hamming_d_prime.append(sub)
                    d_prime_start = middle_seq.index(motif) + 21
                    d_prime_end = d_prime_start + len(motif) - 1
                    all_d_primes_start.append(d_prime_start)
                    all_d_primes_end.append(d_prime_end)
                    temp_motif = motif                
    if len(all_d_primes) == 0:  # if no D' box was found, hamming distance is 4, D' motif is NNNN and start and end are 0
        hamming_d_prime, all_d_primes, all_d_primes_start, all_d_primes_end = [4], ['NNNN'], [0], [0]

    # Find all possible D'-C' pairs where C' is downstream of D' by at least 2 nt (i.e. at least 1 nt between D' and C' boxes) 
    # and return the best pair according to the lowest total Hamming distance (if two pairs have the same Hamming distance, the closest one to 5' of snoRNA is chosen)
    total_hamming, d_prime_index, c_prime_index = 10000000000, 10000000000, 10000000000  # these are dummy high numbers 
    for i, d_prime in enumerate(all_d_primes):
        for j, c_prime in enumerate(all_c_primes):
            # C' downstream of D' by at least 2 nt or if D' is found but not C' (the case where no D' is found but C' is found is also included: d_prime_end = 0 is smaller than any existing c_prime_start)
            if (all_d_primes_end[i] <= all_c_primes_start[j] + 2) | ((all_d_primes[i] != 'NNNN') & (all_c_primes[j] == 'NNNNNNN')):   
                temp_total_hamming = hamming_d_prime[i] + hamming_c_prime[j]
                if temp_total_hamming < total_hamming:
                    total_hamming = temp_total_hamming
                    d_prime_index = i
                    c_prime_index = j

            # if no D' nor C' box are found
            elif (all_d_primes[i] != 'NNNN') & (all_c_primes[j] == 'NNNNNNN'):
                temp_total_hamming = hamming_d_prime[i] + hamming_c_prime[j]
                d_prime_index, c_prime_index = 0, 0


    # If only one C' box is found and multiple D' are found (i.e. for 3 snoRNAs) but are overlapping, keep the D' box with the 
    # lowest Hamming distance (closest to 5' if multiple D' have the same Hamming) and return NNNNNNN as the C'
    if (d_prime_index == 10000000000) | (c_prime_index == 10000000000):
        all_c_primes, all_c_primes_start, all_c_primes_end, c_prime_index = ['NNNNNNN'], [0], [0], 0
        d_prime_index = hamming_d_prime.index(min(hamming_d_prime))
    c_prime_motif, c_prime_start, c_prime_end = all_c_primes[c_prime_index], all_c_primes_start[c_prime_index], all_c_primes_end[c_prime_index]
    d_prime_motif, d_prime_start, d_prime_end = all_d_primes[d_prime_index], all_d_primes_start[d_prime_index], all_d_primes_end[d_prime_index]
    return c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end



def generate_df(fasta, func, motif_name):
    """ From a fasta of snoRNA sequences, find a given motif (C or D) using
        predefined function func and output the motif sequence, start and end
        as a df."""
    # Get motif, start and end position inside dict
    box_dict = {}
    with open(fasta, 'r') as f:
        sno_id = ''
        for line in f:
            if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
            else:
                seq = line.rstrip('\n')
                motif, start, end = func(seq)
                box_dict[sno_id] = [motif, start, end]

    # Create dataframe from box_dict
    box = pd.DataFrame.from_dict(box_dict, orient='index',
                                columns=[f'{motif_name}_sequence', f'{motif_name}_start',
                                        f'{motif_name}_end'])
    box = box.reset_index()
    box = box.rename(columns={"index": "gene_id"})
    return box


def generate_df_prime(fasta):
    """ From a fasta of snoRNA sequences, find a given motif (C' or D') using
        predefined function find_c_prime_d_prime_hamming and output the motif sequence,
        start and end as a df."""
    # Get motif, start and end position inside dict
    box_dict = {}
    with open(fasta, 'r') as f:
        sno_id = ''
        for line in f:
            if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
            else:
                seq = line.rstrip('\n')
                c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end = find_c_prime_d_prime_hamming(seq)
                box_dict[sno_id] = [c_prime_motif, c_prime_start,
                                    c_prime_end, d_prime_motif, d_prime_start,
                                    d_prime_end]

    # Create dataframe from box_dict
    box = pd.DataFrame.from_dict(box_dict, orient='index',
                                columns=['C_prime_sequence', 'C_prime_start',
                                        'C_prime_end', 'D_prime_sequence',
                                        'D_prime_start', 'D_prime_end'])
    box = box.reset_index()
    box = box.rename(columns={"index": "gene_id"})
    return box


def find_all_boxes(fasta, path):
    """ Find C, D, C' and D' boxes in given fasta using generate_df and concat
        resulting dfs horizontally."""
    df_c = generate_df(fasta, find_c_box, 'C')
    df_d = generate_df(fasta, find_d_box, 'D')
    df_c_prime_d_prime = generate_df_prime(fasta)

    df_final = reduce(lambda  left,right: pd.merge(left,right,on=['gene_id'],
                                            how='outer'),
                                            [df_c, df_d, df_c_prime_d_prime])
    df_final.to_csv(path, index=False, sep='\t')


find_all_boxes(expressed_cd, snakemake.output.c_d_box_location_expressed)
find_all_boxes(not_expressed_cd, snakemake.output.c_d_box_location_not_expressed)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import pandas as pd

""" Convert T into U in snoRNA sequences and create a fasta file containing all
    snoRNA sequences with their gene_id as names"""

sno_df = pd.read_csv(snakemake.input.snodb, sep='\t')
sno_df = sno_df[['gene_id_sno', 'seq']]

# Replace T by U in snoRNA sequences
sno_df['seq'] = sno_df['seq'].str.replace('T', 'U')

# Create the fasta
dictio = sno_df.set_index('gene_id_sno')['seq'].to_dict()
dictio = {'>'+ k: v for k, v in dictio.items()}  # Add '>' in front of all sno id

with open(snakemake.output.sno_sequences, "a+") as file:  # a+ for append in new file
    for k, v in dictio.items():
        file.write(k+'\n'+v+'\n')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd

all_shap_path = snakemake.input.all_shap
output_path = snakemake.output.concat_shap
sno_per_confusion_value = snakemake.input.sno_per_confusion_value
conf_val = snakemake.wildcards.confusion_value
conf_val_pair = {'FN': 'TP', 'TP': 'FN', 'FP': 'TN', 'TN': 'FP'}  # to help select only real confusion value
                                                                # (i.e. those always predicted as such across iterations and models)
conf_val_df = pd.read_csv([path for path in sno_per_confusion_value if conf_val in path][0], sep='\t')
conf_val_pair_df = pd.read_csv([path for path in sno_per_confusion_value if conf_val_pair[conf_val] in path][0], sep='\t')

# Select only real confusion_value (ex: FN) (those always predicted as such across models and iterations)
real_conf_val = list(set(conf_val_df.gene_id_sno.to_list()) - set(conf_val_pair_df.gene_id_sno.to_list()))

dfs = []
for path in all_shap_path:
    iteration = path.split('/')[-1].split('_shap')[0].split('_')[-1]
    df = pd.read_csv(path, sep='\t')
    df['iteration'] = iteration
    df = df[df['gene_id_sno'].isin(real_conf_val)]
    dfs.append(df)

concat_df = pd.concat(dfs)
concat_df.to_csv(output_path, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd

""" Concat all iterations dfs (of feature rank per model) into one df."""

dfs = []
for i, df in enumerate(snakemake.input.dfs):
    temp_df = pd.read_csv(df, sep='\t')
    dfs.append(temp_df)

# Concat all dfs into one df
df_final = pd.concat(dfs)
df_final.to_csv(snakemake.output.concat_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score
import pickle

""" Return the confusion matrix associated to each model and their F1 score.
    Return also a df containing each snoRNA in the test set and their associated
    confusion matrix value (i.e. TN, FP, FN or TP)."""

# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (1)) on test data and compare to y_test
y_pred = model.predict(X_test)

# Compute the confusion matrix (where T:True, F:False, P:Positive, N:Negative)
TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

# Compute F1 score
f1 = f1_score(y_test, y_pred)

# Return confusion matrix with F1 socre included
matrix_dict = {'true_negatives': TN, 'false_positives': FP,
                'false_negatives': FN, 'true_positives': TP, 'f1_score': f1}
matrix_df = pd.DataFrame(matrix_dict, index=[0])
matrix_df.to_csv(snakemake.output.confusion_matrix, sep='\t', index=False)

# Return snoRNAs and their confusion matrix value (TN, FP, FN and TP) as a df
y_test_df = pd.DataFrame(y_test)
y_test_df = y_test_df.reset_index()
y_pred_df = pd.DataFrame(y_pred)
y_pred_df.columns = ['predicted_label']

info_df = pd.concat([y_test_df, y_pred_df], axis=1)

info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'TN'
info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'FP'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'FN'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'TP'

info_df.to_csv(snakemake.output.info_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import pandas as pd
from sklearn.metrics import confusion_matrix, f1_score, roc_curve, accuracy_score
from sklearn.linear_model import LogisticRegression
import pickle
import numpy as np

""" Return the confusion matrix associated to each model and their F1 score.
    Return also a df containing each snoRNA in the test set and their associated
    confusion matrix value (i.e. TN, FP, FN or TP)."""

X_test = pd.read_csv(snakemake.input.X_test[0], sep='\t', index_col='gene_id_sno')
y_test = pd.read_csv(snakemake.input.y_test[0], sep='\t')
y_test.index = X_test.index  # set gene_id_sno as index

# Define class LogisticRegressionWithThreshold
class LogisticRegressionWithThreshold(LogisticRegression):
    def predict(self, X, threshold=None):
        if threshold == None: # If no threshold passed in, simply call the base class predict, effectively threshold=0.5
            return LogisticRegression.predict(self, X)
        else:
            y_scores = LogisticRegression.predict_proba(self, X)[:, 1]
            y_pred_with_threshold = (y_scores >= threshold).astype(int)

            return y_pred_with_threshold

    def threshold_from_optimal_tpr_minus_fpr(self, X, y):
        # Find optimal log_reg threshold where we maximize the True positive rate (TPR) and minimize the False positive rate (FPR)
        y_scores = LogisticRegression.predict_proba(self, X)[:, 1]
        fpr, tpr, thresholds = roc_curve(y, y_scores)

        optimal_idx = np.argmax(tpr - fpr)

        return thresholds[optimal_idx], tpr[optimal_idx] - fpr[optimal_idx]


# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Find optimal threshold and predict using that threshold instead of 0.5
threshold, optimal_tpr_minus_fpr = model.threshold_from_optimal_tpr_minus_fpr(X_test, y_test)
print('Optimal threshold and tpr-fpr:')
print(threshold, optimal_tpr_minus_fpr)
y_pred = model.predict(X_test, threshold)
print('Accuracy:')
print(accuracy_score(y_test, y_pred))



# Compute the confusion matrix (where T:True, F:False, P:Positive, N:Negative)
TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

# Compute F1 score
f1 = f1_score(y_test, y_pred)

# Return confusion matrix with F1 socre included
matrix_dict = {'true_negatives': TN, 'false_positives': FP,
                'false_negatives': FN, 'true_positives': TP, 'f1_score': f1}
matrix_df = pd.DataFrame(matrix_dict, index=[0])
matrix_df.to_csv(snakemake.output.confusion_matrix, sep='\t', index=False)

# Return snoRNAs and their confusion matrix value (TN, FP, FN and TP) as a df
y_test_df = pd.DataFrame(y_test)
y_test_df = y_test_df.reset_index()
y_pred_df = pd.DataFrame(y_pred)
y_pred_df.columns = ['predicted_label']

info_df = pd.concat([y_test_df, y_pred_df], axis=1)

info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_log_reg_thresh'] = 'TN'
info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_log_reg_thresh'] = 'FP'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_log_reg_thresh'] = 'FN'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_log_reg_thresh'] = 'TP'

info_df.to_csv(snakemake.output.info_df, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score
import pickle

""" Return the confusion matrix associated to each model and their F1 score.
    Return also a df containing each snoRNA in the test set and their associated
    confusion matrix value (i.e. TN, FP, FN or TP)."""

X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
y_test = pd.read_csv(snakemake.input.y_test, sep='\t')
y_test.index = X_test.index  # set gene_id_sno as index

# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on test data and compare to y_test
y_pred = model.predict(X_test)

# Compute the confusion matrix (where T:True, F:False, P:Positive, N:Negative)
TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

# Compute F1 score
f1 = f1_score(y_test, y_pred)

# Return confusion matrix with F1 socre included
matrix_dict = {'true_negatives': TN, 'false_positives': FP,
                'false_negatives': FN, 'true_positives': TP, 'f1_score': f1}
matrix_df = pd.DataFrame(matrix_dict, index=[0])
matrix_df.to_csv(snakemake.output.confusion_matrix, sep='\t', index=False)

# Return snoRNAs and their confusion matrix value (TN, FP, FN and TP) as a df
y_test_df = pd.DataFrame(y_test)
y_test_df = y_test_df.reset_index()
y_pred_df = pd.DataFrame(y_pred)
y_pred_df.columns = ['predicted_label']

info_df = pd.concat([y_test_df, y_pred_df], axis=1)

info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_' + snakemake.wildcards.models2] = 'TN'
info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_' + snakemake.wildcards.models2] = 'FP'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_' + snakemake.wildcards.models2] = 'FN'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_' + snakemake.wildcards.models2] = 'TP'

info_df.to_csv(snakemake.output.info_df, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score
import pickle

""" Return the confusion matrix associated to each model and their F1 score.
    Return also a df containing each snoRNA in the test set and their associated
    confusion matrix value (i.e. TN, FP, FN or TP)."""

# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1017 and 180 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=180, train_size=1017, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (1)) on test data and compare to y_test
y_pred = model.predict(X_test)

# Compute the confusion matrix (where T:True, F:False, P:Positive, N:Negative)
TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

# Compute F1 score
f1 = f1_score(y_test, y_pred)

# Return confusion matrix with F1 socre included
matrix_dict = {'true_negatives': TN, 'false_positives': FP,
                'false_negatives': FN, 'true_positives': TP, 'f1_score': f1}
matrix_df = pd.DataFrame(matrix_dict, index=[0])
matrix_df.to_csv(snakemake.output.confusion_matrix, sep='\t', index=False)

# Return snoRNAs and their confusion matrix value (TN, FP, FN and TP) as a df
y_test_df = pd.DataFrame(y_test)
y_test_df = y_test_df.reset_index()
y_pred_df = pd.DataFrame(y_pred)
y_pred_df.columns = ['predicted_label']

info_df = pd.concat([y_test_df, y_pred_df], axis=1)

info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'TN'
info_df.loc[(info_df['label'] == 0) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'FP'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 0), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'FN'
info_df.loc[(info_df['label'] == 1) & (info_df['predicted_label'] == 1), 'confusion_matrix_val_' + snakemake.wildcards.models] = 'TP'

info_df.to_csv(snakemake.output.info_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd

c_d_box_expressed = pd.read_csv(snakemake.input.c_d_box_location_expressed, sep='\t')
c_d_box_not_expressed = pd.read_csv(snakemake.input.c_d_box_location_not_expressed, sep='\t')
h_aca_box_expressed = pd.read_csv(snakemake.input.h_aca_box_location_expressed, sep='\t')
h_aca_box_not_expressed = pd.read_csv(snakemake.input.h_aca_box_location_not_expressed, sep='\t')

def table_to_fasta(df, col1, col2, output_path):
    """ Use 2 columns in table to create fasta (col1 being used for ID lines and
        col2 being sequence lines)."""
    d = dict(zip(df[col1], df[col2]))
    with open(output_path, 'w') as f:
        for sno_id, sequence in d.items():
            f.write(f'>{sno_id}\n')
            f.write(f'{sequence}\n')



table_to_fasta(c_d_box_expressed, 'gene_id', 'C_sequence', snakemake.output.c_expressed)
table_to_fasta(c_d_box_expressed, 'gene_id', 'D_sequence', snakemake.output.d_expressed)
table_to_fasta(c_d_box_not_expressed, 'gene_id', 'C_sequence', snakemake.output.c_not_expressed)
table_to_fasta(c_d_box_not_expressed, 'gene_id', 'D_sequence', snakemake.output.d_not_expressed)
table_to_fasta(c_d_box_expressed, 'gene_id', 'C_prime_sequence', snakemake.output.c_prime_expressed)
table_to_fasta(c_d_box_expressed, 'gene_id', 'D_prime_sequence', snakemake.output.d_prime_expressed)
table_to_fasta(c_d_box_not_expressed, 'gene_id', 'C_prime_sequence', snakemake.output.c_prime_not_expressed)
table_to_fasta(c_d_box_not_expressed, 'gene_id', 'D_prime_sequence', snakemake.output.d_prime_not_expressed)
table_to_fasta(h_aca_box_expressed, 'gene_id', 'H_sequence', snakemake.output.h_expressed)
table_to_fasta(h_aca_box_expressed, 'gene_id', 'ACA_sequence', snakemake.output.aca_expressed)
table_to_fasta(h_aca_box_not_expressed, 'gene_id', 'H_sequence', snakemake.output.h_not_expressed)
table_to_fasta(h_aca_box_not_expressed, 'gene_id', 'ACA_sequence', snakemake.output.aca_not_expressed)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Determine the overlap between a bed file of all C/D snoRNAs and a bed
    of the binding of different C/D core proteins (PAR-CLIP data)."""

col = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info']
sno_bed = pd.read_csv(snakemake.input.sno_bed, sep='\t', names=col)  # generated with gtf_to_bed
nop58_a_bed = [path for path in snakemake.input.RBP_par_clip_beds if 'NOP58_repA' in path][0]
nop58_b_bed = [path for path in snakemake.input.RBP_par_clip_beds if 'NOP58_repB' in path][0]
nop56_bed = [path for path in snakemake.input.RBP_par_clip_beds if 'NOP56' in path][0]
fbl_bed = [path for path in snakemake.input.RBP_par_clip_beds if 'FBL_merge' in path][0]
fbl_mnase_bed = [path for path in snakemake.input.RBP_par_clip_beds if 'FBL_mnase' in path][0]
df = pd.read_csv(snakemake.input.df, sep='\t')

# Keep only C/D snoRNAs in sno_bed 
cd_bed = sno_bed[sno_bed['gene_id'].isin(list(df[df['sno_type'] == 'C/D'].gene_id_sno))]
cd_bed.to_csv('cd_temp.bed', index=False, header=False, sep='\t')
cd_bed = BedTool('cd_temp.bed')

# Merge the RBP PAR-CLIP bed file (remove redundancy in peaks)
# Merged peaks have at most 1 nt between them. Concat the 2 NOP58 replicates together (same for FBL)
sp.call(f'cat {nop58_a_bed} {nop58_b_bed} | sort -k1,1 -k2,2n > nop58_temp_merge.bed', shell=True)
nop58_parclip_temp_bed = BedTool('nop58_temp_merge.bed')
nop58_parclip_bed = nop58_parclip_temp_bed.merge(s=True, d=1, c=[6,5,6], o='distinct,sum,distinct').saveas()

sp.call(f'cat {fbl_bed} {fbl_mnase_bed} | sort -k1,1 -k2,2n > fbl_temp_merge.bed', shell=True)
fbl_parclip_temp_bed = BedTool('fbl_temp_merge.bed')
fbl_parclip_bed = fbl_parclip_temp_bed.merge(s=True, d=1, c=[6,5,6], o='distinct,sum,distinct').saveas()

nop56_parclip_temp_bed = BedTool(nop56_bed)
nop56_parclip_bed = nop56_parclip_temp_bed.merge(s=True, d=1, c=[6,5,6], o='distinct,sum,distinct').saveas()

# Intersect the RBP peaks with C/D snoRNA bed (make sure that least 50% of the RBP peak is overlapped by a given snoRNA)
intersection = cd_bed.intersect(nop58_parclip_bed, wa=True, s=True, wb=True, sorted=True, F=0.5).saveas(snakemake.output.overlap_sno_NOP58_par_clip)
intersection2 = cd_bed.intersect(fbl_parclip_bed, wa=True, s=True, wb=True, sorted=True, F=0.5).saveas(snakemake.output.overlap_sno_FBL_par_clip)
intersection3 = cd_bed.intersect(nop56_parclip_bed, wa=True, s=True, wb=True, sorted=True, F=0.5).saveas(snakemake.output.overlap_sno_NOP56_par_clip)

sp.call('rm cd_temp.bed nop58_temp_merge.bed fbl_temp_merge.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Determine the overlap between a bed file of all H/ACA snoRNAs and a bed
    of the binding of DKC1 (eCLIP data)."""

col = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info']
sno_bed = pd.read_csv(snakemake.input.sno_bed, sep='\t', names=col)  # generated with gtf_to_bed
dkc1_hepg2, dkc1_hek293 = snakemake.input.dkc1_HepG2_eCLIP_bed, snakemake.input.dkc1_HEK293_par_clip_bed
df = pd.read_csv(snakemake.input.df, sep='\t')

# Keep only H/ACA snoRNAs in sno_bed (i.e. those that bind DKC1)
haca_bed = sno_bed[sno_bed['gene_id'].isin(list(df[df['sno_type'] == 'H/ACA'].gene_id_sno))]
haca_bed.to_csv('haca_temp.bed', index=False, header=False, sep='\t')
haca_bed = BedTool('haca_temp.bed')

# First, merge the DKC1 eCLIP bed file (remove redundancy in peaks)
# All peaks have at least a pVal < 0.01 and max 1 nt between them
dkc1_temp_bed = BedTool(dkc1_hepg2)
dkc1_bed = dkc1_temp_bed.merge(s=True, d=1, c=[5,7,6,8], o='distinct')

# Second, merge the DKC1 PAR-CLIP bed file (remove redundancy in peaks)
# All peaks have at least a pVal < 0.01 and max 1 nt between them
dkc1_parclip_temp_bed = BedTool(dkc1_hek293)
dkc1_parclip_bed = dkc1_parclip_temp_bed.merge(s=True, d=1, c=[6,5,6], o='distinct,sum,distinct').saveas()

# Intersect the DKC1 peaks with H/ACA snoRNA bed (make sure that least 50% of the DKC1 peak is overlapped by a given snoRNA)
intersection = haca_bed.intersect(dkc1_bed, wa=True, s=True, wb=True, sorted=True, F=0.5).saveas(snakemake.output.overlap_sno_DKC1_eCLIP)
intersection2 = haca_bed.intersect(dkc1_parclip_bed, wa=True, s=True, wb=True, sorted=True, F=0.5).saveas(snakemake.output.overlap_sno_DKC1_par_clip)
sp.call('rm haca_temp.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import pandas as pd

sno_fasta = snakemake.input.sno_fasta
sno_info = pd.read_csv(snakemake.input.sno_info, sep='\t')
cd_output, haca_output = snakemake.output.cd_fasta, snakemake.output.haca_fasta

# Select either C/D or H/ACA box snoRNAs in fasta of all snoRNA sequences
cd_ids = sno_info[sno_info['snoRNA_type'] == 'C/D']['gene_id'].to_list()
haca_ids = sno_info[sno_info['snoRNA_type'] == 'H/ACA']['gene_id'].to_list()
cd_dict, haca_dict = {}, {}
with open(sno_fasta, 'r') as f:
    sno_id = ''
    for line in f:
        if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
        else:
            seq = line.rstrip('\n')
            if sno_id in cd_ids:
                cd_dict[sno_id] = seq
            elif sno_id in haca_ids:
                haca_dict[sno_id] = seq

# Create fasta of C/D snoRNAs
with open(cd_output, 'w') as f:
    for sno_id, sequence in cd_dict.items():
        f.write(f'>{sno_id}\n')
        f.write(f'{sequence}\n')

# Create fasta of H/ACA snoRNAs
with open(haca_output, 'w') as f:
    for sno_id, sequence in haca_dict.items():
        f.write(f'>{sno_id}\n')
        f.write(f'{sequence}\n')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import pandas as pd

sno_fasta = snakemake.input.sno_fasta
snodb = pd.read_csv(snakemake.input.snodb, sep='\t')
cd_output, haca_output = snakemake.output.cd_fasta, snakemake.output.haca_fasta

# Select either C/D or H/ACA box snoRNAs in fasta of all snoRNA sequences
cd_ids = snodb[snodb['sno_type'] == 'C/D']['gene_id_sno'].to_list()
haca_ids = snodb[snodb['sno_type'] == 'H/ACA']['gene_id_sno'].to_list()
cd_dict, haca_dict = {}, {}
with open(sno_fasta, 'r') as f:
    sno_id = ''
    for line in f:
        if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
        else:
            seq = line.rstrip('\n')
            if sno_id in cd_ids:
                cd_dict[sno_id] = seq
            elif sno_id in haca_ids:
                haca_dict[sno_id] = seq

# Create fasta of C/D snoRNAs
with open(cd_output, 'w') as f:
    for sno_id, sequence in cd_dict.items():
        f.write(f'>{sno_id}\n')
        f.write(f'{sequence}\n')

# Create fasta of H/ACA snoRNAs
with open(haca_output, 'w') as f:
    for sno_id, sequence in haca_dict.items():
        f.write(f'>{sno_id}\n')
        f.write(f'{sequence}\n')



def generate_df_prime(fasta):
    """ From a fasta of snoRNA sequences, find a given motif (C' or D') using
        predefined function find_c_prime_d_prime_hamming and output the motif sequence,
        start and end as a df."""
    # Get motif, start and end position inside dict
    box_dict = {}
    with open(fasta, 'r') as f:
        sno_id = ''
        for line in f:
            if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
            else:
                seq = line.rstrip('\n')
                c_prime_motif, c_prime_start, c_prime_end, d_prime_motif, d_prime_start, d_prime_end = find_c_prime_d_prime_hamming(seq)
                box_dict[sno_id] = [c_prime_motif, c_prime_start,
                                    c_prime_end, d_prime_motif, d_prime_start,
                                    d_prime_end]

    # Create dataframe from box_dict
    box = pd.DataFrame.from_dict(box_dict, orient='index',
                                columns=['C_prime_sequence', 'C_prime_start',
                                        'C_prime_end', 'D_prime_sequence',
                                        'D_prime_start', 'D_prime_end'])
    box = box.reset_index()
    box = box.rename(columns={"index": "gene_id"})
    return box
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
import itertools
""" Split fastas file of expressed or not expressed C/D box snoRNAs according
    to their length (small or long, i.e < or >= 200 nt). """

df = pd.read_csv(snakemake.input.all_features_labels_df, sep='\t')
df = df[df['sno_type'] == 'C/D']
sno_fasta = snakemake.input.sno_fasta
outputs = [snakemake.output.small_expressed_cd, snakemake.output.long_expressed_cd,
            snakemake.output.small_not_expressed_cd, snakemake.output.long_not_expressed_cd]

# Get gene_id of C/D snoRNAs per length and abundance_status
small_expressed_cd = df[(df['sno_length'] < 200) & (df['abundance_cutoff_2'] == 'expressed')].gene_id_sno.to_list()
long_expressed_cd = df[(df['sno_length'] >= 200) & (df['abundance_cutoff_2'] == 'expressed')].gene_id_sno.to_list()
small_not_expressed_cd = df[(df['sno_length'] < 200) & (df['abundance_cutoff_2'] == 'not_expressed')].gene_id_sno.to_list()
long_not_expressed_cd = df[(df['sno_length'] >= 200) & (df['abundance_cutoff_2'] == 'not_expressed')].gene_id_sno.to_list()

# Create dict from sno_fasta as key: val --> id: sequence
d = {}
with open(sno_fasta, 'r') as f:
    for fasta_id, sequence in itertools.zip_longest(*[f]*2):
            fasta_id = fasta_id.lstrip('>').rstrip('\n')
            sequence = sequence.rstrip('\n')
            d[fasta_id] = sequence

# Get sequence for all snoRNA groups in their respective output file
for i, group in enumerate([small_expressed_cd, long_expressed_cd, small_not_expressed_cd, long_not_expressed_cd]):
    with open(outputs[i], 'w') as f:
        for j, id in enumerate(group):
            seq = d[id]
            f.write(f'>{id}\n')
            f.write(f'{seq}\n')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import itertools
""" Extract specific snoRNA sequences (per sno_type and abundance_status) from
    the fasta file of all snoRNA sequences."""

df = pd.read_csv(snakemake.input.all_features_labels_df, sep='\t')
sno_fasta = snakemake.input.sno_fasta
outputs = [snakemake.output.expressed_cd, snakemake.output.expressed_haca,
            snakemake.output.not_expressed_cd, snakemake.output.not_expressed_haca]

# Get gene_id of snoRNAs per sno_type and abundance_status
expressed_cd = df[(df['sno_type'] == 'C/D') & (df['abundance_cutoff_2'] == 'expressed')].gene_id_sno.to_list()
expressed_haca = df[(df['sno_type'] == 'H/ACA') & (df['abundance_cutoff_2'] == 'expressed')].gene_id_sno.to_list()
not_expressed_cd = df[(df['sno_type'] == 'C/D') & (df['abundance_cutoff_2'] == 'not_expressed')].gene_id_sno.to_list()
not_expressed_haca = df[(df['sno_type'] == 'H/ACA') & (df['abundance_cutoff_2'] == 'not_expressed')].gene_id_sno.to_list()

# Create dict from sno_fasta as key: val --> id: sequence
d = {}
with open(sno_fasta, 'r') as f:
    for fasta_id, sequence in itertools.zip_longest(*[f]*2):
            fasta_id = fasta_id.lstrip('>').rstrip('\n')
            sequence = sequence.rstrip('\n')
            d[fasta_id] = sequence

# Get sequence for all snoRNA groups in their respective output file
for i, group in enumerate([expressed_cd, expressed_haca, not_expressed_cd, not_expressed_haca]):
    with open(outputs[i], 'w') as f:
        for j, id in enumerate(group):
            seq = d[id]
            f.write(f'>{id}\n')
            f.write(f'{seq}\n')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Get sno sequences from snoRNA bed file and genome fasta."""

bed = BedTool(snakemake.input.sno_bed)
genome = snakemake.input.genome
if 'Mus_musculus' in genome:
    temp_output = 'mouse_temp_output.fa'
else:
    species = snakemake.wildcards.species
    temp_output = f'{species}_temp_output.fa'
output = snakemake.output.sno_fasta

fasta = bed.sequence(fi=genome, nameOnly=True, s=True)
with open(fasta.seqfn, 'r') as fasta_file, open(temp_output, 'w') as output_file:
    for line in fasta_file:
        if '>' not in line:
            line = line.replace('T', 'U')
        output_file.write(line)

# Remove strand info from fasta ids
sp.call(f"sed -i 's/(+)//g; s/(-)//g' {temp_output} && mv {temp_output} {output}", shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import pandas as pd
import collections as coll

""" Define an abundance cutoff for host genes that will be used as a feature
    (>1 TPM in at least one average condition). The samples used to quantify HG
    abundance are from Shen et al. Nature 2012. They are mouse RNA-Seq
    experiments on stem cells, embryos and adult tissues. We select only the
    mESC and embryonic brain to use as the samples to determine the
    abundance_cutoff_host threshold. These samples were processed using the
    Recount3 pipeline. """

host_df = pd.read_csv(snakemake.input.host_df, sep='\t')
sno_tpm_df = pd.read_csv(snakemake.input.sno_tpm_df, sep='\t')
recount_tpm_df = pd.read_csv(snakemake.input.recount_tpm_df)
srr_id_dict = snakemake.params.srr_id_conversion

# Add host id column to sno_tpm_df
host_dict = dict(zip(host_df.gene_id_sno, host_df.host_id))
sno_tpm_df['host_id'] = sno_tpm_df['gene_id'].map(host_dict)

# Recount tpm df processing
recount_tpm_df[['gene_id', 'version']] = recount_tpm_df['Unnamed: 0'].str.split('.', expand=True)
recount_tpm_df = recount_tpm_df.drop(['version', 'Unnamed: 0'], axis=1)
recount_tpm_df = recount_tpm_df.rename(columns=srr_id_dict)

# Select only host genes from recount total tpm_df
recount_tpm_df = recount_tpm_df[recount_tpm_df['gene_id'].isin(list(host_df['host_id']))].reset_index(drop=True)

# Select only mESC (embryonic stem cells) and brain of mouse embryo at age E14.5
# These samples are comparable to the TGIRT-Seq samples quantifying snoRNAs in mESC and in mESC treated with retinoic acid (differentiate into neurons)
recount_tpm_df = recount_tpm_df[['gene_id', 'mESC_1', 'mESC_2', 'E14_5_brain_1', 'E14_5_brain_2']]

# If host gene is expressed >1 TPM in at least one average condition (average of the duplicates), it is expressed, else not expressed (or no host gene at all)
hg_abundance = recount_tpm_df.filter(regex='_[123]$').reset_index(drop=True)  # tpm column must end with '_1, _2 or _3'
cols_hg = list(hg_abundance.columns)
duplicates_hg = [cols_hg[n:n+2] for n in range(0, len(cols_hg), 3)]

recount_tpm_df['abundance_cutoff_host'] = ''
for i in range(0, len(recount_tpm_df)):
    row = hg_abundance.iloc[i]
    for j, duplicate in enumerate(duplicates_hg):
        if (recount_tpm_df.loc[i, duplicate].mean() > 1) == True:
            recount_tpm_df.loc[i, 'abundance_cutoff_host'] = 'host_expressed'
            break
recount_tpm_df['abundance_cutoff_host'] = recount_tpm_df['abundance_cutoff_host'].replace('', 'host_not_expressed')

# if HG was not quantified in recount tpm_df, consider it as not expressed
no_tpm_HG = [recount_tpm_df]
for host_id in host_df.host_id:
    if host_id in list(recount_tpm_df.gene_id):
        continue
    else:
        vals = [[host_id, 0, 0, 0, 0, 'host_not_expressed']]
        temp_df = pd.DataFrame(vals, columns=['gene_id', 'mESC_1', 'mESC_2', 'E14_5_brain_1', 'E14_5_brain_2', 'abundance_cutoff_host'])
        no_tpm_HG.append(temp_df)
recount_tpm_df = pd.concat(no_tpm_HG)
recount_tpm_df = recount_tpm_df.rename(columns={'gene_id': 'host_id'})

# Merge abundance_cutoff_host information to sno_tpm df and fill NA for intergenic snoRNAs
final_df = sno_tpm_df.merge(recount_tpm_df[['host_id', 'abundance_cutoff_host']], how='left', on='host_id')
final_df['abundance_cutoff_host'] = final_df['abundance_cutoff_host'].fillna('intergenic')

final_df.to_csv(snakemake.output.HG_abundance_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import pandas as pd
from pybedtools import BedTool
import subprocess as sp
import re

sno_bed_path = snakemake.input.sno_bed
gtf_bed_path = snakemake.input.formatted_gtf_bed
sno_bed = BedTool(sno_bed_path)

# Intersect sno_bed with bed of genes from gtf to find if any are a snoRNA host gene
# Intersect must be on the same strand (s=True) and must fully include the snoRNA (f=1)
intersection = sno_bed.intersect(gtf_bed_path, s=True, f=1, wb=True, sorted=True).saveas('temp_snoRNA_HG.bed')

# Load intersect_df
cols = ['chr_sno', 'start_sno', 'end_sno', 'gene_id_sno', 'dot', 'strand_sno',
        'source_sno', 'feature_sno', 'dot2', 'attributes_sno', 'chr_host',
        'start_host', 'end_host', 'host_id', 'dot3', 'strand_host', 'source_host',
        'feature_host', 'dot4', 'attributes_host']
intersect_df = pd.read_csv('temp_snoRNA_HG.bed', sep='\t', names=cols, header=None)



# Retrieve host name and biotype from the attribtes_host column
attribute_dict = dict(zip(intersect_df.host_id, intersect_df.attributes_host.str.split('; ')))
host_name, host_biotype = {}, {}
for k, v in attribute_dict.items():
    name = [item.replace('"', '').replace(';', '').split(' ', maxsplit=1)[1] for item in v if 'gene_name' in item][0]
    biotype = [item.replace('"', '').replace(';', '').split(' ', maxsplit=1)[1] for item in v if 'gene_biotype' in item][0]
    host_name[k] = name
    host_biotype[k] = biotype

intersect_df['host_biotype'] = intersect_df['host_id'].map(host_biotype)
intersect_df['host_name'] = intersect_df['host_id'].map(host_name)

# Keep only relevant columns
intersect_df = intersect_df.drop(columns=['dot', 'source_sno', 'feature_sno', 'dot2',
                                'attributes_sno', 'chr_host', 'dot3', 'strand_host',
                                'source_host', 'feature_host', 'dot4', 'attributes_host'])

dfs = []
for i, group in intersect_df.groupby('gene_id_sno'):
    group['host_length'] = group['end_host'] - group['start_host'] + 1
    group = group.reset_index(drop=True)
    cols_ = group.columns
    if len(group) > 1:
        if len(group[group['host_name'].str.contains('Snhg|SNHG')]) == 1: # if only one SNHG gene is present in the potential HG, define as HG
            temp_df = group[group['host_name'].str.contains('Snhg|SNHG')]
            dfs.append(temp_df)
        elif len(group[group['host_name'].str.contains('Snhg|SNHG')]) > 1: # if multiple SNHG genes are present in the potential HG, define the shortest as HG
            temp_df = group[group['host_name'].str.contains('Snhg|SNHG')]
            temp_df = temp_df.iloc[temp_df['host_length'].idxmin()].reset_index(drop=True).to_frame()
            temp_df = temp_df.T
            temp_df.columns = cols_
            dfs.append(temp_df)
        else:
            temp_df = group.iloc[group['host_length'].idxmin()].reset_index(drop=True).to_frame()  # select the shortest potential HG
            temp_df = temp_df.T
            temp_df.columns = cols_
            dfs.append(temp_df)
    else:
        dfs.append(group)

# Concat dfs together
final_host_df = pd.concat(dfs)
final_host_df.to_csv(snakemake.output.mouse_snoRNA_HG, sep='\t', index=False)

sp.call('rm temp_snoRNA_HG.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
import collections as coll

""" Add an abundance cutoff column for snoRNAs to define them as expressed or not
    expressed in mouse. This column will serve as the label used by the predictor.
    Also find the snoRNA length. """

df = pd.read_csv(snakemake.input.sno_tpm_df, sep='\t')
sno_bed = pd.read_csv(snakemake.input.sno_bed, sep='\t',
                    names=['chr', 'start', 'end', 'gene_id', 'dot',
                            'strand', 'feature', 'dot2', 'attributes'], index_col=False)

# Compute the snoRNA length
sno_bed['sno_length'] = sno_bed['end'].astype(int) - sno_bed['start'].astype(int) + 1
sno_bed = sno_bed[['gene_id', 'sno_length']]
df = df.merge(sno_bed, how='left', on='gene_id')


# If snoRNA is expressed >1 TPM in at least one average sample (average of the triplicates), it is expressed, else not expressed
sno_abundance = df.filter(regex='_[123]$')  # tpm column must end with '_1, _2 or _3'
cols = list(sno_abundance.columns)
triplicates = [cols[n:n+3] for n in range(0, len(cols), 3)]

df['abundance_cutoff'] = ''
for i in range(0, len(df)):
    row = sno_abundance.iloc[i]
    for j, triplicate in enumerate(triplicates):
        if (df.loc[i, triplicate].mean() > 1) == True:
            df.loc[i, 'abundance_cutoff'] = 'expressed'
            break
df['abundance_cutoff'] = df['abundance_cutoff'].replace('', 'not_expressed')

print('Abundance cutoff based on >1 TPM in at least one average condition:')
print(coll.Counter(df['abundance_cutoff']))


df.to_csv(snakemake.output.tpm_label_df, index=False, sep='\t')
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
import pandas as pd
import collections as coll
import re
import regex
from math import isnan

""" Find snoRNA type (C/D vs H/ACA) of mouse snoRNAs """

def cut_sequence(seq):
    """ Get the 20 first and 20 last nt of a given sequence."""
    first, last = seq[:20], seq[-20:]
    length = len(seq)
    return first, last, length


def find_c_box(seq):
    """ Find exact C box (RUGAUGA, where R is A or G), if not present, find C
        box with 1,2 or max 3 substitutions. Return also the start and end
        position of that box as 1-based values. If no C box is found, return a
        'NNNNNNN' empty C box and 0 as start and end of C box."""
    first_20, last_20, length_seq = cut_sequence(seq)
    len_c_box = 7
    # First, find exact C box (RUGAUGA) within 20 nt of the snoRNA 5' end
    if re.search('(A|G)UGAUGA', first_20) is not None:  # find exact C box
        i = 1
        for possible_c in re.finditer('(A|G)UGAUGA', first_20):
            if i <= 1:  # select first matched group only (closest RUGAUGA to 5' end of snoRNA)
                c_motif = possible_c.group(0)
                c_start = possible_c.start() + 1
                c_end = possible_c.end()
                i += 1
                return c_motif, c_start, c_end  # this exits the global if statement
    else:  # find not exact C box (up to max 3 substitution allowed)
        for sub in range(1, int((len_c_box-1)/2 + 1)):  # iterate over 1 to 3 substitutions allowed
            c_motif = regex.findall("((A|G)UGAUGA){s<="+str(sub)+"}", first_20, overlapped=True)
            if len(c_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                c_motif = c_motif[0][0]  # if multiple C boxes found, keep the the C box closest to 5' end
                c_start = first_20.find(c_motif) + 1
                c_end = c_start + len(c_motif) - 1
                return c_motif, c_start, c_end  # this exits the global else statement
        # If no C box is found, return NNNNNNN and 0, 0 as C box sequence, start and end
        c_motif, c_start, c_end = 'NNNNNNN', 0, 0
        return c_motif, c_start, c_end


def find_d_box(seq):
    """ Find exact D box (CUGA), if not present, find D box with 1 or max 2
        substitutions. Return also the start and end position of that box as
        1-based values. If no D box is found, return a 'NNNN' empty D box and 0
        as start and end of D box."""
    first_20, last_20, length_seq = cut_sequence(seq)
    len_d_box = 4
    # First, find exact D box (CUGA) within 20 nt of the snoRNA 3' end
    if re.search('CUGA', last_20) is not None:  # find exact D box
        *_, last_possible_d = re.finditer('CUGA', last_20)
        d_motif = last_possible_d.group(0)  # if multiple exact D boxes found, keep the D box closest to 3' end
        d_start = (length_seq - 20) + last_possible_d.start() + 1
        d_end = (length_seq - 20) + last_possible_d.end()
        return d_motif, d_start, d_end
    else:  # find not exact D box (up to max 50% of substitution allowed (i.e. 2 nt))
        for sub in range(1, int(len_d_box/2 + 1)):  # iterate over 1 to 2 substitutions allowed
            d_motif = regex.findall("(CUGA){s<="+str(sub)+"}", last_20, overlapped=True)
            if len(d_motif) >= 1:  # if we have a match, break and keep that match (1 sub privileged over 2 subs)
                d_motif = d_motif[-1]  # if multiple D boxes found, keep the the D box closest to 3' end
                d_start = (length_seq - 20) + last_20.rindex(d_motif) + 1
                d_end = d_start + len(d_motif) - 1
                return d_motif, d_start, d_end  # this exits the global else statement
        # If no D box is found, return NNNN and 0, 0 as D box sequence, start and end
        d_motif, d_start, d_end = 'NNNN', 0, 0
        return d_motif, d_start, d_end

def find_aca(line):
    """ Find the most downstream ACA motif in the last 10 nt of H/ACA box
        snoRNAs."""
    last_10 = line[-10:]
    length_seq = len(line)
    if re.search('ACA', last_10) is not None:  # find exact ACA box
        *_, last_possible_aca = re.finditer('ACA', last_10)
        aca_motif = last_possible_aca.group(0)  # if multiple exact ACA boxes found, keep the ACA box closest to 3' end
        aca_start = (length_seq - 10) + last_possible_aca.start() + 1  # 1-based position
        aca_end = (length_seq - 10) + last_possible_aca.end()  # 1-based
    else:  # if no ACA is found
        aca_motif, aca_start, aca_end = 'NNN', 0, 0

    return aca_motif, aca_start, aca_end



# Loading file and dfs
sno_fasta = snakemake.input.sno_fasta
tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')
rna_central_df = pd.read_csv(snakemake.input.rna_central_df, sep='\t')
id_conversion_df = pd.read_csv(snakemake.input.id_conversion_df, sep='\t',
                        names=['rna_central_id', 'source', 'ensembl_transcript_id',
                                'taxid', 'biotype', 'ensembl_gene_id'])
mouse_gtf = pd.read_csv(snakemake.input.gtf, sep='\t', skiprows=5,
                        names=['chr', 'source', 'feature', 'start', 'end', 'dot',
                                'strand', 'dot2', 'attributes'])
mouse_gtf = mouse_gtf[mouse_gtf['feature'] == 'gene']

# Remove ".number" a the end of ensembl gene id
id_conversion_df[['ensembl_gene_id', 'suffix']] = id_conversion_df['ensembl_gene_id'].str.split('.', expand=True)

# Select only snoRNAs in the ensembl gtf
sno_only = mouse_gtf[mouse_gtf['attributes'].str.contains('gene_biotype "snoRNA"')]
sno_attributes = list(sno_only['attributes'])
sno_ids = [attribute.split(';') for attribute in sno_attributes]
sno_ids = [item for sublist in sno_ids for item in sublist]  # flatten list of list into one list
sno_ids = [attr.split(' "')[1].strip('"') for attr in sno_ids if 'gene_id' in attr]  # remove 'gene_id 'and '"'
sno_tpm_df = tpm_df[tpm_df['gene_id'].isin(sno_ids)]

# Dict of corresponding ensembl and rnacentral ids
id_mapping = dict(zip(id_conversion_df.ensembl_gene_id, id_conversion_df.rna_central_id))

# Find if sno_type is known for mouse snoRNAs in RNAcentral description
rna_central_df.loc[rna_central_df['description'].str.contains('C/D|SNORD|Snord|U3'), 'snoRNA_type'] = 'C/D'
rna_central_df.loc[rna_central_df['description'].str.contains('H/ACA|SNORA|Snora|ACA'), 'snoRNA_type'] = 'H/ACA'
sno_type_dict = dict(zip(rna_central_df.upi, rna_central_df.snoRNA_type))  # where upi is the column of RNA central id

# Create rna_central_id and snoRNA_type columns
sno_tpm_df['rna_central_id'] = sno_tpm_df['gene_id'].map(id_mapping)
sno_tpm_df['snoRNA_type'] = sno_tpm_df['rna_central_id'].map(sno_type_dict)

# For snoRNAs without snoRNA type from RNAcentral description, find if their gene_name contains this information directly
sno_tpm_df.loc[(sno_tpm_df['snoRNA_type'].isna()) & (sno_tpm_df['gene_name'].str.contains('Snord|SNORD|U8|U3')), 'snoRNA_type'] = 'C/D'
sno_tpm_df.loc[(sno_tpm_df['snoRNA_type'].isna()) & (sno_tpm_df['gene_name'].str.contains('Snora|SNORA')), 'snoRNA_type'] = 'H/ACA'

# Drop all snoRNAs where we couldn't find the snoRNA type (158 snoRNAs)
sno_tpm_df = sno_tpm_df.dropna(subset=['snoRNA_type'])

# For the snoRNAs without names telling if they are C/D or H/ACA, infer from their sequence if they have the canonical box motifs
'''
no_snoRNA_type_ids = list(sno_tpm_df[~sno_tpm_df['snoRNA_type'].isin(['C/D', 'H/ACA'])].gene_id)

# Create dictionary of ensembl_id: RNA sequence
seq_dict = {}
with open(sno_fasta, 'r') as f:
    temp_id = ''
    for line in f:
        if line.startswith('>'):
            id = line.strip('\n').strip('>')
            temp_id = id
        else:
            sno_sequence = line.strip('\n').replace('T', 'U')
            seq_dict[temp_id] = sno_sequence

# First, search for C and D boxes; if not present, search for ACA box
sno_type_search_dict = dict(zip(sno_tpm_df.gene_id, sno_tpm_df.snoRNA_type))
sno_type_search_dict = {k: sno_type_search_dict[k] for k in sno_type_search_dict if sno_type_search_dict[k] != 'NaN'}  # remove sno that have NaN as snoRNA type
other = 0
for sno_id in no_snoRNA_type_ids:
    snoRNA_seq = seq_dict[sno_id]
    c_motif, c_start, c_end = find_c_box(snoRNA_seq)
    d_motif, d_start, d_end = find_d_box(snoRNA_seq)
    if (c_motif == "NNNNNNN") | (d_motif == "NNNN"):  # if we don't find either a C or D box
        aca_motif, aca_start, aca_end = find_aca(snoRNA_seq)
        if aca_motif == "ACA":  # this is a ACA
            sno_type_search_dict[sno_id] = "H/ACA"
        else:
            other += 1
            print(f'{sno_id} is of unknown snoRNA type')
    else: # this is a C/D snoRNA
        sno_type_search_dict[sno_id] = "C/D"
print(f'{other} snoRNAs are of unknown snoRNA type. They are thereby excluded of downstream analyses.')

# Add the snoRNA type to these snoRNAs in sno_tpm_df and exclude the remaining snoRNAs that do not have a snoRNA type (only 25 snoRNAs for Mus musculus)
sno_tpm_df['snoRNA_type'] = sno_tpm_df['gene_id'].map(sno_type_search_dict)
sno_tpm_df = sno_tpm_df.dropna(subset=['snoRNA_type'])
len_sno_tpm_df = len(sno_tpm_df)
print(f'The snoRNA type (C/D or H/ACA) was found for {len_sno_tpm_df} snoRNAs. These snoRNAs are included in downstream analyses.')
'''
sno_tpm_df.to_csv(snakemake.output.snoRNA_type_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import pandas as pd
import os

""" Define an abundance cutoff for host genes that will be used as a feature
    (>1 TPM in at least one average condition). The samples used to quantify HG
    abundance are from the Bgee database of normal animal samples (Bastian et
    al. 2021). """

host_df = pd.read_csv(snakemake.input.host_df, sep='\t')
sno_df = pd.read_csv(snakemake.input.sno_df, sep='\t')
tpm_df_dir = snakemake.input.tpm_df_dir

# Add host id column to sno_df
host_dict = dict(zip(host_df.gene_id_sno, host_df.host_id))
sno_df['host_id'] = sno_df['gene_id'].map(host_dict)


# Load all TPM dfs and filter to keep only the gene_id, condition and TPM value of host genes
dfs = []
host_ids = list(pd.unique(host_df.host_id))
for file in os.listdir(tpm_df_dir):
    if file.endswith('.tsv'):
        df = pd.read_csv(f'{tpm_df_dir}/{file}', sep='\t')
        df = df[['Gene ID', 'Anatomical entity name', 'TPM']]
        df.columns = ['gene_id', 'condition', 'TPM']
        df = df[df['gene_id'].isin(host_ids)]
        dfs.append(df)

# Concat all dfs and groupby gene_id and condition; return average TPM per gene_id and condition
concat_df = pd.concat(dfs)
grouped_df = concat_df.groupby(['gene_id', 'condition'])['TPM'].mean()
grouped_df = grouped_df.reset_index()

# If host gene is expressed >1 TPM in at least one average condition (average of the replicates), it is expressed,
# else not expressed
pivot_df = grouped_df.pivot_table(index='gene_id', columns='condition', values='TPM')
pivot_df.loc[(pivot_df.iloc[:, :] > 1).any(axis=1), 'abundance_cutoff_host'] = 'host_expressed'
pivot_df['abundance_cutoff_host'] = pivot_df.abundance_cutoff_host.fillna('host_not_expressed')
pivot_df = pivot_df.reset_index()

# Merge pivot_df to host_df (if host not present/quantified in Bgee datasets, consider it as host_not_expressed)
host_df = host_df[['host_id', 'host_name', 'host_biotype']]
host_merge_df = host_df.drop_duplicates(subset='host_id').merge(pivot_df, how='left', left_on='host_id', right_on='gene_id')
host_merge_df['abundance_cutoff_host'] = host_merge_df['abundance_cutoff_host'].fillna('host_not_expressed')
host_merge_df = host_merge_df.drop(columns='gene_id')

# Merge host_abundance_cutoff info to sno_df
final_df = sno_df.merge(host_merge_df, how='left', on='host_id')
final_df['abundance_cutoff_host'] = final_df['abundance_cutoff_host'].fillna('intergenic')
temp_df = final_df.pop('abundance_cutoff_host')
final_df.insert(4, temp_df.name, temp_df)  # move abundance_cutoff_host column to the fifth position in df

final_df.to_csv(snakemake.output.HG_abundance_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import pandas as pd
from pybedtools import BedTool
import subprocess as sp
import re

species = snakemake.wildcards.species
sno_bed_path = snakemake.input.sno_bed
gtf_bed_path = snakemake.input.formatted_gtf_bed
sno_bed = BedTool(sno_bed_path)

# Intersect sno_bed with bed of genes from gtf to find if any are a snoRNA host gene
# Intersect must be on the same strand (s=True) and must fully include the snoRNA (f=1)
intersection = sno_bed.intersect(gtf_bed_path, s=True, f=1, wb=True, sorted=True).saveas(f'{species}_temp_snoRNA_HG.bed')

# Load intersect_df
cols = ['chr_sno', 'start_sno', 'end_sno', 'gene_id_sno', 'dot', 'strand_sno',
        'source_sno', 'feature_sno', 'dot2', 'attributes_sno', 'chr_host',
        'start_host', 'end_host', 'host_id', 'dot3', 'strand_host', 'source_host',
        'feature_host', 'dot4', 'attributes_host']
intersect_df = pd.read_csv(f'{species}_temp_snoRNA_HG.bed', sep='\t', names=cols, header=None)


# Retrieve host name and biotype from the attribtes_host column
attribute_dict = dict(zip(intersect_df.host_id, intersect_df.attributes_host.str.split('; ')))
host_name, host_biotype = {}, {}
for k, v in attribute_dict.items():
    if all('gene_name' not in att for att in v):  # if attribute 'gene_name' is missing in all attributes
        gene_id = [attribute for attribute in v if 'gene_id' in attribute][0]
        fake_gene_name = gene_id.replace('gene_id', 'gene_name') # create a fake gene name which will be the gene_id
        v = v + [fake_gene_name]
    name = [item.replace('"', '').replace(';', '').split(' ', maxsplit=1)[1] for item in v if 'gene_name' in item][0]
    biotype = [item.replace('"', '').replace(';', '').split(' ', maxsplit=1)[1] for item in v if 'gene_biotype' in item][0]
    host_name[k] = name
    host_biotype[k] = biotype

intersect_df['host_biotype'] = intersect_df['host_id'].map(host_biotype)
intersect_df['host_name'] = intersect_df['host_id'].map(host_name)

# Keep only relevant columns
intersect_df = intersect_df.drop(columns=['dot', 'source_sno', 'feature_sno', 'dot2',
                                'attributes_sno', 'chr_host', 'dot3', 'strand_host',
                                'source_host', 'feature_host', 'dot4', 'attributes_host'])

dfs = []
for i, group in intersect_df.groupby('gene_id_sno'):
    group['host_length'] = group['end_host'] - group['start_host'] + 1
    group = group.reset_index(drop=True)
    cols_ = group.columns
    if len(group) > 1:
        if len(group[group['host_name'].str.contains('Snhg|SNHG')]) == 1: # if only one SNHG gene is present in the potential HG, define as HG
            temp_df = group[group['host_name'].str.contains('Snhg|SNHG')]
            dfs.append(temp_df)
        elif len(group[group['host_name'].str.contains('Snhg|SNHG')]) > 1: # if multiple SNHG genes are present in the potential HG, define the shortest as HG
            temp_df = group[group['host_name'].str.contains('Snhg|SNHG')]
            temp_df = temp_df.iloc[temp_df['host_length'].idxmin()].reset_index(drop=True).to_frame()
            temp_df = temp_df.T
            temp_df.columns = cols_
            dfs.append(temp_df)
        else:
            temp_df = group.iloc[group['host_length'].idxmin()].reset_index(drop=True).to_frame()  # select the shortest potential HG
            temp_df = temp_df.T
            temp_df.columns = cols_
            dfs.append(temp_df)
    else:
        dfs.append(group)

# Concat dfs together
final_host_df = pd.concat(dfs)
final_host_df.to_csv(snakemake.output.species_snoRNA_HG, sep='\t', index=False)

sp.call(f'rm {species}_temp_snoRNA_HG.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import pandas as pd

""" Find snoRNA type (C/D vs H/ACA) of species snoRNAs """

# Loading file and dfs
rna_central_df = pd.read_csv(snakemake.input.rna_central_df, sep='\t')
id_conversion_df = pd.read_csv(snakemake.input.id_conversion_df, sep='\t',
                        names=['rna_central_id', 'source', 'ensembl_transcript_id',
                                'taxid', 'biotype', 'ensembl_gene_id'])
sno_bed = pd.read_csv(snakemake.input.sno_bed, sep='\t',
                        names=['chr', 'start', 'end', 'gene_id', 'dot', 'strand',
                                'source', 'feature', 'dot2', 'attributes'])

attributes = [att.split(';') for att in list(sno_bed.attributes)]
gene_id_name = {}
for sno_attributes in attributes:
    temp_name = None
    for specific_attribute in sno_attributes:
        if 'gene_id' in specific_attribute:
            id = specific_attribute.split(' "')[-1].strip('"')
        elif 'gene_name' in specific_attribute:
            name = specific_attribute.split(' "')[-1].strip('"')
            temp_name = name
    if all('gene_name' not in attri for attri in sno_attributes):
        temp_name = id
    gene_id_name[id] = temp_name


# Get snoRNA ids and name and create df
sno_df = pd.DataFrame(gene_id_name.items(), columns=['gene_id', 'gene_name'])

# Remove ".number" a the end of ensembl gene id
if '.' in list(id_conversion_df['ensembl_gene_id'])[0]:
    id_conversion_df[['ensembl_gene_id', 'suffix']] = id_conversion_df['ensembl_gene_id'].str.split('.', expand=True)

# Dict of corresponding ensembl and rnacentral ids
id_mapping = dict(zip(id_conversion_df.ensembl_gene_id, id_conversion_df.rna_central_id))

# Find if sno_type is known for species snoRNAs in RNAcentral description
rna_central_df.loc[rna_central_df['description'].str.contains('C/D|SNORD|Snord|U3'), 'snoRNA_type'] = 'C/D'
rna_central_df.loc[rna_central_df['description'].str.contains('H/ACA|SNORA|Snora|ACA'), 'snoRNA_type'] = 'H/ACA'
sno_type_dict = dict(zip(rna_central_df.upi, rna_central_df.snoRNA_type))  # where upi is the column of RNA central id

# Create rna_central_id and snoRNA_type columns
sno_df['rna_central_id'] = sno_df['gene_id'].map(id_mapping)
sno_df['snoRNA_type'] = sno_df['rna_central_id'].map(sno_type_dict)

# For snoRNAs without snoRNA type from RNAcentral description, find if their gene_name contains this information directly
sno_df.loc[(sno_df['snoRNA_type'].isna()) & (sno_df['gene_name'].str.contains('Snord|SNORD|U8|U3')), 'snoRNA_type'] = 'C/D'
sno_df.loc[(sno_df['snoRNA_type'].isna()) & (sno_df['gene_name'].str.contains('Snora|SNORA')), 'snoRNA_type'] = 'H/ACA'


# Drop all snoRNAs where we couldn't find the snoRNA type
sno_df = sno_df.dropna(subset=['snoRNA_type'])


sno_df.to_csv(snakemake.output.snoRNA_type_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import collections as coll

""" Define an abundance cutoff for host genes that will be used as a feature
    (>1 TPM in at least one average condition). The samples used to quantify HG
    abundance are 3 WT S. cerevisiae samples processed using the TGIRT-Seq
    pipeline (Fafard-Couture et al., 2021, Genome Biology)."""

host_df = pd.read_csv(snakemake.input.host_df, sep='\t')
sno_tpm_df = pd.read_csv(snakemake.input.sno_tpm_df, sep='\t')
tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')

# Add host id column to sno_tpm_df
host_dict = dict(zip(host_df.gene_id_sno, host_df.host_name))
sno_tpm_df['host_id'] = sno_tpm_df['gene_id'].map(host_dict)
# Select only host genes from total tpm_df
tpm_df = tpm_df[tpm_df['gene_id'].isin(list(host_df['host_id']))].reset_index(drop=True)

# If host gene is expressed >1 TPM in at least one average condition (average of the duplicates), it is expressed, else not expressed (or no host gene at all)
hg_abundance = tpm_df.filter(regex='_[123]$').reset_index(drop=True)  # tpm column must end with '_1, _2 or _3'
cols_hg = list(hg_abundance.columns)
triplicates_hg = [cols_hg[n:n+3] for n in range(0, len(cols_hg), 3)]

tpm_df['abundance_cutoff_host'] = ''
for i in range(0, len(tpm_df)):
    row = hg_abundance.iloc[i]
    for j, duplicate in enumerate(triplicates_hg):
        if (tpm_df.loc[i, duplicate].mean() > 1) == True:
            tpm_df.loc[i, 'abundance_cutoff_host'] = 'host_expressed'
            break
tpm_df['abundance_cutoff_host'] = tpm_df['abundance_cutoff_host'].replace('', 'host_not_expressed')

# Merge abundance_cutoff_host information to sno_tpm df and fill NA for intergenic snoRNAs
final_df = sno_tpm_df.merge(tpm_df[['gene_name', 'abundance_cutoff_host']], how='left', left_on='host_id', right_on='gene_name')
final_df['abundance_cutoff_host'] = final_df['abundance_cutoff_host'].fillna('intergenic')
final_df = final_df.drop(columns='gene_name_y')
final_df = final_df.rename(columns={'gene_name_x': 'gene_name'})
final_df.to_csv(snakemake.output.HG_abundance_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import collections as coll

""" Find snoRNA type (C/D vs H/ACA) of yeast snoRNAs """

# Loading file and dfs
tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')
snoRNA_type_df = pd.read_csv(snakemake.input.yeast_mine_df, sep='\t',
                        names=['id', 'other_id', 'gene_id', 'description'])

# Convert to lowercase the first 2 letters of gene_id (snR instead of SNR) in snoRNA_type_df
snoRNA_type_df['gene_id'] = snoRNA_type_df['gene_id'].str.replace(r'SNR([a-zA-Z0-9]*)', r'snR\1', regex=True)
snoRNA_type_df['gene_id'] = snoRNA_type_df['gene_id'].str.replace('snR17A', 'snR17a')
snoRNA_type_df['gene_id'] = snoRNA_type_df['gene_id'].str.replace('snR17B', 'snR17b')

print(snoRNA_type_df)
print(tpm_df)

# Create snoRNA_type column
snoRNA_type_df.loc[snoRNA_type_df['description'].str.contains('C/D|U3'), 'snoRNA_type'] = 'C/D'
snoRNA_type_df.loc[snoRNA_type_df['description'].str.contains('H/ACA'), 'snoRNA_type'] = 'H/ACA'
snoRNA_type_df = snoRNA_type_df.dropna(subset=['snoRNA_type'])
snoRNA_type_df = snoRNA_type_df[['gene_id', 'snoRNA_type']]

# Merge with tpm_df
sno_tpm_df = snoRNA_type_df.merge(tpm_df, how='left', on='gene_id')

sno_tpm_df.to_csv(snakemake.output.snoRNA_type_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Get the upstream and downstream flanking regions (15 nt) of each snoRNA
    using pybedtools flank. Then extend these flanking regions inside the
    snoRNA sequence using pybedtools slop. Extend for 5 nt from the 5' and
    3' of C/D box snoRNAs; extend 5 nt from the 5' and 3 nt from the 3' of
    H/ACA box snoRNAs."""


all_sno_bed = BedTool(snakemake.input.all_sno_bed)
sno_info_df = pd.read_csv(snakemake.input.sno_info, sep='\t')
chr_size_file = snakemake.input.genome_chr_size

# Get snoRNA type of all snoRNAs from RNAcentral reference
sno_type_dict = sno_info_df.set_index('gene_id')['snoRNA_type'].to_dict()

# Filter sno_bed to only keep snoRNAs where we could find a snoRNA type in RNAcentral
all_sno_bed = BedTool(line for line in all_sno_bed if line[3] in sno_type_dict.keys())

# Get 15 nt flanking regions upstream and downstream of C/D and H/ACA snoRNAs
# The .saveas() is needed to create a temporary version of the object since it's used afterwards (otherwise, it does not work)
flanking = all_sno_bed.flank(g=chr_size_file, b=15) #this is a bedtools object


cd_flank = BedTool(line for line in flanking if sno_type_dict[line[3]] == 'C/D').saveas()  # where line[3] corresponds to the gene_id
haca_flank = BedTool(line for line in flanking if sno_type_dict[line[3]] == 'H/ACA').saveas()  # where line[3] corresponds to the gene_id

# Separate the bed objects into the left or the right flanking regions for both snoRNA type
cd_flank_left = BedTool(line for i, line in enumerate(cd_flank) if i % 2 == 0).saveas()
cd_flank_right = BedTool(line for i, line in enumerate(cd_flank) if i % 2 != 0).saveas()

haca_flank_left = BedTool(line for i, line in enumerate(haca_flank) if i % 2 == 0).saveas()
haca_flank_right = BedTool(line for i, line in enumerate(haca_flank) if i % 2 != 0).saveas()

# For H/ACA snoRNAs, split by strand, since we don't extend (slop) the same number of nt (5 vs 3 nt respectively for the 5' and 3' end)
# We don't need to that for C/D since we extend 5 nt from the 5' and 3'end, so it does not affect the terminal stem sequences
haca_flank_left_plus = BedTool(line for line in haca_flank_left if line[5] == '+').saveas()  # where line[5] corresponds to the strand
haca_flank_left_minus = BedTool(line for line in haca_flank_left if line[5] == '-').saveas()  # where line[5] corresponds to the strand
haca_flank_right_plus = BedTool(line for line in haca_flank_right if line[5] == '+').saveas()  # where line[5] corresponds to the strand
haca_flank_right_minus = BedTool(line for line in haca_flank_right if line[5] == '-').saveas()  # where line[5] corresponds to the strand

# For C/D snoRNAs, extend the flanking region inside the snoRNA for 5 nt from the 5' and 3' of the snoRNA
cd_extend_left = cd_flank_left.slop(r=5, l=0, g=chr_size_file).saveas(snakemake.output.flanking_cd_left)
cd_extend_right = cd_flank_right.slop(l=5, r=0, g=chr_size_file).saveas(snakemake.output.flanking_cd_right)

# For H/ACA snoRNAs, extend the flanking region inside the snoRNA for 5 nt from the 5' and 3 nt from the 3' of the snoRNA
haca_extend_left_plus = haca_flank_left_plus.slop(r=5, l=0, g=chr_size_file, s=True).saveas('temp_haca_extend_left_plus.bed')
haca_extend_right_plus = haca_flank_right_plus.slop(l=3, r=0, g=chr_size_file, s=True).saveas('temp_haca_extend_right_plus.bed')

haca_extend_left_minus = haca_flank_left_minus.slop(l=3, r=0, g=chr_size_file, s=True).saveas('temp_haca_extend_left_minus.bed')
haca_extend_right_minus = haca_flank_right_minus.slop(r=5, l=0, g=chr_size_file, s=True).saveas('temp_haca_extend_right_minus.bed')

sp.call('cat temp_haca_extend_left_plus.bed temp_haca_extend_left_minus.bed > '+snakemake.output.flanking_haca_left, shell=True)
sp.call('cat temp_haca_extend_right_plus.bed temp_haca_extend_right_minus.bed > '+snakemake.output.flanking_haca_right, shell=True)
sp.call('rm temp_haca_extend*.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Get the upstream and downstream flanking regions (15 nt) of each snoRNA
    using pybedtools flank. Then extend these flanking regions inside the
    snoRNA sequence using pybedtools slop. Extend for 5 nt from the 5' and
    3' of C/D box snoRNAs; extend 5 nt from the 5' and 3 nt from the 3' of
    H/ACA box snoRNAs."""


all_sno_bed = BedTool(snakemake.input.all_sno_bed)
sno_info_df = pd.read_csv(snakemake.input.snodb_info, sep='\t')

# Get snoRNA type of all snoRNAs from snoDB reference
sno_type_dict = sno_info_df.set_index('gene_id_sno')['sno_type'].to_dict()

# Get 15 nt flanking regions upstream and downstream of C/D and H/ACA snoRNAs
# The .saveas() is needed to create a temporary version of the object since it's used afterwards (otherwise, it does not work)
flanking = all_sno_bed.flank(genome="hg38", b=15) #this is a bedtools object
cd_flank = BedTool(line for line in flanking if sno_type_dict[line[3]] == 'C/D').saveas()  # where line[3] corresponds to the gene_id
haca_flank = BedTool(line for line in flanking if sno_type_dict[line[3]] == 'H/ACA').saveas()  # where line[3] corresponds to the gene_id

# Separate the bed objects into the left or the right flanking regions for both snoRNA type
cd_flank_left = BedTool(line for i, line in enumerate(cd_flank) if i % 2 == 0).saveas()
cd_flank_right = BedTool(line for i, line in enumerate(cd_flank) if i % 2 != 0).saveas()

haca_flank_left = BedTool(line for i, line in enumerate(haca_flank) if i % 2 == 0).saveas()
haca_flank_right = BedTool(line for i, line in enumerate(haca_flank) if i % 2 != 0).saveas()

# For H/ACA snoRNAs, split by strand, since we don't extend (slop) the same number of nt (5 vs 3 nt respectively for the 5' and 3' end)
# We don't need to that for C/D since we extend 5 nt from the 5' and 3'end, so it does not affect the terminal stem sequences
haca_flank_left_plus = BedTool(line for line in haca_flank_left if line[5] == '+').saveas()  # where line[5] corresponds to the strand
haca_flank_left_minus = BedTool(line for line in haca_flank_left if line[5] == '-').saveas()  # where line[5] corresponds to the strand
haca_flank_right_plus = BedTool(line for line in haca_flank_right if line[5] == '+').saveas()  # where line[5] corresponds to the strand
haca_flank_right_minus = BedTool(line for line in haca_flank_right if line[5] == '-').saveas()  # where line[5] corresponds to the strand

# For C/D snoRNAs, extend the flanking region inside the snoRNA for 5 nt from the 5' and 3' of the snoRNA
cd_extend_left = cd_flank_left.slop(r=5, l=0, genome="hg38").saveas(snakemake.output.flanking_cd_left)
cd_extend_right = cd_flank_right.slop(l=5, r=0, genome="hg38").saveas(snakemake.output.flanking_cd_right)

# For H/ACA snoRNAs, extend the flanking region inside the snoRNA for 5 nt from the 5' and 3 nt from the 3' of the snoRNA
haca_extend_left_plus = haca_flank_left_plus.slop(r=5, l=0, genome="hg38", s=True).saveas('temp_haca_human_extend_left_plus.bed')
haca_extend_right_plus = haca_flank_right_plus.slop(l=3, r=0, genome="hg38", s=True).saveas('temp_haca_human_extend_right_plus.bed')

haca_extend_left_minus = haca_flank_left_minus.slop(l=3, r=0, genome="hg38", s=True).saveas('temp_haca_human_extend_left_minus.bed')
haca_extend_right_minus = haca_flank_right_minus.slop(r=5, l=0, genome="hg38", s=True).saveas('temp_haca_human_extend_right_minus.bed')

sp.call('cat temp_haca_human_extend_left_plus.bed temp_haca_human_extend_left_minus.bed > '+snakemake.output.flanking_haca_left, shell=True)
sp.call('cat temp_haca_human_extend_right_plus.bed temp_haca_human_extend_right_minus.bed > '+snakemake.output.flanking_haca_right, shell=True)
sp.call('rm temp_haca_human_extend*.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Get the upstream and downstream flanking regions (15 nt) of each snoRNA
    using pybedtools flank. Then extend these flanking regions inside the
    snoRNA sequence using pybedtools slop. Extend for 5 nt from the 5' and
    3' of C/D box snoRNAs; extend 5 nt from the 5' and 3 nt from the 3' of
    H/ACA box snoRNAs."""

species = snakemake.wildcards.species
all_sno_bed = BedTool(snakemake.input.all_sno_bed)
sno_info_df = pd.read_csv(snakemake.input.sno_info, sep='\t')
chr_size_file = snakemake.input.genome_chr_size

# Get snoRNA type of all snoRNAs from RNAcentral reference
sno_type_dict = sno_info_df.set_index('gene_id')['snoRNA_type'].to_dict()

# Filter sno_bed to only keep snoRNAs where we could find a snoRNA type in RNAcentral
all_sno_bed = BedTool(line for line in all_sno_bed if line[3] in sno_type_dict.keys())

# Get 15 nt flanking regions upstream and downstream of C/D and H/ACA snoRNAs
# The .saveas() is needed to create a temporary version of the object since it's used afterwards (otherwise, it does not work)
flanking = all_sno_bed.flank(g=chr_size_file, b=15) #this is a bedtools object


cd_flank = BedTool(line for line in flanking if sno_type_dict[line[3]] == 'C/D').saveas()  # where line[3] corresponds to the gene_id
haca_flank = BedTool(line for line in flanking if sno_type_dict[line[3]] == 'H/ACA').saveas()  # where line[3] corresponds to the gene_id

# Separate the bed objects into the left or the right flanking regions for both snoRNA type
cd_flank_left = BedTool(line for i, line in enumerate(cd_flank) if i % 2 == 0).saveas()
cd_flank_right = BedTool(line for i, line in enumerate(cd_flank) if i % 2 != 0).saveas()

haca_flank_left = BedTool(line for i, line in enumerate(haca_flank) if i % 2 == 0).saveas()
haca_flank_right = BedTool(line for i, line in enumerate(haca_flank) if i % 2 != 0).saveas()

# For H/ACA snoRNAs, split by strand, since we don't extend (slop) the same number of nt (5 vs 3 nt respectively for the 5' and 3' end)
# We don't need to that for C/D since we extend 5 nt from the 5' and 3'end, so it does not affect the terminal stem sequences
haca_flank_left_plus = BedTool(line for line in haca_flank_left if line[5] == '+').saveas()  # where line[5] corresponds to the strand
haca_flank_left_minus = BedTool(line for line in haca_flank_left if line[5] == '-').saveas()  # where line[5] corresponds to the strand
haca_flank_right_plus = BedTool(line for line in haca_flank_right if line[5] == '+').saveas()  # where line[5] corresponds to the strand
haca_flank_right_minus = BedTool(line for line in haca_flank_right if line[5] == '-').saveas()  # where line[5] corresponds to the strand

# For C/D snoRNAs, extend the flanking region inside the snoRNA for 5 nt from the 5' and 3' of the snoRNA
cd_extend_left = cd_flank_left.slop(r=5, l=0, g=chr_size_file).saveas(snakemake.output.flanking_cd_left)
cd_extend_right = cd_flank_right.slop(l=5, r=0, g=chr_size_file).saveas(snakemake.output.flanking_cd_right)

# For H/ACA snoRNAs, extend the flanking region inside the snoRNA for 5 nt from the 5' and 3 nt from the 3' of the snoRNA
haca_extend_left_plus = haca_flank_left_plus.slop(r=5, l=0, g=chr_size_file, s=True).saveas(f'temp_haca_extend_left_plus_{species}.bed')
haca_extend_right_plus = haca_flank_right_plus.slop(l=3, r=0, g=chr_size_file, s=True).saveas(f'temp_haca_extend_right_plus_{species}.bed')

haca_extend_left_minus = haca_flank_left_minus.slop(l=3, r=0, g=chr_size_file, s=True).saveas(f'temp_haca_extend_left_minus_{species}.bed')
haca_extend_right_minus = haca_flank_right_minus.slop(r=5, l=0, g=chr_size_file, s=True).saveas(f'temp_haca_extend_right_minus_{species}.bed')

sp.call(f'cat temp_haca_extend_left_plus_{species}.bed temp_haca_extend_left_minus_{species}.bed > '+snakemake.output.flanking_haca_left, shell=True)
sp.call(f'cat temp_haca_extend_right_plus_{species}.bed temp_haca_extend_right_minus_{species}.bed > '+snakemake.output.flanking_haca_right, shell=True)
sp.call(f'rm temp_haca_extend*_{species}.bed', shell=True)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
import pandas as pd

expressed_cd_fa = snakemake.input.expressed_cd_fa
not_expressed_cd_fa = snakemake.input.not_expressed_cd_fa
expressed_haca_fa = snakemake.input.expressed_haca_fa
not_expressed_haca_fa = snakemake.input.not_expressed_haca_fa

c_d_box_expressed = pd.read_csv(snakemake.input.c_d_box_location_expressed, sep='\t')
c_d_box_not_expressed = pd.read_csv(snakemake.input.c_d_box_location_not_expressed, sep='\t')
h_aca_box_expressed = pd.read_csv(snakemake.input.h_aca_box_location_expressed, sep='\t')
h_aca_box_not_expressed = pd.read_csv(snakemake.input.h_aca_box_location_not_expressed, sep='\t')

def dict_to_fasta(dictio, output_path):
    """ Use dict to create fasta (keys being used for ID lines and
        values being sequence lines)."""
    with open(output_path, 'w') as f:
        for sno_id, sequence in dictio.items():
            f.write(f'>{sno_id}\n')
            f.write(f'{sequence}\n')



def flanking_nt_to_c_box(position_df, input_fasta, output_fasta, box_type):
    """ Find start and end of C box in position_df and extract 3nt flanking up- and downstream
        of C box. Return flanking nt + C box sequence as fasta. If C box is directly at the 5'
        start of the snoRNA, return 'nnn' as the flanking nt. Also, return flanking nt in
        lowercase and motif as uppercase. This function can also be applied to C' and H boxes
        using box_type."""
    len_motif_dict = {"C": "nnnNNNNNNNnnn", "C_prime": "nnnNNNNNNNnnn", "H": "nnnNNNNNNnnn"}
    flanking_motif_dict = {}
    position_df = position_df.set_index('gene_id')
    start_end_dict = position_df[[f'{box_type}_start', f'{box_type}_end']].to_dict('index')  # these positions are 1-based
    with open(input_fasta, 'r') as f:
        sno_id = ''
        for line in f:
            if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
            else:
                seq = line.rstrip('\n')
                start, end = start_end_dict[sno_id][f'{box_type}_start'], start_end_dict[sno_id][f'{box_type}_end']
                if (start == 0) & (end == 0):  # i.e. no C box was found
                    flanking_motif_dict[sno_id] = len_motif_dict[box_type]
                elif start >= 4:  # i.e. a C box was found after the first 3 nt
                    flanking_motif = seq[start-4:end+3]  # -4 and +3 to convert to 0-based indexing
                    left, right = flanking_motif[0:3].lower(), flanking_motif[-3:].lower()
                    flanking_motif_dict[sno_id] = left + flanking_motif[3:-3] + right
                else:  # i.e. a C box was found but included within the first 3 nt
                    n = {0: 'nnn', 1: 'nn', 2:'n'}
                    flanking_motif = seq[0:end+3]
                    flanking_motif = n[start-1] + flanking_motif
                    left, right = flanking_motif[0:3].lower(), flanking_motif[-3:].lower()
                    flanking_motif_dict[sno_id] = left + flanking_motif[3:-3] + right

    dict_to_fasta(flanking_motif_dict, output_fasta)


def flanking_nt_to_d_box(position_df, input_fasta, output_fasta, box_type):
    """ Find start and end of D box in position_df and extract 3nt flanking up- and downstream
        of D box. Return flanking nt + D box sequence as fasta. If D box is directly at the 3'
        start of the snoRNA, return 'nnn' as the flanking nt. Also, return flanking nt in
        lowercase and motif as uppercase. This function can also be applied to D' and ACA boxes
        using box_type."""
    len_motif_dict = {"D": "nnnNNNNnnn", "D_prime": "nnnNNNNnnn", "ACA": "nnnNNNnnn"}
    flanking_motif_dict = {}
    position_df = position_df.set_index('gene_id')
    start_end_dict = position_df[[f'{box_type}_start', f'{box_type}_end']].to_dict('index')  # these positions are 1-based
    with open(input_fasta, 'r') as f:
        sno_id = ''
        for line in f:
            if line.startswith('>'):
                id = line.lstrip('>').rstrip('\n')
                sno_id = id
            else:
                seq = line.rstrip('\n')
                start, end = start_end_dict[sno_id][f'{box_type}_start'], start_end_dict[sno_id][f'{box_type}_end']
                if (start == 0) & (end == 0):  # i.e. no D box was found
                    flanking_motif_dict[sno_id] = len_motif_dict[box_type]
                elif end <= len(seq) - 3:  # i.e. a D box was found before the last 3 nt
                    flanking_motif = seq[start-4:end+3]  # -4 and +3 to convert to 0-based indexing
                    left, right = flanking_motif[0:3].lower(), flanking_motif[-3:].lower()
                    flanking_motif_dict[sno_id] = left + flanking_motif[3:-3] + right
                else:  # i.e. a D box was found but included within the last 3 nt
                    n = {0: 'nnn', 1: 'nn', 2:'n'}
                    flanking_motif = seq[start-4:]
                    flanking_motif = flanking_motif + n[len(seq) - end]
                    left, right = flanking_motif[0:3].lower(), flanking_motif[-3:].lower()
                    flanking_motif_dict[sno_id] = left + flanking_motif[3:-3] + right

    dict_to_fasta(flanking_motif_dict, output_fasta)


# Create fasta of motif and flanking nt for expressed and not expressed C/D box snoRNAs
flanking_nt_to_c_box(c_d_box_expressed, expressed_cd_fa, snakemake.output.c_expressed, 'C')
flanking_nt_to_c_box(c_d_box_expressed, expressed_cd_fa, snakemake.output.c_prime_expressed, 'C_prime')
flanking_nt_to_d_box(c_d_box_expressed, expressed_cd_fa, snakemake.output.d_expressed, 'D')
flanking_nt_to_d_box(c_d_box_expressed, expressed_cd_fa, snakemake.output.d_prime_expressed, 'D_prime')

flanking_nt_to_c_box(c_d_box_not_expressed, not_expressed_cd_fa, snakemake.output.c_not_expressed, 'C')
flanking_nt_to_c_box(c_d_box_not_expressed, not_expressed_cd_fa, snakemake.output.c_prime_not_expressed, 'C_prime')
flanking_nt_to_d_box(c_d_box_not_expressed, not_expressed_cd_fa, snakemake.output.d_not_expressed, 'D')
flanking_nt_to_d_box(c_d_box_not_expressed, not_expressed_cd_fa, snakemake.output.d_prime_not_expressed, 'D_prime')

# Create fasta of motif and flanking nt for expressed and not expressed H/ACA box snoRNAs
flanking_nt_to_c_box(h_aca_box_expressed, expressed_haca_fa, snakemake.output.h_expressed, 'H')
flanking_nt_to_d_box(h_aca_box_expressed, expressed_haca_fa, snakemake.output.aca_expressed, 'ACA')

flanking_nt_to_c_box(h_aca_box_not_expressed, not_expressed_haca_fa, snakemake.output.h_not_expressed, 'H')
flanking_nt_to_d_box(h_aca_box_not_expressed, not_expressed_haca_fa, snakemake.output.aca_not_expressed, 'ACA')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import subprocess as sp
if 'Mus_musculus' in snakemake.input.gtf_bed:
    species = 'mus_musculus'
else:
    species = snakemake.wildcards.species
df = pd.read_csv(snakemake.input.gtf_bed, sep='\t',
                    names=['chr', 'start', 'end', 'gene_id', 'dot',
                            'strand', 'source', 'feature', 'dot2', 'attributes'], index_col=False)

embedded_genes = ['miRNA', 'Mt_tRNA', 'scaRNA', 'scRNA', 'snoRNA', 'snRNA', 'sRNA']
embedded_genes = [f'gene_biotype "{item}"' for item in embedded_genes]
embedded_genes = '{}'.format('|'.join(embedded_genes))

# Keep only gene features and remove embedded genes
df = df[df['feature'] == 'gene']
df = df[~df['attributes'].str.contains(embedded_genes)]

# Add "chr" in front of chr number
df['chr'] = 'chr' + df['chr'].astype(str)
df.to_csv(f'temp_gtf_{species}.bed', index=False, sep='\t', header=False)

# Sort gtf bed file
sp.call(f'sort -k1,1 -k2,2n temp_gtf_{species}.bed > '+snakemake.output.formatted_gtf_bed, shell=True)
sp.call(f'rm temp_gtf_{species}.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import subprocess as sp
""" From a bed file containing all snoRNAs (generated via gtf_to_bed.sh) and a csv file containing snoRNA info (i.e.
    their HG), generate snoRNA bed files (intronic, intergenic, intronic without SNHG14 snoRNAs, only SNHG14 snoRNAs)"""

all_sno_bed = pd.read_csv(snakemake.input.all_sno_bed, sep='\t',
                          names=['chr', 'start', 'end', 'gene_id', 'dot',
                                'strand', 'source', 'feature', 'dot2',
                                 'gene_info'])
all_sno_bed['chr'] = all_sno_bed['chr'].astype(str)
all_sno_bed = all_sno_bed.sort_values(['chr', 'start', 'end'])
all_sno_bed = all_sno_bed.replace(to_replace='"cluster_([0-9]{0,4})', value=r'"cluster_\1"; gene_version "1', regex=True)  # Add gene_version for blockbuster detected snoRNAs


sno_info = pd.read_csv(snakemake.input.sno_info_df)
intergenic_sno_id = list(sno_info[sno_info['host_id'].isna()].gene_id)  # snoRNAs with NaN host_id are intergenic
snhg14_sno_id = list(sno_info[sno_info['host_name'] == 'SNHG14'].gene_id)

intronic_sno_bed = all_sno_bed[~all_sno_bed['gene_id'].isin(intergenic_sno_id)]
intergenic_sno_bed = all_sno_bed[all_sno_bed['gene_id'].isin(intergenic_sno_id)]

intronic_sno_bed_wo_snhg14 = intronic_sno_bed[~intronic_sno_bed['gene_id'].isin(snhg14_sno_id)]  # bed file containing all intronic snoRNAs except those encoded in SNHG14
snhg14_sno_bed = intronic_sno_bed[intronic_sno_bed['gene_id'].isin(snhg14_sno_id)]  # bed file containing SNHG14 snoRNAs

intronic_sno_bed.to_csv(snakemake.output.intronic_sno_bed, index=False, header=False, sep='\t')
intergenic_sno_bed.to_csv(snakemake.output.intergenic_sno_bed, index=False, header=False, sep='\t')
intronic_sno_bed_wo_snhg14.to_csv(snakemake.output.intronic_sno_bed_wo_snhg14, index=False, header=False, sep='\t')
snhg14_sno_bed.to_csv(snakemake.output.snhg14_sno_bed, index=False, header=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
import shap
import numpy as np

""" Compute the SHAP value of all features for all snoRNAs in each test set and each model."""

iteration_ = snakemake.wildcards.manual_iteration
model_name = snakemake.wildcards.models2
X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
model_path = snakemake.input.pickled_trained_model
model = pickle.load(open(model_path, 'rb'))

if model_name == 'log_reg':
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))
    shap_val = explainer.shap_values(X_test)
    shap_val_df = pd.DataFrame(shap_val, index=X_test.index, columns=X_test.columns)
    shap_val_df = shap_val_df.add_suffix('_SHAP')
    shap_val_df = shap_val_df.reset_index()
    base_value = explainer.expected_value  # this is the base value where the decision plot starts (average of all X_train log odds)
    base_value_df = pd.DataFrame([base_value], columns=[f'expected_value_{model_name}_{iteration_}'])
    shap_val_df.to_csv(snakemake.output.shap, sep='\t', index=False)
    base_value_df.to_csv(snakemake.output.expected_value, sep='\t', index=False)
else:
    explainer = shap.KernelExplainer(model.predict, shap.sample(X_train, 100, random_state=42))
    shap_val = explainer.shap_values(X_test)
    shap_val_df = pd.DataFrame(shap_val, index=X_test.index, columns=X_test.columns)
    shap_val_df = shap_val_df.add_suffix('_SHAP')
    shap_val_df = shap_val_df.reset_index()
    base_value = explainer.expected_value  # this is the base value where the decision plot starts (average of all X_train log odds)
    base_value_df = pd.DataFrame([base_value], columns=[f'expected_value_{model_name}_{iteration_}'])
    shap_val_df.to_csv(snakemake.output.shap, sep='\t', index=False)
    base_value_df.to_csv(snakemake.output.expected_value, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
import shap
import numpy as np

""" Compute the SHAP value of all features for all snoRNAs in each test set and each model."""

iteration_ = snakemake.wildcards.iteration
model_name = snakemake.wildcards.models2
X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
model_path = snakemake.input.pickled_trained_model
model = pickle.load(open(model_path, 'rb'))

if model_name == 'log_reg':
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))
    shap_val = explainer.shap_values(X_test)
    shap_val_df = pd.DataFrame(shap_val, index=X_test.index, columns=X_test.columns)
    shap_val_df = shap_val_df.add_suffix('_SHAP')
    shap_val_df = shap_val_df.reset_index()
    base_value = explainer.expected_value  # this is the base value where the decision plot starts (average of all X_train log odds)
    base_value_df = pd.DataFrame([base_value], columns=[f'expected_value_{model_name}_{iteration_}'])
    shap_val_df.to_csv(snakemake.output.shap, sep='\t', index=False)
    base_value_df.to_csv(snakemake.output.expected_value, sep='\t', index=False)
else:
    explainer = shap.KernelExplainer(model.predict, shap.sample(X_train, 100, random_state=42))
    shap_val = explainer.shap_values(X_test)
    shap_val_df = pd.DataFrame(shap_val, index=X_test.index, columns=X_test.columns)
    shap_val_df = shap_val_df.add_suffix('_SHAP')
    shap_val_df = shap_val_df.reset_index()
    base_value = explainer.expected_value  # this is the base value where the decision plot starts (average of all X_train log odds)
    base_value_df = pd.DataFrame([base_value], columns=[f'expected_value_{model_name}_{iteration_}'])
    shap_val_df.to_csv(snakemake.output.shap, sep='\t', index=False)
    base_value_df.to_csv(snakemake.output.expected_value, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import pandas as pd

""" Select only the predicted branch point nucleotide with the highest
    probability per HG intron and compute its distance to the snoRNA. The columns
    in the output df are seqnames (chr), start (start position of the window,
    i.e always 44 nt upstream of the 3'exon), end (end position of the window,
    i.e always 18 nt upstream of the 3'exon), strand, gene_id, transcript_id,
    exon_3prime (exon_id of 3' exon), exon_5prime (exon_id of 5' exon),
    exon_number (exon number of exon downstream of the branch point),
    intron_number (intron in which the branch point is located i.e the number of
    the upstream exon), test_site (position of the branch_point), seq_pos0 (nt
    at the branch point), branchpoint_prob (branchpoint probability computed by
    branchpointer), U2_binding_energy (U2 binding energy to the branch point
    computed by branchpointer) and bp_to_3prime (distance of branch point to
    3' exon)."""

bp_total_df = pd.read_csv(snakemake.input.bp_distance_total_df)
sno_location_df = pd.read_csv(snakemake.input.sno_location_df, sep='\t')
sno_overlap_df = pd.read_csv(snakemake.params.sno_overlap_df, sep='\t')

#Split bp_total_df into a SNHG14 df and a 'all other host genes' (HG) df
snhg14_bp = bp_total_df[bp_total_df['transcript_id'] == "NR_146177.1"]
all_other_hg_bp = bp_total_df[bp_total_df['transcript_id'] != "NR_146177.1"]

## Grouby df by chr, exon3prime and exon5prime of window and take only the line with the highest probability of branch point per group
#For all HG except SNHG14
all_other_hg_bp['max'] = all_other_hg_bp.groupby(['seqnames', 'exon_3prime', 'exon_5prime'])['branchpoint_prob'].transform('max')
all_other_hg_bp = all_other_hg_bp[all_other_hg_bp['max'] == all_other_hg_bp['branchpoint_prob']]  # Select only the max probability branch point

#For SNHG14, groupby exon number only (since no exon ids are available in the refseq gtf)
snhg14_bp['max'] = snhg14_bp.groupby('exon_number')['branchpoint_prob'].transform('max')
snhg14_bp = snhg14_bp[snhg14_bp['max'] == snhg14_bp['branchpoint_prob']]  # Select only the max probability branch point


# Calculate bp_to_3prime distance and intron number for both dfs and concat the resulting dfs into one df
all_other_hg_bp['bp_to_3prime'] = all_other_hg_bp['end'] + 18 - all_other_hg_bp['test_site']
all_other_hg_bp['intron_number'] = all_other_hg_bp['exon_number'] - 1

all_other_hg_bp = all_other_hg_bp[['seqnames', 'start', 'end', 'strand', 'gene_id',
                'transcript_id', 'exon_3prime', 'exon_5prime',
                'exon_number', 'intron_number', 'test_site', 'seq_pos0',
                'branchpoint_prob', 'U2_binding_energy', 'bp_to_3prime']]

snhg14_bp['bp_to_3prime'] = snhg14_bp['end'] + 18 - snhg14_bp['test_site']
snhg14_bp['intron_number'] = snhg14_bp['exon_number'] - 1

snhg14_bp = snhg14_bp[['seqnames', 'start', 'end', 'strand', 'gene_id',
                'transcript_id', 'exon_3prime', 'exon_5prime',
                'exon_number', 'intron_number', 'test_site', 'seq_pos0',
                'branchpoint_prob', 'U2_binding_energy', 'bp_to_3prime']]

bp_distance_simple = pd.concat([all_other_hg_bp, snhg14_bp], ignore_index=True)
bp_distance_simple.to_csv(snakemake.output.bp_distance_simple, index=False, sep='\t')

# Merge bp_distance_simple df to sno_location_df and calculate the distance between intronic snoRNAs and the predicted branch_point in their intron
sno_location_df['intron_number'] = sno_location_df['intron_number'].astype(int)
sno_bp_df = sno_location_df.merge(bp_distance_simple[['transcript_id', 'intron_number', 'bp_to_3prime']],
                how='left', left_on=['transcript_id_host', 'intron_number'], right_on=['transcript_id', 'intron_number'])

sno_bp_df['dist_to_bp'] = sno_bp_df['distance_downstream_exon'] - sno_bp_df['bp_to_3prime']

# Replace bp_to_3prime and dist_to_bp by 0 for snoRNAs that overlap a HG exon
overlapping_sno = list(sno_overlap_df[~sno_overlap_df['hg_overlap'].isna()].sno)
overlapping_sno.append('NR_000026') #this snoRNA is at 2 nt from its downstream exon, so the dist_to_bp would be negative; so we consider it as overlaping also
for i, sno_id in enumerate(overlapping_sno):
    sno_bp_df.loc[sno_bp_df.gene_id_sno == sno_id, 'bp_to_3prime'] = 0
    sno_bp_df.loc[sno_bp_df.gene_id_sno == sno_id, 'dist_to_bp'] = 0

sno_bp_df.drop(['transcript_id_x', 'transcript_id_y'], axis=1, inplace=True)

sno_bp_df.to_csv(snakemake.output.sno_distance_bp, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
import collections as coll
from statistics import mode

""" Define the consensus confusion value across the 3 models and 10 iterations. 
    Choose the confusion value based on the highest number of time predicted 
    as such across the 30 models. If 2 confusion values have an equal number of 
    votes (15 vs 15), remove randomly 1 iteration and the equality will then be 
    broken and a predominant confusion value will be chosen."""

df_paths = snakemake.input.confusion_val_df

all_mouse_sno_ids = pd.read_csv(df_paths[0], sep='\t')  # we take the first df here, but they all contain all the snoRNAs
all_mouse_sno_ids = list(all_mouse_sno_ids['gene_id_sno'])

conf_val = {}
for sno_id in all_mouse_sno_ids:
    temp_val = []
    for path in df_paths:
        df = pd.read_csv(path, sep='\t')
        df = df.filter(regex='gene_id_sno|^confusion_matrix_val')
        val = df[df['gene_id_sno'] == sno_id].values[0][1]
        temp_val.append(val)
    if 15 in coll.Counter(temp_val).values():  # if there is an equality in confusion value votes (ex: 15 TN vs 15 FP)
        del temp_val[0]  # remove first iteration for all three models
        del temp_val[9]
        del temp_val[18]
        consensus = mode(temp_val)
        conf_val[sno_id] = consensus
    else:
        consensus = mode(temp_val)
        conf_val[sno_id] = consensus

# Create df from dict
final_df = pd.DataFrame(conf_val.items(), columns=['gene_id_sno', 'consensus_confusion_value'])
final_df.to_csv(snakemake.output.consensus_conf_val_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import pandas as pd
import collections as coll
from statistics import mode

""" Define the consensus confusion value across the 10 iterations for each model.
    Choose the confusion value based on the highest number of time predicted
    as such across the 10 models. If 2 confusion values have an equal number of
    votes (5 vs 5), remove randomly 1 iteration and the equality will then be
    broken and a predominant confusion value will be chosen."""

df_paths = snakemake.input.confusion_val_df

all_mouse_sno_ids = pd.read_csv(df_paths[0], sep='\t')  # we take the first df here, but they all contain all the snoRNAs
all_mouse_sno_ids = list(all_mouse_sno_ids['gene_id_sno'])

conf_val = {}
for sno_id in all_mouse_sno_ids:
    temp_val = []
    for path in df_paths:
        df = pd.read_csv(path, sep='\t')
        df = df.filter(regex='gene_id_sno|^confusion_matrix_val')
        val = df[df['gene_id_sno'] == sno_id].values[0][1]
        temp_val.append(val)
    if 5 in coll.Counter(temp_val).values():  # if there is an equality in confusion value votes (ex: 5 TN vs 5 FP)
        del temp_val[0]  # remove first iteration 
        consensus = mode(temp_val)
        conf_val[sno_id] = consensus
    else:
        consensus = mode(temp_val)
        conf_val[sno_id] = consensus

# Create df from dict
final_df = pd.DataFrame(conf_val.items(), columns=['gene_id_sno', 'consensus_confusion_value'])
final_df.to_csv(snakemake.output.consensus_conf_val_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import pandas as pd
from pybedtools import BedTool

""" Get the sequences composing the potential terminal stems of snoRNAs into a
    combined fasta file. Each entry correspond to the left flanking region in
    reverse order followed by a '&' and the right flanking region in reverse
    order. This orders ensures that RNAcofold respects the 5'->3' opposite base
    pairing composing the potential terminal stems and create representative
    structure graphs. On the resulting graphs, the green sequence corresponds to
    the left flanking region and the red sequence corresponds to the right
    flanking region (the 5' of these sequences being at the extremity with no
    full dot inside the nucleotide on the graph; the 3' of these sequences being
    at the extermitiy where there is a full dot inside the nucleotide). """
col_names = ['chr', 'start', 'end', 'gene_id', 'score', 'strand', 'source', 'feature', 'score2', 'characteristics']
cd_left = BedTool(snakemake.input.flanking_cd_left)
cd_right = BedTool(snakemake.input.flanking_cd_right)
haca_left = BedTool(snakemake.input.flanking_haca_left)
haca_right = BedTool(snakemake.input.flanking_haca_right)

# Verify if there is a left and right sequence for all snoRNAs (sometimes the
# snoRNA is at the end of a chr, which makes it impossible to find a right sequence)
cd_left_df = pd.read_table(cd_left.fn, names=col_names)
cd_right_df = pd.read_table(cd_right.fn, names=col_names)
if list(cd_left_df.gene_id) != list(cd_right_df.gene_id):
    diff_id = list(set(list(cd_left_df.gene_id)) - set(list(cd_right_df.gene_id)))
    diff_id2 = list(set(list(cd_right_df.gene_id)) - set(list(cd_left_df.gene_id)))
    diff_ids = diff_id + diff_id2
    print(diff_ids)
    cd_left_df = cd_left_df[~cd_left_df['gene_id'].isin(diff_ids)]
    cd_right_df = cd_right_df[~cd_right_df['gene_id'].isin(diff_ids)]
    cd_left = BedTool.from_dataframe(cd_left_df)
    cd_right = BedTool.from_dataframe(cd_right_df)

# Get the sequences of the extended flanking regions of C/D snoRNAs
fasta_cd_left = cd_left.sequence(fi=snakemake.input.genome_fasta, s=True)
fasta_cd_right = cd_right.sequence(fi=snakemake.input.genome_fasta, s=True)

# For C/D snoRNAs, open both left and right flanking regions fasta file and append the right region
# to the right of the left region as reverse order strings separated by a '&'; this combined sequence is used by RNAcofold.
# The attribute "seqfn" points to the new fasta file created by sequence().
seqs_cd = []
with open(fasta_cd_left.seqfn, 'r') as file_left, open(fasta_cd_right.seqfn, 'r') as file_right:
    for line_left, line_right in zip(file_left, file_right):
        line_left = str(line_left)
        line_right = str(line_right)
        if (not line_left.startswith('>')) & (not line_right.startswith('>')):
            # Reverse the order of both flanking sequences with [::-1]
            co_seq = line_left[::-1] + "&" + line_right[::-1]  # this order is how RNAcofold will accurately try to see the
                                                # best base pairing between the left and right extended flanking regions of snoRNAs
            co_seq = co_seq.replace('T', 'U')  # convert DNA to RNA
            co_seq = co_seq.replace('\n', '')  # remove new lines from string
            seqs_cd.append(co_seq)
print(len(seqs_cd))
# Create a dictionary of C/D snoRNAs id as keys and co_seq as values
cd_dictio = {}
cd_ids = pd.read_table(cd_left.fn, names=col_names)
for i, gene_id in enumerate(cd_ids['gene_id']):
    cd_dictio[gene_id] = seqs_cd[i]
cd_dictio = {'>'+ k: v for k, v in cd_dictio.items()}  # Add '>' in front of all sno id

# Append these C/D snoRNAs ids and co_seq into the output file
with open(snakemake.output.sequences, "a+") as file:  # a+ for append in new file
    for k, v in cd_dictio.items():
        file.write(k+'\n'+v+'\n')



# Get the sequences of the extended flanking regions of H/ACA snoRNAs
fasta_haca_left = haca_left.sequence(fi=snakemake.input.genome_fasta, s=True)
fasta_haca_right = haca_right.sequence(fi=snakemake.input.genome_fasta, s=True)

# For H/ACA snoRNAs, open both left and right flanking regions fasta file and append the right region
# to the right of the left region as reverse order strings separated by a '&'; this combined sequence is used by RNAcofold.
# The attribute "seqfn" points to the new fasta file created by sequence().
seqs_haca = []
with open(fasta_haca_left.seqfn, 'r') as file_left, open(fasta_haca_right.seqfn, 'r') as file_right:
    for line_left, line_right in zip(file_left, file_right):
        line_left = str(line_left)
        line_right = str(line_right)
        if (not line_left.startswith('>')) & (not line_right.startswith('>')):
            # Reverse the order of both flanking sequences with [::-1]
            co_seq = line_left[::-1] + "&" + line_right[::-1]  # this order is how RNAcofold will accurately try to see the
                                                # best base pairing between the left and right extended flanking regions of snoRNAs
            co_seq = co_seq.replace('T', 'U')  # convert DNA to RNA
            co_seq = co_seq.replace('\n', '')  # remove new lines from string
            seqs_haca.append(co_seq)

# Create a dictionary of H/ACA snoRNAs id as keys and co_seq as values
haca_dictio = {}
haca_ids = pd.read_table(haca_left.fn, names=col_names)
for i, gene_id in enumerate(haca_ids['gene_id']):
    haca_dictio[gene_id] = seqs_haca[i]
haca_dictio = {'>'+ k: v for k, v in haca_dictio.items()}  # Add '>' in front of all sno id

# Append these H/ACA snoRNAs ids and co_seq into the output file
with open(snakemake.output.sequences, "a+") as file:  # a+ for append in output file already containing C/D snoRNAs co_seq
    for k, v in haca_dictio.items():
        file.write(k+'\n'+v+'\n')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import pandas as pd
import functions as ft

""" Stacked bar chart of all predicted expressed/not_expressed snoRNAs per species"""
species_ordered = ['pan_troglodytes', 'gorilla_gorilla', 'macaca_mulatta',
                    'oryctolagus_cuniculus', 'rattus_norvegicus', 'bos_taurus',
                    'ornithorhynchus_anatinus', 'gallus_gallus', 'xenopus_tropicalis',
                    'danio_rerio']
human_df = pd.read_csv(snakemake.input.human_labels, sep='\t')
mouse_df = pd.read_csv(snakemake.input.mouse_labels, sep='\t')
paths = snakemake.input.dfs
dfs = []
for i, path in enumerate(paths):
    species_name = path.split('/')[-1].split('_predicted_label')[0]
    df = pd.read_csv(path, sep='\t')
    df['species_name'] = species_name
    df = df[['predicted_label', 'species_name']]
    dfs.append(df)

# Concat dfs into 1 df
concat_df = pd.concat(dfs)

# Given a species name list, count the number of criteria in specific col of df
# that was previously filtered using species_name_list in global_col
def count_list_species(initial_df, species_name_list, global_col, criteria, specific_col):
    """
    Create a list of lists using initial_col to split the global list and
    specific_col to create the nested lists.
    """
    df_list = []

    #Sort in acending order the unique values in global_col and create a list of
    # df based on these values
    print(species_name_list)
    for val in species_name_list:
        temp_val = initial_df[initial_df[global_col] == val]
        df_list.append(temp_val)


    l = []
    for i, df in enumerate(df_list):
        temp = []
        for j, temp1 in enumerate(criteria):
            crit = df[df[specific_col] == temp1]
            crit = len(crit)
            temp.append(crit)
        l.append(temp)

    return l


# Generate a bar chart of categorical features with a hue of gene_biotype
counts_per_feature = count_list_species(concat_df, species_ordered, 'species_name',
                    list(snakemake.params.hue_color.keys()),
                    'predicted_label')
# Add human and mouse actual labels for comparison
human_expressed = len(human_df[human_df['abundance_cutoff_2'] == 'expressed'])
human_not_expressed = len(human_df[human_df['abundance_cutoff_2'] == 'not_expressed'])
mouse_expressed = len(mouse_df[mouse_df['abundance_cutoff'] == 'expressed'])
mouse_not_expressed = len(mouse_df[mouse_df['abundance_cutoff'] == 'not_expressed'])
counts_per_feature = [[human_expressed, human_not_expressed]] + [[mouse_expressed, mouse_not_expressed]] + counts_per_feature

# Convert to percent
percent = ft.percent_count(counts_per_feature)


# Get the total number of snoRNAs (for which we found snoRNA type) per species
total_nb_sno = str([sum(l) for l in counts_per_feature])
xtick_labels = ['homo_sapiens', 'mus_musculus'] + species_ordered
xtick_labels = [label.capitalize().replace('_', ' ') for label in xtick_labels]


ft.stacked_bar2(percent, xtick_labels,
                list(snakemake.params.hue_color.keys()), '', '',
                'Proportion of snoRNAs (%)', snakemake.params.hue_color, total_nb_sno,
                snakemake.output.bar)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.abundance_cutoff_df, sep='\t')
print(df)
biotypes = ['protein_coding', 'snRNA', 'snoRNA', 'tRNA', 'lncRNA']

# Keep only protein_coding, lncRNA, snRNA, snoRNA and tRNA
df = df[df['gene_biotype'].isin(biotypes)]
print(df)


# Generate a bar chart of categorical features with a hue of gene_biotype
counts_per_feature = ft.count_list_x(df, 'gene_biotype',
                    list(snakemake.params.colors.keys()),
                    'abundance_cutoff_2')
percent = ft.percent_count(counts_per_feature)

ft.stacked_bar(percent, sorted(list(df['gene_biotype'].unique())),
                list(snakemake.params.colors.keys()), '', 'Biotype',
                'Proportion of RNAs (%)', snakemake.params.colors,
            snakemake.output.bar)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.df, sep='\t')
snoRNA_type_df = pd.read_csv(snakemake.input.snoRNA_type_df, sep='\t')
sno_type = str(snakemake.wildcards.sno_type)
sno_type = sno_type[0] + '/' + sno_type[1:]

# Keep only C/D or H/ACA snoRNAs
df = df.merge(snoRNA_type_df, how='left', left_on='gene_id_sno', right_on='gene_id')
df = df[df['snoRNA_type'] == sno_type]

# Generate a bar chart of categorical features with a hue of abundance_cutoff_2
# per sno_type
counts_per_feature = ft.count_list_x(df, 'abundance_cutoff',
                    list(snakemake.params.hue_color.keys()),
                    'abundance_cutoff_host')
percent = ft.percent_count(counts_per_feature)

ft.stacked_bar(percent, sorted(list(df['abundance_cutoff'].unique())),
                list(snakemake.params.hue_color.keys()), '', 'Abundance status of snoRNAs',
                'Proportion of snoRNAs (%)', snakemake.params.hue_color,
            snakemake.output.bar)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.df, sep='\t')
cd = df[df['sno_type'] == 'C/D']
haca = df[df['sno_type'] == 'H/ACA']

# Generate a bar chart of categorical features with a hue of abundance_cutoff_2
# for all snoRNAs
counts_per_feature = ft.count_list_x(df, 'abundance_cutoff_2',
                    list(snakemake.params.hue_color.keys()),
                    snakemake.wildcards.categorical_features)
percent = ft.percent_count(counts_per_feature)

ft.stacked_bar(percent, sorted(list(df['abundance_cutoff_2'].unique())),
                list(snakemake.params.hue_color.keys()), '', 'Abundance status of snoRNAs',
                'Proportion of snoRNAs (%)', snakemake.params.hue_color,
            snakemake.output.bar_categorical_features)


# Generate a bar chart of categorical features with a hue of abundance_cutoff_2
# for either C/D and H/ACA snoRNAs separately

#For C/D
counts_per_feature_cd = ft.count_list_x(cd, 'abundance_cutoff_2',
                    list(snakemake.params.hue_color.keys()),
                    snakemake.wildcards.categorical_features)
percent_cd = ft.percent_count(counts_per_feature_cd)


ft.stacked_bar(percent_cd, sorted(list(cd['abundance_cutoff_2'].unique())),
                list(snakemake.params.hue_color.keys()), '', 'Abundance status of C/D snoRNAs',
                'Proportion of snoRNAs (%)', snakemake.params.hue_color,
                snakemake.output.bar_categorical_features_cd)


#For H/ACA
counts_per_feature_haca = ft.count_list_x(haca, 'abundance_cutoff_2',
                    list(snakemake.params.hue_color.keys()),
                    snakemake.wildcards.categorical_features)
percent_haca = ft.percent_count(counts_per_feature_haca)

ft.stacked_bar(percent_haca, sorted(list(haca['abundance_cutoff_2'].unique())),
                list(snakemake.params.hue_color.keys()), '', 'Abundance status of H/ACA snoRNAs',
                'Proportion of snoRNAs (%)', snakemake.params.hue_color,
                snakemake.output.bar_categorical_features_haca)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.df, sep='\t')
snoRNA_type_df = pd.read_csv(snakemake.input.snoRNA_type_df, sep='\t')
sno_type = str(snakemake.wildcards.sno_type)
sno_type = sno_type[0] + '/' + sno_type[1:]

# Keep only C/D or H/ACA snoRNAs
df = df.merge(snoRNA_type_df, how='left', left_on='gene_id_sno', right_on='gene_id')
df = df[df['snoRNA_type'] == sno_type]

# Generate a bar chart of categorical features with a hue of predicted expressed vs not_expressed
# per sno_type
counts_per_feature = ft.count_list_x(df, 'predicted_label',
                    list(snakemake.params.hue_color.keys()),
                    'abundance_cutoff_host')
percent = ft.percent_count(counts_per_feature)

ft.stacked_bar(percent, sorted(list(df['predicted_label'].unique())),
                list(snakemake.params.hue_color.keys()), '', 'Abundance status of snoRNAs',
                'Proportion of snoRNAs (%)', snakemake.params.hue_color,
            snakemake.output.bar)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
import functions as ft
from scipy.stats import fisher_exact

FP_path = [path for path in snakemake.input.confusion_value_df if 'FP' in path][0]
TN_path = [path for path in snakemake.input.confusion_value_df if 'TN' in path][0]
FP, TN = pd.read_csv(FP_path, sep='\t'), pd.read_csv(TN_path, sep='\t')
multi_HG_df = pd.read_csv(snakemake.input.multi_HG_df, sep='\t')
multi_HG_sno = list(multi_HG_df.gene_id_sno)
color_dict = snakemake.params.color_dict

# Create tables required for the Fisher's exact test contingency table
FP.loc[FP['gene_id_sno'].isin(multi_HG_sno), 'multi_HG_different_labels'] = 'yes'
FP['multi_HG_different_labels'] = FP['multi_HG_different_labels'].fillna('no')

TN.loc[TN['gene_id_sno'].isin(multi_HG_sno), 'multi_HG_different_labels'] = 'yes'
TN['multi_HG_different_labels'] = TN['multi_HG_different_labels'].fillna('no')

# Create contingency table
table = ft.fisher_contingency(FP, TN, 'multi_HG_different_labels', 'yes')
oddsratio, p_val = fisher_exact(table)

print(f'p-val is {p_val}')

# Create bar chart
counts_per_conf_val = [list(table['group1']), list(table['group2'])]
percent = ft.percent_count(counts_per_conf_val)

ft.stacked_bar(percent, ['FP', 'TN'], color_dict.keys(),
                '', 'Confusion value',
                'Proportion of snoRNAs within \n multi-intronic HG (%)', color_dict.values(),
            snakemake.output.bar)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.df, sep='\t')
feature = snakemake.wildcards.intronic_features
limits = [[0, 25, '[0; 25['], [25, 50, '[25; 50['], [50, 100, '[50; 100['],
            [100, 250, '[100; 250['], [250, 500, '[250; 500['],
            [500, 1000, '[500; 1000['], [1000, 5000, '[1000; 5000['],
            [5000, 10000, '[5000; 10000['], [10000, 100000, '[10000; 100000['],
            [100000, 1000000000, '>100000']]
labels = []
# Create ranges of data for intronic features (with large scale of data)
for i, limit in enumerate(limits):
    lower, upper, label = limit[0], limit[1], limit[2]
    df.loc[(df[feature] >= lower) & (df[feature] < upper), feature+'_range'] = label
    labels.append(label)

# Drop intergenic snoRNAs
df = df.dropna(subset=[feature])
cd = df[df['sno_type'] == 'C/D']
haca = df[df['sno_type'] == 'H/ACA']


# Generate a grouped bar chart of intronic features with a hue of abundance_cutoff_2
# for either C/D and H/ACA snoRNAs separately

#For C/D
counts_cd = ft.count_list_x_unsorted(cd, list(snakemake.params.hue_color.keys()),
                'abundance_cutoff_2', labels, feature+'_range')

ft.bar_from_lst_of_lst(counts_cd, [0.15, 0.45], list(snakemake.params.hue_color.values()), 0.3, labels,
                        feature+'_range', "Number of snoRNAs",
                        list(snakemake.params.hue_color.keys()), snakemake.output.bar_cd)


#For H/ACA
counts_haca = ft.count_list_x_unsorted(haca, list(snakemake.params.hue_color.keys()),
                'abundance_cutoff_2', labels, feature+'_range')

ft.bar_from_lst_of_lst(counts_haca, [0.15, 0.45], list(snakemake.params.hue_color.values()), 0.3, labels,
                        feature+'_range', "Number of snoRNAs",
                        list(snakemake.params.hue_color.keys()), snakemake.output.bar_haca)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import pandas as pd
import functions as ft
import numpy as np

confusion_val = snakemake.wildcards.confusion_value
confusion_value_sno_df = pd.read_csv(snakemake.input.confusion_value_df, sep='\t')
confusion_value_sno = list(confusion_value_sno_df.gene_id_sno)
shap_value_paths = snakemake.input.shap_values
shap_dfs = []

for i, path in enumerate(shap_value_paths):
    model, iteration = path.split('/')[-1].split('_shap_values')[0].rsplit('_', 1)
    df = pd.read_csv(path, sep='\t')
    df['model'] = model
    df['iteration'] = iteration
    shap_dfs.append(df)

concat_shap = pd.concat(shap_dfs)

# Keep only snoRNAs that are part of the given confusion value
concat_shap = concat_shap[concat_shap['gene_id_sno'].isin(confusion_value_sno)]

# Convert all SHAP values to absolute values of these SHAP values
concat_shap.update(concat_shap.select_dtypes(include=[np.number]).abs())

# Get the top 1, 2 and 3 feature (i.e. highest abs(SHAP value) across features) for each snoRNA
concat_shap['top_1'] = concat_shap.filter(regex="_SHAP$").apply(lambda row: row[row == row.nlargest(1).values[-1]].index[0], axis=1)
concat_shap['top_2'] = concat_shap.filter(regex="_SHAP$").apply(lambda row: row[row == row.nlargest(2).values[-1]].index[0], axis=1)
concat_shap['top_3'] = concat_shap.filter(regex="_SHAP$").apply(lambda row: row[row == row.nlargest(3).values[-1]].index[0], axis=1)

# Drop snoRNAs that have the same model and top1/2/3 (so it does not bias the global portrait
# if a snoRNA is present in multiple iterations and always predicted based on the same features)
concat_shap = concat_shap.drop_duplicates(subset=['top_1', 'top_2', 'top_3', 'model'])
concat_shap.to_csv(snakemake.output.df, sep='\t', index=False)
print(concat_shap)
len_df = len(concat_shap)

# Create a list of list containing the relative number of times a feature is classified as a top 1, 2 or 3 feature
'''
relative_number_tops = []
for feature in concat_shap.filter(regex="_SHAP$").columns:
    print(feature)
    temp = []
    for top in ['top_1', 'top_2', 'top_3']:
        number_in_top_x = len(concat_shap[concat_shap[top] == feature])
        relative_number = (number_in_top_x / len_df) * 100
        temp.append(relative_number)
    relative_number_tops.append(temp)
print(relative_number_tops)
'''
relative_number_tops = []
for top in ['top_1', 'top_2', 'top_3']:
    temp = []
    for feature in concat_shap.filter(regex="_SHAP$").columns:
        number_in_top_x = len(concat_shap[concat_shap[top] == feature])
        relative_number = (number_in_top_x / len_df) * 100
        temp.append(relative_number)
    relative_number_tops.append(temp)
print(relative_number_tops)


# Create grouped bar chart
xticklabels = [feat.split('_norm_')[0] for feat in concat_shap.filter(regex="_SHAP$").columns]
ft.bar_from_lst_of_lst(relative_number_tops, [0.15, 0.3, 0.45], ['blue', 'green', 'pink'], 0.2, xticklabels, 'Features',
        f'Proportion of all predicted {confusion_val} snoRNAs',
        ['1st most predictive feature', '2nd most predictive feature', '3rd most predictive feature'], snakemake.output.bar)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
import matplotlib.pyplot as plt
import shap
import numpy as np

X_test_paths = snakemake.input.X_test
shap_iterations_paths = snakemake.input.shap_values
feature_df = pd.read_csv(snakemake.input.df, sep='\t')
cd_sno = feature_df[feature_df['sno_type'] == 'C/D']['gene_id_sno'].to_list()
haca_sno = feature_df[feature_df['sno_type'] == 'H/ACA']['gene_id_sno'].to_list()

# Load all manual split iterations dfs and select only C/D or H/ACA (shap values and feature values)
shap_values_all_iterations, X_test_snotype_all_iterations = [], []
for i, df_path in enumerate(X_test_paths):
    X_test = pd.read_csv(X_test_paths[i], sep='\t', index_col='gene_id_sno')
    shap_iteration = pd.read_csv(shap_iterations_paths[i], sep='\t', index_col='gene_id_sno')

    # Split between C/D and H/ACA snoRNAs
    if snakemake.wildcards.sno_type == "CD":
        X_test_snotype = X_test[X_test.index.isin(cd_sno)]
        shap_iteration_sno_type = shap_iteration[shap_iteration.index.isin(cd_sno)]
    elif snakemake.wildcards.sno_type == "HACA":
        X_test_snotype = X_test[X_test.index.isin(haca_sno)]
        shap_iteration_sno_type = shap_iteration[shap_iteration.index.isin(haca_sno)]

    X_test_snotype_all_iterations.append(X_test_snotype)
    shap_values_all_iterations.append(shap_iteration_sno_type)

# Concat values of all 10 iterations in a df
final_shap_values = np.concatenate(shap_values_all_iterations, axis=0)
final_X_test_snotype = pd.concat(X_test_snotype_all_iterations)  # Concat vertically all X_test_snotype dfs to infer feature value in the summary plot

# Create summary bar plot
plt.rcParams['svg.fonttype'] = 'none'
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
shap.summary_plot(final_shap_values, final_X_test_snotype, plot_type="bar", show=False, max_display=50, color="#969696")
plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.all_features, sep='\t')
haca = df[df['sno_type'] == 'H/ACA']
colors_dict = snakemake.params.ab_status_color

ft.bivariate_density(haca, 'conservation_score', 'sno_mfe', 'abundance_cutoff_2',
                    snakemake.output.bivariate_density, palette=colors_dict,
                    edgecolor='grey', xlim=(-0.1,1.1), ylim=(-175,0))


expressed = haca[haca['abundance_cutoff_2'] == 'expressed']
print('Average conservation across expressed H/ACA', expressed['conservation_score'].mean())
print('Average sno_mfe across expressed H/ACA', expressed['sno_mfe'].mean())
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
import functions as ft

""" Create a bar plot to compare confusion values for all
    categorical features in the top 10 predictive features."""

color_dict = snakemake.params.color_dict
output = snakemake.output.bar
categorical_feature = snakemake.wildcards.top_10_categorical_features
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')
feature_df = feature_df[['gene_id_sno', categorical_feature]]
confusion_value_df = pd.read_csv(snakemake.input.sno_per_confusion_value, sep='\t')

# Get df of all snoRNAs of a given confusion_value inside dict
sno_per_confusion_value = {}
for conf_val in ['TN', 'TP', 'FN', 'FP']:
    df_temp = confusion_value_df[confusion_value_df['confusion_matrix'] == conf_val]
    sno_list = df_temp['gene_id_sno'].to_list()
    df = feature_df[feature_df['gene_id_sno'].isin(sno_list)]
    sno_per_confusion_value[conf_val] = df

dfs = [sno_per_confusion_value['TN'], sno_per_confusion_value['TP'],
        sno_per_confusion_value['FN'], sno_per_confusion_value['FP']]

# Count feature values to create the bar chart in the order (TN, FP/FN, TP)
count_list = []
for conf_val_df in dfs:
    true_condition = conf_val_df[conf_val_df[categorical_feature] == 1.0]  # ex: "host is expressed" is true
    false_condition = conf_val_df[conf_val_df[categorical_feature] == 0.0]  # ex: "host is expressed" is false
    temp = [len(true_condition), len(false_condition)]
    count_list.append(temp)

percent_count = ft.percent_count(count_list)

# Create bar plot
colors = [color_dict['True'], color_dict['False']]
ft.stacked_bar(percent_count, ['TN', 'TP', 'FN', 'FP'],
                ["True", "False"], f'{categorical_feature}',
                'Confusion value snoRNA group', 'Proportion of snoRNAs (%)', colors, output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import pandas as pd
import functions as ft

""" Create a bar plot per confusion value comparison (e.g. FP vs TP) for all
    categorical features in the top 10 predictive features per snoRNA type C/D vs
    H/ACA). Each comparison counts only one time a snoRNA (ex: it considers a TP
    snoRNA once even if it is predicted multiple time as a TP across iterations)."""

sno_type = snakemake.wildcards.sno_type
sno_type = sno_type[0] + '/' + sno_type[1:]
color_dict = snakemake.params.color_dict
output = snakemake.output.bar
categorical_feature = snakemake.wildcards.top_10_categorical_features
comparison = snakemake.wildcards.comparison_confusion_val
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Select only one snoRNA type
feature_df = feature_df[feature_df[sno_type] == 1.0]
feature_df = feature_df[['gene_id_sno', categorical_feature]]


# Get the list of all snoRNAs of a given confusion_value inside dict
sno_per_confusion_value_paths = snakemake.input.sno_per_confusion_value
sno_per_confusion_value = {}
for path in sno_per_confusion_value_paths:
    confusion_value = path.split('/')[-1]
    confusion_value = confusion_value.split('_')[0]
    df = pd.read_csv(path, sep='\t')
    sno_list = df['gene_id_sno'].to_list()
    sno_per_confusion_value[confusion_value] = sno_list


# Get the snoRNA feature value for each confusion value in the comparison
confusion_val1, confusion_val2_3 = comparison.split('_vs_')
confusion_val2, confusion_val3 = confusion_val2_3.split('_')  # this is respectively TN and TP
df1 = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value[confusion_val1])]  # this removes duplicates (snoRNA is counted only once per confusion value)
df2 = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value[confusion_val2])]
df3 = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value[confusion_val3])]


# Get only snoRNAs that are always predicted as their confusion value
# (i.e. remove snoRNAs that are for example predicted in an iteration as FP and in another as TN)
if confusion_val1 == 'FP':
    all_fp = df1['gene_id_sno'].to_list()
    all_tn = df2['gene_id_sno'].to_list()
    all_tp = df3['gene_id_sno'].to_list()
    all_fn = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value['FN'])]
    all_fn = list(pd.unique(all_fn['gene_id_sno']))
    real_fp = list(set(all_fp) - set(all_tn))
    real_tn = list(set(all_tn) - set(all_fp))
    real_tp = list(set(all_tp) - set(all_fn))
    df1 = df1[df1['gene_id_sno'].isin(real_fp)]
    df2 = df2[df2['gene_id_sno'].isin(real_tn)]
    df3 = df3[df3['gene_id_sno'].isin(real_tp)]
elif confusion_val1 == 'FN':
    all_fn = df1['gene_id_sno'].to_list()
    all_tn = df2['gene_id_sno'].to_list()
    all_tp = df3['gene_id_sno'].to_list()
    all_fp = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value['FP'])]
    all_fp = list(pd.unique(all_fp['gene_id_sno']))
    real_fn = list(set(all_fn) - set(all_tp))
    real_tn = list(set(all_tn) - set(all_fp))
    real_tp = list(set(all_tp) - set(all_fn))
    df1 = df1[df1['gene_id_sno'].isin(real_fn)]
    df2 = df2[df2['gene_id_sno'].isin(real_tn)]
    df3 = df3[df3['gene_id_sno'].isin(real_tp)]


# Count feature values to create the bar chart in the order (TN, FP/FN, TP)
count_list = []
for conf_val_df in [df2, df1, df3]:
    true_condition = conf_val_df[conf_val_df[categorical_feature] == 1.0]  # ex: "host is expressed" is true
    false_condition = conf_val_df[conf_val_df[categorical_feature] == 0.0]  # ex: "host is expressed" is false
    temp = [len(true_condition), len(false_condition)]
    count_list.append(temp)

percent_count = ft.percent_count(count_list)

# Create bar plot
colors = [color_dict['True'], color_dict['False']]
ft.stacked_bar(percent_count, [confusion_val2, confusion_val1, confusion_val3],
                ["True", "False"], f'{categorical_feature} ({sno_type})',
                'Confusion value snoRNA group', 'Proportion of snoRNAs (%)', colors, output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np
import seaborn as sns

rank_df = pd.read_csv(snakemake.input.rank_features_df, sep='\t')
rank_df[['feature2', 'norm']] = rank_df['feature'].str.split('_norm', expand=True)
rank_df = rank_df.drop(columns=['norm', 'feature'])

feature_distribution = {}
for i, group in enumerate(rank_df.groupby('feature2')['feature_rank']):
    feature_name = group[0]
    range_ = group[1].max() - group[1].min()
    median_ = group[1].median()
    feature_distribution[feature_name] = [median_, range_]

# Order features by increasing median value of feature_ranks and by range as second sort if two features have the same median
# i.e. the same order as in the violin plot of feature rank
feature_distribution_df = pd.DataFrame.from_dict(feature_distribution, columns = ['median', 'range'], orient='index')
ordered_features = feature_distribution_df.sort_values(by=['median', 'range'], ascending=[True, True]).index.to_list()
print(ordered_features)


models = ['log_reg', 'svc', 'rf']
outputs = snakemake.output.heatmaps
for mod in models:
    output = [path for path in outputs if mod in path][0]
    iterations_df = rank_df[rank_df['model'].str.startswith(mod)]
    print(list(iterations_df['feature2']))
    pivot = iterations_df.pivot(index='model', columns='feature2', values='feature_rank')
    pivot = pivot[ordered_features]
    print(pivot)
    correlation_df = pivot.corr(method='spearman')
    print(correlation_df)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots()
    sns.clustermap(correlation_df, cmap='viridis', cbar_kws={'label': "Feature rank correlation\n(Spearman's ρ)"},
                row_cluster=False)
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=8)
    plt.xlabel(xlabel="Features")
    plt.ylabel(ylabel="Features")
    plt.savefig(output, dpi=600, bbox_inches='tight')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import functions as ft
from scipy import stats as st
import numpy as np

tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.all_features, sep='\t')

# Create average TPM (and log10) columns in tpm_df and merge that column to feature_df
tpm_df['avg_tpm'] = tpm_df.filter(regex='^[A-Z].*_[1-3]$', axis=1).mean(axis=1)
tpm_df = tpm_df[['gene_id', 'avg_tpm']]
feature_df = feature_df.merge(tpm_df, how='left', left_on='gene_id_sno', right_on='gene_id')
feature_df = feature_df.drop(['gene_id'], axis=1)
feature_df['avg_tpm_log10'] = np.log10(feature_df['avg_tpm'])

# Get expressed snoRNAs
expressed = feature_df[feature_df['abundance_cutoff_2'] == "expressed"]


# Create violin plots to compare the abundance of expressed snoRNAs per snoRNA type
ft.violin(expressed, "sno_type", "avg_tpm_log10", None, None, "Type of snoRNA",
                "Average abundance across \n tissues (log10(TPM))", "",
                snakemake.params.colors, ['black'], snakemake.output.violin)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import collections as coll
from scipy.stats import fisher_exact

cols_eclip = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info', 'chr_rbp', 'start_rbp', 'end_rbp', 'score_rbp', 'signalValue_rbp', 'strand_rbp', 'pval_rbp']
cols_par_clip = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info', 'chr_rbp', 'start_rbp', 'end_rbp', 'strand_rbp', 'score_rbp', 'strand_rbp_duplicated']
dkc1_eclip = pd.read_csv(snakemake.input.dkc1_eclip_overlap, sep='\t', names=cols_eclip)
dkc1_par_clip = pd.read_csv(snakemake.input.dkc1_par_clip_overlap, sep='\t', names=cols_par_clip)
nop58_par_clip = pd.read_csv(snakemake.input.nop58_par_clip_overlap, sep='\t', names=cols_par_clip)
fbl_par_clip = pd.read_csv(snakemake.input.fbl_par_clip_overlap, sep='\t', names=cols_par_clip)
nop56_par_clip = pd.read_csv(snakemake.input.nop56_par_clip_overlap, sep='\t', names=cols_par_clip)
df = pd.read_csv(snakemake.input.df, sep='\t')
cd, haca = df[df['sno_type'] == 'C/D'], df[df['sno_type'] == 'H/ACA']


# Filter PAR-CLIP peaks based on enrichment score
dkc1_par_clip = dkc1_par_clip[dkc1_par_clip['score_rbp'] > 10]
nop58_par_clip = nop58_par_clip[nop58_par_clip['score_rbp'] > 10]
nop56_par_clip = nop56_par_clip[nop56_par_clip['score_rbp'] > 10]
fbl_par_clip = fbl_par_clip[fbl_par_clip['score_rbp'] > 10]


# Define if sno is bound by RBP in given dataset
haca.loc[haca['gene_id_sno'].isin(list(dkc1_eclip.gene_id)), 'DKC1_eCLIP'] = "RBP is bound to snoRNA"
haca.loc[~haca['gene_id_sno'].isin(list(dkc1_eclip.gene_id)), 'DKC1_eCLIP'] = "RBP is not bound to snoRNA"
haca.loc[haca['gene_id_sno'].isin(list(dkc1_par_clip.gene_id)), 'DKC1_PAR_CLIP'] = "RBP is bound to snoRNA"
haca.loc[~haca['gene_id_sno'].isin(list(dkc1_par_clip.gene_id)), 'DKC1_PAR_CLIP'] = "RBP is not bound to snoRNA"

cd.loc[cd['gene_id_sno'].isin(list(nop58_par_clip.gene_id)), 'NOP58_PAR_CLIP'] = "RBP is bound to snoRNA"
cd.loc[~cd['gene_id_sno'].isin(list(nop58_par_clip.gene_id)), 'NOP58_PAR_CLIP'] = "RBP is not bound to snoRNA"
cd.loc[cd['gene_id_sno'].isin(list(fbl_par_clip.gene_id)), 'FBL_PAR_CLIP'] = "RBP is bound to snoRNA"
cd.loc[~cd['gene_id_sno'].isin(list(fbl_par_clip.gene_id)), 'FBL_PAR_CLIP'] = "RBP is not bound to snoRNA"
cd.loc[cd['gene_id_sno'].isin(list(nop56_par_clip.gene_id)), 'NOP56_PAR_CLIP'] = "RBP is bound to snoRNA"
cd.loc[~cd['gene_id_sno'].isin(list(nop56_par_clip.gene_id)), 'NOP56_PAR_CLIP'] = "RBP is not bound to snoRNA"


# Generate RBP_binding bar chart comparison between expressed and not expressed snoRNAs 
def global_stacked_bar(df, hue_col, color_dict, sep_col, title, xlabel, ylabel, output_path):
    counts_per_feature = ft.count_list_x(df, hue_col, list(color_dict.keys()), sep_col)
    percent = ft.percent_count(counts_per_feature)
    ft.stacked_bar(percent, sorted(list(haca[hue_col].unique())),
                    list(color_dict.keys()), title, xlabel,
                    ylabel, color_dict, output_path)

global_stacked_bar(haca, 'abundance_cutoff_2', snakemake.params.RBP_binding_colors, 'DKC1_eCLIP', 'H/ACA with DKC1 eCLIP', 
                    'Expression status of snoRNAs', 'Proportion of expressed snoRNAs (%)', snakemake.output.bar_haca_dkc1_eclip)

global_stacked_bar(haca, 'abundance_cutoff_2', snakemake.params.RBP_binding_colors, 'DKC1_PAR_CLIP', 'H/ACA with DKC1 PAR-CLIP',
                    'Expression status of snoRNAs', 'Proportion of expressed snoRNAs (%)', snakemake.output.bar_haca_dkc1_par_clip)

global_stacked_bar(cd, 'abundance_cutoff_2', snakemake.params.RBP_binding_colors, 'NOP58_PAR_CLIP', 'C/D with NOP58 PAR-CLIP',
                    'Expression status of snoRNAs', 'Proportion of expressed snoRNAs (%)', snakemake.output.bar_cd_nop58_par_clip)

global_stacked_bar(cd, 'abundance_cutoff_2', snakemake.params.RBP_binding_colors, 'FBL_PAR_CLIP', 'C/D with FBL PAR-CLIP',
                    'Expression status of snoRNAs', 'Proportion of expressed snoRNAs (%)', snakemake.output.bar_cd_fbl_par_clip)

global_stacked_bar(cd, 'abundance_cutoff_2', snakemake.params.RBP_binding_colors, 'NOP56_PAR_CLIP', 'C/D with NOP56 PAR-CLIP',
                    'Expression status of snoRNAs', 'Proportion of expressed snoRNAs (%)', snakemake.output.bar_cd_nop56_par_clip)



def criteria_count(group1, group2, col, crit):
    "This creates the contingency table needed to perform Fisher's exact test"
    count1a = len(group1[group1[col] == crit])
    count1b = len(group1[group1[col] != crit])
    count2a = len(group2[group2[col] == crit])
    count2b = len(group2[group2[col] != crit])

    dict = {'group1': [count1a, count1b], 'group2': [count2a, count2b]}
    table = pd.DataFrame(data=dict, index=[crit, '!= '+crit])
    print(table)
    return table


for binding_status in list(snakemake.params.RBP_binding_colors.keys()):
    table_haca_eclip = criteria_count(haca[haca['abundance_cutoff_2'] == 'expressed'], haca[haca['abundance_cutoff_2'] == 'not_expressed'], 'DKC1_eCLIP', binding_status)
    table_haca_par_clip = criteria_count(haca[haca['abundance_cutoff_2'] == 'expressed'], haca[haca['abundance_cutoff_2'] == 'not_expressed'], 'DKC1_PAR_CLIP', binding_status)
    table_cd_nop58 = criteria_count(cd[cd['abundance_cutoff_2'] == 'expressed'], cd[cd['abundance_cutoff_2'] == 'not_expressed'], 'NOP58_PAR_CLIP', binding_status)
    table_cd_fbl = criteria_count(cd[cd['abundance_cutoff_2'] == 'expressed'], cd[cd['abundance_cutoff_2'] == 'not_expressed'], 'FBL_PAR_CLIP', binding_status)
    table_cd_nop56 = criteria_count(cd[cd['abundance_cutoff_2'] == 'expressed'], cd[cd['abundance_cutoff_2'] == 'not_expressed'], 'NOP56_PAR_CLIP', binding_status)

    oddsratio_haca_eclip, p_val_haca_eclip = fisher_exact(table_haca_eclip)
    oddsratio_haca_par_clip, p_val_haca_par_clip = fisher_exact(table_haca_par_clip)
    oddsratio_cd_nop58, p_val_cd_nop58 = fisher_exact(table_cd_nop58)
    oddsratio_cd_fbl, p_val_cd_fbl = fisher_exact(table_cd_fbl)
    oddsratio_cd_nop56, p_val_cd_nop56 = fisher_exact(table_cd_nop56)

    print('\n'+binding_status)
    print(f"p = {p_val_haca_eclip} (Fisher's exact test) H/ACA eCLIP")
    print(f"p = {p_val_haca_par_clip} (Fisher's exact test) H/ACA PAR-CLIP")
    print(f"p = {p_val_cd_nop58} (Fisher's exact test) C/D NOP58 PAR-CLIP")
    print(f"p = {p_val_cd_fbl} (Fisher's exact test) C/D FBL PAR-CLIP")
    print(f"p = {p_val_cd_nop56} (Fisher's exact test) C/D NOP56 PAR-CLIP")
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are false negatives in
    all 4 models."""
output_path = snakemake.params.decision_plot_FN
log = snakemake.output.shap_local_FN_log
false_negatives = snakemake.params.false_negatives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)


# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in false_negatives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in false_negatives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are false negatives in
    all 4 models."""
output_path = snakemake.params.decision_plot_FN
log = snakemake.output.shap_local_FN_log
false_negatives = snakemake.params.false_negatives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in false_negatives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in false_negatives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are false positives in
    all 4 models."""
output_path = snakemake.params.decision_plot_FP
log = snakemake.output.shap_local_FP_log
false_positives = snakemake.params.false_positives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)


# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in false_positives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in false_positives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are false positives in
    all 4 models."""
output_path = snakemake.params.decision_plot_FP
log = snakemake.output.shap_local_FP_log
false_positives = snakemake.params.false_positives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')

# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in false_positives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in false_positives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
import shap

""" Create a shap decision plot for SNORA77B for all models"""
output_path = snakemake.output.decision_plot
colors_dict = snakemake.params.colors_dict
colors_ = [colors_dict['not_expressed'], colors_dict['not_expressed'], colors_dict['expressed']]
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors_)
sno_id = snakemake.wildcards.interesting_sno_ids
X_test_paths = snakemake.input.X_test
expected_value_paths = snakemake.input.expected_values
shap_value_paths = snakemake.input.shap_values

# Select expected_val, shap_val and feature value for the given snoRNA
expected_val, shap_val, X_test = [], [], []
for i, path in enumerate(shap_value_paths):
    model_name, iteration = path.split('/')[-1].split('_shap')[0].split('_manual_')
    shap_i = pd.read_csv(path, sep='\t', index_col='gene_id_sno')
    expected_val_i = pd.read_csv(expected_value_paths[i], sep='\t')
    X_test_i = pd.read_csv(X_test_paths[i], sep='\t', index_col='gene_id_sno')
    if sno_id in shap_i.index:
        X_test_i = X_test_i[X_test_i.index == sno_id]
        shap_i = shap_i[shap_i.index == sno_id]
        shap_i = shap_i.to_numpy()
        shap_val.append(shap_i)
        expected_val.append(expected_val_i)
        X_test.append(X_test_i)

temp_df = pd.read_csv(shap_value_paths[0], sep='\t', index_col='gene_id_sno')
col_names = temp_df.columns.to_list()
col_names = [col.split('_norm')[0] for col in col_names]

# For log_reg, the model output is in log odds, so we need to convert it to probability using the logit function
if snakemake.wildcards.models2 == "log_reg":
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.decision_plot(expected_val[0].values[0][0], shap_val[0], X_test[0],
                    feature_names=col_names, show=False, feature_display_range=slice(-1, -50, -1), link='logit', plot_color=cmap)
    plt.savefig(output_path, bbox_inches='tight', dpi=600)

else: # For RF and SVC, the output is already in probability, so no need to convert log odds to probability
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.decision_plot(expected_val[0].values[0][0], shap_val[0], X_test[0],
                    feature_names=col_names, show=False, feature_display_range=slice(-1, -50, -1), plot_color=cmap)
    plt.savefig(output_path, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
import shap
import numpy as np

""" Create a shap decision plot containing all snoRNAs in GAS5 (for each model)"""
sno_ids_list = snakemake.params.sno_ids
expected_value_paths = snakemake.input.expected_values
shap_value_paths = snakemake.input.shap_values
model = snakemake.wildcards.models2
decision_plot_output = snakemake.output.decision_plot
colors_dict = snakemake.params.colors_dict
colors_ = [colors_dict['not_expressed'], colors_dict['not_expressed'], colors_dict['expressed']]
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors_)

expected_val, shap_val = {}, {}
for i, path in enumerate(shap_value_paths):
    model_name, iteration = path.split('/')[-1].split('_shap')[0].split('_manual_')
    shap_i = pd.read_csv(path, sep='\t', index_col='gene_id_sno')
    expected_val_i = pd.read_csv(expected_value_paths[i], sep='\t')
    shap_val[iteration] = shap_i
    expected_val[iteration] = expected_val_i


def find_shap_values(shap_val_dict, expected_val_dict, sno_ids):
    """ Merge SHAP values of snoRNAs of interest (ex: all snoRNAs in same HG)
        into one list (per model) and do the same for expected values. The shap
        values are returned as a list of arrays, whereas the expected_vals are
        returned as a simple list (each element is the base value for a snoRNA
        in a given test set)."""
    shap_values, expected_vals = [None] * len(sno_ids), [None] * len(sno_ids)
    for i, sno_id in enumerate(sno_ids):
        for iteration_nb, shap_df in shap_val_dict.items():
            if sno_id in shap_df.index:  # if we find the snoRNA in that given test set
                shap_df_i = shap_df[shap_df.index == sno_id]
                shap_array_i = shap_df_i.to_numpy()
                shap_values[i] = shap_array_i
                expected_value_df = expected_val_dict[iteration_nb]
                expected_vals[i] = [expected_value_df.values[0][0]] * len(shap_df_i)
    expected_vals = [item for sublist in expected_vals for item in sublist]  # convert list of list into a simple list
    shap_values = np.concatenate(shap_values, axis=0)
    shap_values = list(shap_values[:, np.newaxis, :])

    return shap_values, expected_vals


shap_final, expected_val_final = find_shap_values(shap_val, expected_val, sno_ids_list)


# For log_reg, the model output is in log odds, so we need to convert it to probability using the logit function
col_names = shap_val['first'].columns.to_list()
col_names = [col.split('_norm')[0] for col in col_names]
if model in ['log_reg']:
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    plt.rcParams['svg.fonttype'] = 'none'
    shap.multioutput_decision_plot(expected_val_final, shap_final, 0, show=False,
                                feature_display_range=slice(-1, -50, -1), link='logit',
                                feature_names=col_names, plot_color=cmap)
    plt.savefig(decision_plot_output, bbox_inches='tight', dpi=600)
else:  # For RF and SVC, the output is already in probability, so no need to convert log odds to probability
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    plt.rcParams['svg.fonttype'] = 'none'
    shap.multioutput_decision_plot(expected_val_final, shap_final, 0, show=False,
                                feature_display_range=slice(-1, -50, -1),
                                feature_names=col_names, plot_color=cmap)
    plt.savefig(decision_plot_output, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np

""" Create a shap decision plot per specific snoRNA per model."""

# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer = shap.LinearExplainer(model, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test.loc['ENSG00000212498', :])  # Select one snoRNA
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.decision_plot(explainer.expected_value, shap_values,
                    X_test.loc['ENSG00000212498', :], show=False, feature_display_range=slice(-1, -50, -1))
    plt.savefig(snakemake.output.decision_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer2 = shap.KernelExplainer(model2.predict, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test.loc['ENSG00000212498', :])  # Select one snoRNA
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc['ENSG00000212498', :], show=False, feature_display_range=slice(-1, -50, -1))
    plt.savefig(snakemake.output.decision_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
import matplotlib.pyplot as plt
import shap
import numpy as np

""" Create a clustered shap decision plot containing all snoRNAs of a given confusion value per model."""
log_reg_output, svc_output, rf_output = snakemake.output.shap_local_log_reg, snakemake.output.shap_local_svc, snakemake.output.shap_local_rf
conf_val = snakemake.wildcards.confusion_value
sno_per_confusion_value = snakemake.input.sno_per_confusion_value
conf_val_pair = {'FN': 'TP', 'TP': 'FN', 'FP': 'TN', 'TN': 'FP'}  # to help select only real confusion value
                                                                # (i.e. those always predicted as such across iterations and models)
conf_val_df = pd.read_csv([path for path in sno_per_confusion_value if conf_val in path][0], sep='\t')
conf_val_pair_df = pd.read_csv([path for path in sno_per_confusion_value if conf_val_pair[conf_val] in path][0], sep='\t')

# Load shap_val and expe dfs into dict for all models and iteration
shap_val, expected_val = {}, {}
shap_val_paths, expected_val_paths = snakemake.input.shap_values, snakemake.input.expected_value
for i, path in enumerate(shap_val_paths):
    model_name, iteration = path.split('/')[-1].split('_shap')[0].rsplit('_', maxsplit=1)
    shap_i = pd.read_csv(path, sep='\t', index_col='gene_id_sno')
    expected_val_i = pd.read_csv(expected_val_paths[i], sep='\t')
    if model_name not in shap_val.keys():
        shap_val[model_name] = {iteration: shap_i}
    else:
        shap_val[model_name][iteration] = shap_i
    if model_name not in expected_val.keys():
        expected_val[model_name] = {iteration: expected_val_i}
    else:
        expected_val[model_name][iteration] = expected_val_i



# Select only real confusion_value (ex: FN) (those always predicted as such across models and iterations)
real_conf_val = list(set(conf_val_df.gene_id_sno.to_list()) - set(conf_val_pair_df.gene_id_sno.to_list()))

# Get shap values and expected values in the desired format for the decision plot
def merge_shap_values(shap_val_dict, expected_val_dict, model_name_, real_confusion_values):
    """ Merge SHAP values of 10 iterations into one list (per model) and do the
        same for expected values. The shap values are returned as a list of arrays
        (each array corresponds to all the snoRNA of a given confusion_value in
        one test set (iteration)), whereas the expected_vals are returned as a
        simple list (each element is the base value for a snoRNA in a given test set)"""
    shap_values = [None] * len(shap_val_dict[model_name_].keys())
    expected_vals = [None] * len(shap_val_dict[model_name_].keys())
    for i, iteration_nb in enumerate(shap_val_dict[model_name_].keys()):
        df_shap_i = shap_val_dict[model_name_][iteration_nb]
        df_shap_i = df_shap_i[df_shap_i.index.isin(real_confusion_values)]
        conf_val_sno_nb = len(df_shap_i)
        shap_i_array = df_shap_i.to_numpy()
        shap_values[i] = shap_i_array
        expected_value_df = expected_val_dict[model_name_][iteration_nb]
        expected_vals[i] =  [expected_value_df.values[0][0]] * conf_val_sno_nb
    expected_vals = [item for sublist in expected_vals for item in sublist]  # convert list of list into a simple list
    shap_values = np.concatenate(shap_values, axis=0)
    shap_values = list(shap_values[:, np.newaxis, :])
    return shap_values, expected_vals

shap_log_reg, expected_val_log_reg = merge_shap_values(shap_val, expected_val, 'log_reg', real_conf_val)
shap_svc, expected_val_svc = merge_shap_values(shap_val, expected_val, 'svc', real_conf_val)
shap_rf, expected_val_rf = merge_shap_values(shap_val, expected_val, 'rf', real_conf_val)

# For log_reg, the model output is in log odds, so we need to convert it to probability using the logit function
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
shap.multioutput_decision_plot(expected_val_log_reg, shap_log_reg, 0, show=False,
                            feature_display_range=slice(-1, -50, -1), link='logit', feature_order='hclust',
                            feature_names=shap_val['log_reg']['first'].columns.to_list())
plt.savefig(log_reg_output, bbox_inches='tight', dpi=600)

# For RF and SVC, the output is already in probability, so no need to convert log odds to probability
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
shap.multioutput_decision_plot(expected_val_svc, shap_svc, 0, show=False,
                            feature_display_range=slice(-1, -50, -1), feature_order='hclust',
                            feature_names=shap_val['log_reg']['first'].columns.to_list())
plt.savefig(svc_output, bbox_inches='tight', dpi=600)

fig, ax = plt.subplots(1, 1, figsize=(15, 15))
shap.multioutput_decision_plot(expected_val_rf, shap_rf, 0, show=False,
                            feature_display_range=slice(-1, -50, -1), feature_order='hclust',
                            feature_names=shap_val['log_reg']['first'].columns.to_list())
plt.savefig(rf_output, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are true negatives in
    all 4 models."""
output_path = snakemake.params.decision_plot_TN
log = snakemake.output.shap_local_TN_log
true_negatives = snakemake.params.true_negatives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)


# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in true_negatives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in true_negatives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are true negatives in
    all 4 models."""
output_path = snakemake.params.decision_plot_TN
log = snakemake.output.shap_local_TN_log
true_negatives = snakemake.params.true_negatives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in true_negatives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in true_negatives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are true positives in
    all 4 models."""
output_path = snakemake.params.decision_plot_TP
log = snakemake.output.shap_local_TP_log
true_positives = snakemake.params.true_positives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)


# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in true_positives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in true_positives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
import subprocess as sp

""" Create a shap decision plot per specific snoRNAs that are true positives in
    all 4 models."""
output_path = snakemake.params.decision_plot_TP
log = snakemake.output.shap_local_TP_log
true_positives = snakemake.params.true_positives
sp.call("mkdir -p "+output_path+" &> "+log, shell=True)

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    for sno_id in true_positives:
        shap_values = explainer.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer.expected_value, shap_values,
                        X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1), link='logit')
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    for sno_id in true_positives:
        shap_values2 = explainer2.shap_values(X_test.loc[sno_id, :])  # Select one snoRNA
        plt.rcParams['svg.fonttype'] = 'none'
        fig, ax = plt.subplots(1, 1, figsize=(15, 15))
        shap.decision_plot(explainer2.expected_value, shap_values2, X_test.loc[sno_id, :], show=False, feature_display_range=slice(-1, -50, -1))
        plt.savefig(output_path+sno_id+"_"+snakemake.wildcards.models2+"_all_features_test_set_100_background.svg", bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import functions as ft
import numpy as np
df = pd.read_csv(snakemake.input.df, sep='\t')
snoRNA_type_df = pd.read_csv(snakemake.input.snoRNA_type_df, sep='\t')
sno_type = str(snakemake.wildcards.sno_type)
sno_type = sno_type[0] + '/' + sno_type[1:]

# Keep only C/D or H/ACA snoRNAs
df = df.merge(snoRNA_type_df, how='left', left_on='gene_id_sno', right_on='gene_id')
df = df[df['snoRNA_type'] == sno_type]

# Generate a density plot of numerical features with a hue of abundance_cutoff
hues = list(pd.unique(df['abundance_cutoff']))
df_list = []
colors = []
color_dict = snakemake.params.hue_color

for hue in hues:
    temp_df = df[df['abundance_cutoff'] == hue][snakemake.wildcards.mouse_numerical_features]
    df_list.append(temp_df)
    color = color_dict[hue]
    colors.append(color)




ft.density_x(df_list, snakemake.wildcards.mouse_numerical_features, 'Density', 'linear', '',
        colors, hues, snakemake.output.density_features)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
import functions as ft
import numpy as np
df = pd.read_csv(snakemake.input.df, sep='\t')


# Generate a density plot of numerical features with a hue of sno_type,
# abundance_cutoff or abundance_cutoff_2
hues = list(pd.unique(df[snakemake.wildcards.feature_hue]))
df_list = []
colors = []
color_dict = snakemake.params.hue_color

#logscale_features = ['distance_upstream_exon', 'distance_downstream_exon',
#                    'dist_to_bp', 'intron_length', 'sno_length', 'sno_mfe']
logscale_features = []

print(snakemake.wildcards.numerical_features)
for hue in hues:
    if snakemake.wildcards.numerical_features in logscale_features:
        temp_df = df[df[snakemake.wildcards.feature_hue] == hue][snakemake.wildcards.numerical_features]
        temp_df['temp'] = np.log10(temp_df[snakemake.wildcards.numerical_features])
        temp_df = temp_df.drop([snakemake.wildcards.numerical_features], axis=1)
        temp_df.columns = snakemake.wildcards.numerical_features
        print(temp_df)
        df_list.append(temp_df)
    else:
        temp_df = df[df[snakemake.wildcards.feature_hue] == hue][snakemake.wildcards.numerical_features]
        df_list.append(temp_df)
    color = color_dict[hue]
    colors.append(color)




ft.density_x(df_list, snakemake.wildcards.numerical_features, 'Density', 'linear', '',
        colors, hues, snakemake.output.density_features)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pandas as pd
import functions as ft
import numpy as np

df = pd.read_csv(snakemake.input.df, sep='\t')

# Generate a simple density plot of numerical_features without a hue
ft.density(df[snakemake.wildcards.numerical_features],
        snakemake.wildcards.numerical_features, 'Density', '',
        snakemake.output.density_features_simple, color=snakemake.params.simple_color)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.df, sep='\t')
snoRNA_type_df = pd.read_csv(snakemake.input.snoRNA_type_df, sep='\t')
sno_type = str(snakemake.wildcards.sno_type)
sno_type = sno_type[0] + '/' + sno_type[1:]

# Keep only C/D or H/ACA snoRNAs
df = df.merge(snoRNA_type_df, how='left', left_on='gene_id_sno', right_on='gene_id')
df = df[df['snoRNA_type'] == sno_type]

# Generate a density plot of numerical features with a hue of predicted expressed vs not_expressed
hues = list(pd.unique(df['predicted_label']))
df_list = []
colors = []
color_dict = snakemake.params.hue_color

for hue in hues:
    temp_df = df[df['predicted_label'] == hue][snakemake.wildcards.species_numerical_features]
    df_list.append(temp_df)
    color = color_dict[hue]
    colors.append(color)




ft.density_x(df_list, snakemake.wildcards.species_numerical_features, 'Density', 'linear', '',
        colors, hues, snakemake.output.density_features)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import functions as ft
df = pd.read_csv(snakemake.input.df, sep='\t')
feat = snakemake.wildcards.numerical_features
# Reverse the relative intron rank value (count it from the 5' end instead of from the 3' end)
if feat == 'relative_intron_rank':
    df['relative_intron_rank_switch'] = 1 - df['relative_intron_rank']
    feat = 'relative_intron_rank_switch'
print(feat)
cd = df[df['sno_type'] == 'C/D']
haca = df[df['sno_type'] == 'H/ACA']

# Generate a density plot of numerical features with a hue of abundance_cutoff_2
# for either C/D and H/ACA snoRNAs separately
hues = list(pd.unique(df['abundance_cutoff_2']))
df_list_cd = []
df_list_haca = []
colors = []
color_dict = snakemake.params.hue_color

for hue in hues:
    temp_df_cd = cd[cd['abundance_cutoff_2'] == hue][feat]
    df_list_cd.append(temp_df_cd)
    temp_df_haca = haca[haca['abundance_cutoff_2'] == hue][feat]
    df_list_haca.append(temp_df_haca)
    color = color_dict[hue]
    colors.append(color)
if feat == 'terminal_stem_mfe':
    ft.density_x_size(df_list_cd, feat, 'Density', 'linear', '',
            colors, hues, snakemake.output.density_features_cd, (10,8), -42, 2)
    ft.density_x_size(df_list_haca, feat, 'Density', 'linear', '',
            colors, hues, snakemake.output.density_features_haca, (10,8), -42, 2)
elif feat == 'relative_intron_rank_switch':    
    ft.density_x_size(df_list_cd, 'relative_intron_rank_switch', 'Density', 'linear', '',
            colors, hues, snakemake.output.density_features_cd, (10,8), -0.05, 1.05)
    ft.density_x_size(df_list_haca, 'relative_intron_rank_switch', 'Density', 'linear', '',
            colors, hues, snakemake.output.density_features_haca, (10,8), -0.05, 1.05)
else:
    ft.density_x(df_list_cd, feat, 'Density', 'linear', '',
            colors, hues, snakemake.output.density_features_cd)
    ft.density_x(df_list_haca, feat, 'Density', 'linear', '',
            colors, hues, snakemake.output.density_features_haca)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import functions as ft
import numpy as np

feature = snakemake.wildcards.top_10_numerical_features
color_dict = snakemake.params.color_dict
colors = [color_dict['FP'], color_dict['TN']]
sno_per_confusion_value = snakemake.input.sno_per_confusion_value
feature_df = pd.read_csv(snakemake.input.all_features_df, sep='\t')
fp = pd.read_csv([path for path in sno_per_confusion_value if 'FP' in path][0], sep='\t')
tn = pd.read_csv([path for path in sno_per_confusion_value if 'TN' in path][0], sep='\t')

# Select only real confusion value (those always predicted as such across iterations and models)
real_fp = list(set(fp.gene_id_sno.to_list()) - set(tn.gene_id_sno.to_list()))
real_tn = list(set(tn.gene_id_sno.to_list()) - set(fp.gene_id_sno.to_list()))
real_fp_df = feature_df[feature_df['gene_id_sno'].isin(real_fp)].drop_duplicates()
real_tn_df = feature_df[feature_df['gene_id_sno'].isin(real_tn)].drop_duplicates()

# Select only TN that are within an expressed HG
real_tn_df = real_tn_df[real_tn_df['abundance_cutoff_host'] == 'host_expressed']

ft.density_x([real_fp_df[feature], real_tn_df[feature]], feature, 'Density', 'linear',
            'FP vs TN that have an expressed HG', colors, ['FP', 'TN with an expressed HG'], snakemake.output.density)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import pandas as pd
import functions as ft
import math
df = pd.read_csv(snakemake.input.df, sep='\t')

# Select intronic snoRNAs and create intron subgroup (small vs long intron)
df = df[df['host_biotype2'] != 'intergenic']
df.loc[df['intron_length'] < 5000, 'intron_subgroup'] = 'small_intron'
df.loc[df['intron_length'] >= 5000, 'intron_subgroup'] = 'long_intron'

# Separate per snoRNA type
sno_type = snakemake.wildcards.sno_type
sno_type = sno_type[0] + '/' + sno_type[1:]
sno_type_df = df[df['sno_type'] == sno_type]

# Generate a density plot of various features with a hue of abundance_cutoff_2
# per snoRNA type and intron subgroup
hues = list(pd.unique(sno_type_df['abundance_cutoff_2']))
df_list_small = []
df_list_long = []
colors = []
color_dict = snakemake.params.hue_color

# Create function to get min and max values to set the xlim accordingly on the density plot x-axis
def get_min_max(df):
    if snakemake.wildcards.intron_group_feature in ['terminal_stem_mfe', 'sno_mfe']:
        min = df[snakemake.wildcards.intron_group_feature].min()
        min = -1 * (-min + (10 - -min % 10))  # get the closest negative number that is a multiple of 10 going downward in negative values
        max = 0
        min2, max2 = None, None
    elif snakemake.wildcards.intron_group_feature == 'conservation_score':
        min, max = -0.05, 1.05
        min2, max2 = None, None
    elif snakemake.wildcards.intron_group_feature == 'dist_to_bp':  # these values vary too much between intron subgroups to have the same xlims
        min = 0                                                     # this is why we return 2 min and 2 max values (one for each intron subgroup)
        df_small, df_long = df[df['intron_subgroup'] == 'small_intron'], df[df['intron_subgroup'] == 'long_intron']
        max = df_small[snakemake.wildcards.intron_group_feature].max()
        max = max + (10 - max % 10)  # get the closest positive number that is a multiple of 10 going upward in positive values
        min2 = 0
        max2 = df_long[snakemake.wildcards.intron_group_feature].max()
        max2 = max2 + (10 - max2 % 10)  # get the closest positive number that is a multiple of 10 going upward in positive values
    else:
        min = 0
        max = df[snakemake.wildcards.intron_group_feature].max()
        max = max + (10 - max % 10)  # get the closest positive number that is a multiple of 10 going upward in positive values
        min2, max2 = None, None

    return min, max, min2, max2

# Separate per intron subgroup and get xlims for the density plots
small_df = sno_type_df[sno_type_df['intron_subgroup'] == 'small_intron']
long_df = sno_type_df[sno_type_df['intron_subgroup'] == 'long_intron']
min, max, dist_to_bp_min2, dist_to_bp_max2 = get_min_max(sno_type_df)

for hue in hues:
    temp_df_small = small_df[small_df['abundance_cutoff_2'] == hue][snakemake.wildcards.intron_group_feature]
    temp_df_long = long_df[long_df['abundance_cutoff_2'] == hue][snakemake.wildcards.intron_group_feature]
    df_list_small.append(temp_df_small)
    df_list_long.append(temp_df_long)
    colors.append(color_dict[hue])

if snakemake.wildcards.intron_group_feature == 'dist_to_bp':
    ft.density_x_size(df_list_small, snakemake.wildcards.intron_group_feature, 'Density', 'linear', '',
            colors, hues, snakemake.output.density_small, (12, 6), min, max)
    ft.density_x_size(df_list_long, snakemake.wildcards.intron_group_feature, 'Density', 'linear', '',
            colors, hues, snakemake.output.density_long, (12, 6), dist_to_bp_min2, dist_to_bp_max2)
else:
    if (math.isnan(max) == True) | (math.isnan(min) == True):  # when looking at C,D,C',D' hamming for H/ACA snoRNAs or H,ACA hamming for C/D, the result is NaN values
        min_modified, max_modified = 0, 1
        ft.density_x_size(df_list_small, snakemake.wildcards.intron_group_feature, 'Density', 'linear', '',
                colors, hues, snakemake.output.density_small, (12, 6), min_modified, max_modified)
        ft.density_x_size(df_list_long, snakemake.wildcards.intron_group_feature, 'Density', 'linear', '',
                colors, hues, snakemake.output.density_long, (12, 6), min_modified, max_modified)
    else:
        ft.density_x_size(df_list_small, snakemake.wildcards.intron_group_feature, 'Density', 'linear', '',
                colors, hues, snakemake.output.density_small, (12, 6), min, max)
        ft.density_x_size(df_list_long, snakemake.wildcards.intron_group_feature, 'Density', 'linear', '',
                colors, hues, snakemake.output.density_long, (12, 6), min, max)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.df, sep='\t')

# Create column of sno mfe normalized by snoRNA length
df['sno_mfe_length_normalized'] = df['sno_mfe'] / df['sno_length']

# Generate a density plot of the sno_mfe normalized by sno_length with a hue of abundance_cutoff_2
# for either C/D and H/ACA snoRNAs separately
cd = df[df['sno_type'] == 'C/D']
haca = df[df['sno_type'] == 'H/ACA']
hues = list(pd.unique(df['abundance_cutoff_2']))
df_list_cd = []
df_list_haca = []
colors = []
color_dict = snakemake.params.hue_color

for hue in hues:
    temp_df_cd = cd[cd['abundance_cutoff_2'] == hue]['sno_mfe_length_normalized']
    df_list_cd.append(temp_df_cd)
    temp_df_haca = haca[haca['abundance_cutoff_2'] == hue]['sno_mfe_length_normalized']
    df_list_haca.append(temp_df_haca)
    color = color_dict[hue]
    colors.append(color)


ft.density_x(df_list_cd, 'Normalized snoRNA stability (kcal/mol*nt)', 'Density', 'linear', '',
        colors, hues, snakemake.output.density_features_cd)

ft.density_x(df_list_haca,'Normalized snoRNA stability (kcal/mol*nt)' , 'Density', 'linear', '',
        colors, hues, snakemake.output.density_features_haca)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import pandas as pd
import functions as ft
from scipy.stats import mannwhitneyu

sno_cons = pd.read_csv(snakemake.input.sno_cons, sep='\t')
sno_cons = sno_cons.rename(columns={'conservation_score': 'sno_conservation_score'})
upstream_sno_cons = pd.read_csv(snakemake.input.upstream_sno_cons, sep='\t')
upstream_sno_cons = upstream_sno_cons.rename(columns={'conservation_score': 'upstream_conservation_score'})
df = pd.read_csv(snakemake.input.feature_label_df, sep='\t')

# Select intergenic and expressed snoRNAs
#intergenic = df[df['abundance_cutoff_host'] == 'intergenic']
#intergenic = df[(df['abundance_cutoff_host'] == 'intergenic') & (df['abundance_cutoff_2'] == 'not_expressed')]
intergenic = df[(df['abundance_cutoff_host'] == 'intergenic') & (df['abundance_cutoff_2'] == 'expressed')]


# Merge conservation info to intergenic df
cons = sno_cons.merge(upstream_sno_cons, how='left', on='gene_id')
intergenic = intergenic.merge(cons, how='left', left_on='gene_id_sno', right_on='gene_id')

# Separate conserved form recently copied snoRNAs (threshold of 0.5 of sno_conservation_score)
new = intergenic[intergenic['sno_conservation_score'] <= 0.5].upstream_conservation_score
old = intergenic[intergenic['sno_conservation_score'] > 0.5].upstream_conservation_score

# Create density
print(new.median())
print(old.median())

U, pval = mannwhitneyu(list(new), list(old))
print(pval)
nb_old, nb_new = str(len(old)), str(len(new))
colors, groups = ['lightgreen', 'grey'], [f'Conserved snoRNAs\n(conservation score > 0.5) (n={nb_old})', f'Recent snoRNAs\n(conservation score <= 0.5) (n={nb_new})']
ft.density_x([old, new], 'Promoter conservation of expresssed \nintergenic snoRNAs (Conservation score)', 'Density', 'linear', '', colors, groups, snakemake.output.density)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import collections as coll
from scipy.stats import fisher_exact

cols = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info', 'chr_aqr', 'start_aqr', 'end_aqr', 'score_aqr', 'signalValue_aqr', 'strand_aqr', 'pval_aqr']
aqr_df = pd.read_csv(snakemake.input.aqr_overlap_HG, sep='\t', names=cols)
df = pd.read_csv(snakemake.input.df, sep='\t')
exp = df[(df['abundance_cutoff_2'] == 'expressed') & (df['abundance_cutoff_host'] != 'intergenic')]

# Define if sno intron is bound by AQR
aqr_bound_sno_intron = list(pd.unique(aqr_df.gene_id))
exp.loc[exp['gene_id_sno'].isin(aqr_bound_sno_intron), 'AQR_binding'] = 'Intron is bound by AQR'
exp['AQR_binding'] = exp['AQR_binding'].fillna('Intron is not bound by AQR')
cd, haca = exp[exp['sno_type'] == 'C/D'], exp[exp['sno_type'] == 'H/ACA']

# Split expressed snoRNAs based on their distance to bp (close (<=100nt) vs far (>100nt))
cd_close, cd_far = cd[cd['dist_to_bp'] <= 100], cd[cd['dist_to_bp'] > 100]
haca_close, haca_far = haca[haca['dist_to_bp'] <= 100], haca[haca['dist_to_bp'] > 100]



# Generate AQR_binding bar chart comparison between snoRNAs close vs far from bp
cd_close['bp_proximity'] = 'Close to branch\npoint (<= 100 nt)'
cd_far['bp_proximity'] = 'Far from branch\npoint (>100 nt)'
cd_combined = pd.concat([cd_close, cd_far])

counts_per_feature = ft.count_list_x(cd_combined, 'bp_proximity',
                    list(snakemake.params.AQR_binding_colors.keys()),
                    'AQR_binding')
percent = ft.percent_count(counts_per_feature)
ft.stacked_bar(percent, sorted(list(cd_combined['bp_proximity'].unique())),
                list(snakemake.params.AQR_binding_colors.keys()), 'C/D', 'Branch point proximity',
                'Proportion of expressed snoRNAs (%)', snakemake.params.AQR_binding_colors, snakemake.output.bar_HG_cd)


haca_close['bp_proximity'] = 'Close to branch\npoint (<= 100 nt)'
haca_far['bp_proximity'] = 'Far from branch\npoint (>100 nt)'
haca_combined = pd.concat([haca_close, haca_far])
counts_per_feature = ft.count_list_x(haca_combined, 'bp_proximity',
                    list(snakemake.params.AQR_binding_colors.keys()),
                    'AQR_binding')
percent = ft.percent_count(counts_per_feature)
ft.stacked_bar(percent, sorted(list(haca_combined['bp_proximity'].unique())),
                list(snakemake.params.AQR_binding_colors.keys()), 'H/ACA', 'Branch point proximity',
                'Proportion of expressed snoRNAs (%)', snakemake.params.AQR_binding_colors, snakemake.output.bar_HG_haca)



def criteria_count(group1, group2, col, crit):
    "This creates the contingency table needed to perform Fisher's exact test"
    count1a = len(group1[group1[col] == crit])
    count1b = len(group1[group1[col] != crit])
    count2a = len(group2[group2[col] == crit])
    count2b = len(group2[group2[col] != crit])

    dict = {'group1': [count1a, count1b], 'group2': [count2a, count2b]}
    table = pd.DataFrame(data=dict, index=[crit, '!= '+crit])
    print(table)
    return table


for binding_status in list(snakemake.params.AQR_binding_colors.keys()):
    table_cd = criteria_count(cd_close, cd_far, 'AQR_binding', binding_status)
    table_haca = criteria_count(haca_close, haca_far, 'AQR_binding', binding_status)
    oddsratio_cd, p_val_cd = fisher_exact(table_cd)
    oddsratio_haca, p_val_haca = fisher_exact(table_haca)
    print(binding_status)
    print(f"p = {p_val_cd} (Fisher's exact test) C/D")
    print(f"p = {p_val_haca} (Fisher's exact test) H/ACA")
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import collections as coll
from scipy.stats import mannwhitneyu, fisher_exact

colors = list(snakemake.params.dist_to_bp_group_colors.values())
df = pd.read_csv(snakemake.input.df, sep='\t')
exp = df[(df['abundance_cutoff_2'] == 'expressed') & (df['abundance_cutoff_host'] != 'intergenic')]
cd, haca = exp[exp['sno_type'] == 'C/D'], exp[exp['sno_type'] == 'H/ACA']

# Split expressed snoRNAs based on their distance to bp (close (<=100nt) vs far (>100nt))
cd_close, cd_far = cd[cd['dist_to_bp'] <= 100], cd[cd['dist_to_bp'] > 100]
haca_close, haca_far = haca[haca['dist_to_bp'] <= 100], haca[haca['dist_to_bp'] > 100]

# Generate terminal stem mfe density comparison between snoRNAs close vs far from bp
ft.density_x([cd_close['terminal_stem_mfe'], cd_far['terminal_stem_mfe']], 'Terminal stem stability (kcal/mol)',
            'Density', 'linear', 'Expressed C/D', colors, ['Close to branch point (<= 100 nt)', 'Far from branch point (>100 nt)'], snakemake.output.density_terminal_stem_cd)

ft.density_x([haca_close['terminal_stem_mfe'], haca_far['terminal_stem_mfe']], 'Terminal stem stability (kcal/mol)',
            'Density', 'linear', 'Expressed H/ACA', colors, ['Close to branch point (<= 100 nt)', 'Far from branch point (>100 nt)'], snakemake.output.density_terminal_stem_haca)

U, p = mannwhitneyu(cd_close['terminal_stem_mfe'], cd_far['terminal_stem_mfe'])
print(f'p = {p} (M-W U test, C/D close vs C/D far, terminal stem mfe)')

U, p = mannwhitneyu(haca_close['terminal_stem_mfe'], haca_far['terminal_stem_mfe'])
print(f'p = {p} (M-W U test, H/ACA close vs H/ACA far, terminal stem mfe)')

# Generate box score density comparison between snoRNAs close vs far from bp
ft.density_x([cd_close['combined_box_hamming'], cd_far['combined_box_hamming']], 'Box score',
            'Density', 'linear', 'Expressed C/D', colors, ['Close to branch point (<= 100 nt)', 'Far from branch point (>100 nt)'], snakemake.output.density_box_score_cd)

ft.density_x([haca_close['combined_box_hamming'], haca_far['combined_box_hamming']], 'Box score',
            'Density', 'linear', 'Expressed H/ACA', colors, ['Close to branch point (<= 100 nt)', 'Far from branch point (>100 nt)'], snakemake.output.density_box_score_haca)

U, p = mannwhitneyu(cd_close['combined_box_hamming'], cd_far['combined_box_hamming'])
print(f'p = {p} (M-W U test, C/D close vs C/D far, Box score)')

U, p = mannwhitneyu(haca_close['combined_box_hamming'], haca_far['combined_box_hamming'])
print(f'p = {p} (M-W U test, H/ACA close vs H/ACA far, Box score)')


# Generate sno mfe density comparison between snoRNAs close vs far from bp
ft.density_x([cd_close['sno_mfe'], cd_far['sno_mfe']], 'Structure stability (kcal/mol)',
            'Density', 'linear', 'Expressed C/D', colors, ['Close to branch point (<= 100 nt)', 'Far from branch point (>100 nt)'], snakemake.output.density_sno_mfe_cd)

ft.density_x([haca_close['sno_mfe'], haca_far['sno_mfe']], 'Structure stability (kcal/mol)',
            'Density', 'linear', 'Expressed H/ACA', colors, ['Close to branch point (<= 100 nt)', 'Far from branch point (>100 nt)'], snakemake.output.density_sno_mfe_haca)

U, p = mannwhitneyu(cd_close['sno_mfe'], cd_far['sno_mfe'])
print(f'p = {p} (M-W U test, C/D close vs C/D far, Sno MFE)')

U, p = mannwhitneyu(haca_close['sno_mfe'], haca_far['sno_mfe'])
print(f'p = {p} (M-W U test, H/ACA close vs H/ACA far, Sno MFE)')


# Generate target type bar chart comparison between snoRNAs close vs far from bp
cd_close['bp_proximity'] = 'Close to branch\npoint (<= 100 nt)'
cd_far['bp_proximity'] = 'Far from branch\npoint (>100 nt)'
cd_combined = pd.concat([cd_close, cd_far])
counts_per_feature = ft.count_list_x(cd_combined, 'bp_proximity',
                    list(snakemake.params.sno_target_colors.keys()),
                    'sno_target')
percent = ft.percent_count(counts_per_feature)
ft.stacked_bar(percent, sorted(list(cd_combined['bp_proximity'].unique())),
                list(snakemake.params.sno_target_colors.keys()), 'C/D', 'Branch point proximity',
                'Proportion of expressed snoRNAs (%)', snakemake.params.sno_target_colors, snakemake.output.bar_target_cd)


haca_close['bp_proximity'] = 'Close to branch\npoint (<= 100 nt)'
haca_far['bp_proximity'] = 'Far from branch\npoint (>100 nt)'
haca_combined = pd.concat([haca_close, haca_far])
counts_per_feature = ft.count_list_x(haca_combined, 'bp_proximity',
                    list(snakemake.params.sno_target_colors.keys()),
                    'sno_target')
percent = ft.percent_count(counts_per_feature)
print(percent)

ft.stacked_bar(percent, sorted(list(haca_combined['bp_proximity'].unique())),
                list(snakemake.params.sno_target_colors.keys()), 'H/ACA', 'Branch point proximity',
                'Proportion of expressed snoRNAs (%)', snakemake.params.sno_target_colors, snakemake.output.bar_target_haca)

def criteria_count(group1, group2, col, crit):
    "This creates the contingency table needed to perform Fisher's exact test"
    count1a = len(group1[group1[col] == crit])
    count1b = len(group1[group1[col] != crit])
    count2a = len(group2[group2[col] == crit])
    count2b = len(group2[group2[col] != crit])

    dict = {'group1': [count1a, count1b], 'group2': [count2a, count2b]}
    table = pd.DataFrame(data=dict, index=[crit, '!= '+crit])
    print(table)
    return table


for target in list(snakemake.params.sno_target_colors.keys()):
    table_cd = criteria_count(cd_close, cd_far, 'sno_target', target)
    table_haca = criteria_count(haca_close, haca_far, 'sno_target', target)
    oddsratio_cd, p_val_cd = fisher_exact(table_cd)
    oddsratio_haca, p_val_haca = fisher_exact(table_haca)
    print(target)
    print(f"p = {p_val_cd} (Fisher's exact test) C/D")
    print(f"p = {p_val_haca} (Fisher's exact test) H/ACA")
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


conf_val_df = pd.read_csv(snakemake.input.confusion_value_per_sno, sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')

df = conf_val_df.merge(host_biotype_df, how='left', on='gene_id_sno')


# Create a donut chart of the confusion value of snoRNAs (outer donut)
# The inner donut shows the host biotype (protein-coding, non-coding or intergenic)

count_datasets = []
count_attributes = []
for status in snakemake.params.conf_val_colors.keys():  # Iterate through confusion values (TN, TP, FN, FP)
    temp_df = df[df['confusion_matrix'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.conf_val_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.conf_val_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.conf_val_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.conf_val_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


conf_val_df = pd.read_csv(snakemake.input.confusion_value_per_sno[0], sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')

# Simplify host_biotype column and merge with conf_val_df
simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding', 'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding', 'unprocessed_pseudogene': 'non_coding'}
host_biotype_df['host_biotype2'] = host_biotype_df['host_biotype'].map(simplified_biotype_dict)
df = conf_val_df.merge(host_biotype_df, how='left', on='gene_id_sno')
df['host_biotype2'] = df['host_biotype2'].fillna('intergenic')


# Create a donut chart of the confusion value of snoRNAs (outer donut)
# The inner donut shows the host biotype (protein-coding, non-coding or intergenic)

count_datasets = []
count_attributes = []
for status in snakemake.params.conf_val_colors.keys():  # Iterate through confusion values (TN, TP, FN, FP)
    temp_df = df[df['confusion_matrix_val_log_reg'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.conf_val_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.conf_val_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.conf_val_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.conf_val_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


conf_val_df = pd.read_csv(snakemake.input.confusion_value_per_sno[0], sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')

# Simplify host_biotype column and merge with conf_val_df
simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding', 'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding', 'unprocessed_pseudogene': 'non_coding'}
host_biotype_df['host_biotype2'] = host_biotype_df['host_biotype'].map(simplified_biotype_dict)
df = conf_val_df.merge(host_biotype_df, how='left', on='gene_id_sno')
df['host_biotype2'] = df['host_biotype2'].fillna('intergenic')


# Create a donut chart of the confusion value of snoRNAs (outer donut)
# The inner donut shows the host biotype (protein-coding, non-coding or intergenic)

count_datasets = []
count_attributes = []
for status in snakemake.params.conf_val_colors.keys():  # Iterate through confusion values (TN, TP, FN, FP)
    temp_df = df[df['confusion_matrix_val_log_reg_thresh'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.conf_val_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.conf_val_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.conf_val_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.conf_val_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


conf_val_df = pd.read_csv(snakemake.input.confusion_value_per_sno[0], sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')
mod = snakemake.wildcards.models2
# Simplify host_biotype column and merge with conf_val_df
simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding', 'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding', 'unprocessed_pseudogene': 'non_coding'}
host_biotype_df['host_biotype2'] = host_biotype_df['host_biotype'].map(simplified_biotype_dict)
df = conf_val_df.merge(host_biotype_df, how='left', on='gene_id_sno')
df['host_biotype2'] = df['host_biotype2'].fillna('intergenic')


# Create a donut chart of the confusion value of snoRNAs (outer donut)
# The inner donut shows the host biotype (protein-coding, non-coding or intergenic)

count_datasets = []
count_attributes = []
for status in snakemake.params.conf_val_colors.keys():  # Iterate through confusion values (TN, TP, FN, FP)
    temp_df = df[df[f'confusion_matrix_val_{mod}'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.conf_val_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.conf_val_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.conf_val_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.conf_val_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')
host_biotype_df = host_biotype_df[['gene_id_sno', 'host_biotype']]
host_biotype_df = host_biotype_df.rename(columns={'gene_id_sno': 'gene_id'})

simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding', 'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding', 'unprocessed_pseudogene': 'non_coding'}
host_biotype_df['host_biotype2'] = host_biotype_df['host_biotype'].map(simplified_biotype_dict)

# Drop duplicates (keep only 1 occurence)
df = df.drop_duplicates(subset=['sno_mfe', 'terminal_stem_mfe',
                                'combined_box_hamming',
                                'abundance_cutoff_host'], keep=False)

# Merge host biotype df to sno df
df = df.merge(host_biotype_df, how='left', left_on='gene_id_sno', right_on='gene_id')
df['host_biotype2'] = df['host_biotype2'].fillna('intergenic')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the host biotype (intergenic, protein-coding or non-coding)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')
host_biotype_df = host_biotype_df[['gene_id_sno', 'host_biotype']]
host_biotype_df = host_biotype_df.rename(columns={'gene_id_sno': 'gene_id'})

simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding', 'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding', 'unprocessed_pseudogene': 'non_coding'}
host_biotype_df['host_biotype2'] = host_biotype_df['host_biotype'].map(simplified_biotype_dict)

# Merge host biotype df to sno df
df = df.merge(host_biotype_df, how='left', on='gene_id')
df['host_biotype2'] = df['host_biotype2'].fillna('intergenic')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the host biotype (intergenic, protein-coding or non-coding)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')
host_biotype_df = host_biotype_df[['gene_id_sno', 'host_biotype']]
host_biotype_df = host_biotype_df.rename(columns={'gene_id_sno': 'gene_id'})

simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding', 'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding', 'unprocessed_pseudogene': 'non_coding'}
host_biotype_df['host_biotype2'] = host_biotype_df['host_biotype'].map(simplified_biotype_dict)

# Drop duplicates (keep only 1 occurence)
df = df.drop_duplicates(subset=['sno_mfe', 'terminal_stem_mfe',
                                'combined_box_hamming',
                                'abundance_cutoff_host'])

# Merge host biotype df to sno df
df = df.merge(host_biotype_df, how='left', left_on='gene_id_sno', right_on='gene_id')
df['host_biotype2'] = df['host_biotype2'].fillna('intergenic')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the host biotype (intergenic, protein-coding or non-coding)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the host biotype (intergenic, protein-coding or non-coding)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff_2'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.host_biotype_colors.keys()), list(snakemake.params.host_biotype_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')
host_biotype_df = pd.read_csv(snakemake.input.host_biotype_df, sep='\t')
host_biotype_df = host_biotype_df[['gene_id_sno', 'host_biotype']]
host_biotype_df = host_biotype_df.rename(columns={'gene_id_sno': 'gene_id'})

simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding',
                            'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding',
                            'unprocessed_pseudogene': 'non_coding', 'pseudogene': 'non_coding',
                            'processed_pseudogene': 'non_coding', 'polymorphic_pseudogene': 'non_coding',
                            'processed_transcript': 'non_coding', 'lincRNA': 'non_coding',
                            'antisense': 'non_coding', 'sense_intronic': 'non_coding',
                            'sense_overlapping': 'non_coding'}
host_biotype_df['host_biotype2'] = host_biotype_df['host_biotype'].map(simplified_biotype_dict)

# Merge host biotype df to sno df
df = df.merge(host_biotype_df, how='left', right_on='gene_id', left_on='gene_id_sno')
df['host_biotype2'] = df['host_biotype2'].fillna('intergenic')

# Create a donut chart of the predicted abundance status of snoRNAs (outer donut)
# The inner donut shows the host biotype (intergenic, protein-coding or non-coding)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['predicted_label'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.host_biotype_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['host_biotype2'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.host_biotype_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.host_biotype_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, 'Host gene biotype according to\nthe predicted abundance status',
            list(snakemake.params.host_biotype_colors.keys()),
            list(snakemake.params.host_biotype_colors.values()),
            snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')

# Select intronic snoRNAs and create intron subgroup (small vs long intron)
df = df[df['host_biotype2'] != 'intergenic']
df.loc[df['intron_length'] < 5000, 'intron_subgroup'] = 'small_intron'
df.loc[df['intron_length'] >= 5000, 'intron_subgroup'] = 'long_intron'

# Create a donut chart of the abundance status of snoRNAs (inner donut)
# The outer donut shows the intron subgroup of snoRNAs
count_datasets = []
count_attributes = []
for sub in snakemake.params.intron_subgroup_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['intron_subgroup'] == sub]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for status in snakemake.params.label_colors.keys():
        attributes_dict[status] = len(temp_df[temp_df['abundance_cutoff_2'] == status])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(sub, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.intron_subgroup_colors.keys()) * len(snakemake.params.label_colors.keys())
labels = [list(snakemake.params.intron_subgroup_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.label_colors.values()) * len(snakemake.params.intron_subgroup_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.intron_subgroup_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.label_colors.keys()), list(snakemake.params.label_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Drop duplicates in test set (drop all occurences)
feature_df = feature_df.drop_duplicates(subset=['sno_mfe', 'terminal_stem_mfe',
                                'combined_box_hamming',
                                'abundance_cutoff_host'], keep=False)

# Merge dfs
df = feature_df.merge(df[['gene_id', 'snoRNA_type']], how='left', left_on='gene_id_sno', right_on='gene_id')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the snoRNA type (C/D or H/ACA)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.sno_type_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['snoRNA_type'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.sno_type_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.sno_type_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.sno_type_colors.keys()), list(snakemake.params.sno_type_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the snoRNA type (C/D or H/ACA)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.sno_type_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['snoRNA_type'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.sno_type_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.sno_type_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.sno_type_colors.keys()), list(snakemake.params.sno_type_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Drop duplicates in test set (drop all occurences)
feature_df = feature_df.drop_duplicates(subset=['sno_mfe', 'terminal_stem_mfe',
                                'combined_box_hamming',
                                'abundance_cutoff_host'])

# Merge dfs
df = feature_df.merge(df[['gene_id', 'snoRNA_type']], how='left', left_on='gene_id_sno', right_on='gene_id')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the snoRNA type (C/D or H/ACA)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.sno_type_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['snoRNA_type'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.sno_type_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.sno_type_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.sno_type_colors.keys()), list(snakemake.params.sno_type_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')

# Create a donut chart of the abundance status of snoRNAs (outer donut)
# The inner donut shows the snoRNA type (C/D or H/ACA)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['abundance_cutoff_2'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.sno_type_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['sno_type'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.sno_type_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.sno_type_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, '', list(snakemake.params.sno_type_colors.keys()), list(snakemake.params.sno_type_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.read_csv(snakemake.input.df, sep='\t')
snoRNA_type_df = pd.read_csv(snakemake.input.snoRNA_type_df, sep='\t')

# Merge dfs
df = df.merge(snoRNA_type_df, how='left', left_on='gene_id_sno', right_on='gene_id')

# Create a donut chart of the predicted abundance status of snoRNAs (outer donut)
# The inner donut shows the snoRNA type (C/D or H/ACA)

count_datasets = []
count_attributes = []
for status in snakemake.params.label_colors.keys():  # Iterate through abundance statuses
    temp_df = df[df['predicted_label'] == status]
    count_datasets.append(len(temp_df))
    attributes_dict = {}
    for type in snakemake.params.sno_type_colors.keys():
        attributes_dict[type] = len(temp_df[temp_df['snoRNA_type'] == type])
    sorted_dict = {k: v for k, v in sorted(attributes_dict.items())}  # Sort dictionary alphabetically as in the config file
    print(status, sorted_dict)
    for val in sorted_dict.values():
        count_attributes.append(val)

counts = [count_datasets, count_attributes]

# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(snakemake.params.label_colors.keys()) * len(snakemake.params.sno_type_colors.keys())
labels = [list(snakemake.params.label_colors.keys()), inner_labels]

# Set inner colors as a repeated list of colors for each part of the inner donut (ex: same 2 inner colors repeated for each outer donut part)
inner_colors = list(snakemake.params.sno_type_colors.values()) * len(snakemake.params.label_colors.keys())

# Set colors as outer and inner_colors
colors = [list(snakemake.params.label_colors.values()), inner_colors]

ft.donut_2(counts, labels, colors, 'SnoRNA type according to the\npredicted abundance status', list(snakemake.params.sno_type_colors.keys()), list(snakemake.params.sno_type_colors.values()), snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import pandas as pd
from pandas.core.common import flatten
import functions as ft
import collections as coll

df = pd.read_csv(snakemake.input.df, sep='\t')
colors = snakemake.params.rank_colors
output = snakemake.output.donut

def count_in_list(input_list, ref_list):
    """ Count the number of occurences of each element in ref_list within the
        input_list. Return the associated percentages in a list."""
    l = []
    for element in ref_list:
        occurences = input_list.count(element)
        l.append(occurences)

    total = sum(l)
    percent_list = []
    for count in l:
        if total != 0:
            percent = count/total * 100
            percent = round(percent, 2)
            percent_list.append(percent)
        else:
            percent = 0
            percent_list.append(percent)

    return percent_list


# Create a dict where each key is a feature and each value is a list of all ranks
# (corresponding to that feature) within the top 5 predictive features across models that use that feature
rank_dict = df.groupby('feature')['feature_rank'].apply(list).to_dict()
print(rank_dict)

## Create a dict where the keys are the models intersections
## (e.g., gbm, log_reg_svc_gbm_knn, svc_knn, etc.) and the values are the features shared by these models
# First aggregate model names together (ex: if a feature was shared by svc and knn, then the model name becomes svc_knn)
groups = df.groupby('feature').agg({'model':list})
groups['model'] = groups['model'].apply(lambda x: "_".join(map(str, x)))
groups.reset_index(inplace=True)
# Combine the common features to each models intersection into a list and create the dict
groups = groups.groupby('model')['feature'].apply(lambda x: ",".join(map(str, x))).reset_index()
groups['feature'] = groups['feature'].str.split(',')
feature_model_intersect_dict = dict(zip(groups.model, groups.feature))

print(feature_model_intersect_dict)

# Generate list of list of rank percentage (1 list per intersection category (8 categories in total))
model_intersect_names = []
features_name = []
all_rank_percent = []
for model_intersect, features in feature_model_intersect_dict.items():
    if len(features) == 1:  # if only one feature is shared in that model intersection
        ranks = rank_dict[features[0]]
        rank_percent = count_in_list(ranks, [1,2,3,4,5])  # count how many rank 1,2, ..., 5 are associated to that feature across model(s)
        all_rank_percent.append(rank_percent)
    elif len(features) > 1:  # if multiple features are common in that model intersection
        temp = []
        for feature in features:
            ranks = rank_dict[feature]
            temp.append(ranks)
        rank_percent = count_in_list(list(flatten(temp)), [1,2,3,4,5])
        all_rank_percent.append(rank_percent)
    model_intersect_names.append(model_intersect)
    features_name.append(features)

print(model_intersect_names)
print(features_name)
print(all_rank_percent)

# Generate a donut chart per model intersection according to their rank (1st, 2nd, ..., 5th most predictive feature) across models
ft.pie_multiple(all_rank_percent, colors.keys(), colors.values(), model_intersect_names,
    'Ranking of top 5 most predictive features across model intersections', 'Rank across top 5 features and models', output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import functions as ft

# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Function to order feature by importance absolute values
def abs_feature_imp(feature_name_list, feature_val_list):
    d = dict(zip(feature_name_list, feature_val_list))
    ordered_val = sorted(feature_val_list, reverse=True, key=abs)  # absolute value used for sorting
    keys = list(d.keys())
    vals = list(d.values())

    ordered_keys = []
    temp_index = 0
    for val in ordered_val:
        if len(list(np.where(ordered_val == val)[0])) > 1:  # if multiple features have the same value
            multiple_keys = [k for k,v in d.items() if v == val]
            key = multiple_keys[temp_index]
            temp_index += 1  # update to select the next feature name with same value
            ordered_keys.append(key)
        else:
            key = keys[vals.index(val)]
            ordered_keys.append(key)

    return ordered_keys, ordered_val


# Unpickle and thus instantiate the model represented by the 'models' wildcard
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    importance = model.coef_[0]
    x_tick_labels, values = abs_feature_imp(list(X_train.columns), importance)
    ft.simple_bar(values, x_tick_labels, '', 'Feature name',
        'Feature importance', snakemake.output.bar_plot)


elif snakemake.wildcards.models == "svc":
    # Create empty figure, since we can't extract feature importance with an SVC with a sigmoid kernel
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    plt.savefig(snakemake.output.bar_plot, bbox_inches='tight', dpi=600)

else:  # for gbm and rf
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    importance = model.feature_importances_
    x_tick_labels, values = abs_feature_imp(list(X_train.columns), importance)
    ft.simple_bar(values, x_tick_labels, '', 'Feature name',
        'Feature importance', snakemake.output.bar_plot)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import matplotlib.pyplot as plt
import forgi.visual.mplotlib as fvm
import forgi
import subprocess as sp

input_file_stem = snakemake.input.snora77b_terminal_stem
input_file_all_sno = snakemake.input.all_snorna_structure
snora77b_dot_bracket = snakemake.output.snora77b_dot_bracket

# Remove temporarily the MFE (i.e a negative number between parentheses) in the snoRNA/terminal stem stability file
sp.call(f"sed -E 's/\(.[0-9]*.[0-9]*\)//g' {input_file_stem} > temp_stem.fa", shell=True)

# Extract the sequence and dot bracket of SNORA77B only from the fasta of all sno dot brackets
sp.call(f"grep -A 2 'ENSG00000264346' {input_file_all_sno} > {snora77b_dot_bracket}", shell=True)
sp.call(f"sed -E 's/\(.[0-9]*.[0-9]*\)//g' {snora77b_dot_bracket} > temp_sno.fa", shell=True)

# Create forgi graph of SNORA77B with its terminal stem
plt.rcParams['svg.fonttype'] = 'none'
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
cg = forgi.load_rna("temp_stem.fa", allow_many=False)
fvm.plot_rna(cg, text_kwargs={"fontweight":"black"}, lighten=0.7,
             backbone_kwargs={"linewidth":2}, ax=ax)
plt.savefig(snakemake.output.snora77b_terminal_stem_figure)

# Create forgi graph of SNORA77B alone
plt.rcParams['svg.fonttype'] = 'none'
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
cg = forgi.load_rna("temp_sno.fa", allow_many=False)
fvm.plot_rna(cg, text_kwargs={"fontweight":"black"}, lighten=0.7,
             backbone_kwargs={"linewidth":2}, ax=ax)
plt.savefig(snakemake.output.snora77b_figure)

# Remove temp files
sp.call('rm temp_s*', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    #shap.plots.bar(shap_values, show=False, max_display=50)
    shap.summary_plot(shap_values, X_test, plot_type='bar', max_display=50, show=False)
    plt.savefig(snakemake.output.bar_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    #shap.plots.bar(shap_values2, show=False, max_display=50)
    shap.summary_plot(shap_values2, X_test, plot_type='bar', max_display=50, show=False)
    plt.savefig(snakemake.output.bar_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')

# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap local values for one snoRNA
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values, X_test, plot_type='bar', max_display=50, show=False)
    plt.savefig(snakemake.output.bar_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values2, X_test, plot_type='bar', max_display=50, show=False)
    plt.savefig(snakemake.output.bar_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import pandas as pd
import functions as ft
import collections as coll

sno_per_confusion_value_df = pd.read_csv(snakemake.input.sno_per_confusion_value, sep='\t')
host_df = pd.read_csv(snakemake.input.host_df)
multi_HG_different_label_snoRNAs_df = pd.read_csv(snakemake.input.multi_HG_different_label_snoRNAs, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')
feature_df = feature_df[['gene_id_sno', 'abundance_cutoff_2']]
color_dict = snakemake.params.color_dict
bar_output = snakemake.output.hbar


# Drop intergenic snoRNAs and merge sno_per_confusion_value_df to host_df
sno_per_confusion_value_df = sno_per_confusion_value_df.merge(feature_df, how='left', on='gene_id_sno')
intronic_sno_df = host_df.merge(sno_per_confusion_value_df, how='left', left_on='sno_id', right_on='gene_id_sno')



# Select only multi_HG (drop HG with only 1 snoRNA)
mono_HG, multi_HG = [], []
for i, group in enumerate(intronic_sno_df.groupby('host_id')):
    grouped_df = group[1]
    host_id = group[0]
    if len(grouped_df) == 1:  # mono-HG
        mono_HG.append(host_id)
    elif len(grouped_df) > 1:  # multi-HG
        multi_HG.append(host_id)

intronic_sno_df = intronic_sno_df[intronic_sno_df['host_id'].isin(multi_HG)]


# Find in the multi_HG those that have snoRNAs with all the same label vs those with different labels
sno_ids_multi_HG_diff_labels = list(multi_HG_different_label_snoRNAs_df.gene_id_sno)
intronic_sno_df.loc[~intronic_sno_df['gene_id_sno'].isin(sno_ids_multi_HG_diff_labels), 'label_type'] = 'same_label'
intronic_sno_df.loc[intronic_sno_df['gene_id_sno'].isin(sno_ids_multi_HG_diff_labels), 'label_type'] = 'diff_label'

# Find for multi_HG with same snoRNA labels the % of all_expressed or all_not_expressed snoRNAs
all_expressed, all_not_expressed = [], []
for i, group in enumerate(intronic_sno_df[intronic_sno_df['label_type'] == 'same_label'].groupby('host_id')):
    grouped_df = group[1]
    host_id = group[0]
    if 'expressed' in list(grouped_df.abundance_cutoff_2):
        all_expressed.append(host_id)
    elif 'not_expressed' in list(grouped_df.abundance_cutoff_2):
        all_not_expressed.append(host_id)
intronic_sno_df.loc[intronic_sno_df['host_id'].isin(all_expressed), 'expression_category'] = 'all_sno_expressed'
intronic_sno_df.loc[intronic_sno_df['host_id'].isin(all_not_expressed), 'expression_category'] = 'all_sno_not_expressed'


# Find for multi_HG with different snoRNA labels the % of snoRNAs that are 50-50 expressed-not_expressed, more expressed, or more not_expressed
half_expressed_not_expressed, more_expressed, more_not_expressed = [], [], []
for i, group in enumerate(intronic_sno_df[intronic_sno_df['label_type'] == 'diff_label'].groupby('host_id')):
    grouped_df = group[1]
    host_id = group[0]
    if len(grouped_df[grouped_df['abundance_cutoff_2'] == 'expressed']) == len(grouped_df[grouped_df['abundance_cutoff_2'] == 'not_expressed']):
        half_expressed_not_expressed.append(host_id)
    elif len(grouped_df[grouped_df['abundance_cutoff_2'] == 'expressed']) > len(grouped_df[grouped_df['abundance_cutoff_2'] == 'not_expressed']):
        more_expressed.append(host_id)
    elif len(grouped_df[grouped_df['abundance_cutoff_2'] == 'expressed']) < len(grouped_df[grouped_df['abundance_cutoff_2'] == 'not_expressed']):
        more_not_expressed.append(host_id)

intronic_sno_df.loc[intronic_sno_df['host_id'].isin(half_expressed_not_expressed), 'expression_category'] = 'half_expressed_not_expressed'
intronic_sno_df.loc[intronic_sno_df['host_id'].isin(more_expressed), 'expression_category'] = 'more_expressed'
intronic_sno_df.loc[intronic_sno_df['host_id'].isin(more_not_expressed), 'expression_category'] = 'more_not_expressed'



# Groupby label_type, expression_category, host_id and then number of sno per HG (in descending order)
groupby_df = intronic_sno_df.groupby(['label_type', 'expression_category', 'host_name']).size().reset_index()
groupby_df.columns = ['label_type', 'expression_category', 'host_name', 'number_of_sno_per_HG']
groupby_df = groupby_df.sort_values(['label_type', 'expression_category', 'number_of_sno_per_HG'], ascending=[True, True, False])
original_index = list(groupby_df.host_name)
inverted_index = original_index[::-1]  # we invert the order so that it appears correctly on the hbar chart (from top to bottom instead of botton to top)

# Count the number of sno per confusion value per HG
counts = []
for host_name in inverted_index:
    temp = []
    for confusion_val in ['TN', 'TP', 'FN', 'FP']:
        df = intronic_sno_df[intronic_sno_df['host_name'] == host_name]
        number = len(df[df['confusion_matrix'] == confusion_val])
        temp.append(number)
    counts.append(temp)

# Create df to generate the hbar chart (keep the good index determined by the groupby)
final_df = pd.DataFrame(counts, index=inverted_index, columns=['TN', 'TP', 'FN', 'FP'])
print(final_df)
ft.barh(final_df, 10, 20, '', 'Number of snoRNAs', 'Host gene name', bar_output, stacked=True, color=color_dict, width=0.85)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np
import seaborn as sns

rank_df = pd.read_csv(snakemake.input.rank_features_df, sep='\t')
rank_df[['feature2', 'norm']] = rank_df['feature'].str.split('_norm', expand=True)
rank_df = rank_df.drop(columns=['norm', 'feature'])

feature_distribution = {}
for i, group in enumerate(rank_df.groupby('feature2')['feature_rank']):
    feature_name = group[0]
    range_ = group[1].max() - group[1].min()
    median_ = group[1].median()
    feature_distribution[feature_name] = [median_, range_]

# Order features by increasing median value of feature_ranks and by range as second sort if two features have the same median
# i.e. the same order as in the violin plot of feature rank
feature_distribution_df = pd.DataFrame.from_dict(feature_distribution, columns = ['median', 'range'], orient='index')
ordered_features = feature_distribution_df.sort_values(by=['median', 'range'], ascending=[True, True]).index.to_list()
print(ordered_features)


models = ['log_reg', 'svc', 'rf']
outputs = snakemake.output.heatmaps
for mod in models:
    output = [path for path in outputs if mod in path][0]
    iterations_df = rank_df[rank_df['model'].str.startswith(mod)]
    print(list(iterations_df['feature2']))
    pivot = iterations_df.pivot(index='model', columns='feature2', values='feature_rank')
    pivot = pivot[ordered_features]
    print(pivot)
    correlation_df = pivot.corr(method='spearman')
    print(correlation_df)
    mask = np.zeros_like(correlation_df)  # create matrix of 0s with the same shape as correlation_df
    mask[np.triu_indices_from(mask, k=1)] = True  # return indices for the upper triangle of the matrix (which will mask this portion of the heatmap)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots()
    sns.heatmap(correlation_df, mask=mask, square=True, ax=ax, xticklabels=True,
                yticklabels=True, cmap='viridis', cbar_kws={'label': "Feature rank correlation\n(Spearman's ρ)"})
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=8)
    plt.xlabel(xlabel="Features")
    plt.ylabel(ylabel="Features")
    plt.savefig(output, dpi=600, bbox_inches='tight')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np


shap_values_paths = snakemake.input.shap_values
dfs = []
for i, path in enumerate(shap_values_paths):
    model, iteration = path.split('/')[-1].split('_shap_')[0].rsplit('_', 1)
    df = pd.read_csv(path, sep='\t')
    df['id_model_iteration'] = df['gene_id_sno'] + f'_{model}_{iteration}'
    df = df.drop(columns='gene_id_sno')
    df = df.set_index('id_model_iteration')
    dfs.append(df)

concat_df = pd.concat(dfs)


ft.heatmap_simple(concat_df, 'plasma', 'SHAP values', snakemake.output.heatmap)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np

confusion_value_df = pd.read_csv(snakemake.input.confusion_value_df, sep='\t')
confusion_value_sno_ids = list(confusion_value_df.gene_id_sno)
shap_values_paths = snakemake.input.shap_values
dfs = []
for i, path in enumerate(shap_values_paths):
    model, iteration = path.split('/')[-1].split('_shap_')[0].rsplit('_', 1)
    df = pd.read_csv(path, sep='\t')
    df = df[df['gene_id_sno'].isin(confusion_value_sno_ids)]  # select only the snoRNAs part of the specific confusion value
    df['id_model_iteration'] = df['gene_id_sno'] + f'_{model}_{iteration}'
    df = df.drop(columns='gene_id_sno')
    df = df.set_index('id_model_iteration')
    dfs.append(df)

concat_df = pd.concat(dfs)


ft.heatmap_simple(concat_df, 'plasma', 'SHAP values', snakemake.output.heatmap)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np

confusion_value_df = pd.read_csv(snakemake.input.confusion_value_df, sep='\t')
confusion_value_sno_ids = list(confusion_value_df.gene_id_sno)
shap_values_paths = snakemake.input.shap_values
dfs = []
for i, path in enumerate(shap_values_paths):
    model, iteration = path.split('/')[-1].split('_shap_')[0].rsplit('_', 1)
    df = pd.read_csv(path, sep='\t')
    df = df[df['gene_id_sno'].isin(confusion_value_sno_ids)]  # select only the snoRNAs part of the specific confusion value
    df['id_iteration'] = df['gene_id_sno'] + f'_{iteration}'
    df = df.drop(columns='gene_id_sno')
    df = df.set_index('id_iteration')
    dfs.append(df)

concat_df = pd.concat(dfs)


ft.heatmap_simple(concat_df, 'plasma', 'SHAP values', snakemake.output.heatmap)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np

log_reg_output = [path for path in snakemake.output.heatmap if 'log_reg' in path][0]
svc_output = [path for path in snakemake.output.heatmap if 'svc' in path][0]
rf_output = [path for path in snakemake.output.heatmap if 'rf' in path][0]

shap_values_paths = snakemake.input.shap_values
log_reg_dfs, svc_dfs, rf_dfs = [], [], []
for i, path in enumerate(shap_values_paths):
    model, iteration = path.split('/')[-1].split('_shap_')[0].rsplit('_', 1)
    df = pd.read_csv(path, sep='\t')
    df['id_model_iteration'] = df['gene_id_sno'] + f'_{model}_{iteration}'
    df = df.drop(columns='gene_id_sno')
    df = df.set_index('id_model_iteration')
    if model == 'log_reg':
        log_reg_dfs.append(df)
    elif model == 'svc':
        svc_dfs.append(df)
    elif model == 'rf':
        rf_dfs.append(df)

log_reg_concat_df = pd.concat(log_reg_dfs)
svc_concat_df = pd.concat(svc_dfs)
rf_concat_df = pd.concat(rf_dfs)


ft.heatmap_simple(log_reg_concat_df, 'plasma', 'SHAP values', log_reg_output)
ft.heatmap_simple(svc_concat_df, 'plasma', 'SHAP values', svc_output)
ft.heatmap_simple(rf_concat_df, 'plasma', 'SHAP values', rf_output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
import matplotlib.pyplot as plt
import shap
import numpy as np
import functions as ft

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
y_test = pd.read_csv(snakemake.input.y_test, sep='\t')
y_test.index = X_test.index
col_color_dict = snakemake.params.labels_dict
col_color_dict[0] = col_color_dict.pop('not_expressed')
col_color_dict[1] = col_color_dict.pop('expressed')

# Unpickle and thus instantiate the model represented by the 'models2' wildcard
# Instantiate the explainer using the X_train as backgorund data and X_test to generate shap global values
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test)

    df = pd.DataFrame(shap_values.T, index=X_test.columns, columns=X_test.index)

    # Data for heatmap column colorbar (with y_test)
    predicted_label = pd.DataFrame(model.predict(X_test))
    predicted_label.index = X_test.index
    predicted_label.columns = ['predicted_label']

    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    ft.heatmap(df, col_color_dict, y_test['label'], predicted_label['predicted_label'],
                'plasma', 'SHAP value', snakemake.output.heatmap)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test)

    df = pd.DataFrame(shap_values2.T, index=X_test.columns, columns=X_test.index)
    predicted_label = pd.DataFrame(model2.predict(X_test))
    predicted_label.index = X_test.index
    predicted_label.columns = ['predicted_label']

    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    ft.heatmap(df, col_color_dict, y_test['label'], predicted_label['predicted_label'],
                'plasma', 'SHAP value', snakemake.output.heatmap)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
import logomaker
import matplotlib.pyplot as plt

""" Create logo of C and D boxes from fasta of either expressed or
    not expressed C/D box snoRNAs."""

fastas = snakemake.input.box_fastas
outputs = snakemake.output.box_logos

# Get all box sequences (not sno_id) in a list
for fasta in fastas:
    # Get the name of the box and if in expressed or not expressed snoRNAs to redirect figure to correct output
    ab_status_box = fasta.split('/')[-1].rstrip('.fa')
    output = [path for path in outputs if ab_status_box in path][0]
    with open(fasta) as f:
        raw_seqs = f.readlines()
    seqs = [seq.strip() for seq in raw_seqs if '>' not in seq]


    #Get a count and probability matrix to create the logo
    counts_matrix = logomaker.alignment_to_matrix(seqs)
    prob_matrix = logomaker.transform_matrix(counts_matrix, from_type='counts',
                                            to_type='probability')
    rc = {'ytick.labelsize': 32}
    plt.rcParams.update(**rc)
    plt.rcParams['svg.fonttype'] = 'none'
    logo = logomaker.Logo(prob_matrix, color_scheme='classic')
    logo.ax.set_ylabel("Frequency", fontsize=35)
    plt.savefig(output, bbox_inches='tight', dpi=600)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
import pandas as pd
import logomaker
import matplotlib.pyplot as plt
from math import log2
from scipy.stats import entropy, kstest

""" Create logo of C and D boxes from fasta of either expressed or
    not expressed C/D box snoRNAs."""

fastas = snakemake.input.box_fastas
logo_outputs = snakemake.output.box_logos
pie_outputs = snakemake.output.pie_logos
color_dict = snakemake.params.color_dict

def get_entropy(proba_matrix):
    """ Compute the entropy of a given logo from a proba_matrix (each column is
        a nucleotide (A, U, C or G), each line is the position in the logo)."""
    cumulative_entropy = []
    for i, row in proba_matrix.iterrows():
        vals = list(row)
        entropy_per_position = entropy(vals, base=2)
        cumulative_entropy.append(entropy_per_position)
    print(sum(cumulative_entropy))
    return sum(cumulative_entropy)

def make_autopct(values):
    """ Create function to return % in pie chart"""
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{v:d} \n ({p:.1f}%)'.format(p=pct,v=val)
    return my_autopct

def pie_simple_annot(count_list, colors, annotation, path, **kwargs):
    """
    Creates a pie chart from a simple list of values and add an annotation
    (ex: an entropy value) next to the graph.
    """
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))

    ax.pie(count_list, colors=list(colors.values()), pctdistance=0.5,
           textprops={'fontsize': 35}, autopct=make_autopct(count_list), **kwargs)
    fig.suptitle(annotation, y=0.9, fontsize=18)
    plt.legend(labels=colors.keys(), loc='upper right',
                bbox_to_anchor=(1, 1.18), prop={'size': 35})
    plt.savefig(path, dpi=600)


# Get all box sequences (not sno_id) in a list
f_obs_dict = {}
for fasta in fastas:
    # Get the name of the box and if in expressed or not expressed snoRNAs to redirect figure to correct output
    ab_status_box = fasta.split('/')[-1].rstrip('.fa')
    logo_output = [path for path in logo_outputs if ab_status_box in path][0]
    pie_output = [path for path in pie_outputs if ab_status_box in path][0]
    with open(fasta) as f:
        raw_seqs = f.readlines()
    seqs = [seq.strip() for seq in raw_seqs if '>' not in seq]
    seqs_wo_blank = [seq for seq in seqs if 'NNN' not in seq]  # remove blank (NNNN) sequences
    len_seqs, len_seqs_wo_blank = len(seqs), len(seqs_wo_blank)

    #Get a count and probability matrix to create the logo
    counts_matrix = logomaker.alignment_to_matrix(seqs_wo_blank)
    prob_matrix = logomaker.transform_matrix(counts_matrix, from_type='counts',
                                            to_type='probability')

   # Create logo wo blanks
    rc = {'ytick.labelsize': 32}
    plt.rcParams.update(**rc)
    plt.rcParams['svg.fonttype'] = 'none'
    logo = logomaker.Logo(prob_matrix, color_scheme='classic')
    logo.ax.set_ylabel("Frequency", fontsize=35)
    plt.savefig(logo_output, bbox_inches='tight', dpi=600)

    # Compute entropy of each logo (blanks removed)
    entropy_ = str(get_entropy(prob_matrix))

    print(ab_status_box, prob_matrix)

    # Get the observed frequency into a flattened list
    l = prob_matrix.values.tolist()
    f_obs = [j for sublist in l for j in sublist]
    f_obs_dict[ab_status_box] = f_obs

    # Create pie chart of found vs not found motif per box
    percent = [(len_seqs_wo_blank/len_seqs)*100, ((len_seqs - len_seqs_wo_blank)/len_seqs)*100]
    pie_simple_annot(percent, color_dict, f'{entropy_} bits', pie_output)

# Compute Kolmogorov-Smirnov test to see if the statistical significance of box degeneration between expressed/not expressed snoRNAs
res = kstest(f_obs_dict['not_expressed_c_box'], f_obs_dict['expressed_c_box'])
print(res)
res = kstest(f_obs_dict['not_expressed_d_box'], f_obs_dict['expressed_d_box'])
print(res)
res = kstest(f_obs_dict['not_expressed_c_prime_box'], f_obs_dict['expressed_c_prime_box'])
print(res)
res = kstest(f_obs_dict['not_expressed_d_prime_box'], f_obs_dict['expressed_d_prime_box'])
print(res)
res = kstest(f_obs_dict['not_expressed_aca_box'], f_obs_dict['expressed_aca_box'])
print(res)
res = kstest(f_obs_dict['not_expressed_h_box'], f_obs_dict['expressed_h_box'])
print(res)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd
import functions as ft

""" Create a density plot to compare all confusion values for all
    numerical features in the top 10 predictive features."""

color_dict = snakemake.params.color_dict
output = snakemake.output.density
numerical_feature = snakemake.wildcards.top_10_numerical_features
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')
feature_df = feature_df[['gene_id_sno', numerical_feature]]
confusion_value_df = pd.read_csv(snakemake.input.sno_per_confusion_value, sep='\t')

# Get df of all snoRNAs of a given confusion_value inside dict
sno_per_confusion_value = {}
for conf_val in ['TN', 'TP', 'FN', 'FP']:
    df_temp = confusion_value_df[confusion_value_df['confusion_matrix'] == conf_val]
    sno_list = df_temp['gene_id_sno'].to_list()
    df = feature_df[feature_df['gene_id_sno'].isin(sno_list)]
    sno_per_confusion_value[conf_val] = df


# Create density plot
colors = [color_dict['TN'], color_dict['TP'], color_dict['FN'], color_dict['FP']]
dfs = [sno_per_confusion_value['TN'][numerical_feature], sno_per_confusion_value['TP'][numerical_feature],
        sno_per_confusion_value['FN'][numerical_feature], sno_per_confusion_value['FP'][numerical_feature]]
ft.density_x(dfs, numerical_feature, 'Density', 'linear', '',
            colors, ['TN', 'TP', 'FN', 'FP'], output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import pandas as pd
import functions as ft

""" Create a density plot per confusion value comparison (e.g. FP vs TP) for all
    numerical features in the top 10 predictive features per snoRNA type C/D vs
    H/ACA). Each comparison counts only one time a snoRNA (ex: it considers a TP
    snoRNA once even if it is predicted multiple time as a TP across iterations)."""

sno_type = snakemake.wildcards.sno_type
sno_type = sno_type[0] + '/' + sno_type[1:]
color_dict = snakemake.params.color_dict
output = snakemake.output.density
numerical_feature = snakemake.wildcards.top_10_numerical_features
comparison = snakemake.wildcards.comparison_confusion_val
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Select only one snoRNA type
feature_df = feature_df[feature_df[sno_type] == 1.0]
feature_df = feature_df[['gene_id_sno', numerical_feature]]


# Get the list of all snoRNAs of a given confusion_value inside dict
sno_per_confusion_value_paths = snakemake.input.sno_per_confusion_value
sno_per_confusion_value = {}
for path in sno_per_confusion_value_paths:
    confusion_value = path.split('/')[-1]
    confusion_value = confusion_value.split('_')[0]
    df = pd.read_csv(path, sep='\t')
    sno_list = df['gene_id_sno'].to_list()
    sno_per_confusion_value[confusion_value] = sno_list


# Get the snoRNA feature value for each confusion value in the comparison
confusion_val1, confusion_val2_3 = comparison.split('_vs_')
confusion_val2, confusion_val3 = confusion_val2_3.split('_')  # this is respectively TN and TP
df1 = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value[confusion_val1])]
df2 = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value[confusion_val2])]
df3 = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value[confusion_val3])]

# Get only snoRNAs that are always predicted as their confusion value
# (i.e. remove snoRNAs that are for example predicted in an iteration as FP and in another as TN)
if confusion_val1 == 'FP':
    all_fp = df1['gene_id_sno'].to_list()
    all_tn = df2['gene_id_sno'].to_list()
    all_tp = df3['gene_id_sno'].to_list()
    all_fn = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value['FN'])]
    all_fn = list(pd.unique(all_fn['gene_id_sno']))
    real_fp = list(set(all_fp) - set(all_tn))
    real_tn = list(set(all_tn) - set(all_fp))
    real_tp = list(set(all_tp) - set(all_fn))
    df1 = df1[df1['gene_id_sno'].isin(real_fp)]
    df2 = df2[df2['gene_id_sno'].isin(real_tn)]
    df3 = df3[df3['gene_id_sno'].isin(real_tp)]
elif confusion_val1 == 'FN':
    all_fn = df1['gene_id_sno'].to_list()
    all_tn = df2['gene_id_sno'].to_list()
    all_tp = df3['gene_id_sno'].to_list()
    all_fp = feature_df[feature_df['gene_id_sno'].isin(sno_per_confusion_value['FP'])]
    all_fp = list(pd.unique(all_fp['gene_id_sno']))
    real_fn = list(set(all_fn) - set(all_tp))
    real_tn = list(set(all_tn) - set(all_fp))
    real_tp = list(set(all_tp) - set(all_fn))
    df1 = df1[df1['gene_id_sno'].isin(real_fn)]
    df2 = df2[df2['gene_id_sno'].isin(real_tn)]
    df3 = df3[df3['gene_id_sno'].isin(real_tp)]

len_df1, len_df2, len_df3 = len(df1), len(df2), len(df3)

# Create density plot
colors = [color_dict[confusion_val1], color_dict[confusion_val2], color_dict[confusion_val3]]
ft.density_x([df1[numerical_feature], df2[numerical_feature], df3[numerical_feature]],
            numerical_feature, 'Density', 'linear', f'{comparison} ({sno_type})',
            colors, [f'{confusion_val1} ({len_df1})', f'{confusion_val2} ({len_df2})', f'{confusion_val3} ({len_df3})'], output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.df, sep='\t')
cd = df[df['sno_type'] == 'C/D']
haca = df[df['sno_type'] == 'H/ACA']

# Generate a pairplot of numerical features with a hue of abundance_cutoff_2
# for all snoRNAs
ft.pairplot(df, 'abundance_cutoff_2', snakemake.params.hue_color,
            snakemake.output.pairplot)

# Generate a pairplot of numerical features with a hue of abundance_cutoff_2
# for either C/D and H/ACA snoRNAs separately
ft.pairplot(cd, 'abundance_cutoff_2', snakemake.params.hue_color,
            snakemake.output.pairplot_cd)
ft.pairplot(haca, 'abundance_cutoff_2', snakemake.params.hue_color,
        snakemake.output.pairplot_haca)        
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.rank_features_df, sep='\t')
all_features = pd.read_csv(snakemake.input.all_features_df, sep='\t')

# Get top 5 features across models
top_5 = df.groupby('feature')['feature_rank'].median().sort_values(ascending=True).index.to_list()[0:5]
top_5 = ' '.join(top_5).replace('host_expressed_norm', 'host_expressed').split()
top_5.extend(['C/D', 'label'])
top_5_df = all_features.filter(top_5, axis=1)

# Create df for C/D and H/ACA snoRNAs
cd = top_5_df[top_5_df['C/D'] == 1]
cd = cd.drop('C/D', axis=1)
haca = top_5_df[top_5_df['C/D'] == 0]
haca = haca.drop('C/D', axis=1)

# Replace string labels by numeric labels
hue_color = snakemake.params.hue_color
hue_color[0] = hue_color.pop('not_expressed')
hue_color[1] = hue_color.pop('expressed')

# Generate a pairplot of top 5 features across models with a hue of label
# for either C/D and H/ACA snoRNAs separately
ft.pairplot(cd, 'label', snakemake.params.hue_color,
            snakemake.output.pairplot_cd)
ft.pairplot(haca, 'label', snakemake.params.hue_color,
        snakemake.output.pairplot_haca)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import decomposition as dec
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

"""
Creates a PCA plot (principal component analysis).
"""

df_initial = pd.read_csv(snakemake.input.df, sep='\t')
df_copy = df_initial.copy()
y = df_initial['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(df_initial, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


X = df_initial.drop(['label', 'gene_id_sno'], axis=1)
X_test_copy = X_test.copy()
X_test_copy = X_test_copy.drop(['label', 'gene_id_sno'], axis=1)


# Normalize data for PCA
val = X.values
norm_val = StandardScaler().fit_transform(val)

val_test = X_test_copy.values
norm_val_test = StandardScaler().fit_transform(val_test)


# Create PCA analysis with 2 principal components for all snoRNAs
pca_all = dec.PCA(n_components=2, random_state=42)
principal_components_all = pca_all.fit_transform(norm_val)
principal_df_all = pd.DataFrame(data=principal_components_all, columns=['Principal_component_1', 'Principal_component_2'])
print('Explained variation per principal component: {}'.format(pca_all.explained_variance_ratio_))  # returns the proportion of variance explained by each component
print('For each component, the proportion of each columns composing the component is: {}'.format(pca_all.components_))  # returns an array of the proportion that each column contributes per component;

# Create PCA analysis with 2 principal components for snoRNAs in test set only
pca_test = dec.PCA(n_components=2, random_state=42)
principal_components_test = pca_test.fit_transform(norm_val_test)
principal_df_test = pd.DataFrame(data=principal_components_test, columns=['Principal_component_1', 'Principal_component_2'])
print('Explained variation per principal component: {}'.format(pca_test.explained_variance_ratio_))
print('For each component, the proportion of each columns composing the component is: {}'.format(pca_test.components_))


# Create the (pca) scatter plot for all snoRNAs
pc1_all = round(pca_all.explained_variance_ratio_[0] * 100, 2)
pc2_all = round(pca_all.explained_variance_ratio_[1] * 100, 2)

plt.rcParams['svg.fonttype'] = 'none'
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax.set_xlabel(f'Principal Component 1 ({pc1_all} %)', fontsize=15)
ax.set_ylabel(f'Principal Component 2 ({pc2_all} %)', fontsize=15)
principal_df2_all = pd.concat([principal_df_all, df_copy[['label', 'intergenic']]], axis=1)
principal_df2_all['label'] = principal_df2_all['label'].replace([0, 1], ['not_expressed', 'expressed'])
principal_df2_all['intergenic'] = principal_df2_all['intergenic'].replace([0, 1], ['intronic', 'intergenic'])

crits = list(snakemake.params.colors_dict.keys())
colors = list(snakemake.params.colors_dict.values())

# Plot each hue separately on the same ax
for crit, color in zip(crits, colors):
    indicesToKeep = principal_df2_all[snakemake.wildcards.pca_hue] == crit
    ax.scatter(principal_df2_all.loc[indicesToKeep, 'Principal_component_1'],
                principal_df2_all.loc[indicesToKeep, 'Principal_component_2'],
                c=color, s=50)

plt.legend(crits, prop={'size': 15})
plt.savefig(snakemake.output.pca_all, dpi=600)


# Create the (pca) scatter plot for snoRNAs in test set only
pc1_test = round(pca_test.explained_variance_ratio_[0] * 100, 2)
pc2_test = round(pca_test.explained_variance_ratio_[1] * 100, 2)

plt.rcParams['svg.fonttype'] = 'none'
fig2, ax2 = plt.subplots(1, 1, figsize=(15, 15))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax2.set_xlabel(f'Principal Component 1 ({pc1_test} %)', fontsize=15)
ax2.set_ylabel(f'Principal Component 2 ({pc2_test} %)', fontsize=15)
principal_df2_test = pd.concat([principal_df_test, X_test[['label', 'intergenic']].reset_index()], axis=1)
principal_df2_test['label'] = principal_df2_test['label'].replace([0, 1], ['not_expressed', 'expressed'])
principal_df2_test['intergenic'] = principal_df2_test['intergenic'].replace([0, 1], ['intronic', 'intergenic'])

crits = list(snakemake.params.colors_dict.keys())
colors = list(snakemake.params.colors_dict.values())


# Plot each hue separately on the same ax
for crit, color in zip(crits, colors):
    indicesToKeep = principal_df2_test[snakemake.wildcards.pca_hue] == crit
    ax2.scatter(principal_df2_test.loc[indicesToKeep, 'Principal_component_1'],
                principal_df2_test.loc[indicesToKeep, 'Principal_component_2'],
                c=color, s=50)

plt.legend(crits, prop={'size': 15})
plt.savefig(snakemake.output.pca_test, dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import pandas as pd
import functions as ft

df = pd.read_csv(snakemake.input.confusion_value_per_sno, sep='\t')
colors = snakemake.params.color_dict
output = snakemake.output.pie

# Generate a pie chart of the number of snoRNA per confusion_value (TP, TN, FP, FN)
counts = [len(df[df['consensus_confusion_value'] == 'TP']),
            len(df[df['consensus_confusion_value'] == 'TN']),
            len(df[df['consensus_confusion_value'] == 'FP']),
            len(df[df['consensus_confusion_value'] == 'FN'])]  # keep this order as in the config.json color_dict
ft.pie_simple(counts, colors, '', output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import pandas as pd
import functions as ft
import numpy as np

df = pd.read_csv(snakemake.input.sno_per_confusion_value, sep='\t')
colors = snakemake.params.color_dict
output = snakemake.output.pie
# Generate a pie chart of the number of snoRNA per confusion_value (TP, TN, FP, FN)
counts = [len(df[df['confusion_matrix'] == 'TP']),
            len(df[df['confusion_matrix'] == 'TN']),
            len(df[df['confusion_matrix'] == 'FP']),
            len(df[df['confusion_matrix'] == 'FN'])]  # keep this order as in the config.json color_dict
ft.pie_simple(counts, colors, '', output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd
import functions as ft
import collections as coll

df = pd.read_csv(snakemake.input.confusion_value_per_sno[0], sep='\t')
colors = snakemake.params.color_dict
output = snakemake.output.pie

col = list(df.filter(regex='^confusion_matrix_').columns)

# Generate a pie chart of the number of snoRNA per confusion_value (TP, TN, FP, FN)
counts = [len(df[df[col[0]] == 'TP']),
            len(df[df[col[0]] == 'TN']),
            len(df[df[col[0]] == 'FP']),
            len(df[df[col[0]] == 'FN'])]  # keep this order as in the config.json color_dict
ft.pie_simple(counts, colors, '', output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd
import functions as ft
import collections as coll

df = pd.read_csv(snakemake.input.confusion_value_per_sno, sep='\t')
colors = snakemake.params.color_dict
output = snakemake.output.pie

col = list(df.filter(regex='^confusion_matrix_').columns)

# Generate a pie chart of the number of snoRNA per confusion_value (TP, TN, FP, FN)
counts = [len(df[df[col[0]] == 'TP']),
            len(df[df[col[0]] == 'TN']),
            len(df[df[col[0]] == 'FP']),
            len(df[df[col[0]] == 'FN'])]  # keep this order as in the config.json color_dict
ft.pie_simple(counts, colors, '', output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import pandas as pd
import functions as ft

feature_df = pd.read_csv(snakemake.input.df, sep='\t')
host_df = pd.read_csv(snakemake.input.host_df)
multi_HG_different_label_snoRNAs_df = pd.read_csv(snakemake.input.multi_HG_different_label_snoRNAs, sep='\t')
merged_conf_values_paths = snakemake.params.merged_confusion_values
mono_vs_multi_HG_colors = snakemake.params.mono_vs_multi_HG_colors
multi_HG_labels_colors = snakemake.params.multi_HG_labels_colors
multi_HG_same_labels_proportion_colors = snakemake.params.multi_HG_same_labels_proportion_colors
multi_HG_diff_labels_proportion_colors = snakemake.params.multi_HG_diff_labels_proportion_colors

# Drop intergenic snoRNAs and merge host_df to all_features_df
feature_df = feature_df[feature_df['abundance_cutoff_host'] != 'intergenic']
feature_df = feature_df.merge(host_df, how='left', left_on='gene_id_sno', right_on='sno_id')

# Concat all merged confusion value dfs (one per manual iteration) into 1 df
confusion_value_dfs = []
for path in merged_conf_values_paths:
    df = pd.read_csv(path, sep='\t')
    confusion_value_dfs.append(df)
confusion_value_concat_df = pd.concat(confusion_value_dfs)

# Find the number of mono vs multi HG (respectively containing 1 vs >1 snoRNAs in the same HG)
mono_HG, multi_HG = [], []
for i, group in enumerate(feature_df.groupby('host_id')):
    grouped_df = group[1]
    host_id = group[0]
    if len(grouped_df) == 1:  # mono-HG
        mono_HG.append(host_id)
    elif len(grouped_df) > 1:  # multi-HG
        multi_HG.append(host_id)
mono_HG_nb, multi_HG_nb = len(mono_HG), len(multi_HG)
pie_counts = [mono_HG_nb, multi_HG_nb]

# Find in the multi_HG those that have snoRNAs with all the same label vs those with different labels
sno_ids_multi_HG_diff_labels = list(multi_HG_different_label_snoRNAs_df.gene_id_sno)
multi_HG_same_label = feature_df[feature_df['host_id'].isin(multi_HG)]
multi_HG_same_label = multi_HG_same_label[~multi_HG_same_label['gene_id_sno'].isin(sno_ids_multi_HG_diff_labels)]
multi_HG_same_label_nb = len(pd.unique(multi_HG_same_label['host_id']))
multi_HG_diff_label_nb = len(pd.unique(multi_HG_different_label_snoRNAs_df['host_id']))
outer_donut_counts = [multi_HG_same_label_nb, multi_HG_diff_label_nb]

# Find for multi_HG with same snoRNA labels the % of all_expressed or all_not_expressed snoRNAs
all_expressed, all_not_expressed = [], []
for i, group in enumerate(multi_HG_same_label.groupby('host_id')):
    grouped_df = group[1]
    host_id = group[0]
    if 'expressed' in list(grouped_df.abundance_cutoff_2):
        all_expressed.append(host_id)
    elif 'not_expressed' in list(grouped_df.abundance_cutoff_2):
        all_not_expressed.append(host_id)
multi_HG_all_expressed_nb = len(all_expressed)
multi_HG_all_not_expressed_nb = len(all_not_expressed)
inner_donut_same_labels_counts = [multi_HG_all_expressed_nb, multi_HG_all_not_expressed_nb]

# Find for multi_HG with different snoRNA labels the % of snoRNAs that are 50-50 expressed-not_expressed, more expressed, or more not_expressed
half_expressed_not_expressed, more_expressed, more_not_expressed = [], [], []
for i, group in enumerate(multi_HG_different_label_snoRNAs_df.groupby('host_id')):
    grouped_df = group[1]
    host_id = group[0]
    if len(grouped_df[grouped_df['abundance_cutoff_2'] == 'expressed']) == len(grouped_df[grouped_df['abundance_cutoff_2'] == 'not_expressed']):
        half_expressed_not_expressed.append(host_id)
    elif len(grouped_df[grouped_df['abundance_cutoff_2'] == 'expressed']) > len(grouped_df[grouped_df['abundance_cutoff_2'] == 'not_expressed']):
        more_expressed.append(host_id)
    elif len(grouped_df[grouped_df['abundance_cutoff_2'] == 'expressed']) < len(grouped_df[grouped_df['abundance_cutoff_2'] == 'not_expressed']):
        more_not_expressed.append(host_id)
half_expressed_not_expressed_nb = len(half_expressed_not_expressed)
more_expressed_nb = len(more_expressed)
more_not_expressed_nb = len(more_not_expressed)
inner_donut_diff_labels_counts = [half_expressed_not_expressed_nb, more_expressed_nb, more_not_expressed_nb]

# Create pie chart of mono_HG vs multi_HG
ft.pie_simple(pie_counts, mono_vs_multi_HG_colors, '', snakemake.output.pie)

## Create donut chart of multi_HG with same vs different snoRNA labels
counts_all = [outer_donut_counts, inner_donut_same_labels_counts+inner_donut_diff_labels_counts]
# Set inner_labels as a list of empty strings, and labels as outer and inner_labels
inner_labels = [None] * len(multi_HG_labels_colors.keys()) * len(multi_HG_same_labels_proportion_colors.keys()) + [None]  # + [None] is for the third inner donut label that is only present in one of the half-inner donut
labels = [list(multi_HG_labels_colors.keys()), inner_labels]
# Set inner colors as a repeated list of colors for each part of the inner donut and colors as outer and inner_colors
inner_colors = list(multi_HG_same_labels_proportion_colors.values()) + list(multi_HG_diff_labels_proportion_colors.values())
colors = [list(multi_HG_labels_colors.values()), inner_colors]
legend_labels = list(multi_HG_same_labels_proportion_colors.keys()) + list(multi_HG_diff_labels_proportion_colors.keys())
legend_colors = list(multi_HG_same_labels_proportion_colors.values()) + list(multi_HG_diff_labels_proportion_colors.values())
ft.donut_2(counts_all, labels, colors, '', legend_labels, legend_colors, snakemake.output.donut)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd
import functions as ft
import numpy as np

df = pd.read_csv(snakemake.input.df, sep='\t')
colors = snakemake.params.colors
output = snakemake.output.pie
# Generate a pie chart of the number of snoRNA per abundance status (expressed
# vs not_expressed)
ab_status = [len(df[df['abundance_cutoff_2'] == 'expressed']),
            len(df[df['abundance_cutoff_2'] == 'not_expressed'])]
ft.pie_simple(ab_status, colors, '', output)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import functions as ft

color_dict = snakemake.params.color_dict
dfs_outputs = snakemake.output.dfs
colors = [color_dict['TN'], color_dict['TP'], color_dict['FN'], color_dict['FP']]
confusion_value_paths = snakemake.input.confusion_value_dfs
rbp_enrichment_df = pd.read_csv(snakemake.input.combined_rbp_score_df, sep='\t')
confusion_value_dict = {}
for path in confusion_value_paths:
    confusion_value = path.split("/")[-1].split("_w_")[0]
    print(confusion_value)
    df = pd.read_csv(path, sep='\t')
    df = df.merge(rbp_enrichment_df, how='left', on='gene_id_sno')
    output_path = [output for output in dfs_outputs if confusion_value in output][0]
    df.to_csv(output_path, sep='\t', index=False)
    confusion_value_dict[confusion_value] = df

df_list = [confusion_value_dict['TN'].combined_rbp_score_log10,
            confusion_value_dict['TP'].combined_rbp_score_log10,
            confusion_value_dict['FN'].combined_rbp_score_log10,
            confusion_value_dict['FP'].combined_rbp_score_log10]
ft.density_x(df_list, 'Combined RBP enrichment score (log10)', 'Density', 'linear', '',
                colors, ['TN', 'TP', 'FN', 'FP'], snakemake.output.density)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import pandas as pd
import functions as ft
import numpy as np
from functools import reduce

feature_df = pd.read_csv(snakemake.input.all_feature_df, sep='\t')
rbp_bed_paths = snakemake.input.beds
output_simple = snakemake.output.density
color_dict = snakemake.params.color_dict
colors = [color_dict['expressed'], color_dict['not_expressed']]
bed_dfs = []
for i, path in enumerate(rbp_bed_paths):
    rbp = path.split('/')[-1].split('_mapped')[0]
    print(rbp)
    bed_df = pd.read_csv(path, sep='\t', names=["chr", "start", "end", "gene_id_sno", "dot", "strand", f"{rbp}_score"])
    bed_df = bed_df[['gene_id_sno', f'{rbp}_score']]
    bed_df[f'{rbp}_score'] = bed_df[f'{rbp}_score'].replace('.', 0).astype(float)
    bed_df[f'{rbp}_score'] = bed_df[f'{rbp}_score'] + 0.000001  # add pseudocount so that a log can be computed
    bed_df = bed_df.merge(feature_df, how='left', on='gene_id_sno')
    bed_df[f'{rbp}_score_log10'] = np.log10(bed_df[f'{rbp}_score'])

    # Get only intergenic or intronic snoRNAs
    #bed_df = bed_df[bed_df['abundance_cutoff_host'] == 'intergenic']
    # Create a density plot for each RBP enrichment to compare expressed vs not expressed snoRNAs
    expressed, not_expressed = bed_df[bed_df['abundance_cutoff_2'] == 'expressed'], bed_df[bed_df['abundance_cutoff_2'] == 'not_expressed']
    ft.density_x([expressed[f'{rbp}_score_log10'], not_expressed[f'{rbp}_score_log10']], f'{rbp} enrichment score (log10)', 'Density', 'linear', '', colors,
                ['expressed', 'not_expressed'], output_simple[i])
    bed_df = bed_df.drop(columns='abundance_cutoff_2')
    bed_dfs.append(bed_df)


# Create a combined RBP enrichment score
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['gene_id_sno'],
                                            how='left'), bed_dfs)
df_merged = df_merged.filter(regex='(gene_id_sno|_score$)')
df_merged = df_merged.drop(columns='FBL_mnase_score')
df_merged['combined_rbp_score'] = df_merged.filter(regex='_score$').sum(axis=1)
df_merged = df_merged.merge(feature_df[['gene_id_sno', 'abundance_cutoff_2']], on='gene_id_sno', how='left')
df_merged['combined_rbp_score_log10'] = np.log10(df_merged['combined_rbp_score'])
df_merged.to_csv(snakemake.output.combined_rbp_score_df, sep='\t', index=False)


expressed_merged, not_expressed_merged = df_merged[df_merged['abundance_cutoff_2'] == 'expressed'], df_merged[df_merged['abundance_cutoff_2'] == 'not_expressed']

# Create a density plot for the combined RBP enrichment to compare expressed vs not expressed snoRNAs
ft.density_x([expressed_merged['combined_rbp_score_log10'], not_expressed_merged['combined_rbp_score_log10']], 'Combined RBP enrichment score (log10)', 'Density', 'linear', '', colors,
            ['expressed', 'not_expressed'], snakemake.output.density_combined)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import functions as ft
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the 4 trained models
log_reg = pickle.load(open(snakemake.input.log_reg, 'rb'))
svc = pickle.load(open(snakemake.input.svc, 'rb'))
rf = pickle.load(open(snakemake.input.rf, 'rb'))
gbm = pickle.load(open(snakemake.input.gbm, 'rb'))
knn = pickle.load(open(snakemake.input.knn, 'rb'))
# Create the ROC curve
classifiers = [log_reg, svc, rf, gbm, knn]
ft.roc_curve(classifiers, X_test, y_test, "False positive rate",
            "True positive rate", "ROC curves of the 5 models showing their performance on"+"\n"+"the test dataset with only the "+snakemake.wildcards.one_feature+" feature",
            snakemake.output.roc_curve)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import functions as ft
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the 4 trained models
log_reg = pickle.load(open(snakemake.input.log_reg, 'rb'))
svc = pickle.load(open(snakemake.input.svc, 'rb'))
rf = pickle.load(open(snakemake.input.rf, 'rb'))
gbm = pickle.load(open(snakemake.input.gbm, 'rb'))
knn = pickle.load(open(snakemake.input.knn, 'rb'))
# Create the ROC curve
classifiers = [log_reg, svc, rf, gbm, knn]
ft.roc_curve(classifiers, X_test, y_test, "False positive rate",
            "True positive rate", "ROC curves of the 5 models showing"+"\n"+"their performance on the test dataset",
            snakemake.output.roc_curve)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np

model_colors = snakemake.params.model_colors_dict

# Load test set and labels of all 10 iterations
X_all_test_path, y_all_test_path = snakemake.input.X_test, snakemake.input.y_test
X_test, y_test = [], []
for i, test_path in enumerate(X_all_test_path):
    X_test_iteration = pd.read_csv(test_path, sep='\t', index_col='gene_id_sno')
    y_test_iteration = pd.read_csv(y_all_test_path[i], sep='\t')
    X_test.append(X_test_iteration)
    y_test.append(y_test_iteration)

# Get the name of all models in a list
pickled_models = snakemake.input.pickled_trained_model
model_name = []
for model_path in pickled_models:
    name = model_path.split('_trained')[0]
    name = name.rsplit('/')[-1]
    if name not in model_name:
        model_name.append(name)

# Unpickle and thus instantiate the trained models for all 10 iterations into a dict
loaded_models = {}
for model in model_name:
    iterations_per_model = [path for path in pickled_models if model in path]
    unpickled_models = []
    for iteration_path in iterations_per_model:
        unpickled_model = pickle.load(open(iteration_path, 'rb'))
        unpickled_models.append(unpickled_model)
    loaded_models[model] = unpickled_models


# Compute average true positive rate, false positive rate, AUCs, and stdev of
# AUCs for each model across the 10 iterations
# We use 100 defined thresholds of probabilities to be able to compute a mean and stdev across 10 iterations;
# otherwise, the fpr and tpr given by viz (see below) can have different shape for each iteration
# (because multiple thresholds can have the same x,y --> fpr, tpr coordinates), which would not work to compute an avg per threshold because there would be missing data points
mean_fpr = np.linspace(0, 1, 100)  # we want to create 100 points between 0 and 1 (x-axis) for the roc curve based on
                                    # a smaller number of coordinates given by roc_curve_display (where x,y --> fpr, tpr) using interpolation.
                                    # mean_fpr (mean false positive rate) are 100 evenly distributed x values
mean_tprs = []  # this will contain the avg tpr values (across iterations) per model
std_tprs = []  # this will contain the stdev of tpr values (across iterations) per model
mean_aucs = []  # this will contain the avg auc (across iterations) per model
std_auc = []  # this will contain the stdev of the auc (across iterations) per model
for mod_name, loaded_models in loaded_models.items():
    tprs, aucs = [], []  # true positive rates and AUCs for all iterations of a model
    for i, predictor_per_iteration in enumerate(loaded_models):
        viz = plot_roc_curve(predictor_per_iteration, X_test[i],  # used to access fpr and tps, not to plot the roc curve
                                            y_test[i])
        interpolated_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)  # we interpolate the same number of values (100) for each iteration
        interpolated_tpr[0] = 0.0  # we modify the first value which is 0. to 0.0 on the y-axis
        tprs.append(interpolated_tpr)
        aucs.append(viz.roc_auc)
    mean_tpr_per_model = np.mean(tprs, axis=0)
    mean_tpr_per_model[-1] = 1  # we modify the last value to be exactly 1 on the y axis
    std_tpr_per_model = np.std(tprs, axis=0)
    mean_auc_per_model = auc(mean_fpr, mean_tpr_per_model)
    std_auc_per_model = np.std(aucs)
    mean_tprs.append(mean_tpr_per_model)
    std_tprs.append(std_tpr_per_model)
    mean_aucs.append(mean_auc_per_model)
    std_auc.append(std_auc_per_model)

# Plot the roc curve with error clouds below and above
ft.roc_curve_error_fill(model_name, mean_aucs, mean_fpr, mean_tprs, std_tprs,
                        model_colors, snakemake.output.roc_curve)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import functions as ft

X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
y_test = pd.read_csv(snakemake.input.y_test, sep='\t')

# Unpickle and thus instantiate the 4 trained models
log_reg = pickle.load(open(snakemake.input.log_reg, 'rb'))
svc = pickle.load(open(snakemake.input.svc, 'rb'))
rf = pickle.load(open(snakemake.input.rf, 'rb'))
gbm = pickle.load(open(snakemake.input.gbm, 'rb'))
knn = pickle.load(open(snakemake.input.knn, 'rb'))
# Create the ROC curve
classifiers = [log_reg, svc, rf, gbm, knn]
ft.roc_curve(classifiers, X_test, y_test, "False positive rate",
            "True positive rate", "ROC curves of the 5 models showing"+"\n"+"their performance on the test dataset",
            snakemake.output.roc_curve)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import functions as ft
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1017 and 180 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=180, train_size=1017, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the 4 trained models
log_reg = pickle.load(open(snakemake.input.log_reg, 'rb'))
svc = pickle.load(open(snakemake.input.svc, 'rb'))
rf = pickle.load(open(snakemake.input.rf, 'rb'))
gbm = pickle.load(open(snakemake.input.gbm, 'rb'))
knn = pickle.load(open(snakemake.input.knn, 'rb'))
# Create the ROC curve
classifiers = [log_reg, svc, rf, gbm, knn]
ft.roc_curve(classifiers, X_test, y_test, "False positive rate",
            "True positive rate", "ROC curves of the 5 models showing their performance"+"\n"+"on the test dataset without the snoRNA clusters",
            snakemake.output.roc_curve)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import functions as ft
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the 4 trained models
log_reg = pickle.load(open(snakemake.input.log_reg, 'rb'))
svc = pickle.load(open(snakemake.input.svc, 'rb'))
rf = pickle.load(open(snakemake.input.rf, 'rb'))
gbm = pickle.load(open(snakemake.input.gbm, 'rb'))
knn = pickle.load(open(snakemake.input.knn, 'rb'))
# Create the ROC curve
classifiers = [log_reg, svc, rf, gbm, knn]
ft.roc_curve(classifiers, X_test, y_test, "False positive rate",
            "True positive rate", "ROC curves of the 5 models showing their performance on"+"\n"+"the test dataset without the "+snakemake.wildcards.feature_effect+" feature",
            snakemake.output.roc_curve)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import pandas as pd
import functions as ft
from scipy import stats as st

species_ordered = ['pan_troglodytes', 'gorilla_gorilla', 'macaca_mulatta',
                    'oryctolagus_cuniculus', 'rattus_norvegicus', 'bos_taurus',
                    'ornithorhynchus_anatinus', 'gallus_gallus', 'xenopus_tropicalis',
                    'danio_rerio']
human_df = pd.read_csv(snakemake.input.human_labels, sep='\t')
mouse_df = pd.read_csv(snakemake.input.mouse_labels, sep='\t')
paths = snakemake.input.dfs
dfs = []
for i, path in enumerate(paths):
    species_name = path.split('/')[-1].split('_predicted_label')[0]
    df = pd.read_csv(path, sep='\t')
    df['species_name'] = species_name
    df = df[['predicted_label', 'species_name']]
    dfs.append(df)

# Concat dfs into 1 df
concat_df = pd.concat(dfs)

# Given a species name list, count the number of criteria in specific col of df
# that was previously filtered using species_name_list in global_col
def count_list_species(initial_df, species_name_list, global_col, criteria, specific_col):
    """
    Create a list of lists using initial_col to split the global list and
    specific_col to create the nested lists.
    """
    df_list = []

    #Sort in acending order the unique values in global_col and create a list of
    # df based on these values
    print(species_name_list)
    for val in species_name_list:
        temp_val = initial_df[initial_df[global_col] == val]
        df_list.append(temp_val)


    l = []
    for i, df in enumerate(df_list):
        temp = []
        for j, temp1 in enumerate(criteria):
            crit = df[df[specific_col] == temp1]
            crit = len(crit)
            temp.append(crit)
        l.append(temp)

    return l


# Generate a bar chart of categorical features with a hue of gene_biotype
counts_per_feature = count_list_species(concat_df, species_ordered, 'species_name',
                    list(snakemake.params.hue_color.keys()),
                    'predicted_label')
# Add human and mouse actual labels for comparison
human_expressed = len(human_df[human_df['abundance_cutoff_2'] == 'expressed'])
human_not_expressed = len(human_df[human_df['abundance_cutoff_2'] == 'not_expressed'])
mouse_expressed = len(mouse_df[mouse_df['abundance_cutoff'] == 'expressed'])
mouse_not_expressed = len(mouse_df[mouse_df['abundance_cutoff'] == 'not_expressed'])
counts_per_feature = [[human_expressed, human_not_expressed]] + [[mouse_expressed, mouse_not_expressed]] + counts_per_feature

# Convert to percent
percent = ft.percent_count(counts_per_feature)


# Get the total number of snoRNAs (for which we found snoRNA type) per species
total_nb_sno = [sum(l) for l in counts_per_feature]
xtick_labels = ['homo_sapiens', 'mus_musculus'] + species_ordered
sno_nb_dict = dict(zip(xtick_labels, total_nb_sno))

# Create df
df = pd.DataFrame(percent, index=xtick_labels, columns=list(snakemake.params.hue_color.keys()))
df = df.reset_index()
df = df.rename(columns={'index': 'species'})

# Create sno_nb and predicted_vs_actual_ab_status cols
df['sno_nb'] = df['species'].map(sno_nb_dict)
df.loc[(df.species == 'homo_sapiens') | (df.species == 'mus_musculus'), 'predicted_vs_actual_ab_status'] = 'Actual abundance status'
df.predicted_vs_actual_ab_status = df.predicted_vs_actual_ab_status.fillna('Predicted abundance status')
print(df)

# Create scatter plot
color_dictio = {'Actual abundance status': '#000000',
                'Predicted abundance status': '#bdbdbd'}
pearson_r, pval = st.pearsonr(list(df.sno_nb), list(df.expressed))
print(pearson_r, pval)
ft.scatter(df, 'sno_nb', 'expressed', 'predicted_vs_actual_ab_status',
            'Number of snoRNAs per species', 'Proportion of expressed snoRNAs (%)',
            '', color_dictio, f"Pearson's r: {pearson_r}\np-value: {pval}", snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import re
import statistics as st

# Define function to return the average and standard deviation of accuracies of the 10 iterations per model in a respective dict each
def get_avg_stdev(dict_of_all_models_accuracies):
    avg, stdev = {}, {}
    for i in range(0, len(sorted(dict_of_all_models_accuracies.keys())), 10):  # sort to regroup in order all 10 iterations per model
        iterations_per_model = sorted(dict_of_all_models_accuracies.keys())[i:i+10]  # select the 10 iterations names per model
        accuracies_per_model = [dict_of_all_models_accuracies[iteration] for iteration in iterations_per_model]  # select corresponding accuracies of these 10 iterations
        model_name = iterations_per_model[0].split('_')[0]  # Get the model name
        avg_acc, stdev_acc = st.mean(accuracies_per_model), st.stdev(accuracies_per_model)
        avg[model_name], stdev[model_name] = avg_acc, stdev_acc
    return avg, stdev

# Get the accuracy of all models on the CV set
cv_accuracy = {}
for i, model in enumerate(snakemake.input.cv_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df['accuracy_cv'].values)
    cv_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the training set
train_accuracy = {}
for i, model in enumerate(snakemake.input.training_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df[f'{substring_model}_training_accuracy'].values)
    train_accuracy[f'{substring_model}_{iteration}'] = accuracy


# Get the accuracy of all models on the test set
test_accuracy = {}
for i, model in enumerate(snakemake.input.test_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df[f'{substring_model}_test_accuracy'].values)
    test_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the average and standard deviation of CV, training and test sets accuracies for all models across the 10 iterations
cv_avg, cv_stdev = get_avg_stdev(cv_accuracy)
train_avg, train_stdev = get_avg_stdev(train_accuracy)
test_avg, test_stdev = get_avg_stdev(test_accuracy)

# Create the df of all accuracies
all_accuracies = pd.DataFrame.from_dict([cv_avg, train_avg, test_avg])
all_accuracies.index = ["cv", "train", "test"]
all_accuracies = all_accuracies.transpose()
all_accuracies_hue = all_accuracies.copy()
all_accuracies_hue['model'] = all_accuracies_hue.index
print(all_accuracies_hue)
print(all_accuracies)
# Create the connected scatter plot
color_dict = snakemake.params.colors
color_dict['log'] = color_dict.pop('log_reg')  # patch to link the good color to log_reg
ft.connected_scatter_errbars(all_accuracies, all_accuracies_hue, 'model',
                    color_dict, 'Dataset', [cv_stdev, train_stdev, test_stdev],
                    ['Cross-\nvalidation', 'Training', 'Test'], 'Dataset', 'Accuracy',
                    snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import re
import statistics as st

# Define function to return the average and standard deviation of accuracies of the 20 iterations per model in a respective dict each
def get_avg_stdev(dict_of_all_models_accuracies):
    avg, stdev = {}, {}
    for i in range(0, len(sorted(dict_of_all_models_accuracies.keys())), 20):  # sort to regroup in order all 20 iterations per model
        iterations_per_model = sorted(dict_of_all_models_accuracies.keys())[i:i+20]  # select the 20 iterations names per model
        accuracies_per_model = [dict_of_all_models_accuracies[iteration] for iteration in iterations_per_model]  # select corresponding accuracies of these 20 iterations
        model_name = iterations_per_model[0].split('_')[0]  # Get the model name
        avg_acc, stdev_acc = st.mean(accuracies_per_model), st.stdev(accuracies_per_model)
        avg[model_name], stdev[model_name] = avg_acc, stdev_acc
    return avg, stdev

# Get the accuracy of all models on the CV set
cv_accuracy = {}
for i, model in enumerate(snakemake.input.cv_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-2]
    accuracy = float(df['accuracy_cv'].values)
    cv_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the training set
train_accuracy = {}
for i, model in enumerate(snakemake.input.training_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-2]
    accuracy = float(df[f'{substring_model}_training_accuracy'].values)
    train_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the test set
test_accuracy = {}
for i, model in enumerate(snakemake.input.test_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-2]
    accuracy = float(df[f'{substring_model}_test_accuracy'].values)
    test_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the average and standard deviation of CV, training and test sets accuracies for all models across the 20 iterations
cv_avg, cv_stdev = get_avg_stdev(cv_accuracy)
train_avg, train_stdev = get_avg_stdev(train_accuracy)
test_avg, test_stdev = get_avg_stdev(test_accuracy)

print(cv_accuracy)
print(train_accuracy)
print(test_accuracy)
# Create the df of all accuracies
all_accuracies = pd.DataFrame.from_dict([cv_avg, train_avg, test_avg])
all_accuracies.index = ["cv", "train", "test"]
all_accuracies = all_accuracies.transpose()
all_accuracies_hue = all_accuracies.copy()
all_accuracies_hue['model'] = all_accuracies_hue.index

# Create the connected scatter plot
color_dict = snakemake.params.colors
color_dict['log'] = color_dict.pop('log_reg')  # patch to link the good color to log_reg
ft.connected_scatter_errbars(all_accuracies, all_accuracies_hue, 'model',
                    color_dict, 'Dataset', [cv_stdev, train_stdev, test_stdev],
                    ['Cross-\nvalidation', 'Training', 'Test'], 'Dataset', 'Accuracy',
                    snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import re

# Get the accuracy of all models on the CV set
cv_accuracy = {}
for i, model in enumerate(snakemake.input.cv_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    accuracy = float(df['accuracy_cv'].values)
    cv_accuracy[substring_model] = accuracy

# Get the accuracy of all models on the training set
train_accuracy = {}
for i, model in enumerate(snakemake.input.training_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    accuracy = float(df[f'{substring_model}_training_accuracy'].values)
    train_accuracy[substring_model] = accuracy

# Get the accuracy of all models on the test set
test_accuracy = {}
for i, model in enumerate(snakemake.input.test_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    accuracy = float(df[f'{substring_model}_test_accuracy'].values)
    test_accuracy[substring_model] = accuracy


# Create the df of all accuracies
all_accuracies = pd.DataFrame.from_dict([cv_accuracy, train_accuracy, test_accuracy])
all_accuracies.index = ["cv", "train", "test"]
all_accuracies = all_accuracies.transpose()
all_accuracies_hue = all_accuracies.copy()
all_accuracies_hue['model'] = all_accuracies_hue.index

# Create the connected scatter plot
ft.connected_scatter(all_accuracies, all_accuracies_hue, 'model',
                    snakemake.params.colors.values(), 'Dataset',
                    ['Cross-validation', 'Training', 'Test'], 'Dataset', 'Accuracy',
                    snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import re
import statistics as st

train_accuracy_df_paths = list(snakemake.input.training_accuracy) + list(snakemake.input.training_accuracy_log_reg_thresh)
test_accuracy_df_paths = list(snakemake.input.test_accuracy) + list(snakemake.input.test_accuracy_log_reg_thresh)

# Define function to return the average and standard deviation of accuracies of the 5 iterations per model in a respective dict each
def get_avg_stdev(dict_of_all_models_accuracies):
    avg, stdev = {}, {}
    for i in range(0, len(sorted(dict_of_all_models_accuracies.keys())), 5):  # sort to regroup in order all 5 iterations per model
        iterations_per_model = sorted(dict_of_all_models_accuracies.keys())[i:i+5]  # select the 5 iterations names per model
        accuracies_per_model = [dict_of_all_models_accuracies[iteration] for iteration in iterations_per_model]  # select corresponding accuracies of these 5 iterations
        model_name = iterations_per_model[0].split('_')[0]  # Get the model name
        avg_acc, stdev_acc = st.mean(accuracies_per_model), st.stdev(accuracies_per_model)
        avg[model_name], stdev[model_name] = avg_acc, stdev_acc
    return avg, stdev

# Get the accuracy of all models on the CV set
cv_accuracy = {}
for i, model in enumerate(snakemake.input.cv_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df['accuracy_cv'].values)
    cv_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the training set
train_accuracy = {}
for i, model in enumerate(train_accuracy_df_paths):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    if substring_model == "log_reg":
        accuracy = float(df[f'{substring_model}_thresh_training_accuracy'].values)
    else:
        accuracy = float(df[f'{substring_model}_training_accuracy'].values)
    train_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the test set
test_accuracy = {}
for i, model in enumerate(test_accuracy_df_paths):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    if substring_model == "log_reg":
        accuracy = float(df[f'{substring_model}_thresh_test_accuracy'].values)
    else:
        accuracy = float(df[f'{substring_model}_test_accuracy'].values)
    test_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the average and standard deviation of CV, training and test sets accuracies for all models across the 10 iterations
cv_avg, cv_stdev = get_avg_stdev(cv_accuracy)
train_avg, train_stdev = get_avg_stdev(train_accuracy)
test_avg, test_stdev = get_avg_stdev(test_accuracy)


# Create the df of all accuracies
all_accuracies = pd.DataFrame.from_dict([cv_avg, train_avg, test_avg])
all_accuracies.index = ["cv", "train", "test"]
all_accuracies = all_accuracies.transpose()
all_accuracies_hue = all_accuracies.copy()
all_accuracies_hue['model'] = all_accuracies_hue.index

# Create the connected scatter plot
color_dict = snakemake.params.colors
color_dict['log'] = color_dict.pop('log_reg')  # patch to link the good color to log_reg
ft.connected_scatter_errbars2(all_accuracies, all_accuracies_hue, 'model',
                    color_dict, 'Dataset', [cv_stdev, train_stdev, test_stdev],
                    ['Tuning', 'Training', 'Test'], 'Dataset', 'Accuracy', 0.65, 1.025,
                    snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import re
import statistics as st

# Define function to return the average and standard deviation of accuracies of the 5 iterations per model in a respective dict each
def get_avg_stdev(dict_of_all_models_accuracies):
    avg, stdev = {}, {}
    for i in range(0, len(sorted(dict_of_all_models_accuracies.keys())), 5):  # sort to regroup in order all 5 iterations per model
        iterations_per_model = sorted(dict_of_all_models_accuracies.keys())[i:i+5]  # select the 5 iterations names per model
        accuracies_per_model = [dict_of_all_models_accuracies[iteration] for iteration in iterations_per_model]  # select corresponding accuracies of these 5 iterations
        model_name = iterations_per_model[0].split('_')[0]  # Get the model name
        avg_acc, stdev_acc = st.mean(accuracies_per_model), st.stdev(accuracies_per_model)
        avg[model_name], stdev[model_name] = avg_acc, stdev_acc
    return avg, stdev

# Get the accuracy of all models on the CV set
cv_accuracy = {}
for i, model in enumerate(snakemake.input.cv_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df['accuracy_cv'].values)
    cv_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the training set
train_accuracy = {}
for i, model in enumerate(snakemake.input.training_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df[f'{substring_model}_training_accuracy'].values)
    train_accuracy[f'{substring_model}_{iteration}'] = accuracy


# Get the accuracy of all models on the test set
test_accuracy = {}
for i, model in enumerate(snakemake.input.test_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df[f'{substring_model}_test_accuracy'].values)
    test_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the average and standard deviation of CV, training and test sets accuracies for all models across the 10 iterations
cv_avg, cv_stdev = get_avg_stdev(cv_accuracy)
train_avg, train_stdev = get_avg_stdev(train_accuracy)
test_avg, test_stdev = get_avg_stdev(test_accuracy)
print(cv_avg)
print(cv_stdev)
print(train_avg)
print(train_stdev)
print(test_avg)
print(test_stdev)

# Create the df of all accuracies
all_accuracies = pd.DataFrame.from_dict([cv_avg, train_avg, test_avg])
all_accuracies.index = ["cv", "train", "test"]
all_accuracies = all_accuracies.transpose()
all_accuracies_hue = all_accuracies.copy()
all_accuracies_hue['model'] = all_accuracies_hue.index
print(all_accuracies_hue)
print(all_accuracies)
# Create the connected scatter plot
color_dict = snakemake.params.colors
color_dict['log'] = color_dict.pop('log_reg')  # patch to link the good color to log_reg
ft.connected_scatter_errbars(all_accuracies, all_accuracies_hue, 'model',
                    color_dict, 'Dataset', [cv_stdev, train_stdev, test_stdev],
                    ['Cross-\nvalidation', 'Training', 'Test'], 'Dataset', 'Accuracy',
                    snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import re
import statistics as st

train_accuracy_df_paths = list(snakemake.input.training_accuracy) + list(snakemake.input.training_accuracy_log_reg_thresh)
test_accuracy_df_paths = list(snakemake.input.test_accuracy) + list(snakemake.input.test_accuracy_log_reg_thresh)

# Define function to return the average and standard deviation of accuracies of the 5 iterations per model in a respective dict each
def get_avg_stdev(dict_of_all_models_accuracies):
    avg, stdev = {}, {}
    for i in range(0, len(sorted(dict_of_all_models_accuracies.keys())), 5):  # sort to regroup in order all 5 iterations per model
        iterations_per_model = sorted(dict_of_all_models_accuracies.keys())[i:i+5]  # select the 5 iterations names per model
        accuracies_per_model = [dict_of_all_models_accuracies[iteration] for iteration in iterations_per_model]  # select corresponding accuracies of these 5 iterations
        model_name = iterations_per_model[0].split('_')[0]  # Get the model name
        avg_acc, stdev_acc = st.mean(accuracies_per_model), st.stdev(accuracies_per_model)
        avg[model_name], stdev[model_name] = avg_acc, stdev_acc
    return avg, stdev

# Get the accuracy of all models on the CV set
cv_accuracy = {}
for i, model in enumerate(snakemake.input.cv_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    accuracy = float(df['accuracy_cv'].values)
    cv_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the training set
train_accuracy = {}
for i, model in enumerate(train_accuracy_df_paths):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    if substring_model == "log_reg":
        accuracy = float(df[f'{substring_model}_thresh_training_accuracy'].values)
    else:
        accuracy = float(df[f'{substring_model}_training_accuracy'].values)
    train_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the accuracy of all models on the test set
test_accuracy = {}
for i, model in enumerate(test_accuracy_df_paths):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    iteration = model.split('_')[-1][0:-4]
    if substring_model == "log_reg":
        accuracy = float(df[f'{substring_model}_thresh_test_accuracy'].values)
    else:
        accuracy = float(df[f'{substring_model}_test_accuracy'].values)
    test_accuracy[f'{substring_model}_{iteration}'] = accuracy

# Get the average and standard deviation of CV, training and test sets accuracies for all models across the 10 iterations
cv_avg, cv_stdev = get_avg_stdev(cv_accuracy)
train_avg, train_stdev = get_avg_stdev(train_accuracy)
test_avg, test_stdev = get_avg_stdev(test_accuracy)


# Create the df of all accuracies
all_accuracies = pd.DataFrame.from_dict([cv_avg, train_avg, test_avg])
all_accuracies.index = ["cv", "train", "test"]
all_accuracies = all_accuracies.transpose()
all_accuracies_hue = all_accuracies.copy()
all_accuracies_hue['model'] = all_accuracies_hue.index

# Create the connected scatter plot
color_dict = snakemake.params.colors
color_dict['log'] = color_dict.pop('log_reg')  # patch to link the good color to log_reg
ft.connected_scatter_errbars(all_accuracies, all_accuracies_hue, 'model',
                    color_dict, 'Dataset', [cv_stdev, train_stdev, test_stdev],
                    ['Tuning', 'Training', 'Test'], 'Dataset', 'Accuracy',
                    snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import re

train_accuracy_df_paths = list(snakemake.input.training_accuracy) + [snakemake.input.training_accuracy_log_reg_thresh[0]]
test_accuracy_df_paths = list(snakemake.input.test_accuracy) + [snakemake.input.test_accuracy_log_reg_thresh[0]]

# Get the accuracy of all models on the CV set
cv_accuracy = {}
for i, model in enumerate(snakemake.input.cv_accuracy):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    accuracy = float(df['accuracy_cv'].values)
    cv_accuracy[substring_model] = accuracy

# Get the accuracy of all models on the training set
train_accuracy = {}
for i, model in enumerate(train_accuracy_df_paths):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    if substring_model == 'log_reg':
        accuracy = float(df[f'{substring_model}_thresh_training_accuracy'].values)
    else:
        accuracy = float(df[f'{substring_model}_training_accuracy'].values)
    train_accuracy[substring_model] = accuracy

# Get the accuracy of all models on the test set
test_accuracy = {}
for i, model in enumerate(test_accuracy_df_paths):
    df = pd.read_csv(model, sep='\t')
    substring_model = re.findall("(log_reg|svc|rf|gbm|knn)", model)[0]
    if substring_model == 'log_reg':
        accuracy = float(df[f'{substring_model}_thresh_test_accuracy'].values)
    else:
        accuracy = float(df[f'{substring_model}_test_accuracy'].values)
    test_accuracy[substring_model] = accuracy


# Create the df of all accuracies
all_accuracies = pd.DataFrame.from_dict([cv_accuracy, train_accuracy, test_accuracy])
all_accuracies.index = ["cv", "train", "test"]
all_accuracies = all_accuracies.transpose()
all_accuracies_hue = all_accuracies.copy()
all_accuracies_hue['model'] = all_accuracies_hue.index

# Create the connected scatter plot
ft.connected_scatter(all_accuracies, all_accuracies_hue, 'model',
                    snakemake.params.colors.values(), 'Dataset',
                    ['Cross-validation', 'Training', 'Test'], 'Dataset', 'Accuracy',
                    snakemake.output.scatter)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import functions as ft
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import collections as coll

df = pd.read_csv(snakemake.input.df, sep='\t')
df = df[['abundance_cutoff_2', 'sno_type', 'host_biotype2']]
exp, not_exp = df[df['abundance_cutoff_2'] == 'expressed'], df[df['abundance_cutoff_2'] == 'not_expressed']
df_len = len(df)

snotype_dict_exp, snotype_dict_not_exp = {k:v for k,v in sorted(dict(coll.Counter(exp.sno_type)).items())}, {k:v for k,v in sorted(dict(coll.Counter(not_exp.sno_type)).items())}
host_biotype_dict_exp, host_biotype_dict_not_exp = {k:v for k,v in sorted(dict(coll.Counter(exp.host_biotype2)).items())}, {k:v for k,v in sorted(dict(coll.Counter(not_exp.host_biotype2)).items())}

# Get values (and as %) of each hue per expression status
sno_type_exp_nb, sno_type_not_exp_nb = list(snotype_dict_exp.values()), list(snotype_dict_not_exp.values())
host_biotype_exp_nb, host_biotype_not_exp_nb = list(host_biotype_dict_exp.values()), list(host_biotype_dict_not_exp.values())

def get_percent_df(l1, l2, index, cols):
    percent1 = [i * 100 / sum(l1) for i in l1]
    percent2 = [i * 100 / sum(l2) for i in l2]
    df = pd.DataFrame([percent1, percent2], index = index, columns = cols)
    print(df)
    return df
exp_len, not_exp_len = str(len(exp)), str(len(not_exp))
ind = [f'Expressed\n({exp_len})', f'Not expressed\n({not_exp_len})']
d1 = get_percent_df(sno_type_exp_nb, sno_type_not_exp_nb, ind, list(snotype_dict_exp.keys()))
d2 = get_percent_df(host_biotype_exp_nb, host_biotype_not_exp_nb, ind, list(host_biotype_dict_exp.keys()))
df_l = [d2, d1]

# Create a grouped stacked bar chart showing the % of either expressed or not experssed snoRNAs (separate bar) 
# The hue of each bar is either the snoRNA type (C/D or H/ACA) or the host gene biotype
colors = [list(snakemake.params.host_biotype_colors.values()), list(snakemake.params.sno_type_colors.values())]
ft.plot_clustered_stacked2(df_l, colors, 'Expression status of snoRNAs', 'Proportion of snoRNAs (%)', ind, snakemake.output.bar)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
import functions as ft
from scipy import stats as st

tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.all_features, sep='\t')

# Create average TPM column in tpm_df and merge that column to feature_df
tpm_df['avg_tpm'] = tpm_df.filter(regex='^[A-Z].*_[1-3]$', axis=1).mean(axis=1)
tpm_df = tpm_df[['gene_id', 'avg_tpm']]
feature_df = feature_df.merge(tpm_df, how='left', left_on='gene_id_sno', right_on='gene_id')
feature_df = feature_df.drop(['gene_id'], axis=1)

# Get expressed H/ACA and C/D snoRNAs
haca = feature_df[(feature_df['abundance_cutoff_2'] == "expressed") & (feature_df['sno_type'] == "H/ACA")]
cd = feature_df[(feature_df['abundance_cutoff_2'] == "expressed") & (feature_df['sno_type'] == "C/D")]

# Split according to terminal stem MFE (if < (strong stem) or >= (weak) average MFE)
avg_stem_mfe_haca, avg_stem_mfe_cd = haca['terminal_stem_mfe'].mean(), cd['terminal_stem_mfe'].mean()
haca.loc[haca['terminal_stem_mfe'] < avg_stem_mfe_haca, 'terminal_stem_strength'] = 'Strong'
haca.loc[haca['terminal_stem_mfe'] >= avg_stem_mfe_haca, 'terminal_stem_strength'] = 'Weak'
cd.loc[cd['terminal_stem_mfe'] < avg_stem_mfe_cd, 'terminal_stem_strength'] = 'Strong'
cd.loc[cd['terminal_stem_mfe'] >= avg_stem_mfe_cd, 'terminal_stem_strength'] = 'Weak'
stem_haca, no_stem_haca = haca[haca['terminal_stem_strength'] == 'Strong'], haca[haca['terminal_stem_strength'] == 'Weak']
stem_cd, no_stem_cd = cd[cd['terminal_stem_strength'] == 'Strong'], cd[cd['terminal_stem_strength'] == 'Weak']

# Create violin plots to compare weak and strong terminal stem snoRNA's abundance per snoRNA type
ft.violin(haca, "terminal_stem_strength", "avg_tpm", None, None, "Terminal stem strength",
                "Average abundance across tissues (TPM)", "Abundance of H/ACA snoRNAs with strong and weak terminal stem",
                None, None, snakemake.output.violin_haca)

ft.violin(cd, "terminal_stem_strength", "avg_tpm", None, None, "Terminal stem strength",
                "Average abundance across tissues (TPM)", "Abundance of C/D snoRNAs with strong and weak terminal stem",
                None, None, snakemake.output.violin_cd)

# Compute the significance between groups of H/ACA with a strong or weak terminal stem (same with C/D)
MW_U_stats, p_val = st.mannwhitneyu(stem_haca['avg_tpm'], no_stem_haca['avg_tpm'])
print('H/ACA')
print('Mann-Whitney U statistics:', MW_U_stats, ',  p-value:', p_val)

MW_U_stats, p_val = st.mannwhitneyu(stem_cd['avg_tpm'], no_stem_cd['avg_tpm'])
print('C/D')
print('Mann-Whitney U statistics:', MW_U_stats, ',  p-value:', p_val)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Split between intergenic and intronic snoRNAs
if snakemake.wildcards.hg_biotype == "intronic":
    X_test_hg_biotype = X_test[X_test['intergenic'] == 0]
elif snakemake.wildcards.hg_biotype == "intergenic":
    X_test_hg_biotype = X_test[X_test['intergenic'] == 1]
print(X_test_hg_biotype)
print(len(X_test_hg_biotype))

# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as backgorund data and X_test to generate shap global values
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer = shap.LinearExplainer(model, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test_hg_biotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values, X_test_hg_biotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer2 = shap.KernelExplainer(model2.predict, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test_hg_biotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values2, X_test_hg_biotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
import matplotlib.pyplot as plt
import shap
import numpy as np

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
feature_df = pd.read_csv(snakemake.input.df, sep='\t')
intronic_sno = feature_df[feature_df['host_biotype2'] != 'intergenic']['gene_id_sno'].to_list()
intergenic_sno = feature_df[feature_df['host_biotype2'] == 'intergenic']['gene_id_sno'].to_list()

# Split between intergenic and intronic snoRNAs
if snakemake.wildcards.hg_biotype == "intronic":
    X_test_hg_biotype = X_test[X_test.index.isin(intronic_sno)]
elif snakemake.wildcards.hg_biotype == "intergenic":
    X_test_hg_biotype = X_test[X_test.index.isin(intergenic_sno)]
print(X_test_hg_biotype)
print(len(X_test_hg_biotype))

# Unpickle and thus instantiate the model represented by the 'models2' wildcard
# Instantiate the explainer using the X_train as backgorund data and X_test to generate shap global values
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer = shap.LinearExplainer(model, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test_hg_biotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values, X_test_hg_biotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer2 = shap.KernelExplainer(model2.predict, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test_hg_biotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values2, X_test_hg_biotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as background data and X_test to generate shap global values
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer = shap.LinearExplainer(model, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values, X_test, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer2 = shap.KernelExplainer(model2.predict, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values2, X_test, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import shap
import numpy as np
# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train, test_size=232, train_size=1077, random_state=42, stratify=y_total_train)


# Split between C/D and H/ACA snoRNAs
if snakemake.wildcards.sno_type == "CD":
    X_test_snotype = X_test[X_test['C/D'] == 1]
elif snakemake.wildcards.sno_type == "HACA":
    X_test_snotype = X_test[X_test['H/ACA'] == 1]
print(X_test_snotype)
print(len(X_test_snotype))

# Unpickle and thus instantiate the model represented by the 'models' wildcard
# Instantiate the explainer using the X_train as backgorund data and X_test to generate shap global values
if snakemake.wildcards.models == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer = shap.LinearExplainer(model, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test_snotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values, X_test_snotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer2 = shap.KernelExplainer(model2.predict, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test_snotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values2, X_test_snotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
import matplotlib.pyplot as plt
import shap
import numpy as np

X_test_paths = snakemake.input.X_test
shap_iterations_paths = snakemake.input.shap_values
feature_df = pd.read_csv(snakemake.input.df, sep='\t')
cd_sno = feature_df[feature_df['sno_type'] == 'C/D']['gene_id_sno'].to_list()
haca_sno = feature_df[feature_df['sno_type'] == 'H/ACA']['gene_id_sno'].to_list()

# Load all manual split iterations dfs and select only C/D or H/ACA (shap values and feature values)
shap_values_all_iterations, X_test_snotype_all_iterations = [], []
for i, df_path in enumerate(X_test_paths):
    X_test = pd.read_csv(X_test_paths[i], sep='\t', index_col='gene_id_sno')
    shap_iteration = pd.read_csv(shap_iterations_paths[i], sep='\t', index_col='gene_id_sno')

    # Split between C/D and H/ACA snoRNAs
    if snakemake.wildcards.sno_type == "CD":
        X_test_snotype = X_test[X_test.index.isin(cd_sno)]
        shap_iteration_sno_type = shap_iteration[shap_iteration.index.isin(cd_sno)]
    elif snakemake.wildcards.sno_type == "HACA":
        X_test_snotype = X_test[X_test.index.isin(haca_sno)]
        shap_iteration_sno_type = shap_iteration[shap_iteration.index.isin(haca_sno)]

    X_test_snotype_all_iterations.append(X_test_snotype)
    shap_values_all_iterations.append(shap_iteration_sno_type)

# Concat values of all 10 iterations in a df
final_shap_values = np.concatenate(shap_values_all_iterations, axis=0)
final_X_test_snotype = pd.concat(X_test_snotype_all_iterations)  # Concat vertically all X_test_snotype dfs to infer feature value in the summary plot

# Create summary plot
plt.rcParams['svg.fonttype'] = 'none'
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
shap.summary_plot(final_shap_values, final_X_test_snotype, show=False, max_display=50)
plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from warnings import simplefilter
simplefilter(action='ignore', category=UserWarning)  # ignore all user warnings
import pandas as pd
import matplotlib.pyplot as plt
import shap
import numpy as np

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
feature_df = pd.read_csv(snakemake.input.df, sep='\t')
cd_sno = feature_df[feature_df['sno_type'] == 'C/D']['gene_id_sno'].to_list()
haca_sno = feature_df[feature_df['sno_type'] == 'H/ACA']['gene_id_sno'].to_list()

# Split between C/D and H/ACA snoRNAs
if snakemake.wildcards.sno_type == "CD":
    X_test_snotype = X_test[X_test.index.isin(cd_sno)]
elif snakemake.wildcards.sno_type == "HACA":
    X_test_snotype = X_test[X_test.index.isin(haca_sno)]
print(X_test_snotype)
print(len(X_test_snotype))

# Unpickle and thus instantiate the model represented by the 'models2' wildcard
# Instantiate the explainer using the X_train as backgorund data and X_test to generate shap global values
if snakemake.wildcards.models2 == "log_reg":
    model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer = shap.LinearExplainer(model, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer = shap.LinearExplainer(model, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test_snotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values, X_test_snotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)

else:
    model2 = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))
    #explainer2 = shap.KernelExplainer(model2.predict, X_train)  # Use whole X_train as background (quite longer than using the line below with subsampled background)
    explainer2 = shap.KernelExplainer(model2.predict, shap.sample(X_train, 100, random_state=42)) # reduce number of background sample to 100
    shap_values2 = explainer2.shap_values(X_test_snotype)
    plt.rcParams['svg.fonttype'] = 'none'
    fig, ax = plt.subplots(1, 1, figsize=(15, 15))
    shap.summary_plot(shap_values2, X_test_snotype, show=False, max_display=50)
    plt.savefig(snakemake.output.summary_plot, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import pandas as pd
import collections as coll
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

host_biotype_dfs, sno_type_dfs = snakemake.input.host_biotype_df, snakemake.input.snoRNA_type_df

# Load and merge sno_type df, host_biotype df and predicted ab_status for all species
dfs = []
for i, temp_df in enumerate(snakemake.input.df):
    species_name = temp_df.split('/')[-1].split('_predicted_label')[0].replace('_', ' ').capitalize()
    df = pd.read_csv(temp_df, sep='\t')
    df = df[['gene_id_sno', 'predicted_label']]
    host_biotype_df = pd.read_csv(host_biotype_dfs[i], sep='\t')
    df = df.merge(host_biotype_df[['gene_id_sno', 'host_biotype']], how='left', on='gene_id_sno')
    snoRNA_type_df = pd.read_csv(sno_type_dfs[i], sep='\t')
    snoRNA_type_df = snoRNA_type_df.rename(columns={'gene_id': 'gene_id_sno'})
    df = df.merge(snoRNA_type_df[['gene_id_sno', 'snoRNA_type']], how='left', on='gene_id_sno')
    df['species_name'] = species_name
    dfs.append(df)

# Concat vertically all species dfs
concat_df = pd.concat(dfs)

# Create a simplified version of host_biotype column
simplified_biotype_dict = {'lncRNA': 'non_coding', 'protein_coding': 'protein_coding',
                            'TEC': 'non_coding', 'unitary_pseudogene': 'non_coding',
                            'unprocessed_pseudogene': 'non_coding', 'pseudogene': 'non_coding',
                            'processed_pseudogene': 'non_coding', 'polymorphic_pseudogene': 'non_coding',
                            'processed_transcript': 'non_coding', 'lincRNA': 'non_coding',
                            'antisense': 'non_coding', 'sense_intronic': 'non_coding',
                            'sense_overlapping': 'non_coding'}
concat_df['host_biotype2'] = concat_df['host_biotype'].map(simplified_biotype_dict)
concat_df['host_biotype2'] = concat_df['host_biotype2'].fillna('intergenic')

# One hot encode snoRNA_type and host_biotype2 cols
def split_cols(df, col):
    # Split a column in n one-hot-encoded columns where n is the number of
    # different possibilities in said column
    possibilities = list(pd.unique(df[col]))
    for poss in possibilities:
        df.loc[df[col] == poss, poss] = 1
        df[poss] = df[poss].fillna(0)
    df = df.drop(columns=col)
    return df

sno_type = split_cols(concat_df, 'snoRNA_type')
sno_type = sno_type.drop(columns=['host_biotype2', 'host_biotype'])
host = split_cols(concat_df, 'host_biotype2')

# Merge final_df
final_df = sno_type.merge(host[['gene_id_sno', 'intergenic', 'protein_coding', 'non_coding']], how='left', on='gene_id_sno')
final_df = final_df[['species_name', 'predicted_label', 'H/ACA', 'C/D', 'protein_coding', 'non_coding', 'intergenic']]

# Get number and % of given category (1 (one-hot-encoded)) in col of df
def get_percent(df):
    cols = df.columns
    col_temp, percent_temp = [], []
    for col in cols:
        if col == 'predicted_label':
            label = f'{pd.unique(df[col])[0]}'
            col_temp.append(col)
            percent_temp.append(label)
        elif (col != 'species_name') & (col != 'predicted_label'):
            d = dict(coll.Counter(df[col]))
            if 1 not in d.keys():
                d[1] = 0
            nb = d[1]
            percent = round((nb/len(df)) * 100, 1)
            nb_percent = f'{nb} ({percent}%)'
            col_temp.append(col)
            percent_temp.append(nb_percent)
    temp_df = pd.DataFrame([percent_temp], columns=col_temp)
    return temp_df

# Groupby species_name and get the summary (%) of expressed/not_expressed snoRNAs per characteristics
grouped_df = final_df.groupby(['species_name'])
dfs_final = []
for i, group in grouped_df:
    name, total_sno_nb = i, len(group)
    expressed = group[group['predicted_label'] == 'expressed']
    not_expressed = group[group['predicted_label'] == 'not_expressed']
    temp_dfs = []
    for df_ in [expressed, not_expressed]:
        ddf = get_percent(df_)
        ddf['total'] = len(df_)
        temp_dfs.append(ddf)
    concat_temp_df = pd.concat(temp_dfs)
    concat_temp_df['Species name'] = name
    dfs_final.append(concat_temp_df)


merged_final_df = pd.concat(dfs_final)
merged_final_df = merged_final_df[['Species name', 'predicted_label', 'C/D', 'H/ACA', 'protein_coding', 'non_coding', 'intergenic', 'total']]

merged_final_df.to_csv(snakemake.output.df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

"""
Creates a tSNE plot (t-distributed Stochastic Neighbor Embedding).
"""

df_initial = pd.read_csv(snakemake.input.df, sep='\t')
y = df_initial['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(df_initial, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


X = df_initial.drop(['label', 'gene_id_sno'], axis=1)
X_test_copy = X_test.copy()
X_test_copy = X_test_copy.drop(['label', 'gene_id_sno'], axis=1)


# Normalize data for tSNE
val = X.values
norm_val = StandardScaler().fit_transform(val)

val_test = X_test_copy.values
norm_val_test = StandardScaler().fit_transform(val_test)


# Create tSNE with 2 components for all snoRNAs or only those in the test set
all_sno_t_sne = TSNE(n_components=2, random_state=42).fit_transform(norm_val)
df_initial['Component_1'] = all_sno_t_sne[:, 0]
df_initial['Component_2'] = all_sno_t_sne[:, 1]
df_initial['label'] = df_initial['label'].replace([0, 1], ['not_expressed', 'expressed'])
df_initial['intergenic'] = df_initial['intergenic'].replace([0, 1], ['intronic', 'intergenic'])

test_sno_t_sne = TSNE(n_components=2, random_state=42).fit_transform(norm_val_test)
X_test['Component_1'] = test_sno_t_sne[:, 0]
X_test['Component_2'] = test_sno_t_sne[:, 1]
X_test['label'] = X_test['label'].replace([0, 1], ['not_expressed', 'expressed'])
X_test['intergenic'] = X_test['intergenic'].replace([0, 1], ['intronic', 'intergenic'])

# Create the (tSNE) scatter plot for all snoRNAs
plt.rcParams['svg.fonttype'] = 'none'
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax.set_xlabel('Dimension 1', fontsize=20)
ax.set_ylabel('Dimension 2', fontsize=20)

colors = list(snakemake.params.colors_dict.values())
sns.scatterplot(x='Component_1', y='Component_2', data=df_initial,
                hue=snakemake.wildcards.pca_hue, palette=colors, ax=ax)
plt.savefig(snakemake.output.t_sne_all, dpi=600)


# Create the (tSNE) scatter plot for snoRNAs in test set only
plt.rcParams['svg.fonttype'] = 'none'
fig2, ax2 = plt.subplots(1, 1, figsize=(15, 15))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax2.set_xlabel('Dimension 1', fontsize=20)
ax2.set_ylabel('Dimension 2', fontsize=20)

colors = list(snakemake.params.colors_dict.values())
sns.scatterplot(x='Component_1', y='Component_2', data=X_test,
                hue=snakemake.wildcards.pca_hue, palette=colors, ax=ax2)
plt.savefig(snakemake.output.t_sne_test, dpi=600)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import functions as ft
import statistics as sts

""" Generate an upset plot per confusion value (true positive, false positive,
    false negative or true negative) to see the intersection in the snoRNAs
    (mis)classified by all three models not overfitting (log_reg, svc and rf).
    The upset plot shows the average intersection (+/-stdev) across 10 iterations
    for all models."""
df_output_path = snakemake.params.df_output_path
color_dict = snakemake.params.color_dict
color = color_dict[snakemake.wildcards.confusion_value]

# Sort alphabetically all confusion matrix df (one per iteration) per model
log_reg_conf = snakemake.input.log_reg
log_reg_conf = sorted(log_reg_conf)
svc_conf = snakemake.input.svc
svc_conf = sorted(svc_conf)
rf_conf = snakemake.input.rf
rf_conf = sorted(rf_conf)

# Load confusion matrix of all models per iteration
log_reg_dfs, svc_dfs, rf_dfs = [], [], []
for i, log_reg_df in enumerate(log_reg_conf):
    log_reg = pd.read_csv(log_reg_conf[i], sep='\t')
    log_reg = log_reg[['gene_id_sno', 'confusion_matrix_val_log_reg']]
    log_reg_dfs.append(log_reg)
    svc = pd.read_csv(svc_conf[i], sep='\t')
    svc = svc[['gene_id_sno', 'confusion_matrix_val_svc']]
    svc_dfs.append(svc)
    rf = pd.read_csv(rf_conf[i], sep='\t')
    rf = rf[['gene_id_sno', 'confusion_matrix_val_rf']]
    rf_dfs.append(rf)

# Get the avg (and stdev) total number of confusion_value (ex: TN) across iterations per model
# This will be used for the horizontal bar chart within the upset plot
confusion_val_nb_per_iteration_log_reg = []
confusion_val_nb_per_iteration_svc = []
confusion_val_nb_per_iteration_rf = []
for i, log_reg_df in enumerate(log_reg_dfs):
    nb_log_reg = len(log_reg_df[log_reg_df['confusion_matrix_val_log_reg'] == snakemake.wildcards.confusion_value])
    confusion_val_nb_per_iteration_log_reg.append(nb_log_reg)
    nb_svc = len(svc_dfs[i][svc_dfs[i]['confusion_matrix_val_svc'] == snakemake.wildcards.confusion_value])
    confusion_val_nb_per_iteration_svc.append(nb_svc)
    nb_rf = len(rf_dfs[i][rf_dfs[i]['confusion_matrix_val_rf'] == snakemake.wildcards.confusion_value])
    confusion_val_nb_per_iteration_rf.append(nb_rf)

# Order from rf to log_reg (bottom to top of horizontal bar chart)
avg_nb, stdev_nb = [], []
for model in [confusion_val_nb_per_iteration_rf, confusion_val_nb_per_iteration_svc, confusion_val_nb_per_iteration_log_reg]:
    avg_, stdev_ = sts.mean(model), sts.stdev(model)
    avg_nb.append(avg_)
    stdev_nb.append(stdev_)


# This will be used for the scatter and vertical bar plot in the upset plot
# Merge all dfs within each iteration to create one df
# and get a dict per iteration to show the number of elements per upset plot
# category (e.g. TP_TP_FN: 32, (TP for log_reg, then TP for svc and finally FN for rf))
upset_dicts = []
temp_dfs = []
for i, log_reg_df in enumerate(log_reg_dfs):
    df = log_reg_dfs[i].merge(svc_dfs[i], how='left', on='gene_id_sno')
    df = df.merge(rf_dfs[i], how='left', on='gene_id_sno')
    df.to_csv(df_output_path[i], sep='\t', index=False)


    # Select rows containing at least one confusion_value wildcard (ex: TP)
    val_df = df[(df.iloc[:, 1:4] == snakemake.wildcards.confusion_value).any(axis=1)]
    unique_category = val_df.drop('gene_id_sno', axis=1)
    unique_category = unique_category.drop_duplicates(['confusion_matrix_val_log_reg',
                            'confusion_matrix_val_svc',
                            'confusion_matrix_val_rf'])
    temp_dfs.append(unique_category)  # get only unique combination of confusion value per iteration (merged_df of that iteration)

    # Get the occurences of all possible categories containing the confusion_value wildcard (ex: TP_TP_FN, TP_TP_TP, etc.)
    groups = val_df.groupby(['confusion_matrix_val_log_reg',
                            'confusion_matrix_val_svc',
                            'confusion_matrix_val_rf'])
    d = {}
    for i, group in enumerate(groups):
        upset_category_name = "_".join(group[0])
        number_per_category = len(group[1])
        d[upset_category_name] = number_per_category
    upset_dicts.append(d)

# Get the union of all upset possible categories (ex: TP_TP_FN, TP_TP_TP, etc.) across iterations
concat_temp_df = pd.concat(temp_dfs)
concat_temp_df['upset_category_name'] = concat_temp_df['confusion_matrix_val_log_reg'] + '_' + concat_temp_df['confusion_matrix_val_svc'] + '_' + concat_temp_df['confusion_matrix_val_rf']
union = list(pd.unique(concat_temp_df['upset_category_name']))

# Add 0 as value to missing keys in each dictionary in the upset_dicts list (so that all iterations have the same number of categories)
for cat in union:
    for d in upset_dicts:
        d.setdefault(cat, 0)

# Get the average and stdev for all upset categories across iterations
upset_df = pd.DataFrame(upset_dicts)
average = {}
stdev = {}
for col in upset_df.columns:
    avg, std = upset_df[col].mean(), upset_df[col].std()
    average[col], stdev[col] = avg, std

# Sort by descending order of value
sorted_average = sorted(average.items(), key=lambda x: x[1], reverse=True)
sorted_stdev = sorted(stdev.items(), key=lambda x: x[1], reverse=True)


# Convert confusion value names to model names (ex: if TP for TP_FN_TP, then it
# will output the first and third model name, but not the second) for scatter plot in the upset
def convert_vals(val_list, confusion_value):
    """ ex. of val_list --> ['TP_TP_TP', 'TP_TN_TN', etc.]
        ex. of confusion_value --> 'TP'"""
    name, values = [], []
    for val in val_list:
        log_reg, svc, rf = val.split('_')
        for i, confusion_val in enumerate([rf, svc, log_reg]):  # the order is from y=0 to y=2 on the scatter
            if confusion_val == confusion_value:
                name.append(val)
                values.append(i)  # dot at y=i on the scatter
    return name, values

names, values = convert_vals([key[0] for key in sorted_average], snakemake.wildcards.confusion_value)


# Get minimal and maximal values (0, 1 or 2) of each upset category and their position on the x axis
ymins, ymaxs, vlines_pos = [], [], []
for name in list(set(names)):
    index_in_names = [i for i in range(len(names)) if names[i] == name]
    vals = [values[j] for j in index_in_names]

    # get position of each vertical line on the x-axis (according to sorted_average, i.e values for the bar chart sorted in descending order)
    pos = [sorted_average.index(key) for key in sorted_average if key[0] == name] # list of only one element

    # get minimal and maximal values for each vertical line on the upset plot
    mini, maxi = min(vals), max(vals)
    ymins.append(mini)
    ymaxs.append(maxi)
    vlines_pos.append(pos[0])

# Create homemade upset plot (vertical bar chart over dot plot plus a horizontal bar chart)
ft.upset_avg_3_cat(sorted_average, sorted_stdev, avg_nb, stdev_nb, names, values,
                    vlines_pos, ymins, ymaxs, 'Average intersection size\nacross iterations',
                    'Average number\nper model\nacross iterations',
                    ["RandomForest", "SupportVector", "LogisticRegression"],
                    snakemake.wildcards.confusion_value, color, snakemake.output.upset)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from upsetplot import plot as upset

""" Generate an upset plot per confusion value (true positive, false positive,
    false negative or true negative) to see the intersection in the snoRNAs
    (mis)classified by all four models."""
df_output_path = "results/tables/confusion_matrix_f1/merged_confusion_matrix.tsv"
vals = ["TP", "TN", "FP", "FN"]
val_cols = ['confusion_matrix_val_log_reg', 'confusion_matrix_val_svc',
            'confusion_matrix_val_gbm', 'confusion_matrix_val_knn']
vals.remove(snakemake.wildcards.confusion_value)

log_reg = pd.read_csv(snakemake.input.log_reg, sep='\t')
log_reg = log_reg[['gene_id_sno', 'confusion_matrix_val_log_reg']]
svc = pd.read_csv(snakemake.input.svc, sep='\t')
svc = svc[['gene_id_sno', 'confusion_matrix_val_svc']]
gbm = pd.read_csv(snakemake.input.gbm, sep='\t')
gbm = gbm[['gene_id_sno', 'confusion_matrix_val_gbm']]
knn = pd.read_csv(snakemake.input.knn, sep='\t')
knn = knn[['gene_id_sno', 'confusion_matrix_val_knn']]

# Merge all dfs and create one df per confusion_value wildcard
df = log_reg.merge(svc, how='left', left_on='gene_id_sno', right_on='gene_id_sno')
df = df.merge(gbm, how='left', left_on='gene_id_sno', right_on='gene_id_sno')
df = df.merge(knn, how='left', left_on='gene_id_sno', right_on='gene_id_sno')
df.to_csv(df_output_path, sep='\t', index=False)


# Select rows containing at least one confusion_value wildcard (ex: TP)
val_df = df[(df.iloc[:, 1:4] == snakemake.wildcards.confusion_value).any(axis=1)]

# Convert confusion_value (ex: TP) to True and all other possible wildcards values (ex: TN, FN, FP) to False
val_df = val_df.replace([snakemake.wildcards.confusion_value, "({})".format("|".join(vals))],
                        [True, False], regex=True)

# Generate Multi_index Serie from DataFrame
upset_df = val_df.copy()
upset_df['gene_id_sno'] = 1  # This ensures that we can sum over the gene_id_sno column with the following groupby
upset_df = upset_df.set_index(val_cols)  # Create multi index

a = upset_df.groupby(level=val_cols).sum()  # Groupby multi index
a = a['gene_id_sno']  # Convert the one-column dataframe into a series

# Create the upset plot
plt.rcParams['svg.fonttype'] = 'none'
upset(a, sort_by='cardinality', sort_categories_by='cardinality')
plt.savefig(snakemake.output.upset, bbox_inches='tight', dpi=600)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import shap
from upsetplot import plot as upset
import numpy as np
import re
""" Generate an upset plot of the intersesction of the top 5 most predictive
    features across all four models (wo RF)"""

X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')

# Instantiate log_reg model and get its top 5 features
log_reg = pickle.load(open(snakemake.input.log_reg, 'rb'))
explainer_log_reg = shap.LinearExplainer(log_reg, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
shap_values_log_reg = explainer_log_reg.shap_values(X_test)
vals_log_reg = np.abs(shap_values_log_reg).mean(0)  # mean SHAP value across all examples in X_test for each feature
feature_importance_log_reg = pd.DataFrame(list(zip(X_train.columns, vals_log_reg)), columns=['feature', 'feature_importance'])
feature_importance_log_reg.sort_values(by=['feature_importance'], ascending=False , inplace=True)
feature_importance_log_reg['feature_rank'] = feature_importance_log_reg.reset_index().index + 1  # Create a rank column for feature importance rank (1 to 5)
feature_importance_log_reg['model'] = 'log_reg'
log_reg_df = feature_importance_log_reg.head(n=5)

# Instantiate all other 3 models (knn, gbm and svc) and get their top 5 features
dfs = [log_reg_df]
for i, mod in enumerate(snakemake.input.other_model):
    model = pickle.load(open(mod, 'rb'))
    explainer = shap.KernelExplainer(model.predict, shap.sample(X_train, 100, random_state=42))  # reduce number of background sample to 100
    shap_values = explainer.shap_values(X_test)
    vals = np.abs(shap_values).mean(0)  # mean SHAP value across all examples in X_test for each feature
    feature_importance = pd.DataFrame(list(zip(X_train.columns, vals)), columns=['feature', 'feature_importance'])
    feature_importance.sort_values(by=['feature_importance'], ascending=False , inplace=True)
    feature_importance['feature_rank'] = feature_importance.reset_index().index + 1  # Create a rank column for feature importance rank (1 to 5)
    model_substring = re.search("results/trained_models/(.*)_trained_scale_after_split.sav", mod).group(1)  # find the model name within the pickled model name
    feature_importance['model'] = model_substring  # Create a model column (model name)
    df = feature_importance.head(n=5)
    dfs.append(df)

# Concat top feature dfs into one df
all_top_features = pd.concat(dfs)
all_top_features.to_csv(snakemake.output.top_features_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import pandas as pd
import functions_venn as ft
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Create Venn diagram and compute Simple Matching Coefficient (SMC) (how similar two vectors are (intersection of both vectors / union of both vectors))
# SMC of 0 being totally dissimilar and of 1 being totally similar
gtex_df = pd.read_csv(snakemake.input.gtex_df, sep='\t')
tgirt_df = pd.read_csv(snakemake.input.tgirt_df, sep='\t')

# Drop intergenic snoRNAs
gtex_df = gtex_df[gtex_df['intergenic'] == 0]
tgirt_df = tgirt_df[tgirt_df['intergenic'] == 0]


# Get the number of snoRNA host genes expressed in both or either TGIRT and GTEx datasets
# Get also the number of snoRNA host gene not expressed in both datasets

tgirt_only, gtex_only, both_expressed, both_not_expressed = [], [], [], []
for i, tgirt_val in enumerate(tgirt_df.host_expressed):
    gtex_val = list(gtex_df.host_expressed)[i]
    if (tgirt_val == 1) & (gtex_val == 1):
        both_expressed.append(tgirt_val)
    elif (tgirt_val == 1) & (gtex_val == 0):
        tgirt_only.append(tgirt_val)
    elif (tgirt_val == 0) & (gtex_val == 1):
        gtex_only.append(tgirt_val)
    elif (tgirt_val == 0) & (gtex_val == 0):
        both_not_expressed.append(tgirt_val)


# Compute SMC
smc = (len(both_expressed) + len(both_not_expressed)) / (len(both_expressed) + len(both_not_expressed) + len(tgirt_only) + len(gtex_only))
smc = str(smc)

# Create host_expressed Venn diagram
ft.venn_2([len(tgirt_only), len(gtex_only), len(both_expressed)],
            ['lightblue', 'red'], ['TGIRT', 'GTEx'],
            f'Number of snoRNA host genes expressed\nin TGIRT and GTEx datasets (SMC={smc})',
            snakemake.output.venn_host_expressed)

# Clear the axis between the saving of the second Venn diagram
plt.clf()

# Create host_not_expressed Venn diagram
# (tgirt_only is the opposite of gtex_only and vice-versa (this is why we switch the order compared to the previous venn diagram))
# this means that the number of HG only expressed in TGIRT is equal to the number of HG not expressed only in GTEx and vice-versa
ft.venn_2([len(gtex_only), len(tgirt_only), len(both_not_expressed)],
        ['lightblue', 'red'], ['TGIRT', 'GTEx'],
        f'Number of snoRNA host genes not expressed\nin TGIRT and GTEx datasets (SMC={smc})',
        snakemake.output.venn_host_not_expressed)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft

df = pd.read_csv(snakemake.input.rank_features_df, sep='\t')
model_colors_dict = snakemake.params.model_colors


# Find the range of each distribution (max rank - min rank per feature) and its median
feature_distribution = {}
for i, group in enumerate(df.groupby('feature')['feature_rank']):
    feature_name = group[0]
    range_ = group[1].max() - group[1].min()
    median_ = group[1].median()
    feature_distribution[feature_name] = [median_, range_]

# Order violin plots by increasing median value of feature_ranks and by range as second sort if two features have the same median
feature_distribution_df = pd.DataFrame.from_dict(feature_distribution, columns = ['median', 'range'], orient='index')
ordered_violin = feature_distribution_df.sort_values(by=['median', 'range'], ascending=[True, True]).index

# Remove the iteration value from the model names (ex: log_reg instead of log_reg_first)
df['model'] = df['model'].str.rsplit('_', 1, expand=True)  # max split of 1 from the right to retain log_reg not just log

# Create the connected scatter plot
ft.violin(df, 'feature', 'feature_rank', None, 'model', 'Features', 'Predictive rank across \nmodels and iterations', '',
            ['lightgrey'], model_colors_dict, snakemake.output.violin, order=ordered_violin)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft

df = pd.read_csv(snakemake.input.rank_features_df, sep='\t')
model_colors_dict = snakemake.params.model_colors
print(df)

# Find the range of each distribution (max rank - min rank per feature) and its median
feature_distribution = {}
for i, group in enumerate(df.groupby('feature')['feature_rank']):
    feature_name = group[0]
    range_ = group[1].max() - group[1].min()
    median_ = group[1].median()
    feature_distribution[feature_name] = [median_, range_]

# Order violin plots by increasing median value of feature_ranks and by range as second sort if two features have the same median
feature_distribution_df = pd.DataFrame.from_dict(feature_distribution, columns = ['median', 'range'], orient='index')
ordered_violin = feature_distribution_df.sort_values(by=['median', 'range'], ascending=[True, True]).index

# Remove the iteration value from the model names (ex: log_reg instead of log_reg_manual_first)
df['model'] = df['model'].str.rsplit('_', 2, expand=True)  # max split of 2 from the right to retain log_reg not just log

# Create the connected scatter plot
ft.violin(df, 'feature', 'feature_rank', None, 'model', 'Features', 'Predictive rank across \nmodels and iterations', '',
            ['lightgrey'], model_colors_dict, snakemake.output.violin, order=ordered_violin)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft

df = pd.read_csv(snakemake.input.rank_features_df, sep='\t')
model_colors_dict = snakemake.params.model_colors
# Order violin plots by increasing median value of feature_ranks
ordered_violin = df.groupby('feature')['feature_rank'].median().sort_values(ascending=True).index

# Create the connected scatter plot
ft.violin(df, 'feature', 'feature_rank', None, 'model', 'Feature', 'Predictive rank \n across models', '',
            ['lightgrey'], model_colors_dict, snakemake.output.violin, order=ordered_violin)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft

""" For each model (log_reg, svc and rf), represent a violin plot of the
    accuracies across the 10 iterations based on the predictions of the
    abundance status of mouse snoRNAs."""

accuracies_df_paths = snakemake.input.accuracies
color_dict = snakemake.params.color_dict

# Create dict containing the accuracy of all iterations per model 
acc = {}
for i, path in enumerate(accuracies_df_paths):
    model_name, iteration = path.split('/')[-1].split('.')[0].split('_test_accuracy_')
    df = pd.read_csv(path, sep='\t')
    accuracy = float(df.values[0][0])
    if model_name not in acc.keys():
        acc[model_name] = {iteration: accuracy}
    else:
        acc[model_name][iteration] = accuracy

# Create a df containing all the accuracies in the right format for the violin plot function
dfs = []
for mod_name, accuracy_dict in acc.items():
    vals = list(accuracy_dict.values())
    df = pd.DataFrame({mod_name: vals})
    df = df.rename(columns={mod_name: 'Accuracy'})
    df['Model'] = mod_name
    dfs.append(df)

final_df = pd.concat(dfs)

# Create the violin plot
ft.violin_wo_swarm(final_df, 'Model', 'Accuracy', None, 'Model', 'Accuracy of each\nmodel per iteration', '',
            color_dict, snakemake.output.violin)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np

""" Violin plot of TPM values for all confusion_value"""

confusion_value_df = pd.read_csv(snakemake.input.sno_per_confusion_value[0], sep='\t')


fn = confusion_value_df[confusion_value_df['confusion_matrix_val_log_reg_thresh'] == 'FN']
tp = confusion_value_df[confusion_value_df['confusion_matrix_val_log_reg_thresh'] == 'TP']
tn = confusion_value_df[confusion_value_df['confusion_matrix_val_log_reg_thresh'] == 'TN']
fp = confusion_value_df[confusion_value_df['confusion_matrix_val_log_reg_thresh'] == 'FP']



tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')
color_dict = snakemake.params.color_dict

# Create avg_tpm and log2_avg_tpm columns in tpm_df
tpm_df['avg_tpm'] = tpm_df.filter(regex='^[A-Za-z].*_[123]$').mean(axis=1)
tpm_df['avg_tpm'] = tpm_df['avg_tpm'] + 0.0001  # add pseudocount to be able to compute log afterwards
tpm_df['gene_id_sno'] = tpm_df['gene_id']
tpm_df['log2_avg_tpm'] = np.log2(tpm_df['avg_tpm'])
tpm_df = tpm_df[['gene_id_sno', 'log2_avg_tpm', 'avg_tpm']]

# Merge each confusion value df to tpm_df
fn = fn.merge(tpm_df, how='left', on='gene_id_sno')
tp = tp.merge(tpm_df, how='left', on='gene_id_sno')
tn = tn.merge(tpm_df, how='left', on='gene_id_sno')
fp = fp.merge(tpm_df, how='left', on='gene_id_sno')

# Create the violin plot
concat_df = pd.concat([tn, tp, fn, fp])

ft.violin_wo_swarm(concat_df, 'confusion_matrix_val_log_reg_thresh', 'log2_avg_tpm', None, 'Confusion value', 'log2(average TPM)', '',
            color_dict, snakemake.output.violin)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np

""" Violin plot of TPM values for all confusion_value"""

confusion_value_df = pd.read_csv(snakemake.input.sno_per_confusion_value, sep='\t')


fn = confusion_value_df[confusion_value_df['consensus_confusion_value'] == 'FN']
tp = confusion_value_df[confusion_value_df['consensus_confusion_value'] == 'TP']
tn = confusion_value_df[confusion_value_df['consensus_confusion_value'] == 'TN']
fp = confusion_value_df[confusion_value_df['consensus_confusion_value'] == 'FP']



tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')
color_dict = snakemake.params.color_dict

# Create avg_tpm and log2_avg_tpm columns in tpm_df
tpm_df['avg_tpm'] = tpm_df.filter(regex='^[A-Za-z].*_[123]$').mean(axis=1)
tpm_df['avg_tpm'] = tpm_df['avg_tpm'] + 0.0001  # add pseudocount to be able to compute log afterwards
tpm_df['gene_id_sno'] = tpm_df['gene_id']
tpm_df['log2_avg_tpm'] = np.log2(tpm_df['avg_tpm'])
tpm_df = tpm_df[['gene_id_sno', 'log2_avg_tpm', 'avg_tpm']]

# Merge each confusion value df to tpm_df
fn = fn.merge(tpm_df, how='left', on='gene_id_sno')
tp = tp.merge(tpm_df, how='left', on='gene_id_sno')
tn = tn.merge(tpm_df, how='left', on='gene_id_sno')
fp = fp.merge(tpm_df, how='left', on='gene_id_sno')

# Create the violin plot
concat_df = pd.concat([tn, tp, fn, fp])

ft.violin_wo_swarm(concat_df, 'consensus_confusion_value', 'log2_avg_tpm', None, 'Confusion value', 'log2(average TPM)', '',
            color_dict, snakemake.output.violin)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import matplotlib.pyplot as plt
import functions as ft
import numpy as np

""" Violin plot of TPM values for all real False negatives and real True Positives"""

confusion_value_df = pd.read_csv(snakemake.input.sno_per_confusion_value, sep='\t')


fn = confusion_value_df[confusion_value_df['confusion_matrix'] == 'FN']
tp = confusion_value_df[confusion_value_df['confusion_matrix'] == 'TP']
tn = confusion_value_df[confusion_value_df['confusion_matrix'] == 'TN']
fp = confusion_value_df[confusion_value_df['confusion_matrix'] == 'FP']



tpm_df = pd.read_csv(snakemake.input.tpm_df, sep='\t')
color_dict = snakemake.params.color_dict

# Create avg_tpm and log2_avg_tpm columns in tpm_df
tpm_df['avg_tpm'] = tpm_df.filter(regex='^[A-Z].*_[123]$').mean(axis=1)
tpm_df['avg_tpm'] = tpm_df['avg_tpm'] + 0.0001  # add pseudocount to be able to compute log afterwards
tpm_df['gene_id_sno'] = tpm_df['gene_id']
tpm_df['log2_avg_tpm'] = np.log2(tpm_df['avg_tpm'])
tpm_df = tpm_df[['gene_id_sno', 'log2_avg_tpm', 'avg_tpm']]

# Merge each confusion value df to tpm_df
fn = fn.merge(tpm_df, how='left', on='gene_id_sno')
tp = tp.merge(tpm_df, how='left', on='gene_id_sno')
tn = tn.merge(tpm_df, how='left', on='gene_id_sno')
fp = fp.merge(tpm_df, how='left', on='gene_id_sno')

# Create the violin plot
concat_df = pd.concat([tn, tp, fn, fp])

ft.violin_wo_swarm(concat_df, 'confusion_matrix', 'log2_avg_tpm', None, 'Confusion value', 'log2(average TPM)', '',
            color_dict, snakemake.output.violin)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import pandas as pd
import re
import regex
from functools import reduce

""" Find H and ACA boxes of each snoRNA (if they exist) and their position. The
    found boxes need to be exact (no substitution allowed) given the already not
    so precise H and ACA motifs."""
dot_bracket = snakemake.input.dot_bracket
output_haca = snakemake.output.h_aca_box_location
haca_fasta = snakemake.input.haca_fasta

# Get dot bracket for all snoRNAs inside dict
structure = {}
with open(dot_bracket, 'r') as f:
    sno_id = ''
    for line in f:
        line = line.strip('\n')
        if line.startswith('>'):
            id = line.strip('>')
            sno_id = id
        elif line.startswith(('(', ')', '.')):
            dot_bracket_structure = line.split(' ')[0]
            structure[sno_id] = dot_bracket_structure


def generate_df(dictio, motif_name):
    """ From a dictionary containing the motif and start/end per H/ACA id"""
    # Create dataframe from box_dict
    box = pd.DataFrame.from_dict(dictio, orient='index')
    box.columns = [f'{motif_name}_sequence', f'{motif_name}_start', f'{motif_name}_end']
    box = box.reset_index()
    box = box.rename(columns={"index": "gene_id"})
    return box


def find_h_box(fasta, dot_bracket_dict):
    """ Find potential Hinge region(s) (if it exists) in dot bracket of H/ACA
        snoRNAs. This region in the dot bracket is represented by ')....('. If
        an ANANNA is present, return that box; otherwise return NNNNNN (no
        mismatch allowed since H box is already not so well defined)."""
    h_box_dict = {}
    with open(fasta, 'r') as f:
        haca_id = ''
        for line in f:
            line = line.strip('\n')
            if line.startswith('>'):
                id = line.strip('>')
                haca_id = id
            else:
                dot_bracket_temp = dot_bracket_dict[haca_id]
                if re.search('\)\.{6,}\(', dot_bracket_temp) is not None:  # find possible hinge regions (at least 6 unpaired nucleotides in the middle of the snoRNA)
                    hinges = re.finditer('\)\.{6,}\(', dot_bracket_temp)
                    h_dot_bracket = re.findall('\)\.{6,}\(', dot_bracket_temp)
                    start = [h.start(0) for h in hinges]
                    end = [start[i] + len(h) for i, h in enumerate(h_dot_bracket)]
                    for i, hinge in enumerate(h_dot_bracket):
                        seq = line[start[i]+1:end[i]-1]  # get hinge region unpaired nucleotides (only the '.', not the surrounding ')' or '(')
                        if re.search('A.A..A', seq) is not None:  # find the first (closest to 5') exact H box within possible hinge regions
                            h_motifs = re.findall('A.A..A', seq)
                            substart = seq.index(h_motifs[0])  # start of H box within the extracted hinge region
                            h_start = start[i] + 2 + substart  # +2 because +1 to be the first unpaired nt in the hinge region and +1 to be 1-based
                            h_box_dict[haca_id] = {'h_motif': h_motifs[0], 'h_start': h_start, 'h_end': h_start + 5}  # +5 nt after the first a in H box
                            break
                    if haca_id not in h_box_dict.keys():  # if no H box was found within the found hinge region(s)
                        h_motif, h_start, h_end = 'NNNNNN', 0, 0
                        h_box_dict[haca_id] = {'h_motif': h_motif, 'h_start': h_start, 'h_end': h_end}

                else:  # if no unpaired hinge region was found, return NNNNNN H box and 0 as start and end
                    h_motif, h_start, h_end = 'NNNNNN', 0, 0
                    h_box_dict[haca_id] = {'h_motif': h_motif, 'h_start': h_start, 'h_end': h_end}

    h_box_df = generate_df(h_box_dict, 'H')
    return h_box_df


def find_aca(fasta):
    """ Find the most downstream ACA motif in the last 10 nt of H/ACA box
        snoRNAs."""
    aca_box_dict = {}
    with open(fasta, 'r') as f:
        haca_id = ''
        for line in f:
            line = line.strip('\n')
            if line.startswith('>'):
                id = line.strip('>')
                haca_id = id
            else:
                last_10 = line[-10:]
                length_seq = len(line)
                if re.search('ACA', last_10) is not None:  # find exact ACA box
                    *_, last_possible_aca = re.finditer('ACA', last_10)
                    aca_motif = last_possible_aca.group(0)  # if multiple exact ACA boxes found, keep the ACA box closest to 3' end
                    aca_start = (length_seq - 10) + last_possible_aca.start() + 1  # 1-based position
                    aca_end = (length_seq - 10) + last_possible_aca.end()  # 1-based
                    aca_box_dict[haca_id] = {'aca_motif': aca_motif, 'aca_start': aca_start, 'aca_end': aca_end}
                else:  # if no ACA is found
                    aca_box_dict[haca_id] = {'aca_motif': 'NNN', 'aca_start': 0, 'aca_end': 0}

    aca_box_df = generate_df(aca_box_dict, 'ACA')
    return aca_box_df


def find_all_boxes(fasta, dot_bracket_dict, path):
    """ Find H and ACA boxes in given fasta using find_h_box and find_aca and
        concat resulting dfs horizontally."""
    df_h = find_h_box(fasta, dot_bracket_dict)
    df_aca = find_aca(fasta)

    df_final = reduce(lambda  left,right: pd.merge(left,right,on=['gene_id'],
                                            how='outer'),
                                            [df_h, df_aca])
    df_final.to_csv(path, index=False, sep='\t')



find_all_boxes(haca_fasta, structure, output_haca)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import pandas as pd
import re
import regex
from functools import reduce

""" Find H and ACA boxes of each snoRNA (if they exist) and their position. The
    found boxes need to be exact (no substitution allowed) given the already not
    so precise H and ACA motifs."""
dot_bracket = snakemake.input.dot_bracket
expressed_haca = snakemake.input.expressed_haca
not_expressed_haca = snakemake.input.not_expressed_haca
output_expressed_haca = snakemake.output.h_aca_box_location_expressed
output_not_expressed_haca = snakemake.output.h_aca_box_location_not_expressed

# Get dot bracket for all snoRNAs inside dict
structure = {}
with open(dot_bracket, 'r') as f:
    sno_id = ''
    for line in f:
        line = line.strip('\n')
        if line.startswith('>'):
            id = line.strip('>')
            sno_id = id
        elif line.startswith(('(', ')', '.')):
            dot_bracket_structure = line.split(' ')[0]
            structure[sno_id] = dot_bracket_structure


def generate_df(dictio, motif_name):
    """ From a dictionary containing the motif and start/end per H/ACA id"""
    # Create dataframe from box_dict
    box = pd.DataFrame.from_dict(dictio, orient='index')
    box.columns = [f'{motif_name}_sequence', f'{motif_name}_start', f'{motif_name}_end']
    box = box.reset_index()
    box = box.rename(columns={"index": "gene_id"})
    return box


def find_h_box(fasta, dot_bracket_dict):
    """ Find potential Hinge region(s) (if it exists) in dot bracket of H/ACA
        snoRNAs. This region in the dot bracket is represented by ')....('. If
        an ANANNA is present, return that box; otherwise return NNNNNN (no
        mismatch allowed since H box is already not so well defined)."""
    h_box_dict = {}
    with open(fasta, 'r') as f:
        haca_id = ''
        for line in f:
            line = line.strip('\n')
            if line.startswith('>'):
                id = line.strip('>')
                haca_id = id
            else:
                dot_bracket_temp = dot_bracket_dict[haca_id]
                if re.search('\)\.{6,}\(', dot_bracket_temp) is not None:  # find possible hinge regions (at least 6 unpaired nucleotides in the middle of the snoRNA)
                    hinges = re.finditer('\)\.{6,}\(', dot_bracket_temp)
                    h_dot_bracket = re.findall('\)\.{6,}\(', dot_bracket_temp)
                    start = [h.start(0) for h in hinges]
                    end = [start[i] + len(h) for i, h in enumerate(h_dot_bracket)]
                    for i, hinge in enumerate(h_dot_bracket):
                        seq = line[start[i]+1:end[i]-1]  # get hinge region unpaired nucleotides (only the '.', not the surrounding ')' or '(')
                        if re.search('A.A..A', seq) is not None:  # find the first (closest to 5') exact H box within possible hinge regions
                            h_motifs = re.findall('A.A..A', seq)
                            substart = seq.index(h_motifs[0])  # start of H box within the extracted hinge region
                            h_start = start[i] + 2 + substart  # +2 because +1 to be the first unpaired nt in the hinge region and +1 to be 1-based
                            h_box_dict[haca_id] = {'h_motif': h_motifs[0], 'h_start': h_start, 'h_end': h_start + 5}  # +5 nt after the first a in H box
                            break
                    if haca_id not in h_box_dict.keys():  # if no H box was found within the found hinge region(s)
                        h_motif, h_start, h_end = 'NNNNNN', 0, 0
                        h_box_dict[haca_id] = {'h_motif': h_motif, 'h_start': h_start, 'h_end': h_end}

                else:  # if no unpaired hinge region was found, return NNNNNN H box and 0 as start and end
                    h_motif, h_start, h_end = 'NNNNNN', 0, 0
                    h_box_dict[haca_id] = {'h_motif': h_motif, 'h_start': h_start, 'h_end': h_end}

    h_box_df = generate_df(h_box_dict, 'H')
    return h_box_df




def find_aca(fasta):
    """ Find the most downstream ACA motif in the last 10 nt of H/ACA box
        snoRNAs."""
    aca_box_dict = {}
    with open(fasta, 'r') as f:
        haca_id = ''
        for line in f:
            line = line.strip('\n')
            if line.startswith('>'):
                id = line.strip('>')
                haca_id = id
            else:
                last_10 = line[-10:]
                length_seq = len(line)
                if re.search('ACA', last_10) is not None:  # find exact ACA box
                    *_, last_possible_aca = re.finditer('ACA', last_10)
                    aca_motif = last_possible_aca.group(0)  # if multiple exact ACA boxes found, keep the ACA box closest to 3' end
                    aca_start = (length_seq - 10) + last_possible_aca.start() + 1  # 1-based position
                    aca_end = (length_seq - 10) + last_possible_aca.end()  # 1-based
                    aca_box_dict[haca_id] = {'aca_motif': aca_motif, 'aca_start': aca_start, 'aca_end': aca_end}
                else:  # if no ACA is found
                    aca_box_dict[haca_id] = {'aca_motif': 'NNN', 'aca_start': 0, 'aca_end': 0}

    aca_box_df = generate_df(aca_box_dict, 'ACA')
    return aca_box_df


def find_all_boxes(fasta, dot_bracket_dict, path):
    """ Find H and ACA boxes in given fasta using find_h_box and find_aca and
        concat resulting dfs horizontally."""
    df_h = find_h_box(fasta, dot_bracket_dict)
    df_aca = find_aca(fasta)

    df_final = reduce(lambda  left,right: pd.merge(left,right,on=['gene_id'],
                                            how='outer'),
                                            [df_h, df_aca])
    df_final.to_csv(path, index=False, sep='\t')



find_all_boxes(expressed_haca, structure, output_expressed_haca)
find_all_boxes(not_expressed_haca, structure, output_not_expressed_haca)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
import pandas as pd

c_d_box = pd.read_csv(snakemake.input.c_d_box_location, sep='\t')
h_aca_box = pd.read_csv(snakemake.input.h_aca_box_location, sep='\t')


def cols_to_dict(df, col1, col2):
    """ Convert two columns of df into dictionary (col1 as keys and col2 as values). """
    df = df[[col1, col2]]
    df = df.set_index(col1)
    dictio = df.to_dict('index')
    return dictio



def hamming(found_motif, consensus_motif):
    """ Find the hamming distance of a found motif compared to the consensus
        motif. """
    hamming = 0
    for i, char in enumerate(found_motif):
        if char != consensus_motif[i]:
            hamming += 1
    return hamming


def convert_h_box(h_motif):
    """ Convert H motif for hamming distance purposes. Convert NNNNNN into ZZZZZZ 
        (so that it is counted as totally different from ANANNA), and convert the 
        2nd, 4th and 5th nucleotide into a N if a H motif was found (so the 3 N 
        in the found motif are counted as matching with the consensus motif)"""
    if h_motif == 'NNNNNN':
        h_motif = 'ZZZZZZ'
    else:
        h_motif = h_motif[0] + 'N' + h_motif[2] + 'NN' + h_motif[5]
    return h_motif


# Create one dictionary per motif (where key/val are sno_id/motif_sequence)
motifs = ['D_sequence', 'C_sequence', 'D_prime_sequence', 'C_prime_sequence', 'H_sequence', 'ACA_sequence']
motif_dicts = []
for motif in motifs:
    if motif.startswith(('C','D')):
        d = cols_to_dict(c_d_box, 'gene_id', motif)
    else:
        d = cols_to_dict(h_aca_box, 'gene_id', motif)
    motif_dicts.append(d)


# Compute hamming distance for each motif
seq = {'D_sequence': 'CUGA', 'C_sequence': 'RUGAUGA', 'D_prime_sequence': 'CUGA', 
    'C_prime_sequence': 'RUGAUGA', 'H_sequence': 'ANANNA', 'ACA_sequence': 'ACA'}
hamming_dict = {}
for dictio in motif_dicts:  # iterate through every motif dictionary (one for C, one for D, one for C_prime, etc.)
    for sno_id, motif_dict in dictio.items():
        if sno_id not in hamming_dict.keys():  # integrates for the first time in hamming_dict a sno_id with a given motif hamming distance
            for k, motif in motif_dict.items():
                if k.startswith('C'):  # for C and C_prime motifs
                    if motif.startswith(('A', 'G')):  # convert first nt of C or C prime motif to R if it's A|G
                        motif = 'R' + motif[1:]
                        hamming_val = hamming(motif, seq[k])
                        hamming_dict[sno_id] = {k: hamming_val}
                    else:  # do not convert to R the first nt because it is not a A|G
                        hamming_val = hamming(motif, seq[k])
                        hamming_dict[sno_id] = {k: hamming_val}
                elif k.startswith('H'):  # convert H motif according to convert_h_box
                    motif = convert_h_box(motif)
                    hamming_val = hamming(motif, seq[k])
                    hamming_dict[sno_id] = {k: hamming_val}
                else:  # for D, D prime and ACA motifs
                    hamming_val = hamming(motif, seq[k])
                    hamming_dict[sno_id] = {k: hamming_val}
        else:  # integrates in hamming_dict the hamming distance of new motifs for sno_ids that are already present in the hamming_dict because of the precedent global if statement
            for k, motif in motif_dict.items():
                if k.startswith('C'):
                    if motif.startswith(('A', 'G')):
                        motif = 'R' + motif[1:]
                        hamming_val = hamming(motif, seq[k])
                        hamming_dict[sno_id][k] = hamming_val
                    else:
                        hamming_val = hamming(motif, seq[k])
                        hamming_dict[sno_id][k] = hamming_val
                elif k.startswith('H'):
                    motif = convert_h_box(motif)
                    hamming_val = hamming(motif, seq[k])
                    hamming_dict[sno_id][k] = hamming_val
                else:
                    hamming_val = hamming(motif, seq[k])
                    hamming_dict[sno_id][k] = hamming_val

# Create df, reorder cols, change col names and compute combined hamming distance (sum of hamming distance for all boxes of a given snoRNA)
df = pd.DataFrame.from_dict(hamming_dict, orient='index')
df = df.reset_index()
df = df.rename(columns={'index': 'gene_id'})
df = df[['gene_id', 'C_sequence', 'D_sequence', 'C_prime_sequence', 'D_prime_sequence', 'H_sequence', 'ACA_sequence']]
df.columns = ['gene_id', 'C_hamming', 'D_hamming', 'C_prime_hamming', 'D_prime_hamming', 'H_hamming', 'ACA_hamming']
df['C_D_and_prime_hamming'] = df['C_hamming'] + df['D_hamming'] + df['C_prime_hamming'] + df['D_prime_hamming'] 
df['H_ACA_hamming'] = df['H_hamming'] + df['ACA_hamming']
df['combined_box_hamming'] = df['C_D_and_prime_hamming'].fillna(df['H_ACA_hamming'])
df = df.drop(columns=['C_D_and_prime_hamming', 'H_ACA_hamming'])

df.to_csv(snakemake.output.hamming_distance_box_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd

""" Generate a host_function column for all snoRNAs. Protein-coding HGs are
    separated into 5 categories (ribosomal protein; ribosome biogenesis &
    translation; RNA binding, processing & splicing; Other; poorly
    characterized). Non-coding HGs are separated in 2 categories
    (functional_ncRNA; non_functional_ncRNA). For intergenic snoRNAs, they will
    have 'intergenic' as host function"""

host_gene_df = pd.read_csv(snakemake.input.host_genes)
host_gene_df = host_gene_df[['host_id', 'host_biotype']]
non_coding = pd.read_csv(snakemake.input.nc_functions, sep='\t', header=0, names=['nc_functions'])
protein_coding = pd.read_csv(snakemake.input.pc_functions, sep='\t')
protein_coding = protein_coding.drop_duplicates(subset=['host_id'])
protein_coding = protein_coding[['host_id', 'pc_host_function']]

# Merge non-coding functions to host gene df
host_gene_df = host_gene_df.merge(non_coding, how='left', left_on='host_id',
                                    right_on='nc_functions')

# Merge protein-coding functions to host gene df
host_gene_df = host_gene_df.merge(protein_coding, how='left', left_on='host_id',
                                    right_on='host_id')

# Create a host_function column
host_gene_df.loc[(host_gene_df['host_biotype'] != 'protein_coding') &
                (host_gene_df['nc_functions'].notnull()), 'host_function'] = 'functional_ncRNA'
host_gene_df.loc[(host_gene_df['host_biotype'] != 'protein_coding') &
                (host_gene_df['nc_functions'].isnull()), 'host_function'] = 'non_functional_ncRNA'

host_gene_df.loc[host_gene_df['pc_host_function'] == 'Ribosomal protein',
                'host_function'] = 'Ribosomal protein'
host_gene_df.loc[host_gene_df['pc_host_function'] == 'Ribosome biogenesis & translation',
                'host_function'] = 'Ribosome biogenesis & translation'
host_gene_df.loc[host_gene_df['pc_host_function'] == 'RNA binding, processing, splicing',
                'host_function'] = 'RNA binding, processing, splicing'
host_gene_df.loc[host_gene_df['pc_host_function'] == 'Other', 'host_function'] = 'Other'
host_gene_df.loc[host_gene_df['pc_host_function'] == 'Poorly characterized',
                'host_function'] = 'Poorly characterized'

host_gene_df = host_gene_df[['host_id', 'host_function']].drop_duplicates(subset=['host_id'])

host_gene_df.to_csv(snakemake.output.hg_function_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

""" Tune the hyperparameters of each models (Logistic regression, Support
    vector classifier, Random Forest and Gradient boosting classifier)
    before even training them, using GridSearchCV with stratified k-fold.
    First, shuffle the dataset and split it into Cross-validation (CV) set
    and training set (respectively 15 % and 85 % of all examples). The CV
    will take place as a stratifed k-fold (3 fold) using GridSearchCV and
    will return the best hyperparameters for each tested model. Of note, the
    training set will be also split later on into training and test set
    (respectively 70 % and 15 % of all examples)."""

df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# Split dataset into CV set (15 % of all examples) and training set (85 %) and use
# the 'stratify' param to keep the same proportion of expressed vs not_expressed in
# training and CV sets. The datasets are shuffled by default
X_train, X_cv, y_train, y_cv = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)

# Instantiate the model defined by the 'models' wildcard
if snakemake.wildcards.models2 == "log_reg":
    model = LogisticRegression(random_state=42, max_iter=500)
elif snakemake.wildcards.models2 == "svc":
    model = svm.SVC(random_state=42)
elif snakemake.wildcards.models2 == "rf":
    model = RandomForestClassifier(random_state=42)
elif snakemake.wildcards.models2 == "knn":
    model = KNeighborsClassifier()
else:
    model = GradientBoostingClassifier(random_state=42)

# Configure the cross-validation strategy (StratifiedKFold where k=3)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Define search space (i.e. in which range of params the GridSearch happens) per model
space = snakemake.params.hyperparameter_space

# Execute the gridsearch per model on the CV set
search = GridSearchCV(estimator=model, param_grid=space,
                        cv=cv, scoring="accuracy")
search.fit(X_cv, y_cv)
print(snakemake.wildcards.models2)
print(search.best_score_)
print(search.best_params_)

# Return the best hyperparameters found by GridSearchCV, and the accuracy of each model
# fitted on the CV set with these hyperparameters into a dataframe
params_dict = search.best_params_
params_dict['accuracy_cv'] = search.best_score_
params_df = pd.DataFrame(params_dict, index=[0])
params_df.to_csv(snakemake.output.best_hyperparameters, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

""" Tune the hyperparameters of each models (Logistic regression, Support
    vector classifier, Random Forest and Gradient boosting classifier)
    before even training them, using GridSearchCV with stratified k-fold on the
    cross-validation (X_cv) set."""

X_cv = pd.read_csv(snakemake.input.X_cv, sep='\t', index_col='gene_id_sno')
y_cv = pd.read_csv(snakemake.input.y_cv, sep='\t')


# Instantiate the model defined by the 'models' wildcard
if snakemake.wildcards.models2 == "log_reg":
    model = LogisticRegression(random_state=42, max_iter=500)
elif snakemake.wildcards.models2 == "svc":
    model = svm.SVC(random_state=42)
elif snakemake.wildcards.models2 == "rf":
    model = RandomForestClassifier(random_state=42)
elif snakemake.wildcards.models2 == "knn":
    model = KNeighborsClassifier()
else:
    model = GradientBoostingClassifier(random_state=42)

# Configure the cross-validation strategy (StratifiedKFold where k=3)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Define search space (i.e. in which range of params the GridSearch happens) per model
space = snakemake.params.hyperparameter_space

# Execute the gridsearch per model on the CV set
search = GridSearchCV(estimator=model, param_grid=space,
                        cv=cv, scoring="accuracy")
search.fit(X_cv, y_cv.values.ravel())
print(snakemake.wildcards.models2)
print(search.best_score_)
print(search.best_params_)

# Return the best hyperparameters found by GridSearchCV, and the accuracy of each model
# fitted on the CV set with these hyperparameters into a dataframe
params_dict = search.best_params_
params_dict['accuracy_cv'] = search.best_score_
params_df = pd.DataFrame(params_dict, index=[0])
params_df.to_csv(snakemake.output.best_hyperparameters, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd

""" Keep only one of the top predicting feature from the orginal
    dataset. Either conservation_score_norm,
    terminal_stem_mfe_norm, sno_mfe_norm, host_expressed, intergenic or
    dist_to_bp_norm."""

df = pd.read_csv(snakemake.input.df, sep='\t')

if snakemake.wildcards.one_feature == "all_four":
    df = df[['gene_id_sno', "conservation_score_norm", "terminal_stem_mfe_norm",
            "sno_mfe_norm", "host_expressed", 'label']]
elif snakemake.wildcards.one_feature == "all_five":
    df = df[['gene_id_sno', "conservation_score_norm", "terminal_stem_mfe_norm",
            "sno_mfe_norm", "host_expressed", "intron_number_norm", 'label']]
else:
    df = df[['gene_id_sno', snakemake.wildcards.one_feature, 'label']]

df.to_csv(snakemake.output.df_one_feature, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from pybedtools import BedTool

merged_beds = snakemake.input.merged_beds
snoRNA_bed = BedTool(snakemake.input.snoRNA_bed)
output_beds = snakemake.output.mapped_snoRNA_bed

# Map the enrichment value of overlapping peaks to each snoRNA (return the sum of all peaks
#overlapping a snoRNA per RBP).
for i, path in enumerate(merged_beds):
    bed = BedTool(path)
    mapped_bed = snoRNA_bed.map(b=bed, s=True, c="4", o="sum").saveas(output_beds[i])
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from pybedtools import BedTool

unmerged_beds = snakemake.input.input_beds
unmerged_beds = [path for path in unmerged_beds if 'DKC1' not in path]
output_beds = snakemake.output.merge_beds
print(unmerged_beds)
print(output_beds)
# Merge overlapping peaks for each bed file by summing the score value of overlapping peaks (max 1 nt distance between peaks)
for i, path in enumerate(unmerged_beds):
    bed = BedTool(path)
    merged_bed = bed.merge(s=True, d=1, c="6,5,6", o="distinct,sum,distinct").saveas(output_beds[i])
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import pandas as pd
import os
import subprocess as sp

""" Generate a merged dataframe of all samples output from coco cc (one df for
    the count, one for the cpm and one for the tpm) """

input_counts_paths = snakemake.input.counts
output_counts = snakemake.output.merged_counts
output_cpm = snakemake.output.merged_cpm
output_tpm = snakemake.output.merged_tpm

# Generate the count, cpm and tpm list of sample dfs
count_temp, cpm_temp, tpm_temp = [], [], []
sorted_paths = sorted(input_counts_paths)
for i, file in enumerate(sorted_paths):
    file_name = file.split('/')[-1].split('.')[0]
    df = pd.read_csv(file, sep='\t')
    if i == 0:  # Add gene_id and gene_name column in the list only one time (for the first sample here)
        gene_id = df[['gene_id', 'gene_name']]
        count_temp.append(gene_id)
        cpm_temp.append(gene_id)
        tpm_temp.append(gene_id)

    count, cpm, tpm = df[['count']], df[['cpm']], df[['tpm']]
    count.columns, cpm.columns, tpm.columns = [file_name], [file_name], [file_name]
    count_temp.append(count)
    cpm_temp.append(cpm)
    tpm_temp.append(tpm)

# Generate the count, cpm and tpm dataframes
count_df, cpm_df, tpm_df = pd.concat(count_temp, axis=1), pd.concat(cpm_temp, axis=1), pd.concat(tpm_temp, axis=1)
count_df.to_csv(output_counts, index=False, sep='\t')
cpm_df.to_csv(output_cpm, index=False, sep='\t')
tpm_df.to_csv(output_tpm, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import numpy as np

""" Generate a merged dataframe of all snoRNA features and labels that will be
    used by the predictor. The labels are 'expressed' or 'not_expressed' using
    the column 'abundance_cutoff'. The features are: host_expressed, snoRNA 
    structure stability (Minimal Free Energy or MFE), snoRNA terminal stem 
    stability (MFE), snoRNA conservation and hamming_distance_box (per box or
    global hamming distance). """

# Labels (abundance_cutoff) and feature abundance_cutoff_host
tpm_df_labels = pd.read_csv(snakemake.input.abundance_cutoff, sep='\t')
tpm_df_labels = tpm_df_labels[['gene_id', 'gene_name', 'abundance_cutoff', 'abundance_cutoff_host']]
tpm_df_labels = tpm_df_labels.rename(columns={'gene_id': 'gene_id_sno'})

# Other features
sno_mfe = pd.read_csv(snakemake.input.sno_structure_mfe, sep='\t',
            names=['gene_id_sno', 'sno_mfe'])
terminal_stem_mfe = pd.read_csv(snakemake.input.terminal_stem_mfe, sep='\t',
                    names=['gene_id_sno', 'terminal_stem_mfe'])

hamming = pd.read_csv(snakemake.input.hamming_distance_box, sep='\t')
hamming = hamming[['gene_id', 'combined_box_hamming']]  # get combined hamming distance for all boxes in a snoRNA
hamming = hamming.rename(columns={'gene_id': 'gene_id_sno'})


# Merge iteratively all of these dataframes
df_list = [tpm_df_labels, hamming, sno_mfe, terminal_stem_mfe]

df_label = df_list[0]
temp = [df_label]
for i, df in enumerate(df_list[1:]):
    if i == 0:
        df_temp = temp[0].merge(df, how='left', on='gene_id_sno')
        temp.append(df_temp)
    else:
        df_temp = temp[i].merge(df, how='left', on='gene_id_sno')
        temp.append(df_temp)

final_df = temp[-1]  # get the last df in temp, i.e. the final merged df

final_df.to_csv(snakemake.output.feature_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import pandas as pd
import numpy as np

""" Generate a merged dataframe of all snoRNA features and labels that will be
    used by the predictor. The labels are 'expressed' or 'not_expressed' using
    the column 'abundance_cutoff_2'. The features are: snoRNA length, type,
    target, host gene (HG) biotype and function, HG NMD susceptibility, HG
    promoter type, intron rank (from 5', 3' and relative to the total number of
    intron) and length in which the snoRNA is encoded, total number of intron of
    the snoRNA's HG, snoRNA distance to upstream and downstream exons, snoRNA
    distance to predicted branch_point, snoRNA structure stability (Minimal Free
    Energy or MFE), snoRNA terminal stem stability (MFE), snoRNA terminal
    stem length score, snoRNA conservation and hamming_distance_box (per box or
    global hamming distance). """

# Labels (abundance_cutoff and abundance_cutoff_2); feature abundance_cutoff_host
tpm_df_labels = pd.read_csv(snakemake.input.abundance_cutoff, sep='\t')
tpm_df_labels = tpm_df_labels[['gene_id', 'gene_name', 'abundance_cutoff', 'abundance_cutoff_2', 'abundance_cutoff_host']]
tpm_df_labels.columns = ['gene_id_sno', 'gene_name', 'abundance_cutoff', 'abundance_cutoff_2', 'abundance_cutoff_host']

# Other features
sno_length = pd.read_csv(snakemake.input.sno_length, sep='\t',
                names=['gene_id_sno', 'sno_length'])
snodb_nmd_di_promoters = pd.read_csv(snakemake.input.snodb_nmd_di_promoters, sep='\t')
snodb_nmd_di_promoters = snodb_nmd_di_promoters[['gene_id_sno', 'sno_type', 'sno_target',
                            'host_biotype2', 'NMD_susceptibility', 'di_promoter', 'host_function']]
sno_mfe = pd.read_csv(snakemake.input.sno_structure_mfe, sep='\t',
            names=['gene_id_sno', 'sno_mfe'])
terminal_stem_mfe = pd.read_csv(snakemake.input.terminal_stem_mfe, sep='\t',
                    names=['gene_id_sno', 'terminal_stem_mfe'])
terminal_stem_length_score = pd.read_csv(snakemake.input.terminal_stem_length_score,
                            sep='\t')
#conservation = pd.read_csv(snakemake.input.sno_conservation, sep='\t')
#conservation.columns = ['gene_id_sno', 'conservation_score']

hamming = pd.read_csv(snakemake.input.hamming_distance_box, sep='\t')
#hamming = hamming[['gene_id', 'C_hamming', 'D_hamming', 'C_prime_hamming', 'D_prime_hamming', 'H_hamming', 'ACA_hamming']]  # get hamming distance per box only (not combined)
hamming = hamming[['gene_id', 'combined_box_hamming']]  # get combined hamming distance for all boxes in a snoRNA
hamming = hamming.rename(columns={'gene_id': 'gene_id_sno'})

# Get sno location within intron (distances to branchpoint and to up/downstream exons)
location_bp = pd.read_csv(snakemake.input.location_and_branchpoint, sep='\t')
location_bp = location_bp[['gene_id_sno', 'intron_number', 'intron_length', 'exon_number_per_hg',
                        'distance_upstream_exon', 'distance_downstream_exon', 'dist_to_bp']]
# Change 'intron_number' col for 'intron_rank_5prime' (i.e. a better name) and compute actual total number of introns
# Compute also intron rank but counting from the 3' ('intron_rank_3prime') and relative_intron_rank (intron_rank_5prime / intron_number)
location_bp = location_bp.rename(columns={"intron_number": "intron_rank_5prime"})
location_bp['total_intron_number'] = location_bp['exon_number_per_hg'] - 1
location_bp['intron_rank_3prime'] = location_bp['exon_number_per_hg'] - location_bp['intron_rank_5prime']
location_bp.loc[location_bp['intron_rank_5prime'] == 0, 'intron_rank_3prime'] = 0  # patch for snoRNAs encoded (completely or with an overlap) within an exon of a HG
location_bp['relative_intron_rank'] = location_bp['intron_rank_3prime'] / location_bp['total_intron_number']
location_bp = location_bp.replace(np.inf, 0)  # replace relative_intron_rank to 0 when total_intron_number is equal to  0
location_bp = location_bp.drop(columns=['exon_number_per_hg'])


# Merge iteratively all of these dataframes
#df_list = [tpm_df_labels, sno_length, conservation, hamming, snodb_nmd_di_promoters,
 #           location_bp, sno_mfe, terminal_stem_mfe, terminal_stem_length_score]
df_list = [tpm_df_labels, sno_length, hamming, snodb_nmd_di_promoters,
            location_bp, sno_mfe, terminal_stem_mfe, terminal_stem_length_score]

df_label = df_list[0]
temp = [df_label]
for i, df in enumerate(df_list[1:]):
    if i == 0:
        df_temp = temp[0].merge(df, how='left', on='gene_id_sno')
        temp.append(df_temp)
    else:
        df_temp = temp[i].merge(df, how='left', on='gene_id_sno')
        temp.append(df_temp)

final_df = temp[-1]  # get the last df in temp, i.e. the final merged df

final_df.to_csv(snakemake.output.feature_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import numpy as np

""" Generate a merged dataframe of all snoRNA features that will be
    used by the predictor. The features are: host_expressed, snoRNA
    structure stability (Minimal Free Energy or MFE), snoRNA terminal stem
    stability (MFE), snoRNA combined box hamming. """

# abundance_cutoff_host df
host_cutoff_df = pd.read_csv(snakemake.input.abundance_cutoff, sep='\t')
host_cutoff_df = host_cutoff_df[['gene_id', 'gene_name','abundance_cutoff_host']]
host_cutoff_df = host_cutoff_df.rename(columns={'gene_id': 'gene_id_sno'})

# Other features
sno_mfe = pd.read_csv(snakemake.input.sno_structure_mfe, sep='\t',
            names=['gene_id_sno', 'sno_mfe'])
terminal_stem_mfe = pd.read_csv(snakemake.input.terminal_stem_mfe, sep='\t',
                    names=['gene_id_sno', 'terminal_stem_mfe'])

hamming = pd.read_csv(snakemake.input.hamming_distance_box, sep='\t')
hamming = hamming[['gene_id', 'combined_box_hamming']]  # get combined hamming distance for all boxes in a snoRNA
hamming = hamming.rename(columns={'gene_id': 'gene_id_sno'})


# Merge iteratively all of these dataframes
df_list = [host_cutoff_df, hamming, sno_mfe, terminal_stem_mfe]

temp = [df_list[0]]
for i, df in enumerate(df_list[1:]):
    if i == 0:
        df_temp = temp[0].merge(df, how='left', on='gene_id_sno')
        temp.append(df_temp)
    else:
        df_temp = temp[i].merge(df, how='left', on='gene_id_sno')
        temp.append(df_temp)

final_df = temp[-1]  # get the last df in temp, i.e. the final merged df

# Fill NA terminal stem value with 0 (if a stem couldn't be computed because the snoRNA is at the start/end of a chr)
final_df['terminal_stem_mfe'] = final_df['terminal_stem_mfe'].fillna(0)

final_df.to_csv(snakemake.output.feature_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import collections as coll

host_df = pd.read_csv(snakemake.input.host_df)
host_df = host_df[['sno_id', 'host_id', 'host_name']]
host_df = host_df.rename(columns={'sno_id': 'gene_id_sno'})
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Drop intergenic snoRNAs and merge to host df
feature_df = feature_df[feature_df['abundance_cutoff_host'] != 'intergenic']
feature_df =  feature_df.merge(host_df, how='left', on='gene_id_sno')

same_label, different_label = [], []
same, diff = 0, 0
for i, group in enumerate(feature_df.groupby('host_id')):
    grouped_df = group[1]
    if len(grouped_df) > 1:  # multi-HG
        if len(list(pd.unique(grouped_df['abundance_cutoff_2']))) == 1:  # same label for snoRNAs in same HG
            same_label.append(grouped_df)
            same +=1
        if len(list(pd.unique(grouped_df['abundance_cutoff_2']))) == 2:  # different label for snoRNAs in same HG
            different_label.append(grouped_df)
            diff += 1
print(f'Same label multi-HG: {same}')
print(f'Different label multi-HG: {diff}')

different_label_df = pd.concat(different_label)
different_label_df.to_csv(snakemake.output.multi_HG_different_label_snoRNAs, sep='\t')
same_label_df = pd.concat(same_label)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" One-hot encode categorical features and label-encode the labels."""

df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Label-encode the label column manually
df.loc[df['abundance_cutoff_2'] == 'expressed', 'label'] = 1
df.loc[df['abundance_cutoff_2'] == 'not_expressed', 'label'] = 0
df = df.drop(columns=['gene_name', 'abundance_cutoff', 'abundance_cutoff_2'])

numerical_features_label = df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64'])

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [df[['gene_id_sno']]]
df = df.drop(columns=['gene_id_sno'])
df_cat = df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = df[[col]]

    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat (i.e. 1541)

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Remove duplicated columns to keep only one (e.g. 'intergenic', which is created 5 times when one-hot encoding host-related columns)
final_df = final_df.loc[:,~final_df.columns.duplicated()]

final_df.to_csv(snakemake.output.one_hot_encoded_df, sep='\t')
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models and
    predict the abundance status of mouse snoRNAs."""
all_iterations = ["manual_first", "manual_second", "manual_third", "manual_fourth",
                "manual_fifth", "manual_sixth", "manual_seventh", "manual_eighth",
                "manual_ninth", "manual_tenth"]
manual_iteration = snakemake.wildcards.manual_iteration
idx = all_iterations.index(manual_iteration)
random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Label-encode the label column manually
feature_df.loc[feature_df['abundance_cutoff'] == 'expressed', 'label'] = 1
feature_df.loc[feature_df['abundance_cutoff'] == 'not_expressed', 'label'] = 0
feature_df = feature_df.drop(columns=['gene_name', 'abundance_cutoff'])

numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64'])

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']]]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]

    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
y_mouse = final_df['label']
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]



# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 4 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]  # same order as in final_df that we want to predict
y = human_feature_df['label']

# Configure the cross-validation strategy (StratifiedKFold where k=10)
# This serves only to split in ten equal folds, no cross-validation is done to this point
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_state)

# Get the corresponding test set (10% of total df) for that manual iteration
i = 0
for total_train_index, test_index in kf.split(X, y):
    if i == idx:
        X_test = X.loc[test_index]
        y_test = y.loc[test_index]

        # Split the total_train into cv and train test (respectively 10 and 80 % of the total df (i.e. 11.1% and 88.9% of 1386 snoRNAs))
        X_total_train = X.loc[total_train_index]
        y_total_train = y.loc[total_train_index]
        X_train, X_cv, y_train, y_cv = train_test_split(X_total_train, y_total_train,
                                    test_size=0.111, train_size=0.889, random_state=random_state,
                                    stratify=y_total_train)
        X_train = X_train.drop('gene_id_sno', axis=1)
        # Scale by substracting mean and dividing by stdev
        scaler = StandardScaler().fit(X_train)
    i+=1

# Scale final_df (all mouse snoRNAs with top4 features) by the same scaling factor used to scale the training set
scaled_feature_df = scaler.transform(final_df)

# Predict label (expressed (1) or not_expressed (0)) on mouse snoRNAs
y_pred = model.predict(scaled_feature_df)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
y_mouse.to_csv(snakemake.output.label_df, sep='\t', index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models on human
    snoRNAs and predict the abundance status of species snoRNAs."""

random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')  # species snoRNAs
feature_df_copy = feature_df.copy()
threshold = pd.read_csv(snakemake.input.threshold[0], sep='\t')  # threshold used for log_reg decision
thresh = threshold.iloc[0, 0]

## Species dataset processing
# Drop gene_name col and select numerical feature columns
feature_df = feature_df.drop(columns='gene_name')
numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64'])

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']]]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]

    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]





## Human dataset processing
# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 4 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]  # same order as in final_df that we want to predict
y = human_feature_df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)

# Create the scaler that will be used to mean normalize species snoRNA features
df = X_train.set_index('gene_id_sno')
scaler = StandardScaler().fit(df)



# Scale final_df (all species snoRNAs with top4 features) by the same scaling factor used to scale the training set in human snoRNAs
scaled_feature_df = scaler.transform(final_df)


# Define a new class of LogisticRegression in which we can choose the log_reg threshold used to predict
class LogisticRegressionWithThreshold(LogisticRegression):
    def predict(self, X, threshold=None):
        if threshold == None: # If no threshold passed in, simply call the base class predict, effectively threshold=0.5
            return LogisticRegression.predict(self, X)
        else:
            y_scores = LogisticRegression.predict_proba(self, X)[:, 1]
            y_pred_with_threshold = (y_scores >= threshold).astype(int)

            return y_pred_with_threshold

    def threshold_from_optimal_tpr_minus_fpr(self, X, y):
        # Find optimal log_reg threshold where we maximize the True positive rate (TPR) and minimize the False positive rate (FPR)
        y_scores = LogisticRegression.predict_proba(self, X)[:, 1]
        fpr, tpr, thresholds = roc_curve(y, y_scores)

        optimal_idx = np.argmax(tpr - fpr)

        return thresholds[optimal_idx], tpr[optimal_idx] - fpr[optimal_idx]


# Unpickle and thus instantiate the trained log_reg thresh model
model = pickle.load(open(snakemake.input.pickled_trained_model[0], 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on species snoRNAs
y_pred = model.predict(scaled_feature_df, thresh)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df = prediction_df.replace(to_replace=[0, 1], value=['not_expressed', 'expressed'])

# Merge predictions to feature df and save df
merged_df = feature_df_copy.merge(prediction_df, how='left', on='gene_id_sno')
merged_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models and
    predict the abundance status of mouse snoRNAs."""

random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')  # mouse snoRNAs


## Mouse dataset proocessing
# Label-encode the label column manually
feature_df.loc[feature_df['abundance_cutoff'] == 'expressed', 'label'] = 1
feature_df.loc[feature_df['abundance_cutoff'] == 'not_expressed', 'label'] = 0
feature_df = feature_df.drop(columns=['gene_name', 'abundance_cutoff'])


# Remove ALL duplicate snoRNAs (do not keep any of the snoRNAs if they have identical copis)
# in the test set (copies that have exactly the same sequence and terminal stem)
feature_df = feature_df.drop_duplicates(subset=['sno_mfe', 'terminal_stem_mfe',
                                                'combined_box_hamming',
                                                'abundance_cutoff_host'],
                                                keep=False)
numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64']).reset_index(drop=True)

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']].reset_index(drop=True)]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]
    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
y_mouse = final_df['label']
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]


## Human dataset processing
# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 4 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]  # same order as in final_df that we want to predict
y = human_feature_df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)

# Create the scaler that will be used to mean normalize mouse snoRNA features
df = X_train.set_index('gene_id_sno')
scaler = StandardScaler().fit(df)



# Scale final_df (all mouse snoRNAs with top4 features) by the same scaling factor used to scale the training set in human snoRNAs
scaled_feature_df = scaler.transform(final_df)

# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on mouse snoRNAs
y_pred = model.predict(scaled_feature_df)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
y_mouse.to_csv(snakemake.output.label_df, sep='\t', index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models and
    predict the abundance status of mouse snoRNAs."""

random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')  # mouse snoRNAs


## Mouse dataset proocessing
# Label-encode the label column manually
feature_df.loc[feature_df['abundance_cutoff'] == 'expressed', 'label'] = 1
feature_df.loc[feature_df['abundance_cutoff'] == 'not_expressed', 'label'] = 0
feature_df = feature_df.drop(columns=['gene_name', 'abundance_cutoff'])

numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64'])

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']]]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]

    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
y_mouse = final_df['label']
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]


## Human dataset processing
# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 4 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]  # same order as in final_df that we want to predict
y = human_feature_df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)

# Create the scaler that will be used to mean normalize mouse snoRNA features
df = X_train.set_index('gene_id_sno')
scaler = StandardScaler().fit(df)



# Scale final_df (all mouse snoRNAs with top4 features) by the same scaling factor used to scale the training set in human snoRNAs
scaled_feature_df = scaler.transform(final_df)

# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on mouse snoRNAs
y_pred = model.predict(scaled_feature_df)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
y_mouse.to_csv(snakemake.output.label_df, sep='\t', index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models and
    predict the abundance status of mouse snoRNAs."""

random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')  # mouse snoRNAs


## Mouse dataset proocessing
# Label-encode the label column manually
feature_df.loc[feature_df['abundance_cutoff'] == 'expressed', 'label'] = 1
feature_df.loc[feature_df['abundance_cutoff'] == 'not_expressed', 'label'] = 0
feature_df = feature_df.drop(columns=['gene_name', 'abundance_cutoff'])

numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64'])

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']]]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]

    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
y_mouse = final_df['label']
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming']]


## Human dataset processing
# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 3 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming']]  # same order as in final_df that we want to predict
y = human_feature_df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)

# Create the scaler that will be used to mean normalize mouse snoRNA features
df = X_train.set_index('gene_id_sno')
scaler = StandardScaler().fit(df)



# Scale final_df (all mouse snoRNAs with top3 features) by the same scaling factor used to scale the training set in human snoRNAs
scaled_feature_df = scaler.transform(final_df)

# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on mouse snoRNAs
y_pred = model.predict(scaled_feature_df)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
y_mouse.to_csv(snakemake.output.label_df, sep='\t', index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models and
    predict the abundance status of mouse snoRNAs."""

random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')  # mouse snoRNAs


## Mouse dataset proocessing
# Label-encode the label column manually
feature_df.loc[feature_df['abundance_cutoff'] == 'expressed', 'label'] = 1
feature_df.loc[feature_df['abundance_cutoff'] == 'not_expressed', 'label'] = 0
feature_df = feature_df.drop(columns=['gene_name', 'abundance_cutoff'])


# Remove duplicate snoRNAs in the test set (copies that have exactly the same sequence and terminal stem)
feature_df = feature_df.drop_duplicates(subset=['sno_mfe', 'terminal_stem_mfe',
                                                'combined_box_hamming',
                                                'abundance_cutoff_host'])

numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64']).reset_index(drop=True)

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']].reset_index(drop=True)]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]
    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
y_mouse = final_df['label']
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]


## Human dataset processing
# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 4 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]  # same order as in final_df that we want to predict
y = human_feature_df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)

# Create the scaler that will be used to mean normalize mouse snoRNA features
df = X_train.set_index('gene_id_sno')
scaler = StandardScaler().fit(df)



# Scale final_df (all mouse snoRNAs with top4 features) by the same scaling factor used to scale the training set in human snoRNAs
scaled_feature_df = scaler.transform(final_df)

# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on mouse snoRNAs
y_pred = model.predict(scaled_feature_df)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
y_mouse.to_csv(snakemake.output.label_df, sep='\t', index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models and
    predict the abundance status of mouse snoRNAs."""

random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')  # mouse snoRNAs


## Mouse dataset proocessing
# Label-encode the label column manually
feature_df.loc[feature_df['abundance_cutoff'] == 'expressed', 'label'] = 1
feature_df.loc[feature_df['abundance_cutoff'] == 'not_expressed', 'label'] = 0
feature_df = feature_df.drop(columns=['gene_name', 'abundance_cutoff'])


# Remove duplicate snoRNAs in the test set (copies that have exactly the same sequence and terminal stem)
feature_df = feature_df.drop_duplicates(subset=['sno_mfe', 'terminal_stem_mfe',
                                                'combined_box_hamming'])

numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64']).reset_index(drop=True)

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']].reset_index(drop=True)]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]
    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
y_mouse = final_df['label']
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming']]


## Human dataset processing
# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 4 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming']]  # same order as in final_df that we want to predict
y = human_feature_df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)

# Create the scaler that will be used to mean normalize mouse snoRNA features
df = X_train.set_index('gene_id_sno')
scaler = StandardScaler().fit(df)



# Scale final_df (all mouse snoRNAs with top4 features) by the same scaling factor used to scale the training set in human snoRNAs
scaled_feature_df = scaler.transform(final_df)

# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on mouse snoRNAs
y_pred = model.predict(scaled_feature_df)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
y_mouse.to_csv(snakemake.output.label_df, sep='\t', index=False)
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

""" Scale features based on the training sets used to train the models on human
    snoRNAs and predict the abundance status of species snoRNAs."""

random_state = snakemake.params.random_state
human_feature_df_0 = pd.read_csv(snakemake.input.human_snoRNA_feature_df, sep='\t')
feature_df = pd.read_csv(snakemake.input.feature_df, sep='\t')  # species snoRNAs
feature_df_copy = feature_df.copy()

## Species dataset processing
# Drop gene_name col and select numerical feature columns
feature_df = feature_df.drop(columns='gene_name')
numerical_features_label = feature_df.copy()
numerical_features_label = numerical_features_label.select_dtypes(include=['int64', 'float64'])

# Convert categorical features (astype 'object') into numpy array, convert the
# strings in that array into numbers with LabelEncoder and then one-hot encode this numerical array
dfs = [feature_df[['gene_id_sno']]]
feature_df = feature_df.drop(columns=['gene_id_sno'])
df_cat = feature_df.select_dtypes(include=['object'])
cols = df_cat.columns
for i, col in enumerate(cols):
    df_cat = feature_df[[col]]

    # Convert column into numpy array
    array_cat = df_cat.values.reshape(-1, 1)  # -1 infers the length of df_cat

    # Transform string array into numerical array
    le = LabelEncoder()
    df_cat[col+'_cat'] = le.fit_transform(df_cat[col])

    # Get the string that is linked to each numerical category created by LabelEncoder
    label_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    labels = list(label_dict.keys())

    # One-hot encode the numerical array that was created
    enc = OneHotEncoder(handle_unknown='ignore')
    one_hot_array = enc.fit_transform(df_cat[[col+'_cat']]).toarray()
    enc_df = pd.DataFrame(one_hot_array, columns=labels)

    dfs.append(enc_df)

# Concat all one-hot encoded categorical columns
final_df = pd.concat(dfs, axis=1)

# Concat numerical features and label at the end and set index as sno_id
final_df = pd.concat([final_df, numerical_features_label], axis=1)
final_df = final_df.set_index('gene_id_sno')

# Keep only relevant columns
final_df = final_df[['sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]





## Human dataset processing
# Fill NaN with -5
human_feature_df_0 = human_feature_df_0.fillna(-5)
human_feature_df = human_feature_df_0.copy()
human_feature_df = shuffle(human_feature_df, random_state=random_state)

# Keep only top 4 features
X = human_feature_df.drop('label', axis=1)
X = X[['gene_id_sno', 'sno_mfe', 'terminal_stem_mfe', 'combined_box_hamming', 'host_expressed']]  # same order as in final_df that we want to predict
y = human_feature_df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)

# Create the scaler that will be used to mean normalize species snoRNA features
df = X_train.set_index('gene_id_sno')
scaler = StandardScaler().fit(df)



# Scale final_df (all species snoRNAs with top4 features) by the same scaling factor used to scale the training set in human snoRNAs
scaled_feature_df = scaler.transform(final_df)

# Unpickle and thus instantiate the trained log_reg model
model = pickle.load(open(snakemake.input.pickled_trained_model[0], 'rb'))

# Predict label (expressed (1) or not_expressed (0)) on species snoRNAs
y_pred = model.predict(scaled_feature_df)
prediction_df = pd.DataFrame(y_pred, index=final_df.index, columns=['predicted_label'])
prediction_df = prediction_df.reset_index()
prediction_df = prediction_df.replace(to_replace=[0, 1], value=['not_expressed', 'expressed'])

# Merge predictions to feature df and save df
merged_df = feature_df_copy.merge(prediction_df, how='left', on='gene_id_sno')
merged_df.to_csv(snakemake.output.predicted_label_df, sep='\t', index=False)

# Save also scaled_feature_df and real label as dfs
scaled_feature_df = pd.DataFrame(scaled_feature_df, index=final_df.index, columns=final_df.columns)
scaled_feature_df = scaled_feature_df.reset_index()
scaled_feature_df.to_csv(snakemake.output.scaled_feature_df, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd

feature_df = pd.read_csv(snakemake.input.all_features_df, sep='\t')
output_path = snakemake.output.real_confusion_value_df
sno_per_confusion_value = snakemake.input.sno_per_confusion_value
conf_val = snakemake.wildcards.confusion_value
conf_val_pair = {'FN': 'TP', 'TP': 'FN', 'FP': 'TN', 'TN': 'FP'}  # to help select only real confusion value
                                                                # (i.e. those always predicted as such across iterations and models)
conf_val_df = pd.read_csv([path for path in sno_per_confusion_value if conf_val in path][0], sep='\t')
conf_val_pair_df = pd.read_csv([path for path in sno_per_confusion_value if conf_val_pair[conf_val] in path][0], sep='\t')

# Select only real confusion_value (ex: FN) (those always predicted as such across models and iterations)
real_conf_val = list(set(conf_val_df.gene_id_sno.to_list()) - set(conf_val_pair_df.gene_id_sno.to_list()))

df = feature_df[feature_df['gene_id_sno'].isin(real_conf_val)]
df.to_csv(output_path, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import collections as coll

""" Find all the snoRNAs that are either predicted as a TP, TN, FP or FN in at
    least 2 of the 3 chosen models."""

confusion_value = snakemake.wildcards.confusion_value
confusion_value_df_paths = snakemake.input.confusion_value_df

# Load confusion value dfs (contain confusion value for each sno in the respective test set)
confusion_value_dfs = []
for path in confusion_value_df_paths:
    df = pd.read_csv(path, sep='\t')
    last_col_name = [col_name for col_name in df.columns if 'confusion_matrix_val' in col_name][0]
    df = df.rename(columns={last_col_name: last_col_name.split('_val_')[0]})
    confusion_value_dfs.append(df)


# Concat all dfs vertically and create one df per confusion_value
concat_df = pd.concat(confusion_value_dfs)
confusion_final_df = concat_df[concat_df['confusion_matrix'] == confusion_value]
confusion_final_df = confusion_final_df[['gene_id_sno', 'confusion_matrix']]
print(confusion_final_df)
confusion_final_df.to_csv(snakemake.output.sno_per_confusion_value, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import collections as coll

""" Find all the snoRNAs that are either predicted as a TP, TN, FP or FN in at
    least 2 of the 3 chosen models."""

confusion_value_df_paths = snakemake.input.confusion_value_df

# Load confusion value dfs (contain confusion value for each sno in the respective test set for each model)
confusion_value_dfs = []
for path in confusion_value_df_paths:
    df = pd.read_csv(path, sep='\t')
    last_col_name = [col_name for col_name in df.columns if 'confusion_matrix_val' in col_name][0]
    df = df.rename(columns={last_col_name: last_col_name.split('_val_')[0]})
    confusion_value_dfs.append(df)


# Concat all dfs vertically
concat_df = pd.concat(confusion_value_dfs)
confusion_final_df = concat_df[['gene_id_sno', 'confusion_matrix']]

# Keep only one line per snoRNA (the chosen confusion value is the mode (i.e the most frequent value) across the 3 models prediction)
confusion_final_df = confusion_final_df.groupby('gene_id_sno')['confusion_matrix'].agg(pd.Series.mode).to_frame()
confusion_final_df = confusion_final_df.reset_index()
print(confusion_final_df)
confusion_final_df.to_csv(snakemake.output.sno_per_confusion_value, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import pandas as pd

""" Remove the top predicting feature from the orginal
    dataset containing all snoRNA features. Either conservation_score_norm,
    terminal_stem_mfe_norm, sno_mfe_norm, host_expressed or all_four features."""

df = pd.read_csv(snakemake.input.df, sep='\t')

if snakemake.wildcards.feature_effect == "all_four":
    df = df.drop(columns=["conservation_score_norm", "terminal_stem_mfe_norm",
                        "sno_mfe_norm", "host_expressed"])
elif snakemake.wildcards.feature_effect == "top_10":
    df = df.drop(columns=["conservation_score_norm", "terminal_stem_mfe_norm",
                        "sno_mfe_norm", "host_expressed", "intron_number_norm",
                        "dist_to_bp_norm", "intron_length_norm", "sno_length_norm", "intergenic.3",
                        "rRNA", "Orphan", "distance_downstream_exon_norm"])
elif snakemake.wildcards.feature_effect == "top_10_intergenic":
    df = df.drop(columns=["conservation_score_norm", "terminal_stem_mfe_norm",
                        "sno_mfe_norm", "host_expressed", "intron_number_norm",
                        "dist_to_bp_norm", "intron_length_norm", "sno_length_norm", "intergenic.3",
                        "rRNA", "Orphan", "distance_downstream_exon_norm", "intergenic", "intergenic.1", "intergenic.2", "intergenic.4"])
elif snakemake.wildcards.feature_effect == "top_10_all":
    df = df.drop(columns=["conservation_score_norm", "terminal_stem_mfe_norm",
                        "sno_mfe_norm", "host_expressed", "intron_number_norm",
                        "dist_to_bp_norm", "intron_length_norm", "sno_length_norm", "intergenic.3",
                        "rRNA", "Orphan", "distance_downstream_exon_norm", "intergenic", "intergenic.1",
                        "intergenic.2", "intergenic.4", "host_not_expressed", "snRNA", "non_coding",
                        "protein_coding","False", "True", "dual_initiation", "simple_initiation"])
elif snakemake.wildcards.feature_effect == "top_11_all":
    df = df.drop(columns=["conservation_score_norm", "terminal_stem_mfe_norm",
                        "sno_mfe_norm", "host_expressed", "intron_number_norm",
                        "dist_to_bp_norm", "intron_length_norm", "sno_length_norm", "intergenic.3",
                        "rRNA", "Orphan", "distance_downstream_exon_norm", "intergenic", "intergenic.1",
                        "intergenic.2", "intergenic.4", "host_not_expressed", "snRNA", "non_coding",
                        "protein_coding","False", "True", "dual_initiation", "simple_initiation", "distance_upstream_exon_norm"])                        
elif snakemake.wildcards.feature_effect == "Other":
    df = df[["gene_id_sno", "Other", "label"]]


### Numerical vs categorical feature
###intrinsinc vs extrinsinc features


else:
    df = df.drop(columns=[snakemake.wildcards.feature_effect])

df.to_csv(snakemake.output.df_wo_feature, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd

""" Remove snoRNA clusters in SNHG14 and MEG8 host genes (so mostly SNORD115,
    116 and 113, 114 snoRNAs) from the orginal dataset containing all snoRNAs. """

ref_table_HG = pd.read_csv(snakemake.input.host_gene_df)
df = pd.read_csv(snakemake.input.df, sep='\t')

# Get all snoRNAs in SNHG14 and MEG8
clusters = ref_table_HG[(ref_table_HG['host_id'] == 'ENSG00000224078') | (ref_table_HG['host_id'] == 'ENSG00000225746')]
cluster_sno = list(clusters['sno_id'])

# Remove these snoRNAs from the original dataset
df = df[~df['gene_id_sno'].isin(cluster_sno)]

df.to_csv(snakemake.output.df_wo_clusters, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train, CV and test sets, apply
    mean normalization (feature scaling) to numerical features columns."""
all_iterations = ["manual_first", "manual_second", "manual_third", "manual_fourth",
                "manual_fifth", "manual_sixth", "manual_seventh", "manual_eighth",
                "manual_ninth", "manual_tenth"]
manual_iteration = snakemake.wildcards.manual_iteration
idx = all_iterations.index(manual_iteration)
random_state = snakemake.params.random_state
df_0 = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df_0 = df_0.fillna(-5)


# First, shuffle the df
df = df_0.copy()
df = shuffle(df, random_state=random_state)

X = df[['gene_id_sno', 'combined_box_hamming']]
y = df['label']

# Configure the cross-validation strategy (StratifiedKFold where k=10)
# This serves only to split in ten equal folds, no cross-validation is done to this point
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_state)

# Get the corresponding test set (10% of total df) for that manual iteration
i = 0
for total_train_index, test_index in kf.split(X, y):
    if i == idx:
        X_test = X.loc[test_index]
        y_test = y.loc[test_index]

        # Split the total_train into cv and train test (respectively 10 and 80 % of the total df (i.e. 11.1% and 88.9% of 1386 snoRNAs))
        X_total_train = X.loc[total_train_index]
        y_total_train = y.loc[total_train_index]
        X_train, X_cv, y_train, y_cv = train_test_split(X_total_train, y_total_train,
                                    test_size=0.111, train_size=0.889, random_state=random_state,
                                    stratify=y_total_train)
        y_test.to_csv(snakemake.output.y_test, index=False, sep='\t')
        y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')
        y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')

        # Scale feature values using mean normalization for numerical value columns
        # with high standard deviation
        dfs = [X_cv, X_train, X_test]
        output = [snakemake.output.cv, snakemake.output.train, snakemake.output.test]
        for j, df in enumerate(dfs):
            df_num = df.select_dtypes(include=['int64', 'float64'])
            num_cols = list(df_num.columns)
            for i, col in enumerate(num_cols):
                mean = df[col].mean()
                std = df[col].std()
                if std != 0:
                    df[col+'_norm'] = (df[col] - mean) / std
                else:  # to deal with column that has all the same value, thus a std=0
                    df[col+'_norm'] = df[col]  # we don't scale, but these values will either be all 0 or all 1
            df = df.drop(num_cols, axis=1)
            df.to_csv(output[j], index=False, sep='\t')
    i+=1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train, CV and test sets, apply
    mean normalization (feature scaling) to numerical features columns."""
all_iterations = ["manual_first", "manual_second", "manual_third", "manual_fourth",
                "manual_fifth", "manual_sixth", "manual_seventh", "manual_eighth",
                "manual_ninth", "manual_tenth"]
manual_iteration = snakemake.wildcards.manual_iteration
idx = all_iterations.index(manual_iteration)
random_state = snakemake.params.random_state
df_0 = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df_0 = df_0.fillna(-5)


# First, shuffle the df
df = df_0.copy()
df = shuffle(df, random_state=random_state)

X = df.drop('label', axis=1)
y = df['label']

# Configure the cross-validation strategy (StratifiedKFold where k=10)
# This serves only to split in ten equal folds, no cross-validation is done to this point
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_state)

# Get the corresponding test set (10% of total df) for that manual iteration
i = 0
for total_train_index, test_index in kf.split(X, y):
    if i == idx:
        X_test = X.loc[test_index]
        y_test = y.loc[test_index]

        # Split the total_train into cv and train test (respectively 10 and 80 % of the total df (i.e. 11.1% and 88.9% of 1386 snoRNAs))
        X_total_train = X.loc[total_train_index]
        y_total_train = y.loc[total_train_index]
        X_train, X_cv, y_train, y_cv = train_test_split(X_total_train, y_total_train,
                                    test_size=0.111, train_size=0.889, random_state=random_state,
                                    stratify=y_total_train)
        y_test.to_csv(snakemake.output.y_test, index=False, sep='\t')
        y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')
        y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')

        # Scale feature values using mean normalization for numerical value columns
        # with high standard deviation
        dfs = [X_cv, X_train, X_test]
        output = [snakemake.output.cv, snakemake.output.train, snakemake.output.test]
        for j, df in enumerate(dfs):
            df_num = df.select_dtypes(include=['int64', 'float64'])
            num_cols = list(df_num.columns)
            for i, col in enumerate(num_cols):
                mean = df[col].mean()
                std = df[col].std()
                if std != 0:
                    df[col+'_norm'] = (df[col] - mean) / std
                else:  # to deal with column that has all the same value, thus a std=0
                    df[col+'_norm'] = df[col]  # we don't scale, but these values will either be all 0 or all 1
            df = df.drop(num_cols, axis=1)
            df.to_csv(output[j], index=False, sep='\t')
    i+=1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train, CV and test sets, apply
    mean normalization (feature scaling) to numerical features columns."""
all_iterations = ["manual_first", "manual_second", "manual_third", "manual_fourth",
                "manual_fifth", "manual_sixth", "manual_seventh", "manual_eighth",
                "manual_ninth", "manual_tenth"]
manual_iteration = snakemake.wildcards.manual_iteration
idx = all_iterations.index(manual_iteration)
random_state = snakemake.params.random_state
df_0 = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df_0 = df_0.fillna(-5)


# First, shuffle the df
df = df_0.copy()
df = shuffle(df, random_state=random_state)

# Keep only top 3 features
X = df.drop('label', axis=1)
X = X[['gene_id_sno', 'combined_box_hamming', 'sno_mfe', 'terminal_stem_mfe']]
y = df['label']

# Configure the cross-validation strategy (StratifiedKFold where k=10)
# This serves only to split in ten equal folds, no cross-validation is done to this point
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_state)

# Get the corresponding test set (10% of total df) for that manual iteration
i = 0
for total_train_index, test_index in kf.split(X, y):
    if i == idx:
        X_test = X.loc[test_index]
        y_test = y.loc[test_index]

        # Split the total_train into cv and train test (respectively 10 and 80 % of the total df (i.e. 11.1% and 88.9% of 1386 snoRNAs))
        X_total_train = X.loc[total_train_index]
        y_total_train = y.loc[total_train_index]
        X_train, X_cv, y_train, y_cv = train_test_split(X_total_train, y_total_train,
                                    test_size=0.111, train_size=0.889, random_state=random_state,
                                    stratify=y_total_train)
        y_test.to_csv(snakemake.output.y_test, index=False, sep='\t')
        y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')
        y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')

        # Scale feature values using mean normalization for numerical value columns
        # with high standard deviation
        dfs = [X_cv, X_train, X_test]
        output = [snakemake.output.cv, snakemake.output.train, snakemake.output.test]
        for j, df in enumerate(dfs):
            df_num = df.select_dtypes(include=['int64', 'float64'])
            num_cols = list(df_num.columns)
            for i, col in enumerate(num_cols):
                mean = df[col].mean()
                std = df[col].std()
                if std != 0:
                    df[col+'_norm'] = (df[col] - mean) / std
                else:  # to deal with column that has all the same value, thus a std=0
                    df[col+'_norm'] = df[col]  # we don't scale, but these values will either be all 0 or all 1
            df = df.drop(num_cols, axis=1)
            df.to_csv(output[j], index=False, sep='\t')
    i+=1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.utils import shuffle

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train, CV and test sets, apply
    mean normalization (feature scaling) to numerical features columns."""
all_iterations = ["manual_first", "manual_second", "manual_third", "manual_fourth",
                "manual_fifth", "manual_sixth", "manual_seventh", "manual_eighth",
                "manual_ninth", "manual_tenth"]
manual_iteration = snakemake.wildcards.manual_iteration
idx = all_iterations.index(manual_iteration)
random_state = snakemake.params.random_state
df_0 = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df_0 = df_0.fillna(-5)


# First, shuffle the df
df = df_0.copy()
df = shuffle(df, random_state=random_state)

# Keep only top 3 features
X = df.drop('label', axis=1)
X = X[['gene_id_sno', 'combined_box_hamming', 'sno_mfe', 'terminal_stem_mfe', 'host_expressed']]
y = df['label']

# Configure the cross-validation strategy (StratifiedKFold where k=10)
# This serves only to split in ten equal folds, no cross-validation is done to this point
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_state)

# Get the corresponding test set (10% of total df) for that manual iteration
i = 0
for total_train_index, test_index in kf.split(X, y):
    if i == idx:
        X_test = X.loc[test_index]
        y_test = y.loc[test_index]

        # Split the total_train into cv and train test (respectively 10 and 80 % of the total df (i.e. 11.1% and 88.9% of 1386 snoRNAs))
        X_total_train = X.loc[total_train_index]
        y_total_train = y.loc[total_train_index]
        X_train, X_cv, y_train, y_cv = train_test_split(X_total_train, y_total_train,
                                    test_size=0.111, train_size=0.889, random_state=random_state,
                                    stratify=y_total_train)
        y_test.to_csv(snakemake.output.y_test, index=False, sep='\t')
        y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')
        y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')

        # Scale feature values using mean normalization for numerical value columns
        # with high standard deviation
        dfs = [X_cv, X_train, X_test]
        output = [snakemake.output.cv, snakemake.output.train, snakemake.output.test]
        for j, df in enumerate(dfs):
            df_num = df.select_dtypes(include=['int64', 'float64'])
            num_cols = list(df_num.columns)
            for i, col in enumerate(num_cols):
                mean = df[col].mean()
                std = df[col].std()
                if std != 0:
                    df[col+'_norm'] = (df[col] - mean) / std
                else:  # to deal with column that has all the same value, thus a std=0
                    df[col+'_norm'] = df[col]  # we don't scale, but these values will either be all 0 or all 1
            df = df.drop(num_cols, axis=1)
            df.to_csv(output[j], index=False, sep='\t')
    i+=1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import pandas as pd
from sklearn.model_selection import train_test_split

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train, CV and test sets, apply
    mean normalization (feature scaling) to numerical features columns."""
iteration = snakemake.wildcards.iteration
random_state_dict = snakemake.params.random_state
random_state = random_state_dict[iteration]
df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df = df.fillna(-5)

X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=random_state, stratify=y)
y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=random_state,
                                    stratify=y_total_train)
y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')
y_test.to_csv(snakemake.output.y_test, index=False, sep='\t')

# Scale feature values using mean normalization for numerical value columns
# with high standard deviation
dfs = [X_cv, X_train, X_test]
output = [snakemake.output.cv, snakemake.output.train, snakemake.output.test]
for j, df in enumerate(dfs):
    df_num = df.select_dtypes(include=['int64', 'float64'])
    num_cols = list(df_num.columns)
    for i, col in enumerate(num_cols):
        mean = df[col].mean()
        std = df[col].std()
        df[col+'_norm'] = (df[col] - mean) / std
    df = df.drop(num_cols, axis=1)
    df.to_csv(output[j], index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
from sklearn.model_selection import train_test_split

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train, CV and test sets, apply
    mean normalization (feature scaling) to numerical features columns."""

df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df = df.fillna(-5)

X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)
y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)
y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')
y_test.to_csv(snakemake.output.y_test, index=False, sep='\t')

# Scale feature values using mean normalization for numerical value columns
# with high standard deviation
dfs = [X_cv, X_train, X_test]
output = [snakemake.output.cv, snakemake.output.train, snakemake.output.test]
for j, df in enumerate(dfs):
    df_num = df.select_dtypes(include=['int64', 'float64'])
    num_cols = list(df_num.columns)
    for i, col in enumerate(num_cols):
        mean = df[col].mean()
        std = df[col].std()
        df[col+'_norm'] = (df[col] - mean) / std
    df = df.drop(num_cols, axis=1)
    df.to_csv(output[j], index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd


""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, apply mean normalization (feature scaling) to
    numerical features columns."""

df = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df = df.fillna(-5)

# Scale feature values using mean normalization for numerical value columns
# with high standard deviation
df_num = df.select_dtypes(include=['int64', 'float64'])
num_cols = list(df_num.columns)
for i, col in enumerate(num_cols):
    mean = df[col].mean()
    std = df[col].std()
    df[col+'_norm'] = (df[col] - mean) / std

df = df.drop(num_cols, axis=1)
df.to_csv(snakemake.output.scaled_feature_df, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train and CV sets, apply
    mean normalization (feature scaling) to numerical features columns."""

random_state = int(snakemake.wildcards.rs)

df_0 = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df_0 = df_0.fillna(-5)


# First, shuffle the df
df = df_0.copy()
df = shuffle(df, random_state=random_state)

# Keep only top 3 features
X = df.drop('label', axis=1)
X = X[['gene_id_sno', 'combined_box_hamming', 'sno_mfe', 'terminal_stem_mfe']]
y = df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)
y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')
y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')

# Scale feature values using mean normalization for numerical value columns
dfs = [X_cv.set_index('gene_id_sno'), X_train.set_index('gene_id_sno')]
output = [snakemake.output.cv, snakemake.output.train]
for j, df in enumerate(dfs):
    scaler = StandardScaler().fit(df)
    scaled_df_array = scaler.transform(df)
    scaled_df = pd.DataFrame(scaled_df_array, index=df.index, columns=df.columns)
    scaled_df = scaled_df.reset_index()
    scaled_df.to_csv(output[j], index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train and CV sets, apply
    mean normalization (feature scaling) to numerical features columns."""

random_state = snakemake.params.random_state
df_0 = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df_0 = df_0.fillna(-5)


# First, shuffle the df
df = df_0.copy()
df = shuffle(df, random_state=random_state)

# Keep only top 4 features
X = df.drop('label', axis=1)
X = X[['gene_id_sno', 'combined_box_hamming', 'sno_mfe', 'terminal_stem_mfe', 'host_expressed']]
y = df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)
y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')
y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')

# Scale feature values using mean normalization for numerical value columns
dfs = [X_cv.set_index('gene_id_sno'), X_train.set_index('gene_id_sno')]
output = [snakemake.output.cv, snakemake.output.train]
for j, df in enumerate(dfs):
    scaler = StandardScaler().fit(df)
    scaled_df_array = scaler.transform(df)
    scaled_df = pd.DataFrame(scaled_df_array, index=df.index, columns=df.columns)
    scaled_df = scaled_df.reset_index()
    scaled_df.to_csv(output[j], index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

""" Fill NaN in feature df in numerical columns with -5 instead. -5 was chosen
    arbitrarily so that this negative value should not interfere with all other
    positive values. Then, AFTER splitting into train and CV sets, apply
    mean normalization (feature scaling) to numerical features columns."""

random_state = int(snakemake.wildcards.rs)

df_0 = pd.read_csv(snakemake.input.feature_df, sep='\t')

# Fill NaN with -5
df_0 = df_0.fillna(-5)


# First, shuffle the df
df = df_0.copy()
df = shuffle(df, random_state=random_state)

# Keep only top 4 features
X = df.drop('label', axis=1)
X = X[['gene_id_sno', 'combined_box_hamming', 'sno_mfe', 'terminal_stem_mfe', 'host_expressed']]
y = df['label']


# Split the X dataset into cv and train test (respectively 10 and 90 % of the total df)
X_train, X_cv, y_train, y_cv = train_test_split(X, y,
                            test_size=0.1, train_size=0.9, random_state=random_state,
                            stratify=y)
y_cv.to_csv(snakemake.output.y_cv, index=False, sep='\t')
y_train.to_csv(snakemake.output.y_train, index=False, sep='\t')

# Scale feature values using mean normalization for numerical value columns
dfs = [X_cv.set_index('gene_id_sno'), X_train.set_index('gene_id_sno')]
output = [snakemake.output.cv, snakemake.output.train]
for j, df in enumerate(dfs):
    scaler = StandardScaler().fit(df)
    scaled_df_array = scaler.transform(df)
    scaled_df = pd.DataFrame(scaled_df_array, index=df.index, columns=df.columns)
    scaled_df = scaled_df.reset_index()
    scaled_df.to_csv(output[j], index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import pandas as pd
from pybedtools import BedTool
import subprocess as sp

""" Determine the intersection between a bed file of all snoRNAs and a bedGraph
    of the phastCons conservation score for each nucleotide in the human genome.
    Then compute the average conservation score per snoRNA (over all the nt
    composing the snoRNA that have a conservation score associated)."""

col = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
        'dot2', 'gene_info']
sno_bed = pd.read_csv(snakemake.input.sno_bed, sep='\t', names=col)  # generated with gtf_to_bed

# Get bed of 100 nt upstream of snoRNA instead of snoRNA itself (account for strandness)
upstream = sno_bed.copy()
upstream['start_up'] = upstream['start'] - 1
upstream.loc[upstream['strand'] == '+', 'start'] = upstream['start_up'] - 100
upstream.loc[upstream['strand'] == '+', 'end'] = upstream['start_up']

upstream['end_up'] = upstream['end'] + 1
upstream.loc[upstream['strand'] == '-', 'end'] = upstream['end_up'] + 100
upstream.loc[upstream['strand'] == '-', 'start'] = upstream['end_up']
upstream = upstream[col]
upstream.to_csv('upstream_temp.bed', sep='\t', index=False, header=False)

def intersection(sno_bed, phastcons_bedgraph, output_path):
    """ Get the intersection between the snoRNA bed file and the sorted
        conservation bedgraph file."""
    a = BedTool(sno_bed)
    intersection = a.intersect(phastcons_bedgraph, wb=True, sorted=True).saveas(output_path)

def average_score(intersect_df_path, sno_bed_df, output_path):
    """Get the average conservation score per snoRNA."""
    intersect_df = pd.read_csv(intersect_df_path, sep='\t', names=['chr','start',
                                'end', 'gene_id', 'dot', 'strand', 'source', 'feature',
                                'dot2', 'gene_info', 'chr_bg', 'start_bg', 'end_bg', 'conservation_score'])

    # Groubpy every line (nucleotide) corresponding to a snoRNA and get the average conservation score across all these nt
    sno_nt = intersect_df.groupby(['gene_id'])['conservation_score'].mean()

    # Some snoRNAs are composed of nt that are not present in the conservation
    # bedgraph (only 2 snoRNAs don't have conservation scores at all; other snoRNAs might have a few nt missing a
    # conservation score, but these null values are not considered in the average
    # since they are not present in the intersect_df). For the 2 snoRNAs missing conservation score, we consider it as a score of 0
    sno_bed_df = sno_bed_df[['gene_id']]

    final_df = sno_bed_df.merge(sno_nt, how='left', left_on='gene_id', right_on='gene_id')
    final_df = final_df.fillna(0)
    final_df.to_csv(output_path, index=False, sep='\t')

def main(sno_bed_path, phastcons_bedgraph_path, output_path_intersection,
            sno_bed_df, output_path_final_df):
    intersect_df = intersection(sno_bed_path, phastcons_bedgraph_path, output_path_intersection)
    final_conservation_df = average_score(output_path_intersection, sno_bed_df, output_path_final_df)



# Get snoRNA average conservation across 100 vertebrates
main(snakemake.input.sno_bed, snakemake.input.phastcons_bg,
    snakemake.output.intersection_sno_conservation, sno_bed, snakemake.output.sno_conservation)

# Get average conservation of the 100 nt upstream of the snoRNAs (~promoter region)
main('upstream_temp.bed', snakemake.input.phastcons_bg,
    snakemake.output.intersection_upstream_sno_conservation, upstream, snakemake.output.upstream_sno_conservation)

sp.call('rm upstream_temp.bed', shell=True)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import pandas as pd

""" Drop NA and duplicates from snodb table and format the rest of the columns.
    Use the final host gene list v101 to determine snoRNA host gene biotype,
    function, name, chr, start and end. Add also information about NMD
    susceptibility of host genes and presence of dual-initiation (DI) promoters
    within host genes. Of note, this table contains all snoRNAs and also some
    scaRNAs found on the database snoDB."""

snodb_original = pd.read_csv(snakemake.input.snodb_original_table, sep='\t')
hg_df = pd.read_csv(snakemake.input.host_gene_df)
nmd = pd.read_csv(snakemake.input.nmd)
di_promoter = pd.read_csv(snakemake.input.di_promoter)
hg_functions = pd.read_csv(snakemake.input.host_functions, sep='\t')


# Remove NA lines and duplicates
snodb = snodb_original.dropna(subset=['gene_id_annot2020', 'gene_name_annot2020'])
snodb = snodb.drop_duplicates(subset=['gene_id_annot2020', 'gene_name_annot2020'])


# Drop unwanted columns and rename other columns
snodb = snodb[['gene_id_annot2020', 'gene_name_annot2020', 'box type',
                'target summary', 'seq']]
snodb.columns = ['gene_id_sno', 'gene_name_sno', 'sno_type', 'sno_target_old', 'seq']


# Format the sno_target column
rRNA = ['Others, rRNA', 'rRNA', 'Others, rRNA, snRNA', 'rRNA, snRNA']
snRNA = ['Others, snRNA', 'snRNA']
snodb.loc[snodb['sno_target_old'].isin(rRNA), 'sno_target'] = 'rRNA'
snodb.loc[snodb['sno_target_old'].isin(snRNA), 'sno_target'] = 'snRNA'
snodb.loc[snodb['sno_target_old'] == 'Others', 'sno_target'] = 'Orphan'
snodb['sno_target'] = snodb['sno_target'].fillna('Orphan')


# Add host gene info (id, name, chr (seqname), strand, start, end, biotype)
snodb = snodb.merge(hg_df, how='left', left_on='gene_id_sno', right_on='sno_id')

non_coding = ['lncRNA', 'transcribed_unprocessed_pseudogene', 'unprocessed_pseudogene',
                'transcribed_unitary_pseudogene', 'TEC', 'transcribed_processed_pseudogene',
                'processed_pseudogene']
snodb.loc[snodb['host_biotype'] == 'protein_coding', 'host_biotype2'] = 'protein_coding'
snodb.loc[snodb['host_biotype'].isin(non_coding), 'host_biotype2'] = 'non_coding'
snodb['host_biotype2'] = snodb['host_biotype2'].fillna('intergenic')


# Add NMD and DI promoter info for intronic snoRNAs
nmd_substrates = list(pd.unique(nmd['gs']))
di_promoter_host = list(pd.unique(di_promoter['gene_id']))
snodb.loc[snodb['host_name'].isin(nmd_substrates), 'NMD_susceptibility'] = True
snodb.loc[(~snodb['host_name'].isin(nmd_substrates)) &
        (~snodb['host_name'].isnull()), 'NMD_susceptibility'] = False
snodb.loc[snodb['host_name'].isnull(), 'NMD_susceptibility'] = "intergenic"

snodb.loc[snodb['host_id'].isin(di_promoter_host), 'di_promoter'] = "dual_initiation"
snodb.loc[(~snodb['host_id'].isin(di_promoter_host)) &
        (~snodb['host_id'].isnull()), 'di_promoter'] = "simple_initiation"
snodb.loc[snodb['host_id'].isnull(), 'di_promoter'] = "intergenic"


# Add host_function column (combined for protein-coding and non-coding HGs)
snodb = snodb.merge(hg_functions, how='left', left_on='host_id', right_on='host_id')
snodb['host_function'] = snodb['host_function'].fillna('intergenic')

snodb.drop(columns=['sno_target_old', 'sno_id', 'host_biotype'], inplace=True)

snodb.to_csv(snakemake.output.snodb_formatted, index=False, sep='\t')
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
import pandas as pd
import subprocess as sp
from pybedtools import BedTool

""" Locate the host gene intron in which snoRNAs are located and return the
    intron number and distance to exons related to each snoRNA. We take the
    transcript with the highest number of exons per HG as the 'main transcript'.
    We process SNHG14 snoRNAs separately since their HG transcript with the
    highest number of exons doesn't include all of its embedded snoRNAs
    (instead we take the longest SNHG14 transcript from RefSeq so that it
    includes all SNHG14 snoRNAs)"""



col = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature', 'dot2', 'gene_info']
gtf_bed = pd.read_csv(snakemake.input.sorted_gtf_bed, sep='\t', names=col)  # generated with gtf_to_bed

sno_HG_coordinates = pd.read_csv(snakemake.input.sno_HG_coordinates)
sno_info = pd.read_csv(snakemake.input.sno_tpm_df)
sno_info = sno_info[~sno_info['host_id'].isna()]  # remove intergenic snoRNAs from this analysis
sno_info_wo_snhg14_snornas = sno_info[sno_info['host_name'] != 'SNHG14'] #Exclusion of SNHG14 HG because the longest transcript does not cover all of its embedded snoRNAs

sno_bed_wo_snhg14_path = snakemake.input.sno_bed_wo_snhg14  # generated with generate_snoRNA_beds.py
snhg14_bed_path = snakemake.input.snhg14_bed  # Obtained from refseq longest transcript so that it includes all SNHG14 snoRNAs
snhg14_sno_bed_path = snakemake.input.sno_snhg14_bed  # generated with generate_snoRNA_beds.py



def generate_hg_bed(gtf_bed, sno_info, output_path_bed, output_path_bed_col_split):
    """Iterate through a gtf file in a bed format to retrieve only the information related to host genes (hgs) and
        create a resulting hg bed file. sno_info is a df that gives the HG of interest (ex: all HGs vs all HGs except
        SNHG14)"""
    hgs = list(pd.unique(sno_info['host_id']))
    df = []
    for i, hg in enumerate(hgs):
        print(hg)
        temp_df = gtf_bed[gtf_bed['gene_info'].str.contains('"'+hg+'"')]
        df.append(temp_df)

    df_final = pd.concat(df)
    df_final.to_csv(output_path_bed, sep='\t', header=False, index=False)  # This is the bed (from gtf) of all host genes except SNHG14

    sp.call("""awk -i inplace -v OFS='\t' '$8=="exon"' """ + output_path_bed, shell=True)  # Keep only the exon features in the HG bed
    sp.call("""sed -i -E 's/ tag .*;"/"/g; s/ exon_id .*;./"/g' """ + output_path_bed, shell=True)  # remove exon_id and tag infos in gene_info column to simplify

    df_final = pd.read_csv(output_path_bed, sep='\t', names=['chr', 'start', 'end', 'gene_id', 'dot', 'strand',
                                                             'source', 'feature', 'dot2', 'gene_info'])

    # Split the gtf bed file into more readable columns
    hg_bed = df_final
    hg_bed[['empty1', 'gene_id2', 'empty2', 'gene_version', 'empty3', 'transcript_id', 'empty4', 'transcript_version',
            'empty5', 'exon_number', 'empty6', 'gene_name', 'empty7', 'gene_source', 'empty8', 'gene_biotype', 'empty9',
            'transcript_name', 'empty10', 'transcript_source',
            'empty11', 'transcript_biotype']] = hg_bed.gene_info.str.split(" ", expand=True).applymap(
        lambda x: x.replace('"', '')).applymap(lambda x: x.replace(';', ''))

    hg_bed = hg_bed.drop(axis=1,
                         labels=['gene_info', 'empty1', 'empty2', 'empty3', 'empty4', 'empty5', 'empty6', 'gene_source',
                                 'empty7', 'empty8', 'empty9', 'transcript_source', 'empty10', 'empty11',
                                 'gene_version', 'transcript_version'])

    hg_bed.to_csv(output_path_bed_col_split, sep='\t', header=False, index=False)

    print('Finished generate_hg_bed!')
    return hg_bed


def get_max_exon_transcript_per_hg(sno_HG_coordinates, hg_bed, output_path, sno_overlap_path):
    """ Sort HG transcripts per exon_number (in descending order) and by their
        name*** if multiple transcripts have the same number of exons (in ascending
        order since a ...-201 transcript is more present than a ...-205). Then
        iterate trough this ordered groupby object and get the first HG transcript
        that doesn't overlap with the snoRNA (if possible, but there are multiple
        snoRNA exceptions that overlap in all kinds of form with their HG; see all
        elifs below). Then regroup all these HG transcripts (1 per HG) in a bed file.
        ***Transcript name with '-201' are the most present transcript and then less
        and less present with the 201 increasing"""
    most_exon_transcripts = []
    groups = hg_bed.groupby('gene_id')

    # Drop SHNG14 (ENSG00000224078) snoRNAs
    sno_HG_coordinates = sno_HG_coordinates[sno_HG_coordinates['host_id'] != 'ENSG00000224078']
    sno_HG_coordinates = sno_HG_coordinates.set_index('sno_id')
    sno_dict = sno_HG_coordinates.to_dict('index')

    for sno_id, cols in sno_dict.items():
        sno_start, sno_end, host_id = sno_dict[sno_id]['sno_start'], sno_dict[sno_id]['sno_end'], sno_dict[sno_id]['host_id']
        host = groups.get_group(host_id)
        host_transcripts = host.groupby('transcript_id')
        temp = []
        for transcript_id, transcript in host_transcripts:  # Create a exon_max column to sort afterwards
            exon_number = max(map(int, transcript['exon_number']))
            transcript.loc[:, 'exon_max'] = exon_number
            temp.append(transcript)
        host_df = pd.concat(temp)
        trans_max = host_df.sort_values(by=['exon_max', 'transcript_name'], ascending=[False, True])  # Sort by max number of exon to least number of exon, then by the transcript name in ascending order (transcript-201 to ... -213 ...)
        for trans_id, trans_df in trans_max.groupby('transcript_id', sort=False):
            trans_df = trans_df.reset_index()
            l = ['ENSG00000261709', 'ENSG00000277947', 'NR_132981', 'NR_132980']
            if sno_id in l:  # This is to patch for the 4 snoRNAs that are encoded within the exon of a 1-exon lncRNA
                print(trans_id)
                gene = host[host['transcript_id'] == trans_id]
                gene.loc[:, 'sno'] = sno_id
                gene.loc[:, 'hg_overlap'] = 'sno_into_single_exon'
                most_exon_transcripts.append(gene)
                break
            for i in trans_df.index.values[:-1]:  # For all other snoRNAs
                if trans_df.loc[i, 'exon_max'] > 1:
                    exon1 = trans_df.iloc[i, :]
                    exon2 = trans_df.iloc[i+1, :]
                    if (sno_start > exon1['end']) & (sno_end < exon2['start']):  # Normal snoRNAs encoded between two exons
                        best_transcript_id = exon1['transcript_id']
                        gene = host[host['transcript_id'] == best_transcript_id]
                        gene.loc[:, 'sno'] = sno_id
                        gene.loc[:, 'hg_overlap'] = ''
                        most_exon_transcripts.append(gene)
                        break
                    elif (((sno_end < exon1['end']) & (sno_start > exon1['start'])) | ((sno_end < exon2['end']) & (sno_start > exon2['start']))):  # snoRNA overlaps with exon but doesn't extend before or after the exon
                        if sno_id in ['ENSG00000201672', 'ENSG00000206620', 'NR_145790', 'ENSG00000212293']:  # To patch for the 4 snoRNAs that overlap with the HG transcript with the most exon but not with the second most exon HG transcript
                            continue
                        elif sno_id not in ['ENSG00000275662', 'ENSG00000201672', 'ENSG00000206620', 'NR_145790', 'ENSG00000212293']:
                            best_transcript_id = exon1['transcript_id']
                            gene = host[host['transcript_id'] == best_transcript_id]
                            gene.loc[:, 'sno'] = sno_id
                            gene.loc[:, 'hg_overlap'] = 'sno_into_multi_exon'
                            most_exon_transcripts.append(gene)
                            break
                    elif (((sno_start < exon1['end']) & (sno_end > exon1['end'])) | ((sno_start < exon2['end']) & (sno_end > exon2['end']))): # snoRNA overlaps with exon and extends after the exon
                        if sno_id in ['ENSG00000207145', 'ENSG00000207297', 'ENSG00000274309']:  # To patch for the 3 snoRNA that overlap with the HG transcript with the most exon, but not with the second most exon HG transcript
                            continue
                        elif sno_id not in ['ENSG00000207145', 'ENSG00000207297', 'ENSG00000274309']:
                            best_transcript_id = exon1['transcript_id']
                            gene = host[host['transcript_id'] == best_transcript_id]
                            gene.loc[:, 'sno'] = sno_id
                            gene.loc[:, 'hg_overlap'] = 'sno_over_multi_exon_after'
                            most_exon_transcripts.append(gene)
                            break
                    elif (((sno_start < exon1['start']) & (sno_end > exon1['start'])) | ((sno_start < exon2['start']) & (sno_end > exon2['start']))): # snoRNA overlaps with exon and extends before the exon
                        if sno_id == 'ENSG00000275662':  # To patch for this snoRNA that overlaps with the HG transcript with the most exon, but not with the second most exon HG transcript
                            continue
                        elif sno_id != 'ENSG00000275662':
                            best_transcript_id = exon1['transcript_id']
                            gene = host[host['transcript_id'] == best_transcript_id]
                            gene.loc[:, 'sno'] = sno_id
                            gene.loc[:, 'hg_overlap'] = 'sno_over_multi_exon_before'
                            most_exon_transcripts.append(gene)
                            break
            else:  # If the inner loop is not broken, continue to the next transcript
                continue
            break  # If the inner loop is broken, then break the outer loop


    hg_simple = pd.concat(most_exon_transcripts)
    sno_overlap = hg_simple[['sno', 'strand', 'hg_overlap', 'gene_id', 'transcript_id', 'transcript_name']].drop_duplicates()
    sno_overlap.to_csv(sno_overlap_path, index=False, sep='\t')
    hg_simple = hg_simple.loc[:, hg_simple.columns != 'hg_overlap']
    hg_simple = hg_simple.loc[:, hg_simple.columns != 'sno']

    hg_simple.to_csv(output_path, sep='\t', header=False, index=False)

    sp.call('sort -k1,1 -k2,2n -k3,17 -o '+output_path+' '+output_path, shell=True) #sort the output file by chr and start
    sp.call("""awk -i inplace -v OFS='\t' '$1="chr"$1' """+output_path, shell=True) #add "chr" in front of first column

    print('Finished get_max_exon_transcript_per_hg!')

    return hg_simple, sno_overlap


def get_exon_number_per_hg(hg_bed_file):
    """ Extract the number of exons per HG (from the transcripts chosen with
        get_max_exon_transcript_per_hg())."""

    cols = ['chr', 'start', 'end', 'gene_id', 'dot', 'strand', 'source', 'feature', 'dot', 'gene_id2', 'transcript_id',
            'exon_number', 'gene_name', 'biotype', 'transcript_name', 'transcript_biotype']

    hg_bed_file.columns = cols
    exon_number_df = hg_bed_file.groupby('transcript_id')

    rows = []
    for transcript_id, group in hg_bed_file.groupby('transcript_id'):
        max_nb = max(map(int, group['exon_number']))
        row = [transcript_id, max_nb]
        rows.append(row)
    max_nb_df = pd.DataFrame(rows, columns = ['transcript_id', 'exon_number_per_hg'])

    print('Finished get_exon_number_per_hg!')

    return max_nb_df


def get_up_downstream_exons(hg_bed_file_path, snoRNA_bed_file_path, output_path, sno_overlap_df):
    """Create two bed files giving the number of the exon located either upstream or downstream of snoRNAs and the
        distance to that exon using bedtools closest (ex: upstream exon number 4 means that the snoRNA
        is located in the 4th intron of its HG; downstream exon number 8 means that the snoRNA is located in the 7th
        intron of the HG. Since snoRNAs in the same HG have not the same corresponding HG transcript, we need to use
        bedtools closest for each snoRNa and its corresponding HG transcript (given by the sno_overlap_df)."""

    sno_bed = pd.read_csv(snoRNA_bed_file_path, sep='\t', names=['chr', 'start', 'end', 'sno_id', 'dot', 'strand', 'source', 'feature', 'dot2', 'gene_info'])
    hg_bed = pd.read_csv(hg_bed_file_path, sep='\t', names=['chr', 'start', 'end', 'host_id', 'dot', 'strand', 'source', 'feature', 'dot2', 'gene_id2', 'transcript_id', 'exon_number', 'host_name', 'biotype', 'transcript_name', 'transcript_biotype'])
    sno_overlap = sno_overlap_df.set_index('sno')
    sno_overlap_dict = sno_overlap.to_dict('index')

    for i, sno in enumerate(list(pd.unique(sno_bed.loc[:, 'sno_id']))):
        print(sno)
        temp_sno_df = sno_bed[sno_bed['sno_id'] == sno]
        temp_sno_df.to_csv(snoRNA_bed_file_path+'_temp.bed', sep='\t', header=False, index=False)
        a = BedTool(snoRNA_bed_file_path+'_temp.bed')

        hg_transcript_id = sno_overlap_dict[sno]['transcript_id']
        temp_hg_df = hg_bed[hg_bed['transcript_id'] == hg_transcript_id]
        temp_hg_df.to_csv(hg_bed_file_path+'_temp.bed', sep='\t', header=False, index=False)

        upstream = a.closest(hg_bed_file_path+'_temp.bed', t="first", io=True, id=True, D="a").saveas(output_path + 'exon_upstream_of_sno_'+str(i)+'_TEMP')
        downstream = a.closest(hg_bed_file_path+'_temp.bed', t="first", io=True, iu=True, D="a").saveas(output_path + 'exon_downstream_of_sno_'+str(i)+'_TEMP2')

    sp.call("cat "+output_path+"*_TEMP > "+output_path+"exon_upstream_of_sno.bed && rm "+output_path+"*_TEMP", shell=True)
    sp.call("cat "+output_path+"*_TEMP2 > "+output_path+"exon_downstream_of_sno.bed && rm "+output_path+"*_TEMP2 && rm "+hg_bed_file_path+"*_temp.bed", shell=True)


    print('Finished get_up_downstream_exons!')


def get_up_downstream_exons_snhg14(hg_bed_file_path, snoRNA_bed_file_path, output_path):
    """Create two bed files giving the number of the exon located either upstream or downstream of snoRNAs within SNHG14 HG and the
        distance to that exon using bedtools closest (ex: upstream exon number 4 means that the snoRNA
        is located in the 4th intron of its HG; downstream exon number 8 means that the snoRNA is located in the 7th
        intron of the HG"""

    a = BedTool(snoRNA_bed_file_path)
    upstream = a.closest(hg_bed_file_path, t="first", io=True, id=True, D="a").saveas(output_path + 'exon_upstream_of_sno.bed')
    downstream = a.closest(hg_bed_file_path, t="first", io=True, iu=True, D="a").saveas(output_path + 'exon_downstream_of_sno.bed')

    print('Finished get_up_downstream_exons!')


def get_intron_number_and_distances(path_to_exon_files, df_total_nb_exons_per_hg, output_file_path, sno_overlap_df):
    """Extract the intron number in which a snoRNA is located (i.e. the exon number of the upstream exon) and the
        distance to the upstream and downstream exons. Extract also from other df the number of exons per HG."""
    cols = ['chr_sno', 'start_sno', 'end_sno', 'gene_id_sno', 'dot', 'strand_sno', 'source_sno', 'feature_sno', 'dot2', 'gene_info_sno', 'chr_host', 'start_host',
            'end_host', 'gene_id_host', 'dot3', 'strand_host', 'source_host', 'feature_host', 'dot4', 'gene_id2_host',
            'transcript_id_host', 'intron_number', 'gene_name_host', 'biotype_host', 'transcript_name_host',
            'transcript_biotype_host']
    upstream = pd.read_csv(path_to_exon_files+'/exon_upstream_of_sno_2.bed', sep='\t', names=cols+['distance_upstream_exon'])
    downstream = pd.read_csv(path_to_exon_files+'/exon_downstream_of_sno_2.bed', sep='\t', names=cols+['distance_downstream_exon'])
    upstream, downstream = upstream.reset_index(), downstream.reset_index()

    # Remove minus in front of distances
    upstream['distance_upstream_exon'] = abs(upstream['distance_upstream_exon'])
    downstream['distance_downstream_exon'] = abs(downstream['distance_downstream_exon'])

    # Merge dfs to keep relevant info only
    final_df = upstream[['gene_id_sno', 'start_sno', 'end_sno', 'strand_sno', 'gene_id_host', 'transcript_id_host', 'intron_number', 'distance_upstream_exon']].merge(
        downstream[['gene_id_sno', 'start_sno', 'distance_downstream_exon']], how='left',
        left_on=['gene_id_sno', 'start_sno'], right_on=['gene_id_sno', 'start_sno'])

    # For snoRNAs that overlap with their HG, correct their distance to upstream and downstream exon in the df
    sno_into_single_exon = sno_overlap_df[sno_overlap_df['hg_overlap'] == 'sno_into_single_exon'].set_index('sno')
    sno_into_multi_exon = sno_overlap_df[sno_overlap_df['hg_overlap'] == 'sno_into_multi_exon'].set_index('sno')
    sno_over_multi_exon_after = sno_overlap_df[sno_overlap_df['hg_overlap'] == 'sno_over_multi_exon_after'].set_index('sno')
    sno_over_multi_exon_before = sno_overlap_df[sno_overlap_df['hg_overlap'] == 'sno_over_multi_exon_before'].set_index('sno')

    into_single, into_multi = sno_into_single_exon.to_dict('index'), sno_into_multi_exon.to_dict('index')
    over_after, over_before = sno_over_multi_exon_after.to_dict('index'), sno_over_multi_exon_before.to_dict('index')

    for d in [into_single, into_multi, over_after, over_before]:  # iterate over the dictionaries of different types of overlapping snoRNA
        for sno, cols in d.items():
            final_df.loc[(final_df.gene_id_sno == sno), 'gene_id_host'] = cols['gene_id']
            final_df.loc[(final_df.gene_id_sno == sno), 'transcript_id_host'] = cols['transcript_id']
            final_df.loc[(final_df.gene_id_sno == sno), 'intron_number'] = 0
            if cols['hg_overlap'] in ['sno_into_single_exon', 'sno_into_multi_exon']:  # if sno into single or mutliple exon, replace distance upstream/downstream exon to 0
                final_df.loc[(final_df.gene_id_sno == sno), 'distance_upstream_exon'] = 0
                final_df.loc[(final_df.gene_id_sno == sno), 'distance_downstream_exon'] = 0
            elif cols['hg_overlap'] == 'sno_over_multi_exon_after':  # if sno overlaps and extends after exon
                if cols['strand'] == "-":  # if sno on minus (-) strand, only set the distance to downstream exon to 0
                    final_df.loc[(final_df.gene_id_sno == sno), 'distance_downstream_exon'] = 0
                else:  # if sno on plus (+) strand, only set the distance to upstream exon to 0
                    final_df.loc[(final_df.gene_id_sno == sno), 'distance_upstream_exon'] = 0
            elif cols['hg_overlap'] == 'sno_over_multi_exon_before':  # if sno overlaps and extends before exon, only set the distance to downstream exon to 0
                final_df.loc[(final_df.gene_id_sno == sno), 'distance_downstream_exon'] = 0


    # Add exon number per HG for each sno
    final_df = final_df.merge(df_total_nb_exons_per_hg, how='left', left_on='transcript_id_host', right_on='transcript_id')

    # Add intron length; for snoRNAs that overlap entirely with their HG, correct intron_length for 0
    final_df['intron_length'] = final_df['end_sno'] - final_df['start_sno'] + 1 + final_df['distance_upstream_exon'] + final_df['distance_downstream_exon']
    final_df.loc[(final_df['intron_number'] == 0) & (final_df['distance_upstream_exon'] == 0) & (final_df['distance_downstream_exon'] == 0), 'intron_length'] = 0
    final_df.to_csv(output_file_path, sep='\t', index=False)

    print('Finished get_intron_number_and_distances!')



def main(gtf_bed_file, sno_info_df, output_path, sno_HG_coordinates, sno_bed_wo_snhg14_path, snhg14_bed_path, snhg14_sno_bed_path, output_table_path):

    # Create the output directory
    sp.call("""mkdir -p """ + output_path, shell=True)

    # Generate bed file of HG transcripts except for SNHG14
    hg_bed_wo_snhg14 = generate_hg_bed(gtf_bed_file, sno_info_df,
                                        output_path+'/hg_wo_snhg14.bed',
                                        output_path+'/hg_split_wo_snhg14.bed')

    # Generate from hg_bed file a simpler bed with only the longest transcript per HG
    hg_simple, sno_overlap = get_max_exon_transcript_per_hg(sno_HG_coordinates, hg_bed_wo_snhg14, output_path+'/hg_simple_wo_snhg14_sorted.bed', output_path+'/sno_overlap_hg.tsv')

    # Get the maximal number of exons per HG
    exon_number_per_hg = get_exon_number_per_hg(hg_simple)

    # Append SNHG14 row to exon_number_per_hg df
    snhg14_df_dict = {'transcript_id':'NR_146177.1', 'exon_number_per_hg':148}  # To patch for SNHG14 long transcript that has 148 exons
    snhg14_df = pd.DataFrame(snhg14_df_dict, index=[0])
    exon_number_per_hg = exon_number_per_hg.append(snhg14_df, ignore_index=True)

    # Get the exon upstream and downstream of each snoRNA (either in all HG except SNHG14 or the SNHG14 HG)
    get_up_downstream_exons(output_path+'/hg_simple_wo_snhg14_sorted.bed', sno_bed_wo_snhg14_path, output_path+'/wo_snhg14_', sno_overlap)

    get_up_downstream_exons_snhg14(snhg14_bed_path, snhg14_sno_bed_path, output_path+'/snhg14_')

    # Concat the dfs wo and with snhg14 together
    sp.call('cat '+output_path+'/wo_snhg14_exon_downstream_of_sno.bed '+output_path+'/snhg14_exon_downstream_of_sno.bed >'+output_path+'/exon_downstream_of_sno_2.bed', shell=True)
    sp.call('cat '+output_path+'/wo_snhg14_exon_upstream_of_sno.bed '+output_path+'/snhg14_exon_upstream_of_sno.bed >'+output_path+'/exon_upstream_of_sno_2.bed', shell=True)

    get_intron_number_and_distances(output_path, exon_number_per_hg, output_table_path, sno_overlap)

    print('Main script ran successfully!')



main(gtf_bed, sno_info_wo_snhg14_snornas, snakemake.params.sno_location_exon_dir, sno_HG_coordinates,
    sno_bed_wo_snhg14_path, snhg14_bed_path, snhg14_sno_bed_path, snakemake.output.output_table)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import collections as coll

""" Find the snoRNAs that found in at leat 1 test set (across the 10
    iterations) and those that are never found in the test set."""

all_sno_df = pd.read_csv(snakemake.input.all_sno_df, sep='\t')
test_set_paths = snakemake.input.test_sets
test_sets = []
for path in test_set_paths:
    df = pd.read_csv(path, sep='\t')
    test_sets.append(df)

# Count the number of time a snoRNA is present across the 10 iterations
concat_df = pd.concat(test_sets)
sno_occurence_in_test_sets = {}
for sno_id in list(all_sno_df['gene_id_sno']):
    sno_in_test_sets = len(concat_df[concat_df['gene_id_sno'] == sno_id])
    sno_occurence_in_test_sets[sno_id] = sno_in_test_sets

final_df = pd.DataFrame(sno_occurence_in_test_sets.items(), columns=['gene_id_sno', 'nb_occurences_in_test_sets'])
final_df.to_csv(snakemake.output.sno_presence_test_sets, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
import re
""" Get the terminal stem length score of snoRNAs, if they have a realistic
    terminal stem. The score is equal to the intermolecular paired nt between
    the left and right flanking regions of snoRNAs minus the number of nt gaps
    within the stem."""

dot_bracket_mfe_fasta = snakemake.input.rna_cofold

sno_id = ['']
terminal_stem = {}
with open(dot_bracket_mfe_fasta, 'r') as file:
    for line in file:
        if line.startswith('>'):  # sno_id lines
            sno_id_clean = str(line)
            sno_id[0] = sno_id_clean[1:].replace('\n', '')
        elif '(' in line:  #dot_bracket lines
            dot_bracket = str(line)
            dot_bracket = dot_bracket[0:20].replace('\n', '')  # we select only the left flanking region (20 nt)
            paired_base = dot_bracket.count('(')
            intramolecular_paired = dot_bracket.count(')')  # these ')' are intramolecular paired nt
                                                            # (i.e. nt pair within left sequence only)
            # This finds all overlapping (and non-overlapping) gaps of 1 to 19 nt inside the left flanking region
            gaps = re.findall(r'(?=(\(\.{1,19}\())', dot_bracket)
            number_gaps = ''.join(gaps)  # join all gaps together in one string
            number_gaps = len(re.findall('\.', number_gaps))  # count the number of nt gaps in sequence
            #print(sno_id[0])
            #print(dot_bracket)
            #print(gaps)
            #print(number_gaps)
            stem_length_score = paired_base - intramolecular_paired - number_gaps
            if stem_length_score < 0:
                stem_length_score = 0
            terminal_stem[sno_id[0]] = stem_length_score

df = pd.DataFrame.from_dict(terminal_stem, orient='index').reset_index()
df.columns = ['gene_id_sno', 'terminal_stem_length_score']

df.to_csv(snakemake.output.length_stem, index=False, sep='\t')
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pickle

""" Test each model performance on unseen test data and report their accuracy."""

# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (1)) on test data and compare to y_test
y_pred = model.predict(X_test)
print(snakemake.wildcards.models2)
print(metrics.accuracy_score(y_test, y_pred))
acc = {}
acc[snakemake.wildcards.models2+'_test_accuracy'] = metrics.accuracy_score(y_test, y_pred)
acc_df = pd.DataFrame(acc, index=[0])
acc_df.to_csv(snakemake.output.test_accuracy, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import pandas as pd
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.linear_model import LogisticRegression
import pickle
import numpy as np

""" Test each model performance on unseen test data and report their accuracy."""
X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
y_train = pd.read_csv(snakemake.input.y_train, sep='\t')
X_test = pd.read_csv(snakemake.input.X_test[0], sep='\t', index_col='gene_id_sno')
y_test = pd.read_csv(snakemake.input.y_test[0], sep='\t')

# Get best hyperparameters per model
hyperparams_df = pd.read_csv(snakemake.input.best_hyperparameters[0], sep='\t')
hyperparams_df = hyperparams_df.drop('accuracy_cv', axis=1)

def df_to_params(df):
    """ Convert a one-line dataframe into a dict of params and their value. The
        column name corresponds to the key and the value corresponds to
        the value of that param (ex: 'max_depth': 2 where max_depth was the column
        name and 2 was the value in the df)."""
    cols = list(df.columns)
    params = {}
    for col in cols:
        value = df.loc[0, col]
        params[col] = value
    return params

hyperparams = df_to_params(hyperparams_df)




# Define a new class of LogisticRegression in which we can choose the log_reg threshold used to predict
class LogisticRegressionWithThreshold(LogisticRegression):
    def predict(self, X, threshold=None):
        if threshold == None: # If no threshold passed in, simply call the base class predict, effectively threshold=0.5
            return LogisticRegression.predict(self, X)
        else:
            y_scores = LogisticRegression.predict_proba(self, X)[:, 1]
            y_pred_with_threshold = (y_scores >= threshold).astype(int)

            return y_pred_with_threshold

    def threshold_from_optimal_tpr_minus_fpr(self, X, y):
        # Find optimal log_reg threshold where we maximize the True positive rate (TPR) and minimize the False positive rate (FPR)
        y_scores = LogisticRegression.predict_proba(self, X)[:, 1]
        fpr, tpr, thresholds = roc_curve(y, y_scores)

        optimal_idx = np.argmax(tpr - fpr)

        return thresholds[optimal_idx], tpr[optimal_idx] - fpr[optimal_idx]


# Instantiate that new class and fit the parameters (train on training set)
lrt = LogisticRegressionWithThreshold(C=hyperparams['C'], solver=hyperparams['solver'],
                                random_state=42, max_iter=500, penalty='l2')  # l2 is the default regularization
lrt.fit(X_train, y_train.values.ravel())

# Pickle the model as a .sav file ('wb' for write in binary)
pickle.dump(lrt, open(snakemake.output.pickled_trained_model, 'wb'))

# Find optimal threshold and predict using that threshold instead of 0.5
threshold, optimal_tpr_minus_fpr = lrt.threshold_from_optimal_tpr_minus_fpr(X_test, y_test)
print('Optimal threshold and tpr-fpr:')
print(threshold, optimal_tpr_minus_fpr)
y_pred_thresh = lrt.predict(X_test, threshold)
print('Accuracy:')
print(metrics.accuracy_score(y_test, y_pred_thresh))

# Save training accuracy into df
accu = {}
accu['log_reg_thresh_training_accuracy'] = lrt.score(X_train, y_train.values.ravel())
accu_df = pd.DataFrame(accu, index=[0])
accu_df.to_csv(snakemake.output.training_accuracy, sep='\t', index=False)

# Save test accuracy into df
acc = {}
acc['log_reg_thresh_test_accuracy'] = metrics.accuracy_score(y_test, y_pred_thresh)
acc_df = pd.DataFrame(acc, index=[0])
acc_df.to_csv(snakemake.output.test_accuracy, sep='\t', index=False)

# Save threshold used for prediction in df
t = {}
t['log_reg_threshold'] = threshold
thresh_df = pd.DataFrame(t, index=[0])
thresh_df.to_csv(snakemake.output.threshold, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pickle

""" Test each model performance on unseen test data and report their accuracy."""

X_test = pd.read_csv(snakemake.input.X_test, sep='\t', index_col='gene_id_sno')
y_test = pd.read_csv(snakemake.input.y_test, sep='\t')

# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (1)) on test data and compare to y_test
y_pred = model.predict(X_test)
print(snakemake.wildcards.models2)
print(metrics.accuracy_score(y_test, y_pred))
acc = {}
acc[snakemake.wildcards.models2+'_test_accuracy'] = metrics.accuracy_score(y_test, y_pred)
acc_df = pd.DataFrame(acc, index=[0])
acc_df.to_csv(snakemake.output.test_accuracy, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pickle

""" Test each model performance on unseen test data and report their accuracy."""

# Generate the same CV, training and test sets (only the test set will be
# used in this script) that were generated in hyperparameter_tuning_cv and train_models
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1017 and 180 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=180, train_size=1017, random_state=42,
                                    stratify=y_total_train)


# Unpickle and thus instantiate the trained model defined by the 'models' wildcard
model = pickle.load(open(snakemake.input.pickled_trained_model, 'rb'))

# Predict label (expressed (1) or not_expressed (1)) on test data and compare to y_test
y_pred = model.predict(X_test)
print(snakemake.wildcards.models2)
print(metrics.accuracy_score(y_test, y_pred))
acc = {}
acc[snakemake.wildcards.models2+'_test_accuracy'] = metrics.accuracy_score(y_test, y_pred)
acc_df = pd.DataFrame(acc, index=[0])
acc_df.to_csv(snakemake.output.test_accuracy, sep='\t', index=False)
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)  # ignore all future warnings
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import pickle

""" Train (fit) each model on the training set using the best
    hyperparameters found by hyperparameter_tuning_cv. Pickle these fitted
    models (into .sav files) so that they can be reused after without the
    need to retrain them all over again."""

# Get best hyperparameters per model
hyperparams_df = pd.read_csv(snakemake.input.best_hyperparameters, sep='\t')
hyperparams_df = hyperparams_df.drop('accuracy_cv', axis=1)

def df_to_params(df):
    """ Convert a one-line dataframe into a dict of params and their value. The
        column name corresponds to the key and the value corresponds to
        the value of that param (ex: 'max_depth': 2 where max_depth was the column
        name and 2 was the value in the df)."""
    cols = list(df.columns)
    params = {}
    for col in cols:
        value = df.loc[0, col]
        params[col] = value
    return params

hyperparams = df_to_params(hyperparams_df)

# Generate the same CV, training and test sets (only the training set will be
# used in this script) that were generated in hyperparameter_tuning_cv
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1077 and 232 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=232, train_size=1077, random_state=42,
                                    stratify=y_total_train)


# Instantiate the model defined by the 'models' wildcard using the best hyperparameters
# specific to each model (log_reg, svc, rf, gbm, knn)
if snakemake.wildcards.models2 == "log_reg":
    model = LogisticRegression(C=hyperparams['C'], solver=hyperparams['solver'],
                                random_state=42, max_iter=500)
elif snakemake.wildcards.models2 == "svc":
    model = svm.SVC(C=hyperparams['C'], degree=hyperparams['degree'],
                    gamma=hyperparams['gamma'], kernel=hyperparams['kernel'],
                    random_state=42)
elif snakemake.wildcards.models2 == "rf":
    model = RandomForestClassifier(max_depth=hyperparams['max_depth'],
                min_samples_leaf=hyperparams['min_samples_leaf'],
                min_samples_split=hyperparams['min_samples_split'],
                n_estimators=hyperparams['n_estimators'], random_state=42)
elif snakemake.wildcards.models2 == "knn":
    model = KNeighborsClassifier(n_neighbors=hyperparams['n_neighbors'],
                weights=hyperparams['weights'],
                leaf_size=hyperparams['leaf_size'], p=hyperparams['p'])
else:
    model = GradientBoostingClassifier(loss=hyperparams['loss'],
                max_depth=hyperparams['max_depth'],
                min_samples_leaf=hyperparams['min_samples_leaf'],
                min_samples_split=hyperparams['min_samples_split'],
                n_estimators=hyperparams['n_estimators'], random_state=42)

# Train model and save training accuracy to df
model.fit(X_train, y_train)
print(snakemake.wildcards.models2)
print(model.score(X_train, y_train))

acc = {}
acc[snakemake.wildcards.models2+'_training_accuracy'] = model.score(X_train, y_train)
acc_df = pd.DataFrame(acc, index=[0])
acc_df.to_csv(snakemake.output.training_accuracy, sep='\t', index=False)

# Pickle the model as a .sav file ('wb' for write in binary)
pickle.dump(model, open(snakemake.output.pickled_trained_model, 'wb'))
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)  # ignore all future warnings
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import pickle

""" Train (fit) each model on the training set using the best
    hyperparameters found by hyperparameter_tuning_cv_scale_after_split. Pickle
    these fitted models (into .sav files) so that they can be reused after without the
    need to retrain them all over again."""

# Get best hyperparameters per model
hyperparams_df = pd.read_csv(snakemake.input.best_hyperparameters, sep='\t')
hyperparams_df = hyperparams_df.drop('accuracy_cv', axis=1)

def df_to_params(df):
    """ Convert a one-line dataframe into a dict of params and their value. The
        column name corresponds to the key and the value corresponds to
        the value of that param (ex: 'max_depth': 2 where max_depth was the column
        name and 2 was the value in the df)."""
    cols = list(df.columns)
    params = {}
    for col in cols:
        value = df.loc[0, col]
        params[col] = value
    return params

hyperparams = df_to_params(hyperparams_df)


# Get training set
X_train = pd.read_csv(snakemake.input.X_train, sep='\t', index_col='gene_id_sno')
y_train = pd.read_csv(snakemake.input.y_train, sep='\t')

# Instantiate the model defined by the 'models' wildcard using the best hyperparameters
# specific to each model (log_reg, svc, rf, gbm, knn)
if snakemake.wildcards.models2 == "log_reg":
    model = LogisticRegression(C=hyperparams['C'], solver=hyperparams['solver'],
                                random_state=42, max_iter=500)
elif snakemake.wildcards.models2 == "svc":
    model = svm.SVC(C=hyperparams['C'], degree=hyperparams['degree'],
                    gamma=hyperparams['gamma'], kernel=hyperparams['kernel'],
                    random_state=42)
elif snakemake.wildcards.models2 == "rf":
    model = RandomForestClassifier(max_depth=hyperparams['max_depth'],
                min_samples_leaf=hyperparams['min_samples_leaf'],
                min_samples_split=hyperparams['min_samples_split'],
                n_estimators=hyperparams['n_estimators'], random_state=42)
elif snakemake.wildcards.models2 == "knn":
    model = KNeighborsClassifier(n_neighbors=hyperparams['n_neighbors'],
                weights=hyperparams['weights'],
                leaf_size=hyperparams['leaf_size'], p=hyperparams['p'])
else:
    model = GradientBoostingClassifier(loss=hyperparams['loss'],
                max_depth=hyperparams['max_depth'],
                min_samples_leaf=hyperparams['min_samples_leaf'],
                min_samples_split=hyperparams['min_samples_split'],
                n_estimators=hyperparams['n_estimators'], random_state=42)

# Train model and save training accuracy to df
model.fit(X_train, y_train.values.ravel())
print(snakemake.wildcards.models2)
print(model.score(X_train, y_train.values.ravel()))

acc = {}
acc[snakemake.wildcards.models2+'_training_accuracy'] = model.score(X_train, y_train.values.ravel())
acc_df = pd.DataFrame(acc, index=[0])
acc_df.to_csv(snakemake.output.training_accuracy, sep='\t', index=False)

# Pickle the model as a .sav file ('wb' for write in binary)
pickle.dump(model, open(snakemake.output.pickled_trained_model, 'wb'))
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)  # ignore all future warnings
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import pickle

""" Train (fit) each model on the training set using the best
    hyperparameters found by hyperparameter_tuning_cv. Pickle these fitted
    models (into .sav files) so that they can be reused after without the
    need to retrain them all over again."""

# Get best hyperparameters per model
hyperparams_df = pd.read_csv(snakemake.input.best_hyperparameters, sep='\t')
hyperparams_df = hyperparams_df.drop('accuracy_cv', axis=1)

def df_to_params(df):
    """ Convert a one-line dataframe into a dict of params and their value. The
        column name corresponds to the key and the value corresponds to
        the value of that param (ex: 'max_depth': 2 where max_depth was the column
        name and 2 was the value in the df)."""
    cols = list(df.columns)
    params = {}
    for col in cols:
        value = df.loc[0, col]
        params[col] = value
    return params

hyperparams = df_to_params(hyperparams_df)

# Generate the same CV, training and test sets (only the training set will be
# used in this script) that were generated in hyperparameter_tuning_cv
# (respectively 15%, 70% and 15% of all dataset examples)
df = pd.read_csv(snakemake.input.df, sep='\t', index_col='gene_id_sno')
X = df.drop('label', axis=1)
y = df['label']

# First the CV vs total_train split
X_total_train, X_cv, y_total_train, y_cv = train_test_split(X, y, test_size=0.15,
                                            random_state=42, stratify=y)

# Next the total_train is split into train and test sets (1017 and 180 correspond
# to the number of examples in train and test sets respectively to get an
# approximately 70 % and 15 % of all examples in these two datasets)
X_train, X_test, y_train, y_test = train_test_split(X_total_train, y_total_train,
                                    test_size=180, train_size=1017, random_state=42,
                                    stratify=y_total_train)


# Instantiate the model defined by the 'models' wildcard using the best hyperparameters
# specific to each model (log_reg, svc, rf, gbm, knn)
if snakemake.wildcards.models2 == "log_reg":
    model = LogisticRegression(C=hyperparams['C'], solver=hyperparams['solver'],
                                random_state=42, max_iter=500)
elif snakemake.wildcards.models2 == "svc":
    model = svm.SVC(C=hyperparams['C'], degree=hyperparams['degree'],
                    gamma=hyperparams['gamma'], kernel=hyperparams['kernel'],
                    random_state=42)
elif snakemake.wildcards.models2 == "rf":
    model = RandomForestClassifier(max_depth=hyperparams['max_depth'],
                min_samples_leaf=hyperparams['min_samples_leaf'],
                min_samples_split=hyperparams['min_samples_split'],
                n_estimators=hyperparams['n_estimators'], random_state=42)
elif snakemake.wildcards.models2 == "knn":
    model = KNeighborsClassifier(n_neighbors=hyperparams['n_neighbors'],
                weights=hyperparams['weights'],
                leaf_size=hyperparams['leaf_size'], p=hyperparams['p'])
else:
    model = GradientBoostingClassifier(loss=hyperparams['loss'],
                max_depth=hyperparams['max_depth'],
                min_samples_leaf=hyperparams['min_samples_leaf'],
                min_samples_split=hyperparams['min_samples_split'],
                n_estimators=hyperparams['n_estimators'], random_state=42)

# Train model and save training accuracy to df
model.fit(X_train, y_train)
print(snakemake.wildcards.models2)
print(model.score(X_train, y_train))

acc = {}
acc[snakemake.wildcards.models2+'_training_accuracy'] = model.score(X_train, y_train)
acc_df = pd.DataFrame(acc, index=[0])
acc_df.to_csv(snakemake.output.training_accuracy, sep='\t', index=False)

# Pickle the model as a .sav file ('wb' for write in binary)
pickle.dump(model, open(snakemake.output.pickled_trained_model, 'wb'))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
library(rlang)
sessionInfo()
library(branchpointer)
library(data.table)
library(tidyverse)
library(GenomicRanges)

# Get the unique (distinct) transcript_id of all host genes into a character vector 'transcript_ids'
#transcript_df <- read.table(snakemake@input[["transcript_id_df"]], header=TRUE, sep='\t')
transcript_df <- fread(snakemake@input[["transcript_id_df"]], header=TRUE, sep='\t')
transcript_ids <- transcript_df %>% distinct(transcript_id_host, .keep_all=TRUE)

#Remove problematic host genes for branchpointer (they only work when not querried with other host genes in a character vector; no explanation for this...)
#Otherwise, makeBranchpointWindowForExons doesn't work properly for these host genes
prob_hg <- c("ENST00000526015", "ENST00000379060", "ENST00000359074", "ENST00000413987", "ENST00000471759", "ENST00000554429", "ENST00000638012")
transcript_ids_2 <- transcript_ids[!(transcript_ids[["transcript_id_host"]] %in% prob_hg)]

# Create a gtf out of the exons of all host genes
exons <- gtfToExons(snakemake@input[["hg_gtf"]])
write.csv(exons, 'data/test_EXONS.csv', row.names=FALSE)

# Get bedtools bin directory (needed for predictBranchpoints below)
file_bedtools <- file(snakemake@input[["bedtools_dir"]],"r")
bedtools_dir <- readLines(file_bedtools,n=1)
close(file_bedtools)

# Create a window table for each problematic HG (the windows are 18 to 44 intronic nucleotides upstream of the exons in HG)
# and concat vertically (append) all these windows in a single Grange object called window_vec
# The windows are GRanges objects, not dataframes
window_vec <- GRanges()
for (i in prob_hg){
  temp_window <- makeBranchpointWindowForExons(i, idType="transcript_id", exons=exons)
  window_vec <- append(window_vec, temp_window)
}

# Create a window table for all other remaining HG that are not problematic for branchpointer and append it to window_vec
window_table <- makeBranchpointWindowForExons(transcript_ids_2[["transcript_id_host"]], idType="transcript_id", exons=exons)
window_vec <- append(window_vec, window_table)
write.csv(window_vec, snakemake@output[["bp_window_table"]], row.names = FALSE)

# Predict all possible branchpoints and their probability within each window (for all introns of snoRNA host genes)
# Cannot use locally the arguments useParallel or cores which make this script quite long to run (~1h) (did not test on a computer cluster) ...
bp <- predictBranchpoints(window_vec, queryType = "region", genome = snakemake@input[["genome"]], bedtoolsLocation=bedtools_dir)
write.csv(bp, snakemake@output[["bp_distance"]], row.names = FALSE)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
library(recount3)
library(recount)

# Find all available mouse projects
mouse_projects <- available_projects("mouse")

# Search for specific study of Shen et al. 2012 (mouse tissue RNA-Seq, project_id=SRP006787)
proj_info <- subset(mouse_projects, project == "SRP006787" &
                    project_type == "data_sources")

# Create a RangedSummarizedExperiment (RSE) object at the gene level
rse_gene <- create_rse(proj_info, "gene")


# First, scale counts by total read coverage per sample (saved as counts matrix (must stay ass "counts", otherwise it doesn't work))
assay(rse_gene, "counts") <- transform_counts(rse_gene)

# Then compute TPM from the scaled counts
assay(rse_gene, "TPM") <- getTPM(rse_gene, length_var = "bp_length")

write.csv(assay(rse_gene, "TPM"), snakemake@output[["dataset"]])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
library(data.table)
library(svglite)
library(UpSetR)
library(ggplot2)

# Open top 5 feature df
feature_df <- fread(snakemake@input[['df']], header=TRUE, sep='\t')

# Select rows according to each model and then select only the feature column
log_reg <- subset(feature_df, model == 'log_reg')
feat_log_reg <- log_reg[['feature']]

svc <- subset(feature_df, model == 'svc')
feat_svc <- svc[['feature']]

gbm <- subset(feature_df, model == 'gbm')
feat_gbm <- gbm[['feature']]

knn <- subset(feature_df, model == 'knn')
feat_knn <- knn[['feature']]

# Create list input for UpSetR
listInput <- list(log_reg = feat_log_reg, svc = feat_svc, gbm = feat_gbm,
                  knn = feat_knn)

# Create the upset plot and order it by degree of intersection (intersection to all models, than 3 models, than 2 than 1)
fig <- upset(fromList(listInput), text.scale=c(2, 2, 2, 2, 2, 0), order.by='degree',
      sets.x.label="Number of top \npredictive features", point.size=6, line.size=1,
      matrix.color='black', main.bar.color='black', sets.bar.color='black')
svg(snakemake@output[['upset']], width=8, height=8)
print(fig)
dev.off()
ShowHide 609 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/etiennefc/Abundance_determinants_snoRNA
Name: abundance_determinants_snorna
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...