Cell cycle analysis of single-cell proteomic and transcriptomic data for the FUCCI cell model

public public 1yr ago Version: 1.2.1 0 bookmarks

This repository contains the code used to perform the single-cell proteogenomic analysis of the human cell cycle . This study was based on immunofluorescence staining of ~200k cells for single-cell analysis of proteomic heterogeneity and ~1k cells for analysis of single-cell RNA variability. These analyses were integrated with other proteomic studies and databases to investigate the functional importance of transcript-regulated and non-transcript regulated variability.

Structure of repository

The code is listed in order of execution, e.g. "1_", "2_" etc. The output of each script is used in the subsequent script. This workflow can also be run using snakemake (see below).

The logic for these analyses is contained in the SingleCellProteogenomics folder.

The input files are contained in the "input" folder. This folder is linked here for release v1.2 as a zip file, input.zip . Expand this folder within the base directory of this repository. If you are looking for the raw imaging proteomic dataset produced after filtering artifacts and such, that is located here .

The output files are added to a folder "output" during the analysis, and figures are added to a folder "figures."

An R-script used to analyze skewness and kurtosis (noted in the Methods of the manuscript) is contained in the other_scripts folder. The other_scripts/ProteinDisorder.py script utilizes IUPRED2A and a human UniProt database.

Running the workflow using snakemake

This workflow can be run using snakemake :

  1. Install Miniconda from https://docs.conda.io/en/latest/miniconda.html.

  2. Install snakemake using conda install -c conda-forge snakemake-minimal .

  3. Within this directory, run snakemake -j 1 --use-conda --snakefile workflow/Snakefile .

Single-cell RNA-Seq analysis

The single-cell RNA-Seq data is available at GEO SRA under project number GSE146773 .

The cell cycle phase and FACS intensity information for these ~1,000 cells are contained in the input folder within three files, one per plate, starting with RNAData/180911_Fucci_single cell seq_ss2-18-*.csv .

The snakemake workflow used to analyze the scRNA-Seq dataset, including RNA velocity calculations and louvain unsupervised clustering, can be found in this repository: https://github.com/CellProfiling/FucciSingleCellSeqPipeline.

The loom file containing the results of RNA velocity analysis, including spliced and unspliced counts, can be found in the input folder under RNAData/a.loom , and the observation names used for each cell that match the "Well_Plate" identifiers can be found in RNAData/a.obs_names.csv .

Citation

Mahdessian, D.*; Cesnik, A. J.*; Gnann, C.; Danielsson, F.; Stenström, L.; Arif, M.; Zhang, C.; Le, T.; Johansson, F.; Shutten, R.; Bäckström, A.; Axelsson, U.; Thul, P.; Cho, N. H.; Carja, O.; Uhlén, M.; Mardinoglu, A.; Stadler, C.; Lindskog, C.; Ayoglu, B.; Leonetti, M. D.; Pontén, F.; Sullivan, D. P.; Lundberg, E. “Spatiotemporal dissection of the cell cycle with single cell proteogenomics.” Nature, 2021, 590, 649–654. *Contributed equally. https://www.nature.com/articles/s41586-021-03232-9

Code Snippets

 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
from SingleCellProteogenomics import (ProteinDataPreparation,
                                      ProteinGaussianClustering)
import matplotlib.pyplot as plt
import seaborn as sbn

# Make PDF text readable
plt.rcParams["pdf.fonttype"] = 42
plt.rcParams["ps.fonttype"] = 42
plt.rcParams["savefig.dpi"] = 300

#%% Read in the protein data (methods in ProteinDataPreparation.py)
my_df = ProteinDataPreparation.read_raw_data()
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df
)
wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results(
    u_well_plates, result_dict, ensg_dict, ab_dict
)

#%% Idea: Filter the raw data (methods in ProteinDataPreparation.py)
# Execution: Use manual annotations and nucleus size to filter samples and images
# Output: Filtered dataframe
my_df_filtered = ProteinDataPreparation.apply_manual_filtering(
    my_df, result_dict, ab_dict
)
my_df_filtered = ProteinDataPreparation.apply_big_nucleus_filter(my_df_filtered)
my_df_filtered = ProteinDataPreparation.apply_cell_count_filter(my_df_filtered)
# my_df_filtered.to_csv("input/ProteinData/FucciData.csv")
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df_filtered
)
wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results(
    u_well_plates, result_dict, ensg_dict, ab_dict
)

#%% Idea: Filter for variation and get compartments (methods in ProteinDataPreparation.py)
# Execution: Use annotated variation and compartment information
# Output: Number of cells filtered
my_df_filtered_variation, my_df_filtered_novariation = ProteinDataPreparation.apply_variation_filter(
    my_df_filtered, result_dict, my_df
)

## Uncomment to output these dataframes (used for skewness / kurtosis analysis)
# my_df_filtered_variation.to_csv("output/nuc_predicted_prob_phases_filtered_variation.csv")
# my_df_filtered_novariation.to_csv("output/nuc_predicted_prob_phases_filtered_novariation.csv")

# filter out the ones missing compartment information; these are localized to mitotic structures and handled differently
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df_filtered_variation
)
wp_iscell, wp_isnuc, wp_iscyto, my_df_filtered_compartmentvariation = ProteinDataPreparation.metacompartments(
    u_well_plates, compartment_dict, my_df_filtered_variation
)

# demonstrate that there are no more missing compartment information
plate, u_plate, well_plate, well_plate_imgnb, u_well_plates, ab_objnum, area_cell, area_nuc, area_cyto, ensg_dict, ab_dict, result_dict, compartment_dict, ENSG, antibody, result, compartment = ProteinDataPreparation.read_sample_info(
    my_df_filtered_compartmentvariation
)
wp_iscell, wp_isnuc, wp_iscyto, my_df_filtered_compartmentvariation = ProteinDataPreparation.metacompartments(
    u_well_plates, compartment_dict, my_df_filtered_compartmentvariation
)
wp_ensg, wp_ab, wp_prev_ccd, wp_prev_notccd, wp_prev_negative, prev_ccd_ensg, prev_notccd_ensg, prev_negative_ensg = ProteinDataPreparation.previous_results(
    u_well_plates, result_dict, ensg_dict, ab_dict
)

#%% Idea: Get and process intensities (methods in ProteinDataPreparation.py and ProteinGaussianClustering.py)
# Execution: get intensities; zero center fucci intensities
# Output: Fucci plot
ab_nuc, ab_cyto, ab_cell, mt_cell, green_fucci, red_fucci = ProteinDataPreparation.read_sample_data(
    my_df_filtered_compartmentvariation
)
log_green_fucci, log_red_fucci, log_green_fucci_zeroc, log_red_fucci_zeroc, log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, fucci_data = ProteinGaussianClustering.zero_center_fucci(
    green_fucci, red_fucci, u_plate, well_plate, plate
)

plt.hist2d(log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, bins=200)
plt.xlabel("Log10 Green Fucci Intensity")
plt.ylabel("Log10 Red Fucci Intensity")
plt.savefig("figures/FucciPlotProteinIFData_unfiltered.png")
# plt.show()
plt.close()

# General picture of antibody intensity density
sbn.displot(ab_cell, kind="hist")
plt.xlabel("Mean Intensity")
plt.ylabel("Density")
plt.savefig("figures/antibody_cell_intensity.pdf")
# plt.show()
plt.close()

#%% Idea: Gaussian clustering per plate to identify G1/S/G2 and do kruskal test for variance
# Exec: sklearn.mixture.GaussianMixture & scipy.stats.kruskal
# Output: FDR for cell cycle variation per well per compartment

# NB! The cluster labels can change if any prior analysis changes. Inspect the plots so that top-left FUCCI cluster is G1, top-right is S, bottom-right is G2.
g1_idx, sph_idx, g2_idx = 1, 2, 0
clusternames = [
    "G2" if g2_idx == 0 else "G1" if g1_idx == 0 else "S-ph",
    "G2" if g2_idx == 1 else "G1" if g1_idx == 1 else "S-ph",
    "G2" if g2_idx == 2 else "G1" if g1_idx == 2 else "S-ph",
]
cluster_labels = ProteinGaussianClustering.gaussian_clustering(
    log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale, clusternames
)
g1 = cluster_labels == g1_idx
sph = cluster_labels == sph_idx
g2 = cluster_labels == g2_idx
alpha_gauss, doGenerateBoxplotsPerGene = 0.05, False
wp_comp_kruskal_gaussccd_adj, wp_pass_kruskal_gaussccd_bh_comp, wp_mt_kruskal_gaussccd_adj, wp_pass_gaussccd_bh_mt = ProteinGaussianClustering.gaussian_clustering_analysis(
    alpha_gauss,
    doGenerateBoxplotsPerGene,
    g1,
    sph,
    g2,
    wp_ensg,
    well_plate,
    u_well_plates,
    ab_cell,
    ab_nuc,
    ab_cyto,
    mt_cell,
    wp_iscell,
    wp_isnuc,
    wp_iscyto,
)

# General look at replicates in mock-bulk analysis
ProteinGaussianClustering.address_replicates(
    alpha_gauss, wp_pass_kruskal_gaussccd_bh_comp, wp_ensg, wp_ab, u_well_plates
)
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
from SingleCellProteogenomics import (FucciPseudotime, Loaders,
                                      ProteinBimodality,
                                      ProteinCellCycleDependence,
                                      ProteinVariability, utils)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy
import argparse

if __name__ == "__main__":
    description = "Single cell proteogenomic analysis script -- protein cell cycle dependence"
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument(
        "--quicker", action="store_true", help="Skip some plotting and analyses, namely for HPA releases."
    )
    args = parser.parse_args()
    doAllPlotsAndAnalyses = not args.quicker

    # Make PDF text readable
    plt.rcParams["pdf.fonttype"] = 42
    plt.rcParams["ps.fonttype"] = 42
    plt.rcParams["savefig.dpi"] = 300

    #%% Read in the protein data
    import_dict = Loaders.load_protein_fucci_pseudotime()
    u_plate, well_plate, well_plate_imgnb, well_plate_imgnb_objnb, u_well_plates = (
        import_dict["u_plate"],
        import_dict["well_plate"],
        import_dict["well_plate_imgnb"],
        import_dict["well_plate_imgnb_objnb"],
        import_dict["u_well_plates"],
    )
    ab_nuc, ab_cyto, ab_cell, mt_cell = (
        import_dict["ab_nuc"],
        import_dict["ab_cyto"],
        import_dict["ab_cell"],
        import_dict["mt_cell"],
    )
    area_cell, area_nuc = import_dict["area_cell"], import_dict["area_nuc"]
    wp_ensg, wp_ab = import_dict["wp_ensg"], import_dict["wp_ab"]
    green_fucci, red_fucci = import_dict["green_fucci"], import_dict["red_fucci"]
    log_green_fucci_zeroc, log_red_fucci_zeroc = (
        import_dict["log_green_fucci_zeroc"],
        import_dict["log_red_fucci_zeroc"],
    )
    log_green_fucci_zeroc_rescale, log_red_fucci_zeroc_rescale = (
        import_dict["log_green_fucci_zeroc_rescale"],
        import_dict["log_red_fucci_zeroc_rescale"],
    )
    wp_comp_kruskal_gaussccd_adj, wp_pass_kruskal_gaussccd_bh_comp = (
        import_dict["wp_comp_kruskal_gaussccd_adj"],
        import_dict["wp_pass_kruskal_gaussccd_bh_comp"],
    )
    fucci_data = import_dict["fucci_data"]
    wp_iscell, wp_isnuc, wp_iscyto = (
        import_dict["wp_iscell"],
        import_dict["wp_isnuc"],
        import_dict["wp_iscyto"],
    )
    curr_wp_phases, mockbulk_phases = (
        import_dict["curr_wp_phases"],
        import_dict["mockbulk_phases"],
    )

    #%%
    # Idea: Calculate the polar coordinates and other stuff
    # Exec: Devin's calculations
    # Output: fucci plot with polar coordinates
    pseudotime_result = FucciPseudotime.pseudotime_protein(
        fucci_data,
        ab_nuc,
        ab_cyto,
        ab_cell,
        mt_cell,
        area_cell,
        area_nuc,
        well_plate,
        well_plate_imgnb,
        well_plate_imgnb_objnb,
        log_red_fucci_zeroc_rescale,
        log_green_fucci_zeroc_rescale,
        mockbulk_phases,
    )
    pol_sort_well_plate, pol_sort_norm_rev, pol_sort_well_plate_imgnb, pol_sort_well_plate_imgnb_objnb, pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_ab_cell, pol_sort_mt_cell, pol_sort_area_cell, pol_sort_area_nuc, pol_sort_fred, pol_sort_fgreen, pol_sort_mockbulk_phases = (
        pseudotime_result
    )

    #%% Calculate measures of variance of protein abundance in single cells
    # Idea: Calculate measures of variance, and show them in plots
    # Execution: Now that we already have the data filtered for variability, this is just descriptive.
    # Output: scatters of antibody vs microtubule variances by different measures of variaibility

    # toggle for using log-transformed intensities
    # we decided to use natural intensities
    use_log = False
    calculate_variaton_result = ProteinVariability.calculate_variation(
        use_log,
        u_well_plates,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        pol_sort_well_plate,
        pol_sort_ab_cell,
        pol_sort_ab_nuc,
        pol_sort_ab_cyto,
        pol_sort_mt_cell,
        pol_sort_well_plate_imgnb,
    )
    mean_mean_comp, var_comp, gini_comp, cv_comp, var_cell, gini_cell, cv_cell, var_mt, gini_mt, cv_mt = (
        calculate_variaton_result
    )

    # Compare variances for protein and microtubules, the internal control for each image
    if doAllPlotsAndAnalyses:
        removeThese = pd.read_csv("input/ProteinData/ReplicatesToRemove.txt", header=None)[0] # make these independent samples for one-sided Kruskal-Wallis tests
        utils.general_boxplot(
            (var_comp[~np.isin(u_well_plates, removeThese)], var_mt[~np.isin(u_well_plates, removeThese)]),
            ("Protein", "Microtubules"),
            "",
            "Variance",
            "",
            False,
            f"figures/ProteinMicrotubuleVariances.pdf",
        )
        utils.general_boxplot(
            (cv_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)]),
            ("Protein", "Microtubules"),
            "",
            "CV",
            "",
            False,
            f"figures/ProteinMicrotubuleCVs.pdf",
        )
        utils.general_boxplot(
            (gini_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)]),
            ("Protein", "Microtubules"),
            "",
            "Gini",
            "",
            False,
            f"figures/ProteinMicrotubuleGinis.pdf",
        )
        p_varProt_varMt = 2*scipy.stats.kruskal(var_comp[~np.isin(u_well_plates, removeThese)], var_mt[~np.isin(u_well_plates, removeThese)])[1]
        p_cvProt_cvMt = 2*scipy.stats.kruskal(cv_comp[~np.isin(u_well_plates, removeThese)], cv_mt[~np.isin(u_well_plates, removeThese)])[1]
        p_giniProt_giniMt = 2*scipy.stats.kruskal(gini_comp[~np.isin(u_well_plates, removeThese)], gini_mt[~np.isin(u_well_plates, removeThese)])[1]
        print(
            f"{p_varProt_varMt}: p-value for difference between protein and microtubule variances"
        )
        print(
            f"{p_cvProt_cvMt}: p-value for difference between protein and microtubule CVs"
        )
        print(
            f"{p_giniProt_giniMt}: p-value for difference between protein and microtubule Gini indices"
        )

    #%% Gaussian clustering to identify biomodal intensity distributions
    bimodal_results = ProteinBimodality.identify_bimodal_intensity_distributions(
        u_well_plates,
        wp_ensg,
        pol_sort_well_plate,
        pol_sort_norm_rev,
        pol_sort_ab_cell,
        pol_sort_ab_nuc,
        pol_sort_ab_cyto,
        pol_sort_mt_cell,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        doAllPlotsAndAnalyses
    )
    wp_isbimodal_fcpadj_pass, wp_bimodal_cluster_idxs, wp_isbimodal_generally, wp_bimodal_fcmaxmin = (
        bimodal_results
    )

    #%% Determine cell cycle dependence for each protein
    use_log_ccd = False
    do_remove_outliers = True
    alphaa = 0.05

    # Determine cell cycle dependence for proteins
    ccd_results = ProteinCellCycleDependence.cell_cycle_dependence_protein(
        u_well_plates,
        wp_ensg,
        wp_ab,
        use_log_ccd,
        do_remove_outliers,
        pol_sort_well_plate,
        pol_sort_norm_rev,
        pol_sort_ab_cell,
        pol_sort_ab_nuc,
        pol_sort_ab_cyto,
        pol_sort_mt_cell,
        pol_sort_fred,
        pol_sort_fgreen,
        pol_sort_mockbulk_phases,
        pol_sort_area_cell,
        pol_sort_area_nuc,
        pol_sort_well_plate_imgnb,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        wp_isbimodal_fcpadj_pass,
        wp_bimodal_cluster_idxs,
        wp_comp_kruskal_gaussccd_adj,
        doAllPlotsAndAnalyses
    )
    wp_comp_ccd_difffromrng, mean_diff_from_rng_mt, wp_comp_ccd_clust1, wp_comp_ccd_clust2, wp_ccd_unibimodal, wp_comp_ccd_gauss, perc_var_comp, mean_diff_from_rng, wp_comp_eq_percvar_adj, mean_diff_from_rng_clust1, wp_comp_eq_percvar_adj_clust1, mean_diff_from_rng_clust2, wp_comp_eq_percvar_adj_clust2, mvavgs_x, mvavgs_comp, curr_pols, curr_ab_norms, mvperc_comps, curr_freds, curr_fgreens, curr_mockbulk_phases, curr_area_cell, curr_ab_nuc, curr_well_plate_imgnb, folder = (
        ccd_results
    )

    # Move the temporal average plots to more informative places
    if doAllPlotsAndAnalyses:
        ProteinCellCycleDependence.copy_mvavg_plots_protein(
            folder,
            wp_ensg,
            wp_comp_ccd_difffromrng,
            wp_isbimodal_fcpadj_pass,
            wp_comp_ccd_clust1,
            wp_comp_ccd_clust2,
            wp_ccd_unibimodal,
            wp_comp_ccd_gauss,
        )
        ProteinCellCycleDependence.global_plots_protein(
            alphaa,
            u_well_plates,
            wp_ccd_unibimodal,
            perc_var_comp,
            mean_mean_comp,
            gini_comp,
            cv_comp,
            mean_diff_from_rng,
            wp_comp_eq_percvar_adj,
            wp_comp_kruskal_gaussccd_adj,
        )

    # Analyze the CCD results and save them
    ccd_comp, nonccd_comp, bioccd = ProteinCellCycleDependence.analyze_ccd_variation_protein(
        folder,
        u_well_plates,
        wp_ensg,
        wp_ab,
        wp_iscell,
        wp_isnuc,
        wp_iscyto,
        wp_comp_ccd_difffromrng,
        wp_comp_ccd_clust1,
        wp_comp_ccd_clust2,
        var_comp,
        gini_comp,
        perc_var_comp,
        mean_diff_from_rng,
        wp_comp_kruskal_gaussccd_adj,
        wp_comp_eq_percvar_adj,
        mean_diff_from_rng_clust1,
        wp_comp_eq_percvar_adj_clust1,
        mean_diff_from_rng_clust2,
        wp_comp_eq_percvar_adj_clust2,
        wp_isbimodal_fcpadj_pass,
        wp_isbimodal_generally,
        wp_ccd_unibimodal,
        wp_bimodal_fcmaxmin,
        wp_comp_ccd_gauss,
    )

    # Make a dataframe for plotting on the HPA website
    ProteinCellCycleDependence.make_plotting_dataframe(
        wp_ensg,
        wp_ab,
        u_well_plates,
        wp_iscell,
        wp_iscyto,
        wp_isnuc,
        ccd_comp,
        bioccd,
        curr_pols,
        curr_ab_norms,
        curr_freds,
        curr_fgreens,
        curr_mockbulk_phases,
        mvavgs_x,
        mvavgs_comp,
        mvperc_comps,
        gini_comp,
        perc_var_comp,
    )

    # Perform comparison to LASSO for finding CCD proteins
    if doAllPlotsAndAnalyses:
        ProteinCellCycleDependence.compare_to_lasso_analysis(
            u_well_plates,
            pol_sort_norm_rev,
            pol_sort_well_plate,
            pol_sort_ab_cell,
            pol_sort_ab_nuc,
            pol_sort_ab_cyto,
            pol_sort_mt_cell,
            pol_sort_fred,
            pol_sort_fgreen,
            wp_iscell,
            wp_isnuc,
            wp_iscyto,
            wp_ensg,
            ccd_comp,
        )

        # Generate UMAPs to illustrate cutoffs and stability
        ProteinCellCycleDependence.generate_protein_umaps(
            u_well_plates,
            pol_sort_norm_rev,
            pol_sort_well_plate,
            pol_sort_ab_cell,
            pol_sort_ab_nuc,
            pol_sort_ab_cyto,
            pol_sort_mt_cell,
            wp_iscell,
            wp_isnuc,
            wp_iscyto,
            mean_diff_from_rng,
        )
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
from SingleCellProteogenomics import (FucciPseudotime,
                                      RNACellCycleDependence,
                                      RNADataPreparation,
                                      RNAVelocity,
                                      utils)
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import shutil
import os
import argparse

if __name__ == "__main__":
    description = "Single cell proteogenomic analysis script -- protein cell cycle dependence"
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument(
        "--quicker", action="store_true", help="Skip some plotting and analyses, namely for HPA releases."
    )
    args = parser.parse_args()
    doAllPlotsAndAnalyses = not args.quicker

    # Make PDF text readable
    plt.rcParams["pdf.fonttype"] = 42
    plt.rcParams["ps.fonttype"] = 42
    plt.rcParams["savefig.dpi"] = 300

    bioccd = np.genfromtxt(
        "input/ProteinData/BiologicallyDefinedCCD.txt", dtype="str"
    )  # from mitotic structures
    wp_ensg = np.load("output/pickles/wp_ensg.npy", allow_pickle=True)
    ccd_comp = np.load("output/pickles/ccd_comp.npy", allow_pickle=True)
    nonccd_comp = np.load("output/pickles/nonccd_comp.npy", allow_pickle=True)
    u_plates = ["355","356","357"]

    #%% Convert FACS intensities for FUCCI markers to pseudotime using the same polar coordinate methods as for protein
    # Idea: Use the polar coordinate pseudotime calculations to calculate the pseudotime for each cell
    # Execution: Adapt Devin's code for the cells sorted for RNA-Seq
    # Output: Make log-log fucci intensity plots for the cells analyzed by RNA-Seq; Plot of all fucci pseudotimes; table of pseudotimes for each cell
    adata, phases_filt = RNADataPreparation.read_counts_and_phases(
        "Counts", False, "protein_coding", u_plates
    )
    adata = RNADataPreparation.zero_center_fucci(adata)
    FucciPseudotime.pseudotime_rna(adata)

    #%% Single cell RNA-Seq data preparation and general analysis
    if doAllPlotsAndAnalyses:
        RNADataPreparation.general_plots(u_plates)
        RNADataPreparation.analyze_noncycling_cells(u_plates)
        adata, phasesfilt = RNADataPreparation.qc_filtering(
            adata, do_log_normalize=True, do_remove_blob=False
        )
        adata = RNADataPreparation.zero_center_fucci(adata)
        RNADataPreparation.plot_pca_for_batch_effect_analysis(adata, "BeforeRemovingNoncycling")

    #%% Idea: Similar to mock-bulk analysis for proteins, we can evaluate each gene bundled by phase across cells
    # Execution: Make boxplots of RNA expression by phase
    # Output: boxplots for each gene
    valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
    adata, phases = RNADataPreparation.read_counts_and_phases(
        valuetype, use_spikeins, biotype_to_use, u_plates
    )
    adata, phasesfilt = RNADataPreparation.qc_filtering(
        adata, do_log_normalize=True, do_remove_blob=True
    )
    adata = RNADataPreparation.zero_center_fucci(adata)

    if doAllPlotsAndAnalyses:
        RNADataPreparation.plot_markers_vs_reads(adata)
        RNADataPreparation.plot_pca_for_batch_effect_analysis(adata, "AfterRemovingNoncycling")
        g1 = adata.obs["phase"] == "G1"
        s = adata.obs["phase"] == "S-ph"
        g2 = adata.obs["phase"] == "G2M"
        do_make_boxplots = False
        if do_make_boxplots:
            for iii, ensg in enumerate(adata.var_names):
                maxtpm = np.max(
                    np.concatenate((adata.X[g1, iii], adata.X[s, iii], adata.X[g2, iii]))
                )
                RNACellCycleDependence.boxplot_result(
                    adata.X[g1, iii] / maxtpm,
                    adata.X[s, iii] / maxtpm,
                    adata.X[g2, iii] / maxtpm,
                    "figures/RNABoxplotByPhase",
                    ensg,
                )

    #%% Idea: Display general RNA expression patterns in single cells using UMAP dimensionality reduction, and display with FUCCI pseudotime overlayed
    if doAllPlotsAndAnalyses:
        FucciPseudotime.pseudotime_umap(adata)  # Generate a UMAP with the pseudotime overlayed

        # We can also show that the cycle pattern remains when the curated CCD genes or CCD proteins are removed,
        # demonstrating that there's still valuable information about cell cycling beyond what was called CCD
        RNADataPreparation.demonstrate_umap_cycle_without_ccd(adata)

    # Read in the curated CCD genes / CCD proteins from the present work / Non-CCD genes from the present work; filter for genes that weren't filtered in QC of RNA-Seq
    ccd_regev_filtered, ccd_filtered, nonccd_filtered = utils.ccd_gene_lists(adata)
    adata_ccdprotein, adata_nonccdprotein, adata_regevccdgenes = RNADataPreparation.is_ccd(
        adata, wp_ensg, ccd_comp, nonccd_comp, bioccd, ccd_regev_filtered
    )

    # Generate plots with expression of genes overlayed
    expression_data = adata.X
    normalized_exp_data = (expression_data.T / np.max(expression_data, axis=0)[:, None]).T

    # Log-log FUCCI plot with RNA expression overlayed
    if doAllPlotsAndAnalyses:
        RNACellCycleDependence.plot_expression_facs(
            adata,
            wp_ensg[np.isin(wp_ensg, adata.var_names)],
            normalized_exp_data,
            "figures/GeneExpressionFucci",
        )

    # Cluster the expression into phases and analyze it that way
    bulk_phase_tests = RNACellCycleDependence.analyze_ccd_variation_by_phase_rna(
        adata, normalized_exp_data, biotype_to_use
    )
    # RNACellCycleDependence.plot_expression_boxplots(adata, wp_ensg[np.isin(wp_ensg, adata.var_names)], bulk_phase_tests, "figures/GeneExpressionBoxplots")
    # RNACellCycleDependence.plot_expression_umap(adata, wp_ensg[np.isin(wp_ensg, adata.var_names)], "figures/GeneExpressionUmap")

    #%% Moving average calculations and randomization analysis for RNA
    rna_ccd_analysis_results = RNACellCycleDependence.analyze_ccd_variation_by_mvavg_rna(
        adata,
        wp_ensg,
        ccd_comp,
        bioccd,
        adata_nonccdprotein,
        adata_regevccdgenes,
        biotype_to_use,
        make_mvavg_plots_isoforms=doAllPlotsAndAnalyses,
    )
    percent_ccd_variance, total_gini, mean_diff_from_rng, pass_meandiff, eq_percvar_adj, fucci_time_inds, norm_exp_sort, moving_averages, mvavg_xvals, perms, ccdtranscript, ccdprotein, mvpercs = (
        rna_ccd_analysis_results
    )

    RNACellCycleDependence.make_plotting_dataframe(
        adata,
        ccdtranscript,
        wp_ensg,
        bioccd,
        norm_exp_sort[np.argsort(fucci_time_inds), :],
        mvavg_xvals,
        moving_averages,
        mvpercs,
    )

    if doAllPlotsAndAnalyses:
        RNACellCycleDependence.plot_umap_ccd_cutoffs(adata, mean_diff_from_rng)
        RNACellCycleDependence.figures_ccd_analysis_rna(
            adata,
            percent_ccd_variance,
            mean_diff_from_rng,
            pass_meandiff,
            eq_percvar_adj,
            wp_ensg,
            ccd_comp,
            ccd_regev_filtered,
        )
        RNACellCycleDependence.plot_overall_and_ccd_variances(
            adata,
            biotype_to_use,
            total_gini,
            percent_ccd_variance,
            pass_meandiff,
            adata_ccdprotein,
            adata_nonccdprotein,
            adata_regevccdgenes,
        )
        RNACellCycleDependence.compare_to_lasso_analysis(adata, ccdtranscript)
        RNACellCycleDependence.analyze_cnv_calls(adata, ccdtranscript)

    #%% Moving average calculations and randomization analysis for the spike-in internal controls
    if doAllPlotsAndAnalyses:
        adata_spikeins, phases_spikeins = RNADataPreparation.read_counts_and_phases(
            valuetype, use_spike_ins=True, biotype_to_use="", u_plates=u_plates
        )
        sc.pp.filter_genes(adata_spikeins, min_cells=100)
        print(f"data shape after filtering: {adata_spikeins.X.shape}")

        RNACellCycleDependence.ccd_analysis_of_spikeins(adata_spikeins, perms)

    #%% Analyze RNA velocity
    valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
    adata, phases = RNADataPreparation.read_counts_and_phases(
        valuetype, use_spikeins, biotype_to_use, u_plates, load_velocities=True
    )
    adata_blobless, phasesfilt = RNADataPreparation.qc_filtering(
        adata, do_log_normalize=True, do_remove_blob=False
    )
    RNAVelocity.analyze_rna_velocity(adata_blobless, mean_diff_from_rng, doAllPlotsAndAnalyses)

    #%% Analyze isoforms
    if doAllPlotsAndAnalyses:
        adata_isoform, ccdtranscript_isoform = RNACellCycleDependence.analyze_isoforms(
            adata, ccdtranscript, wp_ensg, ccd_comp, nonccd_comp, u_plates, make_mvavg_plots_isoforms=doAllPlotsAndAnalyses
        )
        RNACellCycleDependence.compare_genes_to_isoforms(
            adata,
            ccdprotein,
            ccdtranscript,
            adata_nonccdprotein,
            adata_isoform,
            ccdtranscript_isoform,
        )
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
from SingleCellProteogenomics import (Loaders, RNADataPreparation,
                                      TemporalDelay)
import matplotlib.pyplot as plt

# Make PDF text readable
plt.rcParams["pdf.fonttype"] = 42
plt.rcParams["ps.fonttype"] = 42
plt.rcParams["savefig.dpi"] = 300

#%% Read in the protein data
import_dict = Loaders.load_temporal_delay()
u_well_plates, wp_ensg = import_dict["u_well_plates"], import_dict["wp_ensg"]
wp_iscell, wp_isnuc, wp_iscyto = (
    import_dict["wp_iscell"],
    import_dict["wp_isnuc"],
    import_dict["wp_iscyto"],
)
ccd_comp, ccdtranscript, ccdtranscript_isoform = (
    import_dict["ccd_comp"],
    import_dict["ccdtranscript"],
    import_dict["ccdtranscript_isoform"],
)
pol_sort_well_plate, pol_sort_norm_rev = (
    import_dict["pol_sort_well_plate"],
    import_dict["pol_sort_norm_rev"],
)
pol_sort_ab_nuc, pol_sort_ab_cyto, pol_sort_ab_cell, pol_sort_mt_cell = (
    import_dict["pol_sort_ab_nuc"],
    import_dict["pol_sort_ab_cyto"],
    import_dict["pol_sort_ab_cell"],
    import_dict["pol_sort_mt_cell"],
)
var_comp_prot, gini_comp_prot, cv_comp_prot = (
    import_dict["var_comp"],
    import_dict["gini_comp"],
    import_dict["cv_comp"],
)
var_cell_prot, gini_cell_prot, cv_cell_prot = (
    import_dict["var_cell"],
    import_dict["gini_cell"],
    import_dict["cv_cell"],
)

#%% Idea: Make temporal heatmap for peak protein expression, and compare known and novel proteins that peak at similar times
# Execution: plt.imshow makes a heatmap if given a 2D array
# Output: heatmap; correlations of known/novel proteins
highlights = []  #'ORC6','DUSP19','BUB1B','DPH2', 'FLI1']
highlights_ensg = []  #'ORC6','DUSP19','BUB1B','DPH2', 'FLI1']

nbins = 20
protein_heatmap_results = TemporalDelay.protein_heatmap(
    nbins,
    highlights,
    highlights_ensg,
    ccd_comp,
    u_well_plates,
    wp_ensg,
    pol_sort_norm_rev,
    pol_sort_well_plate,
    pol_sort_ab_cell,
    pol_sort_ab_nuc,
    pol_sort_ab_cyto,
    pol_sort_mt_cell,
    wp_iscell,
    wp_isnuc,
    wp_iscyto,
)
sorted_maxpol_array, wp_binned_values, wp_max_pol, wp_max_pol_ccd, xvals = (
    protein_heatmap_results
)

# Correlations of known and novel proteins that peak at similar times
TemporalDelay.peak_expression_correlation_analysis(
    wp_binned_values, wp_max_pol, wp_ensg, pol_sort_well_plate, u_well_plates
)

#%% Create a heatmap of peak RNA expression
highlight_names, highlight_ensg = [], []
u_rna_plates = ["355","356","357"]

# Read in RNA-Seq data; use TPMs so that the gene-specific results scales match for cross-gene comparisons
valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
adata, phases = RNADataPreparation.read_counts_and_phases(
    valuetype, use_spikeins, biotype_to_use, u_rna_plates
)
adata, phasesfilt = RNADataPreparation.qc_filtering(
    adata, do_log_normalize=True, do_remove_blob=True
)
adata = RNADataPreparation.zero_center_fucci(adata)
sorted_max_moving_avg_pol_ccd, norm_exp_sort, max_moving_avg_pol, sorted_rna_binned_norm = TemporalDelay.rna_heatmap(
    adata, highlight_names, highlight_ensg, ccdtranscript, xvals
)

# Analyze isoforms
adata_isoform, phases_isoform = RNADataPreparation.read_counts_and_phases(
    valuetype, use_spikeins, biotype_to_use, u_rna_plates, use_isoforms=True,
)
adata_isoform, phasesfilt_isoform = RNADataPreparation.qc_filtering(
    adata_isoform, do_log_normalize=True, do_remove_blob=True
)
adata_isoform = RNADataPreparation.zero_center_fucci(adata_isoform)
sorted_max_moving_avg_pol_ccd_isoform, norm_exp_sort_isoform, max_moving_avg_pol_isoform, sorted_rna_binned_norm_isoform = TemporalDelay.rna_heatmap(
    adata_isoform,
    highlight_names,
    highlight_ensg,
    ccdtranscript_isoform,
    xvals,
    isIsoformData=True,
)
pearsonCorrelations = TemporalDelay.analyze_ccd_isoform_correlations(
    adata, adata_isoform, ccdtranscript, ccdtranscript_isoform, xvals, u_rna_plates
)

#%% Compare the variances and time of peak expression between protein and RNA
TemporalDelay.compare_variances_prot_v_rna(
    adata,
    norm_exp_sort,
    wp_ensg,
    var_comp_prot,
    gini_comp_prot,
    cv_comp_prot,
    var_cell_prot,
    gini_cell_prot,
    cv_cell_prot,
)

TemporalDelay.compare_peak_expression_prot_v_rna(
    adata,
    wp_ensg,
    ccd_comp,
    ccdtranscript,
    wp_max_pol,
    wp_max_pol_ccd,
    sorted_maxpol_array,
    max_moving_avg_pol,
    sorted_max_moving_avg_pol_ccd,
)
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from SingleCellProteogenomics import (FucciCellCycle, Loaders,
                                      ProteinPropertyAnalysis,
                                      RNADataPreparation, utils)
import matplotlib.pyplot as plt
import numpy as np

# Make PDF text readable
plt.rcParams["pdf.fonttype"] = 42
plt.rcParams["ps.fonttype"] = 42
plt.rcParams["savefig.dpi"] = 300
fucci = FucciCellCycle.FucciCellCycle()
u_rna_plates = ["355","356","357"]

#%% Import the genes names we're analyzing
# Read in RNA-Seq data again and the CCD gene lists
valuetype, use_spikeins, biotype_to_use = "Tpms", False, "protein_coding"
adata, phases = RNADataPreparation.read_counts_and_phases(
    valuetype, use_spikeins, biotype_to_use, u_rna_plates
)
adata, phasesfilt = RNADataPreparation.qc_filtering(
    adata, do_log_normalize=True, do_remove_blob=True
)
adata = RNADataPreparation.zero_center_fucci(adata)
import_dict = Loaders.load_ptm_and_stability(adata)
wp_ensg, ccd_comp, nonccd_comp, ccdtranscript, wp_max_pol = (
    import_dict["wp_ensg"],
    import_dict["ccd_comp"],
    import_dict["nonccd_comp"],
    import_dict["ccdtranscript"],
    import_dict["wp_max_pol"],
)
ensg_results, name_results = utils.save_gene_names_by_category(
    adata, wp_ensg, ccd_comp, nonccd_comp, ccdtranscript
)
ensg_ccdtranscript, ensg_nonccdtranscript, ensg_ccdprotein, ensg_nonccdprotein, ensg_ccdprotein_transcript_regulated, ensg_ccdprotein_nontranscript_regulated, genes_analyzed, ccd_regev_filtered, ccd_filtered = ensg_results
names_ccdtranscript, names_nonccdtranscript, names_ccdprotein, names_nonccdprotein, names_ccdprotein_transcript_regulated, names_ccdprotein_nontranscript_regulated, names_genes_analyzed, names_ccd_regev_filtered, names_ccd_filtered = name_results
bioccd = np.genfromtxt(
    "input/ProteinData/BiologicallyDefinedCCD.txt", dtype="str"
)  # from mitotic structures
names_bioccd = utils.ccd_gene_names(bioccd, utils.getGeneNameDict())

#%% Analyze properties of the different groups relative to melting points
proteinProperties = ProteinPropertyAnalysis.ProteinProperties(
    wp_ensg,
    ensg_ccdprotein,
    ensg_ccdprotein_transcript_regulated,
    ensg_ccdprotein_nontranscript_regulated,
    bioccd,
    ensg_nonccdprotein,
    ensg_ccdtranscript,
    names_bioccd,
    names_ccdprotein,
    names_ccdprotein_transcript_regulated,
    names_ccdprotein_nontranscript_regulated,
    names_nonccdprotein,
    names_ccdtranscript,
)
proteinProperties.analyze_melting_points()
proteinProperties.analyze_disorder()
proteinProperties.analyze_length()
proteinProperties.statistical_properties_table()
proteinProperties.generate_properties_table()
proteinProperties.generate_statistical_boxplots()
proteinProperties.tm_scatters()
proteinProperties.kinase_families()
25
shell: "python 1_ProteinCellCycleClusters.py &> {log}"
32
shell: "python 2_ProteinFucciPsuedotime.py &> {log}"
39
shell: "python 3_RNAFucciPseudotime.py &> {log}"
46
shell: "python 4_TemporalDelay.py &> {log}"
53
shell: "python 5_ProteinProperties.py &> {log}"
ShowHide 8 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/CellProfiling/SingleCellProteogenomics
Name: singlecellproteogenomics
Version: 1.2.1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...