Single-Cell and Single-Nuclei Data Integration Workflow

public public 1yr ago 0 bookmarks

Table of Contents

This repo contains files, scripts, and analysis related to exploring integration of single-cell and single-nuclei data.

Environment setup

Managing R packages with renv

Package dependencies for the analysis workflows in this repository are managed using renv . For renv to work as intended, you'll need to work within the sc-data-integration.Rproj project in RStudio. You may need to run renv::restore() upon opening the project to ensure the renv.lock file is synced with the project library.

Each time you install or use new packages, you will want to run renv::snapshot() to update the renv.lock file with any added package and dependencies necessary to run the analyses and scripts in this repo.

If there are dependencies you want to include that are not captured automatically by renv::snapshot() , add them to components/dependencies.R with a call to library() and an explanatory comment. For example, if dplyr were recommended but not required by a package and you wanted to make sure to include it in the lockfile, you would add library(dplyr) to components/dependencies.R . Then rerun renv::snapshot() .

Snakemake/conda setup

The main workflow for the integration scripts is written with Snakemake, which will handle most dependencies internally, including the renv environment.

You will need the latest version of snakemake and the peppy python package. The easiest way to install these is with conda and/or mamba , which you will want to set up to use the bioconda and conda-forge channels using the following series of commands:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

You can then install snakemake and peppy into your preferred environment with:

mamba install snakemake peppy

(Use conda install if you do not have mamba installed.)

Python-based environments will be built automatically by Snakemake when the workflow is run, but the environment for R should be built before running the workflow. To create or update the necessary environment for the R scripts, which includes an isolated version of R , pandoc , and the renv package installation, run the following command from the base of the repository:

bash setup_envs.sh

This script will use Snakemake to install all necessary components for the workflow in an isoloated Conda enviroment. If you are on an Apple Silicon (M1/M2/Arm) Mac, this should properly handle setting up R to use an Intel-based build for compatibility with Bioconductor packages.

This installation may take up to an hour, as all of the R packages will likely have to be compiled from scratch. However, this should be a one-time cost, and ensures that you have all of the tools for the workflow installed and ready.

To use the environment you have just created, you will need to run Snakemake with the --use-conda flag each time.

If there are updates to the renv.lock file only, those can be applied with the following command (on any system):

snakemake --use-conda -c2 build_renv

Data used for benchmarking integration

For exploring data integration, we used test datasets that were obtained from the Human Cell Atlas (HCA) Data Portal , the Single-cell Pediatric Cancer Atlas(ScPCA) Portal , and simulated single-cell data published in Luecken et al., (2022) .

All data from the HCA that we are using can be found in the private S3 bucket, s3://sc-data-integration/human_cell_atlas_data . The simulated data can be downloaded directly from figshare .

All gene expression data used for benchmarking is stored in the private S3 bucket, s3://sc-data-integration . In order to access these files, you must be a Data Lab staff member and have credentials setup for AWS.

Metadata

Inside the sample-info folder is metadata related to datasets used for testing data integration.

  1. <project_name>-project-metadata.tsv : This file contains information about each of the projects that are being used for testing integration from a given area (e.g., HCA, simulated, ScPCA). Each row in this file corresponds to a project, dataset, or group of libraries that should be integrated together. All project-metadata.tsv files must contain a project_name column, but may also contain other relevant project information such as the following:
column_id contents
project_name The shorthand project name
source_id Unique ID associated with the project
tissue_group Tissue group the project belongs to (e.g. blood, brain, or kidney)
files_directory files directory on S3
metadata_filename name of metadata file obtained from the HCA
celltype_filename file name corresponding to file containing cell type information as found on HCA
celltype_filetype format of cell type file availble on HCA
  1. hca-library-metadata.tsv This file contains information about each library from every project that is being used as a test dataset from the HCA. Each row in this file corresponds to a library and contains the following columns:
column_id contents
sample_biomaterial_id Unique ID associated with the individual sample
library_biomaterial_id Unique ID associated with the individual library that was sequenced
bundle_uuid UUID for the individual folder containing each loom file
project_name The shorthand project name assigned by the HCA
source_id Unique ID associated with the project
tissue_group Tissue group the project belongs to (e.g. blood, brain, or kidney)
technology Sequencing/library technology used (e.g. 10Xv2, 10Xv3, etc.)
seq_unit Sequencing unit (cell or nucleus)
diagnosis Indicates if the sample came from diseasead or normal tissue
organ Specified tissue by the HCA where the sample was obtained from
organ_part Specified tissue region by the HCA where the sample was obtained from
selected_cell_types Identifies the group of cells selected for prior to sequencing, otherwise NA
s3_files_dir files directory on S3
loom_file loom file name in the format tissue_group/project_name/bundle_uuid/filename
  1. <project_name>-processed-libraries.tsv : This file contains the list of libraries from each project that are being used for testing data integration. This metadata file is required for most scripts, including running scpca-downstream-analyses using 01-run-downstream-analyses.sh and for running the integration workflow. This file must contain the following columns, but may also contain additional columns related to a given dataset:
column_id contents
sample_biomaterial_id Unique ID associated with the individual sample
library_biomaterial_id Unique ID associated with the individual library that was sequenced
project_name The shorthand project name
integration_input_dir The directory containing the SingleCellExperiment objects to be used as input to the data integration snakemake workflow
  1. hca-celltype-info.tsv : This file is not available on the repo and is stored in the private S3 bucket, s3://sc-data-integration/sample-info . This file contains all available cell type information for projects listed in hca-project-metadata.tsv . This file was created using the scripts/00a-reformat-celltype-info.R which takes as input the cell type information available for each project from the Human Cell Atlas Data Portal. The cell type information for each project, in its original format, can be stored in s3://sc-data-integration/human_cell_atlas_data/celltype . Each row corresponds to a single cell and contain the following information:
column_id contents
sample_biomaterial_id Unique ID associated with the individual sample
library_biomaterial_id Unique ID associated with the individual library that was sequenced
project The shorthand project name assigned by the HCA
barcode The unique cell barcode
celltype The assigned cell type for a given barcode, obtained from cell type data stored in s3://sc-data-integration/human_cell_atlas_data/celltype

Data files present on S3

All data and intermediate files are stored in the private S3 bucket, s3://sc-data-integration . The following data can be found in the above S3 bucket within the human_cell_atlas_data folder:

  • The loom folder contains the original loom files downloaded from the Human Cell Atlas data portal for each test dataset. Here loom files are nested by tissue_group , project_name , and bundle_uuid .

  • The sce folder contains the unfiltered SingleCellExperiment objects saved as RDS files. These SingleCellExperiment objects have been converted from the loom files using the 00-obtain-sce.R script in the scripts directory in this repo. Here RDS files are nested by tissue_group and project_name .

The following data can be found in the S3 bucket within the scib_simulated_data folder:

  • The hdf5 folder contains the original hdf5 files for simulated data obtained from figshare .

  • The sce folder contains the individual SingleCellExperiment objects stored as rds files after running scripts/00b-obtain-sim-sce.R and scripts/00c-create-sim1-subsets.R

A separate reference-files folder contains any reference files needed for processing dataset, such as the gtf file needed to generate the mitochondrial gene list found in the reference-files folder in the repository.

In order to access these files, you must be a Data Lab staff member and have credentials set up for AWS. Additionally, some of the scripts in this repository require use of AWS command line tools. We have previously written up detailed instructions on installing the AWS command line tools and configuring your credentials that can be used as a reference.

After AWS command line tools have been set up, the SingleCellExperiment objects found in s3://sc-data-integration/human_cell_atlas_data/sce can be copied to your local computer by running the 00-obtain-sce.R script with the --copy_s3 flag.

Rscript scripts/00-obtain-sce.R --copy_s3

This will copy any SingleCellExperiment objects for libraries listed in hca-processed-libraries.tsv that have already been converted from loom files. If any libraries listed in hca-processed-libraries.tsv do not have corresponding SingleCellExperiment objects, running the 00-obtain-sce.R will also convert those loom files.

Processed SingleCellExperiment objects to use for data integration

The human_cell_atlas_results/scpca-downstream-analyses folder contains all processed SingleCellExperiment objects and the output from running the core workflow in scpca-downstream-analyses . Within this folder each library that has been processed has its own folder that contains both the processed SingleCellExperiment object and an html summary report. The SingleCellExperiment objects in this folder have both been filtered to remove empty droplets and run through scpca-downstream-analyses using the scripts/01-run-downstream-analyses.sh script. This means they contain a logcounts assay with the normalized counts matrix, both PCA and UMAP embeddings, and clustering assignments that can be found in the louvain_10 column of the colData . The SingleCellExperiment objects present in human_cell_atlas_results/scpca-downstream-analyses should be the objects used as input for integration methods.

These files were produced and synced to S3 using the following script:

Note: To run the below script, you must have available in your path R (v4.1.2), Snakemake and pandoc . pandoc must be version 1.12.3 or higher, which can be checked using the pandoc -v command.

bash scripts/01-run-downstream-analyses.sh \ --downstream_repo <full path to scpca-downstream-analyses-repo> \ --s3_bucket "s3://sc-data-integration/human_cell_atlas_results/scpca-downstream-analyses"

Note: If you wish to run ScPCA data through the integration workflow, rather than HCA data, please see the special guidelines for preparing ScPCA data .

Running the integration workflow

To run the integration workflow, invoke snakemake from the sc-data-integration directory with the following command:

snakemake -c4 --use-conda

You can adjust the number of cores used by adjusting the -c4 flag with however many cores you want to use where the given number represents the number of desired cores (here, 4). Note that you will want to have set up the R conda environment already , especially if you are on an Apple Silicon Mac.

To run the workflow for development, you may wish to specify the config-test.yaml file, which will only run one project through the pipeline to save time:

snakemake -c4 --use-conda --configfile config-test.yaml

Finally, to run the scib_simulated data through the pipeline, use:

snakemake -c4 --use-conda --configfile config-scib_simulated.yaml

Code Snippets

 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
project_root <- here::here()
renv::load(project_root)

# import libraries
library(magrittr)
library(optparse)
source(file.path(project_root, "scripts", "utils", "integration-helpers.R"))

# Set up optparse options
option_list <- list(
  make_option(
    opt_str = c("-l", "--library_file"),
    type = "character",
    default = file.path(project_root, "sample-info", "hca-processed-libraries.tsv"),
    help = "path to file listing all libraries that are to be converted"
  ),
  make_option(
    opt_str = c("-g", "--grouping_var"), 
    type = "character",
    default = "project_name", 
    help = "Column name present in the library metadata file that was used to 
    indicate which SCE objects should be integrated in `02-prepare-merged-sce.R`."
  ),
  make_option(
    opt_str = c("--merged_sce_dir"),
    type = "character",
    default = file.path(project_root, "results", "human_cell_atlas", "merged_sce"),
    help = "Path to folder where SCE objects to be converted are stored, 
    each file should contain the library ID in the filename and be stored as an RDS file.
    Typically this is the output from running scpca-downstream-analyses"
  ),
  make_option(
    opt_str = c("--anndata_output_dir"),
    type = "character",
    default = file.path(project_root, "results", "human_cell_atlas", "merged_anndata"),
    help = "path to folder where all AnnData files will be saved as HDF5 files"
  )
)

# Setup ------------------------------------------------------------------------

# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

# checks that provided metadata files exist
if(!file.exists(opt$library_file)){
  stop("--library_file provided does not exist.")
}

# read in library metadata and grab names of grouping variable 
library_metadata_df <- readr::read_tsv(opt$library_file)

if(!opt$grouping_var %in% colnames(library_metadata_df)){
  stop("--grouping_var must correspond to a column provided in the library metadata file.")
}

group_names <- library_metadata_df %>%
  dplyr::pull(opt$grouping_var) %>%
  unique()

# setup output directory 
create_dir(opt$anndata_output_dir)

# Identify merged SCE files ----------------------------------------------------

# find SCE files that match library ID 
group_name_search <- paste(group_names, collapse = "|")
sce_files <- list.files(opt$merged_sce_dir,
                        pattern = group_name_search, 
                        recursive = TRUE,
                        full.names = TRUE)

# if the number of sce files is different then the number of groups to integrate find the missing files
if(length(sce_files) < length(group_names)){

  groups_found <- stringr::str_extract(sce_files, group_name_search)
  missing_groups <- setdiff(group_names, groups_found)

  stop(
    glue::glue(
      "Missing SCE object for {missing_groups}.
      Make sure that you have run `02-prepare-merged-sce.R`."
    )
  )
}

# Write H5 ---------------------------------------------------------------------

# extract group names from file path to make sure we are writing the input to the correctly named output 
sce_file_names <- stringr::str_extract(sce_files, pattern = group_name_search)

# create paths to anndata h5 files 
anndata_files <- file.path(opt$anndata_output_dir,
                           paste0(sce_file_names, 
                                  "_anndata.h5"))

# small function to read in sce and export as anndata 
export_anndata <- function(sce_file,
                           anndata_file){

  sce <- readr::read_rds(sce_file)
  scpcaTools::sce_to_anndata(sce,anndata_file = anndata_file)
}

# apply export_anndata function to all sce files
purrr::walk2(sce_files, anndata_files, export_anndata)
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
project_root <- here::here()
renv::load(project_root)

# import libraries
library(magrittr)
library(optparse)
library(SingleCellExperiment)

# source helper functions
source(file.path(project_root, "scripts", "utils", "integration-helpers.R"))

# Set up optparse options
option_list <- list(
  make_option(
    opt_str = c("-l", "--library_file"),
    type = "character",
    default = file.path(project_root, "sample-info", "hca-processed-libraries.tsv"),
    help = "path to file listing all libraries that are to be converted"
  ),
  make_option(
    opt_str = c("-g", "--grouping_var"),
    type = "character",
    default = "project_name",
    help = "Column name present in the library metadata file to use for grouping SCE objects
    and merging."
  ),
  make_option(
    opt_str = c("--groups_to_integrate"),
    type = "character",
    default = "All",
    help = "Groups present in `grouping_var` column of metadata file to create merged SCE objects for.
      Default is 'All' which produces a merged object for each group in the metadata file. 
      Specify groups by using a vector, e.g. c('group1','group2')"
  ),
  make_option(
    opt_str = c("--add_celltype"),
    action = "store_true",
    default = FALSE,
    help = "Boolean indicating whether or not celltype data should be added to the 
     individual SCE object prior to merging."
  ),
  make_option(
    opt_str = c("--celltype_info"),
    type = "character",
    default = file.path(project_root, "sample-info", "hca-celltype-info.tsv"),
    help = "Path to file containing cell type information for each SCE object. 
      Must contain columns named `library_biomaterial_id`, `celltype`, and `barcode`."
  ),
  make_option(
    opt_str = c("--batch_column"),
    type = "character",
    default = "batch",
    help = "The name of the column in colData that indicates the batches for each cell,
      typically this corresponds to the library id. Default is 'batch'."
  ),
  make_option(
    opt_str = c("--random_merge"),
    default = FALSE,
    action = "store_true",
    help = "Used to indicate whether or not to merge SCE objects in a random order. Default is FALSE."
  ),
  make_option(
    opt_str = c("--subset_hvg"),
    default = FALSE,
    action = "store_true",
    help = "Indicates whether or not to subset the merged SCE object by highly variable genes.
      If --subset_hvg is used, the merged SCE object will only contain genes
      identified as highly variable genes."
  ),
  make_option(
    opt_str = c("--pca_use_all_genes"),
    default = FALSE,
    action = "store_true",
    help = "Indicates whether or not to use the all genes as input to performing
      principal component analysis. Otherwise only highly variable genes are used
      as input."
  ),
  make_option(
    opt_str = c("-n", "--num_hvg"),
    dest = "num_genes",
    type = "integer",
    default = 5000,
    help = "Number of highly variable genes to select."
  ),
  make_option(
    opt_str = c("--merged_sce_dir"),
    type = "character",
    default = file.path(project_root, "results", "human_cell_atlas", "merged_sce"),
    help = "path to folder where all merged SCE objects files will be saved as RDS files"
  ),
  make_option(
    opt_str = c("--seed"),
    type = "integer",
    default = NULL,
    help = "random seed to set prior to merging"
  )
)

# Setup ------------------------------------------------------------------------

# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

set.seed(opt$seed)

# check that num genes provided is an integer
if(!is.integer(opt$num_genes)){
  stop("--num_hvg must be an integer.")
}

# checks that provided metadata files exist
if(!file.exists(opt$library_file)){
  stop("--library_file provided does not exist.")
}

# read in library metadata and grab unfiltered sce file paths
library_metadata_df <- readr::read_tsv(opt$library_file)

# check that cell type file exists if using add_celltype option 
if(opt$add_celltype){
  if(!file.exists(opt$celltype_info)){
    stop("--celltype_info file provided does not exist.")
  }
  celltype_info_df <- readr::read_tsv(opt$celltype_info)
}

# check that grouping variable is present
if(!opt$grouping_var %in% colnames(library_metadata_df)){
  stop("Must provide a grouping_var that is a column in the library metadata file.")
}

# define groups to integrate
groups <- library_metadata_df %>%
  dplyr::pull(opt$grouping_var) %>%
  unique()

# check that groups to integrate isn't set to All 
if(length(opt$groups_to_integrate) == 1 && (opt$groups_to_integrate == "All")){
  groups_to_integrate <- groups
} else {
  # if All is not used then unlist groups, using space, needed to parse a list from snakemake
  groups_to_integrate <- unlist(stringr::str_split(opt$groups_to_integrate, " "))

  # check that specified groups are present in grouping_var column 
  if(!any(groups_to_integrate %in% groups)){
    stop("Provided `--groups_to_integrate` must also be present in the `--grouping_var` column of 
         the library metadata file.")
  }
}

# subset metadata file to only contain groups to integrate 
library_metadata_df <- library_metadata_df %>%
  dplyr::filter(.data[[opt$grouping_var]] %in% groups_to_integrate)

# grab library ids 
library_ids <- library_metadata_df %>%
  dplyr::pull(library_biomaterial_id)

# setup output directory
create_dir(opt$merged_sce_dir)

# Identify SCE files -----------------------------------------------------------

# grab all unique directories corresponding to projects considered for integration
input_sce_dirs <- library_metadata_df %>%
  dplyr::pull(integration_input_dir) %>%
  unique()

# find SCE files that match library ID, and throw an error if any are missing.
sce_files <- find_input_sce_files(library_ids, input_sce_dirs)

# Merge by group ---------------------------------------------------------------

# get the library IDs from the SCE file names so that we can name the SCEs in the correct order
library_ids_sce_order <- stringr::str_extract(sce_files, 
                                              pattern = paste(library_ids, collapse = "|"))

sce_file_df <- data.frame(sce_files = sce_files,
                          library_id= library_ids_sce_order) %>%
  dplyr::left_join(library_metadata_df, by = c("library_id" = "library_biomaterial_id")) %>%
  dplyr::select(library_id, opt$grouping_var, sce_files)

# group dataframe by the grouping variable
grouped_sce_file_df <- split(sce_file_df, sce_file_df[,opt$grouping_var])

# create a list of SCE lists that is named by the grouping variable with
# each individual inner SCE list named by the library IDs
create_grouped_sce_list <- function(sce_info_dataframe,
                                    celltype_info_df = NULL,
                                    add_celltype){

  library_sce_list = list()
  for (library_idx in 1:length(sce_info_dataframe$library_id)){

    # read sce list for each library
    sce <- readr::read_rds(sce_info_dataframe$sce_files[library_idx])
    library_name <- sce_info_dataframe$library_id[library_idx]

    if(add_celltype){
      # if celltype info is provided add to sce object 
      if(!is.null(celltype_info_df)){
        # check that library has cell type information
        if(library_name %in% unique(celltype_info_df$library_biomaterial_id)){

          # filter celltype info to only have info for specified library 
          filtered_celltype_info <- celltype_info_df %>%
            dplyr::filter(library_biomaterial_id == library_name) %>%
            dplyr::select(barcode, celltype)

          # add celltype info 
          sce <- add_celltype_info(sce_object = sce, 
                                   celltype_info_df = filtered_celltype_info) 

          # add flag indicating that cell type information is available 
          metadata(sce)$celltype_info_available <- TRUE 
        }

        # only add celltype column/metadata if add celltype is yes, but no celltype data is available 
      } else {
        # note that no cell type information is available
        colData(sce)$celltype <- NA
        metadata(sce)$celltype_info_available <- FALSE
      } 
    }

    # create a list for each group named by the library IDs
    library_sce_list[[library_name]] <- sce

  }

  return(library_sce_list)

}

# create grouped sce list with/without celltype addition
if(opt$add_celltype){
  grouped_sce_list <- grouped_sce_file_df %>%
    purrr::map(create_grouped_sce_list, celltype_info_df, add_celltype = opt$add_celltype) 
} else {
  grouped_sce_list <- grouped_sce_file_df %>%
    purrr::map(create_grouped_sce_list, add_celltype = opt$add_celltype)
}

# create a list of merged SCE objects by group
#  In this default usage, a batch column named `batch` will get created
merged_sce_list <- grouped_sce_list %>%
  purrr::map(combine_sce_objects, 
             preserve_rowdata_columns = c("Gene", "gene_names", "ensembl_ids"),
             random_merge = opt$random_merge)

# HVG and dim reduction --------------------------------------------------------

# apply HVG calculation to list of merged SCEs
# object will only be subset to HVG if subset_hvg is true
merged_sce_list <- merged_sce_list %>%
  purrr::map(~ set_var_genes(.x,
                             num_genes = opt$num_genes,
                             subset_hvg = opt$subset_hvg,
                             batch_column = opt$batch_column))

# add PCA and UMAP
# if --pca_use_all_genes is used, use all genes, otherwise only HVG are used
if(opt$pca_use_all_genes){
  merged_sce_list <- merged_sce_list %>%
    purrr::map( ~ perform_dim_reduction(.x,
                                        var_genes = rownames(.x),
                                        pca_type = "multi"))
} else {
  merged_sce_list <- merged_sce_list %>%
    purrr::map( ~ perform_dim_reduction(.x,
                                        var_genes = metadata(.x)$variable_genes,
                                        pca_type = "multi"))
}


# Write RDS --------------------------------------------------------------------

# create paths to merged SCE files
# named with the name of the sce list which corresponds to the grouping variable, not library ID
merged_sce_files <- file.path(opt$merged_sce_dir,
                              paste0(names(merged_sce_list),
                                     "_merged_sce.rds"))

# export files
purrr::walk2(merged_sce_list, merged_sce_files, readr::write_rds)
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
project_root <- here::here()
renv::load(project_root)

# import libraries
library(magrittr)
library(optparse)

# source integration functions
utils_path <- file.path(project_root, "scripts", "utils")
suppressPackageStartupMessages({
  source(file.path(utils_path, "integrate-fastMNN.R"))
  source(file.path(utils_path, "integrate-harmony.R"))
  source(file.path(utils_path, "integrate-seurat.R"))
  source(file.path(utils_path, "integration-helpers.R"))
})

# Set up optparse options
option_list <- list(
  make_option(
    opt_str = c("--input_sce_file"),
    type = "character",
    default = NULL,
    help = "Path to RDS file that contains the merged SCE object to integrate"
  ),
  make_option(
    opt_str = c("--output_sce_file"),
    type = "character",
    default = NULL,
    help = "Path to RDS file where the integrated SCE object will be saved"
  ),
  make_option(
    opt_str = c("--method"),
    type = "character",
    default = NULL,
    help = "Integration method to use, either `fastMNN`, `harmony`, or `Seurat` (case-insensitive)."
  ),
  make_option(
    opt_str = c("-b", "--batch_column"),
    type = "character",
    default = "batch",
    help = "Name of the column in the SCE object that indicates batch groupings."
  ),
  make_option(
    opt_str = c("--seed"),
    type = "integer",
    default = NULL,
    help = "random seed to set during integration"
  ),
  make_option(
    opt_str = c("--harmony_covariate_cols"),
    type = "character",
    default = NULL,
    help = "Optional comma-separated list of columns (e.g. patient, sex) to consider as covariates
            during integration with `harmony`."
  ),
  make_option(
    opt_str = c("--harmony_from_expression"),
    action = "store_false",
    default = TRUE,
    help = "Indicate whether to use the gene expression matrix, rather than PCs, during `harmony` integration.
    To use expression instead of PCs (default), use `--harmony_from_expression`"
  ),
  make_option(
    opt_str = c("--fastmnn_no_cosine"),
    action = "store_false",
    default = TRUE,
    help = "Indicate whether to turn off cosine normalization during `fastMNN` integration.
    To turn off cosine normalization, use `--fastmnn_no_cosine`"
  ),
  make_option(
    opt_str = c("--fastmnn_use_all_genes"),
    action = "store_true",
    default = FALSE,
    help = "Indicate whether to use all genes instead of only HVGs during `fastMNN` integration.
    To use all genes, use `--fastmnn_use_all_genes`"
  ),
  make_option(
    opt_str = c("--fastmnn_auto_merge"),
    action = "store_true",
    default = FALSE,
    help = "Indicates whether or not to use the auto.merge option for `fastMNN` integration.
    To perform auto.merge, use `--fastmnn_auto_merge`."
  ),
  make_option(
    opt_str = c("--seurat_reduction_method"),
    type = "character",
    default = NULL,
    help = "Reduction method to use during Seurat integration."
  ),
  make_option(
    opt_str = c("--seurat_num_genes"),
    type = "numeric",
    default = 2000,
    help = "Number of variable genes for Seurat to identify and use."
  ),
  make_option(
    opt_str = c("--seurat_integration_dims"),
    type = "numeric",
    default = 30,
    help = "Number of dimensions for Seurat to use during integration."
  ),
  make_option(
    opt_str = c("--seurat_umap_dims"),
    type = "numeric",
    default = 30,
    help = "Number of dimensions for Seurat to use during UMAP calculation."
  ),
  make_option(
    opt_str = c("--seurat_anchor_threshold"),
    type = "numeric",
    default = 100,
    help = "Minimum threshold for number of neighbors to consider when weighting anchors during integration."
  ),
  make_option(
    opt_str = c("--corrected_only"),
    action = "store_true",
    default = FALSE,
    help = "Indicate whether to only return the corrected gene expression data, and
    not uncorrected expression, in the integrated SCE object. To return only
    corrected expression, use `--corrected_only`."
  )
)


# --corrected_only: Flag to specify that only corrected gene expression values should
#   be returned in the integrated SCE object. Default usage of this script will
#   return all data.

# Setup ------------------------------------------------------------------------
# Parse options
opt <- parse_args(OptionParser(option_list = option_list))



# Check and assign provided method based on available methods
available_methods <- c("fastmnn", "harmony", "seurat")

# helper function for method check fails
stop_method <- function() {
  stop(
    paste("You must specify one of the following (case-insensitive) to --method:",
          paste(available_methods, collapse = ", ")
    )
  )
}

if (is.null(opt$method)) {
  stop_method()
} else {
  integration_method <- tolower(opt$method)
  if (!(integration_method %in% available_methods)) {
    stop_method()
  }
}


# Check that provided input file exists and is an RDS file
if(is.null(opt$input_sce_file)) {
  stop("You must provide the path to the RDS file with merged SCEs to --input_sce_file")
} else {
  if(!file.exists(opt$input_sce_file)) {
    stop("Provided --input_sce_file file does not exist. Make sure that you have
          run `02-prepare-merged-sce.R` or the provided path is correct.")
  }
}

# Check that both input and output files have RDS extensions
if(!(grepl("\\.rds$", opt$input_sce_file, ignore.case = TRUE)) ||
   !(grepl("\\.rds$", opt$output_sce_file, ignore.case = TRUE))) {
  stop("The provided --input_sce_file and --output_sce_file files must be RDS files.")
}


# Check that directory for output file exists and the specified file is an RDS file
integrated_sce_dir <- dirname(opt$output_sce_file)
create_dir(integrated_sce_dir)


# Read in SCE file -----------------------------
merged_sce_obj <- readr::read_rds(opt$input_sce_file)


# Perform integration with fastMNN, if specified -------------------------
if (integration_method == "fastmnn") {

  # Prepare `gene_list` argument
  if (opt$fastmnn_use_all_genes) {
    fastmnn_gene_list <- NULL
  } else {
    fastmnn_gene_list <- metadata(merged_sce_obj)$variable_genes
  }

  # Perform integration
  integrated_sce_obj <- integrate_fastMNN(
    merged_sce_obj,
    batch_column = opt$batch_column,
    cosine_norm  = opt$fastmnn_no_cosine,
    gene_list    = fastmnn_gene_list,
    seed         = opt$seed,
    auto.merge   = opt$fastmnn_auto_merge
  )

  # add note about auto merge to metadata of integrated object
  if(opt$fastmnn_auto_merge){
    auto_merge <- "auto"
  } else {
    auto_merge <- "input"
  }

  metadata(integrated_sce_obj)$fastmnn_auto_merge <- auto_merge

}

# Perform integration with harmony, if specified -------------------------
if (integration_method == "harmony") {

  # Set up `covariate_cols` argument based on user options
  if (is.null(opt$harmony_covariate_cols)) {
    harmony_covariate_cols <- c()
  } else {
    harmony_covariate_cols <- unlist(stringr::str_split(opt$harmony_covariate_cols, "\\s*,\\s*"))
  }

  # Perform integration
  integrated_sce_obj <- integrate_harmony(
    merged_sce_obj,
    batch_column = opt$batch_column,
    covariate_cols = opt$harmony_covariate_cols,
    from_pca = opt$harmony_from_expression,
    seed = opt$seed
  )
}

# Perform integration with seurat, if specified -------------------------
if (integration_method == "seurat") {

  # Convert the merged sce object into a list of seurat objects
  seurat_list <- prepare_seurat_list(merged_sce_obj)

  # Perform integration
  integrated_seurat_obj <- integrate_seurat(
    seurat_list,
    opt$seurat_reduction_method, # integrate_seurat() will check this argument
    batch_column = opt$batch_column,
    num_genes = opt$seurat_num_genes,
    integration_dims = 1:opt$seurat_integration_dims,
    umap_dims = 1:opt$seurat_umap_dims,
    anchor_threshold = opt$seurat_anchor_threshold
  )

  # Converted the integrated seurat object into an SCE object
  integrated_sce_obj <- as.SingleCellExperiment(integrated_seurat_obj)

  # Rename assays appropriately:

  # The converted `logcounts` is the CORRECTED expression
  assay(integrated_sce_obj, paste0(opt$seurat_reduction_method, "_corrected")) <- logcounts(integrated_sce_obj)

  # The logcounts in SCT assay is logcounts
  logcounts(integrated_sce_obj) <- logcounts( altExp(integrated_sce_obj, "SCT") )[rownames(integrated_sce_obj),]

  # The counts in RNA assay is counts  
  counts(integrated_sce_obj) <- counts( altExp(integrated_sce_obj, "RNA") )[rownames(integrated_sce_obj),]

  # Remove altExps
  integrated_sce_obj <- removeAltExps(integrated_sce_obj)

  # Convert reducedDims names back to <lowercase>_<UPPERCASE> because 
  #   `as.SingleCellExperiment` makes them all uppercase
  reducedDimNames(integrated_sce_obj) <- stringr::str_replace_all(
    reducedDimNames(integrated_sce_obj),
    toupper(opt$seurat_reduction_method),
    opt$seurat_reduction_method
  )

}


# Remove uncorrected expression values, if specified ----------
if (opt$corrected_only) {
  integrated_sce_obj <- remove_uncorrected_expression(integrated_sce_obj)
}


# Write integrated SCE object to RDS -------------------
readr::write_rds(integrated_sce_obj, opt$output_sce_file)
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import os
import anndata as adata
import argparse
import re
from utils.integrate_scanorama import integrate_scanorama

# define arguments
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input_anndata',
                    dest = 'input_anndata',
                    required = True,
                    help = 'Path to HDF5 file with merged AnnData object to integrate')
parser.add_argument('-o', '--output_anndata',
                    dest = 'output_anndata',
                    required = True,
                    help = 'Path to HDF5 file to save the integrated AnnData object')
parser.add_argument('-b', '--batch_column',
                    dest = 'batch_column',
                    default = 'batch',
                    help = 'The name of the column in `anndata.obs` that indicates the batches for each cell, '
                            ' typically this corresponds to the library id.')
parser.add_argument('--use_hvg',
                    dest = 'use_hvg',
                    action = 'store_true',
                    default = False,
                    help = 'Boolean indicating whether or not to use only highly variable genes for data integration.'
                    'If --use_hvg is used, integration will only consider highly variable genes, and similarly '
                    'the returned integrated object will only contain the highly variable genes.')
parser.add_argument('--corrected_only',
                    dest = 'corrected_only',
                    action = 'store_true',
                    help = 'Boolean indicating to return only the corrected data or all raw data.'
                    ' Default will return all data. To return only corrected data, use --corrected_only.')
parser.add_argument('-s', '--seed',
                    dest = 'seed',
                    type=int,
                    default = None,
                    help = 'Random seed to set for scanorama.')

args = parser.parse_args()

# compile extension regex
file_ext = re.compile(r"\.hdf5$|.h5$", re.IGNORECASE)

# check that input file exists, if it does exist, make sure it's an h5 file
if not os.path.exists(args.input_anndata):
    raise FileExistsError("--input_anndata file not found.")
elif not file_ext.search(args.input_anndata):
    raise ValueError("--input_anndata must end in either .hdf5 or .h5 and contain a merged AnnData object.")

# make sure output file path is h5 file
if not file_ext.search(args.output_anndata):
    raise ValueError("--output_anndata must provide a file path ending in either .hdf5 or .h5.")

# check that output file directory exists and create directory if doesn't exist
integrated_adata_dir = os.path.dirname(args.output_anndata)
if not os.path.isdir(integrated_adata_dir):
    os.makedirs(integrated_adata_dir, exist_ok = True)

# read in merged anndata object
merged_adata = adata.read_h5ad(args.input_anndata)

if args.use_hvg:
    try:
        var_genes = list(merged_adata.uns['variable_genes'])
    except KeyError:
        print("Variable genes cannot be found in anndata object."
              " If using --use_hvg, make sure HVG are stored in adata.uns['variable_genes'].")
        raise
else:
    var_genes = list(merged_adata.var_names)

# integrate anndata with scanorama
scanorama_integrated_adata = integrate_scanorama(merged_adata,
                                                 integrate_genes = var_genes,
                                                 batch_column = args.batch_column,
                                                 seed = args.seed)

# remove raw data and logcounts to minimize space if corrected_only is true
if args.corrected_only:
    print("Removing raw data and log-normalized data. Only corrected data will be returned.")
    scanorama_integrated_adata.X = None
    del scanorama_integrated_adata.layers["logcounts"]

# write anndata to h5
scanorama_integrated_adata.write(filename = args.output_anndata)
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import os
import anndata as adata
import argparse
import re
from utils.integrate_scvi import integrate_scvi

# define arguments
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input_anndata',
                    dest = 'input_anndata',
                    required = True,
                    help = 'Path to HDF5 file with merged AnnData object to integrate')
parser.add_argument('-o', '--output_anndata',
                    dest = 'output_anndata',
                    required = True,
                    help = 'Path to HDF5 file to save the integrated AnnData object')
parser.add_argument('-b', '--batch_column',
                    dest = 'batch_column',
                    default = 'batch',
                    help = 'The name of the column in `anndata.obs` that indicates the batches for each cell, '
                            ' typically this corresponds to the library id.')
parser.add_argument('--categorical_covariates',
                    dest = 'categorical_covariates',
                    default = None,
                    type = str,
                    help = 'A comma-separated list of columns containing additional categorical data to be'
                    ' included as a covariate. Default is None.')
parser.add_argument('--continuous_covariates',
                    dest = 'continuous_covariates',
                    default = "subsets_mito_percent",
                    type = str,
                    help = 'A comma-separated list of columns containing additional continous data to be'
                    ' included as a covariate. Default is "subsets_mito_percent".')
parser.add_argument('--num_latent',
                    dest = 'num_latent',
                    type = int,
                    default = 20,
                    help = 'Number of dimensions to return from the latent space.')
parser.add_argument('--use_hvg',
                    dest = 'use_hvg',
                    action = 'store_true',
                    default = False,
                    help = 'Boolean indicating whether or not to use only highly variable genes for data integration.'
                    'If --use_hvg is used, the returned integrated object will only contain the highly variable genes.')
parser.add_argument('--corrected_only',
                    dest = 'corrected_only',
                    action = 'store_true',
                    help = 'Boolean indicating to return only the corrected data or all raw data.'
                    ' Default will return all data. To return only corrected data, use --corrected_only.')
parser.add_argument('-s', '--seed',
                    dest = 'seed',
                    type=int,
                    default = None,
                    help = 'Random seed to set for scanorama.')

args = parser.parse_args()

# compile extension regex
file_ext = re.compile(r"\.hdf5$|.h5$", re.IGNORECASE)

# check that input file exists, if it does exist, make sure it's an h5 file
if not os.path.exists(args.input_anndata):
    raise FileExistsError("--input_anndata file not found.")
elif not file_ext.search(args.input_anndata):
    raise ValueError("--input_anndata must end in either .hdf5 or .h5 and contain a merged AnnData object.")

# make sure output file path is h5 file
if not file_ext.search(args.output_anndata):
    raise ValueError("--output_anndata must provide a file path ending in either .hdf5 or .h5.")

# check that output file directory exists and create directory if doesn't exist
integrated_adata_dir = os.path.dirname(args.output_anndata)
os.makedirs(integrated_adata_dir, exist_ok = True)

# read in merged anndata object
merged_adata = adata.read_h5ad(args.input_anndata)

if args.use_hvg:
    try:
        var_genes = list(merged_adata.uns['variable_genes'])
    except KeyError:
        print("Variable genes cannot be found in anndata object."
              " If using --use_hvg, make sure HVG are stored in adata.uns['variable_genes'].")
        raise
else:
    var_genes = list(merged_adata.var_names)

# split covariates from comma separated strings into lists
if args.categorical_covariates:
    categorical_covariates = [covariate.strip() for covariate in args.categorical_covariates.split(',')]
else:
    categorical_covariates = None

if args.continuous_covariates:
    if args.continuous_covariates == "None":
        continuous_covariates = None
    else:
        continuous_covariates = [covariate.strip() for covariate in args.continuous_covariates.split(',')]
else:
    continuous_covariates = None

# integrate anndata with scvi
scvi_integrated_adata = integrate_scvi(merged_adata,
                                       integrate_genes = var_genes,
                                       batch_column = args.batch_column,
                                       categorical_covariate_columns = categorical_covariates,
                                       continuous_covariate_columns = continuous_covariates,
                                       num_latent = args.num_latent,
                                       seed = args.seed)

# remove raw data and logcounts to minimize space if corrected_only is true
if args.corrected_only:
    print("Removing raw data and log-normalized data. Only corrected data will be returned.")
    scvi_integrated_adata.X = None
    del scvi_integrated_adata.layers["logcounts"]


# write anndata to h5
scvi_integrated_adata.write(filename = args.output_anndata)
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
project_root <- here::here()
renv::load(project_root)

# import libraries
suppressPackageStartupMessages({
  library(magrittr)
  library(optparse)
  library(SingleCellExperiment) 
})
source(file.path(project_root, "scripts", "utils", "integration-helpers.R"))

# Load helper functions
source(
  file.path(
    "scripts", 
    "utils", 
    "integration-helpers.R"
  )
)

# Set up optparse options
option_list <- list(
  make_option(
    opt_str = c("--input_anndata_file"),
    type = "character",
    default = NULL,
    help = "Path to H5 file that contains the integrated AnnData object"
  ),
  make_option(
    opt_str = c("--output_sce_file"),
    type = "character",
    default = NULL,
    help = "Path to RDS file where the integrated SCE object will be saved"
  ),
  make_option(
    opt_str = c("--method"),
    type = "character",
    default = NULL,
    help = "Method that was used for integration, either `scanorama` or `scVI` (case-insensitive)."
  ),
  make_option(
    opt_str = c("--seed"),
    type = "integer",
    default = NULL,
    help = "random seed to set"
  ),
  make_option(
    opt_str = c("--corrected_only"),
    action = "store_true",
    default = FALSE,
    help = "Indicate whether the integrated AnnData object contains only 
    corrected gene expression data, and is missing uncorrected expression (no `X` matrix). 
    To indicate the object only contains corrected data, use `--corrected_only`."
  )
)

# Setup ------------------------------------------------------------------------
# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

set.seed(opt$seed)

# Check provided method based on available methods
available_methods <- c("scanorama", "scvi")

# helper function for method check fails
stop_method <- function() {
  stop(
    paste("You must specify one of the following (case-insensitive) to --method:",
          paste(available_methods, collapse = ", ")
    )
  )
}

if (is.null(opt$method)) {
  stop_method()
} else {
  integration_method <- tolower(opt$method)
  if (!(integration_method %in% available_methods)) {
    stop_method()
  }
}

# Check that provided input file exists and is an RDS file
if(is.null(opt$input_anndata_file)) {
  stop("You must provide the path to the H5 file with merged SCEs to --input_anndata_file")
} else {
  if(!file.exists(opt$input_anndata_file)) {
    stop("Provided --input_anndata_file file does not exist.")
  }
}

# Check that both input and output files have correct extensions
if(!(stringr::str_ends(opt$input_anndata_file, ".hdf5|.h5"))){
  stop("`--input_anndata_file` must end in either '.hdf5' or '.h5'")
}
if(!(stringr::str_ends(opt$output_sce_file, ".rds"))){
  stop("`--output_sce_file` must end in '.rds'")
}

# Check that directory for output file exists
integrated_sce_dir <- dirname(opt$output_sce_file)
create_dir(integrated_sce_dir)

# Anndata to SCE post-processing -----------------------------------------------

# read in AnnData object as H5
integrated_sce <- zellkonverter::readH5AD(opt$input_anndata_file)

# Remove uncorrected expression values if corrected_only is set
# We do this after zellkonverter which always retain counts, even if empty.
if (opt$corrected_only) {
  integrated_sce <- remove_uncorrected_expression(integrated_sce)
}


# replace . added in anndata names with - found in other sce objects
names(metadata(integrated_sce)) <- stringr::str_replace_all(names(metadata(integrated_sce)), "\\.", "-")
names(colData(integrated_sce)) <- stringr::str_replace_all(names(colData(integrated_sce)), "\\.", "-")
names(rowData(integrated_sce)) <- stringr::str_replace_all(names(rowData(integrated_sce)), "\\.", "-")

# check for corrected data 
corrected_assay_name <- paste(integration_method, "corrected", sep = "_")
if(!corrected_assay_name %in% assayNames(integrated_sce)){
  stop("integrated SCE missing corrected gene expression data in assays.")
}

# function to check for SVD or latent embeddings
check_pseudo_pca <- function(integrated_sce, 
                             pseudo_pca_name){
  # check corrected data and SVD were converted
  if(!pseudo_pca_name %in% reducedDimNames(integrated_sce)){
    stop(glue::glue("integrated SCE missing {pseudo_pca_name} in reduced dimensions slot."))
  }
}

# function to run UMAP with respective pseudo PCA name 
add_umap <- function(integrated_sce,
                     pseudo_pca_name,
                     umap_name){

  integrated_sce <- integrated_sce %>%
    scater::runUMAP(dimred = pseudo_pca_name,
                    name = umap_name)
}

# calculate UMAP dependent on integration method 
if(integration_method == "scanorama"){

  pseudo_pca_name <- "scanorama_SVD"
  check_pseudo_pca(integrated_sce,
                   pseudo_pca_name)

  umap_name <- "scanorama_UMAP"

  # run UMAP using the SVD as input 
  integrated_sce <- add_umap(integrated_sce,
                             pseudo_pca_name,
                             umap_name)

}

if(integration_method == "scvi"){

  pseudo_pca_name <- "scvi_latent"
  check_pseudo_pca(integrated_sce,
                   pseudo_pca_name)

  umap_name <- "scvi_UMAP"

  # run UMAP using the latent embeddings as input 
  integrated_sce <- add_umap(integrated_sce,
                             pseudo_pca_name,
                             umap_name)

}

# Write RDS --------------------------------------------------------------------

readr::write_rds(integrated_sce, opt$output_sce_file)
24
25
26
27
28
shell:
    """
    Rscript -e "renv::restore(lockfile = '{input}')" &> {log}
    date -u -Iseconds  > {output}
    """
SnakeMake From line 24 of main/Snakefile
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
shell:
    """
    Rscript scripts/02-prepare-merged-sce.R \
      --library_file "{input.processed_tsv}" \
      --add_celltype {config[add_celltype]} \
      --celltype_info "{config[celltype_file]}" \
      --grouping_var {config[grouping_var]} \
      --groups_to_integrate "{config[groups_to_integrate]}" \
      --merged_sce_dir "{output}" \
      --num_hvg {config[num_hvg]} \
      --subset_hvg \
      --seed {config[seed]} \
      {params.random_merge_flag} \
      &> {log}
    """
SnakeMake From line 40 of main/Snakefile
65
66
67
68
69
70
71
72
73
shell:
    """
    Rscript scripts/02a-convert-sce-to-anndata.R \
      --library_file "{input.processed_tsv}" \
      --merged_sce_dir "{input.merged_sce_dir}" \
      --grouping_var {config[grouping_var]} \
      --anndata_output_dir "{output}" \
      &> {log}
    """
SnakeMake From line 65 of main/Snakefile
89
90
91
92
93
94
95
96
97
98
99
shell:
    """
    Rscript scripts/03a-integrate-sce.R \
      --input_sce_file "{input.merged_sce_dir}/{params.merged_sce_file}" \
      --output_sce_file "{output}" \
      --method fastMNN \
      --seed {config[seed]} \
      --corrected_only \
      {params.auto_merge_flag} \
      &> {log}
    """
SnakeMake From line 89 of main/Snakefile
113
114
115
116
117
118
119
120
121
122
shell:
    """
    Rscript scripts/03a-integrate-sce.R \
      --input_sce_file "{input.merged_sce_dir}/{params.merged_sce_file}" \
      --output_sce_file "{output}" \
      --method harmony \
      --seed {config[seed]} \
      --corrected_only \
      &> {log}
    """
SnakeMake From line 113 of main/Snakefile
139
140
141
142
143
144
145
146
147
148
149
shell:
    """
    Rscript scripts/03a-integrate-sce.R \
      --input_sce_file "{input.merged_sce_dir}/{params.merged_sce_file}" \
      --output_sce_file "{output}" \
      --method seurat \
      --seurat_reduction_method {wildcards.method} \
      --num_genes {params.num_genes} \
      --corrected_only \
      &> {log}
    """
SnakeMake From line 139 of main/Snakefile
161
162
163
164
165
166
167
168
169
170
shell:
    """
    python scripts/03b-integrate-scanorama.py \
      --input_anndata "{input.merged_anndata_dir}/{params.merged_anndata_file}" \
      --output_anndata "{output}" \
      --seed {config[seed]} \
      --use_hvg \
      --corrected_only \
      &> {log}
    """
SnakeMake From line 161 of main/Snakefile
182
183
184
185
186
187
188
189
190
191
192
193
shell:
    """
    python scripts/03c-integrate-scvi.py \
      --input_anndata "{input.merged_anndata_dir}/{params.merged_anndata_file}" \
      --output_anndata "{output}" \
      --continuous_covariates {config[continuous_covariates]} \
      --num_latent {config[num_latent]} \
      --seed {config[seed]} \
      --use_hvg \
      --corrected_only \
      &> {log}
    """
SnakeMake From line 182 of main/Snakefile
203
204
205
206
207
208
209
210
211
212
shell:
    """
    Rscript scripts/04-post-process-anndata.R \
      --input_anndata_file "{input}" \
      --output_sce_file "{output}" \
      --method "{wildcards.method}" \
      --seed {config[seed]} \
      --corrected_only \
      &> {log}
    """
SnakeMake From line 203 of main/Snakefile
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
shell:
    """
    Rscript -e \
    "rmarkdown::render('analysis_templates/01-single-group-integration-check-template.Rmd', \
                        clean = TRUE, \
                        output_file = '{output}', \
                        output_dir = dirname('{output}'), \
                        params = list(group_name = '{wildcards.project}', \
                                      merged_sce_dir = '{workflow.basedir}/{input.merged_sce_dir}', \
                                      integrated_sce_dir = '{workflow.basedir}/{params.integrated_sce_dir}', \
                                      integration_methods = '{config[integration_methods]}', \
                                      library_metadata = '{workflow.basedir}/{config[processed_tsv]}' , \
                                      max_celltypes = {config[max_celltypes]}))" \
    &> {log}
    """
SnakeMake From line 227 of main/Snakefile
ShowHide 13 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/AlexsLemonade/sc-data-integration
Name: sc-data-integration
Version: 1
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...