A single-cell RNA-seq pipeline

public public 1yr ago Version: v1.0.6 0 bookmarks

bollito: single-cell RNA-seq pipeline.

Pipeline status

Introduction

bollito is a Snakemake pipeline that performs a comprehensive single-cell RNA-seq analysis, covering both the basic steps (QC, alignment, quantification and cell specific QC) and more advanced downstream analyses (clustering, diferential expresion, trajectory inference, functional analysis and RNA velocity).

This pipeline uses state-of-the-art single-cell RNA-seq tools like Seurat , STARsolo , Vision , Slingshot , and Velocyto .

The pipeline makes extensive use of Snakemake's integration with the conda package manager and docker/singularity containers, to automatically take care of software requirements and dependencies.

bollito has two main modes of execution depending on the input data:

  • From FASTQ : it accepts FASTQ-formatted raw data (from drop-seq or 10x Genomics experiments).

  • From matrices including:

    • Feature-barcode matrices (expression matrix format from STARsolo).

    • Standard matrices (cell names and gene names included in the matrix).

We've built bollito for flexibility, with the aim of allowing the user to adjust the pipeline to different experiments using configuration parameters. This includes adjusting the cell filtering, normalization, variables regression, number of significant components, clustering resolution, etc. When in doubt, the default parameters were calculated to offer a good starting configuration. Additionally, once the main steps of the pipeline are finished, bollito creates a AnnData file from the Seurat file and test if it is correctly formatted. This feature improves the interconnection between R and Python, helping users which might want to perform analysis with tools from both languages.

Workflow overview

This is a schema of the complete workflow. Certain parts may be skipped depending on input data and chosen configuration.

bollito workflow diagram

Authors

  • Luis García-Jimeno

  • Coral Fustero-Torre

  • María José Jiménez-Santos

  • Gonzalo Gómez-López

  • Tomás Di Domenico

  • Fátima Al-Shahrour

Setup

The setup of the pipeline consists in the modification of three configuration files, indicating the desired parameters and the location of the input files. A general description of these files follows. See the Usage section for more details.

Configuration files

  • config.yaml contains all pipeline parameters.

  • samples.tsv contains information on the samples to be analysed.

  • units.tsv : contains information on the different data files associated to the samples to be analysed.

Input files

  • raw data in gzip compressed FASTQ files

or

  • matrices of count data

    • 10x -like input (matrix.mtx + genes.tsv + barcodes.tsv)

    or

    • standard tabular format (where rows are genes and columns are cells)

Usage

1. Set up the environment

bollito requires the conda package manager in order to work. If you have singularity installed and would rather not install conda, you can provide the "--use-singularity" option when running the pipeline (see below). Otherwise, please install conda by following the bioconda installation instructions .

2. Download bollito repository from Gitlab.

Use git clone command to create a local copy.

git clone https://gitlab.com/bu_cnio/bollito.git

3. Configure the pipeline.

Before executing the pipeline, the users must configure it according to their samples. To do this, they must fill these files:

TIP: different analysis can be run using just one cloned repository. This is achieved by changing the outdir and logdir in the configuration file. Also different parameters values can be used in the different analysis.

a. samples.tsv

This file contains information on the samples to be analyzed. The first column, called "sample", is mandatory, and defines the sample name for each sample. The other columns are used to define the samples. This information is stored as metadata in the Seurat object, so, every cell belonging to that sample is labeled with the value defined by the user, meanwhile the column heaeder will correspond to the condition name.

An example file ( samples-example.tsv) is included in the repository.

Rename it to samples.tsv and edit its contents to list the sample names and the features related to each sample.

b. units.tsv

This file is used to configure the input files.

There are two example files, depending on the type of input data:

  • If your input are FASTQ files:

An example file ( units-example_fastq.tsv ) is included in the repository.

Rename it to units.tsv and edit its contents according to the following table:

Field name Description
sample Sample name (must match the sample name specified in samples.tsv ).
unit A distinct name for each pair of files associated to the same sample (for example in the case of replicates).
fq1 FASTQ file for read 1, containing the Cell Barcode and UMI.
fq2 FASTQ file for read 2, containing the transcriptomic sequence.
  • If your input are matrix files:

An example file ( units-example_matrices.tsv ) is included in the repository.

Rename it to units.tsv and edit its contents according to the following table:

Field name Description
sample Sample name (must match the sample name specified in samples.tsv ).
matrix Matrix file (.mtx for 10x or .tsv for standard) storing the counts.
cell_names (10x only) tsv file containing one cell name per row.
gene_names (10x only) tsv file containing one gene name per row.
metadata (optional) tsv file with two or more columns. First column corresponds to each cell name specified in cell_names.tsv and the rest are metadata variables. First row indicates the metadata variable name.

c. config.yaml

This is the pipeline configuration file, where you can tune all the available parameters to customise your single-cell analysis.

The example file ( config-example.yaml ) features extensive inline documentation.

Here are some of the main available parameters:

Parameter Description
input_type Type of input data ( fastq or matrix ).
technology Technology used to get the reads files ( 10x or Drop-seq ).
outdir Directory where to store the output files.
logdir Directory where to store the log files.
graphics Graphic card availability.
random_seed Seed parameter to allow for reproducible analyses.
case Type of case used to represent the gene names, must be use according the specie and genesets used.
annotation GTF file holding genetic features information.
idx Folder containing STAR genomes indexes.
whitelist Cell barcodes whitelist file needed for 10x experiments quantification and demultiplexing.

d. metadata.tsv (optional)

This file is optional and it is only used when the input file used are matrices. The purpose of this file is to annotate each individual cell, storing that information in the Seurat object. It is a tsv file with two or more columns. The first column corresponds to each cell name specified in cell_names.tsv and the rest are the values of the metadata variables. First row of each column indicates the metadata variable name.

4. Create the Conda environments.

To run the pipeline, the user needs to create the conda environments first, which will take some minutes. This step is done automatically using this command:

snakemake --use-conda --conda-create-envs-only --conda-frontend mamba

5. Run the pipeline.

Once the pipeline is configured and conda environments are created, the user just needs to run bollito.

snakemake --use-conda -j 32

The mandatory arguments are:

  • --use-conda : to install and use the conda environemnts.

  • -j : number of threads/jobs provided to snakemake.

Pipeline steps

Here you can find a general description of the main steps of the pipeline.

1. FASTQ quality control.

General QC

bollito implements FastQC to check the overal quality of the input FASTQ files.

Contamination

FastQ Screen can optionally be enabled in order to check the input FASTQ files for contaminants.

MultiQC report

bollito creates a Quality Control HTML report using MultiQC . This report includes information from FastQC and RSeQC (which will be explained later).

2. Alignment & quantification.

To obtain the cell expression profiles we need to align the FASTQ files against an annotated reference genome. Once we obtain the aligned files, the pipeline assigns the reads to their corresponding cell barcodes, and performs an UMI-based quantification of the annotated features.

The alignment tool implemented in bollito is STAR , taking advantage of its STARsolo mode for single-cell RNA-seq quantification.

This step requires the following parameters, defined in the configuration file:

  • An annotation file containing the features of interest (in GTF format, must match the target genome)

  • One of the following options for the genome:

    • a FASTA file (will be indexed by bollito).

    • a STAR genome index.

  • Technology, which includes Drop-seq, 10x (chemistry version also needs to be specified), or a custom technology where the user will need to set the CB and UMI length.

  • The corresponding CB whitelist.

Example parameters for different STAR configurations are available in the example config file.

3. Alignment quality control.

Quality control of the resulting alignments is performed using RSeQC .

4. Cell quality control, normalization, and dimensionality reduction.

Once the UMI-count matrix is obtained, bollito performs a cell-based quality control. The purpose of this step is to control for broken cells and doublets, and to analyze the cell cycle status of the cells. Next, the normalization and dimensionality reduction steps are applied.

Cell quality control, normalization, and dimensionality reduction are implemented using the Seurat package.

Cell cycle checking is implemented based on molecular signatures from Tirosh et al, 2015 .

In order to customize this step, the following parameters can be adjusted via the configuration file:

  • Minimum and maximum number of detected genes per cell.

  • Minimum and maximum read counts per cell.

  • Minimum ribosomal content.

  • Minimum mitochondrial content.

  • Normalization method "SCT" ( (Hafemeister & Satija, 2019) ) or "standard".

The normalization itself can also be parametrized, including the possibility of regressing the desired variables (such as cell cycle scoring, number of detected genes, % of mitochondrial genes) in order to mitigate their effect on the dataset. Refer to the example config file for the available options.

NOTE: the user will have to choose the correct value for some parameters (QC threshold filters, significant PCA components or regressed variables) after the execution of these steps. For this purpose, it is necessary to stop the pipeline executions when these steps are finished. Then, take a look and the results to define those parameter values at the configuration file. Finally, re-run the pipeline from these steps in order to apply the changes.

4.b. Merging.

bollito can perform an optional merging step. This step consists in combining all the sample's Seurat objects which come from the single-cell QC step. No normalization is performed in this step, since the merged object will be used as input in the normalization step.

To enable this step, the user just need to set the "enabled" parameter to TRUE in the configuration file.

4.c. Integration.

bollito allows for an optional integration step, in order to detect shared cell states between datasets. The integration method is based on the identification of anchor cells between the datasets, and the projection of datasets on each other by using these anchors . For more information regarding the integration step, please refer to the Seurat Integration and Label Transfer documentation .

Integration step also performs the normalization of the integrated samples. It uses the same normalization method and values specified for the normalization rule, so the user does not need to specify any extra parameter.

To enable this step, the user just need to set the "enabled" parameter to TRUE in the configuration file.

5. Clustering.

Clustering of cells uses the normalized expression profiles. After, a dimensionality reduction by PCA, the k -nearest-neightbor (KNN) graph embedded in this lower dimensional space, is obtained. Then a Shared Nearest Neighbour (SNN) graph is constructed calculating the nearest neighbour cells overlapping using the Jaccard index. Once the graph is created, clusters are captured by using a the Louvain algorithm.

To explore the clusters along the resolutions, bollito uses Clustree . Cluster validation is achieved by calculating silhouette scores for each cluster. LISI ( (Korsunsky, I. et al. ) is used to assess if there is a batch effect produced by any of the categorical varaibles that are described in the experiment. If any batch effect is detected I would be advisabe to regress out that variable.

Additionally, once the main steps of the pipeline are finished, bollito creates a AnnData file from the Seurat output file and checks its formatting.

For this step, the following parameters need to be adjusted via the configuration file:

  • Number of significant components based on the elbow plot or JackStraw analysis obtained in previous steps.

  • Number of neighbours ( k ) used to generate the KNN graph (default = 20).

  • Resolutions to be used in the community detection method.

NOTE: the best resolution obtained should be specified in the configuration file, since it is used in posterior effects.

6. Differential expression analysis.

Differential expression analysis is based on the condition that the user requires, including a specific cluster resolution or some annotation information. For each condition, two analyses are performed:

  • Marker gene detection (for this test, only genes with a logFC threshold of 0.25, that are expressed in at least 10% of the cells are included).

  • Differential expression for all genes (no thresholds applied).

Output from this step includes a heatmap of top marker genes per condition, and a .rnk file that may be used for a downstream gene enrichment analysis with GSEA ,

The following parameters of this step need to be adjusted via the configuration file:

  • Set enabled to True .

  • Condition to analyze (cluster resolution or annotation information).

  • Differential expression test to apply.

For a list of available tests see the Seurat FindMarkers function documentation.

7. Functional analysis - Seurat-based.

bollito uses Seurat to apply the AddModuleScore function to the molecular signatures specified by the user. A paired Wilcoxon test is applied to compare the expression of the genes from the signature between the clusters.

To enable this step, the following parameters need to be adjusted via the configuration file:

  • Set enabled to True .

  • Set the path to the molecular signatures file to use (in .gmt format).

  • Set the ratio of expressed genes / total genes from a geneset to be tested (default is 0.2).

8. Functional analysis - Vision-based.

bollito applies the Vision methodology in order to study different molecular signatures and their significance at a specific clustering resolution.

To enable this step, the following parameters need to be adjusted via the configuration file:

  • Set enabled to True .

  • Set the path to the molecular signatures file to use (in .gmt format).

  • Select the desired metadata variables from Seurat.

  • Set the desired cluster resolution.

9. Trajectory inference.

This step analyses the cell lineages of your sample by inferring a pseudotime variable from the data and sorting the clusters according to it. The trajectory inference step is implemented by using the Slingshot package.

To enable it, the following parameters need to be adjusted via the configuration file:

  • Set enabled to True .

  • Cluster resolution must be specified.

  • Specifiy start and end cluster(s) (optional).

  • Number of genes for the heatmap representation.

  • Number of most variable genes (allows you to generate the general additive model of the heatmap).

10. RNA velocity.

The analysis of RNA velocity allows you to capture the expression dynamics of your data by estimating the spliced and unspliced mRNA abundances on each of the available splicing sites. Based on this information, future state of single-cells can be inferred. This step is implemented using the Velocyto wrapper.

To enable this step, the following parameters need to be adjusted via the configuration file:

  • Set enabled to True .

  • Cluster resolution must be specified.

  • If the dataset is large, a downsampling should be considered (optional).

NOTE: RNA velocity step can not be performed if we use a matrix as input of the pipeline, since it needs the BAM files to generate the three count matrices (spliced, unspliced and ambiguous).

Configuration of computation resources

The user can configure bollito to optimise the available computational resources in order to reduce the computational time. The optimization is achieved thanks to Snakemake's ability to run several samples at the same time and single-cell script parallelisation using the future package implementation. Resources configuration is done through the configuration file. This file has a field called resources , where the user can define the RAM memory usage and the number of threads (if the rule admits multithreading) available to a specific step. Additionally, if the user does not provide any value for some of these fields, the pipeline will use the default values.

Shortcuts

bollito features a shortcut system based on specfic targets related to the pipeline's steps. Each target calls a end point rule which terminate the pipeline execution.

To use the shorcuts, you only need to run the pipeline as usual, but with the --until option.

snakemake --use-conda --until target_name

The available targets are:

  • expression_matrix : run bollito until alignment step included.

  • qc_expression_matrix : run bollito until single-cell QC step included (rules: seurat_qc, seurat_postqc & seurat_merge).

  • normalized_expression_matrix : run bollito until single-cell normalization step included (rules: seurat_qc, seurat_postqc, seurat_merge, seurat_normalization & seurat_integration).

Additionally, the user might use the Snakemake rules names as targets, which are available in the config.yaml file.

Reporting

bollito produces a HTML report using Snakemake's automatic report generaration. This report includes the multiQC report and some quality control and normalization information at single-cell level from Seurat.

To generate the report, you only need to use --report option when the analysis is finished.

snakemake --report report.zip

This will generate a zip file containing an HTML index page and all the files needed to correctly display the report.

Scanpy interoperability

Bollito generates an AnnData output file to allow users to perform downstream analyses using Scanpy and other python-based packages. This AnnData file is obtained from the post-clustering Seurat object, so it stores all the annotations and cell filterings applied until that step.

Pipeline benchmarking

The following metrics were generating using the [10K PBMC 3p](https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_X/10k_PBMC_3p_nextgem_Chromium_X_fastqs.tar

  • from 10x Next GEM Chromium X dataset on a HPC cluster running CentOS 8 (248 cores, 2Tb RAM).
pipeline step running time (s) max RAM usage (MB) threads
fastqc 657.897 3830.980 4
star 5388.925 36103.887 8
rseqc_junction_saturation 1310.431 4848.223 1
rseqc_readdis 2507.294 1058.127 1
rseqc_stat 1022.082 116.960 1
rseqc_readgc 1166.199 2183.963 1
rseqc_readdup 2123.237 31723.523 1
rseqc_infer 7.545 150.740 1
rseqc_innerdis 630.251 970.620 1
rseqc_junction_annotation 1171.999 291.133 1
multiQC 40.741 212.860 2
seurat_preQC 120.221 2310.437 1
seurat_postQC 65.062 1869.23 1
seurat_normalization 405.939 9125.253 1
seurat_find-clusters 253.288 3710.987 2

References

  • Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc [Accessed 13 March 2020]

  • Wingett S. (2010). FastQ Screem:FastQ Screen allows you to screen a library of sequences in FastQ format against a set of sequence databases so you can see if the composition of the library matches with what you expect. Available at: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen [Accessed 13 March 2020]

  • Dobin, A., Davis, C., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M. and Gingeras, T., 2012. STAR: ultrafast universal RNA-seq aligner. Bioinformatics , 29(1), pp.15-21.

  • Wang, L., Wang, S. and Li, W., 2012. RSeQC: quality control of RNA-seq experiments. Bioinformatics , 28(16), pp.2184-2185.

  • Kowalczyk, M., Tirosh, I., Heckl, D., Ebert, B. and Regev, A., 2014. Single cell RNA-Seq of hematopoietic stem cells reveals a cell cycle-dependent interplay between aging and differentiation. Experimental Hematology , 42(8), p.S21.

  • Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W., Hao, Y., Stoeckius, M., Smibert, P. and Satija, R., 2019. Comprehensive Integration of Single-Cell Data. Cell , 177(7), pp.1888-1902.e21.

  • Hafemeister, C. and Satija, R., 2019. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology , 20(1).

  • Zappia, L. and Oshlack, A., 2018. Clustering trees: a visualization for evaluating clusterings at multiple resolutions. GigaScience , 7(7).

  • DeTomaso, D., Jones, M., Subramaniam, M., Ashuach, T., Ye, C. and Yosef, N., 2019. Functional interpretation of single cell similarity maps. Nature Communications , 10(1).

  • Street, K., Risso, D., Fletcher, R., Das, D., Ngai, J., Yosef, N., Purdom, E. and Dudoit, S., 2018. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics , 19(1).

  • La Manno, G., Soldatov, R., Zeisel, A., Braun, E., Hochgerner, H., Petukhov, V., Lidschreiber, K., Kastriti, M., Lönnerberg, P., Furlan, A., Fan, J., Borm, L., Liu, Z., van Bruggen, D., Guo, J., He, X., Barker, R., Sundström, E., Castelo-Branco, G., Cramer, P., Adameyko, I., Linnarsson, S. and Kharchenko, P., 2018. RNA velocity of single cells. Nature , 560(7719), pp.494-498.

  • Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., & Wei, K., Baglaenko, Y., Brenner, M., Loh, P. and Raychaudhuri, S. 2019. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods , 16(12), pp.1289-1296.

Test data

The system is pre-configured to run an example based on sample data available from 10x Genomics. The required datasets can be found at these URLS. Please update the "units.tsv" file to point at the data as needed.

  • https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/neuron_10k_v3

  • https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/heart_10k_v3

FASTQ_SCREEN

The pipeline can optionally run FASTQ_SCREEN to check the samples for contamination.

To disable it use the config file option config["rules"]["fastq_screen"]["disabled"] .

Config file pointing to indexes should be placed in a directory named config['rules']['fastq_screen_indexes']['outdir']/FastQ_Screen_Genomes .

If the rule is enabled and no config file provided, default indexes will be downloaded using the command fastq_screen --get_genomes .

![Alt text](https://github.com/cnio-bu/bollito/raw/main/./.img/CNIO_stopcancer.png)

Code Snippets

15
16
wrapper:
    "0.60.0/bio/star/index"
34
35
36
shell:"""
    cat {input} > {output} 2> {log}
"""
82
83
wrapper: 
    "0.60.0/bio/star/align"
27
28
script: 
    "../scripts/step1_qc.R"
63
64
script:
    "../scripts/step2_postqc.R"
85
86
script:
    "../scripts/step2.1_filter.R"
108
109
script:
    "../scripts/step2.5_seurat_merge.R"
154
155
script:
    "../scripts/step3_normalization.R"
182
183
script:
    "../scripts/step3.2_integration.R"
208
209
script:
    "../scripts/step4_find-clusters.R"
228
229
script:
    "../scripts/readAnnData.py"
252
253
script:
    "../scripts/step5_degs.R"
275
276
script:
    "../scripts/step6_gs-scoring.R"
301
302
script:
    "../scripts/step7_traj_in.R"
328
329
script:
    "../scripts/step8_func_analysis.R"
19
20
wrapper:
    "0.35.0/bio/fastqc"
37
38
39
shell:"""
    fastq_screen --threads {threads} --get_genomes --outdir {params.outdir}/ &> {log}
"""
SnakeMake From line 37 of rules/qc.smk
60
61
wrapper:
    "0.35.0/bio/fastq_screen"
SnakeMake From line 60 of rules/qc.smk
79
80
script:
    "../scripts/gtf2bed.py"
SnakeMake From line 79 of rules/qc.smk
102
103
104
shell:
    "junction_annotation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log[0]} 2>&1"
SnakeMake From line 102 of rules/qc.smk
126
127
128
shell:
    "junction_saturation.py {params.extra} -i {input.bam} -r {input.bed} -o {params.prefix} "
    "> {log} 2>&1"
SnakeMake From line 126 of rules/qc.smk
146
147
shell:
    "bam_stat.py -i {input} > {output} 2> {log}"
SnakeMake From line 146 of rules/qc.smk
167
168
shell:
    "infer_experiment.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 167 of rules/qc.smk
189
190
shell:
    "inner_distance.py -r {input.bed} -i {input.bam} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 189 of rules/qc.smk
209
210
shell:
    "read_distribution.py -r {input.bed} -i {input.bam} > {output} 2> {log}"
SnakeMake From line 209 of rules/qc.smk
230
231
shell:
    "read_duplication.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 230 of rules/qc.smk
251
252
shell:
    "read_GC.py -i {input} -o {params.prefix} > {log} 2>&1"
SnakeMake From line 251 of rules/qc.smk
288
289
wrapper:
    "v1.0.0/bio/multiqc"
17
18
19
shell:"""
    bash scripts/STAR_to_velocyto.sh {params.input_dir} 2> {log}
"""
42
43
script:
    "../scripts/step9_RNA_velocity.R"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import sys
sys.stderr=open(snakemake.log[0], "a+")
sys.stdout=open(snakemake.log[0], "a+")

import gffutils
db = gffutils.create_db(snakemake.input[0],
                        dbfn=snakemake.output.db,
                        force=True,
                        keep_order=True,
                        merge_strategy='merge',
                        sort_attribute_values=True,
                        disable_infer_genes=True,
                        disable_infer_transcripts=True)

with open(snakemake.output.bed, 'w') as outfileobj:
    for tx in db.features_of_type('transcript', order_by='start'):
        bed = [s.strip() for s in db.bed12(tx).split('\t')]
        bed[3] = tx.id
        outfileobj.write('{}\n'.format('\t'.join(bed)))

sys.stderr.close()
sys.stdout.close()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import sys
sys.stderr=open(snakemake.log[0], "a+")
sys.stdout=open(snakemake.log[0], "a+")

import scanpy
import warnings

anndata_path = snakemake.input["anndata_obj"]
outdir = snakemake.params["output_dir"]

a = scanpy.read_h5ad(filename = anndata_path)
check_path = outdir + "/check.txt"
print(check_path)
check = open(check_path, 'w')

'''
try: 

except:
    warnings.warn("AnnData can not be read.")
    sys.exit(1)

'''
sys.stderr.close()
sys.stdout.close()
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
head -n 2 ${1}/matrix.mtx > ${1}/mtx_header.txt
head -n 3 ${1}/matrix.mtx | tail -n 1 > ${1}/mtx_summary.txt

tail -n +4 ${1}/matrix.mtx | cut -d " " -f 1-3 > ${1}/ms.txt
tail -n +4 ${1}/matrix.mtx | cut -d " " -f 1-2,4 > ${1}/mu.txt
tail -n +4 ${1}/matrix.mtx | cut -d " " -f 1-2,5 > ${1}/ma.txt

#rm -r ${1}/spliced ${1}/unspliced ${1}/ambiguous
mkdir -p ${1}/spliced ${1}/unspliced ${1}/ambiguous

cat ${1}/mtx_header.txt ${1}/mtx_summary.txt ${1}/ms.txt > ${1}/spliced/matrix.mtx
cat ${1}/mtx_header.txt ${1}/mtx_summary.txt ${1}/mu.txt > ${1}/unspliced/matrix.mtx
cat ${1}/mtx_header.txt ${1}/mtx_summary.txt ${1}/ma.txt > ${1}/ambiguous/matrix.mtx

cp ${1}/features.tsv ${1}/genes.tsv
cp ${1}/genes.tsv ${1}/barcodes.tsv ${1}/spliced/
cp ${1}/genes.tsv ${1}/barcodes.tsv ${1}/unspliced/
cp ${1}/genes.tsv ${1}/barcodes.tsv ${1}/ambiguous/

rm ${1}/*.txt
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("dplyr"))
suppressMessages(library("data.table"))
suppressMessages(library("reticulate"))
suppressMessages(library("ggplot2"))
suppressMessages(library("stringr"))
suppressMessages(library("Matrix"))
suppressMessages(library("patchwork"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
data_dir = paste0(snakemake@params[["input_dir"]],"/Solo.out/Gene/raw")
dir.name = snakemake@params[["output_dir"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake. 
samples_path = snakemake@params[["samples_path"]]
min_cells_per_gene = snakemake@params[["min_cells_per_gene"]]
input_type = snakemake@params[["input_type"]]
technology = snakemake@params[["technology"]]
units_path = snakemake@params[["units_path"]]
sample = snakemake@params[["sample"]]
random_seed = snakemake@params[["random_seed"]]
case =  snakemake@params[["case"]]
ram = snakemake@resources[["mem_mb"]]
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("4. Seed was set at ", random_seed, "."))

# 5. Set regexp for QC variables calculation.
if (case == "lowercase") {
  mito_grep <- "^mt-"
  ribo_grep <- "^rp[sl][[:digit:]]"
} else if (case == "titlecase") {
  mito_grep <- "^mt-"
  ribo_grep <- "^Rp[sl][[:digit:]]"
} else if (case == "uppercase") {
  mito_grep <- "^MT-"
  ribo_grep <- "^RP[SL][[:digit:]]"
} else {
  message("Please choose a correc case option.")
  quit()
}
message(paste0("5. Case is set as ", case, "."))
message("Configuration finished.")
message("\n")


message("PROCESSING STEP")
# 0. Read input and create the expression matrix object.
# If the input file is a fastq file (STARsolo input).
if (input_type == "fastq") {
  file.rename(paste0(data_dir,"/features.tsv"), paste0(data_dir,"/genes.tsv"))
  expression_matrix <- Read10X(data.dir = data_dir)
  message("1. Expression matrix was created from alignment files.")
# If the input file are matrices (directly read from units.tsv).
} else if (input_type == "matrix") { # units.tsv is loaded
  units <- read.csv(units_path, header = TRUE, sep = "\t", row.names = 1, comment.char = "#")
  # If the expression matrix is in 10x like format (matrix, cell barcodes and genes).
  if (technology == "10x") { 
    expression_matrix <- readMM(toString(units[sample,"matrix"]))
    colnames(expression_matrix) <- read.table(toString(units[sample,"cell_names"]))[,1]
    row.names(expression_matrix) <- read.table(toString(units[sample,"gene_names"]))[,1]
    message("1. Expression matrix was created from 10x/CellRanger-like files.")
  # If the expression matrix is in TSV format ((genes as row names and cells as column names).
  } else if (technology == "standard") { 
    expression_matrix = read.csv(toString(units[sample,"matrix"]), sep = "\t", header = TRUE, row.names = 1)
    message("1. Expression matrix was created from TSV matrix.")

  } else {
    message("Please specify a correct unit input.")
  }
} else {
  message("Please specify a correct input type.")
}

# 1. Creating a seurat object. 
expression_matrix <- expression_matrix[, colSums(expression_matrix != 0) > 100] # Take into account cells with more than 100 counts, since CreateSeuratObject function breaks.
seurat = CreateSeuratObject(expression_matrix, project = sample, min.features = 200, min.cells = min_cells_per_gene)
message("2. Seurat object was created.")

# 1.1 Add metadata from samples.tsv file.  
samples_file = read.table(samples_path, sep = "\t", row.names = 1, header = TRUE)
if (dim(samples_file)[2] != 0){
  for (i in 1:length(colnames(samples_file))) {
    seurat <- AddMetaData(seurat, samples_file[sample, i], col.name = colnames(samples_file)[i])
  }
}
message("3. Metadata from sample.tsv was added to Seurat object.")

# 1.1.1 Add specific cell metadata from metadata.tsv file.
if (input_type == "matrix") {
  if (file.exists(toString(units[sample,"metadata"]))){
    meta_file = read.table(toString(units[sample,"metadata"]), sep = "\t", row.names = 1, header = TRUE)
    meta_file <- subset(meta_file, row.names(meta_file) %in% row.names(seurat@meta.data))
    for (i in 1:length(colnames(meta_file))) {
        seurat <- AddMetaData(seurat, meta_file[,i], col.name = colnames(meta_file)[i])
    }
    message("3.1 Metadata from metadata.tsv was added to Seurat object.")
  } else {
    message("Metadata.tsv was not found.")
  }
}

#1.2 Set idents to avoid new idents based on shared CB names. 
Idents(seurat) <- rep(sample, length(colnames(seurat$RNA@data)))

# 2. Preprocessing: Get QC-related values.
# 2.1. Mitochondrial genes - check levels of expression for mt genes. 
seurat[["percent.mt"]] <- PercentageFeatureSet(seurat, pattern = mito_grep)
# 2.2. Ribosomal genes - check levels of expression for rb genes.
seurat[["percent.ribo"]] <- PercentageFeatureSet(seurat, pattern = ribo_grep)
message("4. Ribosomal and mitochondrial percentages were calculated.")

# 2.3. QC: violin plots.
p1 <- VlnPlot(seurat, features = c("nFeature_RNA"), pt.size = 0.25, cols = "#9CCCD0") + ggtitle("Nº features") + theme(legend.position="bottom") 
p2 <- VlnPlot(seurat, features = c("nCount_RNA"), pt.size = 0.25, cols = "#8ADD56")  + ggtitle("Nº counts") + theme(legend.position="bottom")
p3 <- VlnPlot(seurat, features = c("percent.mt"), pt.size = 0.25, cols = "#F07800") + ggtitle("Mitochondrial %") + theme(legend.position="bottom")
p4 <- VlnPlot(seurat, features = c("percent.ribo"), pt.size = 0.25, cols = "#E44631") + ggtitle("Ribosomal %") + theme(legend.position="bottom")
p_comp <- p1 + p2 + p3 + p4 + plot_layout(ncol = 4)
ggsave(paste0(dir.name, "/", folders[1], "/1_vlnplot_QC_variables_prefilt.pdf"), plot = p_comp, scale = 1.2, width = 10, height = 8)
message("5. Combined violin plot was generated.")

# 2.4. QC: GenePlot.
scatter1 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "percent.mt", pt.size = 0.25)+ theme(legend.position="bottom") + labs(title = "Mitochondrial % vs Nº counts", x = "Nº counts", y = "Mitochondrial %")
scatter2 <- FeatureScatter(seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", pt.size = 0.25) + theme(legend.position="bottom") + labs(title = "Nº features vs Nº counts", x = "Nº counts", y = "Nº features")
p_comb2 <- scatter1 + scatter2 + plot_layout(ncol = 2)
ggsave(paste0(dir.name, "/", folders[1], "/2_geneplot_numi_vs_pctmit_ngene.pdf"), plot = p_comb2, scale = 1.5)
message("6. Scaterplots were generated.")


# 2.5. Save RDS: we can use this object to generate all the rest of the data.
saveRDS(seurat, file = paste0(dir.name, "/" ,folders[1], "/seurat_pre-qc.rds"))
message("7. Seurat object was saved.")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("dplyr"))
suppressMessages(library("data.table"))
suppressMessages(library("reticulate"))
suppressMessages(library("ggplot2"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake. 
gene.filter = c(snakemake@params[["genes"]]) # Gene or list of genes.
filter.out = snakemake@params[["filter_out"]] # true or false
filter.threshold = snakemake@params[["threshold"]] # numeric
random_seed = snakemake@params[["random_seed"]]
ram = snakemake@resources[["mem_mb"]]
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("4. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")

# B. Analysis.
message("PROCESSING STEP")
# Load Seurat object from previous step.
seurat = readRDS(paste0(dir.name, "/", folders[1], "/seurat_post-qc.rds"))
message("1. Seurat object was loaded.")

# 4. Filter cell based on gene expression values.
# 4.1 If there are negative markers availale: filter out cells based on gene expression. In this specific case, we are filtering out all cells expressing: Epcam, Pecam1, Krt19 and Ptprc. CHECK THIS

if(length(gene.filter) > 0){
	if (filter.out == TRUE){
		for(i in 1:length(gene.filter)){
			sub_seurat <- FetchData(object = seurat, vars = gene.filter[i])
			seurat <- seurat[, which(x = sub_seurat > filter.threshold)]
		}			
	} else {
		for(i in 1:length(gene.filter)){
			sub_seurat <- FetchData(object = seurat, vars = gene.filter[i])
			seurat <- seurat[, which(x = sub_seurat <= filter.threshold)]
		}	
	}
} else {
	next()
}
message("2. Cells were filter out.")

# Save RDS: we can use this object to generate all the rest of the data.
saveRDS(seurat, file = paste0(dir.name, "/",folders[1], "/seurat_post-qc-filtered.rds"))
message("3. Seurat object was saved.")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake. 
input_data = snakemake@input[["data"]]
random_seed = snakemake@params[["random_seed"]]
velocyto = snakemake@params[["velocyto"]]
outdir_config = snakemake@params[["outdir_config"]]
ram = snakemake@resources[["mem_mb"]]
write_table = as.logical(snakemake@params[["write_table"]])
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("4. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")

# B. Analysis.
message("PROCESSING STEP")
# 2.5 Merging all samples in the execution.
# Loading seurat object function.
combine_object <- function(x, velocyto) {
  experiment = head(tail(strsplit(x, "/")[[1]], n=3), n=1) #Assay name is stored to later use in integrated object metadata.
  seurat_obj <- readRDS(x)
  seurat_obj <- AddMetaData(object = seurat_obj,
                            metadata = experiment,
                            col.name = 'assay_name')
  # If RNA velocity is going to be performed, we add the velocyto matrices in this step.
  if (velocyto){
    # Velocyto matrices path is infered.
    velocyto_dir = paste0(outdir_config,"/star/", experiment,"/Solo.out/Velocyto/raw/")
    velo_names = c("spliced", "unspliced", "ambiguous")
    vel_matrices = list()
    # The matrices are read in 10x format.
    for (name in velo_names) {
      vel_matrices[[name]] <- Read10X(data.dir = paste0(velocyto_dir, name))
    }
    # The matrices are added as assays in the respective seurat object.
    for (name in velo_names) {
      vel_matrices[[name]] <- vel_matrices[[name]][, which(x = colnames(vel_matrices[[name]]) %in% colnames(seurat_obj))] 
      seurat_obj[[name]] <- CreateAssayObject(counts = vel_matrices[[name]])
    }
  }
  return(seurat_obj)
}
message("1. All Seurat object were loaded.")

# Function to get sample names.
get_name_assays <- function(x) {
  experiment = head(tail(strsplit(x, "/")[[1]], n=3), n=1) #Assay name is stored to later use in integrated object metadata.
  return(experiment)
}

# Seurat objects are loaded and sample names are obtained.
seurat_list = lapply(input_data, function(x) combine_object(x, velocyto))
cells_id = sapply(input_data, function(x) get_name_assays(x))
names(seurat_list) <- cells_id

# Seurat object split for merging.
x_seurat <- seurat_list[[1]]
y_seurat <- seurat_list[2:length(seurat_list)]
message("2. List of Seurat object was created.")

# Merging step.
seurat <- merge(x_seurat, y = as.vector(y_seurat), add.cell.ids = cells_id, project = "merged")
message("3. Seurat objects were merged.")

# Save expression matrix.
if (write_table){
	write.table(as.matrix(seurat@assays$RNA@counts), file = paste0(dir.name, "/", folders[1], "/expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE)
	message("4. Merged expression matrix was saved.")
} else {
	message("4. Merged expression matrix was not saved, as specified in the configuration file.")
}

# Save merged Seurat object.
saveRDS(seurat, file = paste0(dir.name, "/", folders[1], "/seurat_post-qc.rds"))
message("5. Seurat object was saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("dplyr"))
suppressMessages(library("data.table"))
suppressMessages(library("reticulate"))
suppressMessages(library("ggplot2"))
suppressMessages(library("patchwork"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake. 
min_feat = snakemake@params[["min_feat"]]
max_feat = snakemake@params[["max_feat"]]
min_count = snakemake@params[["min_count"]]
max_count = snakemake@params[["max_count"]]
mit = snakemake@params[["mit"]]
ribo = snakemake@params[["ribo"]]
random_seed = snakemake@params[["random_seed"]]
ram = snakemake@resources[["mem_mb"]]
write_table = as.logical(snakemake@params[["write_table"]])
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("4. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")

# B. Analysis.
message("PROCESSING STEP")
# Load Seurat object from previous step.
seurat = readRDS(paste0(dir.name, "/", folders[1], "/seurat_pre-qc.rds"))
message("1. Seurat object was loaded.")

#3. Preprocessing: Filter out low-quality cells.
# 3.1 Pre-filtering stats calculus.
stats_pre <- c(length(colnames(seurat)), median(seurat@meta.data[["nCount_RNA"]]), median(seurat@meta.data[["nFeature_RNA"]]), median(seurat@meta.data[["percent.mt"]]), median(seurat@meta.data[["percent.ribo"]]))
message("2. Pre-filtering statistics were obtained.")


# 3.2 We should apply the filterings once the QC plots (GenePlot and Violin plots) have been checked.
# 3.2.1 Feature filter.
min_feat <- ifelse(is.null(min_feat), 0, min_feat)
max_feat <- ifelse(is.null(max_feat), 0, max_feat)
if (min_feat != 0 | max_feat != 0) { 
  cells_seurat <- FetchData(object = seurat, vars = "nFeature_RNA")
  seurat <- seurat[, which(x = cells_seurat > min_feat & cells_seurat < max_feat)]
}

# 3.2.2 Count filter.
min_count <- ifelse(is.null(min_count), 0, min_count)
max_count <- ifelse(is.null(max_count), 0, max_count)
if (min_count != 0 | max_count != 0) { 
  cells_seurat <- FetchData(object = seurat, vars = "nCount_RNA")
  seurat <- seurat[, which(x = cells_seurat > min_count & cells_seurat < max_count)]
}


# 3.2.3 Mitochondrial filter.
if (!is.null(mit)) {
  mit_seurat <- FetchData(object = seurat, vars = "percent.mt")
  seurat <- seurat[, which(x = mit_seurat < mit)]
}

# 3.2.4 Ribosomal filter.
if (!is.null(ribo)) {
  ribo_seurat <- FetchData(object = seurat, vars = "percent.ribo")
  seurat <- seurat[, which(x = ribo_seurat < ribo)]
}
message("3. Cell filters were applied.")

# 3.3 QC: violin plots - After filter.
p1 <- VlnPlot(seurat, features = c("nFeature_RNA"), pt.size = 0.25, cols = "#9CCCD0") + ggtitle("Nº features") + theme(legend.position="bottom") 
p2 <- VlnPlot(seurat, features = c("nCount_RNA"), pt.size = 0.25, cols = "#8ADD56")  + ggtitle("Nº counts") + theme(legend.position="bottom")
p3 <- VlnPlot(seurat, features = c("percent.mt"), pt.size = 0.25, cols = "#F07800") + ggtitle("Mitochondrial %") + theme(legend.position="bottom")
p4 <- VlnPlot(seurat, features = c("percent.ribo"), pt.size = 0.25, cols = "#E44631") + ggtitle("Ribosomal %") + theme(legend.position="bottom")
p_comp <- p1 + p2 + p3 + p4 + plot_layout(ncol = 4)
ggsave(paste0(dir.name, "/", folders[1], "/3_vlnplot_QC_variables_postfilt.pdf"), plot = p_comp, scale = 1.2, width = 10, height = 8)
message("4. Combined violin plot post-filtering was generated.")

# 3.4 Post-filter stats calculus.
stats_post <- c(length(colnames(seurat)), median(seurat@meta.data[["nCount_RNA"]]), median(seurat@meta.data[["nFeature_RNA"]]), median(seurat@meta.data[["percent.mt"]]), median(seurat@meta.data[["percent.ribo"]]))
message("5. Post-filtering statistics were obtained.")

# 3.5 Statistics table.
filtering_df <- data.frame("Number of cells" = c(stats_pre[1],stats_post[1]), "Count median" = c(stats_pre[2],stats_post[2]),"Expressed genes median" = c(stats_pre[3],stats_post[3]), "Mitochondrial percentage median" = c(stats_pre[4],stats_post[4]), "Ribosomal percentage median" = c(stats_pre[5],stats_post[5]))
row.names(filtering_df) <- c("Pre-QC", "Post-QC")
write.table(filtering_df, file = paste0(dir.name, "/", folders[1], "/4_pre_vs_post_stats.tsv"), sep = "\t", col.names = NA, quote = FALSE)
message("6. Statistics table was saved.")

# 3.6 Save expression matrix.
if(write_table){
	write.table(as.matrix(seurat@assays$RNA@counts), file = paste0(dir.name, "/", folders[1], "/expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE)
	message("7. Post-qc expression matrix was saved.")
} else {
	message("7. Post-qc expression matrix was not saved, as specified in the configuration file.")
}

# 3.7 Save RDS: we can use this object to generate all the rest of the data.
saveRDS(seurat, file = paste0(dir.name, "/",folders[1], "/seurat_post-qc.rds"))
message("8. Seurat object was saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("ggplot2"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("stringr"))
suppressMessages(library("future"))
suppressMessages(library("stringr"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
input_data = snakemake@input[["data"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake. 
regress_out = snakemake@params[["regress_out"]]
regress_cell_cycle = snakemake@params[["regress_cell_cycle"]]
vars_to_regress = snakemake@params[["vars_to_regress"]]
norm_type = snakemake@params[["norm_type"]]
vf = snakemake@params[["variable_features"]] 
random_seed = snakemake@params[["random_seed"]]
velocyto = snakemake@params[["velocyto"]]
outdir_config = snakemake@params[["outdir_config"]]
case = snakemake@params[["case"]]
ram = snakemake@resources[["mem_mb"]]
threads = snakemake@threads
write_table = as.logical(snakemake@params[["write_table"]])
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
#Set parallelization.
plan("multiprocess", workers = threads)
message(paste0("4. Threads were set at ", threads, "."))
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("5. Seed was set at ", random_seed, "."))

# 5. Load cell cycle markers signature from Tirosh et al, 2015.
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

## 6. Change geneset signatures to specific gene case.
if (case == "lowercase") {
  s.genes <- str_to_lower(s.genes)
  g2m.genes <- str_to_lower(g2m.genes)
  message ("Set to lowercase.")
} else if (case == "titlecase") {
  s.genes <- str_to_title(s.genes)
  g2m.genes <- str_to_title(g2m.genes)
  message ("Set to titlecase.")
} else if (case == "uppercase") {
  message ("Set to uppercase.")
} else {
  message("Please choose a correc case option.")
  quit()
}
message(paste0("6. Case is set as ", case, "."))
message("Configuration finished.")
message("\n")


message("PROCESSING STEP")
# 6. Integration.
# 6.1. This function will read each one of the input files and will perform the early normalization steps for each sample.
path_to_seurat_object <- function(x, regress_out, vars_to_regress, regress_cell_cycle, norm_type, add_velocity) {
    experiment = head(tail(strsplit(x, "/")[[1]], n=3), n=1) #Assay name is stored to later use in integrated object metadata.
    seurat_obj <- readRDS(x)
    seurat_obj <- AddMetaData(object = seurat_obj, metadata = experiment, col.name = 'assay_name')
    # If RNA velocity is going to be performed, we add the velocyto matrices in this step.
    if (add_velocity){
   	# Velocyto matrices path is infered. 
	velocyto_dir = paste0(outdir_config,"/star/", experiment,"/Solo.out/Velocyto/raw/")
        velo_names = c("spliced", "unspliced", "ambiguous")
	vel_matrices = list()
	# The matrices are read in 10x format.
	for (name in velo_names) {
		vel_matrices[[name]] <- Read10X(data.dir = paste0(velocyto_dir, name))
	}
	# The matrices are added as assays in the respective seurat object.
	for (name in velo_names) {
		vel_matrices[[name]] <- vel_matrices[[name]][, which(x = colnames(vel_matrices[[name]]) %in% colnames(seurat_obj))] 
		seurat_obj[[name]] <- CreateAssayObject(counts = vel_matrices[[name]])
	}

    }
    # Save regression parameters variable
    regression_param <- c(regress_out, regress_cell_cycle)
    vars_param <- list(vars_to_regress = vars_to_regress, cell_cycle = c("S.Score", "G2M.Score"))

    # Normalization of the individual samples
    if(norm_type == "standard"){
        seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)
        message(paste0("2. Seurat object was normalized using the ", norm_type, " approach"))
        seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 2500)
        # Scaling to perform PCA.
        seurat_obj <- ScaleData(seurat_obj, features = rownames(seurat_obj))
        # PCA previous to cell cycle scoring.
        seurat_obj <- RunPCA(seurat_obj, features = if(vf) VariableFeatures(object = seurat_obj) else rownames(seurat_obj), npcs = 50)
        # Cell cycle scores and plots.
        seurat_obj <- CellCycleScoring(object = seurat_obj, s.features = s.genes, g2m.features = g2m.genes, set.ident = T)
        # Scaling.
        if(all(regression_param)){
            seurat_obj <- ScaleData(seurat_obj, features = rownames(seurat_obj), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"))
        } else if (any(regression_param)){
            seurat_obj <- ScaleData(seurat_obj, features = rownames(seurat_obj), vars.to.regress = unlist(vars_param[regression_param]))
        }
    } else if (norm_type == "SCT"){
        seurat_obj <- SCTransform(seurat_obj, verbose = FALSE)
	    message(paste0("2. Seurat object was normalized using the ", norm_type, " approach"))
        # PCA previous to cell cycle scoring.
        seurat_obj <- RunPCA(seurat_obj, features = if(vf) VariableFeatures(object = seurat_obj) else rownames(seurat_obj), npcs = 50)
        # Cell cycle scores and plots.
        seurat_obj <- CellCycleScoring(object = seurat_obj, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
        # If cell cycle regression is needed, a new SCT transformation is performed.
        if(all(regression_param)){
            seurat_obj <- SCTransform(seurat_obj, assay = "RNA", new.assay = "SCT", vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"), verbose = FALSE)
        } else if (any(regression_param)){
            seurat_obj <- SCTransform(seurat_obj, assay = "RNA", new.assay = "SCT", vars.to.regress = unlist(vars_param[regression_param]), verbose = FALSE)
        }
    } else {
	    message("Normalization method not found.")
    }
    return(assign(paste0(experiment[[1]][6]),seurat_obj))
}

# 6.2. The function is applied obtaining a list with all the seurat objects.
seurat_object_list <- lapply(input_data, function(x) path_to_seurat_object(x, regress_out = regress_out, vars_to_regress = vars_to_regress, regress_cell_cycle = regress_cell_cycle, norm_type = norm_type, add_velocity = velocyto))
message("1. All Seurat objects were loaded.")

# 6.3. The minimum number of features between assays is obtained to get an aproximate feature number and keep as much genes as possible.
n_feat_calc <- function(seurat) {
  if (seurat@active.assay == "SCT"){
    return(length(rownames(seurat@assays$SCT@counts)))
  } else if (seurat@active.assay == "RNA"){
    return(length(rownames(seurat@assays$RNA@counts)))
  }
}
n_features <- min(unlist(lapply(seurat_object_list, function(x) n_feat_calc(x))))

# 6.4. Final integration steps and integration object creation.
if (norm_type == "SCT") {
  seurat.features <- SelectIntegrationFeatures(object.list = seurat_object_list, nfeatures = n_features, verbose = FALSE)
  seurat.list <- PrepSCTIntegration(object.list = seurat_object_list, anchor.features = seurat.features, verbose = FALSE)
  seurat.anchors <- FindIntegrationAnchors(object.list = seurat.list, normalization.method = "SCT", anchor.features = seurat.features, verbose = FALSE)
  seurat.integrated <- IntegrateData(anchorset = seurat.anchors, normalization.method = "SCT", verbose = FALSE)
  message(paste0("2. All Seurat objects were integrated following the ", norm_type, " approach."))
} else if (norm_type == "standard") {
  seurat.features <- SelectIntegrationFeatures(object.list = seurat_object_list, nfeatures = n_features, verbose = FALSE)
  seurat.anchors <- FindIntegrationAnchors(object.list = seurat_object_list, dims = 1:100, anchor.features = seurat.features, verbose = FALSE)
  seurat.integrated <- IntegrateData(anchorset = seurat.anchors, dims = 1:100, verbose = FALSE)
  seurat.integrated <- ScaleData(seurat.integrated, verbose = TRUE)
  message(paste0("2. All Seurat objects were integrated following the ", norm_type, " approach."))
}

# 6.5. PCA and Visualize Dimensional Reduction genes.
seurat.integrated <- RunPCA(seurat.integrated, features = if(vf) VariableFeatures(object = seurat.integrated) else rownames(seurat.integrated), ncps = 100, verbose = FALSE)
p1 <- VizDimLoadings(seurat.integrated, dims = 1:2, reduction = "pca") + theme(legend.position="bottom") 
ggsave(paste0(dir.name, "/",folders[2], "/1_viz_dim_loadings.pdf"), plot = p1, scale = 1.5)

# 6.6. PCA projection and integration visualization plot.
getPalette <- colorRampPalette(brewer.pal(9,'Set1'))
p2 <- DimPlot(seurat.integrated, reduction = "pca", group.by = "assay_name", cols=getPalette(length(levels(as.factor(seurat.integrated$assay_name)))))
ggsave(paste0(dir.name, "/", folders[2], "/2_dimplot_PCA.pdf"), plot = p2)
message("3. PCA was done.")

# 6.7. Principal component study using Elbow plot.
p3 <- ElbowPlot(seurat.integrated, ndims = 50) + theme(legend.position="bottom")
ggsave(paste0(dir.name, "/",folders[2], "/3_elbowplot.pdf"), plot = p3, scale = 1.5)
message("4. Elbowplot was generated.")


# 6.8. Cell cycle scores and plots.
seurat.integrated <- CellCycleScoring(object = seurat.integrated, s.features = s.genes, g2m.features = g2m.genes, set.ident = T)
p5 <- FeaturePlot(object = seurat.integrated, features ="S.Score")
ggsave(paste0(dir.name, "/", folders[2], "/5_sscore_featureplot.pdf"), plot = p5, scale = 1.5)
p6 <- FeaturePlot(object = seurat.integrated, features ="G2M.Score")
ggsave(paste0(dir.name, "/", folders[2], "/6_g2mscore_featureplot.pdf"), plot = p6, scale = 1.5)
p7 <- DimPlot(seurat.integrated, reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis() #+ theme(legend.position    ="bottom") 
ggsave(paste0(dir.name, "/", folders[2], "/7_no_umap_pca.pdf"), plot = p7, scale = 1.5)
message("6. Cell-cycle analysis plot was done.")

# 6.9. Save expression matrix.
if(write_table){
	write.table(as.matrix(seurat.integrated@assays$integrated@scale.data), file = paste0(dir.name, "/", folders[2], "/normalized_expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE)
	message("7. Integrated expression matrix was saved.")
} else {
	message("7. Integrated expression matrix was not saved, as specified in the configuration file.")
}

# 6.10. Save Seurat object.
saveRDS(object = seurat.integrated, file = paste0(dir.name, "/", folders[2], "/seurat_normalized-pcs.rds"))
message("8. Seurat object was saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("dplyr"))
suppressMessages(library("data.table"))
suppressMessages(library("reticulate"))
suppressMessages(library("ggplot2"))
suppressMessages(library("stringr"))
suppressMessages(library("future"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
input_data = snakemake@input[["seurat_obj"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake.  
normalization = snakemake@params[["norm_type"]] # "SCT" or "standard"
vf = snakemake@params[["variable_features"]] # TO ADD
regress_out = snakemake@params[["regress_out"]] # true or false
vars_to_regress = c(snakemake@params[["vars_to_regress"]]) # check if null
random_seed = snakemake@params[["random_seed"]]
regress_cell_cycle = snakemake@params[["regress_cell_cycle"]]
regress_merge_effect = snakemake@params[["regress_merge_effect"]]
case = snakemake@params[["case"]]
ram = snakemake@resources[["mem_mb"]]
threads = snakemake@threads
write_table = as.logical(snakemake@params[["write_table"]])
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
#Set parallelization.
plan("multiprocess", workers = threads)
message(paste0("4. Threads were set at ", threads, "."))
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("5. Seed was set at ", random_seed, "."))

# 5. Load cell cycle markers signature from Tirosh et al, 2015.
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

# 6. Change geneset signatures to specific gene case.
if (case == "lowercase") {
  s.genes <- str_to_lower(s.genes)
  g2m.genes <- str_to_lower(g2m.genes)
  message ("Set to lowercase.")
} else if (case == "titlecase") {
  s.genes <- str_to_title(s.genes)
  g2m.genes <- str_to_title(g2m.genes)
  message ("Set to titlecase.")
} else if (case == "uppercase") {
  message ("Set to uppercase.")
} else {
  message("Please choose a correct case option.")
  quit()
}
message(paste0("6. Case is set as ", case, "."))
message("Configuration finished.")
message("\n")


message("PROCESSING STEP")
# Load seurat object.
seurat = readRDS(input_data)
message("1. Seurat object was loaded.")

# Regress merge variable input.
if (regress_merge_effect){
  merge_var = "assay_name"
} else {
  merge_var = NULL
}

# 5. Normalization.
# 5.1. Normalize data depending of the method.
if(normalization == "standard"){
  seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000)
  seurat <- FindVariableFeatures(seurat, selection.method = "vst", nfeatures = 2500)

  # Identify the 10 most highly variable genes.
  top10 <- head(VariableFeatures(seurat), 10)
  p1 <- VariableFeaturePlot(seurat) + theme(legend.position="bottom")
  p1 <- p1 + LabelPoints(plot = p1, points = top10, repel = TRUE) + theme(legend.position="bottom") 
  ggsave(paste0(dir.name, "/",folders[1], "/6_variable_features.pdf"), plot = p1, scale = 1.5)

  # Scaling to perform PCA.
  # Merged object check.
  if (seurat@project.name == "merged") {
    seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = merge_var)
  } else { 
    seurat <- ScaleData(seurat, features = rownames(seurat))
  }
  # PCA previous to cell cycle scoring.
  seurat <- RunPCA(seurat, features = if(vf) VariableFeatures(object = seurat) else rownames(seurat), npcs = 50) 

  # Cell cycle scores and plots.
  seurat <- CellCycleScoring(object = seurat, s.features = s.genes, g2m.features = g2m.genes, set.ident = T)
  p4 <- FeaturePlot(object = seurat, features ="S.Score") + ggtitle("S phase score")
  ggsave(paste0(dir.name, "/", folders[2], "/4_sscore_featureplot.pdf"), plot = p4, scale = 1.5)
  p5 <- FeaturePlot(object = seurat, features ="G2M.Score") + ggtitle("G2/M phase score")
  ggsave(paste0(dir.name, "/", folders[2], "/5_g2mscore_featureplot.pdf"), plot = p5, scale = 1.5)
  p6 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis() #+ theme(legend.position    ="bottom") 
  ggsave(paste0(dir.name, "/", folders[2], "/6_cell_cycle_dimplot.pdf"), plot = p6, scale = 1.5)

  message(paste0("2. Seurat object was normalized using the ", normalization, " approach"))
  # Scaling.
  if(regress_out == TRUE){
    if (regress_cell_cycle) {
      if (seurat@project.name == "merged"){
        seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score", merge_var))
      } else {
        seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"))
      }
    } else {
      if (seurat@project.name == "merged"){
        seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, merge_var))
      } else {
        seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress))
      }      
    }
  }
} else if (normalization == "SCT"){
  if(regress_out == TRUE){
    if(seurat@project.name == "merged"){
      seurat <- SCTransform(seurat, vars.to.regress = c(vars_to_regress, merge_var), verbose = FALSE)
    } else {
      seurat <- SCTransform(seurat, vars.to.regress = vars_to_regress, verbose = FALSE)
    }
  } else {
    if(seurat@project.name == "merged"){
      seurat <- SCTransform(seurat, vars.to.regress = merge_var, verbose = FALSE)
    } else {
      seurat <- SCTransform(seurat, verbose = FALSE)
    }		
  }
  # PCA previous to cell cycle scoring.
  seurat <- RunPCA(seurat, features = VariableFeatures(object = seurat), npcs = 50) # This result could all be saved in a table.

  # Cell cycle scores and plots.
  seurat <- CellCycleScoring(object = seurat, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
  p4 <- FeaturePlot(object = seurat, features ="S.Score") + ggtitle("S phase score")
  ggsave(paste0(dir.name, "/", folders[2], "/4_sscore_featureplot.pdf"), plot = p4, scale = 1.5)
  p5 <- FeaturePlot(object = seurat, features ="G2M.Score") + ggtitle("G2/M phase score")
  ggsave(paste0(dir.name, "/", folders[2], "/5_g2mscore_featureplot.pdf"), plot = p5, scale = 1.5)
  p6 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis() #+ theme(legend.position    ="bottom") 
  ggsave(paste0(dir.name, "/", folders[2], "/6_cell_cycle_dimplot.pdf"), plot = p6, scale = 1.5)
  message(paste0("2. Seurat object was normalized using the ", normalization, " approach"))
  # If cell cycle regression is needed, a new SCT transformation is perform.
  if (regress_cell_cycle){
    if (regress_out) { 
      if (seurat@project.name == "merged"){
        seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score", merge_var), verbose = FALSE)
      } else {
        seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"), verbose = FALSE)
      }
    } else {
      if (seurat@project.name == "merged"){
        seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c("S.Score", "G2M.Score", merge_var), verbose = FALSE)
      } else {
        seurat <- SCTransform(seurat, assay = "RNA", new.assay = "SCT", vars.to.regress = c("S.Score", "G2M.Score"), verbose = FALSE)
      }
    }
  }
} else {
	message("Normalization method not found.")
}
message("3. Seurat object was scaled.")

## 5.2. PCA & metrics calculation.
Idents(seurat) <- seurat@project.name
seurat <- RunPCA(seurat, features = if(vf) VariableFeatures(object = seurat) else rownames(seurat), npcs = 50) # This result could all be saved in a table. 
# Visualizing PCA in Different Ways: elbow plot most variable genes 
p2 <- VizDimLoadings(seurat, dims = 1:2, reduction = "pca") + theme(legend.position="bottom") 
ggsave(paste0(dir.name, "/",folders[2], "/1_viz_dim_loadings.pdf"), plot = p2, scale = 1.5)#, height = height, width = height * aspect_ratio)
p3 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5) + theme(legend.position="bottom") 
ggsave(paste0(dir.name, "/",folders[2], "/2_dimplot.pdf"), plot = p3, scale = 1.5)
# Only if cell cycle was regressed.
if (regress_cell_cycle) {
  p7 <- FeaturePlot(object = seurat, features ="S.Score") + ggtitle("S phase score")
  ggsave(paste0(dir.name, "/", folders[2], "/4_sscore_featureplot_regressed.pdf"), plot = p7, scale = 1.5)
  p8 <- FeaturePlot(object = seurat, features ="G2M.Score") + ggtitle("G2/M phase score")
  ggsave(paste0(dir.name, "/", folders[2], "/5_g2mscore_featureplot_regressed.pdf"), plot = p7, scale = 1.5)
  p9 <- DimPlot(seurat, group.by = "Phase", reduction = "pca", pt.size = 0.5, label = TRUE, label.size = 5) + RotatedAxis()
  ggsave(paste0(dir.name, "/", folders[2], "/6_cell_cycle_dimplot_regressed.pdf"), plot = p9, scale = 1.5)
}
message("4. PCA was calculated.")

# 5.3. Determine the dimensionality of the dataset
p4 <- ElbowPlot(seurat, ndims = 50) + theme(legend.position="bottom")
ggsave(paste0(dir.name, "/",folders[2], "/3_elbowplot.pdf"), plot = p4, scale = 1.5)
message("5. Elbowplot was generated.")

#if (!([email protected] == "SCT")) {
#  seurat <- JackStraw(seurat, num.replicate = 100, dims = 50)
#  seurat <- ScoreJackStraw(seurat, dims = 1:50)
#  p5 <- JackStrawPlot(seurat, dims = 1:50) + theme(legend.position="bottom") + guides(fill=guide_legend(nrow=2, byrow=TRUE)) + labs(y = "Empirical", x="Theoretical")
#  ggsave(paste0(dir.name, "/",folders[2], "/4_jackstrawplot.pdf"), plot = p5, scale = 2)
#  message("6. JackStraw plot was generated.")
#}

if (seurat@project.name == "merged"){
  Idents(seurat) <- "assay_name"
  p6 <- DimPlot(seurat, reduction = "pca", pt.size = 0.5) + theme(legend.position="bottom")
  ggsave(paste0(dir.name, "/",folders[2], "/2_dimplot_merged.pdf"), plot = p6, scale = 1.5)
}

# 5.4. Save the expresion matrix.
if(write_table){
	if (normalization == "SCT") {
		write.table(as.matrix(seurat@assays$SCT@data), file = paste0(dir.name, "/", folders[2], "/normalized_expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE)
        }
        if (normalization == "standard") {
		write.table(as.matrix(seurat@assays$RNA@data), file = paste0(dir.name, "/", folders[2], "/normalized_expression_matrix.tsv"), sep = "\t", row.names = TRUE, col.names = NA, quote = FALSE)
	}
	message("6. Normalized expression matrix was saved.")
} else {
	message("6. Normalized expression matrix was not saved, as specified in the configuration file.")
}


# 5.5. Save RDS: we can use this object to generate all the rest of the data.
saveRDS(seurat, file = paste0(dir.name, "/",folders[2], "/seurat_normalized-pcs.rds"))
message("7. Seurat object was saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("SeuratDisk"))
suppressMessages(library("dplyr"))
suppressMessages(library("data.table"))
suppressMessages(library("reticulate"))
suppressMessages(library("clustree"))
suppressMessages(library("ggplot2"))
suppressMessages(library("cluster"))
suppressMessages(library("writexl"))
suppressMessages(library("future"))
suppressMessages(library("patchwork"))
suppressMessages(library("lisi"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
input_file = snakemake@input[["seurat_obj"]]
dir.name = snakemake@params[["output_dir"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake.  
pc = snakemake@params[["pc"]] # We should check the PCs using the Elbowplot and JackStraw plot.
res = as.vector(snakemake@params[["res"]])
random_seed = snakemake@params[["random_seed"]]
k_neighbors = snakemake@params[["k_neighbors"]]
batch_metadata = snakemake@params[["batch_metadata"]] # check if null
ram = snakemake@resources[["mem_mb"]]
threads = snakemake@threads
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
#Set parallelization.
plan("multiprocess", workers = threads)
message(paste0("4. Threads were set at ", threads, "."))
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("5. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")


message("PROCESSING STEP")
# Load seurat object.
seurat <- readRDS(input_file)
assay_type <- seurat@active.assay
message("1. Seurat object was loaded.")

# 7. Clustering.
# 7.1. FindClusters using UMAP projection. We keep the significant PC obtained from PCA.
seurat <- FindNeighbors(seurat, reduction = "pca", dims = 1:pc, k.param = k_neighbors, future.seed = NULL)
seurat <- FindClusters(seurat, resolution = res, future.seed = NULL)
seurat <- RunUMAP(seurat,dims = 1:pc, n.components = 2, verbose = FALSE, future.seed = NULL)
message("2. UMAP was done.")

# 7.2. Clustree.
p1 <- clustree(seurat, prefix = paste0(assay_type,"_snn_res."))
ggsave(paste0(dir.name, "/", folders[3], "/1_clustree.pdf"), plot = p1, scale = 1.5)
message("3. Clustree representation was obtained.")

# 7.3. Clustering plots and silhouette parameters calculus.
# An empty list is created to store silhouette values.
silhouette_scores <- vector(mode = "list", length = length(res))

# If the seurat object contains more than 50K samples, a subset is done to
if (dim(seurat@meta.data)[1] > 50000){
	  message("Downsampling seurat object to perform silhouette analysis")
  seurat_sil <- seurat[, sample(colnames(seurat), size = 50000, replace=F)]
} else {
	  seurat_sil <- seurat
}

# Loop for each resolution.
for(i in 1:length(which(grepl(paste0(assay_type,"_snn_"),colnames(seurat_sil@meta.data))))){
	full_res = colnames(seurat@meta.data[which(grepl(paste0(assay_type,"_snn_"),colnames(seurat@meta.data)))][i])
	Idents(seurat) <- full_res
	p2 <- DimPlot(seurat, reduction = "umap", label = TRUE, label.size = 5) + theme_minimal() #+ theme(legend.position="bottom") 
	ggsave(paste0(dir.name, "/", folders[3], "/2_umap_",full_res,".pdf"), plot = p2, scale = 1.5)

	# Silhhouettes calculus.
	dist.matrix <- dist(x = Embeddings(object = seurat_sil[["pca"]])[, 1:pc]) #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
	clusters <- slot(seurat_sil, "meta.data")[,full_res]
	sil <- silhouette(x = as.numeric(x = as.factor(x = clusters)), dist = dist.matrix)
	if(is.null(dim(sil))){
		silhouette_scores[[i]] <- as.data.frame(NA)
	} else {
		silhouette_scores[[i]] <- as.data.frame(summary(sil)[2])
	}
	names(silhouette_scores[[i]]) <- full_res
}

# Create a xlsx file to store the silhouette scores.
write_xlsx(silhouette_scores, path = paste0(dir.name, "/", folders[3], "/3_silhouette_score.xlsx"),col_names = TRUE, format_headers = TRUE )
message("4. Silhouette scores were calculated.")

# 7.4. Plots on vars to regress
p3 <- FeaturePlot(seurat, 'nFeature_RNA', pt.size =  0.75) + labs(title = "Nº features") 
ggsave(paste0(dir.name, "/", folders[3], "/4.1_featureplot.pdf"), plot = p3, scale = 1.5)
p4 <- FeaturePlot(seurat, 'nCount_RNA', pt.size =  0.75) + labs(title = "Nº counts") 
ggsave(paste0(dir.name, "/", folders[3], "/4.2_countplot.pdf"), plot = p4, scale = 1.5)
p5 <- FeaturePlot(seurat, 'percent.mt', pt.size =  0.75) + labs(title = "% mitochondrial") 
ggsave(paste0(dir.name, "/", folders[3], "/4.3_mitplot.pdf"), plot = p5, scale = 1.5)
p6 <- FeaturePlot(seurat, 'percent.ribo', pt.size =  0.75) + labs(title = "% ribosomal") 
ggsave(paste0(dir.name, "/", folders[3], "/4.4_riboplot.pdf"), plot = p6, scale = 1.5)
p7 <- FeaturePlot(seurat, 'S.Score', pt.size =  0.75) + labs(title = "S phase score") 
ggsave(paste0(dir.name, "/", folders[3], "/4.5_sscoreplot.pdf"), plot = p7, scale = 1.5)
p8 <- FeaturePlot(seurat, 'G2M.Score', pt.size =  0.75) + labs(title = "G2M phase score") 
ggsave(paste0(dir.name, "/", folders[3], "/4.5_g2mplot.pdf"), plot = p8, scale = 1.5)
message("4. Variable plots on UMAP dimension were produced.")

# 7.5. Dimplot for merged or integrated objects.
if (seurat@active.assay == TRUE || seurat@project.name == "merged"){
  Idents(seurat) <- "assay_name"
  p3 <- DimPlot(seurat, reduction = "umap")
  ggsave(paste0(dir.name, "/", folders[3], "/2_umap_by_assay.pdf"), plot = p3, scale = 1.5)
}

# 7.6 Compute LISI score using the desired variables. 
if(!is.null(batch_metadata)){
	X <- seurat@reductions$umap@cell.embeddings[1:dim(seurat)[2],] # Reductions matrix
	meta_data <- seurat@meta.data[, batch_metadata, drop = FALSE] # Meta data object
	res <- compute_lisi(X, meta_data, label_colnames = batch_metadata) # lisi scores
	# 7.6.1 Superpose lisi score with umap
	lapply(seq_along(batch_metadata), function(x){
		# Orig 
	        p1 <- ggplot(res, aes(x = res[,x])) + geom_density() + theme_classic() #+ theme_minimal()
	        p2 <- DimPlot(seurat, group.by = batch_metadata[x])
	        p3 <- p1 /  p2 + plot_layout(nrow = 2, heights = c(0.25, 2))
	        ggsave(paste0(dir.name, "/", folders[3], "/5_lisi_score_by_", batch_metadata[x], ".pdf"), plot = p3, scale = 1.5)
        })
	message("5. LISI score was calculated.")
}
# 7.7. Statistics table per cluster.
# 7.7.1. Get the index of resolution columns.
resolutions = grep("snn_res", colnames(seurat@meta.data))
# 7.7.2. Set the highest number of clusters.
number_clusters = length(unique(seurat@meta.data$seurat_clusters))
# 7.7.3. Get the maximum cluster names vector.
cluster_names = paste0("Cluster ", seq(number_clusters)) 

# 7.7.4. Loop for each resolution and write table.
for(j in 1:length(resolutions)){
  # Set idents and levels.
  Idents(seurat) <- seurat@meta.data[,resolutions[j]]
  levels(Idents(seurat)) <- cluster_names[1:length(levels(Idents(seurat)))]
  table(Idents(seurat))

  # Proportion of cells per cluster.
  prop.table(x = table(Idents(seurat)))
  cluster.averages <- AverageExpression(object = seurat)
  genes_per_cluster <- Matrix::colSums(cluster.averages$RNA>0)

  # Generate a joint table.
  joint_stats = rbind(number_of_cells = table(Idents(seurat)), prop_per_cluster = prop.table(x = table(Idents(seurat))), genes_per_cluster = genes_per_cluster)
  colnames(joint_stats) = names(table(Idents(seurat)))
  write.table(t(joint_stats), file=paste0(dir.name, "/", folders[3], "/5_", colnames(seurat@meta.data)[resolutions[j]],"_stats.tsv"), sep="\t", col.names = NA, quote = FALSE)
}
message("5. Statistics table was done and saved.")

# 7.8. Save Seurat object in AnnData
SaveH5Seurat(seurat, filename = paste0(dir.name, "/", folders[3], "/seurat_find-clusters.h5Seurat"))
Convert(paste0(dir.name, "/",folders[3], "/seurat_find-clusters.h5Seurat"), dest = "h5ad")

# 7.9. Save RDS: we can use this object to generate all the rest of the data.
saveRDS(seurat, file = paste0(dir.name, "/",folders[3], "/seurat_find-clusters.rds"))
message("8. Seurat object was saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("dplyr"))
suppressMessages(library("data.table"))
suppressMessages(library("reticulate"))
suppressMessages(library("ggplot2"))
suppressMessages(library("BiocParallel"))
suppressMessages(library("openxlsx"))
suppressMessages(library("future"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
input_data = snakemake@input[["seurat_obj"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake.  
selected_cond = snakemake@params[["selected_cond"]]
random_seed = snakemake@params[["random_seed"]]
test = snakemake@params[["test"]]
ranking = snakemake@params[["ranking"]]
ram = snakemake@resources[["mem_mb"]]
threads = snakemake@threads
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
#Set parallelization.
plan("multiprocess", workers = threads)
message(paste0("4. Threads were set at ", threads, "."))
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("5. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")


message("PROCESSING STEP")
# Load seurat object and set clustering resolution and assay type.
seurat <- readRDS(input_data)
assay_type <- seurat@active.assay
message("1. Seurat object was loaded.")

lapply(selected_cond, function(cond){
  if(grepl("0.", cond)){
    cond <- paste0(assay_type, "_snn_res.", cond)
    if(!(cond %in% colnames(seurat@meta.data))){
      stop("The specified resolution is not available.")
    }
  } else {
    if (!(cond %in% colnames(seurat@meta.data))){
      stop("The specified condition is not available.")
    } 
  }


  if (length(unique(seurat@meta.data[[cond]])) >1){
    # Check number of cells for the heatmap to avoid errors.
    if (length(colnames(seurat)) < 1000){
      n_cells = length(colnames(seurat))
    } else {
      n_cells = 1000
    }

    #Set styles for xlsx files.
    redStyle <- createStyle(fontColour = "#B60A1C", bgFill = "#FFF06A", textDecoration = c("BOLD"))
    greenStyle <- createStyle(fontColour = "#309143", bgFill = "#FFF06A", textDecoration = c("BOLD"))

    # 8. Differentially expressed genes between clusters. 
    Idents(seurat) <- cond

    # If the input seurat object is integrated we only perform the FindConservedMarkers.
    if (seurat@active.assay == "integrated") {
      comparisons <- FetchData(seurat, vars = c("assay_name", cond))
      comparisons$assay_cond<-  apply(comparisons[ ,c("assay_name", cond)] , 1 , paste , collapse = "-" )
      if(length(unique(comparisons$assay_cond))>2){
        for (i in 1:length(unique(Idents(seurat)))){
          # 8.1. Table on differentially expressed genes - using basic filterings.
          clusterX.markers <- FindConservedMarkers(seurat, ident.1 = unique(Idents(seurat))[i], min.pct = 0.25, grouping.var = "assay_name", test.use = test) #min expressed
          write.table(clusterX.markers, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".markers.txt"), sep = "\t", col.names = NA, quote = FALSE)
        }
      } else {
        print("The specified condition overlaps with the assay level.")
      }
      message("2. DEG analysis was done for integrated assay.")

      # If the input seurat object is not integrated.
    } else {
      # 8.1. Table on differentially expressed genes - using basic filterings.
      for (i in 1:length(unique(Idents(seurat)))){
        clusterX.markers <- FindMarkers(seurat, ident.1 = unique(Idents(seurat))[i], min.pct = 0.25, test.use = test) #min expressed
        write.table(clusterX.markers, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".markers.txt"), sep = "\t", col.names = NA, quote = FALSE)
      }
      message("2. Cluster markers were obtained.")

      # 8.2. DE includying all genes - needed for a GSEA analysis. 
      if (ranking){
        for (i in 1:length(unique(Idents(seurat)))){
          clusterX.markers <- FindMarkers(seurat, ident.1 = unique(Idents(seurat))[i], min.pct = 0, logfc.threshold = 0, test.use = test) #min expressed
          #write.table(clusterX.markers, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".DE.txt"), sep = "\t", col.names = NA, quote = FALSE)
          wb <- createWorkbook()
          addWorksheet(wb, "DE analysis")
          writeData(wb, "DE analysis", clusterX.markers, rowNames = TRUE)
          conditionalFormatting(wb, "DE analysis", cols = 1:(ncol(clusterX.markers)+1),
                                rows = 2:(nrow(clusterX.markers) + 1), rule = "AND($C2<0, $F2<0.05)",
                                style = greenStyle)
          conditionalFormatting(wb, "DE analysis", cols = 1:(ncol(clusterX.markers)+1),
                                rows = 2:(nrow(clusterX.markers) + 1), rule = "AND($C2>0, $F2<0.05)",
                                style = redStyle)
          legend <- createComment(comment = c("Red means a positive LogFold\n\n", "Green means a negative LogFold"), style = c(redStyle, greenStyle))
          writeComment(wb, "DE analysis", col = 8, row = 2, comment = legend)
          saveWorkbook(wb, paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".DE.xlsx"), overwrite = TRUE)    

          # 8.2.1. Create RNK file. 
          rnk = NULL
          rnk = as.matrix(clusterX.markers[,2])
          rownames(rnk)= toupper(row.names(clusterX.markers))
          write.table(rnk, file = paste0(dir.name, "/", folders[4], "/cluster", unique(Idents(seurat))[i],".rnk"), sep = "\t", col.names = FALSE, quote = FALSE)
        }
        message("3. All genes DEG and RNK file were finished.")
      }

      # 8.3. Find TOP markers.
      seurat.markers <- FindAllMarkers(object = seurat, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25, test.use = test)
      groupedby.clusters.markers = seurat.markers %>% group_by(cluster) %>% top_n(10, avg_logFC)

      # 8.3.1. HeatMap top10.f
      # setting heatmap text label sizes
      ytext_value <- 70/(length(unique(seurat@meta.data[[cond]]))*1.30)
      if (ytext_value > 10) {ytext_value = 10}
      xtext_value <- 50/(length(unique(seurat@meta.data[[cond]]))*1.25)
      if (xtext_value > 8) {xtext_value = 8}

      # setting slim.col.label to TRUE will print just the cluster IDS instead of every cell name
      p1 <- DoHeatmap(object = seurat, features = groupedby.clusters.markers$gene, cells = 1:n_cells, size = xtext_value, angle = 45, 
                      group.bar = TRUE, draw.lines = F, raster = FALSE) +
        scale_fill_gradientn(colors = c("blue", "white", "red")) + guides(color=FALSE) + 
        theme(axis.text.y = element_text(size = ytext_value)) + 
        theme(legend.position="bottom") 
      ggsave(paste0(dir.name, "/", folders[4], "/1_heatmap_topmarkers.pdf"), plot = p1, scale = 2)
      message("4. Top marker heatmap was done.")
    }

    # 8.4. Save RDS: we can use this object to generate all the rest of the data.
    saveRDS(seurat, file = paste0(dir.name, "/",folders[4], "/seurat_degs.rds"))
    message("5. Seurat object was saved.")

  } else if (length(unique(seurat@meta.data[[cond]])) <=1) {
    if(assay_type == "integrated"){
    	print("The specified condition has only one level.")  
      	if(!file.exists( paste0(dir.name, "/", folders[4], "/seurat_degs.rds"))){
	      saveRDS(seurat, file = paste0(dir.name, "/", folders[4], "/seurat_degs.rds"))
      	}
    } else {
      print("The specified condition has only one level. Nothing to be done.")
    	if(!file.exists( paste0(dir.name, "/", folders[4], "/seurat_degs.rds"))){
      		saveRDS(seurat, file = paste0(dir.name, "/", folders[4], "/seurat_degs.rds"))
    	}	
    }
  }
})
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("dplyr"))
suppressMessages(library("data.table"))
suppressMessages(library("reticulate"))
suppressMessages(library("ggplot2"))
suppressMessages(library("qusage"))
suppressMessages(library("clustree"))
suppressMessages(library("patchwork"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
input_data = snakemake@input[["seurat_obj"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# B. Parameters: analysis configuration 
geneset_collection = snakemake@params[["gs_collection"]]
random_seed = snakemake@params[["random_seed"]]
resolutions = snakemake@params[["resolutions"]]
geneset_percentage <- snakemake@params[["geneset_percentage"]]
norm_type = snakemake@params[["norm_type"]]
ram = snakemake@resources[["mem_mb"]]
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("4. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")

message("PROCESSING STEP")
# Load seurat object.
seurat = readRDS(input_data)
assay_type <- seurat@active.assay
dir.create(paste0(dir.name, "/", folders[5]), showWarnings = FALSE)
message("1. Seurat object was loaded.")

# 9. GS scoring
# 9.1 Geneset loading, filtering and scoring.
genesets <- read.gmt(geneset_collection) #should be a tab file, each row = pathway.
genesets <- genesets[unlist(lapply(genesets, function(x) ((length(which(x%in% rownames(seurat)))/ length(x))> geneset_percentage)))]
seurat <- AddModuleScore(object = seurat, features= genesets, name = names(genesets))
message("2. Signatures were loaded and scores were calculated.")

# 9.2 We create a vector containing the different combinations for each resolutions and each cluster.  
res_clus_comb <- vector()  #Empty vector.
for (res in resolutions){
  full_res = paste0(assay_type, "_snn_res.", res)
  for (cluster in levels(seurat@meta.data[[full_res]])) {
    x <- paste0(full_res,"C",cluster) 
    res_clus_comb <- append(res_clus_comb, x) #Adding combinations to the vector.
  }
}

# 9.3 We create a MxN matrix for the p-values, where M is the previous calculated combinations and N is the different gene sets used.
mtx_pval <- matrix(nrow=length(res_clus_comb), ncol=length(names(genesets)))

# 9.4 We fill the matrix with the p-value calculated at cluster level for each resolution and gene set.
if(norm_type == "standard") {
  norm_type <- "RNA"
}

if(assay_type == "integrated"){
  seurat@active.assay <- norm_type
}

for (col in 1:length(genesets)) {
  row = 1
  for (res in resolutions){
    full_res = paste0(assay_type, "_snn_res.", res)
    if(length(levels(seurat@meta.data[[full_res]]))>1){
      for (cluster in levels(seurat@meta.data[[full_res]])) {
        # The resolution is set as Idents
        Idents(seurat) <- seurat[[full_res]]

        #Counts per gene calculus.
        seurat_target_clust <- subset(seurat, idents = cluster)
        counts_in_gene_set_target<- rowMeans(as.matrix(seurat_target_clust@assays[[norm_type]]@data))[which(names(rowSums(as.matrix(seurat_target_clust@assays[[norm_type]]@data))) %in% 
                                                                                                              genesets[[col]])]
        #expr_genes_target <- length(counts_in_gene_set_target[counts_in_gene_set_target > 1])
        #all_expr_genes_target <- length(rowMeans(as.matrix(seurat_target_clust@assays[[assay_type]]@counts))[rowSums(as.matrix(seurat_target_clust@assays[[assay_type]]@counts)) > 1])

        seurat_offtarget_clust <- subset(seurat, idents = cluster, invert = TRUE)
        counts_in_gene_set_offtarget<- rowMeans(as.matrix(seurat_offtarget_clust@assays[[norm_type]]@data))[which(names(rowSums(as.matrix(seurat_offtarget_clust@assays[[norm_type]]@data))) %in% genesets[[col]])]
        #expr_genes_offtarget <- length(counts_in_gene_set_offtarget[counts_in_gene_set_offtarget > 1])
        #all_expr_genes_offtarget <- length(rowMeans(as.matrix(seurat_offtarget_clust@assays[[assay_type]]@counts))[rowSums(as.matrix(seurat_offtarget_clust@assays[[assay_type]]@counts)) > 1])

        #Max count value used to generate the factor. 
        max_val <- max(c(counts_in_gene_set_target, counts_in_gene_set_offtarget))
        #gene_matrix <- matrix(data = c(expr_genes_target, expr_genes_offtarget, all_expr_genes_target, all_expr_genes_offtarget), nrow = 2)    

        #The factor is calculated and applied for each vector.
        target_factor <- length(counts_in_gene_set_target[counts_in_gene_set_target > 0.05*max_val])/length(counts_in_gene_set_target)
        offtarget_factor <- length(counts_in_gene_set_offtarget[counts_in_gene_set_offtarget > 0.05*max_val])/length(counts_in_gene_set_offtarget)

        counts_in_gene_set_target_factor <- counts_in_gene_set_target*target_factor
        counts_in_gene_set_offtarget_factor <- counts_in_gene_set_offtarget*offtarget_factor

        # Wilcoxon test: paired, using a vectors of gene means for each Seurat subset.
        p_value <- wilcox.test(counts_in_gene_set_target_factor, counts_in_gene_set_offtarget_factor, paired = TRUE, alternative = "greater")$p.value
        mtx_pval[row, col] <- p_value
        row = row + 1
      }
    } else {
      mtx_pval[row, col] <- NA
      row = row + 1
    }
  }
}
message("3. Significance per signature and cluster was calculated.")

# 9.5 The p-values are corrected per geneset.
mtx_corr_pval <- mtx_pval
for (col in 1:length(genesets)) {
  p_val_vec_corr <- p.adjust(mtx_corr_pval[,col], "fdr") 
  mtx_corr_pval[,col] <- p_val_vec_corr
}
message("4. P-values were corrected.")

# 9.6 We create a dataframe from a matrix
df_pval <- data.frame(mtx_corr_pval)
row.names(df_pval) <- res_clus_comb
colnames(df_pval) <- names(genesets)
df_pval <- df_pval[ order(row.names(df_pval)), ]

# 9.7 Save p-value table
write.table(df_pval, file = paste0(dir.name, "/",folders[5], "/pval_table.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)
message("5. P-value table was saved.")

# 9.8 We loop for each geneset generating the plots
for (i in 1:length(genesets)){
  gs_name = paste0(names(genesets[i]), i)
  module_name = colnames(seurat@meta.data)[grep(names(genesets)[i], colnames(seurat@meta.data))]

  #Feature plot
  p1 <- FeaturePlot(object = seurat, features = module_name) #+ theme(legend.position="bottom") 
  ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_featureplot.pdf"), plot = p1, scale = 1.5)

  #Mean expression plot
  p2 <- clustree(seurat, paste0(assay_type,"_snn_res."), node_colour = gs_name , node_colour_aggr = "mean") + ggtitle(label = names(genesets)[i]) + theme(plot.title = element_text(size = 12, face = "bold"))
  p2$labels["colour"] <- "signature mean"
  #ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_clustree_mean.pdf"), plot = p2, scale = 1)

  #P-val plot
  p3 <- clustree(seurat, paste0(assay_type,"_snn_res."), node_colour = gs_name , node_colour_aggr = "mean") + ggtitle(label = names(genesets)[i]) + scale_colour_gradient2(low = "blue", mid = "white", high = "red", midpoint = 1.3) + theme(plot.title = element_text(size = 12, face = "bold"))
  clustree_gs_colname = paste0("mean_",gs_name)
  p3$data[[clustree_gs_colname]] <- -log(df_pval[, names(genesets)[i]])
  p3$labels[["colour"]] <- "-log(p_value)"
  #ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_clustree_pval.pdf"), plot = p3, scale = 1) 

  #Plot combination
  p4 <- p2 + p3
  ggsave(paste0(dir.name, "/", folders[5], "/", names(genesets)[i], "_clustree_mean_pval.pdf"), plot = p4, scale = 1, width = 12, height=8, units = "in")

}
message("6. Clustree plots were produced.")

#9.9 Save seurat object
saveRDS(seurat, file = paste0(dir.name, "/",folders[5], "/seurat_complete.rds"))
message("7. Seurat object was saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("GenomeInfoDbData"))
suppressMessages(library("Seurat"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("slingshot"))
suppressMessages(library("SingleCellExperiment"))
suppressMessages(library("rmarkdown"))
suppressMessages(library("gam"))
suppressMessages(library("pheatmap"))
if (snakemake@params[["graphics"]]) {suppressMessages(library("rgl"))}
if (snakemake@params[["graphics"]]) {suppressMessages(library("htmltools"))}
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
input_data = snakemake@input[["seurat_obj"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake.  
selected_res = snakemake@params[["selected_res"]]
start.clus = snakemake@params[["start_clus"]]
end.clus = snakemake@params[["end_clus"]]
n_var_genes = snakemake@params[["n_var_genes"]]
n_plotted_genes = snakemake@params[["n_plotted_genes"]]
random_seed = snakemake@params[["random_seed"]]
pc = snakemake@params[["pc"]]
graphics = snakemake@params[["graphics"]]
ram = snakemake@resources[["mem_mb"]]
threads = snakemake@threads
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("4. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")

message("PROCESSING STEP")
# 10. Trajectory inference analysis using slingshot.
# 10.1. Seurat object is converted to SingleCellExperiment object (required by slingshot) & cluster resolution is selected. 
seurat <- readRDS(input_data)
assay_type <- seurat@active.assay
cluster_res <- paste0(assay_type,"_snn_res.",selected_res)
if (!(cluster_res %in% colnames(seurat@meta.data))){
  stop("Specified resolution is not available.")
}
seurat.sim <- as.SingleCellExperiment(seurat)
message("1. Seurat object was loaded.")

if (graphics){
  # Running UMAP to get 3 dimensions.
  seurat3D <- RunUMAP(seurat,dims = 1:pc, n.components = 3, verbose = FALSE)
  seurat.sim <- as.SingleCellExperiment(seurat)
  seurat3D.sim <- as.SingleCellExperiment(seurat3D)
  message("1.5. 3D UMAP was calculated.")
}

# 10.2. Slingshot algorithm (dimensions = UMAP). There are 4 options depending of the start and end cluster in the following trajectory:
if(is.numeric(start.clus) == FALSE && is.numeric(end.clus) == FALSE){
  seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP")
  if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP")}
  const = FALSE
} else if(is.numeric(start.clus) == TRUE && is.numeric(end.clus) == FALSE){
  seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus)
  if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus)}
  const = TRUE
} else if(is.numeric(start.clus) == FALSE && is.numeric(end.clus) == TRUE){
  seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP", end.clus = end.clus)
  if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP", end.clus = end.clus)}
  const = TRUE
} else if(is.numeric(start.clus) == TRUE && is.numeric(end.clus) == TRUE){
  seurat.sim <- slingshot(seurat.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus, end.clus=end.clus)
  if (graphics) {seurat3D.sim <- slingshot(seurat3D.sim, clusterLabels=cluster_res, reducedDim="UMAP", start.clus = start.clus, end.clus=end.clus)}
  const = TRUE
}
message("2. Slingshot trajectories were calculated.")

# 10.3. Plots generation.
# 10.3.1. Set the number of colours from the palette and extend it.
n_col <- length(levels(seurat.sim@colData[, cluster_res]))
getPalette <- colorRampPalette(brewer.pal(9,'Set1'))

if (graphics) {
  # 10.3.2. Set a good window size for the 3D plots.
  r3dDefaults$windowRect <- c(0,50, 1024, 720) 

  # 10.3.3. Curves 3D plot with legend --> HTML output.
  plot3d(reducedDims(seurat3D.sim)$UMAP[,1:3], col = getPalette(n_col)[seurat3D.sim@colData[, cluster_res]])
  plot3d.SlingshotDataSet(SlingshotDataSet(seurat3D.sim), type = "curves", add = TRUE, lwd =3)
  legend3d("topright", legend=paste0("Cluster - ", levels(seurat3D.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col), inset=c(0))
  p_curves <- rglwidget(width = 1240, height = 1024)
  htmltools::save_html(htmltools::tagList(p_curves), file = paste0(dir.name, "/", folders[6], "/", paste0("3D_curves_", selected_res, "_res.html")))

  # 10.3.4. Lineage 3D plot with legend --> HTML output.
  plot3d(reducedDims(seurat3D.sim)$UMAP[,1:3], col = getPalette(n_col)[seurat3D.sim@colData[, cluster_res]])
  plot3d.SlingshotDataSet(SlingshotDataSet(seurat3D.sim), type = "lineages", add = TRUE, lwd =3)
  legend3d("topright", legend=paste0("Cluster - ", levels(seurat3D.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col), inset=c(0))
  p_lineages <- rglwidget(width = 1240, height = 1024)
  htmltools::save_html(htmltools::tagList(p_lineages), file = paste0(dir.name, "/", folders[6], "/", paste0("3D_lineages_", selected_res, "_res.html")))
}
# 10.3.5. Curves 2D plot with legend --> pdf output. 
pdf(paste0(dir.name, "/", folders[6], "/2D_curves_", selected_res, "_res.pdf"), width = 11, height = 11)
plot(reducedDims(seurat.sim)$UMAP, col = getPalette(n_col)[seurat.sim@colData[, cluster_res]],
     pch=16, asp = 1, main = "2D curves - Clusters Trajectories")
legend("topright", legend=paste0("Cluster - ", levels(seurat.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col))
lines(SlingshotDataSet(seurat.sim), lwd=2, col = 'black')
dev.off()

# 10.3.6. Lineage 2D plot with legend --> pdf output.
pdf(paste0(dir.name, "/", folders[6], "/2D_lineages_", selected_res, "_res.pdf"), width = 11, height = 11)
plot(reducedDims(seurat.sim)$UMAP, col = getPalette(n_col)[seurat.sim@colData[, cluster_res]],
     pch=16, asp = 1, main = "2D lineages - Clusters Trajectories")
legend("topright", legend=paste0("Cluster - ", levels(seurat.sim@colData[, cluster_res])), pch=16, col=getPalette(n_col))
lines(SlingshotDataSet(seurat.sim), lwd=2, type = 'lineages', col = 'black', show.constraints = const)
dev.off()
message("3. Trajectory lots were done.")

# 10.4. Temporally expressed genes heatmap.
# 10.4.1. Pseudotime is obtained.
t <- seurat.sim$slingPseudotime_1

# 10.4.2. Get the n variable genes.
Y <- assays(seurat.sim)$logcounts
var_genes <- names(sort(apply(Y,1,var),decreasing = TRUE))[1:n_var_genes]
Y <- Y[var_genes,]

# 10.4.3. Fitting a gam model.
gam.pval <- apply(Y,1,function(z){
  d <- data.frame(z=z, t=t)
  suppressWarnings({
    tmp <- suppressWarnings(gam(z ~ lo(t), data=d))
  })
  p <- summary(tmp)[3][[1]][2,3]
  p
})

# 10.4.4. Get the top genes selected by p-val and plot the heatmap.
topgenes <- names(sort(gam.pval, decreasing = FALSE))[1:n_plotted_genes]
heatdata <- assays(seurat.sim)$logcounts[topgenes, order(t, na.last = NA)]
heatclus <- seurat.sim@colData[,cluster_res][order(t, na.last = NA)]
annotation <- data.frame("Cluster" = heatclus, row.names = colnames(heatdata))
ann_colors <- list("Cluster" = setNames(getPalette(n_col),
                                        levels(heatclus)))
# 10.4.5. Plot the genes.
pheatmap(heatdata, cluster_cols = FALSE, 
         color =  colorRampPalette(c("yellow", "red"))(100),
         annotation_col = annotation, annotation_colors = ann_colors,
         show_colnames = FALSE, filename = paste0(dir.name, "/", folders[6], "/temporally_expressed_heatmaps_", selected_res, "_res.pdf"))
message("4. Pseudotime heatmap was obtained.")

# Save used objects.
if (graphics) {save(seurat.sim, seurat3D.sim, file = paste0(dir.name, "/", folders[6], "/slingshot_sce_objects.RData"))
} else {save(seurat.sim, file = paste0(dir.name, "/", folders[6], "/slingshot_sce_objects.RData"))}
message("5. R objects were saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries. 
suppressMessages(library("Seurat"))
suppressMessages(library("ggplot2"))
suppressMessages(library("viridis"))
suppressMessages(library("devtools"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("VISION"))
suppressMessages(library("scales"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
dir.name = snakemake@params[["output_dir"]]
input_data = snakemake@input[["seurat_obj"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake.  
geneset_collection = snakemake@params[["geneset_collection"]]
meta_columns = snakemake@params[["meta_columns"]]
selected_res = snakemake@params[["selected_res"]]
random_seed = snakemake@params[["random_seed"]]
regress_out = snakemake@params[["regress_out"]]
vars_to_regress = snakemake@params[["vars_to_regress"]]
regress_cell_cycle = snakemake@params[["regress_cell_cycle"]]
ram = snakemake@resources[["mem_mb"]]
threads = snakemake@threads
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("4. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")


message("PROCESSING STEP")
# 11. Vision functional analysis.
# 11.1. Seurat object with clustering is loaded.
seurat <- readRDS(input_data)
assay_type <- seurat@active.assay
cluster_res <- paste0(assay_type, "_snn_res.", selected_res)
if (!(cluster_res %in% colnames(seurat@meta.data))){
  stop("Specified resolution is not available.")
}
meta_columns <-  c(meta_columns, cluster_res)
message("1. Seurat object was loaded.")

# 11.2. Vision object is created. 
# If seurat object is not a integration.
if (seurat@active.assay != "integrated"){
  # To avoid negative values with SCT normalization we use the standard, keeping the information from the previous analysis.
  if (seurat@active.assay == "SCT") {
    seurat@active.assay <- "RNA"
    seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000, seed = TRUE)
    all.genes <- rownames(seurat)
    if(regress_out == TRUE){
      if (regress_cell_cycle) {
        seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"))
      } else {
        seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = vars_to_regress)
      }
    } else {
      seurat <- ScaleData(seurat, features = rownames(seurat))
    }
  }
  suppressMessages(vis <- Vision(seurat,
                                 signatures = geneset_collection,
                                 meta = seurat@meta.data[,meta_columns],
                                 projection_methods = NULL))
} else {
# Calculate vision score using the RNA slot when samples have been integrated. 
  seurat@active.assay <- "RNA"
  seurat <- NormalizeData(seurat, normalization.method = "LogNormalize", scale.factor = 10000, seed = TRUE)
  all.genes <- rownames(seurat)
  if(regress_out == TRUE){
    if (regress_cell_cycle) {
      seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = c(vars_to_regress, "S.Score", "G2M.Score"))
    } else {
      seurat <- ScaleData(seurat, features = rownames(seurat), vars.to.regress = vars_to_regress)
    }
  } else {
    seurat <- ScaleData(seurat, features = rownames(seurat))
  }
  suppressMessages(vis <- Vision(seurat,
                                 signatures = geneset_collection,
                                 meta = seurat@meta.data[,c(meta_columns, "assay_name")],
                                 projection_methods =  NULL))
  seurat@active.assay <- "integrated"
}
message("2. Vision object was created.")

# 11.3. Vision analysis step.
options(mc.cores = threads)
vis <- suppressMessages(analyze(vis))
message("3. Vision object was analyzed.")

# 11.4. Outputs generation.
# 11.4.1. Projection values stored.
if (seurat@active.assay != "integrated"){
  projections <- vis@Projections$Seurat_umap
} else {
  vis@Projections[[1]] <- seurat@reductions$umap@cell.embeddings
  projections <- seurat@reductions$umap@cell.embeddings
}

# 11.4.2. Cluster plot in UMAP projection.
getPalette <- colorRampPalette(brewer.pal(9,'Set1'))
p1 <- ggplot() + aes(x=projections[, 1], y=projections[, 2], color=vis@metaData[,cluster_res]) + geom_point(alpha=0.5) + xlab("UMAP_1") + ylab("UMAP_2") + ggtitle("Clusters representation UMAP") + labs(color='clusters') + scale_color_manual(values = getPalette(nlevels(vis@metaData[,cluster_res])))
suppressMessages(ggsave(paste0(dir.name, "/", folders[7], "/CLUSTERS_REPRESENTATION_UMAP_set1.pdf"), plot = p1))

# 11.4.3. Integration plot in UMAP projection.
if (seurat@active.assay == "integrated"){
  p2 <- ggplot() + aes(x=projections[, 1], y=projections[, 2], color=vis@metaData[,"assay_name"]) + geom_point(alpha=0.5) + xlab("UMAP_1") + ylab("UMAP_2") + ggtitle("Integration representation UMAP") + labs(color='Assay') + scale_color_manual(values = getPalette(nlevels(vis@metaData[,"assay_name"])))
  suppressMessages(ggsave(paste0(dir.name, "/", folders[7], "/INTEGRATION_REPRESENTATION_UMAP_set1.pdf"), plot = p2))
}

# 11.4.4. Clusters vs Molecular Signatures statistics values. 
for (i in 1:nlevels(vis@metaData[,cluster_res])){
  write.table(vis@ClusterComparisons$Signatures[[cluster_res]][[i]], file = paste0(dir.name, "/", folders[7], "/vision_table_res_", selected_res, "_cluster_", (i-1), ".tsv"), col.names = NA, quote = FALSE, sep = "\t")
}
message("4. Vision plots were produced.")

# 11.4.5. Molecular Signatures statistics values.
stats_DF <- data.frame("Consistency" = vis@LocalAutocorrelation$Signatures$C,
                       "p-values" =  vis@LocalAutocorrelation$Signatures$pValue, "FDR" = vis@LocalAutocorrelation$Signatures$FDR)
colnames(stats_DF) <- c("Consistency", "p-values", "FDR")
rownames(stats_DF) <- rownames(vis@LocalAutocorrelation$Signatures) 
write.table(stats_DF, file = paste0(dir.name, "/", folders[7], "/vision_param_values_per_geneset.txt"), col.names = NA, quote = FALSE, sep = "\t" )
message("5. Vision statistics were saved in a table.")

# 11.4.6. Molecular signatures scores in UMAP projection.
for (genesig in colnames(vis@SigScores)){
  if (stats_DF[genesig,"FDR"] < 0.05) {
    p3 <- ggplot() + aes(x=projections[, 1], y=projections[, 2],  color=vis@SigScores[, genesig]) + geom_point() + ggtitle(genesig) +   scale_color_viridis(option = "D") + xlab("UMAP_1") + ylab("UMAP_2") + ggtitle(paste0(genesig, " - UMAP")) +  labs(color='vision score')
    suppressMessages(ggsave(paste0(dir.name, "/", folders[7], "/", genesig, "_UMAP.pdf"), plot = p3))
  }
}
message("6. Vision plots from signatures were produced.")

# 11.5. Save Vision Object.
saveRDS(vis, file = paste0(dir.name, "/", folders[7], "/vision_object.rds"))
message("7. Vision object was saved.")
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
log <- file(snakemake@log[[1]], open = "wt")
sink(log)
sink(log, type = "message")

message("CONFIGURATION STEP")
# A. Parameters: 
# 1. Load libraries.
suppressMessages(library("Seurat"))
suppressMessages(library("velocyto.R"))
suppressMessages(library("SeuratWrappers"))
suppressMessages(library("RColorBrewer"))
suppressMessages(library("future"))
message("1. Libraries were loaded.")

# 2. Folder configuration. 
input_data = snakemake@input[["seurat_obj"]]
velocyto_dir = snakemake@params[["velocyto_dir"]]
dir.name = snakemake@params[["output_dir"]]
folders = c("1_preprocessing", "2_normalization", "3_clustering", "4_degs", "5_gs", "6_traj_in", "7_func_analysis", "8_RNA_velocity")
message("2. Folder paths were set.")

# 3. Get variables from Snakemake.
selected_res = snakemake@params[["selected_res"]]
random_seed = snakemake@params[["random_seed"]]
downsampling = snakemake@params[["downsampling"]]
n_cells = snakemake@params[["n_cells"]]
ram = snakemake@resources[["mem_mb"]]
threads = snakemake@threads
message("3. Parameters were loaded.")

# 4. Analysis configuration. 
# RAM configuration.
options(future.globals.maxSize = ram*1024^2)
#Set parallelization.
plan("multiprocess", workers = threads)
message(paste0("4. Threads were set at ", threads, "."))
# Set seed.
if (is.numeric(random_seed)) {
  set.seed(random_seed)
}
message(paste0("5. Seed was set at ", random_seed, "."))
message("Configuration finished.")
message("\n")


message("PROCESSING STEP")
# Load seurat object
seurat = readRDS(input_data)

# 12. Compute velocity and velocity plots.
# 12.1. Check cluster resoltution.
assay_type <- seurat@active.assay
cluster_res <- paste0(assay_type, "_snn_res.", selected_res)
if (!(cluster_res %in% colnames(seurat@meta.data))){
  stop("Specified resolution is not available.")
}
message("1. Seurat object was loaded.")

# If Seurat objects are not integrated we need to add the velocyto matrices.
if (!(seurat@active.assay == "integrated" || seurat@project.name == "merged")) {
  # 12.2. Get velocity matrices and place them in a list.
  velo_names = c("spliced", "unspliced", "ambiguous")
  vel_matrices = list()
  for (name in velo_names) {
    vel_matrices[[name]] <- Read10X(data.dir = paste0(velocyto_dir, name))
  }

  # 12.3. Load Velocyto matrices as seurat assays.
  for (name in velo_names) {
    vel_matrices[[name]] <- vel_matrices[[name]][, which(x = colnames(vel_matrices[[name]]) %in% colnames(seurat))] 
    seurat[[name]] <- CreateAssayObject(counts = vel_matrices[[name]])
  }
}
message("2. Seurat object was prepared to perform RNA velocity.")

# 12.4. Downsampling (optional).
if (downsampling == TRUE && n_cells < length(rownames(seurat@meta.data))){
  #n_total_cells = length(rownames([email protected]))
  random_sample = sample(x = rownames(seurat@meta.data), size = n_cells, replace = FALSE)
  seurat <- subset(x = seurat, cells = random_sample) 
  message("2.5. Downsampling was performed.")
}

# 12.5. Set specific cluster labels as idents.
Idents(seurat) <- seurat@meta.data[[cluster_res]]

# 12.6. Run velocyto from the wrapper.
seurat <- RunVelocity(object = seurat, deltaT = 1, kCells = 25, fit.quantile = 0.02)
message("3. RNA velocity was calculated.")

# 12.7. Obtain palette.
n_col <- length(levels(seurat))
getPalette <- colorRampPalette(brewer.pal(9,'Set1'))

# 12.8. Set colours to clusters. 
ident.colors <- getPalette(n_col) 
names(ident.colors) <- levels(seurat)
cell.colors <- ident.colors[Idents(seurat)]
names(cell.colors) <- colnames(seurat)

# 12.9. Create the RNA velocity plot.
pdf(paste0(dir.name, "/",folders[8], "/RNA_velocity_plot.pdf"))
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
show.velocity.on.embedding.cor(emb = Embeddings(object = seurat, reduction = "umap"), vel = Tool(object = seurat, 
                               slot = "RunVelocity"), n = 100, scale = "sqrt", cell.colors = ac(x = cell.colors, alpha = 0.5), 
                               cex = 0.8, arrow.scale = 3, show.grid.flow = TRUE, min.grid.cell.mass = 0.5, grid.n = 40, arrow.lwd = 1, 
                               do.par = FALSE,  cell.border.alpha = 0.1, xlab = "UMAP1", ylab = "UMAP2", 
                               main = paste0("RNA velocity plot - Cluster resolution: ", selected_res))
opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend("topright", inset = c(0.05, 0.115), legend=paste0("Cluster - ", levels(seurat)),
       pch=16, col=getPalette(n_col))
dev.off()
message("4. Velocity plot was produced.")


#12.10. This plot the sample coloring the cell by its origin assay (integrated objects).
if (seurat@active.assay == "integrated") {
  Idents(seurat) <- seurat@meta.data[["assay_name"]]
  ident.colors <- getPalette(n_col)
  names(ident.colors) <- levels(seurat)
  cell.colors <- ident.colors[Idents(seurat)]
  names(cell.colors) <- colnames(seurat)

  # 12.11. Create the RNA velocity plot.
  pdf(paste0(dir.name, "/",folders[8], "/RNA_velocity_plot_by_assay.pdf"))
  par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
  show.velocity.on.embedding.cor(emb = Embeddings(object = seurat, reduction = "umap"), vel = Tool(object = seurat,
                                 slot = "RunVelocity"), n = 100, scale = "sqrt", cell.colors = ac(x = cell.colors, alpha = 0.5),
                                 cex = 0.8, arrow.scale = 3, show.grid.flow = TRUE, min.grid.cell.mass = 0.5, grid.n = 40, arrow.lwd = 1,
                                 do.par = FALSE,  cell.border.alpha = 0.1, xlab = "UMAP1", ylab = "UMAP2",
                                 main = paste0("RNA velocity plot - Origin assay"))
  opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE)
  on.exit(par(opar))
  plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
  legend("topright", inset = c(0.05, 0.115), legend=paste0("Assay - ", levels(seurat)),
         pch=16, col=getPalette(n_col))
  dev.off()
  message("4.5. Velocity plot from integrated object was produced.")

}

#12.12. This plot the sample coloring the cell by its origin assay (mergedd objects).
if (seurat@project.name == "merged") {
  Idents(seurat) <- "assay_name"
  ident.colors <- getPalette(n_col)
  names(ident.colors) <- levels(seurat)
  cell.colors <- ident.colors[Idents(seurat)]
  names(cell.colors) <- colnames(seurat)

  # 12.13. Create the RNA velocity plot.
  pdf(paste0(dir.name, "/",folders[8], "/RNA_velocity_plot_by_assay.pdf"))
  par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
  show.velocity.on.embedding.cor(emb = Embeddings(object = seurat, reduction = "umap"), vel = Tool(object = seurat,
                                 slot = "RunVelocity"), n = 100, scale = "sqrt", cell.colors = ac(x = cell.colors, alpha = 0.5),
                                 cex = 0.8, arrow.scale = 3, show.grid.flow = TRUE, min.grid.cell.mass = 0.5, grid.n = 40, arrow.lwd = 1,
                                 do.par = FALSE,  cell.border.alpha = 0.1, xlab = "UMAP1", ylab = "UMAP2",
                                 main = paste0("RNA velocity plot - Origin assay"))
  opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0), mar=c(0, 0, 0, 0), new=TRUE)
  on.exit(par(opar))
  plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
  legend("topright", inset = c(0.05, 0.115), legend=paste0("Assay - ", levels(seurat)),
         pch=16, col=getPalette(n_col))
  dev.off()
  message("4.5. Velocity plot from integrated object was produced.")
}

# 12.14. Save seurat object with RNA velocity slots. 
saveRDS(seurat, file = paste0(dir.name, "/",folders[8], "/seurat_velocity.rds"))
message("5. Seurat object was saved.")
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path
from tempfile import TemporaryDirectory

from snakemake.shell import shell

log = snakemake.log_fmt_shell(stdout=False, stderr=True)

def basename_without_ext(file_path):
    """Returns basename of file path, without the file extension."""

    base = path.basename(file_path)

    split_ind = 2 if base.endswith(".gz") else 1
    base = ".".join(base.split(".")[:-split_ind])

    return base


# Run fastqc, since there can be race conditions if multiple jobs 
# use the same fastqc dir, we create a temp dir.
with TemporaryDirectory() as tempdir:
    shell("fastqc {snakemake.params} --quiet "
          "--outdir {tempdir} {snakemake.input[0]}"
          " {log}")

    # Move outputs into proper position.
    output_base = basename_without_ext(snakemake.input[0])
    html_path = path.join(tempdir, output_base + "_fastqc.html")
    zip_path = path.join(tempdir, output_base + "_fastqc.zip")

    if snakemake.output.html != html_path:
        shell("mv {html_path} {snakemake.output.html}")

    if snakemake.output.zip != zip_path:
        shell("mv {zip_path} {snakemake.output.zip}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import os
from snakemake.shell import shell
import tempfile

__author__ = "Ryan Dale"
__copyright__ = "Copyright 2016, Ryan Dale"
__email__ = "[email protected]"
__license__ = "MIT"

_config = snakemake.params['fastq_screen_config']

subset = snakemake.params.get('subset', 100000)
aligner = snakemake.params.get('aligner', 'bowtie2')
extra = snakemake.params.get('extra', '')
log = snakemake.log_fmt_shell()

# snakemake.params.fastq_screen_config can be either a dict or a string. If
# string, interpret as a filename pointing to the fastq_screen config file.
# Otherwise, create a new tempfile out of the contents of the dict:
if isinstance(_config, dict):
    tmp = tempfile.NamedTemporaryFile(delete=False).name
    with open(tmp, 'w') as fout:
        for label, indexes in _config['database'].items():
            for aligner, index in indexes.items():
                fout.write('\t'.join([
                    'DATABASE', label, index, aligner.upper()]) + '\n')
        for aligner, path in _config['aligner_paths'].items():
            fout.write('\t'.join([aligner.upper(), path]) + '\n')
    config_file = tmp
else:
    config_file = _config

# fastq_screen hard-codes filenames according to this prefix. We will send
# hard-coded output to a temp dir, and then move them later.
prefix = os.path.basename(snakemake.input[0].split('.fastq')[0])
tempdir = tempfile.mkdtemp()

shell(
    "fastq_screen --outdir {tempdir} "
    "--force "
    "--aligner {aligner} "
    "--conf {config_file} "
    "--subset {subset} "
    "--threads {snakemake.threads} "
    "{extra} "
    "{snakemake.input[0]} "
    "{log}"
)

# Move output to the filenames specified by the rule
shell("mv {tempdir}/{prefix}_screen.txt {snakemake.output.txt}")
shell("mv {tempdir}/{prefix}_screen.png {snakemake.output.png}")

# Clean up temp
shell("rm -r {tempdir}")
if isinstance(_config, dict):
    shell("rm {tmp}")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "[email protected]"
__license__ = "MIT"


import os
from snakemake.shell import shell

extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = (
    [snakemake.input.fq1]
    if isinstance(snakemake.input.fq1, str)
    else snakemake.input.fq1
)
fq2 = snakemake.input.get("fq2")
if fq2:
    fq2 = (
        [snakemake.input.fq2]
        if isinstance(snakemake.input.fq2, str)
        else snakemake.input.fq2
    )
    assert len(fq1) == len(
        fq2
    ), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])

if fq1[0].endswith(".gz"):
    readcmd = "--readFilesCommand zcat"
else:
    readcmd = ""

outprefix = os.path.dirname(snakemake.output[0]) + "/"

shell(
    "STAR "
    "{extra} "
    "--runThreadN {snakemake.threads} "
    "--genomeDir {snakemake.params.index} "
    "--readFilesIn {input_str} "
    "{readcmd} "
    "--outFileNamePrefix {outprefix} "
    "--outStd Log "
    "{log}"
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__author__ = "Thibault Dayris"
__copyright__ = "Copyright 2019, Dayris Thibault"
__email__ = "[email protected]"
__license__ = "MIT"

from snakemake.shell import shell
from snakemake.utils import makedirs

log = snakemake.log_fmt_shell(stdout=True, stderr=True)

extra = snakemake.params.get("extra", "")
sjdb_overhang = snakemake.params.get("sjdbOverhang", "100")

gtf = snakemake.input.get("gtf")
if gtf is not None:
    gtf = "--sjdbGTFfile " + gtf
    sjdb_overhang = "--sjdbOverhang " + sjdb_overhang
else:
    gtf = sjdb_overhang = ""

makedirs(snakemake.output)

shell(
    "STAR "  # Tool
    "--runMode genomeGenerate "  # Indexation mode
    "{extra} "  # Optional parameters
    "--runThreadN {snakemake.threads} "  # Number of threads
    "--genomeDir {snakemake.output} "  # Path to output
    "--genomeFastaFiles {snakemake.input.fasta} "  # Path to fasta files
    "{sjdb_overhang} "  # Read-len - 1
    "{gtf} "  # Highly recommended GTF
    "{log}"  # Logging
)
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__author__ = "Julian de Ruiter"
__copyright__ = "Copyright 2017, Julian de Ruiter"
__email__ = "[email protected]"
__license__ = "MIT"


from os import path

from snakemake.shell import shell


input_dirs = set(path.dirname(fp) for fp in snakemake.input)
output_dir = path.dirname(snakemake.output[0])
output_name = path.basename(snakemake.output[0])
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

shell(
    "multiqc"
    " {snakemake.params}"
    " --force"
    " -o {output_dir}"
    " -n {output_name}"
    " {input_dirs}"
    " {log}"
)
ShowHide 37 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/cnio-bu/bollito
Name: bollito
Version: v1.0.6
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: MIT License
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...