Cell cycle analysis of single-cell proteomic and transcriptomic data for the FUCCI cell model -- RNA-Seq analysis pipeline

public public 1yr ago Version: 1.3 0 bookmarks

Single-cell proteogenomic analysis

This repository contains the snakemake pipeline for analyzing the RNA sequencing data for ~1k single cells. The results of this single-cell RNA-Seq analysis provide a transcriptomic context to a proteomic analysis based on immunofluorescence staining of ~200k individual cells. For the code used to perform that single-cell proteogenomic analysis of the human cell cycle, please see the CellProfiling/SingleCellProteogenomics repository.

Single-cell sequencing files

The single-cell RNA-Seq data is available at GEO SRA under project number GSE146773 .

This data is downloaded automatically in this pipeline.

Updating the Ensembl version

The genome and Ensembl versions are located at the top of the file workflow/config/FucciSingleCell.yaml . These can be updated, and the references will be downloaded automatically.

Usage

  1. Clone repository and initialize submodules: git clone --recurse-submodules https://github.com/CellProfiling/FucciSingleCellSeqPipeline.git && cd FucciSingleCellSeqPipeline/workflow

  2. Install conda: https://docs.conda.io/en/latest/miniconda.html

  3. Create and activate setup environment: conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup

  4. Run the workflow: snakemake --use-conda --conda-frontend mamba --cores 24 --resources mem_mb=100000 , where you can subsitute the max number of cores and max memory allocation. At least 54 GB of free memory should be available.

Usage on cluster

In place of installing conda, you may need to activate it as a module, such as by module load conda and then follow the instructions to initialize it.

Adapt config/cluster_config.yaml for your needs.

In place of the last step above, you can use the scheduler like this: snakemake -j 500 --cores 16 --cluster-config config/cluster_config.yaml --latency-wait 60 --keep-going --use-conda --conda-frontend mamba --cluster "sbatch -t {cluster.time} -N {cluster.nodes} --cpus-per-task {threads} -p {cluster.partition}"

Usage on protected access cluster

  1. Clone repository and initialize submodules on your local machine: git clone --recurse-submodules https://github.com/CellProfiling/FucciSingleCellSeqPipeline.git && cd FucciSingleCellSeqPipeline/workflow

  2. Install conda: https://docs.conda.io/en/latest/miniconda.html

  3. Create and activate setup environment: conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup

  4. If running the pipeline on protected access computer, predownload files by running snakemake -j 16 ../results/setup.txt on a machine with internet access.

  5. Make a tarball of the project with cd ../.. && tar -cxvf FucciSingleCellSeqPipeline.zip FucciSingleCellSeqPipeline and transfer it to the protected access cluster.

  6. Load conda as a module on the protected access cluster, such as with module load conda , and follow the instructions to activate it.

  7. Create and activate setup environment: conda env create -n fuccisetup -f envs/setup.yaml && conda activate fuccisetup

  8. Adapt config/cluster_config.yaml for your needs.

  9. Use the scheduler from snakemake like this: snakemake -j 500 --cores 16 --cluster-config config/cluster_config.yaml --latency-wait 60 --keep-going --use-conda --conda-frontend mamba --cluster "sbatch -A {cluster.account} -t {cluster.time} -N {cluster.nodes} --cpus-per-task {threads} -p {cluster.partition}"

Citation

Mahdessian, D.*; Cesnik, A. J.*; Gnann, C.; Danielsson, F.; Stenström, L.; Arif, M.; Zhang, C.; Le, T.; Johansson, F.; Shutten, R.; Bäckström, A.; Axelsson, U.; Thul, P.; Cho, N. H.; Carja, O.; Uhlén, M.; Mardinoglu, A.; Stadler, C.; Lindskog, C.; Ayoglu, B.; Leonetti, M. D.; Pontén, F.; Sullivan, D. P.; Lundberg, E. “Spatiotemporal dissection of the cell cycle with single cell proteogenomics.” Nature, 2021, 590, 649–654. *Contributed equally. https://www.nature.com/articles/s41586-021-03232-9

Code Snippets

18
19
20
21
shell:
    "(STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.genomedir}"
    " --genomeFastaFiles {input.efa} {input.gfa} --sjdbGTFfile {input.gtf}"
    " {params.sjoptions} {params.size}) &> {log}"
36
37
38
39
40
shell:
    "(STAR --genomeLoad NoSharedMemory --genomeDir {params.genomedir}"
    " --runThreadN {threads} {params.bam} "
    " --outFileNamePrefix {params.outfolder}/{wildcards.sra}"
    " --readFilesIn <(zcat {input.fastq}) ) &> {log}"
57
58
59
60
shell:
    "(STAR --runMode genomeGenerate --runThreadN {threads} --genomeDir {params.genomedir}"
    " --genomeFastaFiles {input.efa} {input.gfa} --sjdbGTFfile {input.gtf} {params.sjoptions}"
    " --sjdbFileChrStartEnd {input.jj} {params.size}) &> {log}"
84
85
86
87
88
shell:
    "(STAR {params.mode} --genomeDir {params.genomedir}"
    " --runThreadN {threads} {params.bam} {params.transcripts}"
    " --outFileNamePrefix {params.outfolder}/{wildcards.sra}"
    " --readFilesIn <(zcat {input.fastq}) ) &> {log}"
18
19
20
21
22
shell:
    "((wget -O - {params.primary} || wget -O - {params.toplevel}) | gunzip -c - > {output.gfa} && "
    "wget -O - {params.gff} | gunzip -c - > {output.gff3} && "
    "wget -O - {params.pep} | gunzip -c - > {output.pfa} && "
    "wget -O - {params.gtf} | gunzip -c - > {output.gtf}) 2> {log}"
31
shell: "python scripts/fix_gff3_for_rsem.py {input} {output} &> {log}"
40
shell: "python scripts/fix_gff3_for_rsem.py {input} {output} &> {log}"
49
shell: "grep \"^ERCC\\|^#\\|^20\\|^21\\|^22\" {input} > {output}"
58
shell: "grep \"^ERCC\\|^#\\|^20\\|^21\\|^22\" {input} > {output}"
67
shell: "python scripts/filter_fasta.py {input} {output}"
76
77
78
shell:
    "prefetch {wildcards.sra}"
    " --output-directory {params.outdir} &> {log}"
87
88
89
shell:
    "fastq-dump -I --outdir {params.outdir} --split-files {input} && "
    "mv {params.outdir}/{wildcards.sra}_1.fastq {output} &> {log}"
104
105
106
107
108
shell:
    "fastp -q {params.quality}"
    " -i {input} -o {output.fq}"
    " -h {output.html} -j {output.json}"
    " -w {threads} -R {params.title} --detect_adapter_for_pe &> {log}"
8
shell: "grep -v ERCC_ID {input} > {output} 2> {log}"
25
26
27
28
shell:
    "(rsem-prepare-reference --num-threads {threads} --gtf {input.gtf} "
    " --star" # TODO: figure out how to use transcript STAR output
    " \"{input.efa}\",\"{input.gfa}\" {params.prefix}) &> {log}"
51
52
53
54
55
56
shell:
    "(rsem-calculate-expression --num-threads {threads}"
    " --star"
    " {params.options} <(zcat {input.fq})" #{input.bam}"
    " {params.prefix}"
    " {params.quant}/{wildcards.sra}) &> {log}"
77
78
79
shell:
    "python scripts/make_rsem_dataframe.py genes {input.gtf} {input.srr_lookup} {input.series_matrix}"
    " {output.counts} {output.tpms} {output.names} {output.ids} &> {log}"
100
101
102
shell:
    "python scripts/make_rsem_dataframe.py isoforms {input.gtf} {input.srr_lookup} {input.series_matrix}"
    " {output.counts} {output.tpms} {output.names} {output.ids} &> {log}"
SnakeMake From line 100 of rules/quant.smk
10
11
shell:
    "(gdown -O {output}.zip"
42
43
44
shell:
    "(mkdir -p {output} && cp -r {params.rnadir}/* {output} && "
    "cp {input.quant} {input.ids} {input.velocity} {params.rnadir}) &> {log}"
52
shell: "cd ../results && python ../SingleCellProteogenomics/1_ProteinCellCycleClusters.py &> {log}"
60
shell: "cd ../results && python ../SingleCellProteogenomics/2_ProteinFucciPsuedotime.py --quicker &> {log}"
68
shell: "cd ../results && python ../SingleCellProteogenomics/3_RNAFucciPseudotime.py --quicker &> {log}"
76
shell: "cd ../results && python ../SingleCellProteogenomics/4_TemporalDelay.py &> {log}"
84
shell: "cd ../results && python ../SingleCellProteogenomics/5_ProteinProperties.py &> {log}"
97
98
99
shell:
    "(cp {input.protein} {output.protein} && "
    "cp {input.rna} {output.rna}) &> {log}"
15
16
17
shell:
    "velocyto run-smartseq2 --outputfolder {params.outfolder}"
    " --sampleid {params.sampleid} {input.bams} {input.gtf} &> {log}"
SnakeMake From line 15 of rules/velo.smk
28
shell: "(" + "\n".join(["{params.commandpart1} " + sra + " {params.commandpart2} >> {output}" for sra in config['sra']]) + ") 2> {log}"
40
shell: "python scripts/make_a_obs.py {input.loom} {input.srr_lookup} {input.series_matrix} {output} 2> {log}"
SnakeMake From line 40 of rules/velo.smk
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import sys, io
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
fasta=sys.argv[1]
output=sys.argv[2]
fasta_sequences = SeqIO.parse(open(fasta),'fasta')

with open(output,"w") as out:
    for seq_record in fasta_sequences:
        if seq_record.id.startswith("20") or seq_record.id.startswith("21") or seq_record.id.startswith("22"):
            out.write(seq_record.format("fasta"))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import sys
import io
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

isgtf = sys.argv[2].endswith("gtf")
with open(sys.argv[2], "w") as outff:
    with open(sys.argv[1]) as ff:
        print(f"Reading {sys.argv[1]}")
        for line in ff:
            lins = line.split("\t")
            if line.startswith("#"): continue
            elif lins[8].startswith("ID=gene:"): lins[2] = "gene"
            outff.write("\t".join(lins))

    ercc = SeqIO.parse(open("../resources/ERCC.fa"),'fasta')
    for seq_record in ercc:
        if len(seq_record.seq) > 0:
            id = seq_record.id.split()[0]
            gene_fields = [str(x) for x in [id, "ERCC", "gene", 1, len(seq_record.seq), ".", "+", ".",
                f"ID=gene:{id};Name={id};biotype=;description={''.join(seq_record.id.split()[1:])};gene_id={id}"]]
            transcript_fields = [str(x) for x in [id, "ERCC", "transcript", 1, len(seq_record.seq), ".", "+", ".",
                f"ID=transcript:{id};Parent=gene:{id};Name={id};biotype=;transcript_id={id}"]]
            exon_fields = [str(x) for x in [id, "ERCC", "exon", 1, len(seq_record.seq), ".", "+", ".",
                f"Parent=transcript:{id};Name={id};exon_id={id}"]]
            transcript_fields_gtf = [str(x) for x in [id, "ERCC", "transcript", 1, len(seq_record.seq), ".", "+", ".",
                f"gene_id \"gene:{id}\"; transcript_id \"transcript:{id}\"; gene_name \"{id}\"; transcript_name \"{id}\"; gene_biotype \"\"; transcript_biotype \"\";"]]
            exon_fields_gtf = [str(x) for x in [id, "ERCC", "exon", 1, len(seq_record.seq), ".", "+", ".",
                f"gene_id \"gene:{id}\"; transcript_id \"transcript:{id}\"; exon_number \"1\"; gene_name \"{id}\"; transcript_name \"{id}\"; gene_biotype \"\"; transcript_biotype \"\";"]]
            if isgtf:
                outff.write("\t".join(transcript_fields_gtf) + "\n")
                outff.write("\t".join(exon_fields_gtf) + "\n")
            else:
                outff.write("\t".join(gene_fields) + "\n")
                outff.write("\t".join(transcript_fields) + "\n")
                outff.write("\t".join(exon_fields) + "\n")
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import scvelo as scv
import sys, io
import numpy as np
from srr_functions import create_srr_to_cell

aloomfile = sys.argv[1]
srr_lookup = sys.argv[2]
series_matrix = sys.argv[3]
output = sys.argv[4]

aloom = scv.read_loom(aloomfile)
aobs = aloom.obs_names

srr_to_cell = create_srr_to_cell(series_matrix, srr_lookup)

# Use aobs to look up cell name from cells using the SRR link somehow
acells = [srr_to_cell[x.split(':')[1].split('Aligned')[0]] for x in aobs]
with open(output, 'w') as output:
    output.write('well_plate\n')
    output.write('\n'.join(acells))
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
import sys
import numpy as np
import pandas as pd
import os
import glob
from srr_functions import create_srr_to_cell

USAGE = "python make_rsem_dataframe.py <level> <gtf> <srrLookup> <seriesMatrix> <outCountsFile> <outTpmsFile> <namesOut> <idsOut>"
if len(sys.argv) != 9: print(USAGE); exit();
level, gtf, srr_lookup, series_matrix, outcounts, outtpms, outn, outids = sys.argv[1:]

print(f"getting gene names from {gtf}")
with open(gtf) as gtfhandle:
    line_ct = sum([1 for line in gtfhandle])
gene_id_to_name, gene_id_to_biotype = {}, {}
transcript_id_to_name, transcript_id_to_biotype, transcript_id_to_gene = {}, {}, {}
with open(gtf) as gtfhandle:
    for i, line in enumerate(gtfhandle):
        if i % 50000 == 0: print(f"Processed {i} lines out of {line_ct} from {gtf}.")
        if not line.startswith('#'):
            line = line.strip().rstrip(';').split('\t')
            if len(line) < 9 or line[2] != "transcript": continue
            attribute_list = line[8].replace('; ', ';').split(';')
            attributes = {}
            for item in attribute_list:
                item = [x.strip('"') for x in item.replace(" \"",";").split(';')]
                if len(item) < 2:
                    print(f"Error: not a key-value pair: {attributes} {item} in line:\n{line}")
                attributes[item[0]] = item[1]
            if not "gene_name" in attributes:
                attributes["gene_name"] = attributes["gene_id"]
            if not "transcript_name" in attributes:
                attributes["transcript_name"] = attributes["transcript_id"]
            if any([x not in attributes for x in ["gene_id", "gene_name", "transcript_id", "transcript_name"]]):
                print("Error: \"gene_id\", \"gene_name\", \"transcript_id\", or \"transcript_name\" not found in line:\n" + "\t".join(line))
                exit(1)
            gene_id_to_name[attributes["gene_id"].strip("gene:")] = attributes["gene_name"]
            gene_id_to_biotype[attributes["gene_id"].strip("gene:")] = attributes["gene_biotype"] if "gene_biotype" in attributes else ""
            transcript_id_to_name[attributes["transcript_id"].strip("transcript:")] = attributes["transcript_name"]
            transcript_id_to_biotype[attributes["transcript_id"].strip("transcript:")] = attributes["transcript_biotype"] if "transcript_biotype" in attributes else ""
            transcript_id_to_gene[attributes["transcript_id"].strip("transcript:")] = attributes["gene_id"]

print(f"globbing ../results/quant/*.{level}.results")
files = glob.glob(f"../results/quant/*.{level}.results")
line_ct = sum(1 for line in open(files[0]))

print("Making SRR-to-cell lookup...")
srr_to_cell = create_srr_to_cell(series_matrix, srr_lookup)

def get_prefix(file):
    srr = os.path.basename(file).split(".")[0].split("_")[0]
    return srr_to_cell[srr]

doUseGene = level.startswith("gene")
ids = [line.split('\t')[0].strip("gene:").strip("transcript:") for line in open(files[0])]
ids[0] = "gene_id" if doUseGene else "transcript_id"
names = [
    (gene_id_to_name[id] if id in gene_id_to_name else id) if doUseGene else \
    (transcript_id_to_name[id] if id in transcript_id_to_name else id) \
    for id in ids[1:]]
biotypes = [
    (gene_id_to_biotype[id] if id in gene_id_to_biotype else "") if doUseGene else \
    (transcript_id_to_biotype[id] if id in transcript_id_to_biotype else "") \
    for id in ids[1:]]

def save_output(outfilename, id_array, in_files, col_idx, doUseGene):
    print(f"Reading data files to create {outfilename} ...")

    with open(outfilename, "w") as ensg_output_file, open(outfilename + ".ercc.csv", "w") as ercc_output_file:
        ensg_output_file.write(",".join([""] + id_array[1:]) + "\n")
        ercc_output_file.write(",".join([""] + id_array[1:]) + "\n")
        prefixes = [get_prefix(file) for file in in_files]
        file_order = np.argsort(prefixes)
        for file in np.array(in_files)[file_order]:
            prefix = get_prefix(file)
            print(f"reading {prefix}")

            df = pd.read_csv(file, sep='\t', header=None, usecols=[0, col_idx], names=['id', 'value'])
            df['id'] = df['id'].str.strip('gene:').str.strip('transcript:')
            df = df.set_index('id').transpose()
            df.index = [prefix]

            ensg_df = df.filter(regex="ENSG" if doUseGene else "ENST")
            ercc_df = df.filter(regex="ERCC")

            ensg_output_file.write(ensg_df.to_csv(header=False, index=True))
            ercc_output_file.write(ercc_df.to_csv(header=False, index=True))

    print(f"Data saved to {outfilename} and {outfilename}.ercc.csv")

for (filename, col_idx) in [(outcounts, 4), (outtpms, 5)]:
    save_output(filename, ids, files, col_idx, doUseGene)

print(f"Saving to {outn} ...")
np.savetxt(outn, np.column_stack((ids[1:], names, biotypes)), delimiter=",", fmt="%s")

print(f"Saving to {outids} ...")
np.savetxt(outids, 
    np.column_stack((list(transcript_id_to_gene.keys()), 
        list(transcript_id_to_gene.values()))), delimiter=",", fmt="%s")

print("Done.")
42
shell: "touch {output} 2> {log}"
ShowHide 28 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/CellProfiling/FucciSingleCellSeqPipeline
Name: fuccisinglecellseqpipeline
Version: 1.3
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: None
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...