Knowledge Graph Construction with ngest

public public 1yr ago Version: v0.2.0 0 bookmarks

Installation

  1. Install conda ( https://docs.conda.io/en/latest/miniconda.html )

  2. clone the repo and create conda env

git clone github.com/hmartiniano/ngest.git
cd ngest
conda env create -n ngest -f env.yml
conda activate ngest

Usage

To build a KG with all the databases you need 64 GB of RAM and around 10 GB disk space.

I the root dir of the repo run:

make

This will run the snakemake workflow.

Setup neo4j

  1. Install docker with docker-compose plugin:

https://docs.docker.com/compose/install/

  1. Copy example env file to neo4j/env:
cd neo4j
cp env.example env
  1. Replace username and password in env file.

  2. Start neo4j:

docker compose up -d
  1. Run conversion script:
python ../scripts/tsv_to_neo4j ../data/finals/merged_nodes.tsv ../data/finals/merged_edges.tsv
cp nodes.csv.gz edges.csv.gz import
  1. Enter container
docker compose exec neo4j bash 
  1. Import data Inside the container run:
./bin/neo4j-admin database import full --nodes /import nodes.csv.gz --edges /import/edges.csv.gz --overwrite-destination

Code Snippets

5
shell: "curl -L {BGEE} -o {output}"
10
shell: "python scripts/bgee_to_kgx.py -i {input}  -o {output}"
SnakeMake From line 10 of rules/bgee.smk
7
shell: "curl -L {CL} -o {output}"
SnakeMake From line 7 of rules/cl.smk
13
shell: "kgx transform -i obojson -o ../data/processed/intermediary/cl -f tsv {input} "
SnakeMake From line 13 of rules/cl.smk
17
shell: "curl -L {PRMAPPING} -o {output}"
SnakeMake From line 17 of rules/cl.smk
22
shell: "python scripts/cl_kgx_process.py -i {input.nodes} {input.edges} -m {input.mapping} -o {output}"
SnakeMake From line 22 of rules/cl.smk
7
shell: "curl -L {DISGENET} -o {output}"
11
shell: "curl -L {MAPPING} -o {output}"
15
shell: "curl -L {DISGENET_VERSION} -o {output}"
20
shell: "python scripts/disgenet_to_kgx.py -i {input} -v {input.version}  -o {output}"
7
shell: "curl -L {ENSEMBLPROTEINS} -o {output}"
11
shell: "curl -L {ENSEMBLGENES}  -o {output}"
15
shell: "curl -L {ENSEMBLENTREZ}  -o {output}"
21
shell: "python scripts/ensembl_to_entrez.py -i {input} -o {output}"
27
shell: "zcat {input}| awk -F \"\t\" '$3 == \"gene\" {{ print $9 }}' | awk -F \"; \" 'BEGIN {{OFS=\"\t\"}} {{ print > \"{output}\" }}'"
33
shell: "python scripts/ensembl_to_kgx.py -i {input.ensembl} -u {input.uniprot} -g {input.genes}  -o {output}"
9
shell: "curl -L {GOAP} -o {output}"
SnakeMake From line 9 of rules/goa.smk
13
shell: "curl -L {GOAC} -o {output}"
SnakeMake From line 13 of rules/goa.smk
17
shell: "curl -L {GOAR} -o {output}"
SnakeMake From line 17 of rules/goa.smk
21
shell: "curl -L {GOAI} -o {output}"
SnakeMake From line 21 of rules/goa.smk
25
shell: "curl -L {GOAVERSION} -o {output}"
SnakeMake From line 25 of rules/goa.smk
30
shell: "python scripts/goa_to_kgx.py -i {input.rna} {input.protein} {input.complex} {input.isoform} -r {input.ro} -g {input.go}  -c {input.cfg} -v {input.version} -o {output}"
SnakeMake From line 30 of rules/goa.smk
6
shell: "curl {GO} -o {output}"
SnakeMake From line 6 of rules/go.smk
10
shell: "curl -L {GOVERSION} -o {output}"
SnakeMake From line 10 of rules/go.smk
15
shell: "kgx transform -i obojson -o ../data/processed/intermediary/go -f tsv {input} "
SnakeMake From line 15 of rules/go.smk
20
shell: "python scripts/go_kgx_process.py -i {input} -v {input.version} -o {output}"
SnakeMake From line 20 of rules/go.smk
5
shell: "curl -L {HPOA} -o {output}"
10
shell: "python scripts/hpoa_to_kgx.py -i {input.hpoa} -m {input.mondo_map} -n {input.hpo} -o {output}"
SnakeMake From line 10 of rules/hpoa.smk
5
shell: "curl -L {HPO} -o {output}"
SnakeMake From line 5 of rules/hpo.smk
10
shell: "kgx transform -i obojson -o ../data/processed/intermediary/hpo -f tsv {input} "
SnakeMake From line 10 of rules/hpo.smk
15
shell: "python scripts/hpo_kgx_process.py -i {input} -o {output}"
SnakeMake From line 15 of rules/hpo.smk
5
shell: "curl -L {MIRTARBASE} -o {output}"
11
shell: "python scripts/mirtarbase_to_csv.py -i {input} -o {output}"
17
shell: "python scripts/mirtarbase_to_kgx.py -i {input.mirtarbase} -r {input.rnamapping} -g {input.genemapping} -o {output}"
5
shell: "curl -L {MONDO} -o {output}"
10
shell: "python scripts/mondo_mapping.py -i {input}  -o {output}"
15
shell: "kgx transform -i obojson -o ../data/processed/intermediary/mondo -f tsv {input} "
20
shell: "python scripts/mondo_kgx_process.py -i {input} -o {output}"
5
shell: "curl -L {NPINTER} -o {output}"
10
shell: "zcat {input} | awk -F \"\t\"  'BEGIN {{OFS=\"\t\"}} {{ if ($1 == \"interID\" || $11 == \"Homo sapiens\") print $0}}' > {output}"
15
shell: "python scripts/npinter_to_kgx.py -i {input.npinter} -r {input.noncoddingmapping} {input.tarbasemapping} {input.ensemblmapping} -p {input.proteinmapping} -g {input.genemapping} -o {output}"
15
shell: "curl -L {RNACENTRALENSEMBLMAPPING} -o {output}"
20
shell: "curl -L {RNACENTRALTARBASEMAPPING} -o {output}"
24
shell: "curl -L {RNANONCODINGMAPPING} -o {output}"
28
shell: "curl -L {RNACENTRAL} -o {output}"
32
shell: "curl -L {RNAVERSION} -o {output}"
37
shell: "awk 'BEGIN {{OFS=\"\t\"}} {{ if ($4 == 9606) print $0}}' {input} > {output}"
42
shell: "awk 'BEGIN {{OFS=\"\t\"}} {{ if ($4 == 9606) print $0}}' {input} > {output}"
47
shell: "awk 'BEGIN {{OFS=\"\t\"}} {{ if ($4 == 9606) print $0}}' {input} > {output}"
52
shell: "zcat {input} | awk -F \"\t\"  'BEGIN {{OFS=\"\t\"}} {{ if ($1!~/^!/ && $7 == \"taxon:9606\") print $1,$2,$4,$6}}' > {output}"
58
shell: "python scripts/rnacentral_to_kgx.py -i {input.rnacentral} -m {input.mapping} -g {input.genes} -v {input.version} -o {output}"
5
shell: "curl -L {RO} -o {output}"
SnakeMake From line 5 of rules/ro.smk
4
shell: "curl -L {STRING} -o {output}"
9
shell: "python scripts/stringdb_to_kgx.py -i {input.string} -p {input.proteinmapping} -o {output}"
5
shell: "curl -L {UBERON} -o {output}"
11
shell: "kgx transform -i obojson -o ../data/processed/intermediary/uberon -f tsv {input}"
16
shell: "python scripts/uberon_kgx_process.py -i {input} -o {output}"
5
shell: "curl -L {UNIPROT} -o {output}"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import argparse
import pandas as pd
import uuid


def read_files(fname):
    df = pd.read_csv(fname, sep="\t", low_memory=False)
    return df


def get_parser():
    parser = argparse.ArgumentParser(
        prog="bgee_to_kgx.py",
        description="bgee_to_csv: convert an bgee file to CSVs with nodes and edges.",
    )
    parser.add_argument("-i", "--input", help="Input files")
    parser.add_argument(
        "-o", "--output", nargs="+", default="bgee", help="Output prefix. Default: out"
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    bgee = read_files(args.input)

    # bgee = bgee[bgee["Expression"].isin(["present", "absent"])]

    bgee = bgee[bgee["Expression"].isin(["present"])]
    bgee["object"] = bgee["Anatomical entity ID"]
    bgee["subject"] = "ENSEMBL:" + bgee["Gene ID"]
    bgee["provided_by"] = "BGEE"
    bgee = bgee[~bgee["object"].str.contains("∩", na=False)]

    bgee["source"] = "BGEE"
    url = args.input.split("/")[-1]
    bgee["source version"] = url.split("_")[1] + "_" + url.split("_")[2]

    gene_to_ae = bgee
    gene_to_ae["category"] = "biolink:GeneToExpressionSiteAssociation"
    gene_to_ae["predicate"] = "biolink:expressed_in"
    gene_to_ae["relation"] = "RO:0002206"
    gene_to_ae["knowledge_source"] = "BGEE"

    #  to include negated field for absent relations
    #   gene_to_ae["negated"] = gene_to_ae.Expression.str.startswith("absent")

    gene_to_ae = gene_to_ae[
        ["subject", "predicate", "object", "category", "relation", "knowledge_source", "source", "source version"]
    ].drop_duplicates()
    gene_to_ae["id"] = gene_to_ae["subject"].apply(lambda x: uuid.uuid4())
    gene_to_ae.to_csv(f"{args.output[1]}", sep="\t", index=False)

    ae = bgee[["object", "Anatomical entity name", "provided_by", "source", "source version"]]
    ae["id"] = ae["object"]
    ae["name"] = ae["Anatomical entity name"]
    ae.loc[ae["id"].str.contains("UBERON"), "category"] = "biolink:AnatomicalEntity"
    ae.loc[ae["id"].str.contains("CL"), "category"] = "biolink:Cell"

    genes = bgee[["subject", "provided_by", "Gene name", "source", "source version"]]
    genes["id"] = genes["subject"]
    genes["category"] = "biolink:Gene"
    genes["name"] = genes["Gene name"]

    nodes = pd.concat(
        [
            genes[["id", "category", "name", "provided_by", "source", "source version"]],
            ae[["id", "category", "name", "provided_by", "source", "source version"]],
        ]
    ).drop_duplicates()

    nodes[["id", "name", "category", "provided_by","source", "source version"]].to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import argparse
import pandas as pd
import requests

release = "https://api.github.com/repos/obophenotype/cell-ontology/releases/latest"

def get_parser():
    parser = argparse.ArgumentParser(
        prog="cl_kgx_process.py",
        description="cl_kgx_process: convert protein ids from cl kgx files.",
    )
    parser.add_argument("-i", "--input", nargs="+", help="Input files")
    parser.add_argument("-m", "--mapping", help="Input files")
    parser.add_argument(
        "-o", "--output", nargs="+", default="cl", help="Output prefix. Default: out"
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    clnodes = pd.read_csv(args.input[0], sep="\t", low_memory=False)
    cledges = pd.read_csv(args.input[1], sep="\t", low_memory=False)
    clmapping = pd.read_csv(args.mapping, sep="\t", header=None, low_memory=False)

    response = requests.get(
        release
    )
    version = response.json()["name"]

    clnodes["source"] = "CL"
    clnodes["source version"] = version

    cledges["source"] = "CL"
    cledges["source version"] = version

    clmapping.columns = ["ID", "xref", "Relation"]
    clmapping = (
        clmapping[clmapping["xref"].str.contains("UniProt")][["ID", "xref"]]
        .drop_duplicates()
        .set_index("ID")
    )

    clmapping = clmapping[~clmapping.index.duplicated(keep="first")].iloc[:, 0]

    # Transform nodes
    clnodes["Uniprot ID"] = (
        "UNIPROTKB:" + clnodes["id"].map(clmapping).str.split(":").str[-1]
    )
    cledges["Object Uniprot ID"] = (
        "UNIPROTKB:" + cledges["object"].map(clmapping).str.split(":").str[-1]
    )
    cledges["Subject Uniprot ID"] = (
        "UNIPROTKB:" + cledges["subject"].map(clmapping).str.split(":").str[-1]
    )

    clnodes["id"] = clnodes[["Uniprot ID", "id"]].bfill(axis=1).iloc[:, 0]
    cledges["object"] = (
        cledges[["Object Uniprot ID", "object"]].bfill(axis=1).iloc[:, 0]
    )
    cledges["subject"] = (
        cledges[["Subject Uniprot ID", "subject"]].bfill(axis=1).iloc[:, 0]
    )

    clnodes = clnodes[~clnodes.id.str.startswith("PR")]
    clnodes[["id", "category", "name", "provided_by", "source", "source version"]].drop_duplicates().to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )
    cledges = cledges[~cledges.subject.str.startswith("PR")]
    cledges = cledges[~cledges.object.str.startswith("PR")]

    cledges[
        ["id", "subject", "predicate", "object", "relation", "knowledge_source", "source", "source version"]
    ].to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import argparse
import pandas as pd
import uuid


def read_files(fname):
    df = pd.read_csv(fname, sep="\t", low_memory=False)
    return df

def get_version(fname):
    with open(fname) as f:
        for line in f:
            if "version" in line:
                version = line.split("version ")[1].split(").")[0]
    return version

def get_parser():
    parser = argparse.ArgumentParser(
        prog="disgenet_to_kgx.py",
        description=(
            "disgenet_to_csv: convert an disgenet file to CSVs with nodes and edges."
        ),
    )
    parser.add_argument("-i", "--input", nargs="+", help="Input files")
    parser.add_argument("-v", "--version", help="Input version file")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="disgenet",
        help="Output prefix. Default: out",
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    disgenet = read_files(args.input[0])
    disgenet_mapping = read_files(args.input[1])
    entrez_to_ensembl = (
        read_files(args.input[2]).drop_duplicates().set_index("Entrez Gene ID")
    )

    entrez_to_ensembl = entrez_to_ensembl[
        ~entrez_to_ensembl.index.duplicated(keep="first")
    ].iloc[:, 0]

    # Transform nodes
    disgenet_mapping = disgenet_mapping[
        disgenet_mapping["vocabulary"].isin(["HPO", "MONDO"])
    ]

    disgenet_mapping["code"] = (
        disgenet_mapping["vocabulary"] + ":" + disgenet_mapping["code"]
    ).str.replace("HPO:", "")
    disgenet_mapping = (
        disgenet_mapping[["diseaseId", "code"]].drop_duplicates().set_index("diseaseId")
    )
    disgenet_mapping = disgenet_mapping[
        ~disgenet_mapping.index.duplicated(keep="first")
    ].iloc[:, 0]

    disgenet["geneId"] = disgenet["geneId"].map(str)

    disgenet["object"] = disgenet["diseaseId"].map(disgenet_mapping)
    disgenet["subject"] = "ENSEMBL:" + disgenet["geneId"].map(entrez_to_ensembl)
    disgenet["provided_by"] = "Disgenet"
    disgenet["source"] = "Disgenet"
    disgenet["source version"] = get_version(args.version)

    disgenet = disgenet.dropna(subset=["object", "subject"])

    gene_to_phenotype = disgenet[disgenet.object.str.startswith("HP")]
    gene_to_phenotype["category"] = "biolink:GeneToPhenotypicFeatureAssociation"
    gene_to_phenotype["predicate"] = "biolink:associated_with"
    gene_to_phenotype["relation"] = "RO:0016001"
    gene_to_phenotype["knowledge_source"] = "Disgenet"

    gene_to_phenotype = gene_to_phenotype[
        [
            "subject",
            "predicate",
            "object",
            "category",
            "relation",
            "knowledge_source",
            "provided_by",
            "diseaseName",
            "source",
            "source version"
        ]
    ].drop_duplicates()
    gene_to_phenotype["id"] = gene_to_phenotype["subject"].apply(lambda x: uuid.uuid4())

    gene_to_disease = disgenet[disgenet.object.str.startswith("MONDO")]
    gene_to_disease["category"] = "biolink:GeneToDiseaseAssociation"
    gene_to_disease["predicate"] = "biolink:associated_with"
    gene_to_disease["relation"] = "RO:0016001"
    gene_to_disease["knowledge_source"] = "Disgenet"
    gene_to_disease = gene_to_disease[
        [
            "subject",
            "predicate",
            "object",
            "category",
            "relation",
            "knowledge_source",
            "provided_by",
            "diseaseName",
            "source",
            "source version"
        ]
    ].drop_duplicates()
    gene_to_disease["id"] = gene_to_disease["subject"].apply(lambda x: uuid.uuid4())

    edges = pd.concat([gene_to_phenotype, gene_to_disease])
    edges[
        [
            "id",
            "subject",
            "predicate",
            "object",
            "category",
            "relation",
            "knowledge_source",
            "source",
            "source version"
        ]
    ].drop_duplicates().to_csv(f"{args.output[1]}", sep="\t", index=False)

    phenotypes = gene_to_phenotype
    phenotypes["id"] = gene_to_phenotype["object"]
    phenotypes["category"] = "biolink:PhenotypicFeature"
    phenotypes["name"] = gene_to_phenotype["diseaseName"]
    phenotypes = phenotypes[["id", "category", "name", "provided_by", "source", "source version"]]

    diseases = gene_to_disease
    diseases["id"] = diseases["object"]
    diseases["category"] = "biolink:Disease"
    diseases["name"] = gene_to_disease["diseaseName"]
    diseases = diseases[["id", "category", "name", "provided_by", "source", "source version"]]

    nodes = disgenet
    nodes["id"] = disgenet["subject"]
    nodes["category"] = "biolink:Gene"
    nodes["name"] = disgenet["geneSymbol"]
    nodes = nodes[["id", "category", "name", "provided_by", "source", "source version"]]

    nodes = pd.concat([nodes, phenotypes, diseases]).drop_duplicates()

    nodes[["id", "name", "category", "provided_by", "source", "source version"]].to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import argparse
import pandas as pd

ENSEMBL_COLUMNS = ["Ensembl ID", "Entrez Gene ID"]


def read_file(fname):
    df = pd.read_csv(fname, sep="\t", low_memory=False)
    df = df[["gene_stable_id", "xref"]].drop_duplicates()
    df.columns = ENSEMBL_COLUMNS
    return df


def get_parser():
    parser = argparse.ArgumentParser(
        prog="ensembl_to_entrez.py",
        description="ensembl_to_entrez: download ensembl and entrez ids to csv file",
    )
    parser.add_argument("-i", "--input", help="Input files")
    parser.add_argument(
        "-o", "--output", default="ensembl to entrez", help="Output ensembl data."
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    ensemblf = read_file(args.input)
    ensemblf[["Entrez Gene ID", "Ensembl ID"]].to_csv(
        f"{args.output}", sep="\t", index=False
    )


if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
import uuid
import argparse
import pandas as pd

GENES = ["Gene Id", "Gene Version", "Gene Name"]


def read_id_mapping_uniprot(fname):
    df = pd.read_csv(fname, sep="\t", header=None, low_memory=False)
    df.columns = ["ID", "Database", "Database ID"]
    df = df[df["Database"] == "UniProtKB-ID"]
    df["Database ID"] = df["Database ID"].str.split("_").str[0]
    df = df[["ID", "Database ID"]].drop_duplicates().set_index("ID")
    df = df[~df.index.duplicated(keep="first")].iloc[:, 0]
    return df


def read_genes(fname):
    df = pd.read_csv(fname, sep=";", low_memory=False, header=None)
    df = df.iloc[:, :3]
    df.columns = GENES
    df = df[df["Gene Name"].str.contains("gene_name")]
    df["Gene Id"] = "ENSEMBL:" + df["Gene Id"].str.split(" ").str[-1].str.replace(
        '"', ""
    )
    df["Gene Name"] = df["Gene Name"].str.split(" ").str[-1].str.replace('"', "")
    df = df[["Gene Id", "Gene Name"]].drop_duplicates().set_index("Gene Id")
    df = df[~df.index.duplicated(keep="first")].iloc[:, 0]
    return df


def get_parser():
    parser = argparse.ArgumentParser(
        prog="ensembl_to_kgx.py",
        description=(
            "ensembl_to_csv: convert an ensembl file to CSVs with nodes and edges."
        ),
    )
    parser.add_argument("-i", "--input", help="Input files")
    parser.add_argument("-u", "--uniprot", help="Input files")
    parser.add_argument("-g", "--genes", help="Input files")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="ensembl",
        help="Output prefix. Default: out",
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    uniprotf = read_id_mapping_uniprot(args.uniprot)
    ensemblf = pd.read_csv(args.input, sep="\t", comment="!", low_memory=False)
    genesf = read_genes(args.genes)

    ensemblf["protein name"] = ensemblf["xref"].map(uniprotf)
    ensemblf["provided_by"] = "ENSEMBL"
    ensemblf["knowledge_source"] = "ENSEMBL"
    ensemblf["xref"] = ensemblf["xref"].str.split("-").str[0]
    ensemblf["protein name"] = ensemblf["xref"].map(uniprotf)
    ensemblf["source"] = "ENSEMBL"
    version = args.input.split(".")
    ensemblf["source version"] = version[3] + " " + version[4]

    gene_to_protein = ensemblf.dropna(subset=["xref"])
    gene_to_protein["subject"] = "ENSEMBL:" + gene_to_protein["gene_stable_id"]
    gene_to_protein["object"] = "UNIPROTKB:" + gene_to_protein["xref"]
    gene_to_protein["predicate"] = "biolink:has_gene_product"
    gene_to_protein["relation"] = "RO:0002205"
    gene_to_protein = gene_to_protein[
        ["subject", "predicate", "object", "relation", "knowledge_source", "source", "source version"]
    ].drop_duplicates()
    gene_to_protein["id"] = gene_to_protein["subject"].apply(lambda x: uuid.uuid4())

    protein = ensemblf.dropna(subset=["xref"])

    protein["id"] = "UNIPROTKB:" + protein["xref"]
    protein["category"] = "biolink:Protein"
    protein["name"] = protein["protein name"]
    protein["xref"] = "ENSEMBL:" + ensemblf["protein_stable_id"]
    protein = protein[["id", "category", "name", "xref", "provided_by", "source", "source version"]]

    edges = gene_to_protein

    genes = ensemblf
    genes["id"] = "ENSEMBL:" + ensemblf["gene_stable_id"]
    genes["category"] = "biolink:Gene"
    genes["name"] = genes["id"].map(genesf)
    genes = genes[["id", "category", "name", "provided_by", "source", "source version"]]

    nodes = pd.concat([genes, protein]).drop_duplicates()

    nodes[["id", "name", "category", "provided_by", "xref", "source", "source version"]].to_csv(
        f"{args.output [0]}", sep="\t", index=False
    )
    edges[
        ["object", "subject", "id", "predicate", "knowledge_source", "relation", "source", "source version"]
    ].to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import uuid
import json
import argparse
import pandas as pd
import yaml


GAF_COLUMNS = [
    "DB",
    "DB Object ID",
    "DB Object Symbol",
    "Qualifier",
    "GO ID",
    "DB:Reference",
    "Evidence Code",
    "With (or) From",
    "Aspect",
    "DB Object Name",
    "DB Object Synonym",
    "DB Object Type",
    "Taxon(|taxon)",
    "Date",
    "Assigned By",
    "Annotation Extension",
    "Gene Product Form ID",
]


def yaml_loader(fname):
    with open(fname) as f:
        classes = pd.DataFrame(yaml.full_load(f)["classes"])
    classes = classes.drop_duplicates().set_index("database")
    classes = classes[~classes.index.duplicated(keep="first")].iloc[:, 0]
    return classes


def read_gaf(fnames, biolinkclasses):
    gaf = pd.DataFrame(columns=GAF_COLUMNS)
    for f in fnames:
        df = pd.read_csv(f, sep="\t", comment="!", header=None, low_memory=False)
        df.columns = GAF_COLUMNS
        df["Qualifier"] = df["Qualifier"].replace("is_active_in", "active_in")
        df["Qualifier"] = df["Qualifier"].replace("NOT|is_active_in", "NOT|active_in")
        df["DB"] = df["DB"].str.upper()
        df["Biolink Category"] = df["DB"].map(biolinkclasses)
        gaf = pd.concat([gaf, df])
    return gaf


def get_predicate_map(fname):
    ro = json.load(open(fname))
    predicate_to_relation = {}
    for node in ro["graphs"][0]["nodes"]:
        relation = node["id"]
        if node.get("lbl", None) == "is active in":
            predicate = "active in"
        else:
            predicate = node.get("lbl", None)
        if predicate is not None:
            relation = relation.split("/")[-1].replace("_", ":")
            predicate_to_relation[predicate.replace(" ", "_")] = relation
    return predicate_to_relation


def get_parser():
    parser = argparse.ArgumentParser(
        prog="goa_to_kgx.py",
        description="goa_to_kgx: convert an goa file to CSVs with nodes and edges.",
    )
    parser.add_argument("-i", "--input", nargs="+", help="Input GAF files")
    parser.add_argument("-r", "--ro", help="Input RO json file")
    parser.add_argument("-g", "--go", help="Input GO nodes file")
    parser.add_argument("-c", "--cfg", help="Input config.yaml file")
    parser.add_argument("-v", "--version", help="Input version file")
    parser.add_argument(
        "-o", "--output", nargs="+", default="goa", help="Output prefix. Default: out"
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()

    with open(args.version, "r") as f:
        version = json.load(f)["date"]

    biolinkclasses = yaml_loader(args.cfg)
    predicate_to_relation = get_predicate_map(args.ro)
    gof = pd.read_csv(args.go, sep="\t")[["id", "category", "name", "provided_by", "xref", "source", "source version"]]
    gaf = read_gaf(args.input, biolinkclasses)
    gaf["provided_by"] = "GOA"
    gaf["id"] = gaf.DB + ":" + gaf["DB Object ID"].str.split("_").str[0]
    gaf["category"] = gaf["Biolink Category"]
    gaf["name"] = gaf["DB Object Symbol"]
    gaf["source"] = "GOA"
    gaf["source version"] = version



    nodes=pd.concat([gaf[["id", "name", "category", "provided_by", "source", "source version"]], gof])
    nodes.drop_duplicates().to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )
    # Now edges
    gaf["object"] = gaf["GO ID"]
    gaf["subject"] = gaf.DB + ":" + gaf["DB Object ID"]
    gaf["category"] = "biolink:FunctionalAssociation"
    gaf["negated"] = gaf.Qualifier.str.startswith("NOT|")
    gaf["predicate"] = "biolink:" + gaf.Qualifier.str.replace("NOT|", "", regex=False)
    gaf["relation"] = gaf.Qualifier.map(predicate_to_relation)
    gaf["knowledge_source"] = "GOA"
    gaf = gaf[
        [
            "subject",
            "predicate",
            "object",
            "category",
            "negated",
            "relation",
            "knowledge_source",
            "source",
            "source version"
        ]
    ].drop_duplicates()
    gaf["id"] = gaf.subject.apply(lambda x: uuid.uuid4())
    gaf.to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import pandas as pd
import argparse
import json

def get_parser():
    parser = argparse.ArgumentParser(
        prog="go_kgx_process.py",
        description=(
            "go_kgx_process: get go version."
        ),
    )
    parser.add_argument("-i", "--input", nargs="+", help="Input files")
    parser.add_argument("-v", "--version", help="Input version file")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="go",
        help="Output prefix. Default: out",
    )
    return parser

def main():

    parser = get_parser()
    args = parser.parse_args()
    gonodes = pd.read_csv(args.input[0], sep="\t", low_memory=False)
    goedges = pd.read_csv(args.input[1], sep="\t", low_memory=False)

    with open(args.version, "r") as f:
        version = json.load(f)["date"]

    gonodes["source"] = "GO"
    gonodes["source version"] = version

    goedges["source"] = "GO"
    goedges["source version"] = version

    gonodes[["id", "category", "name", "provided_by", "description", "xref", "source", "source version"]].drop_duplicates().to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )
    goedges[
        ["id", "subject", "predicate", "object", "relation", "knowledge_source", "source", "source version"]
    ].to_csv(f"{args.output[1]}", sep="\t", index=False)

if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import uuid
import argparse
import pandas as pd

HPOA_COLUMNS = [
    "DatabaseId",
    "DB Name",
    "Qualifier",
    "HPO ID",
    "DB Reference",
    "Evidence",
    "Onset",
    "Frequency",
    "Sex",
    "Modifier",
    "Aspect",
    "Biocuration",
]

def get_version (fname):
    with open(fname) as f:
        for line in f:
            if "#version:" in line:
                version = line.split(":")[1].split("\n")[0].replace(" ", "")
    return version

def read_hpoa(fname):
    hpoa = pd.read_csv(fname, sep="\t", header=None, low_memory=False, comment="#")
    hpoa.columns = HPOA_COLUMNS
    return hpoa


def read_mondo(fname):
    mondo = pd.read_csv(fname, sep="\t", low_memory=False)
    mondo = mondo.drop_duplicates().set_index("disease")
    mondo = mondo[~mondo.index.duplicated(keep="first")].iloc[:, 0]
    return mondo


def get_parser():
    parser = argparse.ArgumentParser(
        prog="hpoa_to_kgx.py",
        description="hpoa_to_kgx: convert an hpoa file to CSVs with nodes and edges.",
    )
    parser.add_argument("-i", "--input", help="Input hpoa files")
    parser.add_argument("-m", "--mapping", help="Input mondo mapping files")
    parser.add_argument("-n", "--hpo", help="Input hpo nodes")
    parser.add_argument(
        "-o", "--output", nargs="+", default="goa", help="Output prefix. Default: out"
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    hpoa = read_hpoa(args.input)
    mondo_mapping = read_mondo(args.mapping)

    version = get_version(args.input)

    hpoa["provided_by"] = "HPOA"
    hpoa["knowledge_source"] = "HPOA"
    hpoa["id"] = hpoa["DatabaseId"].map(mondo_mapping)
    hpoa["category"] = "biolink:Disease"
    hpoa["name"] = hpoa["DB Name"]
    hpoa["source"] = "HPOA"
    hpoa["source version"] = version
    hpf = pd.read_csv(args.hpo, sep="\t")[
        [
            "id",
            "name",
            "category",
            "provided_by",
            "xref",
            "source",
            "source version"
        ]
    ]
    hpf = hpf[hpf.id.str.startswith("HP")]
    nodes = pd.concat([
        hpoa[
                [
                    "id",
                    "name",
                    "category",
                    "provided_by",
                    "source",
                    "source version"
                ]
        ].dropna(subset=["id"]), hpf]).drop_duplicates().to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )
    # Now edges

    hpoa["subject"] = hpoa["DatabaseId"].map(mondo_mapping)
    hpoa["object"] = hpoa["HPO ID"]
    hpoa["id"] = hpoa.id.apply(lambda x: uuid.uuid4())
    hpoa["category"] = "biolink:DiseaseToPhenotypicFeatureAssociation"
    hpoa["negated"] = hpoa.Qualifier.str.startswith("NOT")
    hpoa["predicate"] = "biolink:has_phenotype"
    hpoa["relation"] = "RO:0002200"
    hpoa = (
        hpoa[
            [
                "subject",
                "predicate",
                "object",
                "negated",
                "category",
                "relation",
                "knowledge_source",
                "source",
                "source version"
            ]
        ]
        .dropna()
        .drop_duplicates()
    )
    hpoa["id"] = hpoa.subject.apply(lambda x: uuid.uuid4())
    hpoa.to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pandas as pd
import argparse
import requests

release = "https://api.github.com/repos/obophenotype/human-phenotype-ontology/releases/latest"
def get_parser():
    parser = argparse.ArgumentParser(
        prog="hpo_kgx_process.py",
        description=(
            "hpo_kgx_process: get hpo version."
        ),
    )
    parser.add_argument("-i", "--input", nargs="+", help="Input files")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="go",
        help="Output prefix. Default: out",
    )
    return parser

def main():

    parser = get_parser()
    args = parser.parse_args()
    hponodes = pd.read_csv(args.input[0], sep="\t", low_memory=False)
    hpoedges = pd.read_csv(args.input[1], sep="\t", low_memory=False)

    response = requests.get(
        release
    )
    version = response.json()["name"]

    hponodes["source"] = "HPO"
    hponodes["source version"] = version

    hpoedges["source"] = "HPO"
    hpoedges["source version"] = version

    hponodes[["id", "category", "name", "provided_by", "description", "xref", "source","source version"]].drop_duplicates().to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )
    hpoedges[
        ["id", "subject", "predicate", "object", "relation", "knowledge_source", "source", "source version"]
    ].to_csv(f"{args.output[1]}", sep="\t", index=False)

if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import argparse
from collections import Counter
from kgx.transformer import Transformer
import networkx as nx

def get_parser():
    parser = argparse.ArgumentParser(
        prog="lcc.py",
        description=(
            "lcc: extract the largest connected component from a tsv format KG."
        ),
    )
    parser.add_argument("-n", "--nodes", help="Node file")
    parser.add_argument("-e", "--edges", help="Edge files")
    parser.add_argument("-o", "--output", default="lcc", help="Output prefix. Default: lcc")
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    input_args = {'filename': [args.nodes, args.edges], 'format': 'tsv'}
    output_args = {'filename': args.output, 'format': 'tsv'}
    t = Transformer(stream=False)
    t.transform(input_args=input_args)
    print("connected components:", Counter(map(len, nx.connected_components(t.store.graph.graph.to_undirected()))))
    lcc = max(nx.connected_components(t.store.graph.graph.to_undirected()), key=len)
    print("Size of lcc:", len(lcc))
    t.store.graph.graph = t.store.graph.graph.subgraph(lcc)
    t.save(output_args)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import argparse
import pandas as pd
from xlsx2csv import Xlsx2csv
from io import StringIO


def get_parser():
    parser = argparse.ArgumentParser(
        prog="mirtarbase_to_csv.py",
        description="mirtarbase_to_csv: convert a mirtarbase xlsx file to csv.",
    )
    parser.add_argument("-i", "--input", help="Input file")
    parser.add_argument(
        "-o", "--output", default="ensembl", help="Output prefix. Default: out"
    )
    return parser


def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
    buffer = StringIO()
    Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
    buffer.seek(0)
    df = pd.read_csv(buffer)
    return df


def main():
    parser = get_parser()
    args = parser.parse_args()
    path = args.input
    sheet = "Homo sapiens"
    read_excel(path, sheet).to_csv(f"{args.output}", sep="\t", index=False)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import uuid
import argparse
import pandas as pd

def get_version (fname):
    version = fname.split("/")[-1]
    version = version.split("_")[0]
    return version

def get_parser():
    parser = argparse.ArgumentParser(
        prog="mirtarbase_to_kgx.py",
        description=(
            "mirtarbase_to_kgx: convert a mirtarbase file to CSVs with nodes and edges."
        ),
    )
    parser.add_argument("-i", "--input", help="Input files")
    parser.add_argument("-r", "--rna", help="Input files")
    parser.add_argument("-g", "--genes", help="Input files")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="ensembl",
        help="Output prefix. Default: out",
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()

    version = get_version(args.input)

    rnamapping = pd.read_csv(args.rna, sep="\t", header=None, low_memory=False).iloc[
        :, :5
    ]
    rnamapping.columns = ["RNACentral", "DB", "xref", "Species", "Type"]
    rnamapping = rnamapping[["RNACentral", "xref"]].drop_duplicates().set_index("xref")
    rnamapping = rnamapping[~rnamapping.index.duplicated(keep="first")].iloc[:, 0]

    genemapping = (
        pd.read_csv(args.genes, sep="\t", low_memory=False)
        .drop_duplicates()
        .set_index("Entrez Gene ID")
    )
    genemapping = genemapping[~genemapping.index.duplicated(keep="first")].iloc[:, 0]

    mirtarbase = pd.read_csv(args.input, sep="\t", low_memory=False)
    mirtarbase = mirtarbase[
        ["miRTarBase ID", "miRNA", "Target Gene", "Target Gene (Entrez ID)"]
    ]

    mirtarbase["object"] = (
        mirtarbase["Target Gene (Entrez ID)"].map(str).map(genemapping)
    )
    mirtarbase["subject"] = mirtarbase["miRNA"].map(rnamapping)
    mirtarbase = mirtarbase.dropna(subset=["object", "subject"])

    mirtarbase["object"] = "ENSEMBL:" + mirtarbase["object"]
    mirtarbase["subject"] = "RNACENTRAL:" + mirtarbase["subject"]
    mirtarbase["provided_by"] = "Mirtarbase"
    mirtarbase["knowledge_source"] = "Mirtarbase"
    mirtarbase["predicate"] = "biolink:interacts_with"
    mirtarbase["relation"] = "RO:0002434"
    mirtarbase["source"] = "Mirtarbase"
    mirtarbase["source version"] = version

    edges = mirtarbase[
        ["object", "subject", "predicate", "knowledge_source", "relation", "source", "source version"]
    ].drop_duplicates()
    edges["id"] = mirtarbase["subject"].apply(lambda x: uuid.uuid4())

    rna = mirtarbase[["subject", "miRNA", "provided_by", "source", "source version"]]
    rna["id"] = rna["subject"]
    rna["xref"] = rna["miRNA"]
    rna["category"] = "biolink:RNAProduct"

    dna = mirtarbase[
        ["object", "Target Gene", "provided_by", "Target Gene (Entrez ID)", "source", "source version"]
    ]
    dna["xref"] = dna["Target Gene (Entrez ID)"]
    dna["name"] = dna["Target Gene"]
    dna["category"] = "biolink:Gene"
    dna["id"] = dna["object"]

    nodes = pd.concat([dna, rna]).drop_duplicates()

    nodes[["id", "name", "category", "provided_by", "xref", "source", "source version"]].to_csv(
        f"{args.output [0]}", sep="\t", index=False
    )
    edges.to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pandas as pd
import argparse
import requests

release = "https://api.github.com/repos/monarch-initiative/mondo/releases/latest"
def get_parser():
    parser = argparse.ArgumentParser(
        prog="mondo_kgx_process.py",
        description=(
            "mondo_kgx_process: get mondo version."
        ),
    )
    parser.add_argument("-i", "--input", nargs="+", help="Input files")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="go",
        help="Output prefix. Default: out",
    )
    return parser

def main():

    parser = get_parser()
    args = parser.parse_args()
    mondonodes = pd.read_csv(args.input[0], sep="\t", low_memory=False)
    mondoedges = pd.read_csv(args.input[1], sep="\t", low_memory=False)

    response = requests.get(
        release
    )
    version = response.json()["name"]

    mondonodes["source"] = "MONDO"
    mondonodes["source version"] = version

    mondoedges["source"] = "MONDO"
    mondoedges["source version"] = version

    mondonodes[["id", "category", "name", "provided_by", "description", "xref", "source","source version"]].drop_duplicates().to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )
    mondoedges[
        ["id", "subject", "predicate", "object", "relation", "knowledge_source", "source", "source version"]
    ].to_csv(f"{args.output[1]}", sep="\t", index=False)

if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import argparse
import pandas as pd


def get_parser():
    parser = argparse.ArgumentParser(
        prog="mondo_mapping.py", description="mondo_mapping: get mondo mapping csv file"
    )
    parser.add_argument("-i", "--input", help="Input mondo data file.")
    parser.add_argument(
        "-o", "--output", default="mondo_mapping", help="Output mondo mapping."
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()
    mondo_nodes = pd.DataFrame(pd.read_json(args.input).graphs[0]["nodes"])
    mondo_nodes["id"] = (
        mondo_nodes["id"].str.split("/").str[-1].str.replace("_", ":", regex=False)
    )
    mondo_nodes["xrefs"] = mondo_nodes["meta"]
    mondo_map = []

    for node in range(len(mondo_nodes)):
        try:
            xrefs = mondo_nodes["meta"][node]["xrefs"]
            if xrefs is not None:
                for xref in xrefs:
                    mondo_map.append((xref["val"], mondo_nodes["id"][node]))
        except:
            continue
    mondo_map = pd.DataFrame(mondo_map, columns=["disease", "mondo"])
    mondo_map.to_csv(f"{args.output}", sep="\t", index=False)


if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
import argparse
import pandas as pd
import uuid

predicates = {
    "binding": "biolink:binds",
    "binding;regulatory": "biolink:binds",
    "regulatory": "biolink:regulates",
    "expression correlation": "biolink:correlates",
    "coexpression": "biolink:coexpressed_with",
}

GENES = ["Gene Id", "Gene Version", "Gene Name"]

RNACENTRALMAPPING = [
    "RNACentral ID",
    "DB",
    "Transcript ID",
    "Species",
    "RNA Type",
    "Gene ID",
]


def add_predicates(df):
    predicatef = pd.Series(predicates).drop_duplicates()
    df["predicate"] = df["class"].map(predicatef)
    return df

def read_rna(fnames, type):
    rnamapping = pd.DataFrame()
    for f in fnames:
        df = pd.read_csv(f, sep="\t", low_memory=False, header=None)
        df.columns = RNACENTRALMAPPING
        rnamapping = pd.concat([rnamapping, df])
    rnamapping["ID"] = rnamapping[type].str.split(".").str[0]
    rnamapping = rnamapping[["ID", "RNACentral ID"]].drop_duplicates().set_index("ID")
    rnamapping = rnamapping[~rnamapping.index.duplicated(keep="first")].iloc[:, 0]
    return rnamapping


def read_genes(fname):
    df = pd.read_csv(fname, sep=";", low_memory=False, header=None)
    df = df.iloc[:, :3]
    df.columns = GENES
    df = df[df["Gene Name"].str.contains("gene_name")]
    df["Gene Id"] = "ENSEMBL:" + df["Gene Id"].str.split(" ").str[-1].str.replace(
        '"', ""
    )
    df["Gene Name"] = df["Gene Name"].str.split(" ").str[-1].str.replace('"', "")
    df = df[["Gene Id", "Gene Name"]].drop_duplicates().set_index("Gene Name")
    df = df[~df.index.duplicated(keep="first")].iloc[:, 0]
    return df


def read_id_mapping_uniprot(fname):
    df = pd.read_csv(fname, sep="\t", header=None, low_memory=False)
    df.columns = ["ID", "Database", "Database ID"]
    df = df[df["Database"] == "UniProtKB-ID"]
    df["Database ID"] = df["Database ID"].str.split("_").str[0]
    df = df[["ID", "Database ID"]].drop_duplicates().set_index("ID")
    df = df[~df.index.duplicated(keep="first")].iloc[:, 0]
    return df


def get_parser():
    parser = argparse.ArgumentParser(
        prog="npinter_to_kgx.py",
        description="npinter_to_kgx: convert an npinter file to CSVs with nodes and edges.",
    )
    parser.add_argument("-i", "--input", help="Input files")
    parser.add_argument("-p", "--proteins", help="Input files")
    parser.add_argument("-g", "--genes", help="Input files")
    parser.add_argument("-r", "--rna", nargs="+", help="Input files")
    parser.add_argument(
        "-o", "--output", nargs="+", default="npinter", help="Output prefix. Default: out"
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()

    npinterf = pd.read_csv(args.input, sep="\t", low_memory=False)
    npinterf = add_predicates(npinterf)
    uniprotf = read_id_mapping_uniprot(args.proteins)
    ensemblf = read_genes(args.genes)
    rnacentraltf = read_rna(args.rna, "Transcript ID")
    rnacentralgf = read_rna(args.rna, "Gene ID")

    version = args.input.split("/")[-1]
    version = version.split(".")[0].split("_")[1]

    npinterf["RNACentral Transcript"] = npinterf["ncID"].map(rnacentraltf)
    npinterf["RNACentral Gene"] = npinterf["ncID"].map(rnacentralgf)
    npinterf["subject"] = (
        npinterf[["RNACentral Transcript", "RNACentral Gene"]].bfill(axis=1).iloc[:, 0]
    )
    npinterf = npinterf.dropna(subset=["subject"])
    npinterf["subject"] = "RNACENTRAL:" + npinterf["subject"]
    npinterf["provided_by"] = "NPInter"
    npinterf["knowledge_source"] = "NPInter"
    npinterf["source"] = "NPInter"
    npinterf["source version"] = version

    npinterproteins = npinterf[npinterf["level"].isin(["RNA-Protein"])]
    npinterproteins["Uniprot Name"] = npinterproteins["tarID"].map(uniprotf)
    npinterproteins = npinterproteins.dropna(subset=["Uniprot Name"])
    npinterproteins["object"] = "UNIPROTKB:" + npinterproteins["tarID"]

    proteins = npinterproteins[["object", "provided_by", "Uniprot Name", "source", "source version"]]
    proteins["id"] = proteins["object"]
    proteins["name"] = proteins["Uniprot Name"]
    proteins["category"] = "biolink:Protein"
    proteins = proteins[["id", "name", "provided_by", "category", "source", "source version"]
    ].drop_duplicates()

    npinterrna = npinterf[npinterf["level"].isin(["RNA-RNA"])]
    npinterrna["RNACentral Transcript"] = npinterrna["tarID"].map(rnacentraltf)
    npinterrna["RNACentral Gene"] = npinterrna["tarID"].map(rnacentralgf)
    npinterrna["object"] = (
        npinterrna[["RNACentral Transcript", "RNACentral Gene"]].bfill(axis=1).iloc[:, 0]
    )
    npinterrna = npinterrna.dropna(subset=["object"])
    npinterrna["object"] = "RNACENTRAL:" + npinterrna["object"]


    rnaobj = npinterrna[["object", "provided_by", "tarName", "tarType", "tarID","source", "source version"]]
    rnaobj["id"] = rnaobj["object"]
    rnaobj["name"] = rnaobj["tarName"]
    rnaobj["category"] = "biolink:RNAProduct"
    rnaobj["node_property"] = rnaobj["tarType"]
    rnaobj["xref"] = rnaobj["tarID"]
    rnaobj = rnaobj[["id", "name", "provided_by", "category", "xref", "node_property", "source", "source version"]
    ].drop_duplicates()


    npintergenes = npinterf[npinterf["level"].isin(["RNA-DNA"])]
    npintergenes["Ensembl ID"] = npintergenes["tarName"].map(ensemblf)
    npintergenes = npintergenes.dropna(subset=["Ensembl ID"])
    npintergenes["object"] = npintergenes["Ensembl ID"]

    genes = npintergenes[["object", "provided_by", "tarName", "source", "source version"]]
    genes["id"] = genes["object"]
    genes["name"] = genes["tarName"]
    genes["category"] = "biolink:Gene"
    genes = genes[["id", "name", "provided_by", "category","source", "source version"]].drop_duplicates()

    rna = npinterf[["subject", "ncID", "provided_by", "ncType", "ncName","source", "source version"]]
    rna["id"] = rna["subject"]
    rna["name"] = rna["ncName"]
    rna["category"] = "biolink:RNAProduct"
    rna["xref"] = rna["ncID"]
    rna["node_property"] = rna["ncType"]
    rna = rna[
        ["id", "name", "provided_by", "category", "xref", "node_property", "source", "source version"]
    ].drop_duplicates()

    nodes = pd.concat([proteins, genes, rna, rnaobj]).drop_duplicates()
    edges = pd.concat(
        [
            npintergenes[["subject", "object", "knowledge_source", "predicate","source", "source version"]],
            npinterrna[["subject", "object", "knowledge_source", "predicate","source", "source version"]],
            npinterproteins[["subject", "object", "knowledge_source", "predicate","source", "source version"]],
        ]
    )
    edges["id"] = edges["subject"].apply(lambda x: uuid.uuid4())

    nodes.to_csv(f"{args.output[0]}", sep="\t", index=False)
    edges.to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import uuid
import argparse
import pandas as pd

RNACENTRALMAPPING = [
    "RNACentral ID",
    "DB",
    "Transcript ID",
    "Species",
    "RNA Type",
    "Gene ID",
]

RNACENTRAL = ["DB", "RNACentral ID", "Name", "Type"]

GENES = ["Gene Id", "Gene Version", "Gene Name"]


def read_file(fname, columns):
    df = pd.read_csv(fname, sep="\t", header=None, comment="!", low_memory=False)
    df.columns = columns
    return df

def get_version (fname):
    with open(fname) as f:
        version = f.readlines()[1].split("\n")[0]
    return version

def read_genes(fname):
    df = pd.read_csv(fname, sep=";", low_memory=False, header=None)
    df = df.iloc[:, :3]
    df.columns = GENES
    df = df[df["Gene Name"].str.contains("gene_name")]
    df["Gene Id"] = "ENSEMBL:" + df["Gene Id"].str.split(" ").str[-1].str.replace(
        '"', ""
    )
    df["Gene Name"] = df["Gene Name"].str.split(" ").str[-1].str.replace('"', "")
    df = df[["Gene Id", "Gene Name"]].drop_duplicates().set_index("Gene Id")
    df = df[~df.index.duplicated(keep="first")].iloc[:, 0]
    return df


def get_parser():
    parser = argparse.ArgumentParser(
        prog="rnacentral_to_kgx.py",
        description=(
            "rnacentral_to_kgx: convert an rnacentral file to CSVs with nodes and"
            " edges."
        ),
    )
    parser.add_argument("-i", "--input", help="Input files")
    parser.add_argument("-m", "--mapping", help="Input files")
    parser.add_argument("-g", "--genes", help="Input files")
    parser.add_argument("-v", "--version", help="Version file")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="ensembl",
        help="Output prefix. Default: out",
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()

    version = get_version(args.version)

    rnacentralmapping = read_file(args.mapping, RNACENTRALMAPPING)
    rnacentralmapping["Gene ID"] = rnacentralmapping["Gene ID"].str.split(".").str[0]

    rnacentralgenemapping = (
        rnacentralmapping[["RNACentral ID", "Gene ID"]]
        .drop_duplicates()
        .set_index("RNACentral ID")
    )
    rnacentralgenemapping = rnacentralgenemapping[
        ~rnacentralgenemapping.index.duplicated(keep="first")
    ].iloc[:, 0]

    rnacentralrnamapping = (
        rnacentralmapping[["RNACentral ID", "Transcript ID"]]
        .drop_duplicates()
        .set_index("RNACentral ID")
    )
    rnacentralrnamapping = rnacentralrnamapping[
        ~rnacentralrnamapping.index.duplicated(keep="first")
    ].iloc[:, 0]

    genenames = read_genes(args.genes)

    rnacentral = read_file(args.input, RNACENTRAL)
    rnacentral["RNACentral ID"] = (
        rnacentralmapping["RNACentral ID"].str.split("_").str[0]
    )
    rnacentral["Ensembl Gene ID"] = rnacentral["RNACentral ID"].map(
        rnacentralgenemapping
    )
    rnacentral["Ensembl Transcript ID"] = rnacentral["RNACentral ID"].map(
        rnacentralrnamapping
    )
    rnacentral["provided_by"] = rnacentral["DB"].str.upper()
    rnacentral["knowledge_source"] = rnacentral["DB"].str.upper()

    rnacentral["subject"] = "ENSEMBL:" + rnacentral["Ensembl Gene ID"]
    rnacentral["object"] = "RNACENTRAL:" + rnacentral["RNACentral ID"]
    rnacentral["predicate"] = "biolink:has_gene_product"
    rnacentral["relation"] = "RO:0002205"
    rnacentral["source"] = "RNACentral"
    rnacentral["source version"] = version
    rnacentral = rnacentral.dropna(subset=["object", "subject"])

    edges = rnacentral[
        ["subject", "predicate", "object", "relation", "knowledge_source", "source", "source version"]
    ].drop_duplicates()
    edges["id"] = rnacentral["subject"].apply(lambda x: uuid.uuid4())

    rna = rnacentral[["object", "Type", "provided_by", "Name", "Ensembl Transcript ID","source", "source version"]]
    rna["id"] = rna["object"]
    rna["category"] = "biolink:RNAProduct"
    rna["name"] = rna["Name"]
    rna["xref"] = "ENSEMBL:" + rna["Ensembl Transcript ID"]
    rna["node_property"] = rna["Type"]
    rna = rna[["id", "category", "name", "xref", "provided_by", "node_property","source", "source version"]]

    genes = rnacentral[["subject", "provided_by","source", "source version"]]
    genes["id"] = genes["subject"]
    genes["name"] = genes["subject"].map(genenames)
    genes["category"] = "biolink:Gene"
    genes = genes[["id", "category", "name", "provided_by","source", "source version"]]

    nodes = pd.concat([genes, rna]).drop_duplicates()

    nodes[["id", "name", "category", "provided_by", "xref", "node_property","source", "source version"]].to_csv(
        f"{args.output [0]}", sep="\t", index=False
    )
    edges[
        ["object", "subject", "id", "predicate", "knowledge_source", "relation","source", "source version"]
    ].to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import argparse
import pandas as pd
import uuid


def read_id_mapping_uniprot(fname, id, type):
    df = pd.read_csv(fname, sep="\t", header=None, low_memory=False)
    df.columns = ["ID", "Database", "Database ID"]
    df = df[df["Database"] == type]
    df["Database ID"] = df["Database ID"].str.split("_").str[0]
    df = df[["ID", "Database ID"]].drop_duplicates().set_index(id)
    df = df[~df.index.duplicated(keep="first")].iloc[:, 0]
    return df


def get_parser():
    parser = argparse.ArgumentParser(
        prog="stringdb_to_kgx.py",
        description=(
            "string_to_csv: convert an string file to CSVs with nodes and edges."
        ),
    )
    parser.add_argument("-i", "--input", help="Input files")
    parser.add_argument("-p", "--proteins", help="Input files")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="string",
        help="Output prefix. Default: out",
    )
    return parser


def main():
    parser = get_parser()
    args = parser.parse_args()

    version = args.input.split("/")[-1]
    version = version.split(".")[3]

    stringdbf = pd.read_csv(args.input, sep=" ", low_memory=False)

    idmapping = read_id_mapping_uniprot(args.proteins, "Database ID", "STRING")
    namemapping = read_id_mapping_uniprot(args.proteins, "ID", "UniProtKB-ID")

    stringdbf["protein1 id"] = stringdbf["protein1"].map(idmapping)
    stringdbf["protein2 id"] = stringdbf["protein2"].map(idmapping)

    stringdbf = stringdbf.dropna(subset=["protein1 id", "protein2 id"])
    stringdbf["subject"] = "UNIPROTKB:" + stringdbf["protein1 id"]
    stringdbf["object"] = "UNIPROTKB:" + stringdbf["protein2 id"]
    stringdbf["provided_by"] = "STRING"
    stringdbf["knowledge_source"] = "STRING"
    stringdbf["predicate"] = "biolink:interacts_with"
    stringdbf["relation"] = "RO:0002436"
    stringdbf["category"] = "biolink:Protein"
    stringdbf["has_confidence_level"] = stringdbf["combined_score"]
    stringdbf["source"] = "STRING"
    stringdbf["source version"] = version

    protein1 = stringdbf[
        ["protein1", "protein1 id", "subject", "provided_by", "category", "source", "source version"]
    ]
    protein1["id"] = protein1["subject"]
    protein1["name"] = protein1["protein1 id"].map(namemapping)
    protein1["xref"] = "ENSEMBL:" + protein1["protein1"].str.split(".").str[-1]
    protein1 = protein1[["id", "name", "provided_by", "category", "xref", "source", "source version"]]
    protein2 = stringdbf[
        ["protein2", "protein2 id", "object", "provided_by", "category", "source", "source version"]
    ]
    protein2["id"] = protein2["object"]
    protein2["name"] = protein2["protein2 id"].map(namemapping)
    protein2["xref"] = "ENSEMBL:" + protein2["protein2"].str.split(".").str[-1]
    protein2 = protein2[["id", "name", "provided_by", "category", "xref", "source", "source version"]]

    nodes = pd.concat([protein1, protein2]).drop_duplicates()

    edges = stringdbf[
        ["subject", "object", "knowledge_source", "predicate", "has_confidence_level", "source", "source version"]
    ].drop_duplicates()
    edges["id"] = edges["subject"].apply(lambda x: uuid.uuid4())

    nodes.to_csv(f"{args.output[0]}", sep="\t", index=False)
    edges.to_csv(f"{args.output[1]}", sep="\t", index=False)


if __name__ == "__main__":
    main()
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pandas as pd
import argparse
import requests

release = "https://api.github.com/repos/obophenotype/uberon/releases/latest"
def get_parser():
    parser = argparse.ArgumentParser(
        prog="uberon_kgx_process.py",
        description=(
            "uberon_kgx_process: get uberon version."
        ),
    )
    parser.add_argument("-i", "--input", nargs="+", help="Input files")
    parser.add_argument(
        "-o",
        "--output",
        nargs="+",
        default="go",
        help="Output prefix. Default: out",
    )
    return parser

def main():

    parser = get_parser()
    args = parser.parse_args()
    uberonnodes = pd.read_csv(args.input[0], sep="\t", low_memory=False)
    uberonedges = pd.read_csv(args.input[1], sep="\t", low_memory=False)

    response = requests.get(
        release
    )
    version = response.json()["name"]

    uberonnodes["source"] = "Uberon"
    uberonnodes["source version"] = version

    uberonedges["source"] = "Uberon"
    uberonedges["source version"] = version

    uberonnodes[["id", "category", "name", "provided_by", "description", "xref", "source","source version"]].drop_duplicates().to_csv(
        f"{args.output[0]}", sep="\t", index=False
    )
    uberonedges[
        ["id", "subject", "predicate", "object", "relation", "knowledge_source", "source", "source version"]
    ].to_csv(f"{args.output[1]}", sep="\t", index=False)

if __name__ == "__main__":
    main()
53
shell: "kgx merge --merge-config ../config/merge_config.yaml"
58
shell: "python scripts/lcc.py -n {input.nodes} -e {input.edges} -o ../data/processed/finals/lcc "
ShowHide 73 more snippets with no or duplicated tags.

Login to post a comment if you would like to share your experience with this workflow.

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Free

Created: 1yr ago
Updated: 1yr ago
Maitainers: public
URL: https://github.com/hmartiniano/ngest
Name: ngest
Version: v0.2.0
Badge:
workflow icon

Insert copied code into your website to add a link to this workflow.

Downloaded: 0
Copyright: Public Domain
License: GNU General Public License v3.0
  • Future updates

Related Workflows

cellranger-snakemake-gke
snakemake workflow to run cellranger on a given bucket using gke.
A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...