Neo4j Data Integration and Build Pipeline for EpiGraphDB Graph Creation

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Neo4j data integration and build pipeline - https://github.com/elswob/neo4j-build-pipeline

This pipeline originated from the work done to create the graph for EpiGraphDB . With over 20 separate data sets,

Code Snippets

run:
    #open output file
    o = open(output[0], "w")
    #validata data integration config file
    if NODEDIR in config:
        nodes = config[NODEDIR]
        for i in nodes:
            o.write(f"integration node {i}\n")
            validate(nodes[i], os.path.join(os.getcwd(),DATA_CONFIG_SCHEMA))
    if RELDIR in config:
        rels = config[RELDIR]
        for i in rels:
            o.write(f"integration rel {i}\n")
            validate(rels[i], os.path.join(os.getcwd(),DATA_CONFIG_SCHEMA))

    #validate db schema config file
    with open(os.path.join(CONFIG_PATH,"db_schema.yaml")) as file:
        db_schema = yaml.load(file,Loader=yaml.FullLoader)
        if 'meta_nodes' in db_schema:
            nodes = db_schema['meta_nodes']
            for i in nodes:
                o.write(f"schema node {i}\n")
                validate(nodes[i], os.path.join(os.getcwd(),DB_SCHEMA_NODES_SCHEMA))
        else:
            print('The db schame has no nodes!')
            exit()
        if 'meta_rels' in db_schema:
            rels = db_schema['meta_rels']
            for i in rels:
                o.write(f"schema rel {i}\n")
                validate(rels[i], os.path.join(os.getcwd(),DB_SCHEMA_RELS_SCHEMA))
    o.close()

SnakeMake From line 39 of workflow/Snakefile

shell:
    """
    echo 'Deleting {params.NEO4J_IMPORTDIR}/{params.NODEDIR}/merged/*'
    rm -f {params.NEO4J_IMPORTDIR}/{params.NODEDIR}/merged/*
    echo 'Deleting {params.NEO4J_IMPORTDIR}/{params.NODEDIR}/created.txt'
    rm -f {params.NEO4J_IMPORTDIR}/{params.NODEDIR}/created.txt
    echo 'Deleting {params.NEO4J_IMPORTDIR}/{params.RELDIR}/created.txt'
    rm -f {params.NEO4J_IMPORTDIR}/{params.RELDIR}/created.txt
    echo 'Deleting find {params.SNAKEMAKELOGS}/{params.NODEDIR} -name "*.log" -delete'
    if [ -f {params.SNAKEMAKELOGS}/{params.NODEDIR} ]; then find {params.SNAKEMAKELOGS}/{params.NODEDIR} -name "*.log" -delete; fi
    echo 'Deleting find {params.SNAKEMAKELOGS}/{params.RELDIR} -name "*.log" -delete'
    if [ -f {params.SNAKEMAKELOGS}/{params.RELDIR} ]; then find {params.SNAKEMAKELOGS}/{params.RELDIR} -name "*.log" -delete; fi
    echo 'Deleting {params.NEO4J_IMPORTDIR}/master*'
    rm -f {params.NEO4J_IMPORTDIR}/master*
    echo 'Deleting {params.SNAKEMAKELOGS}/*.log'
    rm -f {params.SNAKEMAKELOGS}/*.log
    echo 'Deleting {params.NEO4J_IMPORTDIR}/logs/*'
    rm -f {params.NEO4J_IMPORTDIR}/logs/*
    #not sure if below is too severe
    echo 'Deleting find {params.NEO4J_IMPORTDIR}/{params.NODEDIR} -name "*.csv.gz" -delete -o -name "*import-nodes.txt" -delete'
    if [ -d {params.NEO4J_IMPORTDIR}/{params.NODEDIR} ]; 
        then find {params.NEO4J_IMPORTDIR}/{params.NODEDIR} -name "*.csv.gz" -delete -o -name "*import-nodes.txt" -delete; 
    else
        echo "{params.NEO4J_IMPORTDIR}/{params.NODEDIR} is missing"
    fi
    echo 'Deleting find {params.NEO4J_IMPORTDIR}/{params.RELDIR} -name "*.csv.gz" -delete -o -name "*import-nodes.txt" -delete'
    if [ -d {params.NEO4J_IMPORTDIR}/{params.RELDIR} ]; 
        then find {params.NEO4J_IMPORTDIR}/{params.RELDIR} -name "*.csv.gz" -delete -o -name "*import-nodes.txt" -delete; 
    else
        echo "{params.NEO4J_IMPORTDIR}/{params.RELDIR} is missing"
    fi
    """

SnakeMake From line 85 of workflow/Snakefile

shell:
    """
    rm -f {params.NEO4J_IMPORTDIR}/{params.NODEDIR}/merged/*
    rm -f {params.NEO4J_IMPORTDIR}/{params.NODEDIR}/created.txt
    rm -f {params.NEO4J_IMPORTDIR}/{params.RELDIR}/created.txt
    rm -f {params.NEO4J_IMPORTDIR}/master*
    rm -f {params.SNAKEMAKELOGS}/master*
    rm -f {params.SNAKEMAKELOGS}/import_report.log
    """       

SnakeMake From line 126 of workflow/Snakefile

shell:
    """
    echo 'Starting database...'
    #force load of .env file if it exists to avoid docker issues with cached variables
    if [ -f .env ]; then export $(cat .env | sed 's/#.*//g' | xargs); fi
    #create neo4j directories if not already done
    echo 'Creating Neo4j graph directories'
    python -m workflow.scripts.graph_build.create_neo4j > {log.graph} 2> {log.graph}
    #create container
    docker-compose up -d 
    #docker-compose up -d --no-recreate 
    echo 'removing old database...'
    docker exec --user neo4j {CONTAINER_NAME} sh -c 'rm -rf /var/lib/neo4j/data/databases/neo4j' > {log.graph} 2> {log.graph}
    docker exec --user neo4j {CONTAINER_NAME} sh -c 'rm -f /var/lib/neo4j/data/transactions/neo4j/*' > {log.graph} 2> {log.graph}
    echo 'running import...'
    SECONDS=0
    docker exec --user neo4j {CONTAINER_NAME} sh /var/lib/neo4j/import/master_import.sh > {log.build} 2> {log.build}
    duration=$SECONDS
    echo "Import took $(($duration / 60)) minutes and $(($duration % 60)) seconds."
    echo 'stopping container {CONTAINER_NAME}...'
    docker stop {CONTAINER_NAME}
    echo 'starting container {CONTAINER_NAME}...'
    docker start {CONTAINER_NAME}
    echo 'waiting a bit...'
    sleep 30
    echo 'adding contraints and extra bits...'
    docker exec --user neo4j {CONTAINER_NAME} sh /var/lib/neo4j/import/master_constraints.sh > {log.constraints} 2> {log.constraints}
    echo 'waiting a bit for indexes to populate...'
    sleep 30
    echo 'checking import report...'
    python -m workflow.scripts.graph_build.import-report-check {NEO4J_LOGDIR}/import.report > {output}
    echo 'running tests...'
    python -m pytest -vv
    echo 'Neo4j browser available here: http://{NEO4J_ADDRESS}:{NEO4J_HTTP}/browser'
    #open http://{NEO4J_ADDRESS}:{NEO4J_HTTP}/browser
    """

SnakeMake From line 146 of workflow/Snakefile

shell: 
    """
    #rm -f {NEO4J_IMPORTDIR}/{NODEDIR}/merged/*
    python -m workflow.scripts.graph_build.merge_sources > {log} 2> {log}
    python -m workflow.scripts.graph_build.create_master_import > {log} 2> {log}
    """

SnakeMake From line 189 of workflow/Snakefile

shell: "echo `date` > {NEO4J_IMPORTDIR}/{NODEDIR}/created.txt"

SnakeMake From line 203 of workflow/Snakefile

shell: 
    """
    #make neo4j directory
    d={NEO4J_IMPORTDIR}/{NODEDIR}/{params.meta_id}
    mkdir -p $d

    #clean up any old import and constraint data
    rm -f $d/{params.meta_id}-import-nodes.txt
    rm -f $d/{params.meta_id}-constraint.txt

    #run the processing script
    python -m {params.PROCESSINGDIR}.{params.metaData[script]} -n {params.meta_id} > {log} 2> {log}
    """

SnakeMake From line 215 of workflow/Snakefile

shell: "echo `date` > {NEO4J_IMPORTDIR}/{RELDIR}/created.txt"

SnakeMake From line 236 of workflow/Snakefile

shell: 
    """
    #make directory
    d={NEO4J_IMPORTDIR}/{RELDIR}/{params.meta_id}
    mkdir -p $d

    #clean up any old import and constraint data
    rm -f $d/{params.meta_id}-import-rels.txt
    rm -f $d/{params.meta_id}-constraint.txt

    #run the processing script
    python -m {params.PROCESSINGDIR}.{params.metaData[script]} -n {params.meta_id} > /dev/null 2> {log}
    """

SnakeMake From line 248 of workflow/Snakefile

shell: 
    """
    python -m workflow.scripts.graph_build.create_neo4j_backup > {log} 2> {log}
    """

SnakeMake From line 264 of workflow/Snakefile

ShowHide 10 more snippets with no or duplicated tags.

Comments

Support

Do you know this workflow well? If so, you can request seller status , and start supporting this workflow.

Created: 1yr ago

Updated: 1yr ago

Maitainers: public

URL: https://github.com/Lekanville/epg

Name: epg

Version: 1

Badge:

Insert copied code into your website to add a link to this workflow.

License: MIT License

Keywords:

EpiGraphDB Snakemake Data acquisition

Future updates

Related Workflows

psychip_snakemake — Show Details View Workflow

ENCODE pipeline for histone marks developed for the psychENCODE project

public

psychip pipeline is an improved version of the ENCODE pipeline for histone marks developed for the psychENCODE project. The o...

raw sequence reads Alignment Sequence alignment report macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

Near-real time tracking of SARS-CoV-2 in Connecticut

public

Repository containing scripts to perform near-real time tracking of SARS-CoV-2 in Connecticut using genomic data. This pipeli...

JSON nextclade Augur Biopython FOCUS Pandas Snakemake bs4 epiweeks geopy matplotlib numpy pycountry pycountry-convert uszipcode

Free

cellranger-snakemake-gke — Show Details View Workflow

snakemake workflow to run cellranger on a given bucket using gke.

public

A Snakemake workflow for running cellranger on a given bucket using Google Kubernetes Engine. The usage of this workflow ...

macs2 ucsc-bedclip bedGraphToBigWig BEDTools BWA Picard SAMtools Snakemake

Free

ATLAS - Three commands to start analyzing your metagenome data

public

Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, t...

raw sequence reads Genome assembly Annotation track checkm2 gunc prodigal snakemake-wrapper-utils MEGAHIT Atlas BBMap Biopython BioRuby Bwa-mem2 cd-hit CheckM DAS Diamond eggNOG-mapper v2 MetaBAT 2 Minimap2 MMseqs MultiQC Pandas Picard pyfastx SAMtools SemiBin Snakemake SPAdes SqueezeMeta TADpole VAMB CONCOCT ete3 gtdbtk h5py networkx numpy plotly psutil utils metagenomics

Free

175

rna-seq-star-deseq2 — Show Details View Workflow

RNA-seq workflow using STAR and DESeq2

public

This workflow performs a differential gene expression analysis with STAR and Deseq2. The usage of this workflow is described ...

Free

dna-seq-gatk-variant-calling — Show Details View Workflow

This Snakemake pipeline implements the GATK best-practices workflow

public

This Snakemake pipeline implements the GATK best-practices workflow for calling small germline variants. The usage of thi...

VCF raw sequence reads Variant calling genetic variants gatk rust-bio-tools snakemake-wrapper-utils tabix BCFtools BWA FastQC MultiQC Pandas Picard SAMtools Snakemake Trimmomatic Variant Effect Predictor (VEP) common matplotlib numpy seaborn DNA

Free