Data processing and analytics code for project "Biodiversity and ecosystem services trade-offs"

public 1yr ago 0 bookmarks

View Workflow

Help improve this workflow!

This workflow has been published but could be further improved with some additional meta data:

Keyword(s) in categories input, output, operation, topic

You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .

Trade-offs between biodiversity and ecosystem services in Europe

Version: 0.2.0

Introduction

Installation

The project relies both on Python and R scripts to pre- and post-procesing the data as well as running some of the analyses. While the project might run on Windows machines, it has never been tested on one. Your safest bet is running everything on a Ubuntu 14.04 (Trusty) machine. Everything else should run on other Linux distributions as well, but Zonation currently works nicely only Ubuntu 14.04 (though it is possible to compile it on 16.04 as well).

1. Getting this project

You need to first have git installed on the system you want to run this project on. Install git by:

sudo apt-get install git

Next, get everything in this project using git:

git clone https://github.com/VUEG/bdes_to.git

2. Installing necessary dependencies using conda

This project uses conda package, dependency and environment management system for setting everything up. It comes with simple installer script bootsrap_conda.sh that will install the right version of conda command line program for you. To run it, type:

# Get into the project directory
cd bdes_to
# Install conda
./bootsrap_conda.sh

After installation is finished and assuming everythign went well, you create a new enviroment with all the necessary (Python and R) pacakages installed by doing the following (still in the same directory):

conda env create -n bdes_to

This command will create a new virtual environment called bdes_to and install all the dependencies listed in environment.yml in it.

3. Installing Zonation

Zonation is not yet available through conda, so you will have to install it separately and system-wide. Follow the installation instructions found here .

Running the processing and analysis workflow

This project uses snakemake workflow management system to run the different stages in sequence. It has already been installed in step 1. You can list all the individual snakemake rules (i.e. stages) by typing in:

snakemake -l

or run the whole sequence (not recommended as this will also run all the Zonation analyses) by:

snakemake

Project organization

├── LICENSE
├── environment.yml <- Conda environment file
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tests <- tests scripts

License

See LICENSE file .

Contributors

Joona Lehtomäki joona.lehtomaki@gmail.com

Code Snippets

run:
    llogger = utils.get_local_logger("pprocess_nuts0", log[0])
    # Read in the bounds as used in harmonize_data rule
    bleft = PROJECT_EXTENT["left"] + OFFSET[0]
    bbottom = PROJECT_EXTENT["bottom"] + OFFSET[1]
    bright = PROJECT_EXTENT["right"] + OFFSET[2]
    btop = PROJECT_EXTENT["top"] + OFFSET[3]
    bounds = "{0} {1} {2} {3}".format(bleft, bbottom, bright, btop)
    # Reproject to EPSG:3035 from EPSG:4258
    input_shp = utils.pick_from_list(input.shp, ".shp")
    reprojected_shp = utils.pick_from_list(output.reprojected, ".shp")
    cmd_str = 'ogr2ogr {0} -t_srs "EPSG:{1}" {2}'.format(reprojected_shp, PROJECT_CRS, input_shp)
    shell(cmd_str)
    llogger.info("Reprojected NUTS level 0 data from EPSG:4258 to EPSG:3035")
    llogger.debug(cmd_str)

    # NUTS 0 data has "NUTS_ID" field, but it's character. Convert to
    # integer for raserization
    enhanced_shp = utils.pick_from_list(output.enhanced, ".shp")
    with fiona.drivers():
        with fiona.open(reprojected_shp) as source:
            meta = source.meta
            meta['schema']['geometry'] = 'Polygon'
            # Insert new fields
            meta['schema']['properties']['ID'] = 'int'
            meta['schema']['properties']['mask'] = 'int'

            ID = 1
            with fiona.open(enhanced_shp, 'w', **meta) as sink:
                # Loop over features
                for f in source:
                    f['properties']['ID'] = ID
                    ID += 1
                    # Create a mask ID (same for each feature) that can
                    # later be used in creating a mask.
                    f['properties']['mask'] = 1
                    # Write the record out.
                    sink.write(f)

    # Clip shapefile using ogr2ogr, syntax:
    # ogr2ogr output.shp input.shp -clipsrc <left> <bottom> <right> <top>
    processed_shp = utils.pick_from_list(output.processed, ".shp")
    # Do 2 things at the same time:
    #  1. Select a subset of counties (defined by params.countries)
    #  2. Clip output to an extent (given by bounds)
    # Build the -where clause for ogr2ogr
    where_clause = "NUTS_ID IN ({})".format(", ".join(["'" + item + "'" for item in PROJECT_COUNTRIES]))
    shell('ogr2ogr -where "{where_clause}" {processed_shp} {enhanced_shp} -clipsrc {bounds}')
    llogger.debug("Clipped NUTS data to analysis bounds: {}".format(bounds))
    llogger.debug("Selected only a subset of eurostat countries:")
    llogger.debug(" " + ", ".join(PROJECT_COUNTRIES))
    llogger.debug("Resulting file: {}".format(processed_shp))

SnakeMake From line 118 of master/Snakefile

run:
    llogger = utils.get_local_logger("pprocess_nuts2", log[0])
    # Read in the bounds as used in harmonize_data rule
    bleft = PROJECT_EXTENT["left"] + OFFSET[0]
    bbottom = PROJECT_EXTENT["bottom"] + OFFSET[1]
    bright = PROJECT_EXTENT["right"] + OFFSET[2]
    btop = PROJECT_EXTENT["top"] + OFFSET[3]
    bounds = "{0} {1} {2} {3}".format(bleft, bbottom, bright, btop)
    # Reproject to EPSG:3035 from EPSG:4258 and clip
    input_shp = utils.pick_from_list(input.shp, ".shp")
    reprojected_shp = utils.pick_from_list(output.reprojected, ".shp")
    cmd_str = 'ogr2ogr {0} -t_srs "EPSG:{1}" {2}'.format(reprojected_shp, PROJECT_CRS, input_shp)
    shell(cmd_str)
    llogger.debug("Reprojected NUTS level 2 data from EPSG:4258 to EPSG:3035")
    llogger.debug(cmd_str)

    # Clip shapefile using ogr2ogr, syntax:
    # ogr2ogr output.shp input.shp -clipsrc <left> <bottom> <right> <top>
    clipped_shp = utils.pick_from_list(output.clipped, ".shp")
    # Clip output to an extent (given by bounds)
    shell('ogr2ogr {clipped_shp} {reprojected_shp} -clipsrc {bounds}')

    # The Pre-processing steps need to be done:
    #  1. Tease apart country code from field NUTS_ID
    #  2. Create a running ID field that can be used as value in the
    #     rasterized version
    processed_shp = utils.pick_from_list(output.processed, ".shp")
    with fiona.drivers():
        with fiona.open(clipped_shp) as source:
            meta = source.meta
            meta['schema']['geometry'] = 'Polygon'
            # Insert new fields
            meta['schema']['properties']['ID'] = 'int'
            meta['schema']['properties']['country'] = 'str'

            ID = 1
            with fiona.open(processed_shp, 'w', **meta) as sink:
                # Loop over features
                for f in source:
                    # Check the country code part of NUTS_ID (2 first
                    # charatcters). NOTE: we're effectively doing filtering
                    # here.
                    country_code = f['properties']['NUTS_ID'][0:2]
                    if country_code in PROJECT_COUNTRIES:
                        f['properties']['ID'] = ID
                        ID += 1
                        f['properties']['country'] = country_code
                        # Write the record out.
                        sink.write(f)

    llogger.debug("Clipped NUTS level 2 data to analysis bounds: {}".format(bounds))
    llogger.debug("Selected only a subset of eurostat countries:")
    llogger.debug(" " + ", ".join(PROJECT_COUNTRIES))
    llogger.debug("Resulting file: {}".format(processed_shp))

SnakeMake From line 182 of master/Snakefile

run:
    llogger = utils.get_local_logger("rasterize_nuts0", log[0])
    input_shp = utils.pick_from_list(input, ".shp")
    layer_shp = os.path.basename(input_shp).replace(".shp", "")

    # Construct extent
    bounds = "{0} {1} {2} {3}".format(PROJECT_EXTENT["left"],
                                      PROJECT_EXTENT["bottom"],
                                      PROJECT_EXTENT["right"],
                                      PROJECT_EXTENT["top"])
    # 1) Rasterize land mask
    cmd_str = "gdal_rasterize -l {} ".format(layer_shp) + \
              "-a ID -tr 1000 1000 -te {} ".format(bounds) + \
              "-ot Int16 -a_nodata -32768 -co COMPRESS=DEFLATE " + \
              "{0} {1}".format(input_shp, output.land_mask)
    llogger.debug(cmd_str)
    for line in utils.process_stdout(shell(cmd_str, read=True)):
        llogger.debug(line)
    # 2) Rasterize common data mask
    cmd_str = "gdal_rasterize -l {} ".format(layer_shp) + \
              "-a mask -tr 1000 1000 -te {} ".format(bounds) + \
              "-ot Int8 -a_nodata -128 -co COMPRESS=DEFLATE " + \
              "{0} {1}".format(input_shp, output.data_mask)
    llogger.debug(cmd_str)
    for line in utils.process_stdout(shell(cmd_str, read=True)):
        llogger.debug(line)

SnakeMake From line 248 of master/Snakefile

run:
    llogger = utils.get_local_logger("rasterize_nuts2", log[0])
    input_shp = utils.pick_from_list(input, ".shp")
    layer_shp = os.path.basename(input_shp).replace(".shp", "")

    # Construct extent
    bounds = "{0} {1} {2} {3}".format(PROJECT_EXTENT["left"],
                                      PROJECT_EXTENT["bottom"],
                                      PROJECT_EXTENT["right"],
                                      PROJECT_EXTENT["top"])
    # Rasterize
    cmd_str = "gdal_rasterize -l {} ".format(layer_shp) + \
              "-a ID -tr 1000 1000 -te {} ".format(bounds) + \
              "-ot Int16 -a_nodata -32768 -co COMPRESS=DEFLATE " + \
              "{0} {1}".format(input_shp, output[0])
    llogger.debug(cmd_str)
    for line in utils.process_stdout(shell(cmd_str, read=True)):
        llogger.debug(line)

SnakeMake From line 284 of master/Snakefile

run:
    llogger = utils.get_local_logger("clip_udr_data", log[0])
    nsteps = len(input.external)
    for i, s_raster in enumerate(input.external):
        # Target raster
        clipped_raster = s_raster.replace("external", "processed/features")
        prefix = utils.get_iteration_prefix(i+1, nsteps)

        llogger.info("{0} Clipping dataset {1}".format(prefix, s_raster))
        llogger.debug("{0} Target dataset {1}".format(prefix, clipped_raster))
        # Clip data. NOTE: UDR species rasters do not have a SRS defined,
        # but they are in EPSG:3035
        cmd_str = 'gdalwarp -s_srs EPSG:3035 -t_srs EPSG:3035 -cutline {0} {1} {2} -co COMPRESS=DEFLATE'.format(input.clip_shp, s_raster, clipped_raster)
        for line in utils.process_stdout(shell(cmd_str, read=True), prefix=prefix):
            llogger.debug(line)

SnakeMake From line 313 of master/Snakefile

run:
    llogger = utils.get_local_logger("create_clc_mask", log[0])

    minx = PROJECT_EXTENT["left"]
    miny = PROJECT_EXTENT["bottom"]
    maxx = PROJECT_EXTENT["right"]
    maxy = PROJECT_EXTENT["top"]

    # STEP 1: Aggregate CLC data
    cmd_warp_str = "gdalwarp -te {} {} {} {} ".format(minx, miny, maxx, maxy) + \
                   "-multi -tr {} {} -r mode ".format(PROJECT_RES, PROJECT_RES) + \
                   "-s_srs 'EPSG:{}' -t_srs 'EPSG:{}' ".format(PROJECT_CRS, PROJECT_CRS) + \
                   "{} {}".format(input.clc_2012, output.agg_clc)
    llogger.info(" [1/3] Aggregating CLC data")
    for line in utils.process_stdout(shell(cmd_warp_str, read=True)):
        llogger.debug(line)

    # STEP 2: Remove aquatic CLC classes:
    # 39 Intertidal flats
    # 40 Water courses
    # 41 Water bodies
    # 42 Coastal lagoons
    # 43 Estuaries
    # 44 Sea and ocean
    #
    # Also binarize the selection
    clc_tmp_1 = os.path.basename(output.clc_mask).split(".")[0] + "_tmp1.tif"
    clc_tmp_1 = os.path.join(os.path.dirname(output.agg_clc), clc_tmp_1)
    cmd_calc_str = 'rio calc "(where (< (* (>= (read 1) 39) 255) 255) 1 0) " ' + \
                   '{} {} --co "compress=DEFLATE"'.format(output.agg_clc, clc_tmp_1)
    llogger.info(" [2/3] Subsetting and binarizing CLC data")
    for line in utils.process_stdout(shell(cmd_calc_str, read=True)):
        llogger.debug(line)

    # STEP 3: Overlay with NUTS0 mask
    cmd_ovr_str = 'rio calc "(where (== (+ (take a 1) (take b 1)) 2) 1 0) " ' + \
                  '--name "a={}" --name "b={}" '.format(clc_tmp_1, input.nuts0) + \
                  '{} --co "compress=DEFLATE"'.format(output.clc_mask)
    logger.debug(cmd_ovr_str)
    llogger.info(" [3/3] Overlaying with NUTS0 mask")
    for line in utils.process_stdout(shell(cmd_ovr_str, read=True)):
        llogger.debug(line)

    # Remove temporary files
    os.remove(clc_tmp_1)

SnakeMake From line 340 of master/Snakefile

run:
    llogger = utils.get_local_logger("harmonize_data", log[0])
    nsteps = len(input.external)
    for i, s_raster in enumerate(input.external):

        # The assumption is that zips don't need anything else but
        # extraction
        if s_raster.endswith(".zip"):
            target_dir = s_raster.replace("external", "processed/features_flow_zones")
            target_dir = os.path.dirname(target_dir)
            # Get rid of the last path component to avoid repetition
            target_dir = os.path.sep.join(target_dir.split(os.path.sep)[:-1])
            if not os.path.exists(target_dir):
                os.mkdir(target_dir)
            prefix = utils.get_iteration_prefix(i+1, nsteps)
            llogger.info("{0} Unzipping dataset {1}".format(prefix, s_raster))
            shell("unzip -o {} -d {} >& {}".format(s_raster, target_dir, log[0]))
        else:
            ## WARP
            # Target raster
            warped_raster = s_raster.replace("external", "interim/warped")
            # No need to process the snap raster, just copy it
            prefix = utils.get_iteration_prefix(i+1, nsteps)
            if s_raster == input.like_raster:
                llogger.info("{0} Copying dataset {1}".format(prefix, s_raster))
                llogger.debug("{0} Target dataset {1}".format(prefix, warped_raster))
                ret = shell("cp {s_raster} {warped_raster}", read=True)
            else:
                llogger.info("{0} Warping dataset {1}".format(prefix, s_raster))
                llogger.debug("{0} Target dataset {1}".format(prefix, warped_raster))

SnakeMake From line 401 of master/Snakefile

run:
    llogger = utils.get_local_logger("process_flowzones", log[0])

    for in_raster, outdir in zip(input.src, output):
        cmd_str = "src/01_pre_processing/cutter.py {} {} {} -f {} -c {}".format(in_raster,
                                                                                input.flow_zone_units,
                                                                                outdir,"ID",
                                                                                    threads)
        print(cmd_str)
        for line in utils.process_stdout(shell(cmd_str, read=True)):
            llogger.debug(line)

SnakeMake From line 475 of master/Snakefile

run:
    llogger = utils.get_local_logger("calculate_flowzone_weights", log[0])

    llogger.info(" [1/2] Calculating zonal stats for {}".format(input.cli_agro))
    stats = zonal_stats(input.flow_zone_units, input.cli_agro,
                        stats=['sum'])
    print(stats)

SnakeMake From line 499 of master/Snakefile

run:
    llogger = utils.get_local_logger("match_coverages", log[0])

    coverage.expand_value_coverage(input.car, input.expand_raster,
                                   output.car, logger=llogger)
    coverage.expand_value_coverage(input.esc, input.expand_raster,
                                   output.esc, logger=llogger)
    coverage.expand_value_coverage(input.esf, input.expand_raster,
                                   output.esf, logger=llogger)
    coverage.expand_value_coverage(input.bio_car, input.expand_raster,
                                   output.bio_car, logger=llogger)
    coverage.expand_value_coverage(input.bio_esc, input.expand_raster,
                                   output.bio_esc, logger=llogger)
    coverage.expand_value_coverage(input.bio_esf, input.expand_raster,
                                   output.bio_esf, logger=llogger)

SnakeMake From line 527 of master/Snakefile

run:
    llogger = utils.get_local_logger("generate_range_data", log[0],
                                     debug=True)

    range_stats = pd.DataFrame({"feature": [], "count": [], "sum": [],
                                "q25_ol": [], "mean_ol": [],
                                "median_ol": [], "q75_ol": []})

    features = ES_DST_DATASETS + BD_DST_DATASETS
    # Remove potential zip files
    features = [feature for feature in features if not feature.endswith(".zip")]
    for i, feature in enumerate(features):
        prefix = utils.get_iteration_prefix(i+1, len(features))

        llogger.info("{} Processing {}".format(prefix, feature))

        feature_stats = spatutils.get_range_size(feature, logger = llogger)
        range_stats = pd.concat([range_stats, feature_stats])

    llogger.info(" Saving results to {}".format(output.csv))
    range_stats.to_csv(output.csv, columns=["feature", "count", "sum",
                                            "q25_ol", "mean_ol",
                                            "median_ol", "q75_ol"],
                                            index=False)