Help improve this workflow!
This workflow has been published but could be further improved with some additional meta data:- Keyword(s) in categories input, output, operation, topic
You can help improve this workflow by suggesting the addition or removal of keywords, suggest changes and report issues, or request to become a maintainer of the Workflow .
This is a workflow produced for PRJ1968. It takes "virtual landscapes" produced by Tom Etherington and determines ecosystem service indicators for them. Each landscape varies some condition of topography, fragmentation or similar metric. This workflow is used to produce a table of outputs for further statistical analysis that ascertains which input properties are influential for particular ecosystem services.
This could be used for real landscapes, but at present would require some modification.
Contact Richard Law for assistance with this code.
Snakemake
This project relies on the workflow software Snakemake . Snakemake determines the correct order of execution for the workflow, and can increase processing efficiency by running jobs in parallel where this is possible.
Snakemake also works with the containerisation software Singularity . Docker is a competing containerisation software that is (arguably) easier to set up. Snakemake works with Singularity, not Docker, but Snakemake can convert a Docker image into a Singularity image on the fly; so this workflow specifies Docker images rather than Singularity images.
Useful Snakemake commands
Command line reference: https://snakemake.readthedocs.io/en/stable/executing/cli.html
In general, to run the whole workflow, run this command from the top-level directory.
all
is a dummy target that sits at the bottom of the workflow, and will cause all (missing) intermediate and ultimate files to be generated or updated.
Unless you just happen to have all the required software installed on your machine, you will want to run the worflow with Singularity images and Conda environments.
snakemake \
--jobs $(nproc) \
--snakefile ./workflow/Snakefile \
--use-singularity \
--singularity-args "-B /cifs/Tlinc/Projects\ K-O/MCSL/virtual-landscapes-50km:/virtual-landscapes -B $PWD/results:/results -B $PWD/logs:/logs" \
--singularity-prefix /home/users/$USER/.singularity \
--use-conda \
--conda-frontend conda \
--rerun-incomplete \
--keep-going \
all
This is a highly intertwined workflow where multiple rules rely on the same input files. This can cause issues with concurrent file access, where jobs will fail because they are unable to read an input that does exist. This means that it is likely that you will need to run this command a few times until all outputs are generated, since there is currently no system of file locking that will cause Snakemake to avoid scheduling jobs to run at the same time when they depend on the same input file/s. Alternatively, copy the input directory to a local directory and use that instead of the networked one.
Alternatively, use
--jobs 1
to avoid concurrent jobs (in exchange for a much longer processing time).
The above command mounts the shared directory of inputs landscapes (
/cifs/Tlinc/Projects\ K-O/MCSL/virtual-landscapes-50km
) to the directory
/virtual-landscapes
inside the container. Because the former is probably different on your machine, you will need to change this to reflect your network drive mapping. It refers to the "T drive" CIFS mount known as "T Lincoln".
Ideally the second volume binding (
-B $PWD/results:/results
) would be (the host-machine-specific equivalent of)
-B /cifs/Tlinc/Projects\ K-O/MCSL/virtual-landscapes-50km/results:/results
, but I have had issues getting Snakemake and/or Singularity to write to that shared directory, so I get it to write locally instead.
gdal_calc.py -A "/virtual-landscapes/hills/landclass_t1_c10.tif" --outfile=/results/carbon_stocks/hills/t1.c10.LandscapeCarbonStock.tif --calc="(A==0)*5 + (A==1)*10 + (A==2)*18 + (A==3)*30 + (A==4)*140 + (A==5)*200" --type=UInt16 --overwrite --co TILED=YES
Activating singularity image /home/users/lawr/.singularity/d656e5e84caaafdc61745ab1ef68999a.simg
FATAL: could not open image /home/users/lawr/Network/Tlinc/Projects K-O/MCSL/workflow/K-O/MCSL/workflow: failed to retrieve path for /home/users/lawr/Network/Tlinc/Projects K-O/MCSL/workflow/K-O/MCSL/workflow: lstat /cifs/Tlinc/Projects K-O/MCSL/workflow/K-O: no such file or directory
[Wed Dec 2 12:43:28 2020]
Error in rule carbon_stock:
jobid: 4
output: results/carbon_stocks/hills/t1.c10.LandscapeCarbonStock.tif
shell:
gdal_calc.py -A "/virtual-landscapes/hills/landclass_t1_c10.tif" --outfile=/results/carbon_stocks/hills/t1.c10.LandscapeCarbonStock.tif --calc="(A==0)*5 + (A==1)*10 + (A==2)*18 + (A==3)*30 + (A==4)*140 + (A==5)*200" --type=UInt16 --overwrite --co TILED=YES
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /cifs/Tlinc/Projects K-O/MCSL/workflow/.snakemake/log/2020-12-02T124327.441032.snakemake.log
```
Most relevant part:
```
FATAL: could not open image /home/users/lawr/Network/Tlinc/Projects K-O/MCSL/workflow/K-O/MCSL/workflow: failed to retrieve path for /home/users/lawr/Network/Tlinc/Projects K-O/MCSL/workflow/K-O/MCSL/workflow: lstat /cifs/Tlinc/Projects K-O/MCSL/workflow/K-O: no such file or directory
```
Other handy commands
To visualise the workflow as a directed acyclic graph (DAG), i.e. an image that represents the flow of files and processes (this requires the
GraphViz
software):
snakemake --dag --snakefile ./workflow/Snakefile | dot -Tsvg > dag.svg
You can also run a linting tool over the Snakefile to check for potential problems while developing new components:
snakemake --lint --snakefile ./workflow/Snakefile
You can "clean" the output directory in order to re-run the analysis from a clean slate. This is particularly useful while developing new outputs and making changes.
snakemake --snakefile ./workflow/Snakefile -j1 clean
Configuration
There is a configuration file in
config/config.yaml
. This defines all of the possible values of
landclass
,
landscape
and
topography
, as well as the project
"basepath"
(probably your version of
/cifs/Tlinc/Projects K-O/MCSL
.
The configuration also has options for specifying the pixel height and width of all inputs (assumed to be constant) and the resolution. Together these are used in one part of the workflow related to expressing landscape patch sizes as proportions of the landscape. To adapt this for a non-virtual landscape, the workflow would need to determine these dynamically from inputs, and also know how to correctly deal with null/mask data (e.g. ocean) which aren't valid parts of a landscape. These are not issues that exist with the virtual landscapes.
Code Snippets
13 14 | shell: 'gdal_calc.py -A {params.landscape} --outfile={output} --calc="(A==0)*5 + (A==1)*10 + (A==2)*18 + (A==3)*30 + (A==4)*140 + (A==5)*200" --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
13 14 | shell: 'gdal_calc.py -A {input} --outfile={output} --calc="(A**2)*0.0012" --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
27 28 | shell: "grass -c {params.dem} -e {output} > {log} 2>&1" |
46 47 48 49 50 51 | shell: '({params.grass} r.in.gdal -r input="{params.dem}" output={params.dem_grass} --overwrite &&' '{params.grass} r.slope.aspect -e elevation={params.dem_grass} slope={params.slope} format="degrees" precision="FCELL" --overwrite &&' '{params.grass} r.mapcalc expression="{params.z_factor} = 0.065 + ( 4.56 * ( {params.slope} / 100 ) ) + ( 65.41 * ({params.slope} / 100 ) ^ 2 )" --overwrite &&' '{params.grass} r.out.gdal -c -m input={params.z_factor} output={output} type=Float64 format=GTiff createopt="TFW=YES,COMPRESS=DEFLATE" --overwrite ' ') > {log} 2>&1' |
72 73 74 75 76 77 78 79 | shell: '({params.grass} r.in.gdal -r input={params.dem} output={params.dem_grass} --overwrite &&' '{params.grass} r.flow elevation={params.dem_grass} flowaccumulation={params.flow_acc} --overwrite &&' '{params.grass} r.mapcalc expression="{params.slope_length} = sqrt({params.flow_acc} * 625 / {params.pi})" --overwrite &&' '{params.grass} r.mapcalc expression="{params.slope_length_bounded} = if({params.slope_length}>=350,350,{params.slope_length})" --overwrite &&' '{params.grass} r.mapcalc expression="{params.slope_length_factor} = sqrt({params.slope_length} / 22)" --overwrite &&' '{params.grass} r.out.gdal -c -m input={params.slope_length_factor} output={output} type=Float64 format=GTiff createopt="TFW=YES,COMPRESS=DEFLATE" --overwrite ' ') > {log} 2>&1' |
93 94 | shell: 'gdal_calc.py -A {params.input} --outfile={output} --calc="((A==0)*500 + (A==1)*100 + (A==2)*10 + (A==3)*5 + (A==4)*7 + (A==5)*5 + (A==6)*0)/1000.0" --type={params.datatype} --NoDataValue=0 --overwrite --co TILED=YES > {log} 2>&1' |
110 111 | shell: 'gdal_calc.py -A {input.p} -B {input.z} -C {input.slp} -D {input.u} --outfile={output} --calc="A * 0.2 * C * B * D" --overwrite --co TILED=YES > {log} 2>&1' |
13 14 | shell: 'gdal_calc.py -A {params.landscape} --outfile={output} --calc="(A==0)*1400 + (A==1)*8800 + (A==2)*1000 + (A==3)*2 + (A==4)*5 + (A==5)*0" --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
14 15 | shell: 'gdal_calc.py -A {params.dem} --outfile={output} --calc="A+{params.dem_offset}" --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
29 30 | script: "../scripts/landscape-metrics.R" |
19 20 | script: "../scripts/invest-ndr.py" |
34 35 | shell: 'invest run ndr --headless --datastack {input} -w {params.workspace} > {log} 2>&1' |
13 14 | shell: 'gdal_calc.py -A {params.landscape} --outfile={output} --calc="numpy.isin(A,[0,3])*11 + numpy.isin(A,[1,2,4,5])*0 + (A==6)*999" --NoDataValue=999 --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
28 29 | shell: 'gdal_calc.py -A {params.landscape} --outfile={output} --calc="numpy.isin(A,[0,1,6])*0 + numpy.isin(A,[2,4,5])*5 + numpy.isin(A,[3])*10" --NoDataValue=0 --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
42 43 | shell: "grass -c {params.landscape} -e {output} > {log} 2>&1" |
96 97 | shell: 'gdal_calc.py -A {input.q} -B {input.r} --outfile={output} --calc="A+B" --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
12 13 | shell: "grass -c {params.dem} -e {output} > {log} 2>&1" |
26 27 | shell: "grass -c {params.landclass} -e {output} > {log} 2>&1" |
53 54 55 56 57 58 59 | shell: '({params.grass} r.in.gdal -r input={params.dem} output={params.dem_grass} --overwrite && ' '{params.grass} r.neighbors -c input={params.dem_grass} output={params.focal} method={params.method} size={params.size} --overwrite && ' '{params.grass} r.mapcalc expression="{params.tpi} = {params.dem_grass} - {params.focal}" --overwrite && ' '{params.grass} r.mapcalc expression="{params.tpi_norm} = if({params.tpi}<{params.threshold},0,1)" --overwrite &&' '{params.grass} r.out.gdal -c -m input={params.tpi_norm} output={output} type={params.datatype} format=GTiff createopt="COMPRESS=DEFLATE" --overwrite ' ') > {log} 2>&1' |
73 74 | shell: 'gdal_calc.py -A {params.landscape} --outfile={output} --calc="(A==0)*0 + (A==1)*0.2 + (A==2)*0.7 + (A==3)*0.6 + (A==4)*0.5 + (A==5)*1.0 + (A==6)*999" --NoDataValue=999 --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
91 92 93 94 95 96 97 98 99 100 | shell: '({params.grass} r.in.gdal -r input={params.landclass} output={params.landclass_grass} --overwrite && ' '{params.grass} r.clump -d input={params.landclass_grass} output={params.landclass_grass}_clump --overwrite && ' '{params.grass} r.to.vect input={params.landclass_grass}_clump output={params.landclass_grass}_patch type=area --overwrite && ' '{params.grass} v.db.addcolumn map={params.landclass_grass}_patch columns="Area DOUBLE PRECISION" && ' '{params.grass} v.to.db map={params.landclass_grass}_patch option=area units=meters columns=Area --overwrite && ' '{params.grass} v.to.rast input={params.landclass_grass}_patch output={params.landclass_grass}_patch_r type=area use=attr attribute_column=Area --overwrite && ' "{params.grass} r.mapcalc expression='{params.landclass_grass}_patch_r_proportion=({params.landclass_grass}_patch_r/{params.total_area})' --overwrite && " '{params.grass} r.out.gdal -c -m input={params.landclass_grass}_patch_r_proportion output={output} type=Float64 format=GTiff createopt="COMPRESS=DEFLATE" --overwrite ' ') > {log} 2>&1' |
114 115 | shell: 'gdal_calc.py -A {input.lc} -B {input.lv} --outfile={output} --calc="A*B" --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
131 132 133 134 | shell: '(gdal_proximity.py {params.rivers} {output} -ot {params.datatype} -distunits {params.distunits} -maxdist {params.maxdist} -fixed-buf-val 1 -nodata 0 -of GTIff && ' 'gdal_calc.py -A {output} -B {params.rivers} --outfile={output} --calc="A+B" --type={params.datatype} --overwrite --co TILED=YES' ') > {log} 2>&1' |
149 150 | shell: 'gdal_calc.py -A {input.r} -B {input.v} -C {input.vp} --outfile={output} --calc="A*0.2 + B*0.2 + C" --type={params.datatype} --overwrite --co TILED=YES > {log} 2>&1' |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import json import os import sys datastack_dict = { "invest_version": "3.8.9", "args": { "biophysical_table_path": snakemake.params["biophysical_table"], "calc_n": True, "calc_p": False, "dem_path": snakemake.params["dem"], "k_param": "2", "lulc_path": snakemake.params["lulc"], "results_suffix": "", "runoff_proxy_path": snakemake.params["runoff_proxy"], "subsurface_critical_length_n": "200", "subsurface_critical_length_p": "", "subsurface_eff_n": "80", "subsurface_eff_p": "", "threshold_flow_accumulation": "4500", "watersheds_path": snakemake.params["watersheds"], "workspace_dir": snakemake.params["workspace"] }, "model_name": "natcap.invest.ndr.ndr" } with open(str(snakemake.output), 'w') as datastack: print(json.dumps(datastack_dict, indent=4, sort_keys=True), file=datastack) sys.exit(0) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | library(landscapemetrics) library(raster) library(dplyr) landscape <- raster(snakemake@params[['landscape']]) check_landscape(landscape) landscape_metrics = dplyr::bind_rows( lsm_l_enn_mn(landscape), lsm_l_shdi(landscape), lsm_l_contag(landscape) ) write.csv(landscape_metrics, file=snakemake@output[[1]], fileEncoding='UTF-8', na='', row.names=FALSE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | from collections import OrderedDict from itertools import product import numpy as np import re import pandas as pd from scipy import ndimage import rasterio iter_seq = snakemake.params['iterator_sequence'] params_product = product(*map( lambda param: snakemake.params[param], iter_seq )) df = pd.DataFrame(columns=iter_seq) landcover_categories = [ 'Crops/horticulture', 'Intensive grass', 'Extensive grass', 'Shrubland', 'Exotic forest', 'Native forest', 'Bare ground / Others' ] landcover_categories_ids = range(0,len(landcover_categories)) landscape_metrics_labels = ['enn_mn', 'shdi', 'contag'] # TODO parallelise? for i, row_params in enumerate(params_product): landscape, topography, landclass, proportion = row_params data = OrderedDict({ 'landscape': landscape, 'topography': topography, 'landclass': landclass, 'proportion': proportion }) # Landscapes (input summary) re_filter = re.compile(f'{landscape}/{landscape}-landclass_{topography}_{landclass}_{proportion}') landscape_data = list(filter( re_filter.search, snakemake.params['landscapes'] ))[0] with rasterio.open(landscape_data) as ds: band = ds.read(1, masked=True) total_cells = ds.height * ds.width num_patches = 0 patch_sizes = [] for class_name, id in zip(landcover_categories, landcover_categories_ids): per_patch_sizes = [] # Determine percentage of total landscape occupied by class data[f'% {class_name}'] = np.count_nonzero(band == id) / total_cells * 100 # Identify "patches": sub-regional areas of the same class value # Queen-case structure; use (2,1) for rook-case structure = ndimage.generate_binary_structure(2,2) # Label queen-case regions of landclass labels, num_labels = ndimage.label(np.where(band == id, 1, 0), structure=structure) data[f'{class_name} patches'] = num_labels num_patches += num_labels for loc in ndimage.find_objects(labels): # Slices # For each patch (accessed with slice), determine its size patch_size = np.count_nonzero(labels[loc]) * snakemake.params['resolution']**2 patch_sizes.append(patch_size) per_patch_sizes.append(patch_size) data[f'{class_name} mean patch size'] = np.mean(per_patch_sizes) data['Total number of patches'] = num_patches data['Mean patch size'] = np.mean(patch_sizes) # Nitrate leaching re_filter = re.compile(f'/{landscape}/{topography}\.{landclass}\.{proportion}/intermediate_outputs/modified_load_n.tif') nitrate_modified_load_data = list(filter( re_filter.search, snakemake.params['ndr_intermediate_modified_load_n'] ))[0] with rasterio.open(nitrate_modified_load_data) as ds: band = ds.read(1, masked=True) nLoad = band.sum() re_filter = re.compile(f'/{landscape}/{topography}\.{landclass}\.{proportion}/n_export.tif') nitrate_modified_load_data = list(filter( re_filter.search, snakemake.params['ndr'] ))[0] with rasterio.open(nitrate_modified_load_data) as ds: band = ds.read(1, masked=True) nExport = band.sum() data['Total landscape N load'] = nLoad data['Total landscape N export'] = nExport data['% landscape N retained'] = (( nLoad - nExport ) / nLoad ) * 100 data['% lanscape N exported'] = ( nExport / nLoad ) * 100 # Soil erosion re_filter = re.compile(f'{landscape}/{topography}.{landclass}.{proportion}') erosion_data = list(filter( re_filter.search, snakemake.params['erosion'] ))[0] with rasterio.open(erosion_data) as ds: band = ds.read(1, masked=True) data['Max per pixel soil erosion'] = band.max() data['Mean per pixel soil erosion'] = band.mean() data['Landscape sum of soil erosion'] = band.sum()/(10000/snakemake.params['resolution']**2) # Recreation recreation_data = list(filter( re_filter.search, snakemake.params['recreation'] ))[0] with rasterio.open(recreation_data) as ds: band = ds.read(1, masked=True) data['Max per pixel recreation score'] = band.max() data['Mean per pixel recreation score'] = band.mean() # Pollination pollination_data = list(filter( re_filter.search, snakemake.params['pollinator_overlap'] ))[0] with rasterio.open(pollination_data) as ds: band = ds.read(1, masked=True) total_cells = ds.height * ds.width data['no pollinators present'] = np.count_nonzero(band == 11) / total_cells * 100 data['medium quality pollinator habitat present'] = np.count_nonzero(band == 16) / total_cells * 100 data['high quality pollinator habitat present'] = np.count_nonzero(band == 21) / total_cells * 100 # Greenhouse gases ghg_data = list(filter( re_filter.search, snakemake.params['greenhouse_gas_emissions'] ))[0] with rasterio.open(ghg_data) as ds: band = ds.read(1, masked=True) data['Mean greenhouse gas emissions'] = band.mean() data['Landscape sum of greenhouse gas emissions'] = band.sum()/(10000/snakemake.params['resolution']**2) # Carbon stocks carbon_stock_data = list(filter( re_filter.search, snakemake.params['carbon_stocks'] ))[0] with rasterio.open(carbon_stock_data) as ds: band = ds.read(1, masked=True) data['Mean carbon stocks'] = band.mean() data['Landscape sum of carbon stocks'] = band.sum()/(10000/snakemake.params['resolution']**2) # Landscape metrics landscape_metrics_data = list(filter( re_filter.search, snakemake.params['landscape_metrics'] ))[0] ls_df = pd.read_csv(landscape_metrics_data) for metric in landscape_metrics_labels: data[metric] = ls_df[ls_df['metric'] == metric]['value'].iat[0] # Append row df = df.append(pd.Series(data=data, name=i), ignore_index=False) df = df.reindex(columns=data.keys()) df.rename(columns={ 'landscape': 'Landscape Type', 'topography': 'Topography ID', 'landclass': 'Fragmentation', 'proportion': 'Proportion set' }, inplace=True) df.to_excel(snakemake.output[0], sheet_name="landscape_summary") sys.exit(0) |
35 36 37 38 | shell: ''' rm -rf results/ ''' |
71 72 | script: "scripts/summary.py" |
Support
- Future updates
Related Workflows





