Australia's coastline spans over 44 degrees of latitude, encompassing habitats from tropical coral reefs and mangroves to temperate seagrass meadows and kelp forests. To assess habitat extent and condition at national scales , and to predict impacts from coastal development, resource extraction, and climate-driven range shifts , there are increasing efforts to combine existing survey data from disparate projects into national datasets for broad-scale habitat modelling.
However, the underlying projects typically differ in objectives, spatial extents, and durations, and may employ habitat-stratified, opportunistic, or ad hoc sampling designs. When combined, these datasets often produce highly unbalanced and biased representations of habitat distribution, which propagate through to erroneous model outputs.
To identify and correct for these biases, we present a five-stage analytical workflow: (1) diagnosing spatial bias and autocorrelation in combined datasets using variogram analysis and Getis-Ord Gi* statistics; (2) identifying ecologically relevant multiscale environmental covariates from remote sensing products; (3) generating pseudo-habitat strata from these covariates using k-means clustering; (4) applying Generalized Random Tessellation Stratified (GRTS) sampling within pseudo-habitat strata to produce spatially and environmentally balanced modelling inputs; and (5) evaluating model outputs for spatial accuracy using spatial kernelling and geographically weighted regression.
We demonstrate this workflow using benthic habitat data derived from BRUVS imagery collated through the GlobalArchive national database. Through case studies, we show how the workflow identifies bias in combined datasets, systematically reduces it to produce more robust and generalisable habitat models, and quantifies spatial uncertainty , enabling managers and practitioners to understand where model predictions are reliable, where data gaps exist, and where targeted additional sampling would most improve predictions.