Dazed and confused: how map projections affect disease map analysis and perception. An echo from GeoVet2019
DOI:
https://doi.org/10.12834/VetIt.3492.27657.2Keywords:
Bias, Communication, Disease mapping, Geographic epidemiology, Map projection, Spatial statisticsAbstract
Disease maps are integral to spatial epidemiology and public health. The map appearance and analysis of corresponding data may both depend on a map projection used to transform the 3-dimensional world onto a 2-dimensional surface. Map projections necessarily introduce bias - an issue that has not received full attention in the literature. This study aims to demonstrate the impact map projections can have on spatial analysis and disease maps for public health.
Case studies applied varying map projections, including the Lambert, Mercator and Robinson projections, to Israel, North Carolina and Southern Ontario as study areas. The effect of projections on various measures, estimates, tests and models was assessed.
When the map projection was changed: (i) a distance in Israel increased by 30%; (ii) for Southern Ontario an areal size increased by almost 95%; Moran’s I test switched from significant to not; and (iii) a single disease cluster in North Carolina converted into three distinct clusters.
Visual bias in disease mapping is unavoidable and should be recognized. Disease maps and spatial analytical inferences, including disease clusters should be reported with their geographic projection. Using geographic coordinates can prevent analytical bias.
Introduction
Disease maps are an integral part of spatial epidemiology and have long been used in epidemiological and public health investigations. John Snow’s cholera outbreak map around London’s Broad Street pump in 1854 is arguably the most recognized example of a disease map (Waller and Gotway, 2004, p. 2). The purpose of disease maps is twofold. First, maps are used to visualize spatial pattern in observed spatial public health data (e.g. locations of cases and controls or regional variation of incidence rates). And second, disease maps are used to report the results from spatial epidemiological data analyses (e.g. variation in residual risk or disease cluster locations). In this sense disease maps are communication tools and used to communicate between researchers and governmental agencies as well as with the other stakeholders including the general public.
The Achilles’ heel of epidemiological inference is bias. Disease maps can be biased in two ways that are very specific to the topic of disease mapping. These two types of bias may be called (direct) visual and (indirect) analytical bias. The cause of both biases is the map projection.
Maps depend on a map projection to convert the 3-dimensional world onto a 2-dimensional plane, e.g. a sheet of paper or computer screen. Due to Carl Friedrich Gauss’ Theorema Egregium from 1827 it is known that map projections distort reality in the process (Banerjee et al., 2003, p. 15), i.e. projections will bias the visual perception of disease maps and the analysis of respective spatial data. More specifically map projections affect the size, shape and orientation of maps, as well as the relative position of locations on a map to each other.
Spatial public health data are health status data plus their locations. Health data are generally presented in either of two forms: as point locations of health events, or as event counts and proportions for administrative regions. The later type might also be called areal or regional data and is the more commonly used spatial data type in public health, and therefore is the focus of this study.
These locations are generally recorded as geographic coordinates, i.e. degrees longitude and latitude. For spatial data analysis, map projections are applied to transform geographic into Cartesian coordinates measured in kilometers or miles. Varying the projection will change the resulting Cartesian coordinates. Thus the chosen map projection determines the coordinates of locations and thereby the very spatial data to be mapped and analyzed.
The epidemiological analysis of spatial pubic health data can be described as an interplay of the 4 study goals of (i) disease mapping, (ii) disease clustering, (iii) disease cluster detection, and (iv) geographic correlation analysis to understand various aspects of a spatial disease pattern (Lawson, 2006; Berke and Waller, 2010). From a public health perspective, disease maps provide a visual communication tool, which can also be used to locate disease clusters or hot spots highlighting areas that need attention or application of control measures. Furthermore, disease clustering provides insight into the need for spatial analytical methods as well as the distance up to which diseases communicate for certain time periods covered by the data. And lastly, geographic correlation analysis refers to spatial regression models to assess and test putative and known risk factors in a spatial context.
Map projections have been discussed in the spatial epidemiological literature (Waller and Gotway, 2004, p. 43; Banerjee et al., 2003, p. 13), but the biasing effects of map projections on disease map perception and spatial analysis seem not have been examined. Other forms of errors in spatial analysis and sources of bias have received more attention. Ocaña-Riola (2010) reports on common errors in relation to disease mapping, but does not make any references to map projections. Zimmerman (2008) reports that up to 30% of event locations might not be geocoded leading to selection bias. Further references discuss diagnostic misclassification bias (Berke and Waller, 2010) or bias due to the modifiable areal unit problem (Waller and Gotway, 2004; Arsenault et al., 2013).
Over centuries a plethora of map projections have been developed and discussed or criticized. Despite or due to the variety of map projections no clear recommendations regarding their use in spatial epidemiology and public health have been proposed in form of a reporting guideline or journal guideline for authors.
Reporting guidelines and recommendations for good practice have become important aspects of health science publication. Selvaratnam et al. (2022) present a scoping review of disease map characteristics and make recommendations for a reporting guideline directed at disease mapping studies. The EQUATOR network (EQUATOR network, 2024) is a coordinated attempt to provide access to reporting guidelines. A search on the equator network did not identify any reference to geographic maps (incl. disease maps) as an item considered in any reporting guideline.
Similarly, a search among Author Guidelines of eight peer-reviewed scientific journals dedicated to the topic of disease mapping and spatial epidemiology found no reference to any advise that authors should report a map projection or the map datum, i.e. Coordinate Reference System as used for data analysis and mapping. The journals searched were (in alphabetic order): Applied Geography, Geographical Analysis, GeoJournal, Geospatial Health, Health & Place, International Journal of Health Geographics, Spatial and Spatio-temporal Epidemiology, Spatial Statistics.
While Reporting Guidelines and Author Guidelines focus on good practice for better study conduct (i.e. data analysis and communication of study results) the reproducibility of results is another aspect. Reproducible research in epidemiology and public health has enormously increased in importance over the last two decades or so (Peng et al., 2006). Although reproducible research can still be wrong (Leek and Peng, 2015) the value of irreproducible research is questionable.
The general goal of this study was to demonstrate the potential effect of map projections on disease map appearance and respective spatial epidemiological data analysis incl. respective public health communication. The specific objectives were (i) to assess how distance measures on a disease map change under varying projections; (ii) to show how size and shape of a study area can change as well how a map projection can affect the result of disease cluster detection using the spatial scan test; and (iii) to compare analytical results with respect to disease clustering (Moran’s I and the range of a semivariogram) and geographic correlations in spatial regression models. Three different maps or study areas were considered in pursuit of the objectives: a map of Israel, the 100 counties of North Carolina in the USA, and a map of the 29 public health units in Ontario, Canada.
Materials and methods
The Israel boundary file was extracted from the GADM website (GADM, 2018). The coordinates in longitude and latitude of the cities Rehovot and Haifa in Israel, as well as Toronto and Ottawa in Southern Ontario were retrieved from Wikipedia (Wikipedia, 2024).
The boundary file for Southern Ontario was extracted from the Statistics Canada website (StatsCan, 2024). The human West Nile virus (WNv) incidence data for Southern Ontario in 2012 was retrieved from the Ontario Agency for Health Protection and Promotion (Public Health Ontario, 2024).
The boundary file for North Carolina and related SIDS mortality and birth data at county level were accessed from the spData package (Bivand et al., 2019).
To demonstrate the effect of map projections on disease map appearance (shape and orientation) as well as affects on analytic results, the following map projections were employed: (i) the Lambert equal area conic projection (which preserves the areal size of sub-regions), (ii) the Robinson projection (which is often used for World maps), (iii) the early Mercator projection, and (iv) the Universal Traverse Mercator (UTM) projection (which is used for measuring distances). For the Israel map the UTM 36N projection was used, whereas for Southern Ontario and North Carolina the UTM 17N projection was applied. All projections were used with the WGS84 datum. And all maps were used with coordinates in kilometers.
To demonstrate map projection effects on analytical measures, the following methods were applied. The Euclidean distance between two cities each in Israel and Ontario were compared to the great arc length (Waller and Gotway, 2004). The area in square kilometre for each projected map was calculated. The area is important as a basis for any population density measures and can be compared to the area resulting from the Lambert equal area projection as benchmark. Equal area projections are supposed to preserve area size.
The Moran’s I statistic (Waller and Gotway, 2004) was considered as an indicator for the presence of disease clustering. Which might be used to decide on the presence of spatial dependence and the need for spatial analytical methods to avoid overdispersion. Overdispersion may result in anti-conservative decision-making, i.e. finding too many significant results. Moran’s I is related to the Pearson correlation coefficient and compares regional observations to observations from neighbouring regions. Moran’s I depend on a choice that defines a neighbourhood for each location. Here a distance-based neighbourhood of 120km in radius was applied to the empirical Bayesian smoothed incidence of human WNv disease in Southern Ontario of 2012 (Assunção and Reis, 1999; Berke and Waller, 2010; Thompson and Berke, 2017),with theexception of the Mercator projection, which requires a 160 km radius to connect all region centroids and define the distance-based neighbourhood.
Disease cluster detection is a common public health task performed using the circular spatial scan test (Kulldorff and Ngarwalla, 1996). The circles depend here on the Euclidian distance in a Cartesian coordinate system defined by the map projection.
The effect of the map projections on a generalized linear geostatistical model (GLGM, Diggle and Ribeiro, 2007) with spherical semi-variogram was studied. Again the WNv data from Ontario was used. Specifically a spatial Poisson regression model was fit with a first order spatial trend surface and spherical semi-variogram dependence structure. Due to lack of covariates the focus was on the range parameter of the semivariogram.
All data analysis was preformed in R (R Core Team, 2019) with specific use of the packages maptools (Bivand and Lewin-Koh, 2019) for applying map projections; spdep (Bivand et al., 2013) for the Moran’s I analysis; MASS (Venables and Ripley, 2002) and nlme (Pinheiro et al., 2019) for fitting GLGM with spherical dependence structure; and smerc (French, 2018) for the circular spatial scan test.
Results and Discussion
Distance, Shape and Size
The great circle distance between the cities Rehovot and Haifa is 103 km. This value is a benchmark as it is measured on the globe, rather than on a plane. The Euclidian distances corresponding to the map projections are in increasing order as follows: Robinson 98 km, UTM 103 km, Mercator 122 km and Lambert 127 km. The distance under the Lambert projection increased by 30% compared to the great arc length. Figure 1 shows the maps of Israel for the 4 map projections and the locations of Haifa and Rehovot.
The implication for public health is that, for example, quarantine cordons are miscalculated, with possible ethical implication. Interestingly, the distance between Toronto and Ottawa in Southern Ontario (Figure 2), which should result in 347 km according to the great arc length, was well measured under the Lambert (338 km) and UTM (336 km) projections, respectively. Whereas the Robinson and Mercator projections result in a considerably longer distance (403 and 470 km, respectively).
The areal size of Southern Ontario was calculated under the Lambert projection to be 119,296 km2 (Table I). Under the Mercator projection the area increased by 95%, i.e. 233,004 km2. This would reduce a population density – a key parameter in many epidemiological models – to about half its size.
It should be noted, that a visual comparison of the Israel maps does not indicate an obvious change in shape and size. Whereas the same projections noticeably affect the map appearance of Southern Ontario and North Carolina, i.e. result in a visual bias (compare Figures 1, 2 and 3).
Moran’s I and semi-variogram range
Moran’s I was estimated using the empirical Bayesian smoothed WNv incidence data for the 29 regions in Southern Ontario. The results are presented in Table I. The point of note is the p-value of Moran’s I, which is indicating the presence of clustering under the Lambert, Mercator and UTM projections, but not the Robinson projection. Similar, the range of a semi-variogram of a GLGM stands out as much longer than range estimates for alternative projections (Table II). However, p-values reported should be interpreted as exploratory and used cautiously (Matthews et al., 2017).
Spatial scan test
The result from a circular spatial scan test analysis under the four projections is summarized in Figure 3 and Table II. It is a curious result that the number of clusters differs from 1 to 3 for the projections. And more importantly the SMR can be as low as 1.2 or high as 4.7; which might demand public health intervention or the opposite. Additionally, the location and size of potential clusters is more or less specific (Figure 3).
These examples show that visual and analytic bias can be introduced by choice of a map projection. The projection-induced bias presented here is a conservative estimate. Map projections are distinguished by their type (e.g. Mercator or Robinson) as well as a datum (i.e. a model for the earth’s shape). Here all projections were based on the WGS84 datum alternatives exist.
Some spatial methods and software applications have been adopted to base the analysis on geographic rather than Cartesian coordinates. For example the spatial scan statistic can be applied to geographic coordinates, where the Euclidian distance is replaced by the great arc length. Similar the Flexible scan test as well as Conditional Auto-Regressive (CAR) models are based on contiguity neighbourhoods rather than continuous neighbourhoods. However, Wall (2004) argues that regional data models using a contiguous dependence structure, i.e. CAR, are not necessarily intuitive and interpretable; and thus promotes distance-based geostatistical modeling instead.
Conclusion
Map projections affect disease maps and spatial epidemiological data analysis in seemingly unpredictable ways. This can result in qualitative changes with respect to public health decision-making. Bias in disease mapping is unavoidable, but should be recognized. Thus it is recommended to report the map projection along with disease maps and inferences for spatial public health data. Recognizing map projection bias requires awareness and thus respective eduction. While this study was presented with the intention to stimulate awareness among researchers, it is important to include map projection bias into the teaching curriculum where possible. For example a class on descriptive statistics featuring graphical summaries (histograms and box-plots) could also present maps and hint at the (direct) visual bias inherent in all maps. Furthermore, every teaching unit on bias and confounding could add map projection bias and its effect on spatial data analysis and public health decision making.
Data analysis should be based on geographic coordinates when possible, rather than Cartesian coordinates to avoid or minimize the (indirect) analytical bias from map projections. It is not always possible to use longitude and latitude coordinates in the analysis; for example, when coordinates are used as covariates in a linear regression model only Cartesian coordinates are permissible. To further an understanding of the unavoidable map projection bias a systematic study of its effect is deemed useful. However, due to the infinite number of projections and situational nature of this bias such a study will certainly be limited.
The four map projections explored in this study were chosen as popular and well known examples of map projections, which also cover a variety of relevant aspects for spatial epidemiological data analysis. Since all map projections lead to bias, there is no correct projection. But a great many projections were developed to minimize some aspect of map distortions and data analytic bias. This is a wide field of study. The current presentation can only instill some appreciation for the affects caused by map projections on disease maps and respective spatial data analysis.
It should be noted that map projections do not just affect larger study areas as sometimes proposed in the literature; see for example Waller and Gotway (2004, p. 47) who suggest distortions are small for small regions. But this refers only to absolute distortion, because a percentage change in distance or area affects large and small study areas proportionally in the same way. This is very well demonstrated here by the Israel example.
Public health communication of spatial epidemiological patterns (such as the size and location of disease clusters or the range of disease spread over specific time periods) should recognize the uncertainty due to map projection bias. An honest reflection on the uncertainty surrounding scientific findings will hopefully lead to more modest and nuanced communications of health risks in spatial epidemiology and public health.
Competing Interests Statement
There are no financial or non-financial competing interests to be declared.
The choice of the boundary files for this study was due to convenience only and posit no political statement.
References
Arsenault, J., Michel, P., Berke, O., Ravel, A., & Gosselin, P. (2013). How to choose geographical units in ecological studies: proposal and application to campylobacteriosis. Spatial and Spatio-temporal Epidemiology, 7, 11–24. https://doi.org/10.1016/j.sste.2013.04.004.
Assunção, R. M., & Reis, E. A. (1999). A new proposal to adjust Moran's I for population density. Statistics in Medicine, 18(16), 2147–2162. https://doi.org/10.1002/(sici)1097-0258(19990830)18:16<2147::aid-sim179>3.0.co;2-i.
Banerjee S, Carlin BP, & Gelfand A. (2003). Hierarchical modeling and analysis for spatial data. Boca Raton: CRC Press.
Berke, O., & Waller, L. (2010). On the effect of diagnostic misclassification bias on the observed spatial pattern in regional count data - a case study using West Nile virus mortality data from Ontario, 2005. Spatial and Spatio-temporal Epidemiology; 1(2-3):117-122. https://doi.org/10.1016/j.sste.2010.03.004.
Bivand, R., & Lewin-Koh, N. (2019). maptools: Tools for Handling Spatial Objects. R package version 0.9-5. https://CRAN.R-project.org/package=maptools.
Bivand, R, Nowosad, J., & Lovelace, R. (2019) spData: Datasets for Spatial Analysis. R package version 0.3.0. https://CRAN.R-project.org/package=spData.
Bivand, R., Pebesma, E., Gomez-Rubio, V. (2013) Applied Spatial Data Analysis with R (2nd Edn.). New York, Springer.
Diggle, P., & Ribeiro, P.J. (2007). Model-based Geostatistics. New York: Springer.
French, J. (2018). smerc: Statistical Methods for Regional Counts. R package version 0.4.5. https://CRAN.R-project.org/package=smerc .
GADM (2024). GADM maps and data, https://gadm.org; [last accessed March 9, 2024].
Kulldorff, M., & Nagarwalla, N. (1996). Spatial disease clusters: detection and inference. Statistics in Medicine; 14:799-810. https://doi.org/10.1002/sim.4780140809 .
Lawson, A.B. (2006). Statistical Methods in Spatial Epidemiology (2nd edn.). New York: Wiley.
Leek, J.T., & Peng, R.D. (2015). Opinion: Reproducible research can still be wrong: adopting a prevention approach. Proceedings of the National Academy of Sciences of the United States of America, 112(6), 1645–1646. https://doi.org/10.1073/pnas.1421412111.
Matthews, R., Wasserstein, R. & Spiegelhalter, D. (2017). The ASA's p-value statement, one year on. Significance; 14(2): 38-41. https://doi.org/10.1111/j.1740-9713.2017.01021.x.
Ocaña-Riola, R. (2010). Common errors in disease mapping. Geospatial Health, 4(2), 139–154. https://doi.org/10.4081/gh.2010.196.
Peng, R.D., Domenici, F., & Zeger, S.L. (2006). Reproducible epidemiologic research. American Journal of Epidemiology; 163(9): 783–789, https://doi.org/10.1093/aje/kwj093.
Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team (2019). nlme: Linear and Nonlinear Mixed Effects Model. R package version 3.1-139. https://CRAN.R-project.org/package=nlme.
Public Health Ontario (2024). Ontario Agency for Health Protection and Promotion: Vector-borne diseases 2012 summary report, 2013. Toronto, ON: Queen’s Printer for Ontario. http://www.publichealthontario.ca/-/media/Documents/V/2013/vector-borne-diseases-2012.pdf?rev=4234df789ae44b47ad5656c4b875d415&sc_lang=en (Accessed on March 9, 2024).
R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org.
Selvaratnam, I., Berke, O., Thaivalappil, A., Imada, J., Vythilingam, M., Beardsall, A., Hachborn, G., Ugas, M., & Forrest, R. (2022). Characteristics of Disease Maps of Zoonoses: A Scoping Review and a Recommendation for a Reporting Guideline for Disease Maps. Cartographica: The International Journal for Geographic Information and Geovisualization; 57:2, 113-126. https://doi.org/10.3138/cart-2021-0019.
StatsCan (2020). Statistics Canada - 2016 Census - Boundary files, https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-2016-eng.cfm; (Accessed on March 9, 2024).
Thompson, M., & Berke, O. (2017). Evaluation of the Control of West Nile Virus in Ontario: Did Risk Patterns Change from 2005 to 2012? Zoonoses and Public Health; 64(2), 100–105. https://doi.org/10.1111/zph.12285.
Venables, W.N., & Ripley, B.D. (2002). Modern Applied Statistics with S (4th Edn). New York: Springer.
Wall, M.M. (2004). A close look at the spatial structure implied by the CAR and SAR models. Journal of Statistical Planning and Inference; 121: 311–324. https://doi.org/10.1016/S0378-3758(03)00111-3.
Waller, L., Gotway, C. (2004). Spatial Statistics for Public Health Data. New York: Wiley.
Wikipedia (2024). Wikipedia, The Free Encyclopedia, https://en.wikipedia.org (Accessed on March 9, 2024).
Zimmerman, D.L. (2008). Estimating the intensity of a spatial point process from locations coarsened by incomplete geocoding. Biometrics; 64(1):262-70. https://doi.org/10.1111/j.1541-0420.2007.00870.x