Fifteen years ago, most users experienced online maps much as they might their paper counterparts: flat non-interactive images for browsing geography. In 2005, Google Maps changed that, giving rise to enthusiasm for the mapping mash-up, where data (often taken from public datasets) is located on an interactive scrollable and zoomable map. A year later, OpenStreetMap was launched, providing a platform for the collection and display of mapping data, unencumbered by intellectual property (IP) restrictions, and launched in response to ongoing frustration at the lack of open geographic data in the United Kingdom (UK).1The move from large proprietary desktop Geographic Information Systems (GIS) to increasingly open access to geospatial2data appeared to be underway.
Mapping visualisations have been strategic assets in the popularity of open data and they remain one of the public entry points to engage with open data. A typical mapping portal from the City of Phoenix3in the United States (US) demonstrates the type of geospatial data (and prepared maps) available through a typical North American municipal data portal, including property boundaries, zoning information, traffic volumes, and recreation areas (see Figure 1). A similar site can be found for Manchester, England,4although geospatial data and map access come with terms and conditions that restrict how that data can be used.
Both the potential demonstrated by mapping mashups and user interfaces and the desire for access to valuable geospatial datasets held by governments and government agencies can be seen as driving forces in the development of the open data movement. But what of geospatial data today? Is the data now widely open, accessible, and used? And what progress has been made in unlocking the potential of geospatial data for analysis and improved policy-making?
While much progress has been made in the availability of data, and in the development of tools to visualise it, substantial work is needed to better connect geospatial and open data communities, to equip creators and users of geospatial data with the critical skills (and technical platforms) needed to move beyond simply mapping, and to gain the full benefits of geospatial data analysis. There also are significant risks from the wider use of geospatial data that need to be more directly addressed. Ultimately, advances made in terms of sheer data availability and infrastructures are currently counterbalanced by significant stalemates in terms of analytical approaches to geodata, as well as ownership and privacy risks.
Primer: An overview of open geospatial data
It is estimated that 80% of all government data has some reference to location.5Almost every chapter in this volume touches upon geospatial data in some form. Geospatial content can be found in datasets on subjects as diverse as parks, refugee camps, financial transactions, natural resource distributions, and socioeconomic statistics. Many uses of open data rely on being “mapped” (i.e. attached) to basic geographic framework data.6For example, socioeconomic statistics, like population, may be mapped on top of administrative boundaries. Data on soil quality may be attached to digital elevation models to model erosion, and that same soil data may be compared to geographically intersecting data on land ownership and land subsidies. Without their geospatial component, many open datasets would have much reduced impact.
Mapping generally involves presenting geospatial data alongside a geographic layer. Geographic layers are datasets that are essentially outlines and may or may not be open data themselves. These layers include jurisdictional boundary files (e.g. country, city, school catchment areas, and watershed districts) or linear features like rivers or roads. For completeness sake, there also are geographic point layers, such as centres of cities or locations of known elevation like mountain peaks. Geographic layers may also include remote sensed imagery. Imagery can function as a backdrop onto which geospatial data is overlaid (e.g. logging operations in forested areas). Like other geospatial data, remote sensed imagery can be analysed alone or in combination with other open datasets to identify areas of drought, land use, or pollution.
Many practitioners working with open data consider geography primarily in terms of x and y coordinates, usually expressed as latitude and longitude, respectively. It is important to recognise that there are numerous types of “coordinates”. These include direct location references such as latitude/longitude, postal addresses, or GPS traces. There also are indirect references to location, such as place names (e.g. colloquial neighbourhood names, or official country or region names) that can be turned into a set of coordinates using a gazetteer or a lookup database.
The vast diversity of geospatial data may be more or less open along a number of dimensions. Data may be free to browse but not to download. Or data may be free to download but provided under restrictive licences that limit reuse. Or data may be openly licensed but only available in formats that require proprietary software or that use proprietary referencing systems. To understand open geospatial data, we need to ask: What kind of data is this? and How open is it?
Many kinds of geospatial data in terms of structure, representation, and analysis
There are many different kinds of geospatial data, and for any geographic feature, choices are made about how to represent it. The same feature might be represented using points, lines, polygons, or pixels. This choice impacts the kind of analysis that is possible, the technologies that can be used in analysis, and the biases to watch out for when drawing conclusions from the data.
Figure 3 shows how a feature might be represented as a vector (a collection of linked points) or a raster (a collection of pixels scaled to a particular resolution with each individual pixel encoding information from its immediate area).
Figure 4 illustrates how information is linked to geography for presentation (mapping) and analysis. Geographic layers are usually not directly accompanied by geospatial data. Instead, to a polygon (e.g. a country boundary), one could add (join) datasets, such as population data, information on political control, or catchment areas for particular service provision, and to a point (e.g. lat, lng) one could add details of public services provided at that location.
However, geospatial analysis does not require pre-existing boundaries like countries or cities. This can be useful when the boundaries are not available or when mapping onto those boundaries would be misleading (e.g. mapping incidences of crime onto areas with very different populations). Hexbinning, shown in Figure 5, is an approach to handle point data in these cases, creating a new geographic layer of arbitrary shapes into which the points can be aggregated.
Progress: Open geospatial data availability and infrastructure
The last decade has seen substantial strides in opening up geospatial datasets. Evidence suggests this has brought significant social and economic value. For example, in 2013, the Government of Denmark, through their Basic Data programme, released digital mapping data free under an open licence. A follow-up study in 2017 estimated that this had led to DK 3.5 billion (approx USD 495 million) in socioeconomic value in the preceding year.7It is estimated that making the US LandSat satellite imagery freely available in 2009 accrued USD 1.8 billion annual value to the economy; whereas, charging for access would lead to substantial inefficiencies and loss of value.8In the UK, open data policy has led to new datasets being made open from their mapping agency, the Ordnance Survey. The release of geospatial data responded to advocacy that focused on gains to the economy from a more open approach to this data.9It has long been argued that Canada suffered significant losses due to government’s early reticence to open geospatial data,10which is being remedied.
In the US, efforts to open up federal geospatial data pre-date most consideration of open data worldwide. The federal government, as well as subnational jurisdictions of the US (states, cities), tends to publish geographic datasets as integral parts of their open data portals. The reason that geospatial data is arguably the first open (government) data is due to the establishment of national or subnational spatial data infrastructures (NSDIs), the first one being the Australian Land Information Council in 1986.11NSDIs are outgrowths of “the technology, policies, standards, and human resources necessary to acquire, process, store, distribute, and improve utilisation of geospatial data”.12Geospatial data infrastructures tend to require high levels of interoperability in terms of standardisation to function. These datasets likely originate in different agencies with varying practices of data collection, update schedules, and definitions. Full standardisation requires geospatial data to be at the same geographic projection with the same coordinate system, spatial extent, updates, and data definitions. It is by no means easy to coordinate data so that layers “lie on top” of each other in alignment.
Spatial data infrastructures did not necessarily originate as open platforms. Many were designed as government-to-government data sharing platforms, although several promoted the idea that the data should be accessible to a range of applications and support economic development. Openness of geospatial data remains uneven across the world. The latest Open Data Index13identifies just 12 countries where governments provide fully open national geospatial data, and only one (Brazil) is not in the World Bank’s “High-Income Economies” category. There is movement among numerous countries to increase openness (e.g. Indonesia’s widely discussed One Map initiative). Progress has been slow and mostly focused on rationalisation of geospatial data management. Opening up geospatial data is not simply a matter of applying a licence to existing datasets, but also involves the adoption of policies, standards, and human resources specific to geospatial data.
Encouraged by the International Open Data Charter, and noting the value of an “open by default” approach, the Group on Earth Observation adopted open data principles in 2016,14seeing this as the natural step forward from their existing data sharing regime (established in 2006) and justifying this shift on the basis of the economic, social, governance, education, research, and innovation value.15The European Union’s (EU) INSPIRE16directive has driven the inclusion of geospatial data features in a number of national data portals and extensions for geospatial data to the open source CKAN software.17Many NSDIs have had little integration into the open data landscape. However, the EU’s initiative demonstrates how governments may integrate parallel tracks of activity between the open data and geospatial communities.
Gaps in geospatial data are increasingly addressed through the use of cross-border satellite imagery available on digital earth mapping platforms. Some of this data is sourced from government. The launch of the Africa Regional Data Cube in May 2018 resembles many features of an NSDI in terms of standardisation and provides access to free satellite imagery for Kenya, Senegal, Sierra Leone, Ghana, and Tanzania. It builds on an open source “data cube” platform that compresses pre-processed imagery to reduce the otherwise prohibitive costs of data transfer, storage, and analysis.18
Government data also is being augmented by the private sector and civil society, and some of these new geospatial datasets could become open data. Firms like DigitalGlobe provide imagery derived from commercial satellites. Whereas satellite coverage may be universal, street mapping remains limited by either the availability of non-proprietary street-mapping data or volunteer contributions. Much of this data is licensed to proprietary platforms like Google Maps. Users can zoom into most places on Earth and see road layouts or satellite imagery. To access the same data on other platforms to support applications or analysis can often be prohibitively expensive. For instance, software application programming interfaces (APIs) may be available but based on per-access pricing,19or sudden price changes may leave data out of reach of users seeking to map open data coordinates or build open data-related applications and businesses.20It is important to remember that free to use, but non-open, platforms are subject to prevailing business models of tech industries. Parts of Microsoft’s Bing mapping division were sold to Uber in 2015, and Google increased prices for its mapping APIs up to fourteenfold in 2018. There is a precariousness to basing one’s mapping applications on a specific non-open platform. Fortunately for data consumers, the last decade also has seen the emergence of tools like Leaflet,21which enable digital mapping using a variety of geospatial data providers. Companies like MapBox22provide a commercial offering but are committed to building on top of open source tools and data.
Open geospatial data also is being created through crowdsourcing. The largest platform, OpenStreetMap, “is built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world”.23By comparing CIA World Factbook data on road length in a country with OpenStreetMap data, Maron and Channell found that some countries have 100% coverage of major roads.24In Asia and China coverage is more limited. In India, for example, only 21% of the road network has been digitised on OpenStreetMap.25
Use of private or crowdsourced data reflects the costs of collection and maintenance of geospatial data and related infrastructures. When geospatial data is funded directly from government budgets, rather than through cost-recovery (i.e. charging users for use of the data as a method of supporting government data collection and maintenance), access is at greater risk of budget cuts.26This can lead to pressure from agencies working with geospatial data to develop or retain financing regimes. The cost of data collection has led a few governments, particularly in North America, to explore partnerships with private sector firms to collect data through projects, such as Google Waze, Strava Metro, and Uber Movement.27Ironically, these datasets frequently originate from civil society or individual citizens, but ownership is claimed by the firms providing the platforms for data collection. This can introduce new sources of proprietary data in spatial data infrastructures at the same time that other aspects of those infrastructures may be opening up. Additionally, the inclusion of privately sourced or crowdsourced data invariably shifts control from government in terms of data accuracy, coverage, and timeliness of edits and updates. This will increase the risk to governments (real or perceived), particularly if that data is central to government operations.28
Four examples of open geospatial data
Thousands of examples of open geospatial data projects exist. These include:
- Crime Maps presenting data from the police and justice system (see Chapter 4: Crime and justice) for individuals to see recorded crime incidents and rates in their communities.
- Community assets mapping such as the MySociety.org “Keep it in the Community” project that is mapping an England-wide register of community assets and exploring issues around ownership of community buildings and land.
- Disaster relief and resilience initiatives such as the work of Humanitarian OpenStreetMap Team (HOT) which mobilises volunteers to remotely map disaster-hit areas in support of responders. The OpenDRI (Open Data for Resilience Initiative) seeks to reduce vulnerability to natural hazards and impacts of climate change.29
- Aid mapping including work to understand patterns of aid distribution and the geopolitics of aid.30
Challenges: IP, privacy, and standards
For all the progress that has been made in terms of data openness, four issues present notable challenges for work with open geospatial data.
First, numerous countries face challenges in opening key datasets due to IP restrictions. The UK’s mapping agency, the Ordnance Survey, and postal service, Royal Mail, have long been restricted in how they can open up their geospatial data due to Crown Copyright. Ownership of all or part of the IP was further complicated when the management of the postcode database was outsourced to a private firm. The situation shows signs of improvement with a 2015 open data policy supporting a “presumption to publish”.31However, efforts to create an open address register for the UK have been put on hold, which places this critical lookup dataset out of the reach of many open data projects.32CanadaPost has maintained strict IP protections on its postal code database. In Canada, a one-person firm, Geolytica, built an application that would reverse engineer Canadian postal code boundaries using computational geometry and crowdsourcing. It was done as a proof-of-concept, but the database was also opened up to the public. Geolytica’s efforts led to it being sued by CanadaPost for violating the latter’s ownership of the phrase “postal code” and the underlying content.33
The value of spatial data as IP means that firms are often interested in acquiring exclusive rights to it. Another example from Canada illustrates this. The Ontario-based firm, Teranet, purchased the rights to land registries (cadastres) around the world. In exchange for those rights, the firm maintains the registry datasets and then licenses access back to local and regional governments.34This represents not just private provision of the service but private ownership of the data. There is a paucity of reliable data on how many countries have substantial private ownership of IP in their spatial data infrastructure, yet this is likely to be an important area to track over the coming decade if further gaps are to be avoided in the open geospatial data landscape.
A second key challenge relates to privacy and security. When it concerns data about individuals, location data can often pierce privacy protections and enable surveillance. A combination of just three variables (i.e. gender, birthdate, US zip code) has been found sufficient to identify individuals by name in the US.35Individuals increasingly leave geographic data traces on the web through their use of fitness trackers, location-stamped photographs, or a myriad of other location tracking apps. The existence of this data can jeopardise the anonymity of other datasets that might contain coinciding location and timestamps. Methods exist to maximise privacy while preserving the ability to analyse data (e.g. through geographic masking).36However, the ability to deanonymise data will only improve as artificial intelligence and machine learning are applied to open data.37Whereas open datasets generally do not describe individual persons, the growing availability of geo-indexed data needs to be accounted for when creating, sharing, and using open datasets.
Standardisation presents a third major challenge for greater interoperability in the world of geospatial data. The most commonly used standard for geography is the “atomic standard” of the coordinates, latitude and longitude. Multiple alternatives exist to lat/long (e.g. polar coordinates are better for people near the poles). Considering coordinate systems requires contemplating standards in geographic projections. Inconsistent projections prevent one dataset from correctly being overlaid onto other data layers and may inhibit other operations like calculating travel distances. Polygons like jurisdictional boundaries also generate complexity related to standards. The schema.org standards for place, which contain at least ten different relationships of containment, overlapping, intersection, and equality between areas, provides a sense of how complicated it is to structure geometries beyond simple point locations.38Maintaining the quality of geographic data and ensuring standards are adopted correctly is not trivial. Unlike other sectors, the problem is not the availability of standards (e.g. the Open Geospatial Consortium maintains over 30 open standards for geographic data).39We need an educated understanding about their adoption. Instead of creating an integrated world of geospatial data, open data initiatives could lead to a soup of misaligned points and polygons that are difficult to distinguish.
This leads to the last challenge: the lack of interaction between open data communities and the communities that traditionally work with geographic data. Open geospatial data (via WAIS servers, NDSIs, and Al Gore’s articulation of a Digital Earth40 predate the concept and implementation of open data. Open data advocacy in several countries was sparked by a desire for geospatial data as in the UK FreeOurData campaign41and Canada’s DataLibre.42Nonetheless, there has been a gulf between the early open data movement with its focus on quantity over quality and the geography/geomatics community, which by 2010, was already well established and considering issues of standardisation and data management. We have seen plenty of missed opportunities to bridge the gulf, which has resulted in a bifurcation in skills for geospatial data handling that impedes both the opening, and the effective use, of geospatial data. In particular, this has led to the open data world’s focus on mapping but very little focus on geographical analysis. There remains considerable potential for increased interaction between the two communities to enhance skills and analysis.
Pitfalls and potential: From mapping to analysis
Mapping is undoubtedly important, but visualisation of data is just one strategy of many. There has been a tendency among open data practitioners to map and make inferences based on visual inspection of geospatial datasets. However, these ostensible relationships are often not statistically significant. The ability to map open data in the absence of the critical skills to analyse it correctly can lead to problems and even incorrect policy prescriptions. Expanding skills for detailed spatial statistics and analysis, to allow conclusions to be drawn from open datasets and to create new, improved maps based on the results of that analysis, should be a high priority in the open data community. General data literacy capacity has grown, but the availability of tools, resources, and outreach to promote geospatial data literacy is much more limited. The current lack of analytical capacity represents a critical bottleneck to the effective use of open geospatial data.
For example, one large part of open geographic data handling concerns what is known as “feature geometry”. Most open data containing geospatial attributes is point-based. That is, an entity’s location (e.g. a park, a government transaction, a building project, or a refugee settlement) is represented by a single x, y coordinate. The choice of which points to use is not always obvious. Should the location be a headquarters of a local relief agency or the location where activities are occurring? Many of these points reflect what is called a central tendency or the centroid (a geometric centre of an area). Depending on the shape of the area (e.g. a crescent), a centroid could actually appear outside the area. The simple consideration of which location is mapped can affect the message a map communicates.
Numerous forms of analysis should not rely on point location at all. Many features, such as the geographic distribution of poverty or of crop types, are not natural distributions, easily interpreted through the use of latitude and longitude, but are shaped by politics. Such features are more appropriately described by areal measures. For example, poverty should be reported by the political boundary of a township. Unlike geographic points, working with jurisdictional data can be difficult because boundary file availability and discoverability are limited and there may be disputes over borders. Tools for working with containment (polygons) are less user-friendly, in many cases, than those for generating point-based online maps. Similar issues exist for raster datasets (e.g. satellite imagery), which are especially important for rural areas.43Working with raster data, whether it is satellite data or drone data, generally requires more extensive experience and expensive software than other types of data.
A common alternative to mapping by jurisdiction is through aggregation and clustering. Two popular aggregation methods are hexagonal binning (hexbins) and rectangular grids, which rely on the use of regular artificial areas into which points are counted. A different approach is clustering points through hotspot analysis, which infers the geospatial extent of a phenomenon (e.g. a cluster of disease outbreaks) and differentiates statistically significant clusters from non-significant clusters. Many tools can now automate aggregation and clustering, but tools need to be accompanied by a critical understanding of the way the choice of approach affects analysis. Geographers have widely discussed the modifiable areal unit problem (MAUP)44whereby aggregation units are understood as definitionally artificial and the results of data aggregation depend on the choice of the unit. Results (e.g. counts, rates, densities, and correlations) are influenced by the shape and orientation of the unit (e.g. slight tilting or enlarging of a rectangular grid), as well as by the way the units are combined (scale). O’Loughlin et al. (2014), for example, use open data on a rectangular grid to map violence, heat, and precipitation across the African continent.45They note limits in the data and its aggregation, even as they perform analyses at a finer aggregation than previously conducted to better understand climate conflicts. Tools exist to improve data literacy with regard to problems introduced by spatial aggregation.46The challenge is promoting their adoption outside the geography community and within the much wider community of open data users who may otherwise adopt naive analytical strategies. No aggregation is perfect, including those using jurisdictional boundaries. It is important to broaden critical understanding of the malleability of aggregations in the results they deliver.
This noted, we must be aware that improving the quality of analysis of geospatial open data can be knowledge and resource intensive. For example, AidData’s infrastructure for sophisticated geospatial analysis of international aid patterns is expensive to maintain and requires substantial annual resources.47Although Google has instituted a business model for Google Maps, organisations like AidData cannot rely on similar mechanisms of support.
As we look to the future, opportunities lie in better connecting the open data and geospatial data communities. The latter has been working on improving open source geospatial data tooling for many decades. Even though much of this work has been focused in particular professional contexts, critical and community geographers have long been working on ways to open up access to, and support popular engagement with, geospatial data. The extensive learning and thinking within this field should not be ignored in the rush to open up data and excitement over the latest commercial tools and simplified mapping platforms.
Major advances have been made in open geospatial data. However, numerous gaps remain related to IP, standardisation, privacy, and analytical capacity. In the next decade of open data, we need to ensure greater coordination between the geomatics/GIS and the open data communities so better maps can be produced and greater value can be demonstrated from the wealth of geographic content within the open data released in the last decade.
More than anything, anyone working with geographic open data should approach it with a critical eye and ask two questions. Which choices have been made in creating this data? What lessons might there be from the existing geospatial data community to help with the analysis of this data?