Algorithms and AI
Artificial intelligence (AI) involves machines taking on tasks that previously required human intelligence to complete. Smith and Neupane describe AI as “an area of computer science devoted to developing systems that can be taught or learn to make decisions and predictions within specific contexts”, adding that “AI applications can perform a wide range of intelligent behaviours: optimization (e.g. supply chains); pattern recognition and detection (e.g. facial recognition); prediction and hypothesis testing (e.g. predicting disease outbreaks); natural language processing; and machine translation”.1The AI models that carry out these tasks often require, and generate, vast quantities of data. In many cases, open datasets provide key inputs to AI systems.
Although AI research goes all the way back to the early days of computing, the last few years have seen substantial renewed interest in the topic. Companies, governments, and non-profit organisations have all joined the new wave of AI, looking to mine their existing stocks of data or to process ever increasing data flows. As Figure 1 suggests, interest in AI far outstrips interest in open data per se. While searches for information on open data have flatlined since 2016, interest in AI is on an upward curve. This is also reflected in funding with one report citing over USD 500 million invested by United States (US) philanthropists into AI research in 2018 alone2(quite possibly more than the total sum of philanthropic investment in open data over the entire decade), and numerous governments are developing strategies and funding programmes to boost AI development to gain national competitive advantage in the AI space.3
However, the current wave of AI interest has not been without controversy. Concerns about bias and exclusion, the loss of jobs through automation, human rights impacts of automated decision-making, centralisation of power (and wealth), surveillance through mass data collection, hidden environmental damage, and the safety of AI systems have all been expressed by a growing number of AI institutes, think tanks, and other stakeholders.4This has led to a search for strategies to promote the positive outcomes of AI, while mitigating the potential hazards. Open data has a key role to play in this regard.
This chapter will examine where open data and AI meet, exploring their historic relationship and the ways in which data may be shaped by, and may shape, the future of AI applications. Although open data communities have been relatively slow to engage with AI, the chapter will also explore the pivotal role they can play in making safe and effective use of emerging AI technologies and in providing critical scrutiny to determine when AI approaches are the right answer or when alternative grass roots ‘small data’ strategies should be preferred.
Primer: AI, algorithms, and machine learning
AI relies on applying algorithms to data. An algorithm is simply a set of instructions that can take an input and generate some sort of output.
Some algorithms are manually written. In such cases, it should be theoretically possible to read the algorithm and trace a comprehensible path from input to output. However, many modern algorithms are also based on machine learning, a process by which the output of the algorithm is “trained” by being repeatedly fed data and judged on its outputs. Over time, a machine-learning algorithm adapts to better provide the required output even without the presence of explicit instructions that cover all steps from input to output.
Many ongoing discussions use AI as a shorthand for machine learning, although AI is a broader field including many other areas of focus, such as knowledge representation, planning, and reasoning. There are hundreds of different machine-learning methods and algorithms now available, some of which are specialised for particular problem spaces (e.g. image or speech recognition) and others which are more general purpose. Deep learning (sometimes called deep neural network) models take advantage of vast computing power to create layered learning systems with a structure inspired by the human brain.
Beside method choice and tuning, the quality of a machine-learning model depends largely on three things: the quantity of training data available, the quality of training and input data, and the amount of computing power used to build the model.
Linked histories: Shaping data inputs, outputs, and environments
Many histories trace the birth of AI to a workshop at Dartmouth College in the US in 1956. Research at the time promised the creation of automated general intelligence and captured the imagination of science fiction writers. Yet, when it proved much more complex than anticipated to deliver, funding was rapidly reduced leading to what has been termed the first “AI Winter”.5The late 1980s and early 1990s saw subsequent episodes of AI investment boom and bust, attributed in part by AI advocates not to failures of the technology but to a loss of confidence by governments and companies when AI technologies, far from failing, became so commonplace as to be unremarkable.6This experience may be instructive for open data communities as they reach the end of their first decade, and early innovations are soon taken for granted. As of today, however, AI is experiencing a substantial and sustained boom, traced by many to the widespread adoption of deep learning methods from 2010 onward7and buoyed by widespread (albeit shallow) public awareness of the concept from literature and media coverage.
It is notable that a number of open data pioneers had roots in the AI community, or, more specifically, in Semantic Web research.8In 2001, Web inventor Tim Berners-Lee and colleagues outlined a vision for a Semantic Web as the “evolution of a Web that consisted largely of documents for humans to read to one that included data and information for computers to manipulate”.9Revisiting that vision in 2006, Shadbolt, Hall, and Berners-Lee discussed the “increasing need and a rising obligation for people and organizations to make their data available […] driven by the imperatives of collaborative science, by commercial incentives such as making product details available, and by regulatory requirements”.10Two years later, a number of the same authors also reported on their efforts to bootstrap the Semantic Web with public sector information (PSI).11
AKTive PSI: Semantic Web and open data
In 2005, the United Kingdom (UK) Office of Public Sector Information (OPSI) initiated a project with the Advanced Knowledge Technologies (AKT) group at the University of Southampton to develop a range of Semantic Web demonstration applications, drawing on different sources of PSI. Shadbolt et al. (2011) describe how the “small-scale success of AKTive PSI in 2006–2008 paved the way for the more ambitious data.gov.uk12site, which followed in 2009–2010”.13
The project built on semantic data extraction tools and involved the creation of vocabularies and ontologies for modelling PSI as Linked Open Data (LOD), seeking to lay the groundwork for more advanced “software agents” to intelligently make connections between government services and support greater interoperability.
When in 2009, Tim Berners-Lee and Nigel Shadbolt had the opportunity to shape the development of the UK’s open data portal, data.gov.uk, they placed a particular emphasis on structuring data according to Semantic Web principles, strongly influenced by AI, knowledge representation, and research work on ontologies, vocabularies, and identifiers.14A number of other countries also incorporated Semantic Web and “linked data” principles and practices into their early open data work, although there is little evidence of how far this has been sustained. This is perhaps because, although a desire for structured data to feed data-hungry and university-led AI research was a key driver for early open government data advocacy, the current wave of AI interest has quite distinct origins.
ImageNet is a database of annotated images first published by Princeton University in 2009. Since 2010, an annual challenge has invited machine-learning researchers to compete to provide software that outperforms humans in recognising what the images contain. By 2012, the improving accuracy and speed of the submissions sparked renewed industry interest in AI and in the deep learning methods being used.15It was not the analysis of structured data which sparked this new wave of AI but rather the creation of structured data out of a wealth of unstructured inputs. As a result, the AI methods now widely deployed to extract data from images, sounds, and video, and to make probabilistic connections between data points in different datasets, appear, at first, to sidestep the complex and expensive labour of standardising and structuring every row of a dataset. Yet, these AI models still rely on structured data for training as with ImageNet, where the annotations were as important as the pictures. Although some applications of AI do draw upon open government datasets both to train models and as part of their operation, some of the largest applications of AI today draw on big data gathered by private companies that firms are able to tailor for their machine-learning needs by gathering millions of inputs from user and customer interactions every day. Many modern applications of AI bypass government data for the most part and operate on personal data inside corporations rather than from an open web of public data, which has led to an active debate on the need for improved data governance, including through stronger articulation of data ownership,16data rights,17and regulation of AI use.18
In short, algorithms and open data connect in three main ways. First, at the input level, open data can be an input for machine-learning models (or to develop improved non-AI algorithms), and these models can turn unstructured information into structured open data. Second, at the output level, advanced algorithms can be used to analyse open data in ways that would be far too time consuming to do manually. Suitably designed algorithms can find patterns, supporting decisions and informing actions based on the contents of open datasets. As we will see later, it may also be appropriate for algorithms and machine-learning models themselves to be output as open data. Third, at the level of the wider data landscape or environment, the availability of open data may shape who has access to the data they need to carry out AI research or to adopt algorithmic techniques within their business or projects. Equally, the potential power of AI, and the unequal distribution of access to its power, challenges the (contested)19idea that opening data will inevitably level the playing field of democratic discourse, innovation, or social action. The ability of machines to draw connections between datasets increases the salience of privacy concerns and raises new ethical issues about data publication, as well as concerns that AI systems risk reinforcing systemic biases and patterns of discrimination.
Open data and algorithms in action
It is notable that few other chapters in this volume specifically discuss examples of AI applied to open data, yet a deeper look reveals numerous cases where elements of machine-learning algorithms are being applied to gain economic and social value from open datasets or simply to further research. Moreover, there are signs that governments are increasingly recognising the important relationship between open data and the development of AI, particularly for countries without a predominance of large private sector data-rich businesses.
Policy: Open data as an engine for AI growth
In October 2016, in the last days of the Obama administration, a report from the Office of Science and Technology Policy on “Preparing for the future of Artificial Intelligence” included among its leading recommendations the idea of “an Open Data for AI” initiative with the objective of releasing a significant number of government data sets to accelerate AI research and galvanise the use of open data standards and best practices across government, academia, and the private sector”.20Similar ideas are found in a 2017 report from the UK Royal Society that called for “continued efforts […] to enhance the availability and usability of public sector data” in order to provide “open data for machine learning”,21in calls for the Government of India to release more public data to AI developers,22in recommendations from the University of Pretoria that stakeholders in Africa should “adopt open data initiatives as a way of using technology to support distributed innovation and to make AI development more participatory and transparent”,23as well as in AI strategies under development in Uruguay and Argentina, to name just a few. Oxford Insight’s 2017 Government AI Readiness Index also draws on the Open Data Barometer and Organisation for Economic Co-operation and Development (OECD) OURdata index to assess the quality of digital infrastructure available for AI innovation.24Simply put, open data is seen as a resource governments can use to lower barriers to entry to AI research and to support domestic AI industries to develop.25Government company registers, for example, are often used as a reliable source against which to reconcile data extracted from unstructured documents and filings. However, while these arguments for open data appear in a number of AI-focused policy papers, there is less evidence that open data communities have used policy engagement and commitments around AI as an additional “sales pitch” to overcome flagging engagement in open data policy initiatives.
Alongside recognition of open data as a resource for AI innovation, there is generally strong recognition that this is not just a case of releasing data that exists. Rather, with awareness of the problems that biased data can create, a study from the University of Pretoria describes the deeply political problems associated with data supply and the insufficiency of data about marginalised communities or data covering informal economies. The study suggests that addressing these challenges may require substantial changes to government data ecosystems, as well as efforts to open data from publicly funded academic institutions and to incentivise the sharing of non-proprietary data from the private sector.26
This question of how to incentivise, or provide regulatory mandates for, private sector actors to share their data is increasingly salient in the context of AI. Tennison argues that while open data approaches remain a net positive overall, ultimately “more data disproportionately benefits big tech”.27As long as large firms have both the computational resources and the access to proprietary datasets to combine with open data, they are likely to maintain a competitive advantage. This can create monopoly and competition issues that require new regulatory responses. One emerging voluntary solution, appropriate when the data in question might involve commercially sensitive information or personal data, involves the creation of data trusts: practical and legal mechanisms for sharing datasets and supporting data use that can protect data subjects and ensure good data governance.28For some, attempts by open data organisations to engage with this agenda represent a dilution of commitments for openness and presents a risk to the future coherence of the open data agenda. For others, it represents a necessary development, recognising that to shift the balance of market and government power created by data requires a wider range of approaches to data openness and sharing.29
Analysis: Algorithms unlocking open data value
The McKinsey Institute has developed a library of over 160 AI use-cases and has found that AI is being applied in relation to each of the 16 key Sustainable Development Goals (SDGs).30They identify the greatest number of use-cases are associated with Health, Peace, Justice and Strong Institutions, and Education. Although it is not clear how many of the surveyed cases directly use or generate open data, given that deep learning on structured data is the most common AI capability deployed, it is reasonable to assume that open data is a component for many of them. Although McKinsey is focused heavily on examples from the US, cases of algorithmic and machine-learning approaches to open data can be found across the world. For example, since 2013, the Data Science Africa31initiative has provided a hub for machine-learning research on the continent, and labs such as Makerere University’s AI Research Group32are among a number on the continent exploring traffic models, air quality assessment, and crop diseases surveillance using AI-driven systems.33In Asia, a public–private partnership, the Centre of Excellence on Data Science and Artificial Intelligence established by the State Government of Telangana in India, is just one example of numerous university and private sector programmes building capacity to work with both public and private data.34And in Latin America, researchers and civil society groups are deploying algorithmic analysis to find patterns in data and promote government accountability (see Serenata.ai: Case study). Crucially, the way algorithms and machine learning are currently applied reflects the availability of structured data in different regions. At Makerere University, for example, applications focus on generating structured data from images, addressing challenges of data scarcity prior to analysis. This points to the possibility that machine learning may allow some countries to leapfrog over the requirement for governments to collect particular structured datasets by adopting different strategies to address operational and policy-making data needs by turning unstructured data into structured data.
Serenata.ai: Case study
“Operação Serenata De Amor” [Operation Serenade of Love] is a crowdfunded project hosted by Open Knowledge Brazil using algorithms and AI to audit public spending and politicians expenses.3536The system uses a popular machine-learning toolkit for the open source python language to classify incoming expense records and identify outliers that it flags for human investigation. To date, the tool has identified over 8 000 suspicious payments, totalling BRL 3.6 million. During 2017, the project ran a hackathon event to carry out a “citizens’ audit” of suspicious expense claims and generated over 200 reports to the official auditing body.37In total, the tool has processed more than 1.5 million expense claims.
Alongside their role in data extraction and the filtering needed to find the signal in the noise of large datasets, it is the predictive power of certain algorithms and machine learning that holds a particular attraction. Rather than simply visualising data for humans to interpret, algorithmic analysis holds the promise of constructing models that can find patterns and guide future action. As illustrated below (see Dengue fever: Training decision trees in Paraguay), such predictive models can provide data that might guide health decision-making. However, perhaps the most widely discussed predictive AI applications relate to crime and justice, both in terms of predictive policing and algorithmic sentencing decisions. AI models using open data to predict future crimes have been developed by researchers in the UK,38US,39India,40and a number of other countries, although there is little to suggest that these public models have moved beyond proof-of-concept. There is evidence, however, of police forces and courts making use of proprietary systems to guide their decision-making from the allocation of police resources41through to sentencing and bail decisions.42How far such systems are able to draw upon data from other parts of government as a result of open data policies is an area that warrants further investigation.
AI-enabled interfaces have also been explored as a means to make particular open datasets more accessible. Potential users of open data are often only interested in one specific fact,43but knowing how to search for it can be challenging. Responding to the need for new interfaces to open data, Porreca et al. made use of IBM’s Watson platform and a dataset of Italian infrastructure projects to create an experimental Facebook Messenger chatbot that could answer questions about infrastructure spending.44In a similar vein, in 2016, Taiwanese firm, DSP, ran a challenge to create a “procurement chatbot” that could use public data on contracting to answer questions from potential bidders.45In other settings, machine-learning-based approaches have the potential to overcome the language barriers to use global open datasets, improving the accessibility of open data and supporting greater inclusion in the work to address SDGs.
However, for all the potential of machine learning, it is often neither necessary nor sufficient to support the effective use of open data. The World Wide Web Foundation has reported on a case in Uruguay, where the Ministry of the Interior, following a trial of a predictive policing algorithm, abandoned it and turned instead to the use of retrospective statistical analysis methods.46When the potential downsides of AI are taken into account, then it becomes clear that beyond experimentation, the full application of AI requires a high degree of data literacy and also requires AI solutions to be assessed alongside other less technologically advanced, but potentially more appropriate, methods of data analysis and use.
Dengue fever: Training decision trees in Paraguay
Dengue fever is endemic in Paraguay and a growing public health concern along with other mosquito-borne diseases, including Zika and Chikungunya. In 2016, using data on past dengue fever outbreaks collected by national health surveillance systems and converting published data to a common standard, researchers at the Universidad Nacional de Asunción in Paraguay were able to train and test a decision tree-based machine-learning model to predict future dengue cases based on climate, geographic, and demographic variables drawn from additional open datasets.47Care was taken in the project design to consider the privacy of individuals who have recorded dengue cases.
The research, which resulted in an open source web application, demonstrated the potential of machine learning to generate new insights from data. Because the machine-learning method chosen uses decision trees rather than deep learning neural networks, it generates a more legible model which aids interpretation of how different variables influence the predictions generated. Writing about the case in 2017, GovLab noted that the ultimate impact of the new predictive model would be based on how far it could influence strategy within the government agency responsible for responding to disease epidemics.48
Adverse impacts: Addressing the data environment
Concerns have been widely raised about the potential for algorithmic discrimination, when biases in the data on which machine-learning models are trained is then reflected in the outcomes from their operation. This can reinforce patterns of social exclusion. When the datasets used to train these models are open, it becomes possible to scrutinise both whether particular populations are underrepresented in the data and whether particular fields are missing or collected using biased classifications. Where models are failing to take into account key variables, providing new open datasets to fill that gap has the potential to improve AI applications. This is particularly relevant in a development context, where there is a risk that models optimised for the data available in developed economies might be applied to developing countries where limited data flow renders them much less effective. As the Web Foundation argues, “Opening key datasets will help identify potential biases, lead to more competition between potential service providers, ensure better public services, and increase citizen trust in government.”50However, given many applications of AI involve making decisions about individuals, there are inevitably difficult privacy challenges to navigate in many instances,51and the potential use of AI to connect up disparate databases may turn data that did not previously have privacy risks into sensitive data.52
As yet, there are few identifiable cases of projects engaging in the strategic creation, publication, or suppression of open data with AI specifically in mind. However, as the following section illustrates, there is a growing critical community around AI where such issues may need to be addressed in future.
Open data and AI: Critical friends?
Over the last three years, there has been an explosion of work looking at the potential risks of AI, developing ethical frameworks for AI, and creating partnerships to carry out public interest research and advocacy on AI development. A number of organisations active in open data, such as the Web Foundation,53Iniciativa Latinoamericana por los Datos Abiertos (ILDA),54and the Open Data Institute (ODI),55have developed programmes of work around AI. The International Development Research Centre (IDRC) has initiated an Artificial Intelligence for Development (AI4D) programme,56building on their experience with open data for development, and many more organisations have been established with an AI focus. The Fairness, Accountability and Transparency in Machine Learning (FATML) community,57for example, has grown steadily since its first meetings in 2014, although questions of how to make algorithms truly accountable and transparent remain technically and politically challenging. Transparency may not refer to disclosure of the full contents of datasets, but could involve providing more detailed and standardised metadata that describes why and how the data was collected and covers issues relevant to decisions about whether to use the data for machine learning.58The multi-stakeholder Partnership for AI talks of accountability in terms of “systems that can explain the rationale for inferences” that they draw,59and, more recently, the AI Commons has brought together the AI community to explore the concept of a data commons.60
For some, bringing openness to algorithms and AI involves not just open data, but also open source. However, while many AI frameworks are available as open source software, having access to complex source code often does little to support the auditing of the algorithm against expected outputs,61and where a model has been training at least partially on proprietary or sensitive data, copyright6263or privacy issues may act as a block to fully opening the black box. In New York, the creation of a task force on “Automated Decision Making”64in 2018 led to early proposals that agencies using algorithmic systems should also accept user-submitted datasets to be processed by the agencies’ algorithms with the outputs provided back to the user to allow them to assess whether the system is drawing fair and legitimate conclusions.
ODI has looked beyond data and source code to argue that engagement with AI needs an “open culture”,65hinting at the kinds of organisational change needed to maintain legitimacy when using systems that may produce useful outputs but are less amenable to simple explanation. The Web Foundation’s publication on AI in Latin America operationalises some of these ideas by summarising the kinds of transparency, public engagement, and accountability activities needed for each element of an algorithmic system from data collection through model set-up to execution and interpretation of output, as well as in relation to the socio-legal frameworks against which algorithms are deployed.66Lepri et al. point to “living labs” as a potential testing ground for algorithmic systems.67
Overall, open data ideas are clearly at play in shaping responses to AI, albeit with a focus on public sector applications of AI in most cases. Yet there may also be a case for exploring open data strategies not as a fix for the weak points of AI but as distinct and different response to the challenges of governance and decision-making in the modern world. When Lepri et al. describe the “tyranny of data”, they point to the kind of technocratic and top-down decision-making that comes from large centralised data systems. However, contrary to the theoretical ideals of a distributed Semantic Web inherent in much early open data advocacy, modern AI systems, almost by definition, rely upon the centralisation of large quantities of data on cloud computing platforms that are only accessible to those with adequate financial and technical resources, not to mention the high energy costs of continually refining machine-learning models.68When algorithms are proprietary, then the market structure around AI-based data analysis may also be very different from that around open data (e.g. giving startups easier access to capital, but making it more difficult to align social goals with the commercial objectives of the platforms).
It may be as important to defend “small data” and to promote accessible methods of data literacy and data analysis as it is to critically engage with big data, AI, and algorithms. As ALLDATA, the International Conference on Big Data, Small Data, Linked Data, and Open Data recognises, different kinds of data require different analytical approaches, and not all data is big data.69While machine-learning techniques are often biased toward personalisation and granular decision-making, many applications of open data seek to surface societal problems and to support collective action and empowerment to address sustainable development challenges.
Open data work has already had a powerful and positive influence on the development of AI. Policy interest in supporting domestic AI development has the potential to support both the release of government data and regulatory action to encourage private sector data disclosures. Open data is one of the tools that concerned observers of AI are turning to in order to sketch out ways to make machine learning fair, accountable, and transparent. At the same time, a realisation of what AI can do is starting to change the privacy conversation around open data.
There is a risk that open data work is seen just as the infrastructural labour required to provide the raw material for AI rather than as a distinct area of activity with a much broader role and, at times, divergent set of values. Contemporary AI has much to learn from the last decade of open data, including the need to be problem, rather than data or technology, led and the need for nuance in sector-by-sector approaches rather than the simple application of broad brush principles. The challenges of going from proof-of-concept or prototype to sustainable products that influence policy and practice are also shared by AI and open data communities.
In response to the current state of AI and open data, three specific actions are needed:
- Address machine learning within wider open data literacy building. To make sure the use of AI is led by the need to solve a problem and not by the technology, data literacy and other capacity building should cover machine learning, addressing it as an analytical tool and as a social phenomenon affecting the data environment in which we live. This should support learners in choosing between algorithms and other analytical approaches when identifying the best tools for their own data analysis needs, as well as improving their capacity to critically understand the impacts of AI on their work and their communities. Support for critical literacy among policy-makers in the Global South is particularly important, so that effective decisions can be made about when to use data derived from machine learning and when to continue to invest in state-led structured data collection and disclosure.
- Track machine learning and open data cases across the world. There is a pressing need for baseline data in order to better understand what is really happening with AI and open data,70cutting through the hype and assessing the depth and breadth of open data practice in work on AI. Work by the ILDA on AI addresses the data infrastructure of each AI use-case,71and the Web Foundation’s work on AI in Latin America includes a case study variable to identify whether inputs to AI projects were open data.72It will be important for other larger scale AI studies to do the same and for open data studies to incorporate variables that can identify where machine-learning methods are in use.
- Consciously create open data infrastructures. Smith and Neupane describe how “AI leverages existing infrastructure”, but that when we are operating from a starting point of stark inequality, there is no reason to assume that the outputs will be ethical or equitable.73This makes it all the more important to bring data infrastructures into view by opening them and to actively work to create more equitable open data infrastructures in terms of the datasets, standards, and governance that they are made up of.
Addressing these issues will require deeper collaboration between the AI and open data communities and will undoubtedly benefit from the critical understanding of the messy reality of government data that has been developed by open data practitioners over the last decade.