Big problem around open data, especially government data, and in Hadley’s case UK government data. Open data: created during regular operations, where taxpayers have paid for infrastructure so data should be made available to them. Used to make government operations more open and accountable, identifying efficiencies, reducing corruption. Govs spend a lot of time handling their own data. Not talking wikileaks. Tetherless team has 1.1M datasets available, in multiple languages.
In the UK, we talk about four kinds of data: historical (trends, performance), planning (future, forecasting, permits), infrastructural (opening hours, especially useful when it changes–like when a bridge is out), and operational (where are trains, weather–real time, costs more to make available). Types: transport, healthcare, demographics, mapping, crimes.
What can we do with it?
- transparency (government info going out: lobbying, statistics, activities, performance data),
- delivering services (job centre, national health service),
- improving commercial products and services/non-governmental (vat codes), and
- public sector efficiencies (navigating public services, parliamentary questions).
Chasing the economic impact: money is being made. $1.5B industry in the US. GPS system too. UK has recently engaged in open data institute (Tim Berners-Lee). Challenge is that data is flowing through everything. How to measure? role of third parties in collecting, publishing.
We’re not there yet: lots of publishers (10,000s of publishers, many are IT or HR teams). Pictures of data are not data. Many challenges are human. Also there are a variety of formats (RDF, XML, JSON, CSV, HTML, etc.) –what are developers using? Mapping data may need to be converted to different formats. Licenses: regular copyright, Creative Commons, Open Government License; many orgs can’t set their own license because they’re legally part of the crown. Machine readability: charts, tables and tables-inside-tables, images are not same as raw data. Access points and SLAs (service level agreements) not matched for access/availability.
Data quality: 20,000 men have gone through mid-wife episode (have given birth) in 2009. Data problem? Need to find a way to communicate how reliable the data is. Also difficult to get data back to the publishers. If there’s a reference to a clearly incorrect year, and info comes from data.uk.gov, who do you contact? How to correct this in a manageable way?
We’re all struggling to build lists again, that we have a big discovery problem. We need to make data findable, then encourage more people to publish.
How to fix this? Adding context through linked data: pothole address sources, location data tied to royal mail, ordnance survey, to maps – can be used to mash-up with other data. Semantically enriching data (watch vocabulary). Machines understand context of datasets.
Coordinated effort: LinkedGov–create a clean, usable, understandable body of UK government data for any purpose. Machine & human readable, typed, internally linked. Much to do! Creating familiar interfaces (google refine?), with wizards for viewing & cleaning data. Lots of tasks underway.