Print This Post

Open Data Means Registries

Open Data Means Registries was originally published in the online magazine Informed Infrastructure on March 4, 2013.

The municipal world has been buzzing with the concepts of Open Data. Many regions and municipalities have declared their support for the idea, and a Global Open Data Day was declared on February 22, 2013. Organizations have formed and principles declared (see These principles emphasize availability of data in real time, availability without license constraints, machine processable, fine-grained, and obtained close to the source.

Many of these principles do not go far enough:

  1. Machine understandable Data: The data must be self-describing to the maximum extent possible, meaning that there is a schema describing the data based on a formal data model. XML data is processable, but without an accompanying XML schema, it is impossible to know the data types, or what the tags represent. Try writing a program to style the data for visualization without knowledge of the schema.
  2. Clear data provenance: Users must be able to assess the fitness for purpose of the data. This requires that they understand not only the structure and meaning of the data, but also how the data was collected and processed before they acquired it. Without this information, the data might be open, but users are as likely to misuse it as to use it properly. Providing a clear life cycle management process is as essential as openness.
  3. Provision for Live Data Access: Most open data sites are static repositories of data. You might get the data on Tuesday, only to find it changes two days later, and you know nothing about it. Data should be available at a live access point, with automated notification on data changes. Otherwise it might be timely on the site, but not so timely in your application.

Adding these additional criteria will greatly increase the value of the data. However, more is required.

Many open data sites are no more than FTP servers with text descriptions or catalogues, the latter providing minimal metadata à la Dublin Core. While these sites meet the principles of availability and openness, they often fail in terms of the criteria for utility, especially the criteria listed above.

Viewing open data in static terms (i.e. “send me the files”) is not going to cut it in the Internet of Things. Open data will have to include sensor data streams, from both fixed and mobile sensing devices. It is great to know the shape of the polluted lake at the town’s edge, but a lot more important to know the time histories of the pollutants. Access to live, machine processable data streams with well-defined provenances and supporting information models is essential for the data to be usable and useful.

To meet these criteria suggests that we change some of our thinking about data encoding and data interchange, breaking with what is “apparently simple and easy” and moving to what is “required and not very difficult”. It means breaking with ad hoc file transfers, and manual ETL-style processing, and moving to live interconnections using structured, self-describing data.

To put matters in direct terms:

  1. Encode all open data in GML.
  2. Provide access through a registry service (OGC CSW-ebRIM) and through an OGC WFS, or via a web service involved in the data’s creation.
  3. Provide dynamic synchronization; inform users or machines whenever data changes and provide automated data delivery as required.

Encoding all data in GML is neither difficult nor limiting. GML is supported by all of the major GIS and database vendors. GML application schemas can be devised for ANY application domain and information model. Many schemas, such as DIGGS, AIXM, CityGML, O&M, GeoSciML, and SensorML, are open standards in their own right, and cover a wide spectrum of application domains from city modeling and aviation, to geo-technical, general data measurements and geology. Also, there are plenty of libraries for generating and reading GML data. Finally, modern processors and parsing technologies have made processing megabytes of GML (thousands or millions of features) no big deal.

Registry services (CSW-ebRIM) complement GML data encoding in several ways:

  1. CSW-ebRIM uses GML for its expression of geometry
  2. Registries can store GML features or feature collections as local repository items, or use Web Feature Servers as external repositories
  3. Registries can add semantic value to a WFS, by providing additional classifications for its feature types

Registry services (CSW-ebRIM) provide the perfect vehicle for supporting open data, primarily because of the open character of the registry itself. Registries are transactional and can capture data/metadata both on the fly and programmatically. Registries support automated notification, audit trails, and life cycle status management. Most importantly, Registries provide an open data model that can be used to create any metadata schema for data description, including ISO standards, profiles of ISO standards, and metadata schemas developed for a particular community. Registries and open data – they just go together.