Print This Post

Managing Unstructured Data in CSW-ebRIM

In the world of information sharing, there is a lot of so-called “unstructured data”.  Unstructured data is data for which there is no data model, or at least no data model that exposes any of the semantics of the data.  An HTML document, for example, might have a well-defined structure, but this is no help in understanding the document since the markup is only intended for visualization and browser interaction control.  Much can sometimes be inferred but, in essence, the HTML model for the document is not very helpful for understanding the content.  The same can be said of other machine readable, but not machine understandable, document formats such as PDF or even KML. Of course, what it means to understand a document may be very application-specific so, in some sense, all data can be considered unstructured.

Since a great deal of information falls into the semi- or un-structured category, there is an interest in being able to attach information that adds structure or meaning to the content.  Of course, wherever possible, this is to be done without changing the content itself.  It might be achieved by embedding or attaching information elements to the document itself, or by managing the document in a registry which, in turn, provides the extended information externally to the document.  In many cases, both approaches might be used.  The advantages and disadvantages of each approach are fairly obvious.  In the embedded case, the clear advantage is that all of the information is in one information package.  In the exterior description approach, the clear advantage is that can support multiple descriptions (perhaps for different applications) and not clutter a given package with extraneous information.

In this blog, we explore the use of CSW-ebRIM (see “CSW-ebRIM Registry Service – Part 1: ebRIM profile of CSW (1.0.1)”, (OGC document 07-110r4), for the management of semi-structured or unstructured data, and show how this can be exploited in both the embedded and exterior description use cases.

CSW-ebRIM is a standard from the Open Geospatial Consortium which builds on an OASIS standard called ebRIM (eBusiness Registry Information Model).  CSW-ebRIM makes use of something called Reg-Rep, meaning that there is both a Registry (containing registry objects) and a Repository (containing repository items), with the Registry objects referencing and pointing to the associated Repository items.  Think of the repository items as the unstructured information items, and the related registry objects as descriptions that expose their semantics.

Each repository item (e.g. an HTML document) has a Uniform Resource Name (URN), which is a Uniform Resource Identifier (URI) that uses the URN scheme.  The URN allows the repository item to be readily retrieved from the Registry using a simple GET request (e.g. issued from a web client such as a browser).

In structuring information about an unstructured information item, a model of the information item is created – selecting and defining tags that help convey the meaning of information item.  These tags can be completely arbitrary, and the list of tags can be changed on the fly at any time, just by sending a message to the registry.  The tags can have more or less arbitrary types – and so can include simple types (integers, strings etc) and also geospatial or temporal tags.  The tags can be used to enable searching of the registry and finding and retrieving any repository items (the unstructured stuff).  The search requests can make use of any of the attached tags, including geospatial and temporal constraints, and can even look inside the unstructured items (e.g. look at the content of an HTML document) where that might help in the discovery/access process.

The Galdos implementation of CSW-ebRIM enables the automated transformation of the output from the registry using XSLT scripts, which are associated to a given registry object type (e.g. audio clip).  Whenever a request is made for such an object, the associated transformation script is retrieved, and automatically applied to the registry object.  In this manner, for example, an ATOM feed could be generated in which the registry descriptions are attached (embedded) together with the related content (repository item).

The Galdos implementation further supports automated notification, so the ATOM feed could be updated and sent whenever certain objects in the registry are changed – for example, if someone modifies the description of the audio clip semantics or adds a new audio clip.

If you manage unstructured or semi-structured content, and especially if you wish to do this in a geospatial context, then you should be looking at CSW-ebRIM.