Print This Post

GML and Unstructured Data

In the previous blog post, we discussed the use of CSW-ebRIM in the management and structuring (i.e. attaching meaning) to unstructured data.  In this post, we look at the relationship between GML (Geography Markup Language) and unstructured data.

Unstructured data is data for which there is no data model, or at least no data model that exposes any of the semantics of the data.  An HTML document, for example, might have a well-defined structure, but this is no help in understanding the document, since the markup is only intended for visualization and browser interaction control.  Much can sometimes be inferred, but in essence the HTML model for the document is not very helpful for understanding the content.  The same can be said of other machine readable but not machine understandable document formats such as PDF or even KML. Of course, what it means to understand a document may be very application related, so in some sense all data can be considered unstructured.

As stated in the previous post, unstructured data has no data model, or has a data model that does not expose any of the data’s semantics.  HTML documents are a typical example of this; their structure is well defined, but it is not very helpful for understanding the document because the markup is intended for presentation and browser interaction.  Other machine readable, but not machine understandable, documents include PDF and KML document formats.  Information about such documents can, to some degree, be inferred but is not very helpful for understanding the content.

Is GML unstructured data?  While GML can certainly be created without a schema (hence no visible data model), the usual view of GML is that there is an associated schema which helps express the meaning of the content, so in this sense, GML would not be thought of as unstructured.  Unstructured data is most typically associated with presentation (i.e. is a presentation), and GML is not about presentation.

GML can, of course, be used to annotate (i.e. make structured, add meaning) to unstructured documents.  For example, instead of trying to collect up all the related .prj and .dbf files, a Shape file (.shp) could simply be encoded as the value of a GML property (e.g. base 64 encoded) and then add properties in the same GML document that correspond to the fields in the .dbf file, and the CRS referenced in the .prj file.    Note that this does not require the .shp file to be converted, and most systems can easily read the limited amount of XML required to capture the .dbf information.

Another example, which is a core part of GML, is that of describing geographic imagery.  In this case, the GML points to the image data file (through its URI) and provides additional properties that describe the structure of that image, and the meaning of the “pixels” in the image (e.g. they might be temperatures, radiances, etc.).  In this sense, GML is being used quite explicitly for adding content and meaning to unstructured data.  In the context of GMLJP2, this goes even further , bundling the image description (as above) and the image data all within a single data package (in this case, JPX).  This data package can also contain vector geographic features and annotations.

GML is structured data, and GML schemas provide a convenient means to add content descriptors (especially relative to geography, time, and connectivity) to many kinds of unstructured or not-so-structured data, from Shape files to JPEG2000 images.