Metadata Architecture

Most of NCPP's tools and services are integrated into the CoG environment.  CoG provides a collaborative participatory environment for the climate research community.  In particular, it provides an environment in which climate research projects and organizations can create links and highlight relationships between one another.  NCPP expects its members and associates to use CoG to document relevant issues and artifacts, ranging from raw climate datasets to polished climate outlook reports.

NCPP will provide added value to those artifacts by:

  1. allowing structured metadata and/or translational information to be associated with those artifacts, and
  2. providing access to tools and ervices which can manipulate artifacts and/or generate additional artifacts.

Details about the content of metadata is described elsewhere.  The remainder of this document focuses on the architecture that will intgegrate metadata and related services into CoG for use by NCPP.

This document focuses on tools and services for "metadata" as opposed to "translational information."  Metadata comprises descriptions of the attributes of artifacts.  While these descriptions can be very extensive, they are largely quantitative as opposed to qualitative.  In contrast, translational information is used to capture a set of value judgements about how artifacts might be used within an application; It includes "narratives" about data.  A metadata document references a particular artifact.  Translational information mediates between different artifacts.  Additionally, metadata implies a level of structure and formality (ie: all metadata documents must conform to a specific schema and controlled vocabulary as detailed below) that might be lacking in translational information.

There are several different types of metadata relevant to NCPP that can be categorized as follows:

  source format schema how to search what to return notes
project metadata CoG itself native Django relational database (currently SQLite) defined using Django models NA HTML  
downscaling descriptions user-generated via metadata form embedded within CoG Django models generated from CIM; stored in native Django database; can serialize to CIM XML/JSON CIM Experiments / Simulations / Components (including model components and downscaling components)   "raw" XML or "pretty" HTML (as displayed in the CIM Editor or Viewer)  
downscaling evaluations user-generated via metadata form embedded within CoG Django models generated from CIM; stored in native Django database; can serialize to CIM XML/JSON CIM QualityRecords      
"external" metadata instances pre-existing CIM instances that have been ingested into the same archive used by other metadata in CoG   CIM      

 

The primary use-cases driving this development are the 2012 Dynamical Core Workshop and the 2013 Downscaling Workshop.  Both of these projects will use COG to provide access to datasets (or to evaluations of datasets) and to associate metadata instances with those datasets.  The metadata instances can be queried in order to locate particular datasets.

Much of this work is being done under the umbrella of the ES-DOC project.  This is an international effort focused on providing metadata services for the climate research community.  ES-DOC is commited to using the Common Information Model (CIM) as its metadata schema.  The CIM is a general-purpose schema defining structures for several high-level artifacts used in climate modeling (ie: SoftwareComponents, NumericalExperiments, Grids, etc.).  The CIM uses domain-specific Controlled Vocabularies (CVs) to constrain the content of metadata instances (ie: Atmospheric Components, CMIP5 Experiments, Arakawa "B" Grids, etc.).  ES-DOC provides a more detailed description of the CIM.  ES-DOC is developing a CIM Editor, CIM Viewer, and CIM Comparator.  NCPP plans on using all of these.  The CIM - its structure, content, and methodologies - represents an emerging standard in climate research.  This fits well with NCPP's goal of interoperability; All artifacts that we develop are intended to benefit and be usable by the entire community.

Architecture; how the pieces fit together

Some of the architectural pieces of NCPP's metadata services (click to view larger image)

As you can see, NCPP's development role is focused on the CIM Editor and its ability to publish metadata to the central database used by the rest of ES-DOC's tools.  The Viewer and Comparator are JavaScript applications running on an external server which can be called from within a CoG webpage.  The Editor is a Django application which has been added to the server running CoG (itself a Django application).

The CIM schema has been defined in UML and XSD ("XML Schema").  ES-DOC provides Python and Django (a web-framework built in Python) serializations of the CIM schema.  The CIM Controlled Vocabularies (CVs) are defined as formal mindmaps.  Prior to their use by applications these are translated into better-structured XML files.  

A sample CV mindmap (click to view larger image)

 

The mindmap format may not persist; Although well-suited for presentation purposes, they do not lend themselves to straightforward implementation.  ES-DOC is considering alternative formats which would allow CVs to be hosted on a webserver with RESTful query services to retrieve the values of particular CV elements.  This would allow CVs to be updated asynchronously from metadata schemas and no re-coding of ES-DOC's tools would be required to support new types of metadata documents or properties.  Until that time, though, NCPP is commited to using the same mindmap format as the rest of the community. 

NCPP is coordinating the definition of CVs for dynamical and statistical downscaling.

As mentioned above, ES-DOC uses Python and Django instead of UML or XSD.  Currently, these serializations are written by hand; It is expected that they will be autogenerated from the UML (just as the XSD is currently autogenerated by the UML) in the long-term.  Regarding the CIM Editor, the Django Models representing CIM Classes are surprisingly simple:

from cim_editor.models import *
 

#register these models w/ the editor...
MetadataVersion.factory({"name":"CIM","version":"1.5"})

@CIMDocument()
class ModelComponent(MetadataModel):

    class Meta:
        abstract = False

    _name           = "ModelComponent"
    _title          = "Model Component"
    _description    = "A scientific model"

    shortName           = MetadataAtomicField.Factory("charfield",max_length=LIL_STRING,blank=False)
    longName            = MetadataAtomicField.Factory("charfield",max_length=BIG_STRING,blank=False)
    description         = MetadataAtomicField.Factory("textfield",blank=True)
    responsibleParties  = MetadataManyToManyField(targetModel='cim_1_5.ResponsibleParty',sourceModel="cim_1_5.ModelComponent")

    def __init__(self,*args,**kwargs):
        super(ModelComponent,self).__init__(*args,**kwargs)

 

All that is required is to define the attributes of each model and register the version.  This is clearly the sort of boilerplate code that could be autogenerated.  A version is simply a collection of models (each model corresponds to a CIM class).  Once registered with the CIM Editor application, those models are available to be edited by users.  Any project can select from any model in any version to edit. 

Additionally, the version has a single corresponding "categorization" and the model has a set of corresponding "vocabularies."  A categorization defines a mapping between individual model attributes and categories.  In the Editor (as well as the Viewer) categories are rendered as tabs along the top of the form.  Since the CIM is a standard spanning multiple communities in climate research, the presentation of CIM Metadata should be standardized as well.  That presentation is not part of the schema itself, nor does it belong in a controlled vocabulary - hence, the separate categorization file.  These files will be governed by ES-DOC.  They are written in XML and must be uploaded into the CIM Editor application.  A vocabulary represents a CIM Controlled Vocabulary as described above.  As with the categorization, these must be uploaded into the CIM Editor application.  The CV defines the terms and relationships among terms that are permitted in particular CIM Documents - they are specific to user communities.  For example, there is a CV for CMIP5 simulations and one for downscaling methods.  These files are governed by the appropriate communities themselves.

The CIM Editor generates a webform for a Version & Model / Categorization / Vocabulary / Project combination (click to view larger image)

Additionally, as shown in the diagram above, there is a Project that must be registered with the CIM Editor application.  A project is simply a group wishing to use the Editor.  Different projects may wish to present their users with forms for different versions, documents, categorizations, vocabularies, and/or "customizations."

No form is explicitly defined in the Editor code, instead project adminstrators are expected to define a "customization" for each CIM Document they wish to expose to their users.  CIM documents are comprehensive and complex.  It is unlikely that any user would have the patience to fill out all of the required content of every CIM element.  Therefore, the set of elements that are displayed in any form can be controlled in a customization.  Additional details, such as names or documentation associated with each CIM or CV attribute can also be customized.  Customization is performed by a project adminstrator using a webform:

The Customization Form looks a bit like the Editing Form (click to view larger image)

Once a customization exists, the editing form can be generated via a factory method.  This is done automatically at runtime.  No extra coding is required to accomodate new or changed CIM Versions with new document types, nor new or changed categorizations or vocabularies.  The factory method inspects a given CIM Document class and the corresponding customization for that document/project combination and creates an appropriate form widget for each attribute.  Additionally, depending upon what was specifed in the customization, if the attribute is a relationship to another CIM class the form may embed a sub-form by recursively parsing that class.  The resultant form looks a bit like this (notice the nested fieldsets):

A sample autogenerated Editing Form (click to view larger image)

 

Once a form has been completed it must be validated and then saved.  Saving a single metadata instance includes saving all of the instances it is related to (the sub-forms in the figure above).  This will store the metadata into the relational database being used by CoG.  Obviously, the metadata instances can be retrieved at any time if changes to their content is required.  Eventually, though, users will want to publish their metadata to the wider community.  ES-DOC maintains a central repository of CIM Documents.  These can be queried and then viewed and compared using ES-DOC's CIM Viewer and Comparator respectively.  Publication is done by serializing CIM Documents from Django to CIM XML in an ATOM feed that can be ingested by an ES-DOC service.

CoG is considering also supporting custom queries into the CIM Documents stored in its local database (ie: before they have been published as XML).  This requires serializing them into SOLR XML.  There is an existing ESGF API which relies on SOLR that can be written to.  The ESGF SOLR XML Schema defines a relatively simple structure of name/value pairs and it is straightforward to serialize from a Django Model to a SOLR representation of the salient bits of that model to be searched (ie: the facets):

<?xml version="1.0"?>
<add>
    <doc boost="1.0">
        <field name="id">856673E6-21D8-11E1-A1E2-7E464824019B</field>       
        <field name="title">CAM EUL Dynamical Core</field>
        <field name="type">DynamicalCoreModel</field>
        <field name="url">http://earthsystemcog/projects/dycore/metadata/dycoremodel/id|text/html|HTTPServer</field>
       
        <field name="equations of motion">shallow atmosphere, hydrostatic</field>
        <field name="numerical method">spectral transform eulerian</field>
        <field name="spatial approximation">spectral transform eulerian</field>....

 

ES-DOC and CoG provide a rich infrastructure for working with metadata. 

Last Update: Feb. 27, 2013, 8:38 p.m. by Site Administrator



CoG version 1.3.0   Privacy Policy   Disclaimer   USA.gov