Most of NCPP's tools and services are integrated into the CoG environment. CoG provides a collaborative participatory environment for the climate research community. In particular, it provides an environment in which climate research projects and organizations can create links and highlight relationships between one another. NCPP expects its members and associates to use CoG to document relevant issues and artifacts, ranging from raw climate datasets to polished climate outlook reports.
NCPP will provide added value to those artifacts by:
Details about the content of metadata is described elsewhere. The remainder of this document focuses on the architecture that will intgegrate metadata and related services into CoG for use by NCPP.
This document focuses on tools and services for "metadata" as opposed to "translational information." Metadata comprises descriptions of the attributes of artifacts. While these descriptions can be very extensive, they are largely quantitative as opposed to qualitative. In contrast, translational information is used to capture a set of value judgements about how artifacts might be used within an application; It includes "narratives" about data. A metadata document references a particular artifact. Translational information mediates between different artifacts. Additionally, metadata implies a level of structure and formality (ie: all metadata documents must conform to a specific schema and controlled vocabulary as detailed below) that might be lacking in translational information.
There are several different types of metadata relevant to NCPP that can be categorized as follows:
|source||format||schema||how to search||what to return||notes|
|project metadata||CoG itself||native Django relational database (currently SQLite)||defined using Django models||NA||HTML|
|downscaling descriptions||user-generated via metadata form embedded within CoG||Django models generated from CIM; stored in native Django database; can serialize to CIM XML/JSON||CIM Experiments / Simulations / Components (including model components and downscaling components)||"raw" XML or "pretty" HTML (as displayed in the CIM Editor or Viewer)|
|downscaling evaluations||user-generated via metadata form embedded within CoG||Django models generated from CIM; stored in native Django database; can serialize to CIM XML/JSON||CIM QualityRecords|
|"external" metadata instances||pre-existing CIM instances that have been ingested into the same archive used by other metadata in CoG||CIM|
The primary use-cases driving this development are the 2012 Dynamical Core Workshop and the 2013 Downscaling Workshop. Both of these projects will use COG to provide access to datasets (or to evaluations of datasets) and to associate metadata instances with those datasets. The metadata instances can be queried in order to locate particular datasets.
Much of this work is being done under the umbrella of the ES-DOC project. This is an international effort focused on providing metadata services for the climate research community. ES-DOC is commited to using the Common Information Model (CIM) as its metadata schema. The CIM is a general-purpose schema defining structures for several high-level artifacts used in climate modeling (ie: SoftwareComponents, NumericalExperiments, Grids, etc.). The CIM uses domain-specific Controlled Vocabularies (CVs) to constrain the content of metadata instances (ie: Atmospheric Components, CMIP5 Experiments, Arakawa "B" Grids, etc.). ES-DOC provides a more detailed description of the CIM. ES-DOC is developing a CIM Editor, CIM Viewer, and CIM Comparator. NCPP plans on using all of these. The CIM - its structure, content, and methodologies - represents an emerging standard in climate research. This fits well with NCPP's goal of interoperability; All artifacts that we develop are intended to benefit and be usable by the entire community.
Architecture; how the pieces fit together
The CIM schema has been defined in UML and XSD ("XML Schema"). ES-DOC provides Python and Django (a web-framework built in Python) serializations of the CIM schema. The CIM Controlled Vocabularies (CVs) are defined as formal mindmaps. Prior to their use by applications these are translated into better-structured XML files.
The mindmap format may not persist; Although well-suited for presentation purposes, they do not lend themselves to straightforward implementation. ES-DOC is considering alternative formats which would allow CVs to be hosted on a webserver with RESTful query services to retrieve the values of particular CV elements. This would allow CVs to be updated asynchronously from metadata schemas and no re-coding of ES-DOC's tools would be required to support new types of metadata documents or properties. Until that time, though, NCPP is commited to using the same mindmap format as the rest of the community.
As mentioned above, ES-DOC uses Python and Django instead of UML or XSD. Currently, these serializations are written by hand; It is expected that they will be autogenerated from the UML (just as the XSD is currently autogenerated by the UML) in the long-term. Regarding the CIM Editor, the Django Models representing CIM Classes are surprisingly simple:
All that is required is to define the attributes of each model and register the version. This is clearly the sort of boilerplate code that could be autogenerated. A version is simply a collection of models (each model corresponds to a CIM class). Once registered with the CIM Editor application, those models are available to be edited by users. Any project can select from any model in any version to edit.
Additionally, the version has a single corresponding "categorization" and the model has a set of corresponding "vocabularies." A categorization defines a mapping between individual model attributes and categories. In the Editor (as well as the Viewer) categories are rendered as tabs along the top of the form. Since the CIM is a standard spanning multiple communities in climate research, the presentation of CIM Metadata should be standardized as well. That presentation is not part of the schema itself, nor does it belong in a controlled vocabulary - hence, the separate categorization file. These files will be governed by ES-DOC. They are written in XML and must be uploaded into the CIM Editor application. A vocabulary represents a CIM Controlled Vocabulary as described above. As with the categorization, these must be uploaded into the CIM Editor application. The CV defines the terms and relationships among terms that are permitted in particular CIM Documents - they are specific to user communities. For example, there is a CV for CMIP5 simulations and one for downscaling methods. These files are governed by the appropriate communities themselves.
Additionally, as shown in the diagram above, there is a Project that must be registered with the CIM Editor application. A project is simply a group wishing to use the Editor. Different projects may wish to present their users with forms for different versions, documents, categorizations, vocabularies, and/or "customizations."
No form is explicitly defined in the Editor code, instead project adminstrators are expected to define a "customization" for each CIM Document they wish to expose to their users. CIM documents are comprehensive and complex. It is unlikely that any user would have the patience to fill out all of the required content of every CIM element. Therefore, the set of elements that are displayed in any form can be controlled in a customization. Additional details, such as names or documentation associated with each CIM or CV attribute can also be customized. Customization is performed by a project adminstrator using a webform:
Once a customization exists, the editing form can be generated via a factory method. This is done automatically at runtime. No extra coding is required to accomodate new or changed CIM Versions with new document types, nor new or changed categorizations or vocabularies. The factory method inspects a given CIM Document class and the corresponding customization for that document/project combination and creates an appropriate form widget for each attribute. Additionally, depending upon what was specifed in the customization, if the attribute is a relationship to another CIM class the form may embed a sub-form by recursively parsing that class. The resultant form looks a bit like this (notice the nested fieldsets):
Once a form has been completed it must be validated and then saved. Saving a single metadata instance includes saving all of the instances it is related to (the sub-forms in the figure above). This will store the metadata into the relational database being used by CoG. Obviously, the metadata instances can be retrieved at any time if changes to their content is required. Eventually, though, users will want to publish their metadata to the wider community. ES-DOC maintains a central repository of CIM Documents. These can be queried and then viewed and compared using ES-DOC's CIM Viewer and Comparator respectively. Publication is done by serializing CIM Documents from Django to CIM XML in an ATOM feed that can be ingested by an ES-DOC service.
CoG is considering also supporting custom queries into the CIM Documents stored in its local database (ie: before they have been published as XML). This requires serializing them into SOLR XML. There is an existing ESGF API which relies on SOLR that can be written to. The ESGF SOLR XML Schema defines a relatively simple structure of name/value pairs and it is straightforward to serialize from a Django Model to a SOLR representation of the salient bits of that model to be searched (ie: the facets):
ES-DOC and CoG provide a rich infrastructure for working with metadata.