EML
Handbook
Version 2:
Working Draft
Written by:
David
Blankman
Jeanine
McGann
LTER Network Office
Rick
Reeves
NCEAS
©
2004
Table of Contents
Chapter Page Contents
1................................................................ Forward
1.1........................................................ Handbook
Objective
1.2........................................................ Intended
Audience
1.3........................................................ Guide to Chapters
2................................................................ EML
Overview
2.1........................................................ Definitions: Data Package,
Metadata and EML
2.2........................................................ The
Data Package Life Cycle
2.3........................................................ Setting
the Stage: Metadata Example
2.4........................................................ EML
Document Production and
Management Tools
2.4.1........................................... Creating EML with
Text Editor
2.4.2........................................... Using Web-Based
Data Repository
2.4.3........................................... Using Specialized Metadata
Edit Tool
2.4.3.1................................. Example: Morpho
Editor
3........................................................... Case Studies: Morpho Data Packages
3.1........................................................ MORPHO
Data Package Wizard
3.2........................................................ Case
1: Data Tables ONLY
3.3........................................................ Case
2: Data Tables, Limited Metadata
3.4........................................................ Case
3: Data Tables, Extensive Metadata
4................................................................ In-Depth:
Understanding and using EML
4.1 EML Design Overview
4.2
EML Design Objectives
4.3
Description: Individual EML Modules
<Mark S
writing here>
1.1 Objective
The EML Handbook is a
comprehensive collection of conceptual descriptions, detailed technical
information, and case-study tutorials describing the design and implementation
of Ecological Metadata Language (EML). This document assumes a basic level of
computer knowledge, and some experience working with ecological data.
1.2 Intended Audience
This document is designed
for ecologists, ecological information managers, and ecology students as a
comprehensive reference to the Ecological Metadata Language, EML. It describes EML’s
overall structure, and presents tutorial material in the form of case studies,
that demonstrate EML’s use in creating ecological Data
Packages. For those seeking a
more technical or comprehensive overview of EML and its capabilities, this
document provides links to several online electronic documents. These are
listed at the end of Section 1.3.
1.3 Guide
to Chapters
Chapter 1, Forward, sets forth the
objectives for this document.
Chapter 2, EML
Overview, presents a general introduction to EML. Senior scientists, data
managers, and others seeking a mid-to high-level understanding of the way in
which EML is used in the scientific workplace should read this chapter first.
Chapter 3, Case
Studies: Creating and Maintaining EML, depicts the EML production process
using three interrelated case studies. This section includes a demonstration of
different approaches to creating and editing EML. Ecologists and research
assistants charged with entering and maintaining EML should read this chapter
first.
Chapter 4, In-Depth: Creating Data Packages Using EML, presents
extensive technical details for EML, focusing on techniques construction of
metadata Data Packages using EML modules. The chapter is intended for EML users
who require detailed understanding of the EML specification.
The following online electronic
documents provide additional information regarding the design and use of EML.
Several EML Handbook chapters refer readers to these documents:
2.1
High-level Definitions: Data Package, Metadata, and EML
2.1.1
The Data Package: A System for Documenting Research Datasets
Ecologists
collect data describing many aspects of their data collection efforts. This
information is often combined with the actual experiment data into an
encapsulated collection; this collection is subsequently stored in a searchable
database where it is potentially available to scientific community. This
collection is known as a Data Package. Note that at this point the Data Package
is a language-and device-independent concept. To be useful it requires implementation
via tangible rules, language, and creation-storage-retrieval technology.
2.1.2
Metadata: Systematic Description of Datasets
Simply
put, metadata is information that describes, in a systematic manner, other
information. The
Metadata has been with us
since the first librarian made a list of the items on a shelf of handwritten
scrolls. The term "meta" comes from a Greek word that denotes
"alongside, with, after, next." More recent Latin and English usage
would employ "meta" to denote something transcendental, or beyond
nature. Metadata, then, can be thought of as data about other data. It is the
Internet-age term for information that librarians traditionally have put into
catalogs, and it most commonly refers to descriptive information about Web
resources.
A metadata record consists
of a set of attributes, or elements, necessary to describe the resource in question.
For example, a metadata system common in libraries -- the library catalog --
contains a set of metadata records with elements that describe a book or other
library item: author, title, date of creation or publication, subject coverage,
and the call number specifying location of the item on the shelf.
The linkage between a metadata record and
the resource it describes may take one of two forms: elements may be contained
in a record separate from the item, as in the case of the library's catalog record;
or the metadata may be embedded in the resource itself.
Also consider this
functional metadata description, presented through an example: Members of the
ecological community compile
descriptions of their experimental observations in many ways: they may take
“mental notes”, intending to transcribe them upon return to their office. They
may accumulate written notes in the margins of field notebooks, interspersed
within the actual data, or in separate files. These descriptions: who, what
where, when, why and how an ecological dataset was collected, are metadata.
Without accompanying metadata, a scientific data set becomes more difficult to
use, and therefore less valuable to scientists, with the passage of time.
One could say that any analysis that utilizes
independently collected data is not possible without metadata. For example, consider a data table that is
presented without metadata describing the columns and rows. Such a data table
is useless as the numbers have no scale, units, or context. Further, the if there is no Point of Contact attached
to the data set, there is no way to contact the entity responsible for
collecting it, thus no way to acquire these key parameters.
Metadata
may be implementated within an organization in several ways; however, to
maximize its usefulness it is best to adopt a standardized, systematic
approach, such as a metadata language. These define both the content, format, and
possibly, the implementation tools needed to create, implement, and search, and
use ecological metadata. Ecological Metadata Language (EML)
is an emerging metadata language that is gaining support within the ecological
community.
2.1.3 EML: A Framework for Constructing
Data Packages
Ecological
Metadata Language (EML) is a metadata content specification for ecological
data, resulting from a community-based effort involving ecological researchers,
information managers, and software developers from around the
EML describes
a number of critical aspects of the data, such as variables’ names and
definitions, their units of measurement, time and place of data collection, who
collected the data, etc. Much of this
information is currently collected by ecologists in various forms—such as field
notes, or marginalia, and hence are subject to being
lost, separated from the data, or stored in very arbitrary ways. These disadvantages are eliminated by using a
more formal and standard framework such as EML.
Even in the case of existing data catalogs, the terminology and types of
metadata collected vary widely, whereas EML provides a standardized set of
terms and definitions intended specifically for ecological metadata.
EML is
implemented using Extensible Markup Language (XML), a growing standard for
marking up documents on the Web. This means that metadata created in EML will
be highly available for searching and exchanging over the Web, if so
desired. This formal structure for EML
not only provides a common structure for ecologists to document and interpret
ecological data, but also one that
better enables efficient development of software applications that process the
metadata. The kinds of technology applications anticipated range from basic
tools that allow users to perform specific targeted searches for datasets, to
advanced applications that can perform automatic integration of datasets. This
will be a tremendous help to ecologists in organizing and implementing
long-term research studies, where the ability to search for and synthesize
large amounts of data is crucial.
EML is a
set of pre-defined terms, known as “tags” or “elements”, and a set of
rules. Each tag consists of a pair of
tokens. The tag names enumerate different metadata categories; the actual
metadata are placed between the tokens.
The EML rules specify the attributes and behavior of each tag; for
example, acceptable content, or the maximum permissible number of instances for
each tag can be used for a particular data set, etc.
For example,
the tag called abstract tag set must contain a “brief overview of the resource
or data being documented”. There should
be one abstract tag set within each metadata package, and it should
contain character/textual information.
The value of the abstract tag for a particular data set might be:
“This dataset
contains demographic information on the survival and reproduction of barbed
goat grass (Aegilops triuncialis)
populations reciprocally transplanted between core and edge sub-populations
along an invasion front located at the Mclaughlin
reserve. The purpose of the study is to examine the potential for rapid
evolutionary change along invasion fronts of this exotic grass. The study will
probably end in Summer, 2003.”
Notice that this use of “abstract”
matches with the popular conception of the definition of an abstract. EML attempts to use familiar terms with
intuitively corresponding definitions whenever possible. However, the need for consistency and
reduction of ambiguity has often required adoption of more technically accurate
terms for EML metadata tags, even when some other term might be widely used by
practicing ecologists. For example term
“attribute” refers to what ecologists frequently call a “variable” -- referring
to data of a given type contained within a column of a data table. This naming choice corresponds with formal
use in database terminology. Other
alternative names for “attribute” might have included “field” or “column
header”. Faced with such choices, EML
typically adopts the most technically appropriate choice.
To illuminate the process of designing metadata elements to store specific data fields in EML-based data packages, consider representing the creator of an ecological dataset within the data package. The EML Specification implements the creator element as a subset of the root-level eml-resource module, which (among other entities) stores general descriptive information about dataset resources. Specific creator data fields are stored using the responsibleParty complex data type. The responsibleParty data type in turn uses other complex data types for example, the Person type, to store the attributes of the creator within the metadata element. The Person type in turn is comprised of (among other fields) fundamental character string and numeric elements; moving up the schema hierarchy, the eml-resource module uses several responsibleParty elements to contain other resource-specific attributes (e.g., the provider of the data package’s metadata).
In summary, EML consists of:
Software
tools, for example, the KNB project’s Morpho metadata
editor, simplify the EML creation process by hiding some of the formal complexities
of the XML-based tag set. This is similar to the way that a Web browser such as
Netscape encapsulates the complexity of the HTML standard to make Web pages
more accessible.
Chapter 4 discusses the design and internal
structure of EML in greater detail.
2.2 The
Data Package Life Cycle
Data
packages are living documents. They are created using as much metadata as
exists at the time describing the dataset, when ecologists are motivated to
document new datasets before metadata becomes stale. At some future point, new
metadata can become available; for example, other research team members may
return to base and contribute their observations. Or, the dataset originally
described may be amended with new measurements that require further description.
Finally, the DP may be transferred, or uploaded, into a physically separate
second data repository.
Scientists employ desktop-based data
management software to enter detailed descriptions of their research datasets
(i.e., metadata) in a standardized format. If they desire, the researcher can
also combine the actual research data sets with the metadata, forming a Data
Package. The new Data Package is stored on the researcher’s computer disk; At the researcher’s option the package can also be exported
to one or more Data Repositories maintained on remote computers and accessible
via the Internet and special software. Two examples of such software are: 1) a
Data Repository web site serving as a ‘front-end’ to a Data Repository; 2)
specialized data management software that supports search, retrieval, and
maintenance of data packages stored in one or more remote Data Repositories.
Figure 2.2 depicts the
Data Package development and use process:
2.3 Setting
the Stage: A Simple Metadata Example
To start the discussion of Data Package development
using EML, we present a simple metadata/EML example: You have just completed a
research project on rodent populations at a northwestern
Example: minimal valid EML Document
<?xml version="1.0"encoding="UTF-8"?>
<eml:eml>
packageId="eml.1.1"system="knb"
xmlns:eml="eml://ecoinformatics.org/eml-2.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ds="eml://ecoinformatics.org/dataset-2.0.0"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.0.0
eml.xsd">
<dataset>
<title>Rodents of the
Southwest</title>
<creator>
<individualName>
<surName>Ecosis</surName>
</individualName>
</creator>
<contact>
<individualName>
<surName>Ecosis</surName>
</individualName>
</contact>
</dataset>
</eml:eml>
All XML documents start with this. Most XML documents use UTF-8
also known as Unicode xmlns is short for
XML Namespace.
These fields are the minimum necessary to produce a valid
<title>
<creator>
<contact>
must be unique for the SYSTEM
You can see that
this EML document doesn’t contain much information—just some basic contact information and indication of content, if the
title is well composed. Your new
question is: what
additional elements might provide sufficient information to enable other scientists
to determine whether sharing or combining your data would lead to
a productive collaboration? The
remainder of this document endeavors to answer this question; it will begin by
examining in greater detail the structure of the sample EML document.
The fundamental “object” for which an ecologist might
want to create metadata is termed a “dataset” in EML. “dataset” means different things to different people.
In EML, the term may contain general information such as the title, creator,
and contacts, as well as one or more data entities, such as data tables, that
provide more specific research details. Take a look at the expanded EML dataset
schema displayed in Figure 3. This schema provides a graphical representation
or “map” of the actual XML schema that shows where the different elements fit
in the hierarchy of the language.
For now, consider only the ResourceGroup schema, which is displayed in Figure 4. The <title> field
provides a description of the resource being documented that is long enough to
differentiate it from other similar resources. Multiple titles may be provided,
particularly when trying to express the title in more than one language. If a
title is in a language other than English, use the “xml:lang” attribute. For example,
<title>Rodents of the Southwest</title>
<title>xml:lang=”de”Rodents des Südwestens</title>
In the second
<title>, the text ‘xml:lang=“de”’
is an XML attribute that says: language =german
(de=deutsche). There are two versions of ISO 639 language codes: two- or
three-letter codes. The three-letter system was designed to expand the list of
languages that can be represented. Currently, the two-letter system is used
more often, but over time it is likely that the three-letter system will be
embraced.
A clear
discussion of the use of the language attribute can be found at:
msdn.microsoft.com/library/default.asp?url=/library/en-s/prompting_beta2/html/SSML_Elements_lang.asp.
The actual language codes can be found
at: www.w3.org/WAI/ER/IG/ert/iso639.htm.
A <creator> is a structure for
representing the owner of the data. A <creator> can be a person, an
organization, or an organizational role. Examples of organizational roles are:
Executive Director, Department Administrator, or Information Manager.
At this point, we expand the minimal EML
document to look at the <creator> information in more detail.
...<creator>
<individualName>
<salutation>Dr.</salutation>
<givenName>Pat</givenName>
<givenName>Morgan</givenName>
<surName>Ecosis</surName>
</individualName>
<address>l
<deliveryPoint>Department of
Southwestern Ecology</deliveryPoint>
<deliveryPoint>
<city>Noaguaville</city>
<administrativeArea>NM</administrativeArea>
<postalCode>78890</postalCode>
<country>
</address>
<phone>709-345-8970 x 254</phone>
<electronicMailAddress>pat.ecosis@semiarid.edu</electronicMailAddress>
</creator>
<creator>
<organizationName>USDA</organizationName>
<address>
<deliveryPoint>
<deliveryPoint>Mailstop 3654</deliveryPoint>
<deliveryPoint>25
<city>
<administrativeArea>DC</administrativeArea>
<postalCode>20025</postalCode>
<country>
</address>
<phone>phonetype=”voice”709-345-8970 x 254</phone>
<phone>phonetype=”fax”709-345-8962</phone>
<electronicMailAddress>dataservices@usda.gov</electronicMailAddress>
<onlineUrl>http://www.usda.gov/ecoinformatics/</onlineUrl>
</creator>
The element names may seem unfamiliar and
you might be wondering why the developers of EML used such strange names. These
elements have been taken from the International Standards Organization (ISO),
schema for representing people: iso-party. For
example, a person can have more than one <givenName>,
but only one <surname>. EML does not have a specific tag for middle
names. Since <givenName> is optional, <surName> would be used for “
An address can have more than one <deliveryPoint>. A <deliveryPoint>
is that part of an address that precedes the <city>. The most common <deliveryPoint> is a street address. Other examples are
company names, department names, or post office boxes. Because XML does not
recognize carriage returns, the way to enter that part of an address that
precedes the city is to use a separate <deliveryPoint>
element for each item that you would want to appear on a separate line when
displayed on a web page or printed report. An <administrativeArea>
is the tag that would be used in the
A <creator> can have more than one
address. Each address must be contained in its own set of
<address></address> tags:
…<creator>
<individualName>
<salutation>Dr.</salutationr>
<givenName>Joe</givenName>
<surName>Ecologist</surName>
</individualName>
<address>
<deliveryPoint>
<city>Oceans</city>
<administrativeArea>CA</administrativeArea>
<postalCode>98025</postalCode>
</address>
<address>
<deliveryPoint>Department of
Ecological Sciences</deliveryPoint>
<deliveryPoint>University of the
Oceans</deliveryPoint>
<city>Oceans</city>
<administrativeArea>CA</administrativeArea>
<postalCode>98025</postalCode>
</address>
</creator>
In terms of EML
schema, the above code fragment would look something like this:
There are two additional optional creator
types that can be described by EML, <metadataProvider>
and <associatedParty>. A <metadataProvider> can be used if the person, organization,
or organizational position that provided the metadata is not the one who
created the data being documented. A <metadataProvider>
could be an LTER information manager, a student, or a researcher who may or may
not have been part of the data creation process. While this field is optional,
it is recommended that it be used if the metadata was provided by someone other
than the <creator>.
It is also a good idea to use this if there are multiple creators, but one
of them is the primary source of the metadata. It is especially important to
use this structure if the metadata is being constructed after the fact. For
example, if someone decides to create EML documentation for data that is 20
years old, some metadata may exist in a lab notebook, but the documenter also
is developing new metadata.
An <associatedParty>
is a person, organization, or organizational position that is involved in the
creation of the data, but is neither a <creator> nor a <metadataProvider>. An associated party could be a researcher,
statistician, technician, or an advisor or consultant. If the research data was
developed by a team of researchers, the principal investigator might be the
<creator>, while her associates could be described by <associatedParty>. The schemas for both <metadataProvider> and <associatedParty>
have essentially the same structure as that of <creator>, however, it
should be noted that <associatedParty> also
contains a mandatory element called <role>, which is used to define the
role that the party played in relation to the research being documented, such
as “research assistant,” or “technician.”
As mentioned earlier, the structure of an
EML document is defined by a series of XML schemas. The schema determines not
only the names of valid tags, but also the order of the various structures. As
can be seen in Figure 4, a <creator> must be described before a <metadataProvider>,
and a <metadataProvider> before an <associatedParty>.
Aside from names and addresses, the ResourceGroup schema also is used to provide an overview of
the information in the dataset. Some of the more useful ResourceGroup
elements include <abstract>, <keywordSet>,
<distribution>, and <coverage>. While they are not required
elements, both <abstract> and <keywordSet>
are useful for describing the general nature of the research being documented,
especially when the data is being designed for use with a search engine. The
<distribution> and <coverage> elements reappear in later modules,
and as they are also optional, they will be discussed later. All of these
elements, however, should be listed after the <creator>, <metadataProvider>, and <associatedParty>.
In the <abstract> section, each
paragraph is separated by the <para></para> tagset, while each
section is enclosed in the <section></section> tag set. For most
abstracts, there will be only one section.
As for keywords, the entire set is
defined by the <keywordSet> tags, while each
individual keyword is separated by a <keyword></keyword> tag set.
…<abstract>
<section>
<para>Small
mammal species are often affected by changes in vegetation composition and
structure. In the southwestern
</para>
</section>
</abstract>
<keywordSet>
<keyword>pinon forest</keyword>
<keyword>Peromyscus maniculatus</keyword>
<keyword>deer mouse</keyword>
<keyword>
</keywordSet>
The next
element to be added after the abstract and keywords is the contact information.
A <contact> is a structure for representing the person, organization, or
organizational role to contact regarding the use of the data. Although
<contact> followed <creator> in the minimal EML document in Figure
2, a closer look at Figure 3 and Figure 4 reveals that <creator>,
<metadataProvider>, and <associatedParty>
are part of the larger structure of the ResourceGroup,
while <contact> is located just below the ResourceGroup
in the schema. Any elements that are used in the ResourceGroup
must be entered before the next set of elements <purpose> (optional),
<maintenance> (optional), and <contact> (required).
Without
getting too deep into the language of XML schema, we can say that both the
<contact> and <creator> elements have the same structure, that is,
both are “of type ResponsibleParty.” The
<contact> information follows the same structure as the <creator>
information with regard to address and phone and online resources.
…<contact>
<positionName>Information Manager</positionName>
<address>
<deliveryPoint>Oceans LTER</deliveryPoint>
<deliveryPoint>Department of Marine
Ecology</deliveryPoint>
<deliveryPoint>University of the
Oceans</deliveryPoint>
<deliveryPoint>1514 San Ysidro</deliveryPoint>
<city>
<administrativeArea>LA</administrativeArea>
<postalCode>78890</postalCode>
<country>
</address>
<phone>709-345-8970 x 254</phone>
<electronicMailAddress>imanager@oceans.edu</electronicMailAddress>
<onlineUrl>http://oceanslter.lternet.edu</onlineUrl>
</contact>
One other element that should be discussed relevant to the entire
dataset is <access>. This element specifies permissions given to certain groups
or individuals regarding access to the data or metadata contained in a
particular file. Within <access>, there is a choice of <allow> or
<deny>, both of which then require both <principal> and
<permission> elements. The <principal> refers to the name of the
individual or organization to which the rule applies, while <permission>
specifies the level of access that is being granted, either “read,” “write,” “changePermission,” or “all.”
…<access>
<allow>
<principal>Sevilleta LTER</principal>
<permission>read</permission>
<permission>write</permission>
</allow>
</access>
We
have now defined some of the basic elements of EML pertaining to identifying
the author, topic, time, and location of the data. Clearly it is beneficial to create some
additional metadata that describes the structure and contents of the data
itself. As stated earlier, a dataset consists of one or more data entities, and
the most common data entity is a <dataTable>. A
data table is a very familiar construct to most ecologists—and usually looks
something like this:
OBSERVATION_DATE OBSERVATION_LOCATION SPECIES SPECIES_COUNT
2002-10-29
FCE_001 ALLIGATOR
1
2002-10-29
FCE_002 ALLIGATOR
2
2002-10-29
FCE_003 ALLIGATOR
0
Data tables are produced by people using
software applications such as text or word processors and they often are saved
as text files (ascii) or spreadsheets (Excel), by statistical
packages (SAS, SYSTAT, SPSS), by database systems (MS Access, My SQL, Oracle,
MS SQL Server), or by certain kinds of sensors (data loggers).
In addition to data tables, people using
database applications may also produce a <view> from a database
management system of a <storedProcedure> that
results in data output.
People using GIS (geographical
information system) applications generate both <spatialVector>,
also referred to as boundary or shape files, and <spatialRaster>.
A <spatialRaster> is a geo-referenced image
usually produced by a camera on a satellite or other remote sensing device.
The final kind of data entity is <otherEntity>. An <otherEntity>
is a data entity that cannot be represented by any of the previously defined data
entity structures. A non-geo-referenced photograph is an <otherEntity>, e.g. a photograph of two different types
of butterflies.
The elements must be entered in the
sequence shown in Figure 6, and any
element that was specified in the ResourceGroup applies
to the entire dataset that is being documented. This is especially significant
if the dataset has more than one data entity. For example, assume the dataset
consists of three files: Climate.xls, Biodiversity.xls, and Productivity.xls.
In this case, ResourceGroup elements such as
<distribution> or <coverage> must be general enough to apply to all
of the files. This would be true also if Climate.xls, Biodiversity.xls, and
Productivity.xls were the names of tables in a database.
2.4 EML
Document Production and Management Tools
It is possible, but not advisable, for users to create a data package using a text editor. To do so, user creates an editable ASCII text file containing the ‘raw’ metadata, and then embeds the appropriate EML tags in the correct places throughout the metadata to create a valid EML document. Two main drawbacks: 1) producing a valid EML document entirely ‘by hand’, without supporting software to validate the EML syntax, is time-consuming, error-prone work. 2) There is no way to embed non-ascii data (e.g. a binary image or database file), in metadata constructed in a text editor. Therefore, the authors do not recommend this approach except for extremely simple metadata cases, or case where the researcher has absolutely no other choice.
Some
institutions and research projects are creating Web enabled data entry forms
that collect key metadata fields relevant to the kinds of data sets produced by
their affiliated researchers. These forms are then presented to the research
community as a Data Registry web site. Data entered into these Data Registries
are stored in project-specific Data Repositories. The advantage of these sites is they focus
the researcher’s efforts on collecting and entering he metadata required for
that project. The disadvantage is that these Data Registries typically do not
support editing of existing data packages, or entry of any ‘off-project’
metadata fields.
Specialized metadata entry tools provide the data package developer with complete data entry and edit capability for any valid EML field. One such tool is the Morpho data managent tool, described in the next section
3.0 Case Studies: Morpho-based Data Packages
3.1
Morpho Data
Package Wizard
The
Morpho data management tool is comprehensive, and
thus complex for novice metadata developers to use. To support researchers needing
to create data packages, and without requiring them to master Morpho’s advanced editing capabilities, this tool includes
the Data Package Wizard. The Wizard is a series of interactive data entry forms
that collect the most critical and commonly used metadata / EML fields from
users to produce a valid data package. The Morpho
User Guide (link here) demonstrates use of the Data Package Wizard.
In
the following three sections, we will extend the tutorial example presented in
the User Guide to demonstrate data package creation for three progressively
comprehensive metadata entry cases. The ecological datasets used in these
examples are found in the EML Guidebook
3.2 Case 1: Package with
Data Tables ONLY
Use the Data Package Wizard
simply to import the table, with some minimal metadata.
3.3 Case 2: Data Tables with Limited Metadata
Use the DPW to import the
table and additional metadata
3.3 Case 3: Data Tables
with Comprehensive Metadata
Use the Morpho
data management tool to add additional metadata fields not supported by the
DPW.
4.0 In Depth:
Understanding and Using EML
The objective of this chapter is to connect the advanced metadata
developer with advanced technical information concerning EML.
Specifically, this includes a discussion of EML’s
modular design and references the formal EML
Specification. The information in this chapter is of interest to persons
designing advanced EML-based data packages, or to anyone requiring advanced,
detailed knowledge of EML architecture. After studying the material in this
chapter, metadata developers will be able to parse the EML Module descriptions
in the EML Specification, select the modules required to construct a Data
Package
4.1 EML Design Overview
The Ecological Metadata Language (EML)
is a metadata standard developed by the ecology discipline and for the ecology
discipline. It is based on prior work done by the Ecological Society of
EML is comprised
of modules that, in concert, provide a comprehensive ecological metadata
solution. This modularization is intended to assist information managers in
selecting only those aspects of metadata which are of importance to their
organizational needs, as well as to provide for easier flexibility and
integration with other possible metadata standards. The modular design of EML
will usually be hidden from the individual scientists who are using EML.
Still, it is
useful to understand that these modules do exist in EML, and are designed to
encompass broad areas of thematic interest.
Each module addresses general issues, such as 1) basic descriptions
about data set ownership—who, what, where, when, and why; or 2) descriptions
about how the data are actually structured; or 3) what are the distribution
rights for the data? For example, below
we list some general metadata concepts along with the appropriate EML tags:
1. Ownership
Who did the
research? (creator)
In general
terms, what is the research about? (abstract)
What are some
of the key concepts addressed by the data? (keywords)
Where was the
research done? (geographic
coverage)
What time
periods are covered in the data? (temporal coverage)
What species
are represented in this research/data? (taxonomic coverage)
What methods
were used? (methods)
Is the data
being documented part of a larger project (project)
2.
Data architecture
The data
representation branch contains modules that are used to describe:
What kind of
data entity is it? (e.g., Table vs. Raster)
What are the
names of variables within the data set? (attribute name)
3.
Distributional information
How would I
retrieve this file? (physical)/(distribution)
Who will I
allow access to the data? (access)
These three
examples list a subset of EML’s available modules and
tags The essence of creating (meta)data packages using EML is to understand
which tag to use for which type of metadata, and then to properly fill in the
tags which are critical to your (and others) needs. In addition, as a researcher, you might
benefit by reviewing the overall EML set of tags, as you might discover
critical types of metadata worth documenting that you hadn’t previously
considered.
Following
sections of this document describe EML in more detail, presenting specific
examples intended to familiarize ecologists with the meaning and appropriate
use of many common types of ecological metadata.
4.2 EML Design
Objectives
EML is implemented as a series of XML
document types that are used in a modular and extensible manner to document
ecological data. Each EML module is designed to describe one logical part of
the total metadata that should be included with any ecological dataset.
The architecture of EML was designed to
serve the needs of the ecological community, and has benefited from previous
work in other related metadata languages. EML has adopted the strengths of many
of these languages, but also addresses a number of short-comings that have
proved to inhibit the automated processing and integration of dataset resources
via their metadata.
Four key design features permeate the EML implementation:
·
Modularity: EML was designed as a collection of
modules rather than one large standard to facilitate future growth of the
language in both breadth and depth. By implementing EML with an extensible
architecture, groups may choose which of the core modules are pertinent to
describing their data, literature, and software resources. Also, if EML falls
short in a particular area, it may be extended by creating a new module that describes
the resource (e.g. a detailed soils metadata profile that extends eml-dataset).
The intent is to provide a common set of core modules for information exchange,
but to allow for future customizations of the language without the need of
going through a lengthy 'approval' process.
·
Detailed Structure: EML strives to balance the tradeoff of
too much detail with enough detail to enable advanced services in terms of
processing data through the parsing of accompanied metadata. Therefore, one key
question driving EML design was: 'Will this particular piece of information be
machine-processed, just human readable, or both?' Information was then broken
down into more highly structured elements when the answer involved machine
processing.
·
Compatibility: EML adopts much of it's syntax from
the other metadata standards that have evolved from the expertise of groups in
other disciplines. Whenever possible, EML adopted entire trees of information
in order to facilitate conversion of EML documents into other metadata languages.
EML was designed with the following standards in mind: Dublin Core Metadata
Initiative, the Content Standard for Digital Geospatial Metadata (CSDGM from
the US geological Survey's Federal Geographic Data Committee (FGDC)), the
Biological Profile of the CSDGM (from the National Biological Information
Infrastructure), the International Standards Organization's Geographic
Information Standard (ISO 19115), the ISO 8601 Date and Time Standard, the
OpenGIS Consortiums's Geography Markup Language (GML), the Scientific,
Technical, and Medical Markup Language (STMML), and the Extensible Scientific
Interchange Language (XSIL).
·
Strong Typing: EML is implemented in an Extensible
Markup Language (XML) known as XML Schema, which is a
language that defines the rules that govern the EML syntax. XML Schema is an
internet recommendation from the World Wide Web Consortium, and so a metadata
document that is said to comply with the syntax of EML will structurally meets
the criteria defined in the XML Schema documents for EML. Over and above the
structure (what elements can be nested within others, cardinality, etc.), XML
Schema provides the ability to use strong data typing within elements. This
allows for finer validation of the contents of the element, not just its
structure. For instance, an element may be of type 'date', and so the value
that is inserted in the field will be checked against XML Schema's definition
of a date. Traditionally, XML documents (including previous versions of EML)
have been validated against Document Type Definitions (DTDs), which do not
provide a means to employ strong validation on field values through typing.
Two concepts
clarify the underlying EML design: the content model (i.e. the concepts
behind the structure of a document - which fields go where, cardinality, etc.),
and the syntactic implementation of that model (the technology used to
express the concepts defined in the content model). The normative sections
below define the content model and the XML Schema documents distributed with
EML define the syntactic implementation. For the foreseeable future, XML Schema
will be the syntactic specification, although it may change later.
4.3 Description of Individual EML Modules <Mark S writing here>
EML is comprised of 22 structural elements known as modules:
Each
module implements a separate syntactic element; a (meta)data
package will be comprised of one or more of these elements. Sometimes more than
one instance of each element will occur in a metadata package. Some elements
are hierarchical in that they contain other elements as members. Thus are
elements combined in many unique ways to create a potentially infinite
combination of data packages. Each module is described
below, the Appendix
to this report provides additional technical information.