Module Documentation: eml-physical
Back to EML Contents
The eml-physical Module defines the structural characteristics of data formats as delivered over the wire or as found in a file system. One physical object (which can be a bytestream or an object in a file system) might contain multiple entities (for example, this would be typical in a MS Access file that contained multiple tables of data). However, it is typically used to describe a file or stream that is in some text-based format such as ASCII or UTF-8, and includes the information needed to parse the data stream to extract the entity and its attributes from the stream.

Element Definitions:

eml-physical
  Tooltip: Physical structure.  
  Summary: Physical structure of an entity or entities. 
  Description: Physical structure of an entity or entities. This generally is a detailed description of a text representation that shows how the columns and rows of a table are represented, or simply the name of a well-known binary or proprietary format (e.g., Microsoft Excel 2000).  
  Example:  
identifier
  Tooltip: Unique identifier  
  Summary: The unique identifier of this metadata file or object.  
  Description: The identifier field provides a unique identifier for this metadata documentation. It will most likely be part of a sequence of numbers or letters that are meaningful in a larger context, such as a metadata catalog. That larger system can be identified in the "system" attribute. Multiple identifiers can be listed corresponding to different catalog systems.  
  Example: <identifier system="metacat">nceas.3.2</identifier> 
format
  Tooltip: File format  
  Summary: Contains the name of the format for this file.  
  Description: This element contains the name of the file's format. The file's format is typically ASCII, Unicode, or some well-known binary format (e.g., Microsoft Excel 2000). This could be a mime-type.  
  Example: <format>ASCII</format> 
characterEncoding
  Tooltip: Character Encoding  
  Summary: Contains the name of the chracter encoding used for the data.  
  Description: This element contains the name of the character encoding. This is typically ASCII or UTF-8, or one of the other common encodings.  
  Example: <characterEncoding>UTF-8</characterEncoding> 
size
  Tooltip: Entity size  
  Summary: Describes the physical size of the entity.  
  Description: This element contains information of the physical size of the entity, typically in bytes.  
  Example: <entitySize unit="bytes">13</entitySize> 
authentication
  Tooltip: Authentication method  
  Summary: A value, typically a checksum, used to authenticate that the bitstream delivered to the user is identical to the original.  
  Description: This element describes authentication procedures or techniques, typically by giving a checksum method (e.g., MD5) and checksum value for the bytestream.  
  Example: <authentication method="MD5">f5b2177ea03aea73de12da81f896fe40</authentication>  
compressionMethod
  Tooltip: Entity's compression method  
  Summary: Name ofthe entity's compression method  
  Description: This element describes any compression methods used to compress the entity, such as zip, compress, etc.  
  Example:  
encodingMethod
  Tooltip: Encoding Method  
  Summary: Method used for encoding the entity  
  Description: This element describes the entity's encoded method, such as MIME base64 encoding or binhex encoding.  
  Example:  
numHeaderLines
  Tooltip: Header lines  
  Summary: Header lines in the entity  
  Description: Number of header lines or information that prepares data.  
  Example: <numHeaderLines>3</numHeaderLines> 
recordDelimiter
  Tooltip: Record delimiter character  
  Summary: Character used to delimit records.  
  Description: This element specifies the record delimiter character when the format is text. The record delimiter is usually a newline (\n) on UNIX, a carriage return (\r) on MacOS, or both (\r\n) on Windows/DOS. Multiline records are usually delimited with two line ending characters, for example on UNIX it would be two newline characters (\n\n).  
  Example: <recordDelimiter>\n\r</recordDelimiter> 
quoteCharacter
  Tooltip: Quote character  
  Summary: Character used to quote values for delimeter escaping  
  Description: This element specifies a character to be used in the entity for quoting values so that field delimeters can be used within the value. This basically allows delimeter "escaping". The quoteChacter is typically a " or '.  
  Example: <quoteCharacter>"</quoteCharacter> 
literalCharacter
  Tooltip: Literal character  
  Summary: Character used to escape other characters  
  Description: This element specifies a character to be used for escaping character values so that the following character is treated as its literal value. This allows "escaping" for special characters like quotes, commas, and spaces when they aren't intended as a delimiter value. The literalChacter is typically a \.  
  Example: <literalCharacter>\</literalCharacter> 
fieldStartColumn
  Tooltip: Start column  
  Summary: The starting column number for a fixed format attribute.  
  Description: FixedWidth fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number.  
  Example: any positive integer, see example in "delimeter" description  
fieldWidth
  Tooltip: Field width  
  Summary: FieldWidth specification for fixed field length.  
  Description: FixedWidth fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number.  
  Example: any positive integer, see example in "delimeter" description  
fieldDelimiter
  Tooltip: Attribute delimiter  
  Summary: The end of the attribute (field) is delimited by a special character called a field delimiter.  
  Description: Variable width format fields (attributes) can vary in their field length, thus the end of the field is delimited by a special character called a field delimiter (typically a comma or a space). Data sets are generally classified as fixedWidth format or variableWidth format, but we have determined that this is actually a per-field classification because one may encounter fixedWidth fields mixed together in the same data file with variableWidth fields. In our encoding scheme, the start of each field is assumed to be the column after the last column of the previous field, or the first column if this is the first field in the dataset, unless the starting column is explicity enumerated using the "fieldStartColumn" element. The end column for each field is classified using either a special character delimeter indicated using the filedDelimiter element, or a fixed field length indicated by using the "fieldWidth" element. The delimiter for the last field in the data set can be omitted. variableWidth fields can vary in their field length, and the end of the field is delimited by a special character called a field delimiter, usually a comma or a tab character. fixedWidth fields have a set length, and so the end of the field can always be determined by adding the fieldWidth to the starting column number. Here is an example: Assume we have the following data in a data set: May,100aaaa,1.2, April,200aaaa,3.4, June,300bbbb,4.6, The metadata indicating the physical layout of the 4 fields would include the following: <delimiter>,</delimiter> <fieldWidth>3</fieldWidth> <fieldWidth>3</fieldWidth> <delimiter>,</delimiter> In a strictly fixed format file, the metadata would be slightly different: May100aaaa1.2 Apr200aaaa3.4 Jun300bbbb4.6 <fieldWidth>3</fieldWidth> <fieldWidth>3</fieldWidth> <fieldWidth>4</fieldWidth> <fieldWidth>3</fieldWidth> or, one could explicitly describe the starting columns: <fieldStartColumn>1</fieldStartColumn> <fieldWidth>3</fieldWidth> <fieldStartColumn>4</fieldStartColumn> <fieldWidth>3</fieldWidth> <fieldStartColumn>7</fieldStartColumn> <fieldWidth>4</fieldWidth> <fieldStartColumn>11</fieldStartColumn> <fieldWidth>3</fieldWidth>  
  Example: comma, tab, white space, etc.  

Attribute Definitions:

system
  Tooltip: Catalog system  
  Summary: The catalog system in which this identifier is used.  
  Description: This element gives the name of the catalog system in which this identifier is used. It is useful to determine the scope of the identifier, and to determine the semantics of the various subparts of the identifier. Unresolved issue: can or should this be a URI/URL pointing to the catalog system, or just the name?  
  Example: <identifier system="metacat">nceas.3.2</identifier> 
unit
  Tooltip: Unit of measurement  
  Summary: Unit of measurement for the entity size, typically bytes  
  Description: This element gives the unit of measurement for the size of the entity, and is typically bytes.  
  Example: <entitySize unit="bytes">13</entitySize> 
method
  Tooltip: Authentication method  
  Summary: The method used to calculate an authentication checksum.  
  Description: This element names the method used to calculate and authentication checksum that can be used to validate a bytestream. Typical checksum methods include MD5 and CRC.  
  Example: <authentication method="MD5">f5b2177ea03aea73de12da81f896fe40</authentication>  

Web Contact: jones@nceas.ucsb.edu