XMARC (Version 1.0)
XML Mapping for MARC Data
geoff at minaret dot biz
Copyright © 2001, 2004 Minaret Corp.
Minaret is a registered trademark of Minaret Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy of this document to deal in this document without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of this document, and to permit persons to whom this document is furnished to do so, subject to the following conditions:
The above copyright and trademark notices and this permission notice shall be included in all copies and derivative works of this document.
THIS DOCUMENT IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THIS DOCUMENT OR THE USE OR OTHER DEALINGS IN THIS DOCUMENT.
November 28, 2001 (Version 0.1)
February 9, 2004
This document describes a mapping between the MARC (MAchine Readable Catalog) communications format and XML for storing MARC data in a database using XML compliant field names. This mapping has a few simple rules for constructing XML element names from MARC field and subfield names which results in a unique and legal XML element name for every field and for every combination of field and subfield.
The goal of this mapping is to produce an XML document type that: (a) is as simple as possible to understand and implement; (b) is database friendly; (c) supports high levels of validation, if desired; and (d) works with all MARC formats. To avoid problems with spaces within a fixed length XMARC control field, either a DTD (Document Type Definition) or XML Schema should be used, or an attribute of 'xml:space="preserve"' should be included with any fixed length data elements. When DTDs or schemas are used, XML parsers generally preserve white space or at least make it available to the application.
A file of 20 test records is available in both MARC and XMARC formats to illustrate the principles of this mapping. They are:
Please note that the above MARC file contains diacritic characters which have been combined with the characters they modify and converted into their Unicode equivalents to produce the XML file. For example, the combination of an acute accent followed by the letter "e" in the MARC source file has been converted into the single character "é", which has been output as the Unicode hexadecimal value of "é". Also note that any indicators that were blank in the source file were not output in the resulting XML file.
Create a mapping that is easy to implement and understand, and which will handle any type of MARC record.
This is accomplished by using existing MARC field and subfield names as the basis for their corresponding XML element names. This has the advantage of being easy to understand for anyone conversant in MARC terminology and is also extremely easy to implement by system designers and programmers. No sophisticated field mapping is necessary. In addition, no understanding of the content of the MARC record is necessary to perform the conversion. It does not matter whether the data is bibliographic, authority, community information, etc.
One of the most important considerations in this design is the ease with which it can be implemented and understood. This translates into lower costs and more widespread use of MARC records in an XML environment.
Fixed Length Fields.
The MARC control fields (the leader and the 00x fields) can either be subdivided into subfields or kept intact. Subdividing allows more finely tuned control over these fields but requires more effort than keeping them intact. By permitting these fields to stay intact, it is easier to write applications that do not require much knowledge (if any) of the structure or meaning of individual character positions within these fields.
No XML attributes.
There are no XML attributes used in this specification. This has the benefit of simplicity and consistency.
Easy mapping in both directions.
With a few simple rules any MARC field and subfield can be mapped to an XML equivalent. The rules cover all existing and future MARC field and subfield names, including local fields and subfields.
Unique field and subfield names.
Because this mapping is designed with databases in mind, each MARC field and every combination of field and subfield uses a unique name. Therefore, subfield "a" of the "245" title field, for example, uses a different name than subfield "a" of field "100".
Easy to remember XMARC field and subfield names.Anyone conversant in MARC tag numbers will find it extremely easy to use XMARC field names. With the exception of the MARC leader (which is called leader in XMARC), all XMARC fields start with the letter "f" and are followed by the three digit MARC tag number. For example, the name of the MARC title field (245) is "f245" in XMARC.
Indicator names start with the name of their containing field (i.e. "f245"), followed by either "i1" or "i2" (i.e. "f245i1" or "f245i2"). Likewise, subfields start with a field name, followed by the letter "s" and the MARC subfield name. So field "245", subfield "a" in MARC is called "f245sa" in XMARC.
No loss of MARC information.
Records can be converted back and forth between MARC and XML with no loss of information. Records can be read from a MARC file, converted into XML, manipulated in XML and written back out in MARC without any unintentional data loss. However, any supplemental XML markup (like XHTML tags) within an XMARC record that is not defined as part of this mapping must necessarily be removed when the record is converted back to MARC.
XMARC records may contain local data.An XMARC record may contain non-XMARC elements and attributes that are stripped out (leaving their content) or totally ignored when converted to MARC.
The recommended method of handling whitespace in an XMARC record is to use a DTD or schema and a validating XML parser. This combination permits an application to distinguish ignorable whitespace from required spaces. Required spaces are those at the front and back of a fixed field that are needed to maintain the correct data offsets within the field. An alternate approach is to use the "xml:space" attribute with a value of "preserve" in any fields or subfields where whitespace is important, and to output the content of that field with its start and end tag all on one line.
Character sets and character conversion.
When converting between MARC Unicode data and XMARC, no character conversion is necessary. When converting from MARC-8 content to XMARC, at a minimum, alternate character sets must be translated into to their Unicode equivalents and diacritical marks must be moved from before the character they modify to after. Ideally, common combinations of a diacritical mark and a modifying character would be converted into a single Unicode character. When converting in the other direction, the opposite must occur.
All letters in XMARC names are lower case.
The document level field in an XMARC file with more than one XMARC record is called xmarc-set, which may contain one or more xmarc elements, one for each XMARC record.
The document level field in an XMARC file with only one record may be called xmarc, without the need for a containing xmarc-set element.
Each xmarc element may contain a leader element and one or more field elements. The leader element contains the leader information from the MARC record.
The MARC control fields (001 through 009) are represented by XMARC elements f001 through f009.
The MARC leader and control fields (001 through 009) may either store the entire contents of their corresponding MARC field as a single chunk of data or use sub-elements.
When sub-elements are used, their names will consist of the name of their containing element (i.e. "leader", "f001", "f002", etc.) followed by an underscore ("_"), followed by the numeric representation of the offset of the current data within the original MARC field. The portion of the name following the underscore must be all numeric, with optional zero fill. It is suggested that two digits are used with zero fill, but this is not a requirement. Examples include: "leader_00", "leader_05", "f007_10", etc.
The MARC data fields (010 through 999) are represented by XMARC elements f010 through f999. Each of these XMARC elements may contain one or two indicator elements and one or more subfield elements.
Indicator names start with the name of their containing field (i.e. "f245"), followed by either "i1" or "i2" (i.e. "f245i1" or "f245i2").
MARC subfields "a" through "z" and "0" through "9" start with the name of their containing field, followed by the letter "s" and the subfield name. So field "245", subfield "a" in MARC is called "f245sa" in XMARC.
The following subfield names have been set aside for local use by the MARC specification:
! " # $ % & ' ( ) * + , - . / : ; < = > ?
As these are almost all illegal XML name characters, these subfield names are to be mapped as follows (where ### is replaced by the three digit field number):
All other subfield names not explicitly listed here are illegal (per the MARC specification) and should result in an error by any implementation of XMARC.
Creating a MARC Export File from XMARC Records
This is offered simply as a guideline for how a MARC export file might be created from an input file of XMARC records. It does not cover the details of how to generate a proper MARC record.
In general, only those elements that have been defined as an XMARC element are exported. This permits XMARC records to include non-MARC data that is stripped out during the creation of the MARC export file. It also handles the case of a set of XMARC records using a document level xmarc-set element as well as a single record using a document level xmarc element.
A parser will scan an XMARC input file for any occurrences of xmarc elements. Each xmarc element and its descendents are exported as a MARC record. All other elements are ignored (i.e. comments and processing instructions).
Within the xmarc element, only those child elements with the following names are exported: leader (parts of which must be re-calculated as part of the export) and f### (where ### is a three digit, zero filled, decimal number).
For the leader and fields f001 through f009 all text data from each field is extracted (stripping any supplemental markup) and exported. Care must be taken with these fields in particular to insure that XML white space processing does not alter the value of these fields. It is very important that the length of these control fields does not changed unknowingly and that the data is not shifted to the left because of whitespace removal. If sub-elements have been used, the portion of the sub-element name following the underscore is used to sequence and concatenate the sub-elements with blank padding as needed.
For fields f010 through f999, only child elements with the following names are exported as part of that field (where ### is replaced by the three digit field number): f###i1, f###i2, f###sa through f###sz, and f###s0 through f###s30 in record order.
When exporting the content of the XMARC indicator and subfield elements, all of their text is exported without any additional XML markup that may exist in the record. For example, if subfield f300sa looked like this:
<f300sa>A link to <a href="...">some place</a> on the web.</f300sa>Assuming that the HTML anchor element had been added during XMARC processing, this field would be exported without the element markup but still include its content, thus:
<f300sa>A link to some place on the web.</f300sa>