Document Structure

The top-level structure of the STEMMA Document is as follows:


<?xml version="1.0"?>

<Datasets xmlns=''>


<Created> iso-datetime </Created>


<Id> product-id </Id>

<Name> product-name </Name>

<Version> product-version </Version>


{ <Language> code </Language> |

  <Locale> code </Locale> }

[ TEXT_SEG ] ...


{ <Dataset Name=’name’ xmlns:prefix=’URI’ ... >

[ <Content>

<Created> iso-datetime </Created>

<Author> content-author-name </Author>

<Version> content-version </Version>

[ <Copyright> copyright-notice </Copyright> ]

[ <LastModified> iso-datetime </LastModified>

<ModifiedBy> name </ModifiedBy> ]

{ <Language> code </Language> |

  <Locale> code </Locale> }

[ <Counters>

{ <Counter [Tag=’tag’]> integer </Counter>} ...

</Counters> ]

[ TEXT_SEG ] ...

</Content> ]




</Dataset> } ...






<entity-type Key=’key’ [Abstract=’boolean’]/> ...



The URI for the default namespace is versioned in order to accommodate potential schema updates. The xmlns attribute of the <Dataset> element provides namespaces for custom types and other tag values. A discussion of extending partially controlled vocabularies may be found at Extended Vocabularies.


The <Dataset> element is an envelope for a self-consistent set of data. It has its own name, change history, and linkages. Although a STEMMA Document may contain a single Dataset, multiple ones can be concatenated in order to separate different family branches, or to isolate common Places, Citations, or Events. Under those circumstances, all Key names are local to their respective Dataset. Key references cannot span Datasets in the stored format but can do when the associated data is loaded into memory. The IMPORTS element controls explicit name imports to a Dataset.


The Dataset Header contains the name of the original author, the current version string (no prescribed format), and the date and author of the current revision. The associated <Text> elements (discussed below) can be used to maintain any change history beyond that.


An ‘iso-datetime’, as specified here, will have the format: <yyyy-mm-dd>T<hh:mm><tz>, i.e. numeric date followed by literal-T, followed by numeric time. If present then ‘tz’ is a time-zone offset from UTC, i.e. ±hh:mm. For instance, 2011-12-25T14:00Z or 2011-12-25T14:00+00:00. See Dates.


The Language or Locale designations deserve special mention here. The <Language> element takes an ISO 639-2 three-letter code for a default language, e.g. “eng” for English. The <Locale> element provides a more detailed specification since that involves both an ISO 639-1 two-letter language code plus an ISO 3166-1 two-letter territory code, e.g. “en_GB” for British English. This is a subset of the POSIX locale format and was chosen for its simplicity as opposed to the IETF’s Language Tags (BCP 47) which are very similar but far more comprehensive, even though the latter are acknowledged by XML (see xml:lang attribute). It should be noted that these specifications do not make the Dataset specific to that language or locale (see Locale-independence). What it provides is a default language for the interpretation of narrative text. Each narrative <Text> element can provide an explicit Language or Locale but this Header element provides the default. An example situation where a Locale may be more useful that a mere Language is when the text contains an ambiguous date such as: 09-06-1956. Knowing the Locale would help clarify whether it was June-9 or September-6. If both are specified in the same element then Locale should take precedence because it is more specific.


The EXTENDED_PROPERTIES element defines the custom Properties that may be attached to Persons, Places, Animals, Groups, and Events.


The <Counters> element contains an indefinite number of integer counters that can be used for assisted generation of key values. Key values can be generated algorithmically from names and dates for subject entities such as Persons and Places, but it may be a little more difficult for Events, Citations, Resources, etc. Having a number of persisted counters should help support a simple sequential allocation scheme, or something more involved. Values begin at one, and increment by one after the current value has been utilised. The optional tag value can be used by an application to arbitrate on which counter to use.


Dataset Loading

Although the following section is not technically part of the STEMMA specification, it is included to help clarify why STEMMA does not have an explicit Include statement, and how this works given that it does not use a relation database or any other type of database engine (see Do Genealogists Really Need a Database?).


What a STEMMA Dataset does have is an Import statement, where the <Import> entity-type may be Person, Place, Animal, Group, Event, Citation, Resource, Source, Matrix, or Narrative. Imported Keys may be used as though they are defined in the current Dataset. It is an error if an imported Key is already defined. The end-user may require the hosting product to load full definitions from their respective Datasets — either in the current Document or elsewhere. That product may also cache the Keys defined in each Dataset in order to quickly identify which ones to load.


When modifying a Dataset then the Imports act like a 'forward' declaration for validation purposes. However, when viewing the Dataset contents (in a tree, timeline, map, or whatever) then they currently cause the other necessary Datasets to be loaded. There's a subtle difference between the two modes, depending on whether a 'view' or 'edit' operation is being performed on a given Dataset, and this is similar to the difference between the 'compilation' and 'edit' modes for a software source module. This distinction is only the way that the current software has evolved, though, and it could quite easily do a multi-Dataset load every time.


When a Dataset has been modified then it is persisted back to the original file, and an associated directory (not a folder-type "directory") updated that records what entities are defined in each Dataset of every local Document. It is this directory that the loading process interrogates. In other words, the <Import> element says which entities are required but it doesn't indicate where from. There is deliberately no concept of an explicit "Include dataset-name" statement. When loading a dataset, a memory-resident table of unresolved entities is iteratively scanned to resolve all outstanding references: after the first load then it will include the entries from its Imports. The first entry in the table may require a second Dataset to be loaded, which in turn might resolve many outstanding entries from the table, but also add a few new ones for itself, effectively resulting in the loading of a "dataset tree".