Locale-Independence

This section is concerned with the transportability of the data and of avoiding ambiguity caused by the loading of the data on a machine with different regional settings. This is distinct from providing foreign-language translations of text and of localised formatting of dates, numbers, etc. The format of data values within each STEMMA Dataset should be considered a computer-readable format even though it is text-based. The complete STEMMA Document is not directly concerned with presentation of data to the end-user, and so does not use your preferences for numeric formatting, date formatting, citation styles, etc.

 

The character set should be global which nowadays means UTF-8, and this is also the default with XML. Although an XML-like header could explicitly nominate a non-default character set name, this would put an onus for supporting all possible translations on the receiving software and could limit portability between different operating systems.

 

A small issue with UTF-8 is that some editors don’t acknowledge it. For most people reading this, the default character set in their computer account will probably be a Latin-1 set (e.g. ISO 8859/1 or Windows Latin-1). Unless the editor was smart and recognised XML, or BOM sequences in the start of the data, then it may render some characters incorrectly. It is possible to avoid 8-bit character codes in XML, and so avoid the potential ambiguity, by restricting codes to 7-bit ASCII and using either entity references (e.g. &) or character entities (e.g. €) for all other cases. The impact of this would be small since only developers would be looking at the raw representation.

 

Data values should be in a locale-independent format, similar to literals in the source code of a programming language. For this reason, it is sometimes called using a ‘programming locale'. This effectively means using a period in any decimal numbers (not a comma), all-numeric ISO 8601 format for (Gregorian-)dates (e.g. yyyy-mm-dd), and non-localised 1/0 for Booleans (e.g. for option selections). These conventions are good practice for the designer of any shared data, and are not specific to XML or to micro-history data. Again, just as with programming languages, this ensures transportability and that the data will be loaded (or compiled) identically by any compliant product in any locale.

 

It should be noted that tag values for types, subtypes, modes, and other taxonomies (e.g. Union, Birth, Marriage, etc) should be considered part of the data syntax, just as with element names and attribute names. They should never be presented directly in the UI of a software product but rather should undergo a mapping to a meaningful term for the locale of current end-user.

 

The term culturally neutral is used in this specification and refers to the ability of STEMMA to represent data from different cultures. The specification must avoid assumptions about the structure of personal names, the types of possible unions (e.g. marriage), religious ceremonies, family units, inheritance of names, etc.

 

Although Time Zones (TZ) and Daylight Saving Time (DST) are usually applied to local clock times, they can also apply to local calendar dates. The importance for genealogy is going to be slim at best but the area should be clarified. ISO 8601 does not include any TZ designators — values are either ‘local time’ or relative to UTC (Coordinated Universal Time). Local date/times should be interpreted in the context of the data location rather than the current location of the user but this would only be significant when creating a timeline across TZ boundaries.

 

Issues of date format, such as numeric form, long text form, abbreviated text form, and month/day ordering, are issues for the user interface and are, therefore, controlled by end-user’s regional settings. The stored dates must be independent of that.

 

It should be stressed that this section is concerned with computer-readable data. If, for example, a document image shows a written date, a transcribed version of that date can still be held as text in the data. However, if that written date can be interpreted then the format of the computer-readable value, including any error margins, is rigidly prescribed here.

 

Dates expressed in different Calendars — particularly ones that cannot be precisely converted to Gregorian dates — are usually a bit more challenging. Examples of other calendars include Julian, Hebrew, Islamic, Hindu, Persian, French Republican, Mayan, and Chinese. The STEMMA research notes under Dates and Calendars describe the requirement for a standard representation of dates from all calendars. The same goals as influenced the Gregorian ISO 8601, e.g. unambiguous computer-readable form that is locale-independent, resulted in the encoding used for STEMMA Dates.