Personal Names

STEMMA has a single representation of subject names, whether for Persons, Animals, Places, or Groups. Although this section is primarily about the personal names for Person entities, it is using a mechanism that was designed to address cultural differences in naming, as well as being subject-neutral, and it will be used later for those other subject types.

 

Personal names around the world are not used in the same way as each other, and some things we take for granted in the West have no correspondence elsewhere. As well as variations due to married names, alternative spellings, nicknames, spellings in alternative languages, optional name parts, and stage names, the very structure of a name may be variable leaving it with little uniqueness and no obvious interpretation for our Western given-name/middlename/surname concepts. An in-depth discussion of the issues may be found under Worldwide Family History Data and at The Game of the Name.

 

The handling of personal names separates the acceptance and matching of the name variants from the generation of the canonical names (i.e. the preferred identifications) during output. Both of these also support the temporal dependencies of those names (e.g. changes during marriage, adoption, etc) and potential overlaps of those time periods.

 

As a generic approach that applies to all subject entities, STEMMA provides a prioritised set of patterns to match. A 'full name' is defined by a list of possible ‘token sequences'. These are in priority order and imply which should be tested first. Each ‘token sequence' is an ordered set from the following token types:

 

name              - simple name token, e.g. Tony

{name, ...}       - mandatory selection from alternative tokens

[name, ...]       - optional selection from alternative tokens

 

The following example might belong to someone called Grace Ann Murphy who doesn't always use her middle name and sometimes goes as Gracie. However, she's Irish and also has an Irish version of her name. This would require the following two 'token sequences':

 

{Grace,Gracie} [Ann] Murphy

Gráinne [Ann] Ní Murchú

 

Tokens in each sequence are matched against those in the name from head-to-tail. I emphasise this because some cultures do not write left-to-right.

 

An interesting issue here concerns the variations of individual name parts. In this example, Grace accepts "Gracie" as an informal version of her forename. However, the difference between Ann and Anne is more of a spelling error, during either recording, transcription or a subsequent lookup. This should be handled by the software unit, just as a soundex match might be.

 

Such patterns are stored in STEMMA using the following elements:

 

NAME_VARIANTS=

 

<Names>

<Sequences [RANGE_FROM] [RANGE_TO] [Type=’name-type’]

[Culture=’cultural-style’] [DATA_ATTRIBUTE] ... >

<Canonical [Mode=’name-mode’] [SortAs=’sort-as’] > canonical-name </Canonical> ...

<Sequence [NAME_ATTRIBUTE] ... >

<Tokens [Optional=’boolean’] [Initial=’boolean’]>

{ <Token> name-token-ucf-text </Token> } ...

</Tokens> ...

[ TEXT_SEG ] ...

</Sequence> ...

</Sequences> ...

</Names>

 

RANGE_FROM=

 

AfterEvent=’key’ | FromEvent=’key’ | After=’std-date’ | From=’std-date

 

RANGE_TO=

 

BeforeEvent=’key’ | UntilEvent=’key’ | Before=’std-date’ | Until=’std-date

 

NAME_ATTRIBUTE=

 

Language=’code’ | Phonetic=’boolean’ | Romanised=’boolean’

 

As with Event constraints, After is >, From is >=, Before is <, and Until is <=.

 

When software loads a <Names> element then it should tokenise the canonical names in addition to the explicit token sequences. This enables a certain level of simplification for the cases where there are no accepted token sequences beyond those implied by the canonical names.

 

The SortAs attribute allows the sort-order to be overridden when it is not determined solely by the available characters (e.g. in Japanese). The string consists of a token-by-token specification with ‘-’ indicating ‘no change from the equivalent canonical token’. For instance: “- - Souza”.

 

The Culture attributes is yet to be defined. It is designed to indicate the general style of name and its handling. It therefore implies a prevailing Language code for cases where it has not been overridden. Q: Do we need a default Culture in the Dataset header?

 

The name-mode may be one of Formal, SemiFormal (default), Informal, and Listing, where ‘Listing’ is for sorting and collation purposes (e.g. Proctor, Anthony Charles). See http://stemma.parallaxview.co/name-mode namespace.

 

The name-type may be one of the following. See Extended Vocabularies for defining custom name-types.

 

 

The Initial attribute controls whether individual tokens may be recognised by their initials. When canonical names are being tokenised, this is implied by the Culture setting. Note that initialisms are not applicable in all languages, and even when a foreign name has been Romanised. It is not even the case that subsequent given names following the first may all be placed with initials; a familiar example in genealogical circles being D. Joshua Taylor

 

The default setting for the Optional attribute is ‘0’ (i.e. False). The optional Event range attributes allow the applicability of a set of sequences to be constrained by relevant Events. The default attributes imply those sequences are always valid. A typical use of these is to differentiate maiden names from married names but they would be applicable for any type of name change. During name matching, it is recommended that the Event range attributes are ignored in order to provide a more relaxed operation. However, in order to derive a Person’s full formal name then they should be honoured and in the order they are written, just in case there’s any overlap due to fuzzy Event dates.

 

The name-attributes identifying the language, or whether the representation is phonetic, etc., probably need some clarification. There are a number of terms that often aren’t distinguished as well as they should be:

 

 

 

 

 

 

The <PersonalName> element is provided as a much simplified alternative to the <Names> element for the case where there are no variations and the matched name is identical to just one canonical name. A personal-name specified by a <PersonalName> element is wholly equivalent to a ‘SemiFormal’ Canonical name provided by the <Names> element.

 

Do we need to identify a subset of the tokens in a canonical name for highlighting as a surname, or family name, in software products? Note that a blind approach to marking tokens for highlighting avoids all the pitfalls associated with the rigorous categorisation of all name tokens.

 

In our example, Grace Murphy might be stored as follows, although the first <Sequences> element could be inferred at load-time from the canonical name:

 

<Names>

<Sequences>

<Canonical>Grace Ann Murphy</Canonical>

<Sequence>

<Tokens Initial=’1’>

<Token>Grace</Token>

<Token>Gracie</Token>

</Tokens>

<Tokens Optional=’1’ Initial=’1’>

<Token>Ann</Token>

</Tokens>

<Tokens>

<Token>Murphy</Token>

</Tokens>

</Sequence>

<Sequence Language=’gle’>

<Tokens>

<Token>Gráinne</Token>

</Tokens>

<Tokens Optional=’1’>

<Token>Ann</Token>

</Tokens>

<Tokens>

<Token>Ní</Token>

</Tokens>

<Tokens>

<Token>Murchú</Token>

</Tokens>

</Sequence>

</Sequences>

</Names>

 

This approach would be familiar to anyone with some knowledge of computer-language parsers. The interpretation of the tokens as given names, etc., might be done by a genealogical product but it is not inherent in the stored data.

 

Character matching should be relaxed here, as for Place and Group names. The most obvious case of this to people speaking in a Latin-based language is a case-blind match. However, when looking at other Western locales, the next most common instance of a relaxed match is an accent-blind one. This basically means treating, say, A-acute the same as A, etc. This is common in some locales where the accents are routinely dropped for uppercase. There are also characters that have very different representations in upper and lower case. For instance, the German lowercase sharp s in "straße" (known as eszett) usually (there are exceptions) uppercases to "SS", i.e. "STRASSE". After that, there are symbols with both "composed" forms (i.e. one Unicode character) and "decomposed" forms (i.e. 2 or more Unicode characters). For instance, the following should all be treated the same:

212B (Å) ANGSTROM SIGN
00C5 (Å) LATIN CAPITAL LETTER A WITH RING ABOVE
0041 (A) LATIN CAPITAL LETTER A + 030A (°) COMBINING RING ABOVE

Unicode makes specific recommendations about which composed and decomposed forms should be equivalent:
http://www.unicode.org/reports/tr15/.

 

In summary, any pair of tokens being compared must both be normalised to a “flattened” form that treats each of these categories as equivalent. Only the normalised forms should then be directly compared.

 

A final note on tokenisation of a name prior to applying the name-matching algorithm: Certain punctuation characters should be used to separate the tokens but should not be present during the matching, e.g. spaces, apostrophe, hyphen, and non-breaking space. Hence, Henri Cartier-Besson should be tokenised as the set [Henri, Cartier, Besson]. An exception to this might be the period which would have to be retained. Hence, James O. O'Seven would be tokenised as the set [James, O., O, Seven] to ensure the initial is distinct from a single-character token. See Worldwide Family History Data for further discussion.