Worldwide Family History Data

This paper discusses some variations of family history data and research around the world in an attempt to demonstrate how crucial the subject was to the design of STEMMA®.

As well as supporting the exchange and long-term storage of micro-history data (which includes family history), a data format should be ‘free of both cultural limitations and locale dependencies’.

There’s quite a lot of precision implicit within this phrase so I’ll first try and explain the jargon and give a quick tour of globalisation from a computer point of view.

2 Software Concepts and Standards

Computer software wasn’t always transportable between different countries. Systems now include Native Language Support, or National Language Support (NLS), to help develop software for multiple countries. Despite the availability of much better tools and software libraries, though, it is still possible for a developer to write bad software that is not transportable. The same old assumptions and misconceptions continue to trap the unwary. A firm grasp of the core concepts is therefore a prerequisite.

Internationalisation is the process of making a software product applicable worldwide. In other words, making its development generic enough that it is not constrained to just one country. Localisation is the process of making it address a specific country or culture, e.g. translating its text for that market. Globalisation is sometimes used to refer to both of these. Because of the dual z/s spellings, and the fact that these terms are hard to type, DEC coined the term i18n to represent internationalisation [there are 18 characters between the ‘i’ and ‘n’]. Similarly, "L10n" is used to represent localisation [the uppercase “L” is deliberate to avoid confusion with the digit “1”].

The difference between i18n and L10n may sound a bit vague but it makes sense if you know what software engineers have to do. Once upon a time, all the text that a program wanted to display (e.g. questions, error messages, etc) was burnt into the program code. That meant the program was far from globalised. Nowadays, the text is separated from the code and held in a separate resource file. The program requests the piece of text it requires using an agreed handle, or id. When that same program is configured for a different country, the request for that piece of text is serviced by a different resource file that has been translated appropriately. In this scenario, separating out all the text from the code is part of the i18n process, and the provision of specific translations of that text is part of the L10n process.

As these text resource systems grew more sophisticated, they acquired parameters. This allowed a message to say something like ‘There are <n> files in <m> directories’ without simply concatenating each section of text manually and putting the counts in between. That concatenation approach was another trap that caught the unwary as it didn’t always work elsewhere. A different country may have a different sentential structure that required the parameters in a different order. Computer code can only supply the parameters in one given order so those text templates typically number the parameter markers so that they can be moved around freely by the translation teams. For example: “The <%2> directories contain <%1> files”.

This works well but there are many yet more traps. In the above example, if there were just ‘…1 file in 1 directory’ then you can see that we not only have plurals to worry about but the rules are different for the two nouns. Well, plurals don’t exist in a lot of countries so the problem grows. Other common traps include gendered nouns, ordinal suffixes (st/nd/rd/th), and the indefinite article (a/an). Great effort may be put into each in the misguided belief that other countries must have an equivalent.

The language spoken in a given region can be represented unambiguously by an ISO 639 code. This standard has a number of variants such as ISO 639-1 which uses a two-letter code (e.g. “en” for English and “de” for German) and ISO 639-2 which uses a three-letter code (e.g. “eng” for English and “deu” for German). There are also four-letter codes. The longer codes allow for more scope in representing dialects and dead languages.

The standard ISO 3166 defines unambiguous codes for present-day countries, territories, and other types of region. Again, there are variants of this standard. ISO 3166-1 alone defines two- and three-letter codes, and numeric country codes. For instance, the US has codes “US” or “USA”, and the UK has codes “GB” or “GBR”. The longer codes are preferred given the rate at which politics moves boundaries and affiliations.

The term locale is used to mean a collection of properties that represent a user’s language, regional settings, and cultural preferences. It is usually given a key that is formed from the user’s ISO 639-1 two-letter language code and the user’s ISO 3166-1 two-letter country code. For instance, “en_GB” for British English or “en_AU” for Australian English. BCP 47 (IETF BCP 47, "Tags for Identifying Languages") lies at the heart on these keys but there isn’t a single representation for the refinements of a locale. The POSIX standard uses language[_region][.codeset][@modifier] while the java programming language uses language[_region[_variant]].

These locale properties indicate such things as the default way you want to see your dates and times written, the way you want to express decimal numbers (i.e. do you use a period or a comma for the decimal separator), what character do you use to separate whole digits and how many at a time (for instance, the US/UK use a comma to separate thousands), and whether you want parenthesised negative values. The locale usually includes your default paper size (e.g. A4) and some default systems of measurement. However, the latter tend to cause some confusion. For instance, although a country may claim it uses the Metric System, any “petrol head” will tell you that wheel hub sizes only come in imperial inches — worldwide, and irrespective of what the local government may like. The actual measurement units therefore depend on the context.

The default currency is another source of confusion. Although the UK may use GBP, people can have euro bank accounts, dollar bank accounts, etc. This may be less common in places like the US and so the assumption that a given locale uses a given currency tends to perpetuate. A related trap is to assume that the way you write monetary values depends on your currency rather than your locale. Ireland, for instance, is in the euro-zone but uses a period as a decimal separator, not a comma like mainland Europe. Having been told in a US company that ‘all euro-zone countries use a comma as the decimal separator’, I can testify to the misconceptions.

Monetary values are an interesting case because the locale indicates things like the decimal separator, digit groupings, whether the currency symbol occurs at the front or the end of the value, and whether a negative sign is placed before or after a leading currency symbol. However, both the choice of currency symbol and the number of decimal places are dependent upon the currency, not the locale.

So what is not included in the locale? Well, besides whether you prefer tea or coffee, or routinely avoid adverbs [joke], it does not include the way phone numbers are written. As we’ll see later, there are standards for representing international phone numbers in a computer-readable way but no locale system indicates how a particular region writes its local phone numbers. Each country is different here, and may use different punctuation characters and different digit groupings. See Local_conventions_for_writing_telephone_numbers.

2.1 Character Data

This used to be a huge problem area but is less of a problem these days.

Each printable character is represented in the computer by a specific numeric code. The list of assigned character codes is called a character set. The earliest sets were ASCII and EBCDIC, both developed around 1963, and both US-centric, but as different as chalk and cheese otherwise. Both are termed SBCS (Single Byte Character Sets) since ASCII used just 7 bits to represent each character (i.e. 128 possibilities) and EBCDIC used 8 bits (i.e. 256 possibilities). Under the control of ANSI, the US-ASCII set was supplemented with sets applicable to other countries. However, languages such as Chinese, Japanese, & Korean (CJK) needed far more than 256 possible characters each. Their ANSI sets therefore used one byte for some characters and two bytes for other characters and were termed DBCS (Double Byte Character Sets) as a result. Character sets such as EUC were termed MBCS since they went further and used three or more bytes for some characters.

This was generally a chaotic time since you had to know what character set your data was in if you were to interpret it correctly. Even though computer systems provided masses of tables for software to convert between all the different character sets, data was routinely processed using the wrong set which then resulted in junk characters being displayed.

Then along came Unicode! Developed by the Unicode Consortium in about 1987, and later recognised by ISO, it was a universal character set. This meant that it had codes to represent all possible characters and no more character conversions had to be performed. Each character was represented by a 16-bit quantity which meant there were then 65536 combinations. There were still a few niggles though: Memory was still limited then and so no one was keen on doubling their storage requirement. Some computers serialised a 16-bit value with the high-order byte first (“big-endian”) and some with the low-order byte first (“little-endian”). Also, either half of a 16-bit value could get mistaken for an 8-bit control character when transmitted down an asynchronous connection. There were also lots of ancient characters such as Egyptian Hieroglyphs to consider.

Unicode now defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character Set (UCS) encodings.

UCS-2 — a 2-byte, fixed-width encoding. Now obsolete.

UCS-4 — a 4-byte, fixed-width encoding.

UTF-8 — an 8-bit, variable-width encoding. Uses one to four bytes per character.

UTF-16 — a 16-bit, variable-width encoding. Embraces older UCS-2.

UTF-32 — a 32-bit, fixed-width encoding. Functionally equivalent to UCS-4.

Hence, the ubiquitous UTF-8 is effectively a universal MBCS.

Each locale may have slightly different rules for sorting character data — even when they use the same characters in their languages. For instance, the relative sorting of upper and lower case may differ, or accented and unaccented may differ. There is a Unicode sorting sequence which merely relies on the assigned numeric codes, but only software developers resort to that. A particularly important problem in this area concerns databases. SQL databases were eventually given Unicode column types (e.g. ntext, nchar, etc) and that meant text from multiple locales could be stored in the same table. However, these same databases already provided configurable ‘collation sequences’ that they could assign to such columns to control the semantics of sorting within queries. These collation sequences could only specify one sort order, though, irrespective of the locale associated with the data or the locale associated with the end-user. They’re therefore frowned upon and rarely used in multinational products.

3 Goals

So what should a universal format aim to achieve for micro-history data? The phrase in the Introduction said “free of both cultural limitations and locale dependencies”.

The first of these criteria means that the structure and interpretation of the data should not be constrained by some specific set of cultural values, or by ignorance of other cultures (either old or new). This is, by far, the bigger of the two and will be discussed in the following sections. It is all too easy to assume that the structure of our daily lives, and the society around us, can be extrapolated to apply the whole world, both now and at all other times in its history.

The cultural criterion, also called “culturally neutral” elsewhere, does not mean ignoring cultural differences and treating them all the same. It means embracing them all with a set of generic concepts and syntactic devices that can be used to address the differences individually when needed. From this point of view, it is analogous to the difference between i18n and L10n described earlier.

3.1 Locale Independence

This criterion means that the interpretation of the data should not be dependent upon the locale setting of the end-user. It is primarily a computer issue because the data format is designed to be computer-readable. It is therefore a problem that has been solved already and there are applicable standards for it.

Note that this does not relate to textual data being in some specific language. The format is free to hold such data from many languages as long it qualifies each one.

A good example of a locale-dependent pitfall — which still happens in poorly-designed software to this day — is when you want to store a decimal value. Suppose your program stores the value exactly as your own locale expresses such values. A value stored as “3.14” in the US would be stored as “3,14” in France. When that same program reads the data back, everything seems to work OK in the US and in France. If the data files are exchanged, though, then the program will fail in both locations.

This problem affects any datum that is interpreted by the computer software when the data is loaded up, and that usually means numbers, dates & times, and Boolean values.

This subject has a direct parallel in the source code for programming languages. The source code is supposed to unambiguously define the actions of the program, irrespective of where it gets compiled into machine code. Hence, keywords should never be translated to different languages, and numeric, date/time, and Boolean constants should be in a fixed format that’s independent of the programmer’s locale setting. As a result, this fixed computer-readable format is often said to be belong to the ‘programming locale’.

We take it for granted that the record types (e.g. element and attribute names in XML) are part of the computer-readable syntax and should never be translated into other languages. There is a grey area, though, in textual values associated with some meta-data and data. For instance:

<Role Type=”Spouse”/>

<Sex>Male</Sex>

If such values are defined by the data-format specification (i.e. part of its grammar) then they should never be translated and any standard must be clear on that issue. This is sometimes called a “controlled vocabulary”. Also, those raw values from the data should never be shown in the UI as they belong to the ‘programming locale’ and not the user’s locale. In those circumstances, the terms should be translated appropriately prior to display.

3.2 What is the Programming Locale?

This seems to be an older term used in the context of globalisation issues, and relates to the representation of reserved words and literals in textual data or programming languages. It seems to be more common nowadays to talk about specific standards for the representation of literals, but the term Programming Locale gives a lot more clarity to a generic issue.

1) Literals. The representation of decimal numbers, dates/times, and Booleans is fixed by the language, and NOT by the end-user's locale.

I have seen questions such as why does my programming language want date literals in US format (i.e. VB6) , clearly unaware that they're read by software and not by human beings. We take it for granted that decimal numbers in such languages use a period rather than a comma as the decimal separator, not least because it would otherwise prevent comma being used as a list separator.

The issue becomes very important when generating configuration data that has to be re-imported by the software. For instance, if one of the old-style INI files had a 'Setting=Boolean' line then the value of that setting would likely be TRUE/FALSE, and not a version localised for whoever the end-user is, otherwise it would affect transportability. Similarly with decimal numbers and dates. Basically, the routines used to format those values should be selected for the Programming Locale and NOT from any of the NLS support.

2) Reserved words in such languages are also part of the Programming Locale, even if they look like English, or any other readable language. The rationale is clear — what would be the point of having separate English-C, French-C, etc. — but it still catches people out.

If some data language has certain keywords prescribed by the data specification then they may appear as cryptic abbreviations, and in a particular character case, so there should not be much confusion. But if it also accommodates other reserved words — say for a setting with an enumeration of possible terms — and they appear to be simple English words, then it's not just tempting to display them directly to the end-user but it happens with unfortunately regularity. It's only when such software is globalised that the naivety of the programmer is uncovered.

4 Dates

We may think that this is simply the old argument about whether the day is written before the month, or vice versa. However, the topic is much larger than that, and such an assumption illustrates typical Western parochialism.

A calendar is a mechanism by which dates are reckoned in a given culture. For instance, Gregorian or Julian. A list of worldwide calendars, including historical ones, may be found at: List_of_calendars. Another useful resource may be found at: Calendar FAQ.

The subject of calendars and date representation is examined in more depth under Dates and Calendars.

4.1 Times

The idea of dividing a day into 24 hours, or each day-lit and night-time portion into 12 hours, has been around for many centuries. However, the origin from which they’re counted has varied over that time.

The modern 12-hour clock divides each day into two periods: a.m. (from the Latin ante meridiem, meaning "before midday") and p.m. (from post meridiem, meaning "after midday"). Each period consists of 12 hours numbered: 12 (effectively a zero), 1, 2, … through to 11. See 12-hour_clock for more details.

At the International Meridian Conference in 1884, the following proposal by Sandford Fleming was adopted:

That this universal day is to be a mean solar day; is to begin for all the world at the moment of mean midnight of the initial meridian, coinciding with the beginning of the civil day and date of that meridian; and is to be counted from zero up to twenty-four hours.

This constitutes the modern 24-hour clock. This is considered less ambiguous than the 12-hour clock, and is the preferred system in countries like the UK and Ireland (although both systems are used interchangeably in everyday life). The ISO 8601 standard for the international representation of dates and times uses the notation of the 24-hour clock.

Decimal time is a term often used to refer to French Republican Time, which divides the day into 10 decimal hours, each decimal hour into 100 decimal minutes and each decimal minute into 100 decimal seconds. See French_Republican_Calendar

The six-hour clock is a traditional timekeeping system used in Thai, and formerly the Lao language and Khmer language, alongside the official 24-hour clock. It also counts 24 hours in a day but divides the day into four quarters, each of six hours in length.

5 Letter Case

Most Western languages support the concept of letter case, and distinguish lowercase (minuscule) letters from uppercase or capital (majuscule) letters. See Capitalization.

A character exception in those languages is the German eszett, ß, which only has a lowercase form, although an uppercase form appears in some old books. It is generally uppercased to “SS”, e.g. Straße (street) transforming to STRASSE. There are some contexts where the letter is left in tact during uppercasing, and the available rules were modified during their spelling reform of 1996.

In English, uppercase letters are mostly used as the first letter of a sentence, a proper noun or adjective, for initials, or for abbreviations. They may also be used for emphasis, titles, and to avoid ambiguity. The rules have changed over time with fewer words being capitalised now than in previous centuries.

Some words are effectively case-sensitive because they have a different meaning if the case is changed (e.g. Italic and italic). See List_of_case-sensitive_English_words.

In German (and in Luxembourgish which is related to German), all nouns are capitalized (see http://german.about.com/library/weekly/aa020919a.htm). This was also the practice in Danish before their spelling reform of 1948 (see Spelling_reform). It was even done in 18th century English and may be observed in Gulliver's Travels and most of the original 1787 United States Constitution.

In languages that use diacritical marks (e.g. acute accent), those marks may be preserved during uppercasing (as in German) or routinely dropped (as in French and Spanish).

Irish uses two forms of mutation on initial consonants: lenition (Irish: séimhiú) and eclipsis (Irish: urú). Originally for phonological reasons, this is common to all modern Celtic languages. The net result is that the first letter of a capitalised word may not be the one that is uppercased. For instance:

Oibríonn mo Daidí i mBaile Átha Cliath (My Dad works in Dublin)

Téim go dtí an leabharlann ar an gCéadaoin (I go to the library on Wednesdays)

In general, it’s not even just a single letter that may be uppercased. For instance, in O’Donnell, it is the first two letters, and this affects both personal names and place names derived from personal names (e.g. O’Donnell Street).

6 Comparing Names

Character matching should be relaxed when comparing such things as personal names or place names. The most obvious case of this to people speaking in a Latin-based language is a "case-blind match”. However, when looking at other Western locales, the next most common instance is an "accent-blind match". This basically means treating, say, A-acute the same as A, etc. This is common in some locales where the accents are routinely dropped for uppercase. There are also characters that have very different representations in upper and lower case. For instance, the German lowercase sharp s in "straße" (known as eszett) usually (there are exceptions) uppercases to "SS", i.e. "STRASSE". After that, there are symbols with both "composed" forms (i.e. one Unicode character) and "decomposed" forms (i.e. 2 or more Unicode characters). For instance, the following should all be treated as the same:

212B (Å) ANGSTROM SIGN
00C5 (Å) LATIN CAPITAL LETTER A WITH RING ABOVE
0041 (A) LATIN CAPITAL LETTER A + 030A (°) COMBINING RING ABOVE

Unicode makes specific recommendations about which composed and decomposed forms should be equivalent: http://www.unicode.org/reports/tr15/.

In summary, any pair of tokens being compared must first be normalised to a “flattened” form that treats each of these categories as equivalent, such as a single-case unaccented form. Only the normalised forms should then be compared directly.

Before two names can be compared, they must be tokenised, i.e. split into a sequence of words or elements. Certain punctuation characters should be used to separate the tokens but should not be present during the matching, e.g. spaces, apostrophes, hyphens, and non-breaking space. Hence, Henri Cartier-Besson should be tokenised as the set [Henri, Cartier, Besson]. A possible exception to this might be the period which would have to be retained. Hence, James O. O'Seven would be tokenised as the set [James, O., O, Seven], otherwise, the “O” token would be ambiguous with an “O” initial. This is an important issue in Irish names since the Irish-language equivalent to the O-apostrophe (O-fada, Ó) is a separate token and so must be distinct from any similar initial when “flattened” for comparison. There are several cases where a name may contain a single-letter non-initial, such as the Irish Ó (or O-fada, meaning from) and the Spanish y (meaning and). An English-speaking example would be the renowned geologist J Harlen Bretz whose first name was “J” and not a “J.” initial.

An English-speaking example of where a personal name might contain a non-breaking space is ‘St John’, or ‘St. John’ (see St John (name)), pronounced Sinjin or Sinjun.

When comparing tokens, we must take account of general abbreviations and diminutive forms (e.g. Chas matching Charles), and for certain parts of a personal name — in certain languages only — we must take account of initials (e.g. A. Proctor matching Anthony Proctor). The topic of abbreviations and common misspellings, and the possibility of a single authority on them, is discussed further under Person and Place Names.

7 Personal Names

A given name is used to distinguish members of a family group. The term implies that the name is purposefully chosen when the child is born and contrasts with inherited parts of their personal name. In the West, a given name is often called a first name, or forename, but this presupposes the order of the name parts. See Given_name.

A surname is an inherited part of a personal name added to a given name, and is usually a family name. Many dictionaries actually define ‘surname’ as a synonym of ‘family name’ but this is not true where a culture uses patronymic or matronymic names, i.e. where a surname is based on the given name of a male or female ancestor, respectively. In the West, a surname is often called a last name but that presupposes the order of the name parts. In North and South America, as well as in Europe, a surname is placed at the end of a person's given name. In China, Japan, Korea, Hungary, and in many other East Asian countries, the family name is placed before a person's given name. In Spain and most Spanish-speaking countries, two or more surnames are commonly used. See Surname.

Family names were not compulsory in the Scandinavian countries until the 20th century, and not in Norway until 1923. At the time of writing, Iceland still does not use family names for its native inhabitants.

In the West, family names are usually inherited from the father, and may be described as patrilineal surnames. However, some cultures use a matrilineal surname inherited from the mother.

A middle name is an additional name placed between a given name and a surname. There may be zero or more of these and they may be extra given names, surnames of ancestors or relatives, a maiden name, or a saint’s name.

Traditional Chinese names can use something called a Generational name to identify members of a particular generation, including siblings, cousins, etc. There is no Western equivalent of this custom.

Honorifics are parts of a name expressing esteem or respect for the person. In English-language names, these are usually academic titles (e.g. Dr. or Prof.), honorific prefixes (e.g. the honourable, or his holiness), honorific titles (e.g. Sir, Lord, Dame, Lady), or post-nominal letters (e.g. VC, OBE, PhD). The page on English name suffixes also identifies generational titles (e.g. .Jr, Sr, I, II, III, etc), although the Irish equivalent are actually infix as opposed to either prefix or postfix.

There is also a general class of name token called a 'name particle', analogous to a grammatical particle. This includes all those small joining words such as: “von”, “van”, “der”, “de [la]”, “d′”, “the”, “[son] of”, “mc”, “mac", "Ó", "Ní", "Nic", "Mhic", "Bean", "Ui", "y", etc. These have different characteristics that dictate their behaviour under case conversion and sorting.

The essential elements of a personal name are, therefore, given names, middle names, surnames, and generation names. Terms such as first name, forename, Christian name, last name, and family name are culturally dependent. In conjunction with name particles, honorifics, and generational titles this covers the elements of most modern personal name formats. Some special cases may occur for ancient names or royal titles.

Irish personal names make use of various types of name particle to indicate relationships. A man's surname usually takes the form Ó (or Ua, originally "grandson of") or Mac ("son of") followed by the genitive case of a name, e.g. Ó Dónaill ("grandson of Dónall") or Mac Gearailt ("son of Gerald"). The Ó is usually changed to O' in Anglicised forms, e.g. Ó Conchobhair becoming O'Connor. However, a woman's surname replaces Ó with Ní, and Mac with Nic. Hence the daughter of a man named Ó Dónaill would have the surname Ní Dhónaill and the daughter of a man named Mac Gearailt would have the surname Nic Gearailt. Anglicised forms use O' or Mac regardless of gender.

If an Irish woman marries and takes her husband's surname, the Ó is replaced by [Bean] Uí ("wife of the grandson of") and Mac by [Bean] Mhic ("wife of the son of"). In effect, the Irish surname is dependent on a number of factors, including the gender of the person. It also highlights the need to decide whether those name particles should be considered part of the surname or simply related to it. The Irish Ó (O-fada) is a separate name particle but the Anglicised equivalent (O’) is a prefix in the surname itself.

If the second part of an Irish surname begins with a vowel, the form Ó attaches an h to it, as in Ó hUiginn (O'Higgins) or Ó hAodha (Hughes), and this will be important for any capitalisation or case-conversion operation The other forms cause no change, e.g. Ní Uiginn, [Bean] Uí Uiginn, etc

An Irish given name may also be modified by an adjective, say to distinguish father and son. Mór ("big") and Óg ("young") are analogous to senior and junior in English but occur between the given name and the surname, e.g. Seán Óg Ó Súilleabháin corresponds to John O'Sullivan Jr.

Particularly troublesome cases of personal names occur in Portugal and Spain which, like most Spanish-speaking countries, have two or more surnames. It is common for people with such names to hyphenate them, or specify them as a two-word surname, when completing forms in the English-speaking world.

Just to blow all of this classification out of the water, typical names in the Native American tribes do not have a surname (either family name or patronym). They may still be polynyms, rather than mononyms, but their individual tokens cannot be generally classified. A person might also have different names at different periods of their lives, e.g. an infant name like "little rabbit", later changing to a war name when a boy becomes man, and changing again for the later periods of their life. Some tribes are also secretive about their personal names, using them only within their own tribe, and resorting to a "public name" outside of it. Some related reading:

personal-names-among-the-indian-nations-east-of-the-mississippi
dissertation_lombard_c
Family Education - Baby Names

There are many reasons for name changes and so any data format must be able to accommodate them all, differentiate their usage, and possibly associate dates with them.

A topographic anthroponym or (topoanthroponym) is a personal name derived from a place name. This usually means a given name or surname that equates to the name of a place, such as London. However, the use of birth toponyms was more common in ancient times. For instance, the 12^th century author John of Salisbury.

Sorting of names is a difficult topic, and it can only really be addressed successfully once a name has been broken down into its components parts so that the rules of the relevant culture can be applied. Countries like Portugal, Thai, and Iceland would normally sort on the given name rather than a surname. If a Person has multiple surnames, as in the Spanish-speaking world, then sorting may be on either surname depending on which part of the world you’re in. The presence of name particles such as von and de also complicate the sorting rules since they may or may not be significant. They can also occur before and in between multiple surnames, e.g. the Aragonese painter Francisco José de Goya y Lucientes.

In conclusion, there are two main approaches that could be taken here:

Categorise the tokens of a name according to some controlled vocabulary, and then use that knowledge to determine the appropriate presentation style in different contexts, and how the name should be sorted. This requires help from the end-user to indicate the cultural origin of the name, and identify the tokens.

Simply ask the end-user for all the accepted forms of the name that may be matched on input, and a more limited set for specific presentation styles, including for use in sorted lists.

STEMMA adopts the second strategy since it is more portable across cultures and locales.

Useful documentation resources on personal names:-

W3C internationalisation guide. Discusses personal names around the world: qa-personal-names.

Citation Style Language (CSL) 1.0, deals with sorting of names involving name particles: citationstyles.

Wikipedia. Personal names: Personal_names.

ROCIC Law Enforcement Guide to International Names: law-enforcement-guide-to-international-names

IFLA Universal Bibliographic Control and International MARC Program. National Usages for Entry in Catalogues: NamesOfPersons_1996.

Wikipedia Manual of Style.

Patrick McKenzie, "Falsehoods Programmers Believe About Names", Kalzumeus, 2010-06-17 (http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/)

7.1 Capitalisation

The rules for the capitalisation of personal names as proper nouns differ for different cultures. For instance, in compound names, the presence of name particles such as “von”, “van”, “der”, “de [la]”, “the”, “[son] of”, “mc”, “mac”, etc., are handled in different ways — some are capitalised with the rest of the name and some are not. With “de la” in English, the “La” is capitalised but not the “de”.

Awards and qualifications, particularly when expressed as initials or acronyms, may always be in uppercase.

Given these points, and those made under Letter Case, it is obvious that changing the case of a word is not always reversible, or even straightforward. So why do many genealogists, and genealogy programs, routinely represent surnames in all-uppercase? This may be a trend set in older legal documents, which in turn was probably a convenient way of creating an emphasis without having to overtype with something.

Advocates would claim that capitalisation of the surname makes it obvious that it's a Person being referred to and not a profession or a noun (e.g. Tailor/tailor, Butcher/butcher, etc.), but the initial capital of a proper noun is the normal way to achieve this in English. Others would claim that it creates a very visible emphasis to pick out surnames. I would argue that this is a bad convention and should be strongly discouraged. As well as being culturally biased, it is entirely unnecessary where data is stored on computers. The document under Importance of Narrative explains that when narrative is stored using an appropriate form of mark-up language, it is not only possible to automatically highlight portions of a name but you can select a preferred method such as bold, underline, italic, colours, or fonts that do not require a material change to the name.

This is not an argument against highlighting a surname but one of how to highlight it. If your data contains people from, say, China then their family name would actually be first rather than last. However, Chinese people who have moved to the West often move it to the end to be easier for Westerners. Hence, a highlight can help.

The international language Esperanto commonly uses all-capitals (albeit small capitals) for surnames.

So let’s consider some of the issues in converting surnames to all-uppercase.

Firstly, what constitutes a surname, and secondly, where do we find it? In the West, we expect to find the surname at the end. However, where a name has a double-barrelled surname, the hyphen may or may not be present depending on the age of the name and whether it was a written version of what was heard (e.g. on census night). This means you need to know a fair amount about the person and their family to be sure. Although the reasons are unclear, Andrew Lloyd Webber apparently had to hyphenate his surname when he was named a life peer as Baron Lloyd-Webber (see Andrew_Lloyd_Webber).

Again, in the West, if a person was born out of wedlock, or a father died and the mother remarried, they sometimes retain their original family name as a middle name. This is not technically a double-barrelled name but should it be highlighted?

I came across a question posed on the Internet about situations where a name involved both a patronymic and a family name. It gave the example of a Russian, Aleksandr Ivanovich Guchkov, and suggested there could be four possibilities for putting a surname in all capitals.

Several Welsh names may begin with a double letter, e.g. Lloyd, Ffoulkes. The Welsh language consists of 28 letters, eight of which are digraphs that are treated as single letters for collation purposes, and these include "ff", and "ll". This is not a problem itself but the use of a double leading ‘f’ was also a very early form of capitalisation in English and that is a problem. For instance: “ffrance”, “ffrancis”, as well as “ffourth”, “ffyfth”, and “ffaith”, may still be found in old wills and manuscripts. It is debated whether the surname Ffoulkes should have the first letter capitalised in keeping with a Welsh origin (see Welsh Surnames), or both leading f’s left in lowercase in keeping with an old English origin (e.g. Charles ffoulkes).

One of the contexts where the German eszett character is left in tact, as opposed to uppercasing it to “SS”, is when used in an all-uppercase form, as in legal documents. For instance: HANS STRAßER.

Finally, consider the use of CamelCase in personal names.

8 Places and Addresses

Let us first distinguish a postal address from both a place reference and a location. The following definitions are discussed further under Person and Place Names.

Postal Address — A sequence of terms that direct traditional mail (e.g. letters, packages, etc) to a particular recipient.

Location — A fixed geographical point or area, usually referenced by its coordinates.

Place — A named point or area deemed to have significance to humans.

Postal address variations around the world may be found at: international-address-formats. There is no finished standard yet (see ISO 19160), although the topic has been discussed many times (see International Address Standardisation).

Note that a postal address is not quite the same as a geographical address, although the concepts tend to be conflated when the address includes a postal code. This was the source of some controversy in Ireland when they finally introduced a postal code (Eircode) in 2015: their postal service (An Post) sent letters to each household and company indicating their allocated Eircode, but the target addresses were not what people had been using — sometimes being in an entirely different county. The issue was that An Post were using a postal address — one that routed mail to the appropriate sorting office — whereas recipients had been using a geographical address — one that indicated where the building was located. In other countries, such as the UK, this was not an issue as the postal code took care of mail sorting and routing, but it caused much confusion in Ireland and added to the general controversy over the nature of the Eircode.

If data contains explicit contact details, including postal addresses — e.g. for attribution — then there are some important things to consider.

Postal code - Postal codes or zip codes differ widely between countries and so can only be handled as plain strings. Some are based on geographical coordinates while others are based on postal sorting office zones. At the time of writing, some countries did not have such a system, and Ireland only launched one on 13 July 2015, so it should never be mandated.

Telephone numbers - All stored telephone numbers should be fully international. The E.164 standard specifies how to represent an international telephone number, e.g. +15551234567. It does not specifically separate the ISD country dialling code, trunk code, or subscriber number. Also, it does not represent any trunk prefix required within that country. Formats such as +44 (0)1728 123456 should be avoided as the parenthesised trunk prefix is a UK/IE-centric way of representing numbers. E.123 is similar to E.164 but allows for some restricted punctuation for readability (e.g. spaces) and covers e-mail addresses too. Presentation differs greatly between countries but formatting them for a given locale has never been provided as part of software locale systems — it has always been added as an extra layer on top.

Country dialling codes are not one-to-one with ISO country codes. For instance, +44 covers four separate UK-based ISO country codes (GB, JE, GG, and IM). Some commercial Web sites try to be too clever by assuming — and sometimes enforcing — a contact phone number to have a particular country dialling code, or your credit card to be issued by the country of the delivery address. This flat-earth thinking forgets that people are mobile. Also, with the advent of telephony products such as Skype, telephone numbers may be effectively non-geographical, irrespective of their country dialling code.

Considering places and locations now, geographic coordinates are probably of more use for a fixed location than for irregularly-shaped places, or places with a formal name or address that can be looked up. ISO 6709:2008 supports point location representation through the use of XML but, recognising the need for compatibility with the previous version of the standard, ISO 6709:1983, it also allows for the use of a single alphanumeric string. Different coordinate systems are in common use, though, and so prescribing a single scheme will not work. For instance, the Ordnance Survey National Grid of Great Britain. In general, the coordinates associated with well-established places, including villages, towns, and counties should be left to a Place Authority rather than being duplicates in all personal data collections. When a product performs a location search then it may want to weight the finds by their distance from the ideal location, and that requires a reliable system of coordinates that will yield reproducible results.

Identification of a place by name is a hierarchical process. Each place is part of some larger entity such as a street, town, county, state, etc., right up to a country. The underlying premise is that every place has a unique bounding parent place at any given time. There are different types of hierarchy, such as geographical, religious, administrative, political, and judicial, and they could be represented as alternative hierarchies as long as that premise is not violated. The elements of a Place-hierarchy may be enumerated to yield a Place-hierarchy-path but the ordering (small-to-large or vice versa), and the separating characters, should be considered a cultural or personal preference.

The types of element in a Place-hierarchy will vary geographically. For instance:

A house may have a number, or a name, or both a name and number, or be identified by the name of the family residing there (e.g. “The Fockers”). It may be part of a larger building such as a set of apartments.

The regional categories within each country will be different. Terms like county, state, province, etc., will not apply in all cases, and the relative hierarchical relationship may not be the same.

Not everyone lives in a town or city, or even in a village. Some rural communities may be in a simple hamlet, or even a standalone farmstead.

For people in long-distance transit, it may be wise to generalise the concept of a place to include a named vessel or vehicle.

A more in-depth discussion of Place Hierarchies, and Place Authorities (in the software sense), may be found under Person and Place Names. Some further analysis of postal address is also presented in there.

For programmers in particular, a very useful list of presumptions that they often make about places and addresses may be found in: Michael Tandy, "Falsehoods programmers believe about addresses", mjt.me.uk, 2013-05-29 (https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses/). For instance, that a house name and number are exclusive of each other.

9 Citations

A citation is a reference to the source of some information relevant to the current work, or that was consulted in the production of the current work. Terminology can be a little fluid, depending upon the applied field, but the main consensus is that the citation noun (as opposed to the act of citing something) is either a reference note (usually implemented as a footnote or endnote); a source label; or a source list (or bibliography), possibly with an embedded in-text source reference.

In written or printed material, there are several styles of citation, and these may be categorised as applicable to law, science, or the humanities. For instance: APA style, MLA style, The Chicago Manual of Style (CMOS), Bluebook, ALWD Citation Manual, ASA style, Harvard referencing, and Vancouver system.

The Board for Certification of Genealogists (BCG) recommends the use of both CMOS and EE for family history. EE is a style devised by Elizabeth Shown Mills in Evidence Explained: Citing History Sources from Artifacts to Cyberspace (Baltimore: Genealogical Publishing Co., 2009) to cover the wider range of historical and unpublished sources used in family history.

A less considered aspect of citations is how they translate to other locales and cultures. In printed form, and in the absence of a completely separate translation, elements of the citation may be at odds with the regional settings and personal preferences of a reader from a different locale. For instance, dates may be expressed in an ambiguous numeric form or using English month names. Personal names may not be expressed in the appropriate formal manner. There may also be issues with ordinal suffixes (e.g. st, nd, rd, th), with punctuation (inside or outside of quotes — see Typesetters' Quotes vs. Logical Quotes), with choice of quotation style, and potentially with decimal numbers.

A huge problem for citations in a portable micro-history data format is how to encapsulate the essence of a citation without hard-coding the style, or fixing the regional preferences, or presuming the presentation will use an electronic or non-electronic medium. This subject is presented in more detail under Importance of Narrative.

A clear distinction should be drawn between citations and abstract codes used to identify sources. In the world of libraries and archives, there are several such established schemes:

ISBN (International Standard Book Number): a unique numeric book identifier based upon the 9-digit Standard Book Numbering (SBN) code. The 10-digit ISBN format was developed by ISO and published in 1970 as ISO 2108. Since 2007-01-01, ISBNs have contained 13 digits, a format that is compatible with the EAN-13 International Article Number.

ISSN (International Standard Serial Number): is a unique 8-digit number used to identify a printed or electronic periodical publication, analogous to ISBN for a book. The ISSN system was first drafted as an ISO standard in 1971 and published as ISO 3297.

SICI (Serial Item and Contribution Identifier): code used to uniquely identify specific volumes, articles or other identifiable parts of a periodical. It is an extension of the ISSN that identifies an entire periodical. It is now published as ANSI standard Z39.56.

For electronic documents, there is the DOI (Digital Object Identifier): a text string used to uniquely identify an object in a digital repository. Referring to an online document by its DOI is considered more stable than simply linking using its URL since URLs change but DOIs are permanent. In reality, URLs may also die as cited pages are taken down. Michael Bugeja and Daniela V. Dimitrova, in Vanishing Act: The Erosion of Online Footnotes and Implications for Scholarship in the Digital Age (Duluth, Minnesota: Litwin Books, 2010), showed that citations to online resources have a rate of decay that can be measured using a “half life”, akin to radioactive decay. See Impermanent Links

Most of these abstract codes have a digital representation using a URI. For instance:

Scheme	URI Namespace	Example
ISSN	urn:ISSN:	urn:ISSN:0302-9743
ISBN	urn:ISBN:	urn:ISBN:8884530431
NBN (National Bibliographic Number)	urn:NBN:	urn:NBN:fi-fe19981001
DOI	info:doi/	info:doi/10.1045/july99-caplan
SICI	info:sici/	info:sici/07408188(200010)22:3%3C311:SEUB%3E2.0.CO;2-X
PubMed	info:pmid/	info:pmid/9036860
Open Archives Initiative	info:oai/	info:oai/arXiv.org:hep-th/9901001

At best, they specify indefinite sources — like a simple book reference — and would only be appropriate for published sources where the provenance or any analytical notes are irrelevant. By contrast, a citation is a humanly-readable reference that serves multiple purposes (e.g. attribution, accessibility, and credibility) and must not be displaced by the availability of abstract codes.

The styles mentioned above (e.g. CMOS) all relate to printed citations but for electronic documents there are other options for referencing other material that do not easily fit into those accepted styles. A word or phrase may constitute a hyperlink directly to another electronic document. Alternatively, it (or an attached superscript or other indicator) may link to a traditional reference-note citation. However, that reference-note citation need not constitute a footnote or endnote; the non-sequential nature of electronic documents means that the reference could be popped up when the link is selected.

10 Religion

There are many world religions. Some have one or more central deities and others have none. When you include atheists, and all beliefs in between religion and atheism, then you have a wide range of possibilities for our ancestors. Their beliefs may have shifted over time, too, which then introduces multiple possibilities per person.

More important from the point of view of family history are the events associated with religious celebrations, ceremonies, and rituals. That data will need to distinguish between those events, preferably using a partially controlled vocabulary so that it includes an extensible facility to cater for the lesser-known ones.

Here is a list of some of the better-known ones:

Baptism - a Christian rite of admission into the Christian Church generally, and sometimes into a particular church tradition. Almost always involves water.

Bar-mitzvah - Jewish coming-of-age ritual. According to Jewish law, when a Jewish boy reaches 13, he become responsible for his actions and becomes a Bar Mitzvah.

Bat-mitzvah (sometimes Bas-mitzvah) - Similar to Bar-mitzvah but for girls. The associated age is 12 for girls.

Blessing - A religious pronouncement, usually infusing the recipient with divine favour or good fortune.

Christening - In some traditions, baptism is also called christening, but for others the word "christening" is reserved for the baptism of infants.

Confirmation - a rite of initiation in Christian churches, normally carried out through anointing and/or the laying on of hands and prayer for the purpose of bestowing the Gift of the Holy Spirit.

First Communion - a Catholic Church ceremony for a person's first reception of the sacrament of the Holy Eucharist.

Recent census surveys have thrown up the fact that our knowledge of religious habits and affiliations is blurred by our preconceptions. Whether someone was indoctrinated into a religion, or whether they still visit a church and indulge in religious ceremonies, or whether they actually believe in the teachings of a church, or whether they’ve switched religions during their life, or even whether they have their own private beliefs outside of any formal religion, cannot be answered in response to a simple pick-list. If we want to record information about our ancestors’ religions then we should be aware of the many shades of grey to be found in real-life.

11 Marriage

The concept of a marriage between two people of the opposite sex is traditionally used as the root of a family unit by genealogy products but is that realistic? Even in our own locales we’re all aware of other types of family unit but how has this varied over the years, and how does it vary in other locales?

In most countries, marriage is a union recognised by the state or some religious authority, or both. There are some regions, though, where it is recognised by a tribal group or some other peer group.

In England and Wales, the Marriage Act 1753 came into force on 1754-03-25 and required a formal Church of England marriage ceremony. This, of course, put people of other faiths and atheists in a difficult position. The Marriage Act 1836 came into force on 1837-01-01 and legalised the concept of a civil marriage. However, it now means that a couple can have separate church and civil marriage registrations at different times and in different places; sometimes even in different countries.

Polygamy is a marriage that includes more than two participants. When a man has more than one wife then it is called polygyny, and when a woman has more than one husband then it is called polyandry. In both cases there is no marriage bond between the multiple wives or multiple husbands. If polygamy is illegal then such a relationship is termed bigamy. Polygamy has traditionally been associated with positions of wealth or power. Although historically not uncommon, the practice has been outlawed in many countries — some quite recently (Hong Kong in 1971). It was widespread in African countries and, although now in decline, it is still performed.

Same-sex marriages are not the modern phenomenon many of us believe, and there is a long history of such unions. Granting them the same legal status, however, is a more recent issue. Whether this practice is illegal, recognised by the state as a civil union, or recognised as a full form of marriage, varies greatly around the world. In the US, there are even major differences between the states. In countries that do not allow it, there may be some recognition of ceremonies performed elsewhere. Along with the status of the union itself, there are major differences in the status of children belonging to the couple before the union, and in their entitlement to adopt children after the union.

In addition to the lesser concept of a civil union, some regions recognise a common-law marriage. A couple may be cohabiting and may not have undergone any official ceremony. This may be a trial period in anticipation of a later marriage but it may also be for financial reasons, such as for tax or pensions, or sociological ones. Assuming the couple would be legally eligible for a conventional marriage, and they mutually consent to their relationship, then it may be recognised as a common-law marriage. In some regions, that is legally binding and may have implications for dependants and inheritance.

Most cultures accept death as a form of termination or dissolution of a union. Divorce, however, is not universally recognised around the world. Even where it is recognised, it may have been too expensive for our ancestors to achieve, and they may have simply separated and pretended a prior marriage never happened. As a result in England and Wales, an act of Parliament, Offences Against the Person Act 1861, contained a clause in section.57, Bigamy, which allowed for a presumption of death if separated for seven years or more.

"Provided that nothing in this section contained shall extend ... to any person marrying a second time, whose husband or wife shall have been continually absent from such person for the space of seven years then last past, and shall not have been known by such person to be living within that time".

Lack of knowledge was all that was required here, and there was no obligation to go and find them. This became informally known as “the seven year rule” or “a poor man’s divorce”.

Marriages can also be annulled in some societies. This involves an authority declaring that a marriage never happened, e.g. because it was otherwise illegal, performed under duress or with lack of cognitive understanding, or through non-consummation.

We’ve seen here that the concept of a registered union is extremely varied, and sometimes rather vague. Although the associated ceremony or registration is always a significant event in the lives of those concerned, is it relevant to the concept of a family unit? A family unit may be the result of Adoption, Fostering, or some other type of Guardianship. Adoption was once an informal process, with England and Wales only establishing their first formal adoption law in 1926. Children may have been living with another relative after the death of their real parents.

Children may have been born out of wedlock, or as a result of married persons having liaisons outside of their own marriage — something that probably happened a lot more often than we can ever prove or would want to know.

So what absolutes are there? Well, the concept of a family unit must be considered separately to the concept of biological, or progenitive, parents. Unless there are any clones out there then we’re all the result of a seed from one male and one female parent. That fact alone dilutes the significance of a traditional family tree or pedigree chart, and nicely illustrates how family history can be considered as having much greater scope than genealogy in its literal sense. Usually, these parents engaged in a physical union (procreation) but things can get still more complicated. There may have been a sperm donor or an egg donor. In those cases, who is cited as the birth parents?

12 Family

The concept of a ‘family’ is impossible to pin down without some stricter subdivisions of the term. From the point of view of genealogy (as opposed to family history), it is often considered to be the parents and their unmarried children, and this has influenced the design of data formats and the software that processes them.

However, one or both of the parents may be missing. The group may no longer be living together. Either or both of the parents may have remarried — bringing previous children with them. There may be older generations living with them, or siblings of the parents (i.e. aunts and uncles to the children).

Wikipedia nicely defines a family as a group of people affiliated by consanguinity (blood relationships), affinity, or co-residence. This includes many different possibilities in a single sentence. The article discusses some of our more common family notions, including:

Matrilocal. A mother and her children.

Conjugal (or Nuclear family). A husband, his wife, and children.

Consanguineal (or Extended family). In which parents and children co-reside with other members of a parent's family.

Blended (or Step-family). Families with mixed parents. For instance, where one or both parents remarried, bringing children of the former family into the new family.

Biological relationships are fixed and finite, whereas all other types of relationship are time-dependent and possibly overlapping. The concept of Marriage is dependent upon both culture and life-style so the general case of a family-unit is going to be based more on some sociological grouping such as living together.

Co-residence alone is insufficient to assume the family tag — long-term co-residents may be lodgers, or staff, or may be people forced together by necessity. Although some members may have to live outside the household, say for work, they may still be considered family.

We can’t even assume that a family unit is a sociological group supporting each other emotionally and/or financially without knowledge of their situation. In effect, retrospectively applying this tag to an historical group may need more supporting evidence than can be yielded by birth certificates and census returns alone.

Different societies may also have different traditions or different concepts of a “family unit”, and the aforementioned Wikipedia article discusses some of these.

STEMMA generalises the concept in order to avoid having to rigidly define what a family-unit is. The STEMMA Group element can have a variety of types, including the flavours of family described above, but may be used to model any grouping of people. A Group allows Persons to be associated with it in a time-dependent way, e.g. from the time of a parent’s marriage, or until the time of a child’s marriage. An example may be found under Data Model. The Group syntax also allows derived Groups to be created using SET operators.

13 Gender

The terms gender and sex are sometimes confused.

Sex is either male or female and so reflects a biological difference. This includes physical, hormonal, and genetic characteristics.

Gender is either masculine or feminine and so reflects a social or cultural characteristic.

Gender reassignment therefore includes many more aspects than surgery alone, although it is still treated as a synonym of Sex Reassignment Surgery (SRS).

Although people can travel to undergo these procedures, recognition of their changed status is still hugely controversial and may be denied in their own countries. Even if recognised by the corresponding government, it might not include a change to their passport, retroactive change to their birth certificate, their ability to marry someone of a complementary sex, or acknowledged if that person finds themselves in prison, or a hospital, or applies to join the armed forces.

A multilateral convention was drafted to provide acceptance in other countries but, to date, it has few signatories. See Convention_on_the_recognition_of_decisions_recording_a_sex_reassignment.

In the UK, the Gender Recognition Act 2004 is an Act of the Parliament that allows transsexual people to change their legal gender.

From a family history point of view, this means that there will be a difference between the sex at birth, and the effective sex after any reassignment. Simply having a single property with more than two possible values may not be the best approach since it cannot represent the elapsed time in between a change.

Related to this issue is that of indeterminate sex at birth when a change has to be made due to an inaccurate assessment. See Gender (under ‘Legal Status’).

14 Data Control

This section actually covers a range of issues and I was tempted to call it ‘Sense and Sensitivity’. The issues listed below all share a common factor which is that it may be inappropriate to share certain data with just anyone and everyone.

Ideally, some constraints should be not only visible in the stored data but computer-readable too. This would allow compliant software to acknowledge the limitations on specific data and prevent you accidentally sharing it. It is not possible to prevent wilful infringements but there is a responsibility to recognise that constraints may exist and allow them to be represented in the data.

In a written work, and in computer software, it is usually enough to have a Copyright statement visible that uses the standard © symbol. It doesn’t actually prevent any copying but it is a marker indicating the presence of a restriction and that’s what genealogical data needs. The problem with such a simple visible statement is that (a) there is more than true Copyright to consider, and (b) it has to be computer-readable.

14.1Data Protection

This is an issue of data privacy for living people. A lot of data is routinely collected for individuals and most governments legislate to restrict its dissemination.

The UK Data Protection Act 1998 (which was enacted to align with the European Directive 95/46/EC) requires systems that collect and store personal details to be registered. The Act requires those systems to have adequate security measures, that the data is only used for the intended purposes, that it not be shared outside the European Economic Area, that it may be accessed by the respective individuals, that it be up-to-date, and is not retained unnecessarily.

Certain information is a matter of public record though — for instance, our details in the local phone book. In the UK, although census returns are not public for (nearly-)100 years, birth/marriage/death records and electoral registers are publicly available (although not fully online). This has always made nonsense of banks using date-of-birth and mother’s maiden name in security questions but that is thankfully changing now.

Europe has some of the strongest data protection laws in the world. However, the EU Data Protection Directive of 1995 (95/46/EC) was seen as giving too much flexibility to each member state resulting in 27 different regimes. On top of this, each German state had its own rules which resulted in effectively 26 plus 16 regimes. A new regulation (rather than a directive) was announced on 2012-01-25 that was designed to harmonise the rules across the EU, and implement higher fines for data breaches.

So how does this apply to family history data? Unless we have intimate knowledge of the persons involved then we are unlikely to be able to store information about their financial situation, medical history, or criminal records because we shouldn’t have access to it. Although a person’s address may be in the phone book, or electoral registers, they may have elected to be ‘ex-directory’ and keep those details private. There have been some high-profile cases of estranged partners finding people through such details.

To date, genealogical resources have never been challenged on the basis of data protection. However, with increased collaboration and sharing of data on the Internet then it’s only a matter of time before issues of publishing information on living people meets a legal challenge. As well as the details listed above, this may include: phone numbers, email addresses, adoptions, divorces, ethnicity, qualifications, and work record.

14.2Data Ownership

Copyright is a legal mechanism designed to give the creator of an original work exclusive rights to it, usually for a limited period of time. The concept is recognised in most countries but with notable differences. It may allow different types of rights dependent upon the nature of the creation. For a publication, for instance, it may allow unlimited copying as long as the creator is credited, or permission upon request, or not redistributed for profit. For a film or video, it may restrict it to non-public performances.

Intellectual_property is a broader concept embracing copyright, trademarks, patents, etc.

The Buenos Aires Convention is a copyright treaty signed at Buenos Aires on 1910-08-11. It provided for the mutual recognition of copyrights but required the associated works to carry a notice indicating the reservation of rights. This is where the ubiquitous phrase "All rights reserved" originated.

In 1886, in Bern, Switzerland, The Berne Convention for the Protection of Literary and Artistic Works (aka the Berne Convention) was first accepted. This international agreement tried to harmonise copyright recognition across the signatory countries such that the authors of works in other countries were afforded the same rights as each country’s own nationals. The agreement required copyright to be automatic and not need any formal registration. It was revised a number of times over the years, and the UK and US initially had well-documented issues with full compliance.

On 2000-08-23, Nicaragua became the final member of the Buenos Aires Convention to also become a signatory to the Berne Convention. This meant that no explicit notice of copyright was then necessary and the “All rights reserved” became redundant, although many works still include it as a reminder and to reduce the chances of an “innocent infringement” defence.

In terms of micro-history, although you may have permission to hold a copy of some material in your data collection, you may not have the necessary permission to send duplicate copies to someone else. However, the duration and the legal wording of a copyright may differ between countries, and it may be necessary to know the home country of the work’s author.

Permission to copy or distribute — which is a permit rather than a prohibition — may be informally granted. A typical situation might be family photographs that have been shared with you by another family member. The permission may have been verbal rather than written, and not expressed unambiguously. Whether such permission is legally binding will differ in different jurisdictions but the moral obligation will remain. Hence, the stored data should contain a machine-readable notice indicating presence of such permission (or prohibition) and allowing a warning to be issued before any accidental transmission.

As micro-history data makes more use of narrative, and includes more reasoning and conclusion-forming, then the data becomes a work of intellectual research. As such, it is afforded automatic copyright. The implications of this for online collaborative efforts are mentioned under Importance of Narrative.

14.2.1 Derived Creations

A common topic of discussion is how copyright relates to historical records such as the census. Many genealogists are confused by this, and also worried by the restrictive T&C that online content providers ask them to accept.

Using the UK census as an example case, this section is my own take on the subject because I cannot find a single definitive statement on all the issues. If I cannot find this for UK records then there is almost certainly going to be a bigger issue for worldwide records.

The UK census is subject to Crown copyright. According to the flowchart at Crown copyright, UK census material published after 1989-08-01 is in copyright for 125 years after publication, while that published before that date is in copyright for 50 years from publication.

The raw facts themselves cannot be copyrighted but images of them can be. Findmypast, for instance, includes a statement “Crown Copyright Images reproduced by courtesy of The National Archives, London, England” at the bottom of each image page. Images from the 1911 census include the same text in the images themselves but older ones indicate “Copyright photograph. Not to be reproduced photographically without permission of the Public Records Office, London”.

However, one work may be created from another and become the subject of a separate copyright if it is deemed to be the result of substantial financial or technical investment, or considered to be of independent artistic merit. For a census image, this will include the transcriptions performed by the content provider. We can make our own transcriptions but we cannot claim the provider’s work as our own or use it for commercial profit. Similarly, the indexing of the facts in the provider’s database is a work of their own and automatically the subject of independent rights (see Database rights).

A description of what constitutes a derivative work in UK copyright law may be found at: UK derivative works. A similar concept exists in US copyright law. However, note that Database rights are implemented in the European Union but not in the US.

The derivation of one work from another has some subtle issues. A direct ancestor of my own, William Ashbee, fell foul of a UK copyright case in 1868 and the case of Morris .v. Ashbee is still quoted in many modern UK copyright books. Ashbee set about creating The Merchants and Manufacturers Pocket Directory of London by using an existing work called The Business Directory of London as a spring-board for his canvassers to go and check the associated addresses. Hence, he was creating a work of his own, but it relied on a similar work already created by someone else, and without their permission. The author of the prior work, John Morris, got an injunction preventing publication of William Ashbee’s work. Ashbee lost the case and he went bankrupt soon afterwards. See A Copyright Casualty — Part I, Part II, and Part III.

14.3Data Sensitivity

Sensitivity of data is a more nebulous concept. Most genealogists will eventually encounter details that would have embarrassed a family at the time, and which the family probably didn’t speak of. We’re generally more broad-minded and thick-skinned these days but the sensitivity gets increasingly more potent for recent generations. We would all take care with sensitive information about living relatives, and even those who have passed but are still in our memories.

In the UK, although the 1911 census was made available early (2009-01-13 rather than 100 years after it was taken), there was still some agonising over whether to release the contents of the Infirmity column. This was eventually done but not until 2012-01-03. See http://www.1911census.org.uk/.

As well as the sensitivity being time dependent, though, it will also be culturally dependent. Some cultures may not only take offence at the uncovering of certain details — on the basis of it disrespecting their ancestors or their lineage — but it could cause living relatives to be ostracised or even some form of retribution to be taken against them.

15 Accessibility

I am not aware of any cultural factors that would make family history more or less popular in different regions. However, accessibility to information will definitely vary geographically, and that will impact on its popularity, participation in publishing online content, and in membership of related organisations.

Availability of computerised records, and computerised transcripts of older records, has revolutionised family history research is many countries. From being the pursuit of professionals and specialists, it has now become a hugely popular pastime amongst hobbyists. Irrespective of ones views on their motives, the freely accessible databases of FamilySearch have played a huge role in this.

Certificates of births, marriages, and deaths (BMD), also known as vital records, are held by government departments. People are usually allowed to request a certified copy of certificates for themselves or their immediate family upon proof of identity. Copies of older ones, for use with family history research, are easily obtained in countries like the UK and the US. With the transcription of BMD indexes, the request can be specific and not require a manual search by the corresponding register office. Access to very recent certificates is harder, mainly due to the fear of identity theft. In the UK, there are currently no digitised versions of the BMD index beyond about 2006. Access to divorce records and adoption details are typically restricted. In the US, access to adoption details varies greatly between states.

Commercial organisations such as Ancestry and brightsolid (which owns findmypast and ScotlandsPeople) obviously charge for access to their digitised data but there are a number of initiatives to create freely available data. In the UK, the following components of the FreeUKGEN project all offer free access to the UK records:

FreeBMD — Transcriptions of civil BMD index for England and Wales.

FreeCEN — Access to 19^th century UK census returns.

FreeREG — Transcripts of baptism, marriage, and burial records, parish and non-conformist registers of the UK.

At the time of writing, some of the highest access charges in Ireland have had a very significant effect on the uptake of family history research (based on feedback within a workshop I organise). For instance, nearly €400/year for access to a newspaper archive with no pay-per-view alternative. Similarly, the non-profit IFHF were charging €5 for access to every BMD record returned by a basic search (which are usually imprecise due to the commonality of names there), and an extra €20 simply to use an advanced search.

Most countries of the world have taken a census at some point (Census) although some of these are very recent surveys and unlikely to help with historical research. Access to the data will vary enormously. The UK has a 100-year rule to protect privacy by not allowing public access during that period. The 1911 census was released a year early, in 2010, and there is pressure to release the 1921 data early. In the US, the 1930 census was released in 2002 and the 1940 census is being transcribed at the time of writing (2011). In Ireland, the 1901 and 1911 census data was made available as early as 1961, and was eventually put online in 2010, although future ones will be subject to a100-year rule similar to the UK. This early release is probably associated with the catastrophic loss of earlier Irish census data.

The strong-room of the Public Record Office of Ireland was used as an ammunition store by the anti-Treaty side during the civil war. An incoming shell exploded the munitions and destroyed all of the records in 1922 (http://www.gov.ie/en/essays/genealogy.html). Only those few records in the PRO Reading Room at the start of the conflict survived. This resulted in the loss of

the surviving 19th century census returns (1821 to 1851),

about two-thirds of pre-1870 Church of Ireland parish registers

all of the surviving wills probated in Ireland

Worse still, the census data for 1861 to 1891 had already been pulped during WWI by a government order (censusmemo). However, the surviving 1901 and 1911 records are available online at Irish National Archives for no charge which contrasts sharply with countries like the UK and the US [NB: the 1881 census of England & Wales is generally accessible for free].

® STEMMA is a registered trademark of Tony Proctor.