Musings on Standardisation

1 Introduction

2 Which Data?

3 Computer versus Human Use

4 Culture and Locale

5 Model versus Data Format

6 What about Databases?

7 What about APIs?

8 Process Orientated versusStructural

9 Rigidity versus Flexibility

10 Standard versus Proprietary

1 Introduction

There would hopefully be complete consensus that any new data format, or data model, is intended to be an open standard, and is not controlled or dictated by any single vendor or commercial organisation.

However, what else does the concept include?

This short paper looks at specific high-level issues associated with the ideal and attempts to clarify some finer goals. Although STEMMA® is not promoted as a data standard, it exists because there was no acceptable standard available.

The following articles also provide some related discussions of how to proceed with data standards: Bootstrapping a Data Standard and Supporting a Proof Standard.

2 Which Data?

A standard should not be limited to genealogical data, by which I mean data related to biological lineage and family pedigrees. A standard should cater for the much broader range of data often described as ‘family history data’, and ideally ‘micro-history data’ (see below).

The distinction between genealogical data and family history data is not universal but is growing in acceptance. The Society of Genealogists (SoG) posts their distinction at: http://www.sog.org.uk/education/gandfh.shtml. The following very precise and succinct distinction is from Dr Nick Barratt:

"We use genealogy and family history as though they are one and the same thing, but of course they are not. Genealogy is a purer search for historical connectivity between generations — building a family tree or pedigree, if you like — whereas family history is a broader piece of research into their lives and activities"

This appeared in Your Family History magazine, March 2013, issue 38, page 74.

I would go further than this, though, and ensure it catered for other types of micro-history data, including that for One-Name Studies, One-Place Studies, and personal histories (as in APH). Taking a generalised approach to historical data was a primary goal of STEMMA.

This greater scope must include data related to the lives of the respective people such as biography, recollection, historical narrative, significant events and places, and significant people in others’ lives — whether related or not. It should be able to go beyond people to represent the history of places, or groups, or animals, too. It should cater for supporting documents, research, full citation references, data control (e.g. privacy or copyright), and clearly distinguish evidence from conclusion. See What is Genealogy? for a more in-depth analysis.

3 Computer versus Human Use

A standard should be for the exchange and long-term storage of computer-readable genealogical and micro-history data. It must be stressed that it is not designed to be humanly readable or editable.

Textual types of data representation, such as GEDCOM or XML, have advantages over purely binary formats: they are easier to develop, and to diagnose, but they are also more transportable. Binary formats that contain data such as integers, dates, and floating-point values do not transport well between different computer architectures because of differences in the binary values themselves or in the byte ordering of those values.

If a textual format can be understood by an experienced user then that is a bonus but it should not be a primary goal.

4 Culture and Locale

A standard should be free of restrictions or limitations resulting from cultural insularity or cultural ignorance. In effect, it should be equally applicable to cultures around the world, both now and in the past.

A standard should not be dependent on the locale of the end-user. This means that the content should not be interpreted differently if loaded by software with different regional settings.

5 Model versus Data Format

A standard should primarily reflect a data model that defines what is represented and how it is linked rather than how it is physically represented. Issues like ordinality and cardinality can be represented using tools such as Entity-Relationship (ER) Diagrams.

There may be more than one physical representation (or serialisation format) for a data model, and each must additionally be defined by the associated standard. Because of its prevalence and acceptance as an international standard, one these formats should certainly be XML-based in order to prevent multiple conflicting XML representations being defined outside of the standard.

6 What about Databases?

An essential property of a data model is the normalisation of its data to minimise, if not exclude, redundancy and duplication. Before such data can be used efficiently, it must be loaded into an indexed form, either in memory or in a database, and that indexed form may define additional linkages and generally denormalise the data for efficiency of interrogation. A standard should not dictate this indexed form, nor mandate any particular database type or design. The article at Do Genealogists Really Need a Database? makes the case that genealogists do not even need a database.

7 What about APIs?

Live exchange of data is when it is passed directly between software units, either on the same machine, across the Internet (e.g. cloud computing), or over some other communications protocol. This differs from a static exchange involving a data file like GEDCOM. Such live exchange requires an API (Application Programming Interface) and associated run-time object model. A run-time model would be similar to a static data model but may include additional linkages and relationships implied by the underlying indexed form. It would certainly embrace procedural “method” on the various objects for standardised operations (e.g. searching for a person by name).

While an API that allows interoperability between software units is possible, it is not the generally accepted interpretation of an API within genealogy. Such an API would require a peer-to-peer network, and would require both participants to be online at the same time. It is debatable how much advantage it would offer over a static exchange of data.

On the other hand, Web services and service APIs are typically associated with client-server network models (see SOA and SaaS), and this means they are of primary importance to the providers of online content. It is especially important that any API be defined around a standard data model rather than the specific data currently being exposed.

Although such an API would be a good thing to have, it should be acknowledged by a separate standard to the data model, and it must be dependent upon that data model in order to embrace its structure. A relevant analogy might be OpenOffice. This is an open-source development (not to be confused with an open standard) but it implements the ISO/IEC Open Document Format (ODF).

8 Process Orientated versus Structural

There is no single way of researching, documenting, and storing micro-history data, and there never will be. Many aspects of the research process will be understood by everyone (e.g. separating evidence and conclusion) but we must not presume that we will all use the same software, or even use commercial software products at all.

A standard should be as applicable to an experienced or professional user as to a naïve user who just collects names, dates, and places. Hence, it should not stipulate nor mandate any formal process, and it should be able to represent all data without bias or presumption about the process used to obtain it.

This includes the Genealogical Proof Standard (GPS). While it would be valid for a data model to store details which could be used to support a standard of proof, this is different to making it specific to a given process. We can still recognise and promote best practices for research but a standard should not mandate them. It should concentrate on distinguishing the necessary types of data and linking them together.

9 Rigidity versus Flexibility

A data model must be robustly defined in order to make it unambiguous. It would need firm concepts in order to support analyses such as identifying family groups and depicting timelines.

However, a significant amount of flexibility must also be provided to accommodate the unexpected, or the ad hoc item. Some important aspects of this must be:

Narrative — some powerful mechanism for adding text to selected parts of the data. This should be more than a simple note appendage. See under Importance of Narrative.

User-defined Properties — it will not be possible to prescribe all the types of detail or “fact” that may be recorded from sources worldwide. There must be a way of defining additional ones that will be implicitly usable by all recipients, and which should not clash with ones defined by other authors.

Partially controlled vocabularies to allow extensible categorisation of events, places, styles, etc.

An extensible approach to sources, citations and citation elements that is not constrained to a set of predefined items enumerated in isolation.

10 Standard versus Proprietary

The community has had enough of proprietary standards that are loosely defined, that offer no formal way of proposing changes to them, that are very US-focused, and that do not embrace modern data standards.

The Semantic Web movement, led by W3C, aims to supplement existing Web content — which is either humanly-readable data (e.g. documents) or machine-readable data that requires an application (e.g. spreadsheets) — with information about the data. The idea is that it will allow machines to make decisions and inferences without having to scan raw data.

The use of meta-data, ontologies and “glue” in the form of RDF, XML, and OWL may well provide the semantic information to help find the right entities during a complex query. However, whether you can then merge, correlate, or do anything with those entities as a group depends on the data itself. If they all use a different syntax and different models then the world is no better off. What is still needed is a standard data model for our data.

Once a data model is formed, it does not mandate a particular physical representation, e.g. in a data file. That could use an internationally-recognised data syntax such as XML or a proprietary syntax such as a GEDCOM-like one. However, the Semantic Web would require a representation that used a standard data syntax, and that used internationally recognised standard structures and concepts within the data. Using a proprietary data syntax, or proprietary element formats (e.g. for dates), would immediately limit the applicability of a standard.

® STEMMA is a registered trademark of Tony Proctor.