Importance of Narrative

1 Introduction

2 Background

3 Types of Text

4 Computer Representation

1 Introduction

This paper discusses the importance of narrative text in micro-history data and how STEMMA® addresses it. The paper suggests how text may be given structure so that it can be integrated into micro-history data, as opposed to being an adjunct or attachment, and how this might even help with the Semantic Web.

It is hoped that it will help offset the current trend to distil family history data down to a set of discrete facts and conclusions.

At the time of writing, no commercial product or data format adequately accommodates narrative content in the context of micro-history. Elizabeth Shown Mills has advocated narrative genealogy by using a word-processor. See also Randy Seaver’s associated blog post. However, this does not adequately integrate the narrative with the structures of your core data.

Structured narrative is neither plain-text notes in your data nor rich-text narrative in separate documents; it is marked-up text segments cross-linked with other entities in one all-embracing micro-history schema. A separate presentation of the structured-narrative concept may be downloaded from Structured Narrative.

2 Background

Online content largely consists of extracted facts such as details from census returns, BMD registrations, and parish records. In the interests of economy, only enough key facts are extracted or transcribed to support computer indexing and searching. The original data may be put online as scanned images but its content is not accessible to a computer search. Even when it is typed, rather than hand-written, bulk text is rarely transcribed to be computer searchable — the most obvious exception being newspaper archives.

We should therefore understand the rationale for why content providers and archives focus on discrete key facts, and not assume that this is an inherent property of micro-history data.

Genealogy in its literal sense (i.e. biological lineage, usually expressed by a family tree or pedigree chart) may not need much more than this. However, family history (see Genealogy & Family History — The Difference), and micro-history in general, require much more in the way of narrative text. Professional genealogists may be more use to writing narrative text, especially to justify their conclusions in a report. However, if a universal representation of micro-history data does not accommodate such narrative then the combination of current online content and the capabilities of current software products may diminish its status to that of eccentricity.

As I have said elsewhere, in The Lineage Trap,

If you want to document the fruits of some research then you want narrative, not a family tree. If you want to explain how you arrived at your conclusions then you want narrative, not some stepwise recipe expressed in “computer speak”. If you want to share your family history with relatives then you want real narrative, not some bunch of fields in a database table or some computer-generated “narrative”.

3 Types of Text

The most obvious type of text is biographical narrative for a Person, or historical narrative for a Place or a Group. Such text might be extensive and will undoubtedly reference other entities such as Persons, Places, Animals, Groups, Events, and even raw dates.

Another type of text that would commonly be used in narrative would be footnotes and endnotes, whether for reference-note citations or for general discursive notes. Although citations are more formalised than discursive notes, they may have analytical notes associated with them that will have less structure.

Other uses of text include:

Narrative Essays — Family-history stories and memories, making frequent reference to conclusion entities such as Persons and Places.

Narrative Reports — Reports of personal research presented in narrative form for general readability, revealing both the research journey and the uncovered history.

Research Reports — Report on the findings of a paid research assignment.

Simple Notes — Commentary attached to one-or-more entities in our data.

Research Notes — Everything we know about an associated person, place, family, etc., expressed in a raw form.

Inference and Logic — Explanation of how information supports or contradicts some claim or proposition; typically in the form of proof arguments or proof statements.

Transcription — Transcribed edition of an information source or prior work, including any transcribed extracts. These would use mark-up to provide a faithful reproduction of the relevant nuances.

Audio Transcription — Usually a specialised field but the representation as text, including nuances and multiple voices, is necessary for both searching and analysis.

A number of properties may also be associated with the text, and these may be inclusive of the above categories.

The language of the text, preferably using an ISO designation.

A surety or confidence assessment. This applies both to transcriptions of data and to conclusions or inferences.

Some indication of how sensitive or controversial the data might be, or some control over its privacy and sharing.

STEMMA uses a percentage value as an indication of the confidence in a piece of evidence, or in an inference (see its Surety attribute). The reason for doing this, rather than simple integers as used by GEDCOM, is that it allows some basic arithmetic to assess the confidence of derived data. For instance, the confidence of A may depend on the confidence of ‘B and C’, or of ‘B or C’, which is something that can be handled mathematically. Another potential advantage is that of ‘collective assessment’. Given three alternatives, X, Y, & Z, simple integers might allow an assessment of X against Y, or X against Z, but not X against all the remaining alternatives (i.e. Y+Z).

The use of a numeric representation of confidence is controversial. The subject of "Structured Indications of Uncertainty" is discussed in the context of TEI here: Structured Uncertainty in section 17.1.2. A further discussion directly related to genealogy may be found at: You're Probably Right.

4 Computer Representation

Most data formats for family history have a NOTE element or record type, e.g. GEDCOM. However, these generally provide for small-scale notes and commentary rather than large-scale narrative. STEMMA stands alone in the way it supports narrative for micro-history (see below). It must be stressed that its support is not the same as the marked-up text that might be found on a wiki or a blog; these have no documented data model, and their text is in isolation from structured information such as lineage, timelines, and geography.

Computer software cannot create narrative from raw facts. If a family history program wants to show text then it has to load it from some content that already exists. Presenting an image of the text is OK but the associated content will not have been assimilated into the family history data and so it will be of limited use. Storing it in a separate word-processor document is almost there — the text is inherently computer-readable — but being stored separately from your core data just wastes the content. Also, there is no mark-up that makes the content usable in an historical context, say by some software product — it is just text.

The ideal is to store the text as an intrinsic part of the micro-history data, but with a mechanism that identifies the semantics of key parts. For instance, identifying references to Persons, Places, Animals, Groups, Events, raw dates, or links to other pieces of text, and making them all computer-readable. The latter type is not dissimilar to a footnote and could be used for that purpose, or it could be used to link conclusions and reasoning to supporting evidence.

In effect, this means you need some sort of mark-up language to create structured narrative. A mark-up language is familiar to anyone who has written an HTML (HyperText Markup Language) page, or even a word-processor document although the mark-up is then generated by the software and not generally visible to you. The essential similarity is that the visible text is annotated with extra information, not unlike the original marking-up of a manuscript. See The Power of Annotation.

Using the terminology from Markup_language, there are several forms of mark-up that are required for micro-history narrative:

Descriptive: Marking the text in order to capture its structure and content, rather than specific visualisations of it. Ultimate control over explicit physical rendition such as colour, bold, italic, underline, font name, and font size are best left to the tool presenting the text (e.g. HTML+CSS).

Presentational: This mark-up would be essential for a faithful transcription of something. Although modern systems (such as HTML5) frown on explicit presentational information, it may provide important information necessary for the analysis and correct interpretation of transcribed material. STEMMA’s approach to transcription separates structure and content from presentational or stylistic matters: see Descriptive Mark-up.

Semantic: Although the aforementioned wikipedia link suggests that this is an alternative name for Descriptive mark-up, the usage here is more distinct. This mark-up provides information about the meaning or interpretation of textual references. It is therefore different from the structure and layout in a purely textual context, and is precisely what is needed to identify entities such as Persons and Places.

As an example of semantic mark-up, consider the case of an embedded URL in an HTML or wiki page. The mark-up language provides the computer with the knowledge of the target address, but at the same time provides a separate element of text for the display. There are effectively two bits of information for the same element — one for the end-user and one for the computer.

As another example, consider the citation support in HTML5. Here’s an example:

According to <cite title="HTML & XHTML: The Definitive Guide. Published by O'Reilly Media, Inc.; fifth edition (August 1, 2002)">Chuck Musciano and Bill Kennedy</cite>, the HTML cite tag actually exists!

The <cite> tag provides a formal citation, which can be taken out of line by the computer software, and a separate piece of substitution text for the end-user to read. This example would display the following text in the main body and use the citation elsewhere:

According to Chuck Musciano and Bill Kennedy, the HTML cite tag actually exists!

The actual substitution text might be selectable if presented on a computer display, and used to navigate to the citation.

NB: Although irrelevant to this discussion, the HTML <cite> tag is not a practical model for a similar element in a micro-history data format. The citation style is fixed, the regional preferences (e.g. date/time display format) are fixed, and there is no identification of the distinct elements of the citation (e.g. author) for semantic tagging.

4.1 Advantages

There are multiple advantages to using semantic mark-up in narrative text. Allowing computer software to recognise a specific item means that it can use that data, or reference it, in a special way such as for the creation of a footnote. Similarly, it can decide to display the item using special formatting or highlight rules selected from a style gallery.

The following examples cover some of the possibilities:

Persons — Having a reference to a Person entity embedded in your text allows some canonical version of that person’s name to be automatically displayed in its place. If the Person details are later modified then all embedded references will automatically show the modified name thereafter. The software can automatically highlight the surname portion using bold, italic, underline, or a specific colour. This should eliminate the tradition of uppercasing such name parts, which is not culturally neutral (see Letter Case). On a computer display, as opposed to a printed format, the visible name may also be made into a hyperlink that can take you to full details of that Person, or of their family, etc.

Places — Being able to embed a reference to a Place entity allows a hyperlink to be generated that can be selected to obtain further details. As well as presenting details from your own data, such a link might consult a Place Authority (see Place Authority) to obtain full geographical and historical data for that Place.

Dates — It is important for software to be able to understand a date value (see Dates and Calendars). If a date is embedded in a computer-readable fashion then it allows software to relate that to other Events or timelines. It also allows the software to display a version that is automatically formatted according to your regional settings and preferences, whatever they happen to be. A different end-user might see them formatted according to different settings. A date such as “yesterday” or “next week” may make sense to us but not to computer software.

Annotation Notes — If one section of narrative text includes a link to another section then the software can add a traditional indicator marking the presence of the extra text. In a printed form, that extra text might appear as a footnote or an endnote. On a computer display, the indicator may be selectable and could take you to that text if clicked. Similarly, if a specific datum (e.g. a date of birth) had a link to a section of text then that could be handled in the same way whenever that datum is displayed, and it might provide insights into how the datum was derived. Citations are a particular form of note and will be discussed further in a different section.

5 STEMMA

Although there are many possible uses for narrative text there are two important categories that STEMMA has strived to unify. They are for transcriptions and for generating new narrative work (e.g. essays, reports, inference, etc.). These have markedly different characteristics as follows:

Transcription (including transcribed extracts) — requires support for textual anomalies (uncertain characters, marginalia, footnotes, interlinear/intralinear notes), audio anomalies (noises, gestures, pauses), indications of alternative spellings/pronunciation/meanings, indications of different contributors, different styles or emphasis, and semantic mark-up for references to persons, places, groups, animals, events, and dates. The latter semantic mark-up also needs to clearly distinguish objective information (e.g. that a reference is to a person) from subjective information (e.g. a conclusion as to whom that person is).

Narrative work — requires support for layout and presentation. Descriptive mark-up captures the content and structure in a way that provides visualisation software with the ultimate control over its rendering It needs to be able to generate references to known persons, places, and dates that result in a similar mark-up to that for transcriptions. The difference here is that a textual reference is being generated from the ID of a Person entity, say, as opposed to marking an existing textual reference and possibly linking it to a Person with a given ID. Also needs to be capable of generating reference-note citations and general discursive notes.

A narrative entity is defined using the following element structure:

<Narrative Key=’key’>

[ <Title> narrative-title </Title> ]

{ <Text [Key=’key’] [TEXT_TYPE] … [DATA_ATTRIBUTE] ... >

[ <Title> text-title </Title> ]

…text with embedded entity links…

</Text>} ...

</Narrative>

The optional Language attribute provides an explicit ISO 639-2 three-letter code for the narrative language. If omitted then the language defaults to the prevailing language of the STEMMA Dataset. The Locale attribute provides a more detailed specification since that involves both an ISO 639-1 two-letter language code plus an ISO 3166-1 two-letter territory code, e.g. “en_GB” for British English.

A <Narrative> element is divided into separate Text segments, each of which may have different properties. <Text> elements may specify a key that allows them to be referenced or utilised from elsewhere

References to other STEMMA entities can be embedded in <Text> elements using the following:

<PersonRef [Key=’key’]/>

<AnimalRef [Key=’key’]/>

<PlaceRef [Key=’key’]/>

<GroupRef [Key=’key’]/>

<EventRef [Key=’key’]/>

<ResourceRef Key=’key’/>

<CitationRef Key=’key’/>

The first set of these can also be used to mark-up existing references in a transcription, and optionally link them to a conclusion entity such as a Person.

In STEMMA, a Resource is a separate item in the micro-history collection — typically a separate image or photograph, but also including arbitrary files and even physical artefacts. A Citation makes reference to an external source of information but the concept is generalised and so includes traditionally separate categories such as a section in a published work, the published work itself, and the repository that holds it. See Citations.

Date references may be embedded using a DateRef mark-up and specifying either a STEMMA date-value string or a full STEMMA date-entity. The same element can mark-up an existing reference during a transcription and optionally attach a conclusion date. Some simplified examples might be:

<DateRef Value=’1956-06-09’ Mode=’Short’/>

<DateRef Value=’1903-03-17’> St Patrick’s Day, 1903 </DateRef>

The date-entity structure allows for different degrees of granularity, imprecision, and multiple calendars for synchronised date such as Gregorian/Julian Dual Dates.

Here’s an example that references both a Place and a date:

<Text Key=’tDemiseJessamine’>

<Title> Demise of Jessamine Cottages </Title>

<PlaceRef Key=’wJessamine’ Mode=’Hierarchy’/>, were demolished in <DateRef Value=’1956’/>

</Text>

This text could be referenced from another Text section using the key name tDemiseJessamine. It might generate the text title in place of the reference, but the following text might pop up when it is selected.

Jessamine Cottages, Nottingham, were demolished in 1956

Both the name of the Place and the date might be further selectable, as implied here.

Here’s an example that references a Person this time:

<Text Inference=’1’>

Head of household is Elizabeth Wildgoose (b. <Date><Value Margin=’1’>1802</Value></Date>) and is almost certainly a relative of <PersonRef Key=’pSarahElliott’/>, nee Wildgoose

</Text>

It might generate the following text when presented on the screen:

Head of household is Elizabeth Wildgoose (b. c1802) and is almost certainly a relative of Sarah Elliott (b. c1842), nee Wildgoose

A STEMMA file is deemed a Document, and this is broken down into one-or-more Datasets. Each Dataset has a separate self-contained set of entities, distinguished by such things as author, geography, surname, or multiple criteria. Although STEMMA was initially conceived as a format for long-term storage, such as archive or backup, and secondarily as an exchange format, the presence of structured narrative (incl. the aforementioned mark-up) means that it can be used as a traditional document format; it is not a word-processor format, but it represents narrative along with non-narrative data (e.g. lineage, timelines, and geography) in a single schema. Following this observation, a viewing tool was prototyped that loaded a specific Dataset in one pass from a Document, indexed it in memory, and immediately provided a user interface to navigate around its content and follow the hyperlinks. This is not intended to displace the need for more complicated products, or their associated indexed databases, but it is an interesting digression on the purpose of a file format. On one hand, it provides a way of peeking inside a file without having to learn some low-level data syntax, such as XML, and without having to load it into some proprietary database. On the other hand, it provides a “genealogical document” that has both content and structure, including lineage, events/timelines, geography, and narrative, that can be navigated and presented with a generic tool. This bundling of information as a “genealogical document” could also make it usable for automatic upload to some online framework (see What to Share, and How - Part II) or transmission as a genealogical report to clients. This would neither limit the content nor reduce editorial control and narrative freedom.

6 Citations

Citations are a fundamental part of micro-history data. However, the concept of a reference note, source label, and a source list (as described under Worldwide Family History Data) can be generalised through the use of narrative.

Most readers will think of a citation in terms of its printed reference-note form, e.g.

C. Dallett Hemphill, Bowing to Necessities: A History of Manners in America, 1620-1860 (New York: Oxford University Press, 1999), p.114.

It would be fairly straightforward to represent the essence of such a citation in micro-history data such that a printed form can be generated in the preferred style (e.g. EE, or CMOS) and with the regional preferences of any particular reader (see Meta-data).

When analytical notes are added, though, then they can get much more complicated. Consider this example:

Death notices, Ulster Gazette and Daily National Intelligencer, both dated 24 January 1815. Corra Bacon-Foster, "The Story of Kalorama," Records of the Columbia Historical Society (1910), 108, states Louisa left four children; three have been identified. In 1810, Charles "Cating" and a female, both over 44, were enumerated with one male and female aged 26-44; one male and female aged 16-25; and one male under 10 - suggesting that George, Louisa, and their first son may have been living in the Catton household. See 1810 U.S. census, Ulster County, New York, New Paltz, p. 116, line 6; NA micropublication M252, roll 37.

What this is doing is effectively wrapping one or more simple citations in some commentary by the current author in order to create what might be called a complex citation.

Another requirement of narrative authors is discursive notes, which may or may not include any citations at all. These are simply some text that has been taken out of line; a digression.

In a printed publication, there may be a superscript, or other indicator, at the point where it is relevant, followed by a footnote, endnote, or tablenote containing that text. When viewed on an interactive display then those indicators may be selectable. For generalised notes in purely electronic documents (e.g. in a browser), the footnote/endnote concepts might not be used, and the text may be popped-up when some word or indicator is selected or hovered-over. The point being made is that structured narrative can be used to create generalised notes, without having to presume a particular mechanism or style.

Although it is possible to generate citations of different style, and for different locales, from the discrete citation-element values, there are many complications in the real world. A citation sentence may contain different layers describing the provenance of the source and its information, or it may contain analytical notes. A reference note may contain multiple citation sentences — a tour of these scenarios was covered in Cite Seeing. Subsequent references to the same source would typically use a shortened form of the associated reference note, or the author may have employed an explicit hereinafter-cited-as term, or the Latin abbreviation Ibid. A footnote may have woven two source references into the same piece of text. Certain parts of a citation may not have been available (e.g. an undated document), or may have been erroneous, and so the citation would need to override any simple template-like formatting. In effect, authors of narrative work are loath to delegate generation of their citations to a piece of software working blindly from a set of data values. It is therefore necessary to support hand-crafted forms, and change the focus of citation-elements to that of correlation and interrogation rather than formatting.

7 Meta-data

In the context of micro-history data, meta-data is data about data. Meta-data is an important concept because it allows data to be processed or utilised appropriately. In a computer context, it allows software to understand (in a primitive sense) what data it is dealing with.

STEMMA’s structured narrative provides a form of meta-data in the semantic mark-up it uses to embed references to Persons, Places, Animals, Events, etc. Without that meta-data, all the narrative would contain is the name or textual description of the same entities.

The Semantic Web movement strives to supplement the currently unstructured Web content with meta-data. The idea is that this will allow computers to search, correlate, and combine information more easily. This will therefore be an important part of micro-history representation in online content. It doesn’t necessarily mean that a standard data model needs to incorporate its RDF tags since it would be wrong to tie it to such a specific technology. However, it does mean that such a data model must provide for meta-data, and that a physical serialisation format (as with file formats) that was derived from the data model for the Semantic Web could use RDF tags.

Another situation where meta-data comes up, and is still hotly debated, is citation-elements. These are the elements of data that would constitute a computer-readable citation, e.g. author(s), title, publisher, etc. A computer-readable citation differs from a printed citation in that factors such as style (e.g. EE, CMOS) and regional settings are removed, and can be reapplied later for the context of a specific end-user.

At its most simple, you might imagine such a citation to be represented by a number of discrete XML elements such as:

<CitationReference>

<Elements>

<Author> name </Author>

<Title> title </Title>

…etc…

</Elements>

</CitationReference>

Irrespective of whether each element has a specific tag (e.g. <Author>) or a qualified generic tag (e.g. <Element Name=”Author”>), it still basically identifies the datum rather than the nature of the datum. There has been some considerable discussion on BetterGEDCOM about whether these tags should follow the Dublin Core scheme and have shared semantics. Dublin Core started with a vocabulary of 15 shared meta-data tags, which included things like Creator, although it has since been extended with extra tags and contextual refinements of existing ones. I have been a critic of this for citation-elements since it’s analogous to have relational databases share a common set of column names, each with fixed semantics. Also, the citation-element tags are not themselves meta-data tags, as already mentioned. If the citation-elements were part of a Semantic Web contribution then they would be annotated with RDF meta-data tags indicating their nature — it would not be deduced from the XML data tags. There may be other schemes, too, but they would be part of the corresponding physical serialisation format.

A related discussion of the importance of meta-data may be found at Technophoo, have no fear, although this does not differentiate between the formal data and meta-data concepts for citation-elements. Hence, although Dublin Core (which is now represented by ISO Standard 15836-2009) is a viable standard, it should be acknowledged that it is a standard for meta-data tags rather than data tags.

8 Copyright

Some of the things that cannot be copyrighted include facts and raw unarranged data[1]. From this perspective, mere details transcribed from vital records cannot be the subject of copyright.

Even the building of family pedigree charts showing marriages between people and linking their respective offspring is little more than a re-arrangement of facts that can be looked up by someone else. Although that linking of Person entities according to their biological lineage constitutes part of a conclusion model, and may have been the product of some research, the expression embodies nothing that can be copyrighted in any practical way.

This means that there is no legal impediment to online collaborative trees that include nothing more than facts and conclusions of the pedigree variety. Unfortunately, we all know where that leads. Online trees with no citations, no reasoning, no evidence supporting their conclusions, and no attribution, are “ten-a-penny” (or “a dime-a-dozen”). These trees are replicated just as easily as they’re created and that compounds the problem to a point where it almost kills serious genealogy. It has even been likened to a virus by blogger Ben Sayer.

Once micro-history data makes use of structured narrative then it starts to become a creative work — a work of academic research that includes reasoning, conclusions, and opinions. Such a work is automatically copyright by virtue of the Berne Convention.

Such data might be published under a Creative Commons licence but that only avoids the legal issues. The fact that those reliable and thoroughly-researched contributions will have taken someone a long time to produce — possibly a life’s work — means that they will be understandably less-inclined to just share it with everyone, especially if some of the weaker researchers would simply pass it off as their own.

Is this an argument against collaborative trees? No, it’s simply an indication that the current naïve approach is demonstrably wrong, and will lead to further issues when we include narrative.

A case is made under Evidence and Conclusion for distinguishing three types of data rather than the two implied by this name; the third being all those parts that justify, or prove, the conclusions. Having three distinct parts gives greater flexibility for handling copyright and sharing issues. Also, What to Share, and How - Part II presents an alternative model that incorporates narrative in such a way of to provide automatic attribution, reduce the need for copying, and avoiding edit wars when there are differences of opinion on a shared tree.

® STEMMA is a registered trademark of Tony Proctor.

[1] This is the case in the US, but a sweat of the brow concept still exists in Europe and that has paved the way for database rights. See the “Analysis” section of A Copyright Casualty — Part II.