Importance of Narrative


1        Introduction

2        Background

3        Types of Text

4        Computer Representation

4.1       Advantages

5        STEMMA

6        Citations

7        Meta-data

8        Copyright


1      Introduction

This paper discusses the importance of narrative text in micro-history data and how STEMMA® addresses it. The paper suggests how text may be given structure so that it can be integrated into micro-history data, as opposed to being an adjunct or attachment, and how this might even help with the Semantic Web.

 

It is hoped that it will help offset the current trend to distil family history data down to a set of discrete facts and conclusions.

 

At the time of writing, no commercial product or data format adequately accommodates narrative content in the context of micro-history. Elizabeth Shown Mills has advocated narrative genealogy by using a word-processor. See also Randy Seaver’s associated blog post. However, this does not adequately integrate the narrative with the structures of your core data.

 

Structured narrative is neither plain-text notes in your data nor rich-text narrative in separate documents; it is marked-up text segments cross-linked with other entities in one all-embracing micro-history schema. A separate presentation of the structured-narrative concept may be downloaded from Structured Narrative.

2      Background

Online content largely consists of extracted facts such as details from census returns, BMD registrations, and parish records. In the interests of economy, only enough key facts are extracted or transcribed to support computer indexing and searching. The original data may be put online as scanned images but its content is not accessible to a computer search. Even when it is typed, rather than hand-written, bulk text is rarely transcribed to be computer searchable — the most obvious exception being newspaper archives.

 

We should therefore understand the rationale for why content providers and archives focus on discrete key facts, and not assume that this is an inherent property of micro-history data.

 

Genealogy in its literal sense (i.e. biological lineage, usually expressed by a family tree or pedigree chart) may not need much more than this. However, family history (see Genealogy & Family History — The Difference), and micro-history in general, require much more in the way of narrative text. Professional genealogists may be more use to writing narrative text, especially to justify their conclusions in a report. However, if a universal representation of micro-history data does not accommodate such narrative then the combination of current online content and the capabilities of current software products may diminish its status to that of eccentricity.

 

As I have said elsewhere, in The Lineage Trap,

 

If you want to document the fruits of some research then you want narrative, not a family tree. If you want to explain how you arrived at your conclusions then you want narrative, not some stepwise recipe expressed in “computer speak”. If you want to share your family history with relatives then you want real narrative, not some bunch of fields in a database table or some computer-generated “narrative”.

 

3      Types of Text

The most obvious type of text is biographical narrative for a Person, or historical narrative for a Place or a Group. Such text might be extensive and will undoubtedly reference other entities such as Persons, Places, Animals, Groups, Events, and even raw dates.

 

Another type of text that would commonly be used in narrative would be footnotes and endnotes, whether for reference-note citations or for general discursive notes. Although citations are more formalised than discursive notes, they may have analytical notes associated with them that will have less structure.

 

Other uses of text include:

 

 

A number of properties may also be associated with the text, and these may be inclusive of the above categories.

 

 

STEMMA uses a percentage value as an indication of the confidence in a piece of evidence, or in an inference (see its Surety attribute). The reason for doing this, rather than simple integers as used by GEDCOM, is that it allows some basic arithmetic to assess the confidence of derived data. For instance, the confidence of A may depend on the confidence of ‘B and C’, or of ‘B or C’, which is something that can be handled mathematically. Another potential advantage is that of ‘collective assessment’. Given three alternatives, X, Y, & Z, simple integers might allow an assessment of X against Y, or X against Z, but not X against all the remaining alternatives (i.e. Y+Z).

 

The use of a numeric representation of confidence is controversial. The subject of "Structured Indications of Uncertainty" is discussed in the context of TEI here: Structured Uncertainty in section 17.1.2. A further discussion directly related to genealogy may be found at:  You're Probably Right.

4      Computer Representation

Most data formats for family history have a NOTE element or record type, e.g. GEDCOM. However, these generally provide for small-scale notes and commentary rather than large-scale narrative. STEMMA stands alone in the way it supports narrative for micro-history (see below). It must be stressed that its support is not the same as the marked-up text that might be found on a wiki or a blog; these have no documented data model, and their text is in isolation from structured information such as lineage, timelines, and geography.

 

Computer software cannot create narrative from raw facts. If a family history program wants to show text then it has to load it from some content that already exists. Presenting an image of the text is OK but the associated content will not have been assimilated into the family history data and so it will be of limited use. Storing it in a separate word-processor document is almost there — the text is inherently computer-readable — but being stored separately from your core data just wastes the content. Also, there is no mark-up that makes the content usable in an historical context, say by some software product it is just text.

 

The ideal is to store the text as an intrinsic part of the micro-history data, but with a mechanism that identifies the semantics of key parts. For instance, identifying references to Persons, Places, Animals, Groups, Events, raw dates, or links to other pieces of text, and making them all computer-readable. The latter type is not dissimilar to a footnote and could be used for that purpose, or it could be used to link conclusions and reasoning to supporting evidence.

 

In effect, this means you need some sort of mark-up language to create structured narrative. A mark-up language is familiar to anyone who has written an HTML (HyperText Markup Language) page, or even a word-processor document although the mark-up is then generated by the software and not generally visible to you. The essential similarity is that the visible text is annotated with extra information, not unlike the original marking-up of a manuscript. See The Power of Annotation.

 

Using the terminology from Markup_language, there are several forms of mark-up that are required for micro-history narrative:

 

 

As an example of semantic mark-up, consider the case of an embedded URL in an HTML or wiki page. The mark-up language provides the computer with the knowledge of the target address, but at the same time provides a separate element of text for the display. There are effectively two bits of information for the same element — one for the end-user and one for the computer.

 

As another example, consider the citation support in HTML5. Here’s an example:

 

According to <cite title="HTML & XHTML: The Definitive Guide. Published by O'Reilly Media, Inc.; fifth edition (August 1, 2002)">Chuck Musciano and Bill Kennedy</cite>, the HTML cite tag actually exists!

 

The <cite> tag provides a formal citation, which can be taken out of line by the computer software, and a separate piece of substitution text for the end-user to read. This example would display the following text in the main body and use the citation elsewhere:

 

According to Chuck Musciano and Bill Kennedy, the HTML cite tag actually exists!

 

The actual substitution text might be selectable if presented on a computer display, and used to navigate to the citation.

 

NB: Although irrelevant to this discussion, the HTML <cite> tag is not a practical model for a similar element in a micro-history data format. The citation style is fixed, the regional preferences (e.g. date/time display format) are fixed, and there is no identification of the distinct elements of the citation (e.g. author) for semantic tagging.

 

4.1    Advantages

There are multiple advantages to using semantic mark-up in narrative text. Allowing computer software to recognise a specific item means that it can use that data, or reference it, in a special way such as for the creation of a footnote. Similarly, it can decide to display the item using special formatting or highlight rules selected from a style gallery.

 

The following examples cover some of the possibilities:

 

5      STEMMA

Although there are many possible uses for narrative text there are two important categories that STEMMA has strived to unify. They are for transcriptions and for generating new narrative work (e.g. essays, reports, inference, etc.). These have markedly different characteristics as follows:

 

 

A narrative entity is defined using the following element structure:

 

<Narrative Key=’key’>

            [ <Title> narrative-title </Title> ]

{ <Text [Key=’key’]  [TEXT_TYPE] … [DATA_ATTRIBUTE] ... >

[ <Title> text-title </Title> ]

…text with embedded entity links…

</Text>} ...

</Narrative>

 

 

The optional Language attribute provides an explicit ISO 639-2 three-letter code for the narrative language. If omitted then the language defaults to the prevailing language of the STEMMA Dataset. The Locale attribute provides a more detailed specification since that involves both an ISO 639-1 two-letter language code plus an ISO 3166-1 two-letter territory code, e.g. “en_GB” for British English.

 

A <Narrative> element is divided into separate Text segments, each of which may have different properties. <Text> elements may specify a key that allows them to be referenced or utilised from elsewhere

 

References to other STEMMA entities can be embedded in <Text> elements using the following:

 

<PersonRef [Key=’key’]/>

<AnimalRef [Key=’key’]/>

<PlaceRef [Key=’key’]/>

<GroupRef [Key=’key’]/>

<EventRef [Key=’key’]/>

 

<ResourceRef Key=’key’/>

<CitationRef Key=’key’/>

 

The first set of these can also be used to mark-up existing references in a transcription, and optionally link them to a conclusion entity such as a Person.

 

In STEMMA, a Resource is a separate item in the micro-history collection — typically a separate image or photograph, but also including arbitrary files and even physical artefacts. A Citation makes reference to an external source of information but the concept is generalised and so includes traditionally separate categories such as a section in a published work, the published work itself, and the repository that holds it. See Citations.

 

Date references may be embedded using a DateRef mark-up and specifying either a STEMMA date-value string or a full STEMMA date-entity. The same element can mark-up an existing reference during a transcription and optionally attach a conclusion date. Some simplified examples might be:

 

<DateRef Value=’1956-06-09’ Mode=’Short’/>

<DateRef Value=’1903-03-17’> St Patrick’s Day, 1903 </DateRef>

 

The date-entity structure allows for different degrees of granularity, imprecision, and multiple calendars for synchronised date such as Gregorian/Julian Dual Dates.

 

Here’s an example that references both a Place and a date:

 

<Text Key=’tDemiseJessamine’>

<Title> Demise of Jessamine Cottages </Title>

<PlaceRef Key=’wJessamine’ Mode=’Hierarchy’/>, were demolished in <DateRef Value=’1956’/>

</Text>

 

This text could be referenced from another Text section using the key name tDemiseJessamine. It might generate the text title in place of the reference, but the following text might pop up when it is selected.

 

Jessamine Cottages, Nottingham, were demolished in 1956

 

Both the name of the Place and the date might be further selectable, as implied here.

 

Here’s an example that references a Person this time:

 

<Text Inference=’1’>

Head of household is Elizabeth Wildgoose (b. <Date><Value Margin=’1’>1802</Value></Date>) and is almost certainly a relative of <PersonRef Key=’pSarahElliott’/>, nee Wildgoose

</Text>

 

It might generate the following text when presented on the screen:

 

Head of household is Elizabeth Wildgoose (b. c1802) and is almost certainly a relative of Sarah Elliott (b. c1842), nee Wildgoose

 

 

A STEMMA file is deemed a Document, and this is broken down into one-or-more Datasets. Each Dataset has a separate self-contained set of entities, distinguished by such things as author, geography, surname, or multiple criteria. Although STEMMA was initially conceived as a format for long-term storage, such as archive or backup, and secondarily as an exchange format, the presence of structured narrative (incl. the aforementioned mark-up) means that it can be used as a traditional document format; it is not a word-processor format, but it represents narrative along with non-narrative data (e.g. lineage, timelines, and geography) in a single schema. Following this observation, a viewing tool was prototyped that loaded a specific Dataset in one pass from a Document, indexed it in memory, and immediately provided a user interface to navigate around its content and follow the hyperlinks. This is not intended to displace the need for more complicated products, or their associated indexed databases, but it is an interesting digression on the purpose of a file format. On one hand, it provides a way of peeking inside a file without having to learn some low-level data syntax, such as XML, and without having to load it into some proprietary database. On the other hand, it provides a “genealogical document” that has both content and structure, including lineage, events/timelines, geography, and narrative, that can be navigated and presented with a generic tool. This bundling of information as a “genealogical document” could also make it usable for automatic upload to some online framework (see What to Share, and How - Part II) or transmission as a genealogical report to clients. This would neither limit the content nor reduce editorial control and narrative freedom.

6      Citations

Citations are a fundamental part of micro-history data. However, the concept of a reference note, source label, and a source list (as described under Worldwide Family History Data) can be generalised through the use of narrative.

 

Most readers will think of a citation in terms of its printed reference-note form, e.g.

 

C. Dallett Hemphill, Bowing to Necessities: A History of Manners in America, 1620-1860 (New York: Oxford University Press, 1999), p.114.

 

It would be fairly straightforward to represent the essence of such a citation in micro-history data such that a printed form can be generated in the preferred style (e.g. EE, or CMOS) and with the regional preferences of any particular reader (see Meta-data).

 

When analytical notes are added, though, then they can get much more complicated. Consider this example:

 

Death notices, Ulster Gazette and Daily National Intelligencer, both dated 24 January 1815. Corra Bacon-Foster, "The Story of Kalorama," Records of the Columbia Historical Society (1910), 108, states Louisa left four children; three have been identified. In 1810, Charles "Cating" and a female, both over 44, were enumerated with one male and female aged 26-44; one male and female aged 16-25; and one male under 10 - suggesting that George, Louisa, and their first son may have been living in the Catton household. See 1810 U.S. census, Ulster County, New York, New Paltz, p. 116, line 6; NA micropublication M252, roll 37.

 

What this is doing is effectively wrapping one or more simple citations in some commentary by the current author in order to create what might be called a complex citation.

 

Another requirement of narrative authors is discursive notes, which may or may not include any citations at all. These are simply some text that has been taken out of line; a digression.

 

In a printed publication, there may be a superscript, or other indicator, at the point where it is relevant, followed by a footnote, endnote, or tablenote containing that text. When viewed on an interactive display then those indicators may be selectable. For generalised notes in purely electronic documents (e.g. in a browser), the footnote/endnote concepts might not be used, and the text may be popped-up when some word or indicator is selected or hovered-over. The point being made is that structured narrative can be used to create generalised notes, without having to presume a particular mechanism or style.

 

Although it is possible to generate citations of different style, and for different locales, from the discrete citation-element values, there are many complications in the real world. A citation sentence may contain different layers describing the provenance of the source and its information, or it may contain analytical notes. A reference note may contain multiple citation sentences a tour of these scenarios was covered in Cite Seeing. Subsequent references to the same source would typically use a shortened form of the associated reference note, or the author may have employed an explicit hereinafter-cited-as term, or the Latin abbreviation Ibid. A footnote may have woven two source references into the same piece of text. Certain parts of a citation may not have been available (e.g. an undated document), or may have been erroneous, and so the citation would need to override any simple template-like formatting. In effect, authors of narrative work are loath to delegate generation of their citations to a piece of software working blindly from a set of data values. It is therefore necessary to support hand-crafted forms, and change the focus of citation-elements to that of correlation and interrogation rather than formatting.

 

7      Meta-data

In the context of micro-history data, meta-data is data about data. Meta-data is an important concept because it allows data to be processed or utilised appropriately. In a computer context, it allows software to understand (in a primitive sense) what data it is dealing with.

 

STEMMA’s structured narrative provides a form of meta-data in the semantic mark-up it uses to embed references to Persons, Places, Animals, Events, etc. Without that meta-data, all the narrative would contain is the name or textual description of the same entities.

 

The Semantic Web movement strives to supplement the currently unstructured Web content with meta-data. The idea is that this will allow computers to search, correlate, and combine information more easily. This will therefore be an important part of micro-history representation in online content. It doesn’t necessarily mean that a standard data model needs to incorporate its RDF tags since it would be wrong to tie it to such a specific technology. However, it does mean that such a data model must provide for meta-data, and that a physical serialisation format (as with file formats) that was derived from the data model for the Semantic Web could use RDF tags.

 

Another situation where meta-data comes up, and is still hotly debated, is citation-elements. These are the elements of data that would constitute a computer-readable citation, e.g. author(s), title, publisher, etc. A computer-readable citation differs from a printed citation in that factors such as style (e.g. EE, CMOS) and regional settings are removed, and can be reapplied later for the context of a specific end-user.

 

At its most simple, you might imagine such a citation to be represented by a number of discrete XML elements such as:

 

<CitationReference>

<Elements>

<Author> name </Author>

<Title> title </Title>

…etc…

</Elements>

</CitationReference>

 

Irrespective of whether each element has a specific tag (e.g. <Author>) or a qualified generic tag (e.g. <Element Name=”Author”>), it still basically identifies the datum rather than the nature of the datum. There has been some considerable discussion on BetterGEDCOM about whether these tags should follow the Dublin Core scheme and have shared semantics. Dublin Core started with a vocabulary of 15 shared meta-data tags, which included things like Creator, although it has since been extended with extra tags and contextual refinements of existing ones. I have been a critic of this for citation-elements since it’s analogous to have relational databases share a common set of column names, each with fixed semantics. Also, the citation-element tags are not themselves meta-data tags, as already mentioned. If the citation-elements were part of a Semantic Web contribution then they would be annotated with RDF meta-data tags indicating their nature — it would not be deduced from the XML data tags. There may be other schemes, too, but they would be part of the corresponding physical serialisation format.

 

A related discussion of the importance of meta-data may be found at Technophoo, have no fear, although this does not differentiate between the formal data and meta-data concepts for citation-elements. Hence, although Dublin Core (which is now represented by ISO Standard 15836-2009) is a viable standard, it should be acknowledged that it is a standard for meta-data tags rather than data tags.

8      Copyright

Some of the things that cannot be copyrighted include facts and raw unarranged data[1]. From this perspective, mere details transcribed from vital records cannot be the subject of copyright.

 

Even the building of family pedigree charts showing marriages between people and linking their respective offspring is little more than a re-arrangement of facts that can be looked up by someone else. Although that linking of Person entities according to their biological lineage constitutes part of a conclusion model, and may have been the product of some research, the expression embodies nothing that can be copyrighted in any practical way.

 

This means that there is no legal impediment to online collaborative trees that include nothing more than facts and conclusions of the pedigree variety. Unfortunately, we all know where that leads. Online trees with no citations, no reasoning, no evidence supporting their conclusions, and no attribution, are “ten-a-penny” (or “a dime-a-dozen”). These trees are replicated just as easily as they’re created and that compounds the problem to a point where it almost kills serious genealogy. It has even been likened to a virus by blogger Ben Sayer.

 

Once micro-history data makes use of structured narrative then it starts to become a creative work a work of academic research that includes reasoning, conclusions, and opinions. Such a work is automatically copyright by virtue of the Berne Convention.

 

Such data might be published under a Creative Commons licence but that only avoids the legal issues. The fact that those reliable and thoroughly-researched contributions will have taken someone a long time to produce possibly a life’s work means that they will be understandably less-inclined to just share it with everyone, especially if some of the weaker researchers would simply pass it off as their own.

 

Is this an argument against collaborative trees? No, it’s simply an indication that the current naïve approach is demonstrably wrong, and will lead to further issues when we include narrative.

 

A case is made under Evidence and Conclusion for distinguishing three types of data rather than the two implied by this name; the third being all those parts that justify, or prove, the conclusions. Having three distinct parts gives greater flexibility for handling copyright and sharing issues. Also, What to Share, and How - Part II presents an alternative model that incorporates narrative in such a way of to provide automatic attribution, reduce the need for copying, and avoiding edit wars when there are differences of opinion on a shared tree.



® STEMMA is a registered trademark of Tony Proctor.

[1] This is the case in the US, but a sweat of the brow concept still exists in Europe and that has paved the way for database rights. See the “Analysis” section of A Copyright Casualty — Part II.