2 Evidence and Information
3 Proof and Conclusion
4 Mark-up of Data
4.1 Shallow Semantics
4.2 Deep Semantics
4.3 Example Cases
5 Link Analysis
6 Citations and Attribution
7 Conclusion Sharing
9 Disjoint Persons
This paper examines the relationship between the types of data sometimes referred to as Evidence and Conclusion (or E&C) and their impact on the design of STEMMA®. When employed as a categorisation of our data then these terms may be used inaccurately. Source information on its own is just that; it’s only in the context of some proposition or claim that it can become evidence for or against something. In other words, evidence is what we think some particular source information means.
Some people prefer to talk of records-based data (i.e. that obtained from historical records) in order to differentiate from conclusions. Just to blur the distinction, though, some information sources may be conclusions or inferences themselves, formed by other authors; citing them doesn’t make them facts. I will continue to use the term source information to refer to anything that was used or consulted during our research.
To some people, even the term conclusion is not sufficiently accurate. A conclusion suggests something final — as in a concluding opinion — and has different semantics to, say, a conjecture or hypothesis, both of which still occur during the research process. A conclusion can never be considered final since there’s always a chance of some contradictory evidence coming to light.
A conclusion or inference should be accompanied by some reasoning or rationale, albeit technically separate from that reasoned result. We will see later why distinguishing these two parts is beneficial. If evidence is direct and non-conflicting then a conclusion may be accompanied by a simple proof summary (usually just a list of citations), or proof statement, but in more complex cases then a written proof argument is necessary.
When analysing complex evidence — sometimes referred to as inferential genealogy — then evidence will often be indirect, or negative. Assimilating information from multiple sources, correlating it, and resolving conflicts, is a necessary part of such research. Note that the page Inferential Genealogy describes it as “...how family historians can accurately deduce ancestors’ identities and many aspects of their lives by digging below surface information…”, and the page Complex Evidence suggests that “A genealogist’s goal is to establish identity and prove relationships; complex evidence is the ONLY way to do this”. While I agree with the principles of such research, I do not agree with this focus on identities and relationships; if genealogy is about identities and lineage then family history is about their lives, and micro-history is about any type of localised history. See What is Genealogy?
Much of genealogy is currently focused on the information-evidence-conclusion distinction, and in doing so implies that all research is goal-directed. This precludes the bottom-up approach that I’ve described as Source Mining where all information relating to a life, family, or other historical context, can be pulled together to create a story; an approach that might be used in producing a biography, for instance.
Genealogy is often hampered by imprecise usage of terminology in the general community, mainly between software and non-software circles, each of which have their own precise definitions that may be unknown to, or conflict with, those of their colleagues. The different usage of similar terms in science and law may add to the confusion. I will try to use established terms where possible, and be clear when they might be ambiguous.
Several previous E&C proposals and discussions were to be found on the old BetterGEDCOM wiki, e.g. Defining E&C for BetterGEDCOM, although that site is now being run down.
Some related articles may be found at: Proof of the Pudding, Is That a Fact?, Evidence and Where to Stick It.
Evidence Explained (EE) attaches the following definitions to common terms in its Evidence Analysis Process Map:
Source — source quality:
Original Source is “material in its first oral or recorded form”.
Derivative Source is “material produced by copying an original or manipulating its content”. This includes translations, abstracts, etc. Image copies, recorded copies, and duplicate originals are often treated as per their originals.
Authored Work is “a hybrid of both original and derivative materials produced by writers who study many different sources, reach personal conclusions, and present a new and original piece of writing”.
Information — information reliability:
Primary Information is “details provided by someone with first-hand knowledge of the information reported, such as a participant in an event or an eye witness”.
Secondary Information is “details provided by someone with second-hand or most-distant knowledge of the person, event, or situation. This includes hearsay, tradition, and legend.
Evidence — evidence applicability:
Direct Evidence is “relevant information that states an answer to a specific research question or appears to solve a research problem all by itself”.
Indirect Evidence is “relevant information that does not answer the research question all by itself. Rather, it has to be combined with other information to arrive at an answer to the research question”. Also called circumstantial evidence.
Negative Evidence is “absence of evidence one would expect to find”; not to be confused with a negative search, which may simply indicate that the required records are not online, or held elsewhere.
Preponderance of the evidence is a legal term that used to be employed in the context of a genealogical assessment. It can easily be shown to mislead since it implicitly assumes that all sources of evidence are independently countable. A sample case that demonstrates the fallacy is that of an immigrant who arrives in a country at a very young age. They may have not known their true date of birth, but may have used a reasonable guess throughout their life. Hence, no matter how many sources quote the same figure, it doesn’t make it any more factual.
As well as a data model characterising the source, information therein, and the applicability of evidence, along EE guidelines, there should also be a way of characterising the credibility of the author, compiler, or reporter, too. For instance: Expert, Trusted, or Biased.
In a digital word, information needs to be a searchable but faithful copy of the associated text before it can be fully used to derive evidence, and that rules out plain images in separate files! Even having diplomatic transcriptions in separate text or word-processor files is not an ideal situation since it relies totally on the user to read, assimilate, and analyse them — without software being able to help at all.
Since the majority of people who actually do create transcriptions will be using separate files then this needs some explanation. Imagine that you had a number of document copies laid out on your desk. Imagine that you had read them and ringed certain important details, or items that you believed may be useful later. Now imagine that you could connect those ringed items with lines and add notes as you built up an historical picture or a proof argument.
What I’m describing is more correctly known as Link Analysis. It is a form of Graphic Organisation that allows data relationships to be analysed visually. Link analysis has been employed in such fields as anti-terrorism, fraud analysis, medical diagnosis, and crime evidence analysis. The STEMMA data model supports this but without mandating a particular methodology, or any particular software product. The result is merely a capturing of those data relationships and the logic used in correlating them. More than that, though, it provides a trail which can be followed from conclusions (such as a date, biological relationship, etc) back to that logic, to the underlying information, and to the supporting sources — a process known as drill-down. A consequence of this is that citations are created much earlier, are more precise, and are directly associated with the source analysis and transcribed data.
The article at Source Mining describes a bottom-up approach to general historical research that makes these same points, and Link Analysis is a prerequisite for supporting that approach.
The term “E&C” is a common reference to “Evidence and Conclusion” as the two main parts of our data, but we’ve already seen that there isn’t a simple binary choice of concepts to deal with. Evidence comes from information which comes from sources, but it’s only meaningful when supplemented with reasoning in the context of a statement or claim.
Typical family trees, especially online, deal with conclusions; they describe people that are believed to have existed, and the relationships between them, based on the available source information. Most do not include citations for those supporting sources, but virtually none include any logic or proof argument.
An interesting point is that conclusions are easily represented in software data models, and they will usually employ precise taxonomies/ontologies to characterise data (such as a date-of-birth, or a biological father), or equivalent structures (such as a tree). In effect, these conclusions are designed to be read by software in order to populate a database or to graphically depict biological lineage. Source information, on the other hand, cannot be categorised to that extent, and it has to be humanly-readable so that it can be assimilated and analysed by a human. For instance, if a source indicates that a relationship was that of “uncle” or “cousin” then it might actually mean one of many different things, and research is needed to determine what the real relationship was.
There have been a number of attempts to represent the logical analysis of source information using wholly computerised elements (see FHISO papers received, Research Process, Evidence & GPS, and the GenTech data model — ASSERTION concept), but these are too far removed from handling real text. They also lose the possibility of drilling-down from a conclusion to get back to the written human analysis, and to the underlying information fragments in the sources. While allowing analytic notes to be added to information fragments is a fairly obvious requirement, connecting notes and concepts together to build structure needs written human explanation, not “logic gates” and other such notions. Maybe one reason for the overtly computerised approaches is that software designers feel an onus on them to support “proof” in the mathematical sense rather than the genealogical sense; a result of misunderstanding the terminology (see Proof of the Pudding).
The STEMMA model separates information from conclusion, and tries to build the latter by linking it to the former through the associated reasoning. Note, however, that this is merely a representative model, and it may be used to whatever degree by a hosting product. The following is a quotation from under Musings on Standardisation:
The standard should be as applicable to an experienced or professional user as to a naïve user who just collects names, dates, and places. Hence, it should not stipulate nor mandate any formal process, and it should be able to represent all data without bias or presumption about the process used to obtain it.
What it means is that a standard data model should merely differentiate the types of data, and not be tied to a specific research or analysis methodology; that is the prerogative of individual software products. It doesn’t mean that there shouldn’t be an accepted standard of proof, but note that the Genealogical Proof Standard (GPS), from the Board for Certification of Genealogists (BCG), is just that, and is not a prescribed methodology.
There are a number of initiatives concerned with marking-up Web-based data by attaching semantic tags to it. For instance, schema.org, TEI, RDF, and historical-data.org.
A major issue in this field is distinguishing objective information and subjective conclusions. We (and any search engine) must always be aware of what is visible in the original data and what is added on top of that.
There are two levels of semantics and the conventional terminology does not account for the distinction. Rather than try to differentiate the semantic and descriptive types of mark-up, which are widely regarded as equivalent, I’ll introduce two new terms:
STEMMA strives to distinguish these at all times, and this has had a profound influence on the design of its mark-up for structured narrative and the design for its custom properties. An indication of how it handles transcription anomalies may be found at: Recording Evidence.
See also: Semantic Tagging of Historical Data.
When a datum is a personal name or a place name then such mark-up identifies the text as simply representing the name of a person or a place. What it does not do is attempt to identify the actual person or place.
This type of mark-up is therefore very useful for transcription of source information rather than for representing conclusions. It allows several different identifications to be supported from the same source fragment and so would be applicable to shared content, such as newspaper transcriptions.
This type of semantic information identifies the actual person or place associated with the name reference, and so it’s more appropriate for representing conclusions. This type of mark-up would also be applicable to authored material where there was no prior name reference, but one is required to be generated with the associated semantics attached.
This section presents some real cases where the difference between the original and the interpretation is very important.
An article that simply says "John Smith was fined 10s for loitering" obviously has a person reference in it, but identifying which John Smith is subjective and requires context from elsewhere. In the case of micro-history data, the representation of the ‘John Smith’ person may involve many useful details, including alternative names, and so making a conclusion-link is powerful but still different from the objective information. A reference to “grandmother” in a transcribed letter is an even better illustration; without knowledge of the author then the identification cannot be made, and yet it is still a person reference.
I have some text with a reference to a place called "Bendigo's Ring". OK, so I know it's a place reference and schema.org, say, could mark it as such. That would allow me to search only on that particular text. However, I happen to have local knowledge of the area (North Nottingham, England) and so the surrounding context of the text allows me to identify it as the colloquial name for an actual place: a small hill now called Sunrise Hill on the maps. However, it’s also mistakenly applied to a copse of trees on nearby hill called Glade Hill. Hence, using a different type of mark-up that not only identifies it as a place reference but connects it to both actual places means my searches are then richer as it can be found by multiple names, or by the surrounding geographical context.
An occupation of “charrer” might be someone who burns wood in the making of barrels, or it could be a misspelled version of “charer”, as in charwoman. Knowledge of the sex, age, and previous occupations of the person may hazard an educated guess but that would be a conclusion that’s not evident in the term itself.
A written date may be ambiguous, may have uncertain characters, or it may rely on context outside of that section of text. For instance, a reference of "Last Friday" could be given an interpretation, and an equivalent machine-readable value, but only if someone includes some reasoning and some extra context (e.g. a date from a letterhead, or a newspaper publication date). It's therefore a conclusion and not a verbatim reflection of the written date.
A date in, say, a Hindu calendar must be recorded "as is". Converting it to the Gregorian calendar is not a true reflection of the evidence. There is no agreed "epoch" between parts of some calendars and so a clean, algorithmic conversion is not always possible. At best, it may introduce an error margin that wasn’t in the written form of the date. See: A Calendar for Your Date — Part I and A Calendar for Your Date — Part II.
A Link Chart, or Link Diagram, provides a way of visualising complex information and the relationships between specific parts. The core components in STEMMA’s model for supporting this are the Source and Matrix entities.
The Source entity connects the Citation and Resource entities relevant to a given source and supports preliminary analysis of the information therein, currently via transcriptions but also via images in principle. The Matrix entity supports cross-source analysis.
Assimilation of a given source is typically undertaken once, when you acquire copies of the associated information. Thereafter, its Source entity can be associated with different Matrix entities, each focusing on a particular person, place, family, or discrete research problem.
The small coloured items in each Source may represent references to persons (see Persona), places, animals, groups, events, dates, or any word/phrase that is considered important. These would be connected to each other to represent data relationships or deductive steps (written in English rather than computer-speak) until you have an overall view of the source context. The Matrix entity has very similar items that may correlate those from selected Source entities.
When working with evidence, we need to say where our information came from. This often involves citing published works, such as books and articles, although it’s also true for unpublished works, online works, and artefacts. Ideally, we might also indicate where imported parts of a family tree came from, or where oral evidence came from, or letters, or emails, etc.
The terms citation and attribution are sometimes confused although there is a clear difference. A citation is referencing a prior work or source of information, whereas attribution is giving appropriate credit to individuals.
A citation has a number of purposes: intellectual honesty (not claiming prior work as your own), to allow your sources to be independently assessed by the reader, and to allow the strength of your information sources to be assessed; attribution, though, is more about provenance and credit.
In legal terms, attribution has a specific meaning in the context of a copyrighted work, or even a registered trademark, where it acknowledges the owner of that work or mark. We use the term here in its more generic context of ‘giving credit where credit is due’. This includes where help has been given by another researcher, or permission to reproduce an image or extract has been given to you. Sometimes, they are both required as they have different purposes.
STEMMA generalises the concept of a citation so that the same Citation entity can reference a specific source, particular information within it, or the location and provenance of the source. This degree of generalisation means that a natural progression is to incorporate attribution through the same mechanism.
STEMMA can represent a person through a Person entity (meaning they are part of the micro-history data) or through a Contact entity (meaning they are a researcher or contributor to that data). The Param values which control source and citation references may specify either entity type, and either of the entity types may have associated ContactDetails.
The way such attribution is displayed to the user depends on the source-type URI which identifies the Citation entity.
Online trees are either collaborative (resulting in a single, global tree), or separate (user-owner). User-owned trees may be private, where a subscription is required or an invitation extended by the owner, or public in which case they can be searched and copied for free. A number of sites now employ a model where user-owned trees are private but it automatically looks for correlations with other trees behind the scenes. You are then told of potential connections with them and it is up to you to make contact with the owners of those trees.
There are several online collaborative trees currently available. There is an expectation in some quarters that this is the way forward for family historians, and that there will eventually be a single unified tree available that we can all consult. Just how realistic is that though?
As the term ‘family tree’ implies, such data is basically just representing biological lineage, and might include names, dates, and places for BMD (Birth, Marriage, & Death) events. Family history data in the more general sense is rarely put online — especially for free — but why is that?
Well, one obvious reason is that the associated database model may only accommodate the more limited lineage type of data. Even if you wanted to include citations, narrative, places, or records of other event types, then there would be no easy way to store it except, possibly, as amorphous text.
Another reason is that someone’s full family history, including all their sources, images, workings, notes, etc., may be the product of decades of expensive and time-consuming research. Why should they simply give it away for free, or even store it in an online private tree that, technically-speaking, the content provider owns and may make use of?
Content providers make historical records-based data available but that generally lacks any conclusions. What they need is for genealogical researchers to publish their conclusions online to supplement their historical collections, and the term “conclusion sharing” was coined to represent this. Even the generation of a mere family tree involves the forming of conclusions from the original data, and so those collaborative trees and shared trees are already supporting the concept of conclusion sharing. So what’s wrong with that?
Well, the lack of support for proper citations and reasoning, combined with the reluctance to share precious (in both financial and personal terms) data, means that most online trees tend to be weak and rashly put together. Much of it is copied, which then replicates inherent errors. The easier the data is to obtain, the more likely it is to get copied and re-published. Hence, the predominance of a given relationship in different trees doesn’t improve its likelihood at all; they could all be copies of the same error. This result has been likened to a virus by Ben Sayer.
If we include all the other types of family history data, especially including biographical narrative and all our reasoning, then we introduce another issue: it then amounts to a work of academic research and is, therefore, subject to automatic copyright protection (via the Berne Convention). This is discussed further under Importance of Narrative.
This has serious implications for online content that I do not see being discussed openly. “Conclusion sharing” is a prerequisite for any global or shared online tree but there are basic issues with:
The representation of the data must not only distinguish source information from conclusion but also from evidence (including their reasoning) since an author may be more willing to freely share their conclusions without their reasoning. Some previous discussion of these requirements, and possible ways forward, may be found in the pair of articles: What to Share, and How
and What to Share, and How - Part II, as well as Collaboration With Tears and Collaboration Without Tears.
A very specific issue with unified trees is editorial control: if someone has written about personal memories, or their own research, then no one else should have the right to change it. This is a fundamental dichotomy in those trees where the conclusions are currently changeable by anyone. A recent article on this subject (Feeding the Trees) made a specific suggestion to the industry on how this difference in control might be accommodated; it advocated two layers: informational and conclusional, and made reference to the above-linked articles.
Persona is a much-debated concept in the so-called Evidence & Conclusion model of genealogy. Many threads on the subject can be found on the BetterGEDCOM wiki, such as Do we need persona?
The concept of personae exist in several models, including GenTech and more recently GEDCOM-X. The concept might be traced to a 1959 paper entitled Automatic Linkage of Vital Records. The origin of the term itself is uncertain but at the meeting that kicked off the GenTech model, in 1994, Tom Wetmore gave a talk entitled "Structured Flexibility in Genealogical Data" in which he stressed the need to record evidence data, and where he used the term persona in that context.
A persona is often, inaccurately, described as an evidence person, as opposed to a conclusion person, meaning that it merely represents a reference from a single source to a particular named person. As such, there is no actual birth event, no death event, and no family tree for a persona — the data associated with a persona is only that derived directly from the particular source, e.g. name, age, place of birth, occupation, etc. Similarly, there is no identification of that named person since that would be making a conclusion. On the other hand, a conclusion person (a Person entity, in STEMMA) may have an associated birth event and a lineage (parents, offspring, etc) that we can conclude from the aggregated evidence.
In a two-tier model, both of these entities would be present, and a user would create the Person by correlating the data from the multiple personae — some of which could eventually turn out to refer to different Persons. It has been suggested that this correlation can be done programmatically but I’m not convinced of this. The correlation must also involve the context of each persona (e.g. where they were in a census, who else was in the household, etc), and the nature and reliability of the source, and how it relates to information fro mother sources, but that is a huge sphere of subjective interpretation.
STEMMA records extracted and summarised pieces of information from a source — called Properties — that are relevant to a person, or other subject entity, but these are not quite personae; the conclusion as to which subject entity they belong to has already been made, as has the interpretation of the associated values; a persona is merely the reference information from a given source.
In effect, Properties are just named data items that have been separated as a convenience and serve a wholly different function, such as providing visual cues or for loading into a database. They may be statically associated with a subject entity, or they may be time-dependent (see Time-dependent Attributes). In the time-dependent case, they are connected to the respective subject entity via a specific source contribution to an Event entity.
<Property Name='Name'> William Elliott </Property>
<Property Name='Age'> 10 </Property>
<Property Name='Occupation'> Scholar </Property>
<Property Name='BirthPlace' Key='wUttoxeter'> Staffordshire Uttoxeter </Property>
<Property Name='Relationship’ Key='pTimothyElliott'> Son </Property>
Putting this information into Events allows the information to be presented by time (i.e. a timeline), or geography, or both. The Property values for the Event itself, such as the dates or place, may also be specified in the <SourceLnk> element.
Although earlier versions of the data model suggested that these sets of named values were the closest analogy it had to personae, this changed in V4.0 when the Source entity was introduced. Unlike Properties, the Source entity allows concepts to be built up from the raw source information, and that includes persona-like references to all subject types, including multi-tier versions, and to any item of information deemed relevant to the research in-hand.
For a deeper discussion, see: Genealogical Persona Non Grata.
An interesting aspect of evidence and conclusion occurs when the evidence results in two distinct conclusion-persons, but you cannot prove whether they are the same or different. You may be able to flesh-out the lives of both persons but still not be able to separate or merge them.
I have experienced this myself and expended a lot of effort researching both persons. They are not directly associated with the persona issue since they are both normal conclusion-persons, and may have been arrived at by assessing independent evidence, but there might be sufficient similarity that both are required in your data.
STEMMA supports this situation by allowing all sets of Person entities to be disjoint. This means that the Persons in a STEMMA Dataset are not necessarily tied to a single, root ancestor. There may be outliers (not connected to anyone), or disjoint trees with no links between them. As well as allowing both of these similar Persons to be represented, this feature also allows the representation of completely unrelated Persons (i.e. who may not relatives) who may still be significant to the family history. See Incidental People.
® STEMMA is a registered trademark of Tony Proctor.
 Elizabeth Shown Mills, “QuickLesson 13: Classes of Evidence—Direct, Indirect & Negative“, Evidence Explained: Historical Analysis, Citation & Source Usage (https://www.evidenceexplained.com/content/quicklesson-13-classes-evidence%E2%80%94direct-indirect-negative : accessed 2 Nov 2015).
 Elizabeth Shown Mills, Evidence Explained (Baltimore: Genealogical Publishing Co., 2007).
 A BI process where selecting a summarised datum or a hierarchical field — usually with a click in a GUI tool — revealed the underlying data from which it was derived. The term was used routinely during the early 1990s when a new breed of data-driven OLAP product began to emerge.
 "An environment or material in which something develops; a surrounding medium or structure", from Oxford Dictionaries Online (http://www.oxforddictionaries.com/definition/american_english/matrix : accessed 28 Oct 2015), s.v. “matrix”, alternative 1.
 H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James, “Automatic Linkage of Vital Records”, Science, Vol. 130, No. 3381 (16 Oct 1959): 954–959.