Recording Evidence

A number of features are required to correctly record source information in a transcription. This section illustrates how STEMMA deals with them.

 

 

Some of these terms and concepts may be found in Editorial Methods for Journals, volume 1, and The Conventions of Textual Treatment, chapter five. For other attempts at audio transcription, see http://clu.uni.no/icame/manuals/WSC/MARKCONV.HTM and https://www.univie.ac.at/voice/documents/VOICE_mark-up_conventions_v2-1.pdf.

 

Traditional editorial notations for uncertain characters are not well-suited to digital text as they do not facilitate efficient and accurate searching within the limits of what is known. TEI has elements such as <choose> and <unclear>, and a comprehensive formalised notation may be found at: http://igenie.org under Transcriptions. Although less comprehensive, perhaps the most compact is the UCF (Uncertain Character Format) devised by FreeUKGEN. This is based on the regex pattern-matching language although it must be remembered that this exists within target strings rather than search strings. Regex, in turn, is an extension of tradition wildcard characters[1].This UCF is the basis of the notation used within STEMMA and the following table is from the FreeBMD pages:

 

 

_ (Underscore)

A single uncertain character. It could be anything but is definitely one character. It can be repeated for each uncertain character.

* (Asterisk)

Several adjacent uncertain characters. A single * is used when there are 1 or more adjacent uncertain characters. It is not used immediately before or after a _ or another *. Note: If it is clear there is a space, then * * is used to represent 2 words, neither of which can be read.

[abc]

A single character that could be any one of the contained characters and only those characters. There must be at least two characters between the brackets. For example, [79] would mean either a 7 or a 9, whereas [C_] would mean a C or possibly some other character.

{min,max}

Repeat count - the preceding character occurs somewhere between min and max times. max may be omitted, meaning there is no upper limit. So _{1,} would be equivalent to *, and _{0,1} means that it is unclear if there is any character.

 

UCF also defines a ‘?’ character that is used to represent the situation where all of the characters have been read but you remain uncertain of the word, e.g. “RACHARD?” This is not used within STEMMA because it is ambiguous with ‘?’ representing an absent value, and the equivalent feature is supported by <Alt> mark-up.

 

Some examples:

 

 [lt]                   Can't tell if it's an l or a t.

___                 Three unreadable characters.

[x_]                  I think the character is an ‘x’

_{2,3}              Two or three unreadable characters.

*                       Unknown number of unreadable characters.

_{0,1}              Not sure if there's a letter or an ink blob.

 

Early STEMMA designs considered using an ANSI escape sequence to bracket a set of UCF characters. For instance, <APC>_12[68]<ST> where APC=0x9F and ST=0x9C. This was partly to avoid unconditionally reserving a whole set of characters but also to allow them in attribute values as well as element data. The current version accommodates them in a <Ucf> element:

 

<Ucf> ucf-sequence </Ucf>



[1] Wildcard characters represent variable sequences. There are several schemes but most allocate a single character to represent 0-or-more unknown characters (e.g. ‘*’) and another to represent exactly one unknown character (e.g. ‘?’). These may be combined so that, for instance, ‘?*’ represents 1-or-more unknown characters. Note that since ‘*?’ ≡ ‘?*’ and ‘**’ ≡ ‘*’ then any contiguous sequence of ‘*’ and ‘?’ can be simplified to just [?...][*], i.e. 0-or-more ‘?’ followed by an optional ‘*’.