3. ENCODING

3.1 SGML Tags

SGML (Standard Generalized Markup Language) is a much more complex protocol than I am able to describe here. I touch only on what is absolutely necessary to a scholar who is starting to tag texts. ISO 8879 (published October 15, 1986) is the definitive technical description of SGML. It and two supporting ISO standards (9069-70) are edited and annotated by Goldfarb (pp. 218-593). The TEI Guidelines (P3), edited by Michael Sperberg-McQueen and Lou Burnard, provide a well-thought-out implementation of SGML for literary texts.

In general, SGML asks authors and editors to abandon "procedural markup," that is, low-level instructions about layout, typeface. etc., in favour of "generalized markup" (definition of textual "objects," their descriptive content, and their mutual relations). For each kind of textual "object," one must invent a descriptive name and decide how they logically relate to one another.

Thus an italicized phrase should not be tagged <f type="italics">. The editor instead should tag that string for the perceived function of the italics, whether emphasis (<emp>), or high-lighting (<hi>), etc. Applying logical markup of this sort is not a problem for an author who wishes her meaning to be unambiguous, but it is difficult for editors who are hard pressed to interpret italics one way as opposed to another. Goldfarb (p. 17) explains that SGML is intended to eliminate ambiguity in texts.

While procedural markup ... leaves a document as a character string that has no form other than that which can be deduced from analysis of the document's meaning, generalized markup reduces a document to a regular expression in a known grammar.
In fact, SGML practitioners see a text as a non-physical "logical construct" quite distinct from the verbal data, the words themselves. Yet in the humanities there is no standard way of interpreting texts or their features. All encoding arises from interpretation, and all interpretations are subject of debate.

For this reason, SGML must be applied carefully to Renaissance texts.

SGML tags may be said to have eight main characteristics.



  1. Names. An SGML tag normally consists at least of an element name within angle bracket delimiters. The name is chosen to describe the text it precedes. Names may start only with a letter but may contain up to eight letters, digits, periods, and hyphens (Goldfarb, p. 33).

  2. Entities. SGML allows for special tags called entities, which function as string substitutions for what is in the text. The form of an entity is always |&|, followed by the entity's name, and closed by semi-colon ;. For our purposes, entities may help us to implement a Renaissance English character set within the constraints of the ASCII version of the Roman alphabet. For example, &ctlig; might be employed to stand for a ct ligature. See Goldfarb, pp. 23-24.

  3. Explicit Tag Spans. SGML tags may have an indefinite span, prevailing until they are replaced by another tag of the same kind (e.g., <page.break>), but they also may take a closing tag, normally the opening tag with a virgule -- / -- preceding the tag variable. Thus the SGML tag <col> would be concluded by </col>. This is so for the following reason.

  4. Text as Tag Value. When SGML tags of this kind surround text, this text itself (not an editorial word or phrase inside the tag) becomes the tag value or token. Thus the title page of Randle Cotgrave's dictionary of 1611 states "Compiled by RANDLE COTGRAVE." This might be tagged in SGML as "Compiled by <author> RANDLE COTGRAVE </author>."

  5. Attributes.
  6. Normally SGML tags take attributes inside their delimiters. For instance, the <col> tag could have the attribute n="". The SGML equivalent to the COCOA tag <page 56> might be <page n="56">. An attribute name should be no longer than eight characters and should be linked to its following value by an equals sign (Goldfarb, p. 33). The value must be in double quotation marks.

  7. Explicit Variable Names. SGML tags do not have an abbreviated form, where the delimiters stand for the tag itself. The variable name must always be present.

  8. Tag Hierarchy. SGML tags may fall into nested hierarchies. They may be employed to record the structure of a text. For example, whereas in COCOA such elements as book, canto, stanza, and line are encoded by separate tags not formally related to one another, in SGML their relationships may be declared in a Document Type Declaration (DTD). In this way SGML software, such as a parser or browser, can understand that books contain cantos, cantos contain stanzas, and stanzas contain lines. COCOA-encoded texts often have comparable command files (e.g., TACT's .MKS file) but they do not explicitly indicate hierarchy, except implicitly in counters, which may be reset to their initial condition once another counter occurs and in x-type tags (see below).

  9. Cross-references. SGML tags may include attributes that are cross-references to other tags or other files. This capability enables SGML encoding to attach footnotes, marginalia, and textual variants to a piece of text or to associate a cross reference in a text (e.g., "See last fig.") to a specific place in the text.

Both COCOA and SGML are independent of specific software. For example, a text encoded in either may be indexed by WordCruncher, which has its own encoding style. However, because there is no clear definition of the syntax of COCOA tagging, while SGML is an international standard syntax, SGML is preferred for interchange purposes. Note that conforming to SGML does not require that an editor use any specific tags or Document Type Definition. For example, editors can produce SGML-conformant texts that cannot be handled by the TEI Document Type Definition. For example, HTML, a set of tags and a DTD employed on the World-Wide Web, is a legitimate, if simple non-TEI encoding method.