6.10 Diacritics

Diacritics are characters that occur in a word without affecting its sorting. If the hyphen, the possessive apostrophe, and the most common accents are declared to be diacritics, words with them will be sorted identically with word-forms that, though otherwise identical, lack those two characters. There is some reason not to recognize diacritics at all. Words with a genuine hyphen, and without one, may well be different forms. Should a possessive apostrophe be declared a diacritic, also, then the possessive and plural forms of a noun will fall together with the third-person singular ending for most verbs with the same spelling. Note that RET tagging distinguishes end-of-line and censorship hyphens from compounding hyphens and dashes, and eliding apostrophes and single closing quotation marks from possessive apostrophes.

TEI P3 calls these characters lexical punctuation (p. 897).

In any case, it is important to distinguish these textual diacritics, whatever they may or may not be in a given interpretation. The following codes are likely candidates for diacritic characters:

      Tag          Function

      -            compounding hyphen
      {\-}         end-of-line hyphen
      {--}         dash
      {-}          censorship hyphen
      '            possessive apostrophe
      {'}          eliding apostrophe
      {`}          single closing quotation mark

If non-ASCII characters and abbreviated forms are to be recognized on the same basis as other letter-numbers, then both braces and the vertical bar, which delimit them, should be declared as diacritics. Should this be done, then every character employed in these character codes--as well as every character described in these guidelines as a letter-number--should be explicitly declared as a letter-number. Such code characters, if not found in the alphabet specified for a text-analysis program, may otherwise be taken as word-separator characters like the space. As a result, word-fragments will be sorted as separate word-integers.