<?xml version="1.0" encoding="UTF-8"?>
<!-- © TEI Consortium. Dual-licensed under CC-by and BSD2 licenses; see the file COPYING.txt for details. -->
<?xml-model href="https://jenkins.tei-c.org/job/TEIP5-dev/lastSuccessfulBuild/artifact/P5/release/xml/tei/odd/p5.nvdl" type="application/xml" schematypens="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"?>
<div xmlns="http://www.tei-c.org/ns/1.0" type="div1" xml:id="CH" n="4"> 
<head>Languages and Character Sets</head>
<p>The documents which users of these Guidelines may wish to encode
encompass all kinds of material, potentially expressed in the full
range of written and spoken human languages, including the extinct,
the non-existent, and the conjectural. Because of this wide scope,
special attention has been paid to two particular aspects of the
representation of linguistic information often taken for granted:
language identification and character encoding.</p>
<p>Even within a single document, material in many different languages
may be encountered. Human culture, and the texts which embody it, is
intrinsically multilingual, and shows no sign of ceasing to be so.
Traditional philologists and modern computational linguists alike work
in a polyglot world, in which code-switching (in the linguistic sense)
and accurate representation of differing language systems constitute
the norm, not the exception. The current increased interest in studies
of linguistic diversity, most notably in the recording and
documentation of endangered languages, is one aspect of this long
standing tradition. Because of their historical importance, the needs
of endangered and even extinct languages must be taken into account
when formulating Guidelines and recommendations such as these. </p>
<p>Beyond the sheer number and diversity of human languages, it should
be remembered that in their written forms they may deploy a huge
variety of scripts or writing systems. These scripts are in turn
composed of smaller units, which for simplicity we term here
characters. A primary goal when encoding a text should be to capture
enough information for subsequent users to correctly identify
not only the constituent characters, but also the language and script. In this chapter we
address this requirement, and propose recommended mechanisms to
indicate the languages, scripts and characters used in a document or a
part thereof.  </p>
<p>Identification of language is dealt with in <ptr target="#CHSH"/>. In summary, it recommends the use of pre-defined
identifiers for a language where these are available, as they
increasingly are, in part as a result of the twin pressures of an
increasing demand for language-specific software and an increased
interest in language documentation. Where such identifiers are not
available or not standardized, these Guidelines recommend a method for
documenting language identifiers and their significance, in the same
way as other metadata is documented in the TEI header.</p>
<p>Standardization of the means available to represent characters and
scripts has moved on considerably since the publication of the first
version of these Guidelines. At that time, it was essential to
explicitly document the characters and encoded character sets used by
almost any digital resource if it was to have any chance of being
usable across different computer platforms or environments, but this
is no longer the case. With the availability of the Unicode standard,
more than 128,000 different characters representing almost all of the world's
current writing systems are available and usable in any XML processing
environment without formality. Nevertheless, however large the number
of standardized characters, there will always be a need to encode
documents which use non-standard characters and glyphs, particularly
but not exclusively in historical material. The second part of
this chapter discusses in some detail the concepts and
practice underlying this standard, and also introduces the methods
available for extending beyond it, which are more fully discussed in
<ptr target="#WD"/>.</p>

<div type="div2" xml:id="CHSH"><head>Language Identification</head>
  
<p>Identification of the language a document or part thereof is
written in is a crucial requirement for many envisioned usages of
an electronic document. The TEI therefore accommodates this need in the
following way:<list>
 <item>A global attribute <att>xml:lang</att> is defined for all TEI
 elements. Its value identifies the language and writing system
 used.</item>
 <item>The TEI header has a section set aside for the information
 about the languages used in a document: see further <ptr target="#HD41"/>.</item>
</list></p>
<p>The value of the attribute <att>xml:lang</att> identifies the
language (and, optionally, script) using a coded value. For maximal compatibility with existing
processes, the identifier for the language must be constructed as in
<title>Best Current Practice 47</title><note place="bottom">Currently
BCP 47 comprises two Internet Engineering Task Force documents,
referred to separately as RFC 5646 and RFC 4647; over time, other IETF
documents may succeed these as the best current practice.</note>. This
<emph>same</emph> identifier has to be used to identify
the corresponding <gi>language</gi> element in the TEI header, if one
is present.</p>
<p>The first part of BCP 47 is called <ref target="#CH-BIBL-4"><title>Tags for Identifying
Languages</title></ref>, and proposes the following mechanism for
constructing an identifier (tag) for languages as administered by the
Internet Assigned Numbers Authority (IANA). The tag is assembled from
a sequence of subtags separated by the hyphen (-, U+002D) character.
It gives the language (possibly further identified with a
sublanguage), a script, and a region for the language, each possibly
followed by a variant subtag.</p>
<p>The authoritative list of registered subtags is maintained by IANA
and is available at <ptr target="https://www.iana.org/assignments/language-subtag-registry"/>.
For a good general overview of the construction of language tags, see
<ptr target="https://www.w3.org/International/articles/language-tags/"/>,
and for a practical step-by-step guide, see <ptr target="https://www.w3.org/International/questions/qa-choosing-language-tags.en.php"/>.</p>
<p>In addition to the list of registered subtags, BCP 47 provides
extensions that can be employed by private convention. The constructs
provided can thus be used to generate identifiers for any language,
past and present, in any usage in any area of the world. If such
private extensions are used within the context of the TEI, they should
be documented within the <gi>language</gi> element of the TEI header,
which might also provide a prose description of the language described
by the language tag.</p>

<p>While language, region, and script can be adequately identified
using this mechanism, there is only very rough provision to express a
dimension of time for the language of a document; those codes provided
(e.g. <code>grc</code> for <q>Greek, Ancient (to 1453)</q>) might not
reflect the segments appropriate for a text at hand. Text encoders
might express the time window of the language used in the document by
means of the extension mechanism defined in BCP 47 and relate that
to a <gi>date</gi> element in the corresponding <gi>language</gi>
section of the TEI header.</p>
<p>Equivalences to language identifiers by other authorities can be
given in the <gi>language</gi> section as well, but no formal
mechanism for doing so has been defined.</p>
<p>The scope of the language identification extends to the whole
subtree of the document anchored at the element that carries the
<att>xml:lang</att> attribute, including all elements and those
attributes, if any, where a language might apply.<note place="bottom">This excludes all attributes where a non-textual
datatype has been specified, for example tokens, boolean values,
dates, and predefined value lists.</note></p></div>
  
  <div type="div2" xml:id="CHCS">
   <head>Characters and Character Sets</head>
  <p>All document encoding has to do with representing one thing by
   another in an agreed and systematic way. Applied to the smallest
   distinctive units in any given writing system, which for the
   moment we may loosely call <soCalled>characters</soCalled>, such representation
   raises surprisingly complex and troublesome issues. The reasons
   are partly historical and partly to do with conceptual
   unclarities about what is involved in identifying, encoding,
   processing and rendering the characters of a natural
   language.</p>             
  <div type="div3" xml:id="D4-41">            
   <head><!--4.1-->Historical Considerations</head>           
   <p>When the first methods of representing text for storage or
   transmission by machines were devised, long before the
   development of computers, the overriding aim was to identify the
   smallest set of symbols needed to convey the essential semantic
   content, and to encode that symbol set in the most economical
   way that the storage or transmission media allowed. The initial
   outcome were systems that encoded only such content as could be
   expressed in uppercase letters in the Latin script, plus a few
   punctuation marks and some <soCalled>control characters</soCalled> needed to
   regulate the storage and transmission devices. Such encodings,
   originally developed for telegraphy, strongly influenced the way
   the pioneers of computing conceived of and implemented the
   handling of text, with consequences that are with us still.</p> 
   <p>For many years after the invention of computers, the way they
   represented text continued to be constrained by the imperative
   to use expensive resources with maximal efficiency. Even when
   storage and processing costs began their dramatic fall, the
   Anglo-centric outlook of  most hardware designers and software
   engineers hampered initiatives to devise a more generous and
   flexible model for text representation. The wish to retain
   compatibility with <soCalled>legacy</soCalled> data was an additional disincentive.
   Eventually, tension in East Asia between commitment to
   technological progress and the inability of existing computers
   to cope with local writing systems led to decisive developments.
   Japanese, Korean, and Chinese standards bodies, who long before
   the advent of computers had been engaged in the specification of
   character sets, joined with computer manufacturers and software
   houses to devise ways of mapping those character sets to numeric
   encodings and processing the resulting text data.</p>
   <p>Unfortunately, in the early years there was little or no
   co-ordination among either the national standards bodies or the
   manufacturers concerned, so that although commercial necessity
   dictated that these various local standards were all compatible
   with  the representation of US-American English, they were not
   straightforwardly compatible with one another. Even within Japan
   itself there emerged a number of mutually incompatible systems,
   thanks to a mixture of commercial rivalry, disagreements about
   how best to manage certain intractable problems, and the fact
   that such pioneering work inevitably involved some false starts,
   leading to incompatibilities even between successive products of
   the same bodies. Roughly at the same time, and for similar
   reasons, multiple and incompatible ways of representing
   languages that use Cyrillic scripts were devised, along with
   methods of encoding ancient writing systems which inevitably
   could not aim for compatibility with other writing systems apart
   from basic Latin script. Many of the earliest projects that fed
   into the TEI were shaped in this developmental phase of the
   computerized representation of texts, and it was also the
   context in which SGML was devised and finalized. </p> 
   <p>SGML had of necessity to offer ways of coping with multiple
   writing systems in multiple representations; or rather, it
   provided a framework within which SGML-compliant applications
   capable of handling such multiple representations might be
   developed by those with sufficient financial and personnel
   resources (such as are seldom found in academia). Earlier
   editions of these Guidelines offered advice on character set and
   writing system issues addressed to the condition of those for
   whom SGML was the only feasible option. That advice is here
   substantially altered because of two closely-related
   developments: the availability of the ISO/Unicode character set
   as an international standard, and the emergence of XML and
   related technologies which are committed to the theory and
   practice of character representation which Unicode embodies.
   </p> </div> 
  <div type="div3" xml:id="D4-42"> <head><!--4.2 -->Terminology and Key
   Concepts</head><p>Before the significance of Unicode and the
   implications of the association between XML and Unicode can be
   adequately explained, it is necessary to clarify some key
   concepts and attempt to establish an adequately precise
   terminology for them.</p> 
   <p><figure xml:id="fig1">
     <graphic width="70%" url="Images/CHfig.png"/>
    <head>Examples of the latin <mentioned>a</mentioned>, in both lower and upper case, rendered with different fonts.</head>
    <figDesc>Examples of the latin <mentioned>a</mentioned>, in both lower and upper case, rendered with different fonts.</figDesc>
   </figure></p>
   <p>
    The word <soCalled>character</soCalled> will not of itself take us
    very far towards greater terminological precision. It tends to be
    used to refer indiscriminately both to the visible symbol on a
    page and to the letter or ideograph which that symbol represents,
    two things that it is essential to keep conceptually distinct. The
    visible symbol obviously has some aspects by which we interpret it
    as representing one character rather than another; but its
    appearance may also be significantly determined by features that
    have no effect on our notion of which character in a writing
    system it represents. A familiar instance is the lowercase
    <mentioned>a</mentioned>, which in printed texts may be
    represented either by a <soCalled>single storey</soCalled> symbol
    (<ref target="#fig1">cf. figure 1</ref> in the examples from
    URW Gothic L on the bottom row) or by a <soCalled>two
    storey</soCalled> version (as in <ref target="#fig1">figure
    1</ref> in the examples from Umpush, or URW Bookman L Demi Bold).
    We say that the single and double-storey symbols both represent
    one and the same the same <term>abstract
    character</term> <mentioned>a</mentioned> using two different
    <term>glyphs</term>. Similarly, an uppercase
    <mentioned>A</mentioned> in a serif typeface has additional
    strokes that are absent from the same letter when printed using a
    sans-serif typeface, so that once again we have differing glyphs
    standing for the same abstract character. The distinction
    between abstract characters and glyphs is fundamental to all
    machine processing of documents.</p>
   <p>In most scholarly encoding projects, the accurate recording of
   the abstract characters which make up the text is of prime
   importance, because it is the essential prerequisite of
   digitizing and processing the document without semantic loss. In
   many cases (though there are important exceptions, to be touched
   on shortly) it may not be necessary to encode the specific
   glyphs used to render those abstract characters in the original
   document. An encoding that faithfully registers the abstract
   characters of a document allows us to search and analyse our
   document's content, language, and structure, and to access its full
   semantics. That same encoding, however, may not contain
   sufficient information to allow an exact visual representation
   of the glyphs in the source text or manuscript to be recreated.
   </p>
   <p>The importance of this distinction between information content
   and its visual representation is not always immediately apparent
   to people unused to the specific complexities of text handling
   by machine. Such users tend to ask first what (in order of
   conceptual priority) should actually be their very last
   question: how do I get a physical image that looks like
   character x in my source document to appear on to the screen or
   the output page? Their first question should in fact be: how can
   I get an abstract representation of character x into my encoded
   document in a way that will be universally and unambiguously
   identifiable, no matter what it happens to look like in printout
   or on any particular display? And occasionally the response they
   receive as a result of their misguided initial question is a
   custom <soCalled>solution</soCalled> that satisfies their
   immediate rendering wishes at the price of making their
   underlying document unintelligible to other users (or even to
   the original user in other times and places) because it encodes
   the abstract character in an idiosyncratic way.</p> 
   <p>That said, there will certainly be documents or projects where
   it is a matter of scholarly significance that the compositor or
   scribe chose to represent a given abstract character using one
   particular glyph or set of strokes rather than a
   semantically-equivalent but visually distinct alternative, and
   in that case the specific appearance of the form will have to be
   encoded in one way or another. But that encoding need not (and
   in most cases will not) involve a notation that visually
   resembles the original, any more than italicized text in an
   original document will be represented by the use of italic
   characters in the encoded version.</p> 
   <p>A collection of the abstract characters needed to represent
   documents in a given writing system is known as a 
<term>character set</term>, and the character set or
   <term>character repertoire</term> of a processing or
   rendering device is the set of abstract characters that it is
   equipped to recognize and handle appropriately. There is,
   however, a subtle distinction between these two parallel uses of
   the same term, involving one more key concept which it is
   essential to grasp. The character set of a document (or the
   writing system in which it is recorded) is purely a collection
   of abstract characters. But the character set of a computing
   device is a set of abstract characters which have been mapped in
   a well-defined way to a set of numbers or <term>code points</term> 
   by which the device represents
   those abstract characters internally. It can therefore be
   referred to as a <term>coded character set</term>,
   meaning a set of abstract characters each of which has been
   assigned a numerical code point (or in some instances a sequence
   of code points) which unambiguously identifies the character
   concerned.</p>
<p>It is now possible to use this terminology to
   say what Unicode is: it is a coded character set, devised and
   actively maintained by an international public body, where each
   abstract character is identified by a unique name and assigned a
   distinctive code point.<note place="bottom">Although only Unicode
    is mentioned here explicitly, it should be noted that the
    character repertoire and assigned code points of Unicode and
    the ISO standard 10646 are identical and maintained in a way
    that ensures this continues to be the case. </note> Unicode is 
  distinguished from other coded character sets by its
  (current and potential) size and scope; its built-in provision
  for (in practical terms) limitless expansion; the range and
  quality of linguistic and computational expertise on which it
  draws; the stability, authority, and accessibility it derives
  from its status as an international public standard; and,
  perhaps most importantly, the fact that today it is implemented
  by almost every provider of hardware and software platforms
  worldwide.</p> </div> 
  <div type="div3" xml:id="D4-43"> 
   <head><!--4.3 -->Abstract Characters, Glyphs, and Encoding Scheme
   Design</head> 
   <p>The distinction between abstract characters and glyphs can be
   crucial when devising an encoding scheme.  When performing 
   searches, text retrieval, or creating concordances, users of 
   electronic text will expect the system to recognize and treat 
   different glyphs as instances of the same character; but when 
   perusing the text itself they may well expect to see glyph variants 
   preserved and rendered. When encoding a pre-existing text, the 
   encoder should determine whether a particular
   letter or symbol is a character or a glyphic variant. The Unicode 
   Consortium and an ISO work group (ISO/IEC JTC1
   SC2/WG2) have developed a detailed model of the relationship
   between characters and glyphs. This model, presented in <ref target="https://www.unicode.org/reports/tr17/">Unicode Technical
     Report 17: Character Encoding Model</ref>, is the underpinning
   of much standards work since, including the current chapter.</p>
   <p>The model makes explicit the distinction between two different
   properties of the components of written language: 
<list> 
    <item>their content, i.e. its meaning and phonetic value
    (represented by characters)</item> <item>their graphical
    appearance (represented by glyphs).</item> 
</list> 
   </p>  
   <p> When searching for information, a system generally operates
   on the content aspects of characters,  with little or no
   attention to their appearance. A layout or formatting process,
   on the other hand, must of necessity be concerned with the exact
   appearance of characters. Of course, some operations
   (hyphenation for example) require attention to both kinds of
   feature, but in general the kind of text encoding described in
   these Guidelines tends to focus on content rather than
   appearance (see further <ptr target="#COHQ"/>).</p>
   <p> An encoder wishing to record information about which glyphs
   are present in a given document may do so at either or both of
   two levels:  
<list> 
    <item>the level of character encoding, using an appropriate
    Unicode code point to represent the glyph concerned </item> 
    <item>the markup level, with the glyph indicated via
    appropriate elements or attributes</item> 
   </list> </p> 
   <p>The encoding practice adopted may be guided by, among other
   things, an assessment of the most  frequent uses to which the
   encoded text will be put. For example, if recognition of
   identical characters represented by a variety of glyphs is the
   main priority, it may be advisable to represent the glyph
   variations at markup level, so that the character value can be
   immediately exposed to the indexing and retrieval software.
   Plainly, an encoding project will need to consider such issues
   carefully and document the outcome of their
   deliberations in their TEI customization file (or other local
   encoding documentation) to ensure encoding consistency. Using
   Unicode code points to represent glyph information requires that
   such choices be documented in the TEI header. Such documentation
   cannot of itself guarantee proper display of the desired
   glyph but at least makes the intention of the encoder
   discoverable.</p> 
   <p>At present the Unicode Standard does not offer detailed
   specifications for the encoding of glyph variations. These
   Guidelines do give some recommendations; some discussion of
   related matters is given in <ptr target="#PH"/>,
   and  <ptr target="#WD"/>  offers some features for the definition of variant
   glyphs. </p> 
  </div> 
  <div type="div3" xml:id="D4-44"> 
  <head><!--4.4. -->Entry of Characters</head>    
  <p>The entry of characters was much more complicated before the near-universal
    adoption of Unicode, for which there are <term>Input Method Editors</term>
    (IMEs) available in most languages and fonts that provide glyphs for the full 
    range of the Unicode specification. In those rare situations where there is
    difficulty entering the specific character you want, or some problem representing
    it on the system you are working in, <term>Numeric Character References</term>
    (NCRs) should be used. These take the general form <code><![CDATA[&#D;]]></code> where 
    <code>D</code> is an integer representing the code point of the character in 
    base 10, or <code><![CDATA[&#xH;]]></code>, where <code>H</code> is the code point in
    hexadecimal notation. Every XML processor is capable of recognising NCRs and 
    replacing them with the required code point value without needing access to 
    any additional data. The disadvantage of NCRs as a means of entering, 
    representing and proofing character data is that most human beings find them
    anything but <soCalled>readable</soCalled> and it is all too easy
    for the wrong character to be entered in error and retained undetected. 
    Where characters are not defined in Unicode, these Guidelines provide advice
    on the strategies available for handling their representation in <ref target="#WD">Chapter 25 Representation of non-standard Characters and 
    Glyphs</ref>. </p>
  </div>
  <div type="div3" xml:id="D4-45a">
   <head>Output of Characters</head>
   <p>The rendering of the encoded text is a complicated process that
   depends largely on the purpose, external requirements, local
   equipment and so forth, it is thus outside the scope of coverage
   for these Guidelines. </p>
   <p>It might nevertheless be helpful to put some of the
   terminology used for the rendering process in the context of the
   discussion of this chapter.  As was mentioned above, Unicode
   encodes abstract characters, not specific glyphs.  For any
   process that makes characters visible, however, concrete,
   specifically designed glyph shapes have to be used.  For a printing
   process, for example, these shapes
   describe exactly at which point ink has to be put on the paper
   and which areas have to be left blank.  If we want to print a character
   from the Latin script, besides the selection of
   the overall glyph shape, this process also requires that a
   specific weight and size of the font has been selected,
   and to what degree the shape should be slanted.  Beyond
   individual characters, the overall typesetting process also
   follows specific rules for calculating the distance between
   characters, for determining how much whitespace occurs between any two words, and how long each line should be (and thus at which
   points a new line begins), and so forth.  </p>
   <p>If we concern ourselves only with the rendering process of the
   characters themselves, leaving out all these other parameters, we
   will realize that of all the information required for this process, only a small
   amount will be drawn from the encoded text itself.  This
   information is the code point used to encode the character in the
   document.  With this information, the font selected for printing
   will be queried to provide a glyph shape for this character.
   Some modern font formats (e.g. OpenType) implement a
   sophisticated mapping from a code point to the glyph selected,
   which might take into account surrounding characters (to create
   ligatures where necessary) and the language or even area this character is
   printed for to accommodate different typesetting traditions and
   differences in the usage of glyphs.  </p>
   <p>A TEI document might provide some of the information that is
   required for this process, for example by identifying the
   linguistic context with the <att>xml:lang</att> attribute. The
   selection of fonts and sizes is usually done in a stylesheet,
   while the actual layout of a page is determined by the
   typesetting system used. Similarly, if a document is rendered
   for publication on the Web, information of this kind can be
   shipped with the document in a stylesheet.<note place="bottom">The World Wide
   Web Consortium provides recommendations for two standard
   stylesheet languages: either CSS or
   XSL could be used for this purpose.</note></p>
  </div> 
  <div type="div3" xml:id="D4-45b"> <head><!--4.5 -->Unicode and
   XML</head> 
   <p>XML was designed with Unicode in mind as its means of representing
   abstract characters. It is possible to use other character encoding
   schemes, but in general they are best avoided, as you run the risk 
   of encountering compatibility issues with different XML processors,
   as well as potential difficulties with rendering their output. We 
   recommend using the <term>UTF-8</term> encoding, which for the Basic
   Latin range is identical to ASCII, and which uses a variable-length
   set of bytes to represent characters. It should be noted that it is
   not sufficient simply to declare in the XML Declaration that a document
   is in UTF-8 format. Doing so merely means that processors will treat the
   content therein as if it were UTF-8, and may fail to process the 
   document if it is not. For further discussion of UTF-8, see the 
   section below on <ptr target="#D4-48"/>.</p> 
  </div>
  <div type="div3" xml:id="D4-46"> 
  <head><!--4.6 -->Special Aspects of Unicode Character Definitions</head> 
  <div type="div4" xml:id="D4-46-1"> 
   <head><!--4.6.1 -->Compatibility Characters</head> 
   <p>The principles of Unicode are judiciously tempered with
   pragmatism. This means, among other things, that the actual
   repertoire of characters which the standard encodes, especially
   those parts dating from its earlier days, include a number of
   items which on a strict interpretation of the Unicode
   Consortium's theoretical approach should not have been regarded
   as abstract characters in their own right. Some of these
   characters are grouped<!--, almost quarantined,--> together into a
   code-point regions assigned to  <term>compatibility characters</term>.
   Ligatures are a case in point. Ligatures (e.g. the joining of
   adjacent lowercase letters <q>s</q> and <q>t</q> or <q>f</q> and <q>i</q> in Latin
   scripts, whether produced by a scribal practice of not lifting
   the pen between strokes or dictated by the aesthetics of a type
   design) are representational features with no added semantic
   value beyond that of the two letters they unite (though for
   historians of typography their presence and form in a given
   edition may be of scholarly significance). However, by the time
   the Unicode standard was first being debated, it had become
   common practice to include single glyphs representing the more
   common ligatures in the  repertoires of some typesetting devices
   and high-end printers, and for the coded character sets built
   into those devices to use a single code point for such glyphs,
   even though they represent two distinct abstract characters. So
   as to increase the acceptance of Unicode among the makers and
   users of such devices, it was agreed that some such
   pseudo-characters should be incorporated into the standard as compatibility characters.
   Nevertheless, if a project requires the presence of such
   ligatured forms to be encoded, this should normally be done via
   markup, not by the use of a compatibility character. That way,
   the presence of the ligature can still be identified (and, if
   desired, rendered visually) where appropriate, but indexing and
   retrieval software will treat the code points in the document as
   a simple sequential occurrence of the two constituent characters
   concerned and so correctly align their semantics with
   non-ligatured equivalents. Such ligatures should not be confused
   with digraphs (usually) indicating diphthongs, as in the French
   word "cœur". A digraph is an atomic orthographic unit
   representing an abstract character in its own right, not purely an amalgamation
   of glyphs, and indexing and retrieval software will need to 
   treat it as such. Where a digraph occurs in a source text, it
   should normally be encoded using the appropriate code point for
   the single abstract character which it represents. </p> </div> 
  <div type="div4" xml:id="D4-46-2"> 
   <head><!--4.6.2 -->Precomposed and Combining Characters and
   Normalization</head>    
    <p>The treatment of characters with
   diacritical marks within Unicode shows a similar combination of
   rigour and pragmatism. It is obvious enough that it would be
   feasible to represent many characters with diacritical marks in
   Latin and some other scripts by a sequence of code points, where
   one code point designated the base character and the remainder
   represented one or more diacritical marks that were to be
   combined with the base character to produce an appropriate
   glyphic rendering of the abstract character concerned. From its
   earliest phase, the Unicode Consortium espoused this view in
   theory but was prepared in practice to compromise by assigning
   single code points to <term>precomposed</term> characters which were
   already commonly assigned a single distinctive code point in
   existing encoding schemes. This means, however, that for quite a
   large number of commonly-occurring abstract characters, Unicode
   has two different, but logically and semantically equivalent
   encodings: a <term>precomposed</term> single code point, and a code point
   sequence of a base character plus one or more <term>combining</term>
   diacritics. Scripts more recently added to Unicode no longer
   exhibit this code-point duplication (in current practice no new
   precomposed characters are defined where the use of combining
   characters is possible) but this does nothing to remove the
   problem caused by the duplications from older character sets that 
   have been permanently embodied in Unicode. Together with essentially analogous
   issues arising from the encoding of certain East Asian
   ideographs. This duplication gives rise to the need to practice
   <term>normalization</term> of Unicode documents. Normalization is
   the process of ensuring that a given abstract character is represented in one
   way only in a given Unicode document or document collection.
   The Unicode Consortium provides four standard normalization
   forms, of which the <term>Normalization Form C</term> (NFC)
   seems to be most appropriate for text encoding projects. The NFC, as 
   far as possible, defines conversions for all base characters followed 
   by one or more combining characters into the corresponding precomposed 
   characters. The World Wide Web Consortium has produced a document entitled
   <title>Character Model for the World Wide Web 1.0</title><note place="bottom">Available at
    <ptr target="https://www.w3.org/TR/charmod/"/>.</note>, which among other things
   discusses normalization issues and outlines some relevant
   principles. An authoritative reference is Unicode Standard Annex
   #15 <title>Unicode Normalization Forms</title><note place="bottom">available at
    <ptr target="https://www.unicode.org/reports/tr15/"/></note>. </p> 
   <p>It is important that every Unicode-based project should agree
   on, consistently implement, and fully document a comprehensive and
   coherent normalization practice. As well as ensuring data integrity
   within a given project, a consistently implemented and properly
   documented normalization policy is essential for successful
   document interchange. While different input methods may themselves differ
   in what normalization form they use, any programming language that implements Unicode
   will provide mechanisms for converting between normalization forms, so it 
   is easy in practice to ensure that all documents in a project are in a consistent form,
   even if different methods are used to enter data.</p>
  </div> 
  <div type="div4" xml:id="D4-46-3"> 
   <head><!--4.6.3 -->Character Semantics</head>    
   <p>In addition to the Universal Character Set itself, the
    Unicode Consortium maintains a database of additional character
    semantics<note place="bottom"><ptr target="https://www.unicode.org/ucd/"/></note>. This
    includes names for each character code point and normative
    properties for it.  Character properties, as given in this
    database, determine the semantics and thus the intended use of a
    code point or character. The database also contains information that might be
    needed for correctly processing this character for different
    purposes. It is an important reference in determining which Unicode 
    code point to use to encode a certain character.  </p>
   <p>In addition to the printed documentation and lists made
   available by the Unicode consortium, the information it contains
   may also be accessed by a number of search systems over the Web
   (e.g. <ptr target="http://www.eki.ee/letter/"/>). Examples of
   character properties included in the database include case, numeric
   value, directionality, and, (where applicable) status as a
   <soCalled>compatibility character</soCalled><note place="bottom">For
   further details, see <title>The Unicode Character Property
   Model</title> (Unicode Technical Report #23), at <ptr target="https://www.unicode.org/reports/tr23/"/>.</note>. Where a
   project undertakes local definition of characters with code points
   in the PUA, it is desirable that any relevant additional
   information about the characters concerned should be recorded in an
   analogous way, as further discussed under <ptr target="#WD"/>.</p>
   </div>
  </div>
  <div type="div3" xml:id="D4-48"> 
   <head><!--4.8  -->Issues Arising from the Internal Representations of
   Unicode</head> 
   <p>In theory it should not be necessary for encoders to have any
   knowledge of the various ways in which Unicode code points can
   be represented internally within a document or in the memory of
   a processing system, but experience shows that problems
   frequently arise in this area because of mistaken practice or
   defective software, and in order to recognize the resulting
   symptoms and correct their causes an outline knowledge of
   certain aspects of Unicode internal representation is desirable.
   There are three encodings of Unicode available for use: UTF-8, which
   uses 1–4 bytes per character, UTF-16, which uses 2–4, and UTF-32,
   which uses 4 bytes per character. Current practice for documents to
   be transmitted via the Web recommends only UTF-8.<note place="bottom">See the W3C 
   Internationalization document, <title>Choosing &amp; applying a 
   character encoding</title> at 
     <ptr target="https://www.w3.org/International/questions/qa-choosing-encodings"/></note>
   </p> 
   <div type="div4" xml:id="D4-48-1"> 
   <head><!--4.8.1. -->Encoding Errors Related to UTF-8</head> 
   <p>The code points assigned by Unicode 3.0 and later are
    notionally 32-bit integers, and the most straightforward way to
    represent each such integer in computer storage would be to use
    4 eight-bit bytes. However, many of the code points for
    characters most commonly used in Latin scripts can be
    represented in one byte only and the vast majority of the
    remainder which are in common use (including those assigned
    from the most frequently used PUA range) can be expressed in
    two bytes alone. This accounts for the use of UTF-8 and UTF-16
    and their special place in the XML standard. UTF-8 and UTF-16
    are ways of representing 32-bit code points in an economical
    way. </p><p>UTF-8 is a variable length encoding: the more
    significant bits there are in the underlying code point (or in
    everyday terminology the bigger the number used to represent
    the character), the more bytes UTF-8 uses to encode it. What
    makes UTF-8 particularly attractive for representing Latin
    scripts, explaining its status as the default encoding in XML
    documents, is that all code points that can be expressed in
    seven or fewer bits (the 127 values in the original ASCII
    character set) are also encoded as the same seven or fewer bits
    (and therefore in a single byte) in UTF-8. That is why a
    document which is actually encoded in pure 7-bit ASCII can be
    fed to an XML processor without alteration and without its
    encoding being explicitly declared: the processor will regard
    it as being in the UTF-8 representation of Unicode and be able
    to handle it correctly on that basis.</p>
   <p>However, even within the domain of Latin-based scripts, some
    projects have documents which use characters from 8 bit
    extensions to ASCII, e.g. those in the ISO-8859-n series of
    encodings, and the way characters which under ISO-8859-n use
    all eight bits are encoded in UTF-8 is significantly different,
    giving rise to puzzling errors. Abstract characters that have a
    <emph>single</emph> byte code point where the
    highest bit is set (that is, they have a decimal numeric
    representation between 129 and 255) are encoded in ISO-8859-n
    as a <emph>single</emph> byte with the same value
    as the code point. But in UTF-8 code-point values inside that
    range are expressed as a <emph>two</emph> byte
    sequence. That is to say, the abstract character in question is
    no longer represented in the file or in memory by the same number
    as its code-point value: it is <hi>transformed</hi> (hence the T in
    UTF) into a sequence of two different numbers. Now as a
    side-effect of the way such  UTF-8 sequences are derived from
    the underlying code-point value, many of the single-byte
    eight-bit values employed in ISO-8859-n encodings are illegal
    in UTF-8.</p>
   <p>This complicated situation has a simple consequence which can
    cause great bewilderment. XML processors will effortlessly
    handle character data in pure 7-bit ASCII without that encoding
    needing to be declared to the parser, and will similarly accept
    documents encoded in an undeclared ISO-8859-n encoding if they
    happen to use no characters outside the strict ASCII subset of
    the ISO character sets; but the parse will immediately fail if
    an eight-bit character from an ISO-8859-n set is encountered in
    the input stream, unless the document's encoding has been
    explicitly and correctly declared. Explicitly declaring the
    encoding ought to solve the problem, and if the file is
    correctly encoded throughout, it will do so. But projects dealing 
    with documents of sufficient age may find that they have to deal with some files  encoded
    in UTF-8 along with others in, say, ISO-8859-1. Such encoding
    differences may go unnoticed, especially if the proportion of
    characters where the internal encodings are distinguishable is
    relatively small (for example in a long English text with a
    smattering of French words). These types of error may or may not
    manifest in actual processing errors, and may only become visible
    as <soCalled>garbage</soCalled> characters in the eventual display of documents.</p>
   <p>In projects that routinely handle documents in non-Latin
    scripts, everyone is well aware of the need to ensure correct
    and consistent encoding, so in such places mixed encoding
    problems seldom arise, and when they do are readily identified
    and remedied. Real confusion tends to arise, however, in
    projects which have a low awareness of the issues because they
    employ predominantly unaccented Latin characters, with only
    thinly-distributed instances of accented letters, or other
    <soCalled>special characters</soCalled> where the internal representation under
    ISO-8859-n and UTF-8 are different (such as the copyright
    symbol, or, a frequent troublemaker where eventual HTML output
    is envisaged, the <soCalled>non-breaking space</soCalled>). Even, or especially,
    if such projects view themselves as concerned only with
    English documents, the close relationship between XML and
    Unicode means they will need to acquire an understanding of
    these encoding issues and develop procedures which assure
    consistency and integrity of encoding and its correct
    declaration, including the use of appropriate software for
    transcoding and verification. </p> </div> 
    </div>
  </div>
  </div>