Compatibility issues
The first version of the BNC was released slightly in advance of
the publication of the TEI's definitive Recommendations, and over a
year before publication of the Corpus Encoding Standard. Although all
three standards have much in common (in particular, CDIF - Corpus
Document Interchange Format - the BNC's own initial DTD, was
influential in the design of the other two), they are not
identical. Several elements are named differently, and some, more
significantly, have different content models or attributes.
In the present release of the Corpus, considerable effort has been
made to improve compatibility of the BNC DTD with TEI and with CES,
while retaining as far as possible a degree of compatibility with
CDIF. The objective was to ensure that a document which conformed to
the BNC's DTD would also conform to either of the other two standards,
rather than to ensure that any CES or TEI conformant document would
also be BNC conformant. This necessarily involved some modification of
the original tagging of the corpus, which is detailed in this section.
Differences between the BNC DTD and TEI
The present version of the BNC document type declaration (DTD) can
be expressed as a set of extensions against the standard TEI dtd,
using the extension mechanism recommended by that standard. Full
details of the procedure are given in chapter 3 of the TEI
Guidelines. Essentially, the procedure requires the
definition of two extension files, called here
bncMods.ent and bncMods.dtd, the former
containing definitions of parameter entities needed for this set of
extensions, and the latter containing the actual SGML element and
attribute definitions which make up the required modifications. Copies
of these files are included in the present release, along with the DTD
derived from them. The present section describes their content
informally.
The DTD described elsewhere in this document makes use of several
elements already defined in other TEI tagsets, in particular the base
tag sets for prose and for spoken texts, and the additional tagsets
for language corpora and analysis. To combine all of these with the extension
files mentioned above, a TEI conformant document will begin as
follows:
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [
<!ENTITY % TEI.prose "INCLUDE">
<!ENTITY % TEI.spoken "INCLUDE">
<!ENTITY % TEI.general "INCLUDE">
<!ENTITY % TEI.analysis "INCLUDE">
<!ENTITY % TEI.corpora "INCLUDE">
<!ENTITY % TEI.extensions.ent
SYSTEM "bncMods.ent">
<!ENTITY % TEI.extensions.dtd
SYSTEM "bncMods.dtd">
] ]>
This file can be compiled to form a one file version of the DTD, in which
all parameter references have been resolved, and any redundant declarations
removed, using software such as the TEI PizzaChef.
The file bncMods.ent consists of a number of SGML parameter
entity definitions, which over-ride the definitions provided in the TEI
DTD itself. These declarations have the following effects:
to exclude from the DTD a large number of
standard TEI elements which are not actually used in the BNC
DTD;
to provide alternative names for some standard TEI elements;
to exclude from the TEI DTD some elements which are redefined, either
with a stricter content model, or with differing attribute lists, in the
BNC DTD;
to specify the location within the TEI class system of some elements
not defined in the TEI DTD.
Taking these in turn, some 114 standard TEI elements are excluded from
the DTD by means of parameter entity declarations like the following:
<!ENTITY % ab "IGNORE">
<!-- ... -->
<!ENTITY % xref "IGNORE">
The following is a complete list of standard TEI elements excluded in this way:
ab,
abbr,
add,
affiliation,
alt,
altG1rp,
anchor,
argument,
authority,
back,
biblFull,
birth,
broadcast,
byline,
cb,
channel,
cit,
cl,
constitution,
correction,
dateline,
dateRange,
del,
derivation,
distinct,
div0,
div5,
div6,
div7,
divGen,
docAuthor,
docDate,
docEdition,
docImprint,
docTitle,
domain,
education,
emph,
epigraph,
equipment,
expan,
factuality,
firstLang,
foreign,
front,
fsdDecl,
funder,
gloss,
group,
headLabel,
headItem,
hyphenation,
index,
interp,
interpGrp,
interpretation,
join,
joinGrp,
kinesic,
link,
linkGrp,
m,
measure,
meeting,
metDecl,
mentioned,
milestone,
normalization,
notesStmt,
num,
opener,
orig,
personGrp,
phr,
postBox,
postCode,
preparedness,
principal,
purpose,
q,
quotation,
rendition,
residence,
rs,
reg,
scriptStmt,
seg,
segmentation,
series,
seriesStmt,
signed,
soCalled,
socecStatus,
span,
spanGrp,
sponsor,
state,
stdVals,
step,
street,
symbol,
textDesc,
time,
timeRange,
titlePage,
titlePart,
trailer,
variantEncoding,
when,
writing,
xptr,
xref.
Four elements in the TEI DTD are given different names in the BNC
DTD. For example, the TEI element speaker is renamed
spkr. The declarations below effect this and the other
renamings required, by changing the value of the relevant parameter
entity:
<!ENTITY % n.teiCorpus.2 "bnc">
<!ENTITY % n.TEI.2 "bncDoc">
<!ENTITY % n.p "para">
<!ENTITY % n.speaker "spkr">
The next part of the bncExtns.ent file contains
IGNORE declarations like those above, which have the
effect of removing the existing definitions for 22 TEI elements which
are to be redefined. The redefinitions are provided in the second of
the two BNC extension files, bncExtn.dtd, along with
definitions for some new elements not otherwise available. Their effects are
summarized in the following table.
Summary of differences between TEI and BNC
TEI Element
Difference in BNC dtd
TEI.2Changed content model to allow either text or stext; renamed as bncDoc
activitySimplified content model; added attribute
ageNew element
alignNew element
authorSimplified content model; added attributes
bodySimplified content model
credefined to use endtag and shortref minimization
captionNew element
changeChanged content model
dialectNew element
divChanged content model, specific to speech
div1Simplified content model
div2Simplified content model
div3Simplified content model
div4Simplified content model
itemChanged to disallow mixed content
locNew element
pNew simplified element (TEI p is renamed para)
personSimplified content model; added attributes
poemNew element
quoteChanged content model to disallow mixed content
recordingSimplified content model; added attributes
sSimplified content model
shiftMandatory attribute made optional
spChanged content model
stextNew element
textSimplified content model
truncNew element
unclearSimplified content model; added attributes
wredefined to use endtag and shortref minimization
Finally, as mentioned above, there are six elements defined in the
BNC DTD which do not appear in the TEI DTD. These must be added to the
appropriate element class in the TEI content model. The following
declarations in the bncMods.ent file have that effect:
<!ENTITY % x.chunk "p|">
<!ENTITY % x.common "caption|poem|unclear|">
<!ENTITY % x.divtop "align|">
<!ENTITY % x.seg "trunc|">
Detailed discussion of the extension mechanism and general
conformance issues relating to the use of the TEI is given in chapters
28 and 29 of the TEI Guidelines and is not further
discussed here. For an explanation of the mechanisms used above, the
detailed presentation of the general organization of the TEI DTD
provided in chapter 3 of the Guidelines may also be helpful.
Differences between the BNC DTD and CDIF
This section lists significant differences between the current
BNC DTD and CDIF 1.0. It lists elements whose names have been changed,
elements whose attributes have changed, and elements whose content has been
changed in such a way that CDIF-conformant files will not parse against the new
DTD.
The following CDIF elements have been given different names:
avail is now availability
biblscop is now biblScope
biblstr is now biblStruct
bibnote is now note
clasdecl is now classDecl
corr is now an item within encodingDesc
editdecl is now editorialDecl
ednstmt is now editionStmt
encdesc is now encodingDesc
header is now teiHeader
hyph is now an item within encodingDesc
partics is now particDesc
profdesc is now profileDesc
projdesc is now projectDesc
pubstmt is now publicationStmt
quot is now an item within encodingDesc
rec is now recording
recstmt is now recordingStmt
reg is now corr
relation is now particLinks
revdesc is now revisionDesc
segm is now an item within encodingDesc
settdesc is now settingDesc
srcdesc is now sourceDesc
titstmt is now titleStmt
txtclass is now textClass
The following elements have significantly different attributes
activity has acquired the attribute spont, formerly present on
its parent setting
sic, gap, and corr have all acquired attributes resp (rather than ed)
corr and sic no longer have a cause attribute
gap has acquired the attribute reason (in place of cause)
the complete attribute has been removed from text and stext
w has a different set of values for its type attribute
The content models for elements in the BNC DTD are generally less restrictive than
those of CDIF. In the following list, we specify only those elements where an
element conforming to the CDIF model would not also conform to the BNC model.
CDIF elements whose content model has changed
elementCDIF modelBNC model
address#PCDATA(addrLine+ | (name | postBox | postCode | street)*)
avail#PCDATA(para)*
change(date, respStmt+)(date, respStmt+,para)
clasdecl(category+)(taxonomy+)
editdecl(corr | quot | hyph | segm | trans)+(para+)
ednstmt#PCDATA((edition, respStmt*) | para+)
encDesc(projDesc, (sampDecl|editDecl)*, refsDecl+, tagsDecl?, clasDecl?)projectDesc*, samplingDecl*, editorialDecl*, tagsDecl?, refsDecl*, classDecl*, para*)
imprint(pubPlace | name | date)+(pubPlace | publisher | date | biblScope)*
langusg#PCDATA(para | language)+
list(head*, (label?, item)+)(head?, (item* | (label, item)*))
monogrtitle+, (author | respStmt)*, (edition, respStmt?)*, imprint*, (bibNote | idno | biblScop)* ((((author | editor | respStmt)+, title+, (editor | respStmt)*) | (title+, (author | editor | respStmt)*))?, (note | meeting)*, (edition, (editor | respStmt)*)*, imprint, (imprint | extent | biblScope)*)
partics(person+, relation* )(para+ | (person+, particLinks?))
projdesc#PCDATA(para)+
refsdecl#PCDATA(para)+
sampdecl#PCDATA(para)+
setting(locName?, locale?, activity?)(para+ | (name | date | locale | activity)*)
sp(spkr*, (p | sp | bibl | caption | list | note | poem | quote)* ) +(stage) (spkr?, (p | l | lg | poem | stage | note | caption)+)
text( (p|sp|bibl | caption | list | note | poem | quote)*, (div1)*) +(gap | lb | loc | pb | ptr) (body) +(gap | lb | milestone | pb | ptr)