MDF and its Applications

Report
MDF and its Applications
Sebastian Drude & Irina Nevskaya
Goethe-Universität Frankfurt
RELISH / Lexicon Meeting Nijmegen July 2010
MDF and ist Applications
1. MDF: what is it?
2. Organization of the MDF-format
3. Advantages, problems with MDF
4. Applications and conversions
5. MDF in the RELISH project: Udi
MDF and ist Applications
1. MDF: what is it?
2. Organization of the MDF-format
3. Advantages, problems with MDF
4. Applications and conversions
5. MDF in the RELISH project: Udi
1. MDF: what is it?
• Originally, the Multiple Dictionary Formatter
was an independent computer program
• It converted certain files in Standard Format
into RTF (to be further processed and printed
with office software)
• Today it is part of the Toolbox (formerly Shoebox)
program, in form of Consistent Changes tables
(*.cct, complex scripts for search-and-replace
routines) and MS-Word template files (*.dot)
1. MDF: what is it?
Standard Format (SF) is a very old text format
developed by SIL with minimal mark-up:
• The content is organized in “fields”
• Each field consists of a “marker” (a newline
followed by a backslash and a sequence
of letters, hyphens, digits etc.) and
the “field content” (free text), separated
from the marker by a space character
• This is a simple feature–value structure
“Standard Format” data file
Entry
Field
Field marker
Field content
1. MDF: what is it?
• The MDF program uses a certain SET of markers,
representing typical data categories used in
traditional lexicography
• Properties of the fields (Language etc.)
and a minimal hierarchical structure through
a “is–below”–relation are kept in a separate
“.typ” (type) file, which is also in SF
• In this sense, a file in MDF format is a (SF)
text file which uses the MDF set of markers
(in the MDF hierarchical organization)
MDF.typ (config file)
Marker def.
Description
Language
Position
in hierarchy
MDF and ist Applications
1. MDF: what is it?
2. Organization of the MDF-format
3. Advantages, problems with MDF
4. Applications and conversions
5. MDF in the RELISH project: Udi
2. Organization of the MDF-format
• There are currently about 100 markers directly
supported by MDF (“MDF-fields”)
• The basic hierarchy is:
\lx (lexeme)
└˃ \se (sub-entry)
└˃ \ps (part of speech)
└˃ \sn (sense number)
• Other hierarchies might or used to be supported:
( \lx > \se > \sn > \ps or \lx > \sn > \ps > \se )
1. MDF: what is it?
MDF is documented by the book:
Coward, David F. & Grimes, Charles E. (2000).
Making Dictionaries: A guide to lexicography
and the Multi-Dictionary Formatter. Waxhaw,
North Carolina: SIL International (1st ed. 1995)
URL: http://www.sil.org/computing/shoebox/MDF_2000.pdf
http://www.sil.org/computing/shoebox/MDF_Updates.html
2. Organization of the MDF-format
Several fields can be repeated for up to four different languages,
where “..” → v = vernacular, e = English, n = national, r = regional
• \ps, \pn – part of speech for main entry word (English, national)
• \g.. – gloss for main entry word
• \d.. – definition for main entry word
• \re, \rn, \rr – reverse (for indexes)
• \we, \wn, \wr – word-level gloss
• \x.. – example (sentence and translations)
• \e.. – encyclopedic information
• \u.. – usage information
• \o.. – only (restriction) information
• (\va), \ve, \vn, \vr – variant form comment
• (\cf), \ce, \cn, \cr – cross reference gloss
• (\lf), \le, \ln, \lr – “lexical function” (gloss for related word)
• \pd.. – “paradigm” (gloss for –irregular– form)
2. Organization of the MDF-format
Some 20 fields are discouraged:
• \an (antonym), \sy (synonym) are to be substituted
by the \lf (lexical function), \lfv (lexical function vernacular),
\lf.. (lexical function gloss) fields
• \sg (singular), \pl (plural), \1s (first person singular) etc.
are to be substituted by the \pdl (paradigm form label),
\pdv (paradigm form vernacular), \pd.. (paradigm form gloss)
fields (not yet in the documentation)
Two fields (\dt, \st) are administrative fields
So there are only about 50 genuinely different MDF fields
2. Organization of the MDF-format
• Some of the fields form blocks/groups via the hierarchy,
for instance:
• \lf (lexical function, relations to other entries)
└˃ \lfv related form, \lf.. gloss of rel. form (Engl., nat., reg.)
• \pd (Paradigm information & irregular forms)
└˃ \pdl pdg. label, \pdv pdg. form, \pd.. pdg. gl. (Engl., nat., reg.)
• \rf (reference to an example)
└˃ \xv example form in the vernacular
└˃ \x.. translation of rel. form (Engl., nat., reg.)
• \cf (cross-reference form)
└˃ \c.. cross-reference gloss (Engl., nat., reg.)
• \va (variant form)
└˃ \v.. comment on variant form (Engl., nat., reg.)
MDF and ist Applications
1. MDF: what is it?
2. Organization of the MDF-format
3. Advantages, problems with MDF
4. Applications and conversions
5. MDF in the RELISH project: Udi
3. Advantages, problems with MDF
Advantages:
• Very flexible SF database format
(optional fields, repeated fields etc.)
• Quite exhaustive for standard lexicography
in field research on minority languages
• Is a de-facto standard, although Toolbox
is officially not supported by SIL any more
(now replaced by FIELD / FLEX)
3. Advantages, problems with MDF
General problems:
• Flexibility of SF allows for inconsistencies
• Only recommended order for sister fields
• Almost always extended and adjusted
arbitrarily by individual users
(MDF-derived / MDF-based formats)
• Changes in the hierarchy in the configuration are
not reflected in the data file and vice versa
• Missing closing tags in SF impair conversions
3. Advantages, problems with MDF
Specific problems in the RELISH project:
• \ph (phonetic form) is too generic, it would
be needed in several different contexts
(\cf, \va, \pdv, \lfv…)
• \lt (literal meaning) exists only for head word,
it would be needed for borrowed words etc.
• Even the 3 languages are not sufficient
• To set a “language” property should be possible
for arbitrary fields
3. Advantages, problems with MDF
Specific problems in the RELISH project:
• No clear solution for covering several dialects
• In particular if no dialect is “standard”
• Different solutions:
–
–
–
–
–
\ue (usage information)
\oe (only / restriction)
\ns (notes on sociolinguistics, varieties)
\lf SynD = … (lexical function “Dialectal Synonym”)
\va & \ve (variant form and English comment)
• Most of these solutions only hold for the head word,
we would need dialect marking for \lx, \xv, \va, …
3. Advantages, problems with MDF
Comment on dialect problem in MDF book:
“We intend future enhancements of MDF to
have fields dedicated to dialectal information,
but at present the programming limitations do
not allow us any more field bundles.
For the present, use \va and \lf SynD =.
(footnote p23)
MDF and ist Applications
1. MDF: what is it?
2. Organization of the MDF-format
3. Advantages, problems with MDF
4. Applications and conversions
5. MDF in the RELISH project: Udi
4. Applications and conversions
“Applications” (of the format) may have
different meanings:
• For different languages / dictionary projects
• For transformations / conversions:
– print-dictionaries (via Toolbox, MDF, Word / RTF)
– HTML (Lexique Pro)
– XML (Toolbox export)
– LMF – XML (Lexus import)
– FLEX database
4. Applications and conversions
Problems with all conversions:
• What happens with inconsistencies?
• What happens with different orders
of same-level-fields?
• What happens with additional (non-MDF) fields?
• What happens with sub-entries?
4. Applications and conversions
4. Applications and conversions
4. Applications and conversions
MDF and ist Applications
1. MDF: what is it?
2. Organization of the MDF-format
3. Advantages, problems with MDF
4. Applications and conversions
5. MDF in the RELISH project: Udi
5. MDF in the RELISH project: Udi
5. MDF in the RELISH project: Udi
5. MDF in the RELISH project: Udi
• Digital representation of a print dictionary,
with additions
• Main problem: several languages:
– Udi (v)
– Azerbaidjan (Cyrillic) (n1)
– Azerbaidjan (Latin) (n1lat) (addition)
– Georgian (n2)
– Russian (r)
– English (e) (addition)
5. MDF in the RELISH project: Udi
• The Udi Toolbox database uses 53 fields
• of these, 14 are standard MDF fields
• 11 are MDF fields which have a slightly
different position in the hierarchy
• 28 fields are additional fields
– most (19) of these are for adjusting
the additional “languages” (and scripts)
– 5 are for additional phonetic representations
\lx
. \hm
. \se
. \mn
. . \mn-ph
. . \ph
. . \a
. . . \a-ph
. . \bw
. . \ns
. . \ng
. . \va
. . . \va-ph
. . . \va-ns
. . \pl
. . . \pl-ph
. . \ee
. . . \er
. . \lt
. . . \lte
. . \ps
. . . \pr
. . . \sn
. . . . \nt
. . . . \gn1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
\dt
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
\gn1
. \dn1
. \ltn1
. \nan1
. \gn1lat
. . \dn1lat
. . \ltn1lat
. . \nan1lat
\gr
. \dr
. \ltr
. \nar
\gn2
. \dn2
. \ltn2
. \nan2
\ge
. \de
. \oe
\xv
. \xv-ph
. \x-ns
. \xn1
. . \xn1lat
. \xr
. \xn2
. \xe
MDF-LEXUS conversion
1. From a printed dictionary to a markup text file
2. From a markup text file to the MDF structure in the Toolbox
environment
3. From the MDF structure to the LEXUS structure
Step 1. From a printed dictionary
to a markup text file - 1
Step 1. From a printed dictionary
to a markup text file - 2
Step 2. From a markup text file to the MDF structure
in the Toolbox environment - 1
• Establishing correlations of different sign combinations and
their linguistic counterparts
• Establishing the MDF markers‘ structure and their hierarchies
• Consistency checks:
• Cross-reference failures:
• - absence of the head word
• - absence of the variant
• Numerous spelling mistakes
• Numerous mistakes in the Russian and English translations
• Inconsistencies in contrasting subentries and examples
Step 2. From a markup text file to the MDF structure
in the Toolbox environment -2
Step 3. From the MDF structure
to the LEXUS structure - 1
Step 3. From the MDF structure
to the LEXUS structure - 2
Step 3. From the MDF structure
to the LEXUS structure - 3
Step 3. From the MDF structure
to the LEXUS structure - 4
5. MDF in RELISH: Udi into Lexique Pro
5. MDF in RELISH: Udi into Lexique Pro
From the MDF to the FLEX structure
• Defining writing systems
– Problems with introducing digraphs and the corresponding
sort orders
• Defining import properties
– Problems with markers‘ matching due to different markers
and their hierarchies
– Import failures
• 2 attempts:
– project Udi1
– Project Udi 2
Attempt 1: project Udi 1 (import residue) - 1
Defining writing systems
Attempt 1: project Udi 1 (import residue) - 2
Defining the file format
Attempt 1: project Udi 1 (import residue) - 3
Language mapping
Attempt 1: project Udi 1 (import residue) - 4
Content mapping
Attempt 1: project Udi 1 (import residue) - 5
Content mapping
Attempt 1: project Udi 1 (import residue) – 6
Key markers
Attempt 1: project Udi 1 (import residue) – 7
readiness check
Attempt 1: project Udi 1 (import residue) - 8
Attempt 1: project Udi 1 (import residue) - 9
Attempt 1: project Udi 1 (import residue) - 10
Attempt 2: project Udi 2 -1
encoding writing systems
Attempt 2: project Udi 2 – 2
defining the file format
Attempt 2: project Udi 2 - 3
language mapping
Attempt 2: project Udi 2 - 4
content mapping
Attempt 2: project Udi 2 - 5
content mapping
Attempt 2: project Udi 2 - 6
defining custom fields
Attempt 2: project Udi 2 - 7
modifying mapping
Attempt 2: project Udi 2 - 8
defining key markers
Attempt 2: project Udi 2 - 9
readiness check
Attempt 2: project Udi 2 - 10
import preview results
Attempt 2: project Udi 2 - 11
import preview results
Attempt 2: project Udi 2 - 12
ready to import
Attempt 2: project Udi 2 - 13
import failures
MDF and its Applications
Sebastian Drude & Irina Nevskaya
Goethe-Universität Frankfurt
RELISH / Lexicon Meeting Nijmegen July 2010

similar documents