Index Thomisticus Treebank

Report
Issues in
Building and Exploiting
Latin Language Resources
Marco Passarotti
Università Cattolica del Sacro Cuore, Milan (Italy)
Outlook
• Specific issues of ancient languages and texts
• Language Resources for Latin:
–
–
–
–
Annotated corpora
NLP
The Index Thomisticus Treebank
IT-VaLex
• Exploiting Latin Language Resources:
– Latin Word Order
– From Syntax to Semantics: Textual Clustering (1)
– From Lexicon to Semantics: Textual Clustering (2)
• Is Latin a less-resourced language? What’s still missing?
Specific Issues of Ancient
Languages and Texts
•
•
•
•
•
•
•
•
•
•
No native speakers
Several interpretations of the same text
Several versions of the same text
No digital born texts: relevance of the original source
Dead language = “closed” corpus/lexicon
Diachrony & Dialects: not just one Ancient Greek or Latin
Representativeness: our data are just the top of the iceberg
Not the WSJ, but mostly literary and philosophical texts
Not (just) for NLP purposes, but for the text itself
Computational linguists not used to deal with ancient languages
(LREC ’12: one paper on Latin!). BUT things are changing: see TLT,
ACRH etc.
• Pencil-and-paper scholars (Classicists) not used to deal with digital
LRs, NLP tools and modern linguistic theories: difficult to find
students with required expertise. BUT things are changing…
Language Resources
for Latin
Annotated Corpora
Treebanks
• Collaboration between CIRCSE and Perseus:
– Latin Dependency Treebank: Classical Latin (approx.
55,000 annotated tokens)
– Index Thomisticus Treebank: Thomas Aquinas opera
omnia (approx. 180,000 annotated tokens)
• PROIEL: University of Oslo
– Several translations of the New Testament: Latin,
Greek, Old Church Slavonic, Armenian, Gothic
(approx. 120,000 annotated tokens)
• All dependency-based (via PDT): common
guidelines
NLP Tools (1)
- Morphological analysers: Words (Whitaker),
Morpheus (Perseus), LEMLAT (ILC-CNR)
- Data-driven NLP (best rates)
Source of Data: Index Thomisticus Treebank
Training set: 61,024 tokens (2,820 sentences)
Test set: 7,379 tokens (329 sentences)
- PoS Tagging (HMM-based HunPos tagger):
• 96.75: coarse-grained PoS + fine-grained PoS
• 89.90: with morphological features
- Syntactic Parsing (DeSR):
• 80.02 (LAS); 85.23 (UAS); 87.79 (LA)
NLP Tools (2)
13 Centuries…
• IT-Train: 44,195 – IT-Test: 5,697
• LDT-Train: 47,662 – LDT-Test: 5,481
Parser: DeSR
The Index Thomisticus
Treebank
http://itreebank.marginalia.it
The Corpus
• Index Thomisticus (Busa):
–
–
–
–
opera omnia of Thomas Aquinas
119 works + 61 of other authors
approx. 11 million words
morphologically tagged & lemmatized
• Index Thomisticus Treebank:
– Dependency-based = LDT & PROIEL
– approx. 180,000 words (10,000 sentences)
– from:
• Scriptum super Sententiis Magistri Petri Lombardi
• Summa contra Gentiles
• Summa Theologiae
From FGD to Annotation Layers
language is “a system of means of expression with some definite aim”
(Theses of the Prague Linguistic Circle, 1929)
• L0 (w) Words (tokens): automatic segmentation only
• L1 (m) Morphology: Tags (full morphology, 11 categories) + Lemma
• L2 (a) [FORM] Analytical Layer (surface syntax): dependency-based
Analytical dependency functions: Pred, Sb, Obj, Adv, Atr, Pnom…
• L3 (t) [MEANING] Tectogrammatical Layer (underlying syntax):
dependency-based
– Autosemantic words only (no function words and punctuations)
– Functors (valency): Arguments vs. Adjuncts
• Arguments: ACT, PAT, EFF, ADDR, ORIG
• Adjuncts (~ 50), semantically defined: LOC, TWHEN, MANN, COND,...
– Ellipsis resolution & Coreference (grammatical only: relative clauses,
control-modals, pronouns)
– Topic/focus articulation (deep word order)
In eodem enim instanti terminatur alteratio ad dispositionem quae
est necessitas , et generatio ad formam;
Dynamic Valency Lexicon
IT-VaLex
http://itreebank.marginalia.it/itvalex/
Valency
Number of obligatory
complementations of a word
• ‘arguments’ vs. ‘adjuncts’
• actants vs.circonstants
• ‘inner participants’ vs. ‘free modifications’
By Complex Query
One Output
Exploiting
Latin Language Resources
Latin Word-order
Latin Word-order
OSV OVS SOV SVO VOS VSO
Caesar (35)
Cicero (165)
Jerome (186)
Petronius (425)
0.00
0.02
0.12
0.05
0.00
0.02
0.12
0.05
0.23 0.02
0.14 0.08
0.20 0.20
0.62 0.10 0.03
0.24 0.22 0.01
0.20 0.22 0.12
0.02
0.20
0.11
Propertius (217) 0.14 0.18
Thomas (2,865) 0.04 0.08
0.12 0.32 0.09
0.19 0.60 0.05
0.15
0.04
Sallust (397)
Ovid (182)
Vergil (82)
0.20
0.18
0.03
0.12
0.03
0.03
0.04
0.05
0.69
0.64
0.04
0.46
0.09
0.09
0.73
0.28
From Syntax to Semantics.
Textual Clustering (1)
R Enviroment for Statistical Computing
Package: cluster (function DIANA)
Clustering
• deals with finding a structure in a
collection of (un)labeled data
• the process of organizing objects into
groups (clusters) whose members are
similar in some way
• a cluster is a collection of objects which
are “similar” to each other and are
“dissimilar” to the objects belonging to
other clusters
Textual Clustering for WSD
• Distributional Hypothesis (Harris, 1954)
words that are used in similar contexts
tend to have the same or related meanings
• Firth (1957)
“You shall know a word by the company it
keeps”
Clustering forma in the IT-TB
Lemma forma
• 18,357 occurrences in the IT
• 5,191 occurrences of forma in the IT-TB
• a ‘technical’ word in Thomas, showing high
polysemy
• 4 main meanings in the lexicon of Thomas
by Deferrari & Barry (1948-1949):
• “form, shape”, synonym of figura
• “form”, the actualizing principle that makes
a thing to be what it is
• “mode, manner”
• “formula”
The Distribution of forma -GsB-
The Distribution of forma -GsB- (tag: 6)
From Lexicon to Semantics.
Textual Clustering (2)
R Enviroment for Statistical Computing
Packages: tm, RTextTools, Deducer(Text), lsa
…you shall know a text by the words it keeps
Jerome
Thomas
Seneca: Tragedies
Seneca: Dialogues
dist = euclidean - hclust = ward
Jerome – Vulgata: LSA
Is Latin
a less-resourced language?
What’s still missing?
A BLaRK-like Set for Latin
Modules and Tools
• Text pre-processing: named-entity recognition
• Lemmatization and morphological disambiguation: PoS taggers (diachrony)
• Syntactic analysis: parsers and shallow parsing (diachrony)
• Anaphora and ellipsis resolution
• Semantic and pragmatic annotation: coreference, semantic roles, TFA
Applications
• Entering and acquiring information: digitization & OCR systems (images of
original sources)
• Against sparsity: common on-line infrastructure for ancient languages LRs
• e-learning facilities for teaching ancient languages with LRs and NLP tools
Data
• Texts:
• more treebanked data from more eras
• TGTS-like annotated texts
• aligned translation(s)
• Lexica (mono-/multilingual):
• semantic-based valency lexicon: semantic roles + semantic features of
the arguments
• wordformation-based lexicon
…heigh-ho, heigh-ho,
it's off to work we go!

similar documents