academic domain words - Japanese Lexical Analyzer

Report
Tatsuhiko Matsushita
LALS, Victoria University of Wellington
[email protected]
Main findings
• VDRJ is useful for designing curriculum (material, tests etc.)
• The more domains a words is shared as AW or LAD by, the more
•
•
•
•
•
•
•
abstract the meaning of the word is.
Conversation and non-academic texts contain more general words
and LW
Academic texts: more AW and LAD but less LW in any academic
domain
Wikipedia: more proper nouns and low frequency words
Newspapers and academic items of Wikipedia can be a good
resource for learning AW and LAD.
Natural science texts contain more academic domain words at
lower frequency levels than arts and social science texts
Origins of academic and literary words are considerably clearly
separated; 3/4 of LW originate in Japanese while 3/4 of AW and
LAD originate in Chinese
LAD contains more Western origin words (Gairaigo)
Contents
1. Motive for this research
2. Goals of this presentation
3. Vocabulary Database for Reading Japanese
4. Tiers of Japanese vocabulary
(Basic words, academic words, limited-academic domain words, literary words)
5. Text coverage by word tier
6. Proportions of word origin types by word tiers
7. Number of characters required to cover the word tiers
Implications from the findings
8. Conclusion
7.
1. Motive for this research
How efficiently can we learn vocabulary?
• Learning burden is big!
• More effective choice of target words
• More efficient order for learning the words
Effective choice and efficient order: to maximize the
coverage of text which the learner would encounter in
his/her domain
= Reading comprehension and lexical density
(Hu & Nation, 2000; Komori et al., 2004)
 Q. What words should learners learn first?
And second and next?
Studies on EAP vocabulary
• Basic: General Service List (West, 1953)
• Academic: AWL (Coxhead, 2000)
UWL (Xue & Nation, 1984)
• EGAP-A/S, EGAP-HM/SS etc.
(Tajino, Dalsky, & Sasao, 2009)
• Science-specific Word List (Coxhead & Hirsh, 2007)
• Technical: e.g. Chung (2003)
• Literary vocabulary?
Studies on JAP vocabulary
• Basic: The former JLPT list, Tamamura (1987) etc.
• Academic: Butler (2010), Matsushita (2011)
•?
• Technical: Komiya (1995), Oka (1992) etc.
• Others
• No list for words between academic and technical words
• Literary vocabulary?
2. Goals of this presentation
To introduce
I. the Vocabulary Database for Reading Japanese
II. extracted domain-specific words such as Academic
Words (AW), Limited-Academic-Domain Words
(LAD), Literary Words (LW)
To argue about
IV. how the word tiers work in different types of text
(register variation)
V. how learner’s language background possibly
affects the understanding of texts in different
genres
3. Vocabulary Database for Reading Japanese
• Vocabulary Database for Reading Japanese (VDRJ)
(Matsushita, 2010; 2011)
• Created from the Balanced Contemporary Corpus of Written
•
•
•
•
•
Japanese, 2009 monitor version (NINJAL, 2009)
33 million token (28 million from books and 5 million from the
Internet forum sites (Yahoo Chiebukuro))
19 million content words and 14 million function words
Unit of counting: Lexeme – considerably inclusive but less
inclusive than the word family (Level 6 in Bauer & Nation,
1993) in English
“Short unit of lexemes” are ranked by U (usage coefficient)
(Juilland & Chang-Rodrigues, 1964)
Short unit of lexeme: more inclusive than “lemma”, less
inclusive than “word family”
Some problems of existing Japanese word
frequency lists
• Lack of representativeness
• Too old
• The corpus size is not large enough:
low reliability for low frequency words
• No good sub frequency data which enable us to
calculate dispersion to downgrade unevenly
distributed words
Advantages of word lists
* Various types of word lists can be created from the
vocabulary database (VDRJ)
Reference for developing vocabulary tests
= Checking learners’ vocabulary levels
B) Reference for checking vocabulary level of material
= Checking vocabulary levels of materials
A)
C)
 Specify vocabulary for learners to learn and for teachers
to teach
For better choice of material, modification of text
Cf. Nation (2011), Word profiler
How to make VDRJ
A) Method
I. Classify all the texts into some sub corpora to
see the range and dispersion
cf. Nippon Decimal Classification, BCCWJ
(NINJAL, 2009)
II. Parse (made word segmentation of ) all the
texts by a morphological analyzer with a
dictionary (if the text is not segmented by space
between words.)
cf. MeCab, UniDic
III. Make word lists by AntConc and/or
AntWordProfiler
Content and construct of VDRJ
• Vocabulary Database for Reading Japanese
• The list is for reading as it is made from written
corpus of books and internet forum sites
• Written and spoken languages are different in word
frequency, domain and required language processing
skills
⇒ A good corpus of spoken language is necessary to
develop a good word list for it(, but there is no very
good corpus of spoken Japanese…)
The Classification of Domains and Fields (Corpus form books and internet forum sites, BCCWJ 2009 monitor version)
Domain/Field
Literary Works/Imaginative T exts
The ten
domains
Code for
the ten
domains
The 28
academic
field code
Literary works
LW
a6_G
Languages,
Linguistics and
Philosophy
LP
History and
Ethnology
HE
Notes
All classified as general texts of a6
Humanities and Arts
Languages and Linguistics
Philosophy and Religion
History
Ethnology
a1
a2
a3
a4
a5
Fine Arts
Literature (non-imaginative texts e.g. critique)
Arts and Other
Humanities
AH
a6_T
All classified as technical texts of a6
a7
Other Humanities and Arts
Social Sciences
Politics
s1
Politics and
Law
PL
Economics and
Commerce
EC
Sociology,
Education and
Other Social
Issues
SE
s2
Law
Economics
s3
s4
Commerce and Business
Sociology and Social Issues
Education
Other Social Matters
s5
Including welfare, labour, gender issues
s6
Including pedagogy on each subject
s7
Including transportation, media, current issues
T echnological Natural Sciences
Mathematics
t1
Physics
t2
t3
Astronomy, Earth and Planetary Science
Chemistry, Metal and Mine
Science and
Technology
ST
t4
Technology (Architecture, Civil Engineering)
t5
Technology (Mechanics, Electricity, Marine Engineering)
t6
Other Technological Natural Sciences
t7
Including information science, manufacturing,
library science, part of domestic science
Biological Natural Sciences
Biology
b1
Agriculture
b2
b3
Pharmacy
Medicine
Biology and
Medicine
BM
b4
Dentistry
b5
Nursing
b6
Other Biological Natural Sciences
b7
Internet Q & A Forum (Yahoo Chiebukuro)
Including forestry, fishery, animal husbandry,
veterinary
IF
Including sports, hygienics, environmentology,
part of domestic science
Content of the sub corpora
Types and Tokens by the Ten Domain Classification
(Corpus form books and internet forum sites, BCCWJ 2009
monitor version)
Domain
Number of
Tokens
Ratio
Literary Works/Imaginative Texts
8251999
25.1%
Languages, Linguistics and Philosophy
2134739
6.5%
History and Ethnology
3336818
10.2%
Arts and Other Humanities
3020917
9.2%
Politics and Law
1881012
5.7%
Economics and Commerce
2209107
6.7%
Sociology, Education and Other Social Issues
2996147
9.1%
Science and Technology
1512784
4.6%
Biology and Medicine
2251037
6.9%
Internet Q & A Forum
5224852
15.9%
32819412
100.0%
Total
Different word rankings
• The word ranking problem mainly exists in Basic
Words
• This is mainly due to lack of good spoken corpora
• Compromise: frequency weighted to limited domains
which seem to reflect basic daily needs
• For International Students
• For General Learners
• Non-weighted (ranking for overall written Japanese)
Multidimensional scaling (MDS)
10 domains
10 domains
+ word familiarity
4. Tiers of Japanese vocabulary
(1) The concept of “word tiers”
• Domain / Level
• Level = general importance
= frequency × dispersion
Some words are frequent only in a particular domain
e.g. 発送 (shipping) 振り込み (paying by bank transfer)
古墳 (tumulus / burial mound)
Assumed word tiers for students
Level
• Basic: Top 1288 = Former JLPT Level 4 &3
• Intermediate: Ranked 1289-5000
• Advanced 1: 6K-10K
• Advanced 2: 11K-15K
• Super-Advanced: 15K-20K
• 21K+
• Assumed Known Words (AKW)
Domain
*General / Academic / Literary
4. Tiers of Japanese vocabulary
(2) Basic words (BW)
• Feature of the corpus: formal written language
similar to BNC (Nation, 2004)
• No good spoken corpus for vocabulary studies
• Compromise
• For learners and teachers lists, the former JLPT Level 4 $ 3
vocabulary is put at the top of the list as basic words
To order the basic words
• Identify closer domains to word familiarity (basic needs) by
Multidimensional Scaling (MDS)
• Frequency in literary works and the Internet-forum sites
(Yahoo-Chiebukuro) is weighted
4. Tiers of Japanese vocabulary
(3) Academic domain words
Extracting academic domain words
• Log-likelihood ratio (LLR)(Dunning, 1993)
• Target texts: Technical texts
• Classified into four large academic domains
• Total number of tokens: approx. 2.9 million
• Reference texts: General texts in BCCWJ 2009
• Total number of tokens: approx. 29.9 million
• Extract keywords shared by 4 - 1 domains
• Cut off point: higher for more narrowly distributed words
Number of Shared Academic Domains among the 4 academic domains
Ah
Ss
1
1
2
2
2
3
3
4
2
3
3
Tn
1
2
Bn
2
Ah: Arts & Humanities, Ss: Social Sciences,
Tn: Technological Natural Sciences, Bn: Biological Natural Sciences
1
4. (3) Academic domain words
• Academic words (AW): high specificity in 3+ academic domains
• 4-domain words (cut off point: LLR > 0)
• 3-domain words (cut off point: LLR > 0)
• Limited-academic-domain words (LAD)
• 2-domain words (cut off point: LLR > 1)
• 1-domain words (cut off point: LLR > average value)
• Eliminate the former JLPT Level 4 vocabulary (Top 700 words)
• Eliminate the words ranked at 20001 or lower
• Classify all the AW and LAD by word ranking levels for
International Students (U=Usage Coefficient):
• 5 levels: Basic / Inter. / Adv. 1 / Adv. 2 / Super-adv.
The Classification of Domains and Fields (Corpus form books and internet forum sites, BCCWJ 2009 monitor version)
Domain/Field
Literary Works/Imaginative T exts
The ten
domains
Code for
the ten
domains
The 28
academic
field code
Literary works
LW
a6_G
Languages,
Linguistics and
Philosophy
LP
History and
Ethnology
HE
Notes
All classified as general texts of a6
Humanities and Arts
Languages and Linguistics
Philosophy and Religion
History
Ethnology
a1
a2
a3
a4
a5
Fine Arts
Literature (non-imaginative texts e.g. critique)
Arts and Other
Humanities
AH
a6_T
All classified as technical texts of a6
a7
Other Humanities and Arts
Social Sciences
Politics
s1
Politics and
Law
PL
Economics and
Commerce
EC
Sociology,
Education and
Other Social
Issues
SE
s2
Law
Economics
s3
s4
Commerce and Business
Sociology and Social Issues
Education
Other Social Matters
s5
Including welfare, labour, gender issues
s6
Including pedagogy on each subject
s7
Including transportation, media, current issues
T echnological Natural Sciences
Mathematics
t1
Physics
t2
t3
Astronomy, Earth and Planetary Science
Chemistry, Metal and Mine
Science and
Technology
ST
t4
Technology (Architecture, Civil Engineering)
t5
Technology (Mechanics, Electricity, Marine Engineering)
t6
Other Technological Natural Sciences
t7
Including information science, manufacturing,
library science, part of domestic science
Biological Natural Sciences
Biology
b1
Agriculture
b2
b3
Pharmacy
Medicine
Biology and
Medicine
BM
b4
Dentistry
b5
Nursing
b6
Other Biological Natural Sciences
b7
Internet Q & A Forum (Yahoo Chiebukuro)
Including forestry, fishery, animal husbandry,
veterinary
IF
Including sports, hygienics, environmentology,
part of domestic science
4. Tiers of Japanese vocabulary
(3) -1 Academic words (AW)
• JAWL = Japanese Academic Word List
• High specificity in 3 or 4 academic domains
• 4-domain words (cut off point: LLR > 0)
• 3-domain words (cut off point: LLR > 0)
• Level 0 - VIII 9 levels,2590 words in total
• JAWL I (Intermediate): most essential for learning
• Basic words contains much fewer academic words
• JAWL I: 559 words
Close to AWL in number and text coverage
Coverage in the academic corpus used for extracting AW
AWL: 10.0% JAWL I: 11.1%
Academic Words: Words which are shared by 3 or 4 main academic domains
Ah
Ss
1
1
2
2
2
3
3
4
2
3
3
Tn
1
2
Bn
2
1
Ah: Arts & Humanities, Ss: Social Sciences,
Tn: Technological Natural Sciences, Bn: Biological Natural Sciences
Distribution and examples of JAWL
JAWL
Label
Former
JLPT
Level
Word
Rankings for
Level
International
Students
Number of
High
Specificity Number
Domains of Unique
among the 4 Lexemes
Science
Domains
Least Frequent 6 Words Translation of the Least Frequent 6
in Each Domain
Words in Each Domain
4
31
科学 規則 割合
生産 産業 講義
science, rule, proportion,
ptoduction, industry, lecture
3
39
人口 ス ク リ ーン 数学
競争 工業 地理
population, screen, mathmatics,
competition, manufacture, geography
4
559
発足 半数 配分
縮小 適正 見直し
inauguration, half the number, allocation,
downsize, proper, reconsider
JAWL II
3
542
演説 大小 実情
ス テ ージ ラ イ フ 担保
speech, size, real situation,
stage, life, guarantee
JAWL III
4
212
難問 能動 付随
定型 除 本稿
difficult problem, active, accompany,
standard, except, this article
3
452
交錯 カ ウ ン ト 精度
一因 箇年 エ ン ド
mixture, count, accuracy,
one cause, -year, end
4
103
併存 親和 盛況
散在 補填 関わ り 合う
coexistence, affinity, prosperity,
straggle, compensation, implicated
JAWL VI
3
328
帰着 編著 沿海
拮抗 常套 内情
come down to, written and edited, coastal,
close competition, conventional, internal condition
JAWL VII
4
56
閉 増刊 含意
複 活路 所与
closed, extra edition, implication,
double-, way out, given
3
269
極小 付則 深度
概算 頒布 円錐
minimal, additional clause, depth,
rough estimate, distribution (of goods/paper), cone
JAWL 0
L3
679-1288
Basic
JAWL I
1289-5000
Inter.
5001-10000 Adv. 1
L2
L1
JAWL V Other
JAWL IV
10001-15000 Adv. 2
15000-20000
JAWL VIII
Superadv.
4. (3) -1 Academic words (AW)
Semantic features of AW (1)
• Highly abstract, essential for operating logic
i.e.
• Range: 占める (occupy, account for), 特殊 (special, particular)
• Relation: 属する (belong to), 依存 (rely/reliance)
• Comparison/Evaluation: 後者 (the latter), 優れる (superior),
• Quantitative change: 減少 (decrease), 強化 (reinforce)
• Stage: 当初 (beginning), 現状 (present condition)
• Development of enunciation: 取り上げる (take up [an issue]),
まとめる (summarize)
• Cause-effect, degree, agent, action, object, direction, goal,
instrument, time etc.
4. Tiers of Japanese vocabulary
(3) -1 Academic words (AW)
Semantic features of AW (2)
The most frequent Kanji used for AW
合 (combine, together), 定 (fix, certain), 分 (divide, minute),
一 (one), 同 (same), 数 (number), 上 (up), 体 (body), 出 (out),
大 (large)
• 3-domain words: Some words have concrete meanings
e.g. 署名 (signature), 保健 (health, hygiene)
• 4-domain words: Few words have concrete meanings
• The nature of the words are the same at all levels
POS of Japanese AW (1)
• Common noun: 1072 words (41.4 %) e.g. 背景 (background)
• Verbal noun: 882 words (34.0 %) e.g. 連続 (establish/-ment)
 Adding other types of nouns together,
2104 words (81.2 %) can be a noun
• Verb (excluding verbal nouns): 225 words (8.7 %)
e.g. 認める (recognize/approve) 述べる (describe/mention)
 Adding other types of verbs together,
1107 words (42.7%) can be a verb
• Adjectival noun: 95 words (3.7 %)
e.g. 詳細 (detail/-ed), 平等 (equal/-ity)
• Adjective:Only 9 words (0.3 %) e.g. 著しい (remarkable)
POS of Japanese AW (2)
• Affix: 106 words (4.1 %) e.g. -期 (period), -種 (type)
substantial in Japanese academic words
• Adverb: 34 words (1.3 %) e.g. しばしば (frequently)
• Other (particle, auxiliary verb etc.): 22 words (0.8 %)
• Remarkably many archaic words
e.g. のみ (only), つつ (while doing), べし (ought to), あらゆる (every)
いかなる (any), 我が (my), 漠然 (vague)
• れる/られる (Passive/Potential/Spontaneous)
specific in academic texts
4. (3) -2 Limited-academic-domain words (LAD)
• Limited-academic-domain words (LAD)
• High specificity in 2 or 1 domain(s)
• 2-domain words (cut off point: LLR > 1)
• 1-domain words (cut off point: LLR > average value)
• Something between “academic” and “technical”
• The “scams” from extracting AW?
• Tiers of curriculum cf. Tajino et al. (2007)
• Words correspondent to the curriculum
• Basic: all the learners
• Academic words: prep. to first year
• Limited-academic-domain words (?): prep. to major
• Technical words: major to postgrad.
Number of Shared Academic Domains among the four academic domains
Limited-Academic-Domain Words :
Words which are shared by only
1
1
or
Ah
2
main academic domain(s)
Ss
2
2
3
2
2
3
2
4
3
1
3
2
Tn
1
1
Bn
Ah: Arts & Humanities, Ss: Social Sciences,
Tn: Technological Natural Sciences, Bn: Biological Natural Sciences
4. (3) -2 Limited-academic-domain words (LAD)
2 domain words
Distribution of 2-Domain Words of Japanese Limited-Academic-Domain Words (JLAD)
Level
Number
of Unique
Lexemes
in LAD of
Ah & S s
Number
of Unique
Lexemes
in LAD of
Ah & Tn
Number
of Unique
Lexemes
in LAD of
Ah & Bn
Number
of Unique
Lexemes
in LAD of
S s & Tn
Number
of Unique
Lexemes
in LAD of
S s & Bn
Number
of Unique
Lexemes
in LAD of
Tn & Bn
Basic
15
5
4
5
6
10
45
1289-5000 Inter.
139
27
30
77
57
61
391
L2 5001-10000 Adv. 1
L1
JLAD V Other 10001-15000 Adv. 2
138
38
25
86
50
92
429
91
28
22
58
37
60
296
93
23
17
43
16
40
232
476
121
98
269
166
JLAD
Label
JLAD 0
JLAD I
Word
Former
Rankings for
JLPT
International
Level
S tudents
L3
679-1288
JLAD III
JLAD VII
15000-20000
Total
Superadv.
Total
263 1393
Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences
4. (3) -2 Limited-academic-domain words (LAD)
2 domain words
Examples of 2-Domain Words of Japanese Limited-Academic-Domain Words (JLAD)
JLAD
Label
Word
Former
Rankings for
JLPT
International
Level
S tudents
Level
Least
Least
Least
Least
Least
Least
Frequent Frequent Frequent Frequent Frequent Frequent
2 Words
2 Words
2 Words
2 Words
2 Words
2 Words
in LAD of in LAD of in LAD of in LAD of in LAD of in LAD of
Ah & S s Ah & Tn Ah & Bn S s & Tn S s & Bn Tn & Bn
貿易
以内 ア ル コ ー ル
砂
発音
製
輸出 テ キ ス ト ス テ レ オ レ ポ ート パート テ ニ ス
孤立 オ ール 静岡 ニ ーズ 総務 ス イ ッ チ
JLAD I
1289-5000 Inter.
融資 ペーパ ー 書簡
顧客
性的
液
容れる 音響
発現
閉塞
多用
本件
JLAD III L2
5001-10000 Adv. 1
教義
流布
海域 セ ク シ ョ ン 弱め る 部位
L1
落差 目付け VTR
所見
光学
JLAD V Other 10001-15000 Adv. 2 払 い 戻 し
リ
ハ
ー
サ
ル
ペーハ
ー
コ ロ ン 生長
救命
Super- 峻別 目配り
太極 パ レ ッ ト マ ン ガ ン 棒状
JLAD VII
15000-20000
adv.
公債 テ ク ノ 増量
軽微
居宅
雨水
JLAD 0
L3 679-1288
Basic
ユニ バーシ テ ィ
4. (3) -2 Limited-academic-domain words (LAD)
2 domain words
Examples of 2-Domain Words of Japanese Limited-Academic-Domain Words (Translation)
JLAD
Label
Word
Forme
Rankings for
r JLPT
International
Level
S tudents
Level
Translation of Translation of Translation of Translation of Translation of Translation of
the Least
the Least
the Least
the Least
the Least
the Least
Frequent 2
Frequent 2
Frequent 2
Frequent 2
Frequent 2
Frequent 2
Words in LAD Words in LAD Words in LAD Words in LAD Words in LAD Words in LAD
of Ah & S s
of Ah & Tn
of Ah & Bn
of S s & Tn
of S s & Bn
of Tn & Bn
sand
pronunciation
made (in)
text
stereo
report
all
Shizuoka pref./city
need (n.)
JLAD I
1289-5000 Inter.
paper
epistle
customer
compatible
accoustic manifestation this matter
JLAD III L2 5001-10000 Adv. 1
doctrine
circulation
waters
section
L1
refund
a drop
overseer
VTR
JLAD V Other 10001-15000 Adv. 2
university
cologne
growth
rehearsal
Super- sharp distinction meticulous care
pallet
tai ji
JLAD VII
15000-20000
increase in quantity
adv.
public bond
technoslight
JLAD 0
L3 679-1288
Basic
trade
export
isolation
loan
within
part(-timer)
general affairs
sexual
impasse
weaken
remark (n.)
lifesaving
manganese
dwelling
alcohol
tennis
switch
liquid
frequent use
region (of body)
optics
pH
stick-shaped
rainwater
Examples of 2 domain words: Words which
are shared by only 2 main academic domains
Ah
epistle
waters
growth
Ss
isolation
doctrine
refund
sexual
weaken
lifesaving
paper
accoustic
a drop
Tn
need (n.)
section
VTR
liquid
frequent use
pH
Bn
Ah: Arts & Humanities, Ss: Social Sciences,
Tn: Technological Natural Sciences, Bn: Biological Natural Sciences
4. (3) -2 Limited-academic-domain words (LAD)
2 domain words
• Semantic features
• More concrete and specific than academic words
• Ah & Ss: Social, overlap in history and ethnology
• Ss & Tn: Industrial
• Ss & Bn: Social security, medical and nursing service
• Tn & Bn: Scientific
• Ah & Tn, Ah & Bn: not clear
4. (3) -2 Limited-academic-domain words (LAD)
1 domain words
• It is merely a trial
• The corpus is not the best for academic purpose, especially for
natural sciences
• Extracting something common across domains is much easier
while extracting words by only one target corpus will require
more complete target corpus
• Therefore, AW (4 domain words and 3 domain words) will be
more reliable than LAD (2 domain words and 1 domain words)
4. (3) -2 Limited-academic-domain words (LAD)
1 domain words
Distribution of 1 Domain Words of Japanese Limited-Academic-Domain Words (JLAD)
JLAD
Label
Word
Former
Rankings for
JLPT
International
Level
S tudents
Level
Number Number Number Number
of Unique of Unique of Unique of Unique
Lexemes Lexemes Lexemes Lexemes
in Ah
in S s
in Tn
in Bn
Total
Basic
13
6
5
9
33
1289-5000 Inter.
104
111
46
52
313
L2 5001-10000 Adv. 1
L1
JLAD V Other 10001-15000 Adv. 2
104
127
60
68
359
71
74
48
54
247
60
55
29
53
197
352
373
188
JLAD 0
JLAD I
L3
679-1288
JLAD III
JLAD VII
15000-20000
Total
Superadv.
236 1149
Ah: Arts & Humanities, Ss: Social Sciences, Tn: Technological Natural Sciences, Bn: Biological Natural Sciences
4. (3) -2 Limited-academic-domain words (LAD)
1 domain words
Examples of 1 Domain Words of Japanese Limited-Academic-Domain Words (JLAD)
JLAD
Label
Former
JLPT
Level
Word Rankings
for
International
S tudents
Level
JLAD 0
L3
679-1288
Basic
辞典
文法
工場
遊び
海岸
汽車
退院
柔道
1289-5000
Inter.
色彩
滋賀
紛争
犯
原子
コ ン ク リ ート
拳
杉
王家
呪術
超過
欠席
硬化
ドラッグ
臓器
左足
報国
遍歴
持ち 分
受諾
PM
蒸留
卵子
緑茶
JLAD I
5001-10000 Adv. 1
L2
L1
JLAD V Other 10001-15000 Adv. 2
JLAD III
JLAD VII
15000-20000
Superadv.
Least
Least
Least
Least
Frequent 2
Frequent 2
Frequent 2
Frequent 2
Words in LAD Words in LAD Words in LAD Words in LAD
of Ah
of S s
of Tn
of Bn
厳寒
鼎
卸売り
プロ グ ラ ミ ン グ
引き 当て バラ ッ ク
居合
微小
• Semantic features are much clearer than 2 domain words
4. (3) -2 Limited-academic-domain words (LAD)
1 domain words
Examples of 1 Domain Words of Japanese Limited-Academic-Domain Words (Translation)
JLAD
Label
Word
Forme
Rankings for
r JLPT
International
Level
S tudents
JLAD 0
L3 679-1288
JLAD I
Level
Translation of the
Least Frequent 2
Words in LAD of Ah
Translation of the
Least Frequent 2
Words in LAD of S s
Translation of the
Least Frequent 2
Words in LAD of Tn
Translation of the
Least Frequent 2
Words in LAD of Bn
Basic
dictionary
grammar
factory
play(ing)
seashore
train
leave hospital
judo
coloring
Shiga (pref.)
conflict
offense
atom
concrete (n.)
fist/martial art
cedar
royal family
incantation
excess
absence
harden(ing)
drag/drug
organ
left leg/foot
patriotic
itinerancy
quota
acceptance
PM
distillation
ovum
green tea
wholesale
researve fund
programming
shanty
iai (martial arts)
micro
1289-5000 Inter.
L2 5001-10000 Adv. 1
L1
JLAD V Other 10001-15000 Adv. 2
JLAD III
JLAD VII
15000-20000
Superintense cold
adv. three-legged vessel
• Semantic features are much clearer than 2 domain words
s
Examples of Academic Domain Words:
Words which are shared by 1, 2, 3 or 4 main academic domain(s)
coloring
Ss
royal family
conflict
Ah
excess
isolation
epistle
doctrine
need (n.)
waters
refund
section
growth
VTR
at a stroke
guarantee
mixture
-year
proper
paper
sexual
accoustic
except
weaken
allocation
a drop
lifesaving
lead
end
size
life
Tn
atom
harden(ing)
liquid
frequent use
pH
Bn
fist/martial art
organ
Ah: Arts & Humanities, Ss: Social Sciences,
Tn: Technological Natural Sciences, Bn: Biological Natural Sciences
POS of Japanese LAD (1)
• Common noun: 1605 words (63.1 %) – more than AW (41.4%)
• Verbal noun: 633 words (24.9 %) e.g. 融資 (finance) cf. AW (34.0%)
 Adding other types of nouns together,
2104 words (87.9 %) can be a noun – more than AW (81.2%)
• Verb (excl. verbal nouns): 81 words (3.2 %) cf. AW (8.7%)
e.g. 訳す (translate) 向き合う (face (v.))
 Adding other types of verbs together,
714 words (28.1%) can be a verb – less than AW (42.7%)
• Adjectival noun: 88 words (3.5 %) cf. AW (3.7%)
e.g. フル (full), 偉大 (great)
• Adjective:Only 3 words (0.1 %) cf. AW (0.3%) e.g. 硬い (stiff)
POS of Japanese LAD (2)
• Affix: 109 words (4.3%) cf. AW (4.1%) e.g. –犯 (offense)
substantial in Japanese academic domain words
• Adverb: 15 words (0.6 %) cf. AW (1.3%) e.g. 現に (surely)
• Other (particle, auxiliary verb etc.): 9 words (0.8 %)
cf. AW (0.8%)
• Remarkably many archaic words – similar to AW
e.g. なり [affirmative aux.], とも (even though),
たり [affirmative aux.], ごとし (as/like), 単なる (mere),
しめる(=しむ) [causative aux.], かかる (such)
4. Tiers of Japanese vocabulary
(4) Literary words (LW)
Extracting literary words: Words for reading literary works
• Log-likelihood ratio (Keyness in AntConc)
• Target corpus: literary works (identified by NDC and C-code) in
•
•
•
•
•
BCCWJ 2009 (NINJAL, 2009) – Over 8 million tokens
4 different reference corpus: Technical texts, general texts in
arts and humanities, general texts in the other 3 academic
domains, Internet forum texts (Yahoo Chiebukuro)
Extract keywords shared by the four results
(Cutoff point: average value)
Eliminate the former JLPT Level 4 vocabulary (Top 700 words)
Eliminate the words ranked at 20001 or lower
Classify all the LW by word ranking levels for International
Students (U=Usage Coefficient)
4. (4) Literary words (LW)
Distribution and examples
Distribution and examples of Japanese Literay Words (JLW)
Word
Former
Rankings for
JLPT
International
Level
S tudents
L3
JLW Label
Number of
Unique
Lexemes of
JLW
679-1288
Basic Lit.
142
ちっと も
引き 出し
(not) at all
drawer
1289-5000
Inter. Lit
446
戸惑う
吐き 出す
483
不吉
銀色
puzzled
vent
ominous
silver
345
敵機
口笛
hostile aircraft
whistle
200
香菜
樹海
coriander
sea of trees
L2 5001-10000 Adv. 1 Lit.
L1
Other 10001-15000 Adv. 2 Lit.
15000-20000
Total
Super-adv.
Lit.
1616
Least Frequent 2
Words of JLW
Translation of the
Least Frequent 2
Words of JLW
4. (4) Literary words (LW)
POS of LW
Number of Unique Lexemes of Japanese Literary Words by POS
N.
(Excl.
VN. &
AN,)
• More verbs, adverbs
V.
(Excl.
VN.)
VN
AN.
(Excl.
VN.)
Adj.
Affix
Adv.
Others Total
49
2
1
3
10
12 142
168
21 157
20
12
8
28
32 446
199
23 163
25
13
12
28
20 483
137
19 122
15
7
2
27
16 345
58
5
8
1
21
8 200
86 549
67
41
26 114
88 1616
56
85
645
9
14
VN: Verbal Noun
AN: Adjectival Noun
and interjections
than AW and LAD
• Less verbal nouns
and adjectival nouns
• This inevitably means
LW have less loan
words but more
Japanese-origin
words.
4. (4) Literary words (LW)
Q. How many LW overlap with AW and LAD?
• Only 27 words (0.5% of academic domain words, 1.7% of LW) are
•
•
•
•
•
•
overlapping
Most of the overlapping words (24/27) overlap with 1 domain
words (17 words overlap with words in biological natural science)
Many physical words such as words for body parts
e.g. 左手 (left hand), こぶし (fist), 血 (blood),頭上 (overhead)
No LW words overlap with 4 domain words
Overlapping words are mainly at the intermediate level
No overlapping words in or above 11K+
Some examples of overlapping words:音 (sound), 光 (light),
棚 (shelf), 組 (class), 岩 (rock), ひざ (knee), 興奮 (excite/-ment),
全身 (whole body), 帝 (emperer), ネズミ (mouse), 帆 (sail)
Word tiers: In what order should students
learn them?
• Basic
• General
• AW/LAD
• LW
• Intermediate
• General
• AW/LAD
• LW
• Advanced
• General
• AW/LAD
• LW
• Highly Advanced
• General
• AW/LAD
• LW
• Super-Advanced
• General
• AW/LAD
• LW
• Assumed known words
• Proper names
• Fillers, Signs
• (Transparent compounds *)
• Others
5. Text coverage by word tier
• The word tier analyser: An Excel sheet where word
profiling of a text can be checked automatically by
cutting and pasting the result of AntWordProfiler with
the word tier base word list.
• Text covering efficiency
High efficiency in vocabulary learning
= Fewer unique lexemes cover more texts
(Reciprocal Type/Token Ratio = Token/Type Ratio?)
*Comparison should be made between equally-sized texts)
Coverage of Japanese texts by word tier
Name of Text
MC
UPC
OB
Bestseller
Text Genre
Conversation
Total Number of Tokens for
1.13
Each Test Corpus (Million)
Word Tier (*)
Total # of
Types in
Each Tier
(Lexeme)
Nonbooks
academic
(dominant
prose
ly novels)
BCCWJ
Books &
Internet
Forum
UYN
TIS
TB
MTC-Ss MTC-Tn MTC-Bn
Humaniti
Technolo
Biological
es &
Social
Social
gical
Wikipedia Newspaper
Natural
Social Sciences Sciences Natural
Sciences
Sciences
Sciences
2.10 2.30 32.82 5.90 5.68 0.04 0.19 0.05 0.07 0.01
% of Tokens (Overlap included)
(30821) 1.7 1.3 2.4
2.0
General
13303 81.0 77.2 78.0 74.7
Academic
2590 2.7 7.6 7.2 10.9
Limited-Aca.-Dom.
2542 1.6 3.2 3.8
5.3
Literary
1616 10.8 7.4 6.5
4.6
Overlap
-27 0.0 -0.2 -0.2 -0.1
21K + Others
-2.2 3.5 2.2
2.6
Total
20024 100.0 100.0 100.0 100.0
AKW (**) (Proper nouns etc.)
WP
3.7
64.9
14.9
7.3
1.8
-0.1
7.4
100.0
1.3
63.5
20.7
11.2
1.7
-0.1
1.7
100.0
0.9 0.4 0.3 0.6
66.1 66.0 67.2 61.1
20.7 21.3 20.9 23.2
8.9 8.9 7.7 5.9
2.0 1.6 1.6 2.3
0.0 0.0 0.0 -0.1
1.4 1.8 2.4 7.0
100.0 100.0 100.0 100.0
0.3
61.6
22.7
6.8
1.4
-0.1
7.3
100.0
* All words except 'AKW' and '21K+Others' are listed in top 20000 (01K-20K) ranked by the Word List for International Students (Matsushita, 2011)
** AKW (Assumed Known Words): Words such as proper nouns or fillers which are assumed not to require previous learning.
Findings from the text coverage
• Conversation and Non-academic texts: more general words and LW
• Wikipedia: more proper nouns and low frequency words
• Academic items of Wikipedia: 15-20% of the texts of are estimated
to be covered by JAWL 1 (559 types) – encyclopaedic nature of AW?
 can be a good resource for learning AW
• Newspapers: similar to academic texts, but contains more LAD and
AW at the advanced level
 can be a good resource for learning AW (esp. in social sci.)
• Academic texts: more AW and LAD but less LW in any academic
domain
• Academic texts in natural sciences: more academic domain words
at lower frequency levels (technical vocabulary) than Ah. and Ss.
texts – similar to Coxhead, Stevens, & Tinkle (2010)
6. Proportion of word origin types by
word tier
Proportion of Unique Lexemes by Word Origin and
Word Tier in 01K-20K (*) (Matsushita, 2011)
Word Origin
Word Tier
General
Academic
Li mi ted-a ca demi c-doma i n
Literary
Overlap
Total
Japanese
(%)
Chinese
(%)
Western &
Other
(%)
38.4 45.3 10.8
15.0 75.2
7.0
12.4 69.1 13.7
71.7 21.8
2.5
74.1 22.2
0.0
34.7 50.3 10.0
*Including 24 compound numerals (01K+)
Mixed
(%)
3.2
1.9
1.7
3.1
3.7
2.8
Proper
Nouns
(%)
Unknown
& Signs
(%)
1.5
0.4
2.2
0.3
0.0
1.4
0.8
0.5
1.0
0.6
0.0
0.8
Total
100.0
100.0
100.0
100.0
100.0
100.0
Findings from the proportion of word origin
types by word tier
• LW: Japanese origin words are significantly dominant
• AW and LAD: Chinese origin words are significantly
dominant
• LAD: more Western origin words (Gairaigo)
 Western origin words tend to appear more at
lower frequency levels in academic domain words
• Origins of academic and literary words are
considerably clearly separated:
• Academic – Chinese origin
• Literary – Japanese origin
7. Implications from the findings
Q. Word Tiers: In what order should students learn them?
• Highly Advanced
• Basic
• Academic
• Academic
• LAD
• LAD
• General
• General
• Super-Advanced
• Intermediate
• Academic
• Academic
• LAD
• LAD
• General
• General
• Assumed known words
• Advanced
• Proper names
• Academic
• Fillers
• Signs
• LAD
• (Transparent compounds *)
• General
• Others
Implications for teaching and research
• A vocabulary conscious curriculum should be designed
and incorporated in Japanese programs depending on
the learners’ needs and language backgrounds
• The gap between Chinese-background learners (CBLs)
and non-CBLs will be less in basic conversation and
reading literary works than in reading academic texts
• Good curriculum for learning academic domain words is
particularly desired for non-CBLs of academic Japanese
• Autonomous mode for learning vocabulary will be
necessary particularly when the learners’ needs and
language backgrounds are various
8. Conclusion
Limitations of the word lists
• Less valid in narrower domain words (2D/1D words) and less
reliable in higher frequency levels  Need refining by more
complete academic corpus
• Multi-word units not extracted
• Not sensitive to different usages in different domains (polysemy)
Remaining issues
• Many transparent compounds in Japanese
What is Kanji tier? How is it related to word tier?
Download sites for VDRJ/JAWL
Matsushita Laboratory for Language Learning
http://www.wa.commufa.jp/~tatsum/English%20top_T
atsu.html
(Interface: English)
Google it with “matsushita” and “language”
松下言語学習ラボ
http://www.wa.commufa.jp/~tatsum/index.html
(Interface: Japanese)
Google it with “松下” and “言語”
Main findings
• VDRJ is useful for designing curriculum (material, tests etc.)
• The more domains a words is shared as AW or LAD by, the more
•
•
•
•
•
•
•
abstract the meaning of the word is.
Conversation and non-academic texts contain more general words
and LW
Academic texts: more AW and LAD but less LW in any academic
domain
Wikipedia: more proper nouns and low frequency words
Newspapers and academic items of Wikipedia can be a good
resource for learning AW and LAD.
Natural science texts contain more academic domain words at
lower frequency levels than arts and social science texts
Origins of academic and literary words are considerably clearly
separated; 3/4 of LW originate in Japanese while 3/4 of AW and
LAD originate in Chinese
LAD contains more Western origin words (Gairaigo)
References (1)
• Anthony, L. (2007). AntConc Version 3.2.1 (text analysis tool)
http://www.antlab.sci.waseda.ac.jp/software.html
(Version 1.0 first published in 2002)
• Anthony, L. (2009). AntWordProfiler Version 1.2 w (word
profiler) http://www.antlab.sci.waseda.ac.jp/software.html
(Version 1.0 first published in 2008)
• Beck, I. L., McKeown, M. G., & Kucan, L. (2002). Bringing Words
to Life: Robust Vocabulary Instruction. Solving problems in the
teaching of literacy. New York: Guilford Press.
• Butler, Y. G. (バトラー後藤裕子). (2010). 小中学生のための
日本語学習語リスト(試案)(A list of Japanese academic
vocabulary for elementary and junior high school students in
Japan). 母語・継承語・バイリンガル教育(MHB)研究 (Studies in
Mother Tongue, Heritage Language, and Bilingual Education), 6,
42-58.
References (2)
• Chung, T. M. (2003). Identifying technical terms. Unpublished
PhD dissertation, Victoria University of Wellington.
• Corson, D. J. (1995). Using English Words. Dordrecht: Kluwer
Academic Publishers.
• Corson, D. J. (1997). The learning and use of academic English
words. Language Learning, 47(4), 671-718.
• Coxhead, A. (2000). A new academic word list. TESOL Quarterly,
34(2), 213-238.
• Coxhead, A., & Hirsh, D. (2007). A pilot science-specific word list.
Revue Francaise de Linguistique Appliquee, 12(2), 65-78.
• Coxhead, A., Stevens, L., & Tinkle, J. (2010). Why might
secondary science textbooks be difficult to read? New Zealand
Studies in Applied Linguistics, 16(2), 37-52.
References (3)
• Dunning, T. (1993). Accurate methods for the statistics of
surprise and coincidence. Computational Linguistics, 19, 61–
74.
• Eldridge, J. (2008). No, there isn’t an academic vocabulary
but ... TESOL Quarterly, 109-113.
• Hyland, K., & Tse, P. (2007). Is there an “Academic
Vocabulary”? TESOL Quarterly, 41(2), 235-253.
• Hu, M. H. & Nation, P. (2000). Vocabulary density and
reading comprehension. Reading in a Foreign Language,
13(1), 403-430.
• Juilland, A., & Chang-Rodrigues, E. (1964). Frequency
Dictionary of Spanish Words. London: Mouton & Co.
References (4)
• Komiya, C. (小宮千鶴子). (1995). 専門日本語教育の専門語
-経済の基本的な専門語の特定を目指して- [Technical
terms for teaching technical Japanese: Aiming at identifying
basic technical terms for economics]. 日本語教育 [Teaching
Japanese as a Foreign Language], 86, 81-92.
• Komori, K. (小森和子), Mikuni, J. (三國純子), & Kondo, A.
(近藤安月子). (2004). 文章理解を促進する語彙知識の量的
側面 ―既知語率の閾値探索の試み― (What percentage of
known words in a text facilitates reading comprehension: a
case study for exploration of the threshold of known words
coverage). 日本語教育 [Teaching Japanese as a Foreign
Language], 125, 83-92.
References (5)
• Matsushita, T. (松下達彦). (2010) What words are essential to
read Japanese? Making word lists from a large corpus of books
and internet forum sites [日本語を読むために必要な語彙と
は? -書籍とインターネットの大規模コーパスに基づく語彙リ
ストの作成-]. Proceedings for the Conference of the Society
for Teaching Japanese as a Foreign Language, Spring 2010
[2010年度日本語教育学会春季大会予稿集], 335-336.
• Matsushita, T. (松下達彦). (2011). 日本語を読むための語彙
データベース (The Database for Reading Japanese).
Downloaded from http://www.geocities.jp/tatsum2003/, 22
May 2011
• Nation, I. S. P. (2004). A study of the most frequent word
families in the British National Corpus. P. Bogaards & B. Laufer
(Eds.), Vocabulary in a Second Language: Selection, Acquisition,
and Testing (p 3-13). Amsterdam: John Benjamins.
References (6)
• Nation, I. S. P. (2011). Making and using word lists. I. S. P.
Nation & Stuart Webb (Eds.), Researching and analysing
vocabulary. Boston: Heinle Cengage Learning.
• Oka, M. (岡 益巳). (1992). 非漢字圏の留学生のための日本
経済基本用語表 [Basic terms of the Japanese economy for
non-Kanji background students]. 岡山大学経済学会雑誌
(Okayama Economic Review), 23(4), 191-229.
References (7)
• Tajino, A., Terauchi, H., Sasao, Y., & Maswana, S. (田地野 彰・
寺内 一・笹尾洋介・マスワナ紗矢子). (2007). 総合研究大学
における英語学術語彙リスト開発の意義 -EAPカリキュラム
開発の観点から- (The development of academic words lists
at a multi-disciplinary university in Japan: A fundamental step
in EAP curriculum design). 京都大学高等教育研究 (Kyoto
University Researches in Higher Education), 13.
• Tajino, A., Dalsky, D., & Sasao, Y. (2009). Academic vocabulary
reconsidered: An EAP curriculum-design perspective. Journal
of Teaching English as a Foreign Language and Literature, 1(4),
3-21.
References (8)
• Tamamura, F. (玉村文郎). (1987). 日本語教育基本2570語 [Basic
2570 words for teaching Japanese as a second language]. 日本語の
語彙・意味(2) [Japanese Vocabulary and Meaning], NAFL Institute
日本語教師養成通信講座 [Training Course of Teachers of Japanese
as a Second Language]. アルク (Alc).
• Townsend, D., & Collins, P. (2008). Academic vocabulary and middle
school English learners: an intervention study. Reading and Writing,
22(9), 993-1019. doi:10.1007/s11145-008-9141-y
• Ward, J. (1999). How large a vocabulary do EAP Engineering
students need? Reading in a Foreign Language, 12(2), 309-323.
• West, M. (1953). A General Service List of English Words. London:
Longman, Green & Co.

similar documents