a corpus-based approach to modern fiction

Ways of searching for the Zeitgeist of Modernity
Ilina Doykova
Shumen University, Shumen (Bulgaria)
[email protected]
Statistical analysis
Simple things may characterise different styles
More complex analyses give a more interesting picture
average sentence length
average word length
vocabulary richness
vocabulary growth (homogeneity of text)
specific syntactic structures
degree of modification in NPs
types of verbs (e.g. verbs of persuasion, speech verbs, action verbs, descriptive verbs)
distribution of pronouns (1st/2nd/3rd person)
themes, beliefs, etc.
Especially when used comparatively
Linguistic Tools: WordSmith and Wmatrix
Useful features:
+ Tagging
+ WordList
+ Concordance
= identifies and labels PoS
= generates word-frequency lists
= lists occurrences of a word in context and its
immediate environment, gives access to
• Identify syntactic use of word
• Identify range of meanings
• Identify relative frequency of different uses/meanings
+ KWIC (key word)
+ Word Clouds
= identification of key words through a
comparison with a reference corpus
= semantic tagsets in 21 domains
• Listings can be customised to show what you want more clearly:
sort according to next or previous word
show more or less context
highlight important information
Word Frequency List (Wmatrix)
WordSmith frequency list of predicative adjectives, Modern British Women
Fiction Writers Corpus
Key words list and dispersion plot
(ALONE in MBWFW corpus)
Consistency analysis indicates whether a word is found consistently across
lots of different texts or only in a narrow set of texts, or a specific text
Lemmatized results for relational pairs
WordSmith and Wmatrix
Investigation of semantic domains through semantic tagging (Wmatrix)
Key Domain clouds (for Wmatrix only)
• The larger the word, the greater its “keyness” or uniqueness
as compared to the BNC Written Sampler of imaginative texts.
Comparison of linguistic software
Research and language learning
Word frequency
knowledge in present-day language textbooks (grammatical,
collocational, semantic) is frequency-based;
Real usage
corpora represent actual, not prescribed usage;
find the best equivalent;
investigate on word classes, specific syntactic structures;
Teaching collocations
‘trouble and strife’, ‘the elephant in the room’; ‘blue murder’
specific content (sexist, racist or ideological, etc. )
identification of true authorship
Analysis of texts written in any language and any alphabet
