Week 12 - Personal

Report
Using Corpus Tools in Discourse
Analysis
Discourse and Pragmatics
Week 12
What is a corpus?

An collection of a large number of texts of a particular
type in digital format which can be easily searched and
manipulated with computer programs
What is corpus linguistics ?

The analaysis of collections of texts (corpora) with
computer tools in order to detect grammatical, lexical or
discourse level patterns, often with the aim of comparing
those patterns with those found in other collections of
texts.
Examples of corpus assisted discourse
analysis

Flowerdew (1997, 2002)




Anlaysis of the speeches of Gov. Chris Patten and CE Tung
Chee Hwa
common themes: free market economy, freedom of the
individual, rule of law
Divergent themes: democracy, stability and harmony
Rey (2001)



Startrek characters from 1966 to 1993
Female language has shifted from being more relational to
more informational
Male language has shifted from being more informational to
more relational
Advantages of using corpora




Easily detecting grammatical and lexical patterns in a large
number of texts
Reducing researcher bias
Efficiently detecting differences among varieties, registers,
genres, and Discourses
Corpus based (deductive) vs. Corpus driven (inductive)
analysis
Disadvantages of using corpora





Separation of discourse from its social context
Corpus data usually confined to text (cannot account for
images, non-verbal behavior and other aspects of
multimodal discourse)
Frequency does not equal importance (sometimes very
important messages are implicit or ‘taken for granted’
rather than explicit)
‘People don’t say what they mean and people don’t mean
what they say’
Words have multiple meanings and word meanings
change over time and according to the context in which
they are used
Tools for corpus analysis

Online corpora and concordancers

Collins Bank of English
British National Corpus
Corpus of Contemporary American English
International Corpus of English

General vs. Specialized Corpora




Software tools



AntConc
ConcApp
WordSmith Tools
Preparing corpora





Collecting data (Internet? Scanning files?)
Txt files
Separate files for different texts
‘Cleaning’ files
‘Tagging’
Procedures in corpus analysis






Type token ratio
Dispersion plots
Frequency lists
Concordance data
Collocation calculations
Keyword calculations
Example



Lady Gaga’s lyrics
Total of 59 songs
Reference corpus: 100 top songs from November 2010
Type Token Ratio
Number of types divided by
the number of tokens
Type Token Ratio







Low indicates narrow range of subjects, lack of variety or
frequent repetition
High indicates wide range of subjects, great variation, less
frequent repetition
BNC Written = 45.53
BNC Spoken = 32.96
Baker’s Holiday Pamphlets = 40.03
100 Song Corpus = 9.07
Gaga Corpus = 11.4
Frequency lists
Frequency

Function words (articles, prepositions, conjunctions,
pronouns, etc.)



Useful in answering questions about style, register
Pronouns can be particularly important
Content words (nouns, verbs, adjectives, adverbs)

Useful in answering questions about topics/ Discourses
Top 5 function words

100 Song Corpus

Gaga Corpus

I
you
the
and
it

I
you
the
oh
me




I = 5.09%
me = 1.3%




1 = 4.4%
me = 2.03%
Murphey 1992: The word count revealed that the total referents in first person
(I, me, my, mine, etc.) amounted to 10% of the total words
‘t (not)

100 Song Corpus

Gaga Corpus

Ranked 7
1.3%

Ranked 9
1.59%


Top 5 content words

100 Song Corpus

Gaga Corpus

like
no
can
baby
know
(love) (0.42%)

love (0.98%)
baby
can
want
know









Concordances
Concordances



Can reveal contexts of frequent words
Sorting strategies
Searching for patterns
Concordances
Collocation




‘Co-location’
The frequency with which words appear close to other
words
‘You shall know a lot about a word from the company it
keeps.’ (Firth 1957)
Span (xL, xR)
Top 5 collocates for ‘I’

100 Song Corpus

Gaga Corpus

‘m
and
can
Know
‘ll

‘m
want
‘ll
don’t
can




Span: 1L, 1R




Top 5 collocates of ‘love

100 Song Corpus

Gaga Corpus

I
you
my
me
the

I
fu
want
‘t
revenge




Span 5l, 5R




Keywords


The frequency of words in a corpus in relation to another
corpus
The statistical significance of a keyword's frequency in a
given corpus, relative to a reference corpus.
Keywords
Keywords: semantic domains




lover*
romance*
love*
loves















fame*
fancy*
ribbons*
glitter
fashion
vanity
rich
presents
famous





retro*
bang*
shake*
dirty*
grease*
bad*
teeth
monster
filthy
oh*
eh*
What does this analysis tell us out Lady
Gaga lyrics?



Style and texture
Whos doing whats
Discourses and ideology

similar documents