Text Corpora and Lexical Resources

Text Corpora and Lexical
Chapter 2 of Natural Language
Processing with Python
So far -• We have learned the basics of Python
– Reading and writing – interactive and files
– Control structures
• if, while, for, function and class definitions
– Important data structures:
• lists, tuples, numeric (int and float)
– Basic natural language processing
• Expanding the scope of textual
information we can access
• Additional language constructions for
working with text
• Reintroduce some Python structures for
organizing programs
Text corpora
• A collection of text entities
– Usually there is some unifying
characteristic, but not always
– Typical examples
• All issues of a newspaper for a period of time
• A collection of reports from a particular industry
or standards body
– More recent
• The whole collection of posts to twitter
• All the entries in a blog or set of blogs
Check it out
• Go to http://www.gutenberg.org/
• Take a few minutes to explore the site.
– Look at the top 100 downloads of yesterday
– Can you characterize them? What do you
think of this list?
Corpora in nltk
• The nltk includes part of the Gutenberg
• Find out which ones by
• These are the texts of the Gutenberg
collection that are downloaded with the
nltk package.
Accessing other texts
• We will explore the files loaded with nltk
• You may want to explore other texts also.
• From the help(nltk.corpus):
– If C{item} is one of the unique identifiers listed
in the corpus module's C{items} variable, then
the corresponding document will be loaded
from the NLTK corpus package.
– If C{item} is a filename, then that file will be
For now – just a note that we can use these tools on other
texts that we download or acquire from any source.
Using the tools we saw before
• The particular texts we saw in chapter 1
were accessed through aliases that
simplified the interaction.
• Now, more general case, we have to do
• To get the list of words in a text:
>>>emma = nltk.corpus.gutenberg.words('austen-emma.txt')
• Now we have the form we had for the texts of Chapter 1
and can use the tools found there. Try:
>>> len(emma)
Note the frequency of use of Jane Austen books ???
Shortened reference
• Global context
– Instead of citing the gutenberg corpus for each
>>> from nltk.corpus import gutenberg
>>> gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austensense.txt', ...]
>>> emma = gutenberg.words('austen-emma.txt')
• So,
becomes just
Other access options
• gutenberg.words('austen-emma.txt')
– the words of the text
• gutenberg.raw('austen-emma.txt')
– the original text, no separation into tokens
(words). One long string.
• gutenberg.sents('austen-emma.txt')
– the text divided into sentences
Some code to run
• Enter and run the code for counting
characters, words, sentences and finding
the lexical diversity score of each text in
the corpus.
import nltk
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid))
num_words = len(gutenberg.words(fileid))
num_sents = len(gutenberg.sents(fileid))
num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))
print int(num_chars/num_words), int(num_words/num_sents), \
int(num_words/num_vocab), fileid
Short, simple code. Already seeing some noticeable time to
Modify the code
• Simple change – print out the total
number of characters, words, sentences
for each text.
The text corpus
• Take a look at your directory of nltk_data to
see the variety of text materials accessible
to you.
– Some are not plain text and we cannot use
them yet – but will
– Of the plain text, note the diversity
• Classic published materials
• News feeds, movie reviews
• Overheard conversations, internet chat
– All categories of language are needed to
understand the language as it is defined and as
it is used.
The Brown Corpus
• First 1 million word corpus
• Explore –
– what are the categories?
– Access words or sentences from one or
more categories or fileids
>>> from nltk.corpus import brown
>>> brown.categories()
>>> brown.fileids(categories=”<choose>")
>>> from nltk.corpus import brown
>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist([w.lower() for w in news_text])
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
... print m + ':', fdist[m],
• Enter that code and run it.
• What does it give you?
• What does it mean?
Spot check
• Repeat the previous code, but look for
the use of those same words in the
categories for religion, government
• Now analyze the use of the “wh” words
in the news category and one other of
your choice. (Who, What, Where,
When, Why)
One step comparison
• Consider the following code:
import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
(genre, word)
for genre in brown.categories()
for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)
Enter and run it.
What does it do?
Other corpora
• There is some information about the
Reuters and Inaugural address corpora
also. Take a look at them with the online
site. (5 minutes or so)
Spot Check
• Take a look at Table 2-2 for a list of some of
the material available from the nltk project. (I
cannot fit it on a slide in any meaningful way)
• Confirm that you have downloaded all of
these (when you did the nltk.download, if you
selected all)
• Find them in your directory and explore.
– How many languages are represented?
– How would you describe the variety of content?
What do you find most
• The Universal Declaration of Human Rights
is available in 300 languages.
Organization of Corpora
• The organization will vary according to
the type of corpus. Knowing the
organization may be important for using
the corpus.
Table 2.3 – Basic Corpus Functionality in NLTK
the files of the corpus
the files of the corpus corresponding to these
the categories of the corpus
the categories of the corpus corresponding to these
the raw content of the corpus
the raw content of the specified files
the raw content of the specified categories
the words of the whole corpus
the words of the specified fileids
the words of the specified categories
the sentences of the whole corpus
the sentences of the specified fileids
the sentences of the specified categories
the location of the given file on disk
the encoding of the file (if known)
open a stream for reading the given corpus file
the path to the root of locally installed corpus
the contents of the README file of the corpus
from help(nltk.corpus.reader)
Corpus reader functions are named based on the type of information
they return. Some common examples, and their return types, are:
- I{corpus}.words(): list of str
Types of information
- I{corpus}.sents(): list of (list of str)
returned from typical
- I{corpus}.paras(): list of (list of (list of str))
- I{corpus}.tagged_words(): list of (str,str) tuple
- I{corpus}.tagged_sents(): list of (list of (str,str))
- I{corpus}.tagged_paras(): list of (list of (list of (str,str)))
- I{corpus}.chunked_sents(): list of (Tree w/ (str,str) leaves)
- I{corpus}.parsed_sents(): list of (Tree with str leaves)
- I{corpus}.parsed_paras(): list of (list of (Tree with str leaves))
- I{corpus}.xml(): A single xml ElementTree
- I{corpus}.raw(): unprocessed corpus contents
For example, to read a list of the words in the Brown Corpus, use
>>> from nltk.corpus import brown
>>> print brown.words()
Spot check
• Choose a corpus and exercise some of
the functions
– Look at raw, words, sents, categories,
fileids, encoding
• Repeat for a source in a different
• Work in pairs and talk about what you
find, what you might want to look for.
– Report out briefly
Working with your own sources
• NLTK provides a great bunch of resources,
but you will certainly want to access your
own collections – other books you
download, or files you create, etc.
from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict'
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*')
>>> wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
You could get the
>>> wordlists.words('connectives')
list of files in any
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
Other Corpus readers
• There are a number of different readers
for different types of corpora.
• Many files in corpora are “marked up” in
various ways and the reader needs to
understand the markings to return
meaningful results.
• We will stick to the
PlaintextCorpusReader for now
Conditional Frequency
• When texts in a corpus are divided into
categories, we may want to look at the
characteristics by category – word use by
author or over time, for example
Figure 2.4: Counting Words Appearing in a Text Collection (a conditional
frequency distribution)
Frequency Distributions
• A frequency distribution counts some
occurrence, such as the use of a word or
• A conditional frequency distribution,
counts some occurrence separately for
each of some number of conditions
(Author, date, genre, etc.)
• For example:
>>> genre_word = [(genre, word)
Think about this.
for genre in ['news', 'romance']
What exactly is
for word in brown.words(categories=genre)] happening?
>>> len(genre_word)
What are those 170,576 things?, Run the
code, then enter just >>> genre_word
>>> genre_word = [(genre, word)
for genre in ['news', 'romance']
for word in
>>> len(genre_word)
• For each genre (‘news’, ‘romance’)
• loop over every word in that genre
• produce the pairs showing the genre and
the word
• What type of data is genre_word?
Spot check
• Refining the result
– When you displayed genre_word, you may
have noticed that some of the words are
not words at all. They are punctuation
– Refine this code to eliminate the entries in
genre_word in which the word is not all
– Remove duplicate words that differ only in
Work together. Talk about what you are doing. Share your ideas and
Conditional Frequency
• From the list of pairs we created, we can
generate a conditional frequency
distribution of words by genre
>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd
Run these. Look at the results
>>> cfd.conditions()
Look at the conditional
>>> cfd['news']
<FreqDist with 100554 outcomes>
>>> cfd['romance']
<FreqDist with 70022 outcomes>
>>> list(cfd['romance'])
[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had',
'?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him',
'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]
>>> cfd['romance']['could']
Presenting the results
• Plotting and tabulating
– concise representations of the frequency
• Tabulate cfd.tabulate()
• With no parameters, simply tabulates all
the conditions against all the values
Look closely
>>> from nltk.corpus import inaugural
>>> cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
Get the text
The two axes
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
All the words in each file
for target in ['america', 'citizen']
if w.lower().startswith(target))
Narrow the word choice
Remember List Comprehension?
Three elements
• For a conditional frequency distribution:
– Two axes
• condition or event, something of interest
• some connected characteristic – a year, a place, an
author, anything that is related in some way to the event
– Something to count
• For the condition and the characteristic, what are we
counting? Words? actions? what?
– From the previous example
• inaugural addresses
• specific words
• count the number of times that a form of either of those
words occurred in that address
Spot check
• Run the code on the previous example.
• How many times was some version of
“citizen” used in the 1909 inaugural
• How many times was “america”
mentioned in 2009?
• Play with the code. What can you leave
off and still get some meaningful
Another case
• Somewhat simpler specification
• Distribution of length of word in
languages, with restriction on languages
>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
(lang, len(word))
for lang in languages
for word in udhr.words(lang + '-Latin1'))
Now tabulate
>>> cfd.tabulate(conditions=['English', 'German_Deutsch'],
samples=range(10), cumulative=True)
0 1
3 4
English 0 185 525 883 997 1166 1283 1440 1558 1638
German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275
• Only choose to tabulate some of the
Note – so far, I cannot do plots. I hope to get that fixed. If you can do
plots, do try some of the examples.
Common methods for Conditional
Frequency Distributions
cfdist = ConditionalFreqDist(pairs)
create a conditional frequency
distribution from a list of pairs
cfdist.conditions() alphabetically sorted list of conditions
cfdist[condition] the frequency distribution for this condition
cfdist[condition][sample] frequency for the given sample for this
cfdist.tabulate() tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions) tabulation limited to the specified
samples and conditions
cfdist.plot() graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions) graphical plot limited to the specified
samples and conditions
cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in
• This set of slides comes very directly
from the book, Natural Language
Processing with Python. www.nltk.org

similar documents