A Framework for Automated Corpus Generation for Semantic

A Framework for Automated Corpus Generation
for Semantic Sentiment Analysis
Amna Asmi and Tanko Ishaya, Member, IAENG
Proceedings of the World Congress on Engineering 2012 Vol I
WCE 2012, July 4 - 6, 2012, London, U.K.
• Variety of corpora present (WordNet, SentiWordNet and
Multi-Perspective Question Answering (MPQA))
• Some corpora not large enough
• Generation and annotation is time consuming and
• This paper presents a framework for automated
generation of corpus for semantic sentiment analysis of
user generated web-content
Existing corpora
• Movie Review (pang and others, 2002)
• Varbaul (Sankoff and Cedegan, program based on
multivariate analysis)
• Fidditch (automated parser for English)
• Automatic Mapping Among Lexico-Grammatical
Annotation Models (AMALGAM)
• International corpus of English (ICE)
Existing Techniques for
Sentiment Analysis
• Direction based text including opinions, sentiments,
affects and biases
• Opinion mining using ML techniques (supervised/
unsupervised) (document /sentence/clause level)
• Polarity, degree of polarity, features, subjectivity,
relationships, identification, affect types, mood
classification and ordinal scale
Annotation Process
• Methodology
• Grabbing URL, author, subject, text, comments
• Text broken to sentences
• Sentence applied with Stanford Dependencies Parser and
Penn Treebank Tagging and broken down into clauses
• Subject-Verb-Object triplet extracted
• Rules according to POS, negation, punctuation, conjunction
is specified using SentiWordNet and WordNet
• Rules used to extract sentiment, and define polarity and
• Based on subject and object, and topic/title of sentence of
post, subjectivity is calculated
Tools used
Stanford Parser
PennTree Bank
UMLS(Unified Medical Language System)
• Repository:
Wordnet, SentiWordNet dictionaries,
UMLS Metathesaurus
Rules for sentence, polarity,
subjectivity and sentiment
identification and analysis
• Data Pre-processor:
Input: Unstructured data from medical
Input cleaned and filtered
Captures thread structure, comments
of forum, and arranges other info like
author, topic, date.
Spell checks
Split to set of posts and sent to post
• Post Pre-Processor
Splits texts to sentences using Penn
Tree Tagger
Passes sentences to syntactic parser
Keeps track of start and end of post
• Syntactic Parser (SP)
Collects sentences iteratively and
invokes POS tagger
Name entities and idioms are
Identifies dependencies/ relationship
Classifies sentence as a question,
assertion, comparison, confirmation
seeking or confirmation providing
• Sentiment Analyser(SA)
Extracts sentiment oriented words
from each sentence by using
relationship info (dependencies within)
Polarity Calculator (PC) identifies +
and – words.
Synonyms used if word is not found
Collects synonyms from
Uses UMLS Metathesaurus if
synonym not found
Rules for polarity identification used
Subjectivity Calculator(SC)
Considers POS and relationships
Identifies all sentences related to topic
Takes nouns and associated info
(synonyms, homonyms, meronyms,
holonyms and hyponyms)
Sentiment Analyser:
Takes polarities of sentences marked by
SC for post polarity calculation
Takes aggregate of all polarities of
sentences related to post
Generates sentiment frame info for each
Frame contains type, subject,
object/feature, sentiment oriented
word(s), sentiment type (absolute /
relative), strength (very weak, weak,
average, strong, very strong), polarity of
sentence, post index and sentence index
Forwards calculated values and info to
Sentiment Frame manager
• Sentiment Frame Manager
Stores all information to a physical
Loads all frames in tree structure at
runtime memory on program load
Keeps track of changes and appends
Stored into XML file
Future Work
• Currently being evaluated using medical based forums
• Plans to make it general purpose
Thank You
GIFs courtesy : http://www.retrojunkie.com/

similar documents