Social Media Analytics: From Security Informatics to Business

Report
Business Intelligence and Analytics:
Overview and Examples
Dr. Hsinchun Chen
Director, Artificial Intelligence Lab, University of Arizona
[email protected]
http://ai.arizona.edu;
BI & Analytics: The Field
•
•
•
•
The Data Deluge (The Economists, March 2010); internet
traffic 667 Exabytes by 2013, Cisco; Total amount of
information in 2010, 1.2 Zettabyte (KB-MB-GB-TB-PB-EB-ZBYB)
BIG DATA  BIG COMPUTATION  BIG ANALYTICS  BIG
(SOCIETAL) IMPACT
$3B BI revenue in 2009 (Gartner, 2006); $9.4B BI software
M&A spending in 2010 and $14.1B by 2014 (Forrester)
IBM spent $14B in BI in five years; $9B BI revenue in 2010
(USA Today, November 2010); 24 acquisitions, 10,000 BI
software developers, 8,000 BI consultants, 200 BI
mathematicians  IBM acquired I2/COPLINK in 2011
BI & Analytics: Definition and
Components

BI and Analytics refers to: (1) the technologies, systems,
practices and applications that (2) analyze critical business
data to (3) help an enterprise better understand its business
and market.”

Core technologies: data warehousing, Extraction,
Transformation, and Load (ETL); Business Performance
Management (BPM), visual dashboards; enterprise text and
multimedia search; data and text mining, social network
analysis
BI 2.0 research: web analytics, web 2.0, social media
analytics, opinion mining; in-memory and real-time BI; cloud
computing, data/web services; Hadoop, MapReduce; stream
and mobile data mining

BI Industry and Capabilities (Garter
Report, 2011)
Magic Quadrant for BI Platforms (13 Capabilities)
 Integration (e.g., Microsoft, Oracle, SAP)
 BI (shared) infrastructure
 Metadata management
 Development tools, collaboration
 Information Delivery (e.g., SAP, Microsoft, IBM/Cognos)
 Reporting, dashboards
 Ad hoc query
 Microsoft Office integration
 Search-based BI (structured and unstructured)
 Analysis (e.g., IBM/SPSS, SAS)
 OLAP
 Interactive visualization
 Predictive modeling and data mining
 Scorecards
4
Magic Quadrant for Business Intelligence
Platforms
Hype Cycle for Business Intelligence, 2011
BI Hype Cycle (Garter Report, 2011)



On the Rise
 Collaborative decision making
 Information semantic services
 Search-based data discovery tools
 Natural language question answering
At the Peak
 Enterprise metadata repositories
 BI SaaS
 Visualization-based data discovery tools
 Mobile BI
 In-memory DMBS
Sliding into the Trough
 Real-time decisoning
 Analytics, content analytics, in-memory analytics, text analytics
 Open-source BI tools
 Interactive visualization
7
BI Hype Cycle (Cont’d)

Climbing the Slope
 BI consulting and system integration
 Business activity monitoring
 Column-based DBMS
 Dashboards, data quality tools
 Predictive analytics
 Excel as a BI front end
 Entering the Plateau
 BI platforms
 Data-mining workbenchs
8
Sample BI Applications (AI Lab)

Security informatics



Market intelligence



Securing cyber space, cyber security, predicting Arab Spring
Information and system security, enterprise risk management
Data/text/web mining, web 2.0, social media analytics
Big data (volume/variety/velocity/mobility), Hadoop, Cloud apps
Healthcare informatics


Healthcare IT integration and solutions, decision support
EHR data/text mining, patient empowerment and social media
9
(1) BI for Security: COPLINK
10
COPLINK Identity Resolution and
Criminal Network Analysis
Cross-jurisdictional Information Sharing/Collaboration
Arizona IDMatcher
Law-enforcement Data
AZ
CA
CAN Visualizer
TX
Border Crossing Data
(AZ, CA, TX)
Vehicles
Identity Resolution
DOB
Match
Criminal Network Analysis
High-risk Vehicle
Identification
Identity
Match
Name
Match
People
Address
Match
ID
Match
Law-enforcement Data
Criminal Link Prediction
Suspect Traffic Burst
Detection
Border Crossing Data
Narcotics Network
Mutual Information
Vehicle A
Vehicle B
2000
Time of Day
ID
Similarity
1500
1000
500
0
Jun 9
June 17
Mar 5
Mar 5
May 18
May 18
May 25
May 28
Dates
May 30
Jan 6
Jan 15
Jan 19
Jan 26
Jan 31
< 2004
Feb 27
Nov 17
Dec 19
Dec 21
Address
Similarity
Dec 29
DOB
Similarity
Jan 6
Last
Name
Match
Jan 6
Middle
Name
Match
Nov 11
First
Name
Match
2005 >
Frequent Crossers at Night
First
Name
Similarity
Middle
Name
Similarity
Last
Name
Similarity
Detect false and deceptive
identities across jurisdictions
using a probabilistic naïveBayes based resolution
system.
Vehicle A
Vehicle B
Identify high-risk vehicles
using association techniques
like mutual information using
border crossing and law
enforcement data.
Predict interaction between
individuals and vehicles using
link prediction techniques to
identify high-risk border
crossers.
* Only the grayed datasets are available to the AI Lab
Detect real-time anomalies
and threats in border traffic
using Markov switching and
other models.
11
(2) BI for Market Intelligence (AZ BizIntel)
•
•
•
•
•
•
•
Mass media, social media contents
Text & social media analytics techniques
Finance/accounting/marketing models (Tetlock/Columbia,
Antweiler/UBC, Das/Santa Clara)  NYU (Dhar), Arizona (Dhaliwal,
Kelly, Jiang, Lusch, Yong), National Taiwan U (Li, Hong, Lu)
Bag of words, named entities, proper nouns, topics (1, 2-, 3- grams)
Sentiment/valence, lexicons, machine learning, stakeholder
analysis, EFLS analysis
Time series models, spike detection, decaying function, trading
windows, targeted sentiment
Econometrics/regression models (R-sqr, p-value), 10-fold validation
(F, accuracy), simulated trading (cost, frequency, exit)
SEC/Edgar
NYSE.com
NASDAQ.com
Finance.Yahoo.com
Company Information Database
Ticker
CIK
CUSIP
Company
Name
PERMNO
Yahoo Finance
Forums
Company
Websites
Twitter
Stock
Exchange
WSJ
Dynamic Data Sources
Search
Engines
10K
Report
Blogs
News
Data
Processing
Transformation/Integration
Finance/Econ
models and
metrics
Topics &
Sentiments
Time Series
/ Burst
SNA
Risk Model
Analysis
Interactive Applications
Data Collection
Predefined Data Sources
Company
Keywords
Static
Figures/Dashboards
Basic
Information
Data Sources for US Public Companies
Analytic Approaches
Single Media
Analysis
Cross Media
Analysis
Simulated
Trading
Predicting
Markets
13
AZ BIZ INTEL System Design
Visualization
(3) BI for Healthcare: AZ Smart Health
Research
Deveolopment
Commercialization
Targeted Data
Subscriptions
Healthcare Business Intelligence
NTU Hospital
EHR
National Health Insurance
Database
Health Cloud Infrastructure
Cost, Performance, Benchmarking,
Research & Practical Implications
Market
Development
Health Informatics
System
Development
Artificial Intelligence, Data Mining,
Decision Support, Visualization
Software, Data, Analytics as a Service
On Demand
Health Analytics
Services
Healthcare
Business
Consultations
Patient Social
Media Platform
14
AZ Smart Health Research
Healthcare Decision Support





Symptom-Disease-Treatment Extraction for Medical Knowledge Re-use
Scenario-based Association Rule Mining and Result Validation for Effective Healthcare
Outcome Assessment and Medication Compliance to Signify Quality of Care
Temporal Episodes and Disease Progression Modeling for Better Patient Condition Assessment
Patients-Like-You-and-Me EHR Search Interface to Accelerate Clinical Decision Making
Patient-centered Smart Health




Personalized Healthcare for Chronic and Family Diseases Management
Long Term Medication Effects to Improve New Drug Development
Public Health Modeling and Monitoring for Government Agencies
Patient Social Media to Empower Patients and Improve Self Care at Home
Healthcare Business Analytics





Cost Modeling and Containment
Improving Rate Calculation for the National Health Insurance
Competency and Performance Benchmarking
Quality-based Insurance Reimbursement
Workflow Planning and Coordination for Inter- and Intra- Hospital Process
ARM in Medicine: Symptoms, Diseases,
and Treatments
0.05 < Confidence <=0.2
0.2 < Confidence <=0.5
0.5 < Confidence
Symptoms
Hemoptysis
(786.3)
Other dyspnea and
respiratory abnormalities
(786.09)
0.0640
0.0689
Unspecified pulmonary
tuberculosis
confirmation unspecified
(011.90)
Diseases
0.4525
Pneumonia
(486)
0.2502
Malignant neoplasm of bronchus
and lung, unspecified
(162.9)
0.1456
0.2097
0.0640
Treatments
0.5562
Terbutaline sulphate
5mg/2ml/vial
(ETERBUS)
0.7615
0.5496
Thoracentesis Chest PA view
Pyridoxine Hcl (34.91)
(320011)
Tablets 50mg
(OVTB6)
0.1158
0.4882
0.6777
0.4194
0.4646
Computerized
axial tomography
0.2707 of thorax
(87.41)
Injection or infusion of
Direct smear by
cancer
Gram Stain
chemotherapeutic
Aerobic Culture
(130062)
substance
(13007)
(99.25)
Patient Statistics: Breast Cancer
Patient Genders
1200
1000
800
600
400
200
0
1092
0
M
Patient Age Groups
800
600
400
200
0
618
318
F
Frequent Cooccurred Diagnosis
150
6
15 to 24 25 to 44 45 to 64
0
Secondary malignant neoplasm of bone and…
Malignant neoplasm of female breast, upper-…
Diabetes mellitus without mention of…
Secondary malignant neoplasm of lung
Secondary malignant neoplasm of liver
Malignant neoplasm of other specified sites of…
Essential hypertension, unspecified
Malignant neoplasm of female breast, upper-…
Secondary and unspecified malignant neoplasm…
Benign neoplasm of breast
100
200
> 65
300
400
335
179
169
152
146
146
125
103
99
86
Consistency of Top Treatment Orders
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Top 20 treatments from aggregated population
Exemestane (Aromasin) (諾曼癌素)
Her-2/neu 螢光原位雜交法 (Her-2/neu FISH)
Trastuzumab (Herceptin) (賀癌平)
Anastrozole (Anazo) (安納柔)
Zoledronic acid (Zometa) (卓古祂)
Pegylated liposomal doxorubicin (Caelyx) (康利斯微脂利)
Radical mastectomy-unilateral (乳癌根除術- 單側)
Tamoxifen citrate (得適)
Docetaxel (Taxotere) (剋癌易)
Cyclophosphamide (Endoxan-Asta) (癌得星)
Vinorelbine (Navelbine) (溫諾平)
Docetaxel (Taxotere) (剋癌易)
Epirubicin HCl (Pharmorubicin RD) (泛艾黴素)
Epirubicin (Pharmorubicin) ( "速溶"泛艾黴素)
CA-153 tumor marker (CA-153 腫瘤標記)
Epirubicin (Pharmorubicin) ( "速溶"泛艾黴素)
Methotrexate sodium inj (Amethopterin) (滅殺除癌)
Dissection of axillary lymphatics (腋窩淋巴腺清除術)
Breast tumor biopsy (乳房腫瘤組織檢查切片術)
20 Intravenous chemotherapy 4-8 hours (靜脈化學藥物注射4-8小時)
•
•
•
Physician
Department
M1130 M1529 M1540 M1585 03
BD
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
4
V
Age Group
5
6
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
7
V
V
V
V
V
V
V
V
Cooccurred Diagnosis
196.3
198.5
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
Department 03: General Surgery; Department BD: Gastrointestinal surgery
Age group 4: 15 to 24; Age group 5: 25 to 44; Age group 6: 45 to 64; Age group 7: > 65
Cooccurred Diagnosis 196.3: Secondary and unspecified malignant neoplasm of lymph nodes;
Cooccurred Diagnosis 198.5: Secondary malignant neoplasm of bone and bone marrow
Treatment Comparison Among
Different Physicians
DOCTOR_NO=M1130
1 Caelyx 20mg/10ml/vial (康利斯微脂利)
2 Aromasin S.C. Tablets 25mg (諾曼癌素)
DOCTOR_NO=M1529
DOCTOR_NO=M1540
Zometa Powder For Solution For Infusion 4mg/vial (卓古
祂)
Anazo F.C. Tablets (安納柔)
Methotrexate Inj 50mg/2ml (滅殺除癌)
Gemzar 200mg/vial (健擇)
Zometa Powder For Solution For Infusion 4mg/vial (卓古
Taxotere 20mg/0.5ml/vial (剋癌易)
祂)
3 Navelbine 10mg/1ml/vial (溫諾平)
Intravenous chemotherapy <1 hours (靜脈化學藥物
4 注射)
FORMOXOL 30mg/5ml/vial (伏摩素)
5 Herceptin 440mg/20ml/vial (賀癌平)
Aromasin S.C. Tablets 25mg (諾曼癌素)
Zometa Powder For Solution For Infusion 4mg/vial (卓古
6 祂)
Herceptin 440mg/20ml/vial (賀癌平)
7 FORMOXOL 30mg/5ml/vial (伏摩素)
Navelbine 10mg/1ml/vial (溫諾平)
8 CA-153 tumor marker (CA-153 腫瘤標記)
Endoxan-Asta Injection 200mg/vial(癌得星)
9 Abitrexate 50mg/2ml/vial (必除癌)
Caelyx 20mg/10ml/vial (康利斯微脂利)
10 Taxotere 80mg/2ml/vial (剋癌易)
Taxotere 20mg/0.5ml/vial (剋癌易)
11 Taxotere 20mg/0.5ml/vial (剋癌易)
Taxotere 80mg/2ml/vial (剋癌易)
12 Endoxan-Asta Injection 200mg/vial(癌得星)
CA-153 tumor marker (CA-153 腫瘤標記)
13 Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素) Radical mastectomy-unilateral (乳癌根除術- 單側)
Intravenous chemotherapy 1-4 hours (靜脈化學藥
14 Pharmorubicin RD 50mg/vial (泛艾黴素)
物注射)
Intravenous chemotherapy 4-8 hours (靜脈化學藥
15 Radical mastectomy-unilateral (乳癌根除術- 單側) 物注射)
Taxotere 80mg/2ml/vial (剋癌易)
Herceptin 440mg/20ml/vial (賀癌平)
Radical mastectomy-unilateral (乳癌根除術- 單側)
Granocyte 100ug/vial (顆球諾得)
Sentinel lymphadenectomy (腋窩淋巴腺清除術)
CA-153 tumor marker (CA-153 腫瘤標記)
Endoxan-Asta Injection 200mg/vial(癌得星)
Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素)
Whole body bone scan (全身骨骼掃描)
Pharmorubicin 10mg/vial ( "速溶"泛艾黴素)
Simulation procedure (模擬定位攝影)
Pharmorubicin RD 50mg/vial (泛艾黴素)
Pharmorubicin RD 50mg/vial (泛艾黴素)
Breast tumor biopsy examination (乳房腫瘤組織檢查切
片術)
Intravenous chemotherapy 4-8 hours (靜脈化學藥物
注射)
Intravenous chemotherapy 1-4 hours (靜脈化學藥物
注射)
Pharmorubicin 10mg/vial ( "速溶"泛艾黴素)
Sodium chloride injection (氯化鈉注射液)
Vascular exploration (血管探查)
Fixed mold-large (固定模具之設計及製作-大)
16 Pharmorubicin 10mg/vial ( "速溶"泛艾黴素)
Rasitol Tablets 40mg (Furosemide) (來喜妥)
17 Gemzar 200mg/vial (健擇)
Intravenous chemotherapy 4-8 hours (靜脈化學藥
18 物注射)
Intravenous chemotherapy 1-4 hours (靜脈化學藥
19 物注射)
20 Neurotin Tablets 600mg (鎮頑癲)
Emetrol Tablets 10mg (Domperidone) (愈吐寧)
Treatment Comparison Among
Different Patient Age Groups
Age group=5
Age group=6
Age group=7
Caelyx 20mg/10ml/vial (康利斯微脂利)
Anazo F.C. Tablets (安納柔)
Her-2/neu 螢光原位雜交法 (Her-2/neu FISH)
Herceptin 440mg/20ml/vial (賀癌平)
Aromasin S.C. Tablets 25mg (諾曼癌素)
Abitrexate 50mg/2ml/vial (必除癌)
Zometa Powder For Solution For Infusion 4mg/vial (卓古祂) Herceptin 440mg/20ml/vial (賀癌平)
Radical mastectomy-unilateral (乳癌根除術- 單側)
Pharmorubicin 10mg/vial ( "速溶"泛艾黴素)
Zometa Powder For Solution For Infusion 4mg/vial (卓古祂) Herceptin 440mg/20ml/vial (賀癌平)
Taxotere 80mg/2ml/vial (剋癌易)
Navelbine 10mg/1ml/vial (溫諾平)
Sentinel lymphadenectomy (腋窩淋巴腺清除術)
Taxotere 20mg/0.5ml/vial (剋癌易)
Caelyx 20mg/10ml/vial (康利斯微脂利)
Tadex 10mg/tab (得適)
Navelbine 10mg/1ml/vial (溫諾平)
Zometa Powder For Solution For Infusion 4mg/vial (卓古祂)
Radical mastectomy-unilateral (乳癌根除術- 單側)
Pharmorubicin RD 50mg/vial (泛艾黴素)
Tadex 10mg/tab (得適)
Caelyx 20mg/10ml/vial (康利斯微脂利)
Endoxan-Asta Injection 200mg/vial (癌得星)
Taxotere 80mg/2ml/vial (剋癌易)
Endoxan-Asta Injection 200mg/vial(癌得星)
Tadex 10mg/tab (得適)
Endoxan-Asta Injection 200mg/vial(癌得星)
CA-153 tumor marker (CA-153 腫瘤標記)
Xeloda Tablets 500mg (結瘤達)
Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素)
Breast tumor biopsy (乳房腫瘤組織檢查切片術)
Taxotere 20mg/0.5ml/vial (剋癌易)
Radical mastectomy-unilateral (乳癌根除術- 單側)
Partial mastectomy-unilateral (部份乳癌根除術- 單側)
Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素)
Granocyte 100ug/vial (顆球諾得)
CA-153 tumor marker (CA-153 腫瘤標記)
Pharmorubicin RD 50mg/vial (泛艾黴素)
Taxotere 20mg/0.5ml/vial (剋癌易)
CA-153 tumor marker (CA-153 腫瘤標記)
Pharmorubicin RD 50mg/vial (泛艾黴素)
Intravenous chemotherapy 4-8 hours (靜脈化學藥物注射)Pharmorubicin 10mg/vial ( "速溶"泛艾黴素)
Granocyte 100ug/vial (顆球諾得)
Abitrexate 50mg/2ml/vial (必除癌)
Taxotere 80mg/2ml/vial (剋癌易)
Methotrexate Inj 50mg/2ml (滅殺除癌)
Pharmorubicin 10mg/vial ( "速溶"泛艾黴素)
Sentinel lymphadenectomy (腋窩淋巴腺清除術)
Intravenous chemotherapy 1-4 hours (靜脈化學藥物注射)Pharmorubicin Rapid Dissolation 10mg ( "速溶"泛艾黴素)
Gemzar 200mg/vial (健擇)
Intravenous chemotherapy 1-4 hours (靜脈化學藥物注射)
Compensator design and production (補償器之設計及製
20 FORMOXOL 30mg/5ml/vial (伏摩素)
作)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
•
•
Anazo F.C. Tablets (安納柔) is a treatment for advanced breast cancer in
postmenopausal women (advanced age).
Abitrexate (必除癌) is a drug in the FDA pregnancy risk categories, which has proven
to cause fetal risks and abnormalities. Therefore, it is less likely to be prescribed for
patients in young age group=5 (i.e., age 25 to 44)
Cancer Community Mapping: Text Mining &
Visualization for Documents and Patient Forums
Red Blood Cell and Lymph
Nodes subtopics
Meningeal Neoplasms and
Brain Diaseases subtopics
A Brain Neoplasms article
about toddlers
Breast cancer patient
forum messages
21
BI & Analytics Research Opportunities
and Challenges

Opportunities: BIG DATA  BIG COMPUTATION 
BIG ANALYTICS  BIG (SOCIETAL) IMPCTS (NAE
Grand Challenges: security, healthcare)

Challenges: data deluge (TB/PB)  data variety
(numbers, text, multilingual, multimedia)  data
velocity (mobile, streaming)  data organization &
access (DBMS, Hadoop, IR, image, mobile)  data
analytics (statistical analysis, data/text/web mining)
22
Training the New “Data Scientists”:
Core Knowledge

B-School (Management Information Systems):
economics/finance/accounting/marketing, statistical
analysis/modeling, organizational/behavioral  business
knowledge; statistics

C-School (Computer Science): programming language, data
structure & algorithm, database management system, artificial
intelligence, networking, data mining, web computing & mining
 computational techniques

I-School (Information/Library Science): information
organization, information retrieval, information visualization,
NLP, text mining, HCI  information processing
23

similar documents