LOD를 말하다! - 닥치고 Linked Data

Report
2014.6.27
김우주
연세대학교 정보산업공학과
목차
I.
빅데이터 시대와 정보의 홍수
II. 빅데이터 활용 사례
III. 빅데이터의 한계와 극복 방안
IV. Linked Data의 구축과 활용
V. LOD 2 - 시맨틱 기술의 미래
2
3
An Instrumented Interconnected World
30 billion RFID
12+ TBs
camera
phones
world
wide
100s of
millions
of GPS
enabled
data every day
? TBs of
of tweet data
every day
tags today
(1.3B in 2005)
4.6
billion
devices
sold
annually
25+ TBs of
2+
billion
log data every
day
76 million smart
meters in 2009…
200M by 2014
people
on the
Web by
end 2011
Information Overflow on the Web
 Growth of the Web
 The amount of information available on the Web grows so fast.
 The February 2014 survey shows there exist at least 920,120,079 sites
(http://news.netcraft.com/archives/category/web-server-survey/).
5
Information Overflow on the Web
 The Indexed Web contains at least 19.8 billion pages (Sunday, 02 March,
2014).
 http://www.worldwidewebsize.com/
6
빅데이터란?
 빅데이터란? (07/11/2013, European Commission)
 Every minute the world generates 1.7 million billion bytes of data,
equivalent to 360,000 standard DVDs.
 The big data sector is growing at a rate of 40% a year.
 무엇이 빅데이터를 중요하게 하는가?
 Big data is already affecting all areas of the economy.
 Data-driven decision making leads to 5-6% efficiency gains in the
different sectors observed.
 Intelligent processing of data is also essential for addressing societal
challenges.
7
IBM의 예측: 2014년 6대 빅데이터 트렌드
 직감보다는 더 분석적인 경영 방식
 Companies will grow increasingly data driven and willing to apply analyticsderived insights to key business operations.
 빅데이터 프라이버시와 보안 문제
 Organizations will make a greater effort to build security, privacy, and
governance policies into their big data processes.
 빅데이터에 대한 투자 확대
 CDO(Chief Data Officer)의 등장
 More organizations will bring a chief data officer (CDO) on board.
 보다 유용한 빅데이터 응용 시스템
 외부 데이터에 대한 관심 증대
8
LOD를 말하다!
9
구글의 독감 트렌드
 ‘독감’ 관련 검색어 분석을 통한 독감 예보 가능성 확인
 구글 검색 사이트에 사용자가 남긴 검색어의 빈도를 조사, 독감 환자의 분포 및 확산 정보
제공
10
샌프란시스코, 범죄 예방 시스템
 과거 범죄 발생 지역과 시각 패턴 분석을 통한 경찰 인력 배치
 과거 발생한 범죄 패턴을 분석하여 후속 범죄 가능성 예측
 과거 데이터에서 범죄자 행동을 분석하여 사건 예방을 위한 해법 제시
11
미국 국세청, 탈세 방지 시스템
 빅데이터 분석을 통한 탈세 및 사기 범죄 예방 시스템 구축
 사기 방지 솔루션, 소셜 네트워크 분석, 데이터 통합 및 마이닝 등 활용
 세금 누락 및 불필요한 세금 환급 절감의 효과 발생
12
KT, 서울특별시 – 빅데이터 기반
심야버스 노선 정책 지원
 심야버스 노선 결정을 위한 유동인구 분석 및 노선 분석
 서울시의 교통 환경(정류장/전용차로/환승)기반 지역별 최적 정류장 위치를
도출하고 KT의 CDR데이터 기반 심야시간 유동인구 및 목적지 통계를 융합하여
노선 검증
13
비씨카드, 점포 평가 서비스
 소상공인 창업 성공률 제고를 위한 상가데이터 및 신용카드거래데이터 기반의
빅데이터 분석
 점포이력, 상권분석, 업종추천 등이 이루어지는 과거현황분석, 추천 업종 또는
사용자 선택 업종 매출예측, 수익예측 등의 서비스 제공
14
15
Information Overflow Problems
 Problems
 How to cover all available information? - Recall
 How to find the relevant information? - Precision
Not data (search), but integration, analysis and
insight, leading to decisions and discovery
16
Example Query to Google
 ‘iPad’ 검색 사례
17
Information Silo Problem
 Stove-piped Systems and Poor Content Aggregation
Semantic Interoperability
 To cope with the problems mentioned in the preceding slide,
we need Semantic Interoperability.
 Semantics
 “The meaning or the interpretation of a word, sentence, or other
language form.”
 What is Semantic Interoperability?
 “Processing or Integration of resources based on the understanding
what’s intended or expressed by other systems or parties.’’
19
Front-endedness?
20
What if I want to ...
 Move my content from one place to another?
 RSS ? Not enough
 Aggregate my data
 An open FriendFeed?
 Re-use my Flickr friends on Twitter?
 Invite. Again and again ...
 The Semantic Web and Ontology can help !
 By providing a common framework to interlink data from various
providers in an open way.
21
How is it Possible?
 Ontology: Agreement with Common Vocabulary & Domain
Knowledge
 Semantic Annotation: metadata (manual & automatic
metadata extraction)
 Reasoning: semantics enabled search, integration, analysis,
mining, discovery
22
Semantic Web Layer Cake
23
Three Technical Building Block
 Basic Building Block
 URIs for unambiguous names for resources,
 RDF for common data model for expressing metadata,
 Ontology(OWL) for common vocabularies.
 Semantic Web becomes:
 web of data/things/concepts
• What is a Thing/Concept? It can be anything in the world - a movie, a
person, a disease, a location…
• Machines will be able to understand the concept behind a html page.
• This page is talking about ‘Barack Obama’, He is a ‘Person’ and he is the
‘President of USA’ ?
24
Who borrows this Idea?
 Facebook
 Facebook Open Graph Protocol and Graph Search
 Google
 Knowledge Graph
 Twitter
 Real-time Semantic Web with Twitter Annotations
25
LOD를 말하다!
26
Linked Data
 Building a “Web of Data” to enhance the current Web
 The Linking Open Data (LOD) project:
 http://linkeddata.org/
 Translating existing datasets into RDF and linking them together.
• For example, DBpedia (Wikipedia) and GeoNames, Freebase, BBC
programmes, etc.
 Government data also available as Linked Data
• DATA.gov
• DATA.gov.uk
27
The LOD cloud
2007
2008
28
The LOD cloud
2008
2009
29
Web of Data
30
Web of Data (Statistics)
 The size of the Web of Data
 The size of the Web of Data can be estimated based on the data set
statistics that are collected by the LOD community in the ESW wiki.
 According to these statistics, the Web of Data currently consists of
31 billion RDF triples, which are interlinked by around 500 million
RDF inter-links (09/19/2011).
31
Types of Linked Data Applications
 Linked Data의 활용 방안
32
Semantic Search Engines
 Top 7 Semantic Search Engines as An Alternative to Google
 Kngine
 Hakia
 Kosmix: now is part of @WalmartLabs
 DuckDuckGo
 Evri: specialized for iPad and iPhone
 Powerset: now is part of Bing
 Truevert: focus only on environmental concerns.
33
LOD를 말하다!
34
LOD2 : What is LOD2?
 LOD2(Linked Open Data)
 LOD2 is the large-scale integrating project co-funded by the
European Commission within the FP7 Information and
Communication Technologies Work Programme.
• Started in September 2010
 Partners
• 14 partners (11 European Country)
35
LOD2 : Objectives of LOD2
 LOD2 Project Objectives
 Achieving visualization, deployment, sharing, accessibility for linked
open data by software technology.
• Increase visibility of Linked Data activities [Visualization]
• Support deployment Linked Data components [Deployment]
• Improve information sharing between Linked Data components so that
publishing Linked Data is eased. [Sharing]
• Improve access to the content: the online Linked Open Data [Accessibility]
• Improve the software technology which support it [By software
technology]
36
LOD2 Stack : Overview
 LOD2 Stack
 LOD2 project provides LOD2
Stack for the sake of easy
access to linked data software.
 the LOD2 software stack is an
integrated distribution of
aligned tools supporting the
life-cycle of Linked Data from
extraction, authoring/creation
over enrichment, interlinking,
fusing to visualization and
maintenance
37
LOD2 Stack 3.0
38
LOD2 Stack : The overview of tools
 Apache Stanbol
 In the LOD2 Stack, Apache Stanbol can be used for NLP services
which rely on the stack internal knowledge bases, such as named
entity recognition and text classification.
 CubeViz
 CubeViz is a facetted browser for statistical data utilizing the RDF
Data Cube vocabulary which is the state-of-the-art in representing
statistical data in RDF.
39
LOD2 Stack : The overview of tools
 Dbpedia Spotlight
 DBpedia Spotlight is a tool for automatically annotating mentions
of DBpedia resources in text, providing a solution for linking
unstructured information sources to the Linked Open Data cloud
through DBpedia.
 D2RQ
 D2RQ is a system for accessing relational databases(RDBMS) as
virtual RDF graphs.
40
LOD2 Stack : The overview of tools
 DL-Learner
 The DL-Learner software learns concepts in Description Logics
(DLs) from user-provided examples. (Supervised-learning)
 ORE
 The ORE (Ontology Repair and Enrichment) tool allows for
knowledge engineers to improve an OWL ontology by fixing
inconsistencies and making suggestions for adding further axioms
to it.
41
LOD2 Stack : The overview of tools
 Poolparty
 The PoolParty Extractor (PPX) offers an API providing text mining
algorithms based on semantic knowledge models.
42
LOD2 Stack : The overview of tools
 SemMap
 SemMap allows to visualize knowledge bases having a spatial
dimension.
 Silk
 The Silk Link Discovery Framework supports data publishers in
accomplishing the second task. Using the declarative Silk - Link
Specification Language (Silk-LSL), developers can specify which
types of RDF links should be discovered between data sources as
well as which conditions data items must fulfill in order to be
interlinked.
43
LOD2 Stack : The overview of tools
 Sieve
 Sieve allows Web data to be filtered according to different data
quality assessment policies and provides for fusing Web data
according to different conflict resolution methods.
 LIMES
 LIMES is a link discovery framework for the Web of Data. It
implements time-efficient approaches for large-scale link
discovery based on the characteristics of metric spaces.
44
Silk : Link Discovery Framework
 Interlinking and Fusion Stage Component of LOD2 Stack
 Can be used by data providers to generate RDF links between data
sets on the web of data
• Especially, to set explicit RDF links between data items within different
data sources
 “Data publishers can use Silk to set RDF links from their data
sources to other data sources on the Web”
45
Silk : Silk – Link Specification Language Example
 Aggregation Example:
 Combines multiple confidence values into a single value (average)
Confidence value is the average of
two compared weight
Numeric differences between parameters
46
DL-Learner
 Introduction
 The goal of DL-Learner is to provide a DL/OWL based machine
learning tool to solve supervised learning tasks.
 The DL-Learner software learns concepts in Description Logics
(DLs) from examples.
DL-Learner : Features
 Learning Problems
 Positive and Negative Examples (=previous example)
 Class Learning
• Find out Class Expression for given class
• father ≡ hasChild  male  female   female
Demo of SDT Plug-in to Protégé
49
SWCL - Sample Example
Country
PopulationValue
?
hasPart
Province
positiveInteger
PopulationValue
positiveInteger
.  = . , for all  ∈ 
∈ℎ − . 
50

Constraints Representation in SWCL
 Target Constraint

∈ℎ − . 
.  = . , for all  ∈ 
 Corresponding SWCL Code
<swcl:Constraint rdf:ID=”numberOfPopulation">
<swcl:qualifier>
<swcl:Variable rdf:id="y">
<swcl:bindingClass rdf:resource="#Country"/>
</swcl:Variable>
</swcl:qualifier>
<swcl:hasLHS>
<swcl:TermBlock rdf:ID="termBlock_1">
<swcl:sign rdf:resource="&swcl;plus"/>
<swcl:aggregateOperator rdf:resource="&swcl;Sigma"/>
<swcl:parameter>
<swcl:Variable rdf:id="x">
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#partOf"/>
<owl:hasValue rdf:resource="#y"/>
</owl:Restriction>
</rdfs:subClassOf>
</swcl:Variable>
</swcl:parameter>
<swcl:factor>
<swcl:FactorAtom>
<swcl:bindingClass rdf:resource="#x"/>
51

Our Direction to the Future
 Directions
 Open, Share your data, whenever and wherever you want
 Semantic, Enhance your data, to make more sense of it
 An example: LinkedGeoData.org
 We need an integrated framework to enhance communication and
information sharing in GeoData.
52
Q&A
53

similar documents