Report

מבוא לBI Automated Decision-Making Framework ( BIלפי ויקיפדיה) • http://he.wikipedia.org/wiki/%D7%91%D7%99%D7%A0%D7%94_%D7%A2%D7%A1%D7%A7%D7%99%D7%AA תוכן עניינים • 1היסטוריה • 2תהליך העבודה • 3מחסן נתונים וBI- • 4עיבוד אנליטי מקוון )(OLAP • 5כריית מידע (כל שיטות הלמידה שלמדנו) • 6בינה עסקית תפעולית • 7שימושים עיקריים • 8מוצרי BI DSS היסטוריה של Classical Definitions of DSS • Interactive computer-based systems, which help decision makers utilize data and models to solve unstructured problems" - Gorry and Scott-Morton, 1971 • Decision support systems couple the intellectual resources of individuals with the capabilities of the computer to improve the quality of decisions. It is a computer-based support system for management decision makers who deal with semistructured problems - Keen and Scott-Morton, 1978 Types of DSS • Two major types: – Model-oriented DSS – Data-oriented DSS • Evolution of DSS into Business Intelligence – Use of DSS moved from specialist to managers, and then whomever, whenever, wherever – Enabling tools like OLAP, data warehousing, data mining, intelligent systems, delivered via Web technology have collectively led to the term “business intelligence” (BI) and “business analytics” מויקיפדיה... החל מאמצע שנות ה 2000-קיימים כלים חדשים לבינה עסקית בתפיסה הנקראת (BI 2.0), Business Intelligence 2.0 המאפשרים ביצוע שאילתות על ידי עובדים על נתוני הארגון בזמן אמיתי .המושג BI 2.0נטבע בהקבלה למושג Web 2.0משום שעיבודים מסוג זה הם בתפיסה של דפדפן בסביבת Web.כלי BI 2.0מאפשרים דיווחים דינמיים יותר מהדיווחים הסטטיים שאפיינו כלים מדור קודם. בסיס חשוב לעיבודים מסוג זה הוא השימוש בSOA, -שבא ביחד עם שימוש במוצרי תו ְוכה ( )Middlewareגמישים יותר ושימוש בתקנים להעברת מידע. SOA = Service Oriented Architecture DSS Description • • DSS application A DSS program built for a specific purpose (e.g., a scheduling system for a specific company) Business intelligence (BI) A conceptual framework for decision support. It combines architecture, databases (or data warehouses), analytical tools, and applications Business Intelligence (BI) • BI is an evolution of decision support concepts over time. – Meaning of EIS/DSS… • Then: Executive Information System • Now: Everybody’s Information System (BI) • BI systems are enhanced with additional visualizations, alerts, and performance measurement capabilities. • The term BI emerged from industry apps. The Evolution of BI Capabilities The Architecture of BI • A BI system has four major components – a data warehouse, with its source data – business analytics, a collection of tools for manipulating, mining, and analyzing the data in the data warehouse; – business performance management (BPM) for monitoring and analyzing performance – a user interface (e.g., dashboard) – בשנים האחרונות תפס הנושא של בינה עסקית מקום מרכזי הגידול הרב במידע הנצבר במערכות.במערכות המידע ממוחשבות מחייב הצגה וריכוז של נתונים רלוונטיים על מנת אחד הביטויים לחשיבות התחום הוא.שלמידע תהיה משמעות רכישת חברות בולטות המתמחות בתחום על ידי חברות תוכנה גדולות A High-Level Architecture of BI Learning Objectives • Explain data integration and the extraction, transformation, and load (ETL) processes • Describe real-time (a.k.a. right-time and/or active) data warehousing • Understand data warehouse administration and security issues Stage 1: Data Warehouse • A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format • “The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is nonvolatile and relevant to some moment in time” DW Framework No data marts option Applications (Visualization) Data Sources Access ETL Process Select Legacy Metadata Extract POS Transform Enterprise Data warehouse Integrate Other OLTP/wEB Data mart (Finance) Load Replication External data Data mart (Engineering) Data mart (...) / Middleware Data mart (Marketing) API ERP Routine Business Reporting Data/text mining OLAP, Dashboard, Web Custom built applications Data Integration and the Extraction, Transformation, and Load (ETL) Process Extraction, transformation, and load (ETL) Transient data source Packaged application Data warehouse Legacy system Extract Transform Cleanse Load Data mart Other internal applications Data Mart A departmental data warehouse that stores only relevant data – Dependent data mart A subset that is created directly from a data warehouse – Independent data mart A small data warehouse designed for a strategic business unit or a department OLAP vs. OLTP Online Analytical vs. Online Transaction (Processing) A 3-dimensional OLAP cube with slicing operations OLAP Ti m e Slicing Operations on a Simple Tree-Dimensional Data Cube Sales volumes of a specific Product on variable Time and Region Cells are filled with numbers representing sales volumes Geography Product Sales volumes of a specific Region on variable Time and Products Sales volumes of a specific Time on variable Region and Products Star vs Snowflake Schema Star Schema Dimension TIME Snowflake Schema Dimension PRODUCT Dimension MONTH Quarter Brand M_Name ... ... ... Fact Table SALES Dimension QUARTER UnitsSold Dimension BRAND Brand Dimension DATE Date LineItem ... ... Q_Name ... Dimension GOGRAPHY Division Coutry ... ... ... Dimension CATEGORY Category Fact Table SALES ... Dimension PEOPLE Dimension PRODUCT ... UnitsSold ... Dimension PEOPLE Dimension STORE Division LocID ... ... Dimension LOCATION State ... עוד דוגמא של SNOWFLAKE כריית מידע • סיווג (שווה או לא שווה ל)... – להלוות כסף ,להשקיע בתחום ,לפחות סניף חדש • ניתוח אשכולות ()Clustering – כמה סוגי לקוחות יש? מה מאחד אותם? • ניתוח רגרסיה – כמה נרוויח ,אופטימיזציה סוגי מידע • כריית מידע מנתונים "– היותר "פשוט • כריית מידע מטקסטים INFORMATION RETRIEVAL – SENTIMENT ANALYSIS ,TREND ANALYSIS – Categories of Models Category Objective Techniques Optimization of problems with few alternatives Find the best solution from a small number of alternatives Decision tables, decision trees Optimization via algorithm Find the best solution from a large number of alternatives using a step-by-step process Linear and other mathematical programming models Optimization via an analytic formula Find the best solution in one step using a formula Some inventory models Simulation Find a good enough solution by experimenting with a dynamic model of the system Several types of simulation Heuristics Find a good enough solution using “common-sense” rules Heuristic programming and expert systems Predictive and other models Predict future occurrences, what-if analysis, … Forecasting, Markov chains, financial, … Static and Dynamic Models • Static Analysis – Single snapshot of the situation – Single interval – Steady state • Dynamic Analysis – – – – – Dynamic models Evaluate scenarios that change over time Time dependent Represents trends and patterns over time More realistic: Extends static models Decision Analysis: A Few Alternatives Single Goal Situations • Decision trees – Graphical representation of relationships – Multiple criteria approach – Demonstrates complex relationships – Cumbersome, if many alternatives exists Decision Tables • Investment example • One goal: maximize the yield after one year • Yield depends on the status of the economy (the state of nature) – Solid growth – Stagnation – Inflation Investment Example: Possible Situations 1. If solid growth in the economy, bonds yield 12%; stocks 15%; time deposits 6.5% 2. If stagnation, bonds yield 6%; stocks 3%; time deposits 6.5% 3. If inflation, bonds yield 3%; stocks lose 2%; time deposits yield 6.5% Optimization via Mathematical Programming • Mathematical Programming A family of tools designed to help solve managerial problems in which the decision maker must allocate scarce resources among competing activities to optimize a measurable goal • Optimal solution: The best possible solution to a modeled problem – Linear programming (LP): A mathematical model for the optimal solution of resource allocation problems. All the relationships are linear LP Problem Characteristics 1. Limited quantity of economic resources 2. Resources are used in the production of products or services 3. Two or more ways (solutions, programs) to use the resources 4. Each activity (product or service) yields a return in terms of the goal 5. Allocation is usually restricted by constraints Linear Programming Steps • 1. Identify the … – – – – Decision variables Objective function Objective function coefficients Constraints • Capacities / Demands • 2. Represent the model – LINDO: Write mathematical formulation – EXCEL: Input data into specific cells in Excel • 3. Run the model and observe the results Line LP Example The Product-Mix Linear Programming Model • • • • MBI Corporation Decision: How many computers to build next month? Two types of mainframe computers: CC7 and CC8 Constraints: Labor limits, Materials limit, Marketing lower limits Labor (days) Materials ($) Units Units Profit ($) CC7 300 10,000 1 8,000 CC8 500 15,000 1 12,000 Rel <= <= >= >= Max Objective: Maximize Total Profit / Month Limit 200,000 /mo 8,000,000 /mo 100 200 Sensitivity, What-if, and Seeking Analysis Goal • Sensitivity – Assesses impact of change in inputs on outputs – Eliminates or reduces variables – Can be automatic or trial and error • What-if – Assesses solutions based on changes in variables or assumptions (scenario analysis) • Goal seeking – Backwards approach, starts with goal – Determines values of inputs needed to achieve goal – Example is break-even point determination Heuristic Programming • Cuts the search space • Gets satisfactory solutions more quickly and less expensively • Finds good enough feasible solutions to very complex problems • Heuristics can be – Quantitative – Qualitative (in ES) • Traveling Salesman Problem >>> Heuristic Programming - SEARCH Traveling Salesman Problem • What is it? – A traveling salesman must visit customers in several cities, visiting each city only once, across the country. Goal: Find the shortest possible route – Total number of unique routes (TNUR): TNUR = (1/2) (Number of Cities – 1)! Number of Cities TNUR 5 12 6 60 9 20,160 20 1.22 1018 When to Use Heuristics When to Use Heuristics – – – – – Inexact or limited input data Complex reality Reliable, exact algorithm not available Computation time excessive For making quick decisions Limitations of Heuristics – Cannot guarantee an optimal solution Modern Heuristic Methods • Tabu search – Intelligent search algorithm • Genetic algorithms – Survival of the fittest • Simulated annealing – Analogy to Thermodynamics Simulation • Technique for conducting experiments with a computer on a comprehensive model of the behavior of a system • Frequently used in DSS tools Major Characteristics of Simulation • • • • Imitates reality and capture its richness Technique for conducting experiments Descriptive, not normative tool Often to “solve” very complex problems Simulation is normally used only when a problem is too complex to be treated using numerical optimization techniques Advantages of Simulation • • • • • • • • The theory is fairly straightforward Great deal of time compression Experiment with different alternatives The model reflects manager’s perspective Can handle wide variety of problem types Can include the real complexities of problems Produces important performance measures Often it is the only DSS modeling tool for nonstructured problems Limitations of Simulation • Cannot guarantee an optimal solution • Slow and costly construction process • Cannot transfer solutions and inferences to solve other problems (problem specific) • So easy to explain/sell to managers, may lead overlooking analytical solutions • Software may require special skills Simulation Types • Stochastic vs. Deterministic Simulation – In stochastic simulations: We use distributions (Discrete or Continuous probability distributions) • Time-dependent vs. Time-independent Simulation – Time independent stochastic simulation via Monte Carlo technique (X = A + B) • Discrete event vs. Continuous simulation • Steady State vs. Transient Simulation • Simulation Implementation – Visual simulation – Object-oriented simulation Data Mining Methods: Classification • • • • • Most frequently used DM method Part of the machine-learning family Employ supervised learning Learn from past data, classify new data The output variable is categorical (nominal or ordinal) in nature • Classification versus regression? • Classification versus clustering? Assessment Methods for Classification • Predictive accuracy – Hit rate • Speed – Model building; predicting • Robustness • Scalability • Interpretability – Transparency, explainability Accuracy of Classification Models • In classification problems, the primary source for accuracy estimation is the confusion matrix Predicted Class Negative Positive True Class Positive Negative True Positive Count (TP) False Positive Count (FP) Accuracy TP TN TP TN FP FN True PositiveRate TP TP FN True NegativeRate False Negative Count (FN) True Negative Count (TN) Precision TP TP FP TN TN FP Recall TP TP FN Estimation Methodologies for Classification • Simple split (or holdout or test sample estimation) – Split the data into 2 mutually exclusive sets training (~70%) and testing (30%) 2/3 Training Data Model Development Classifier Preprocessed Data 1/3 Testing Data Model Assessment (scoring) Prediction Accuracy Estimation Methodologies for Classification • k-Fold Cross Validation (rotation estimation) – Split the data into k mutually exclusive subsets – Use each subset as testing while using the rest of the subsets as training – Repeat the experimentation for k times – Aggregate the test results for true estimation of prediction accuracy training • Other estimation methodologies – Leave-one-out, bootstrapping, jackknifing – Area under the ROC curve Classification Techniques • • • • • • • • Decision tree analysis Statistical analysis Neural networks Support vector machines Case-based reasoning Bayesian classifiers Genetic algorithms Rough sets Decision Trees • Employs the divide and conquer method • Recursively divides a training set until each division consists of examples from one class 1. 2. 3. 4. Create a root node and assign all of the training data to it Select the best splitting attribute Add a branch to the root node for each value of the split. Split the data into mutually exclusive subsets along the lines of the specific split Repeat the steps 2 and 3 for each and every leaf node until the stopping criteria is reached Decision Trees • DT algorithms mainly differ on – Splitting criteria • Which variable to split first? • What values to use to split? • How many splits to form for each node? – Stopping criteria • When to stop building the tree – Pruning (generalization method) • Pre-pruning versus post-pruning • Most popular DT algorithms include – ID3, C4.5, C5; CART; CHAID; M5 Cluster Analysis for Data Mining • k-Means Clustering Algorithm – k : pre-determined number of clusters – Algorithm (Step 0: determine value of k) Step 1: Randomly generate k random points as initial cluster centers Step 2: Assign each point to the nearest cluster center Step 3: Re-compute the new cluster centers Repetition step: Repeat steps 3 and 4 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable) Cluster Analysis for Data Mining k-Means Clustering Algorithm Step 1 Step 2 Step 3 Data Mining Myths • Data mining … – provides instant solutions/predictions – is not yet viable for business applications – requires a separate, dedicated database – can only be done by those with advanced degrees – is only for large firms that have lots of customer data – is another name for the good-old statistics Common Data Mining Mistakes 1. 2. 3. 4. 5. Selecting the wrong problem for data mining Ignoring what your sponsor thinks data mining is and what it really can/cannot do Not leaving insufficient time for data acquisition, selection and preparation Looking only at aggregated results and not at individual records/predictions Being sloppy about keeping track of the data mining procedure and results Common Data Mining Mistakes 6. 7. 8. 9. 10. Ignoring suspicious (good or bad) findings and quickly moving on Running mining algorithms repeatedly and blindly, without thinking about the next stage Naively believing everything you are told about the data Naively believing everything you are told about your own data mining analysis Measuring your results differently from the way your sponsor measures them Text Mining Application Area • • • • • • • Information extraction Topic tracking Summarization Categorization Clustering Concept linking Question answering Text Mining Terminology • • • • • • • • Unstructured or semistructured data Corpus (and corpora) Terms Concepts Stemming Stop words (and include words) Synonyms (and polysemes) Tokenizing Text Mining Terminology • • • • • Term dictionary Word frequency Part-of-speech tagging Morphology Term-by-document matrix – Occurrence matrix • Singular value decomposition – Latent semantic indexing Natural Language Processing (NLP) • Structuring a collection of text – Old approach: bag-of-words – New approach: natural language processing • NLP is … – a very important concept in text mining – a subfield of artificial intelligence and computational linguistics – the studies of "understanding" the natural human language • Syntax versus semantics based text mining Natural Language Processing (NLP) • Challenges in NLP – – – – – – Part-of-speech tagging Text segmentation Word sense disambiguation Syntax ambiguity Imperfect or irregular input Speech acts • Dream of AI community – to have algorithms that are capable of automatically reading and obtaining knowledge from text NLP Task Categories • • • • • • • • • • • Information retrieval Information extraction Named-entity recognition Question answering Automatic summarization Natural language generation and understanding Machine translation Foreign language reading and writing Speech recognition Text proofing Optical character recognition Text Mining Applications • Marketing applications – Enables better CRM • Security applications – ECHELON, OASIS – Deception detection (…) • Medicine and biology – Literature-based gene identification (…) • Academic applications – Research stream analysis Web Mining Success Stories • Amazon.com, Ask.com, Scholastic.com, … • Website Optimization Ecosystem Customer Interaction on the Web Analysis of Interactions Web Analytics Voice of Customer Customer Experience Management Knowledge about the Holistic View of the Customer Web Mining Tools Product Name URL Angoss Knowledge WebMiner angoss.com ClickTracks clicktracks.com LiveStats from DeepMetrix deepmetrix.com Megaputer WebAnalyst megaputer.com MicroStrategy Web Traffic Analysis microstrategy.com SAS Web Analytics sas.com SPSS Web Mining for Clementine spss.com WebTrends webtrends.com XML Miner scientio.com Machine Learning Methods Machine Learning Supervised Learning Classification · Decision Tree · Neural Networks · Support Vector Machines · Case-based Reasoning · Rough Sets · Discriminant Analysis · Logistic Regression · Rule Induction Regression · Regression Trees · Neural Networks · Support Vector Machines · Linear Regression · Non-linear Regression · Bayesian Linear Regression Reinforcement Learning · Q-Learning · Adaptive Heuristic Critic (AHC), · State-Action-Reward-StateAction (SARSA) · Genetic Algorithms · Gradient Descent Unsupervised Learning Clustering / Segmentation · SOM (Neural Networks) · Adaptive Resonance Theory · Expectation Maximization · K-Means · Genetic Algorithms Association · Apriory · ECLAT Algorithm · FP-Growth · One-attribute Rule · Zero-attribute Rule BPM versus BI • BPM is an outgrowth of BI and incorporates many of its technologies, applications, and techniques. – The same companies market and sell them. – BI has evolved so that many of the original differences between the two no longer exist (e.g., BI used to be focused on departmental rather than enterprise-wide projects). – BI is a crucial element of BPM. • BPM = BI + Planning (a unified solution) Performance Measurement KPIs and Operational Metrics • Key performance indicator (KPI) A KPI represents a strategic objective and metric that measures performance against a goal • Distinguishing features of KPIs Strategy Targets Ranges Encodings Time frames Benchmarks Performance Measurement • Key performance indicator (KPI) Outcome KPIs vs. (lagging indicators e.g., revenues) Driver KPIs (leading indicators e.g., sales leads) • Operational areas covered by driver KPIs – – – – Customer performance Service performance Sales operations Sales plan/forecast BPM Methodologies • The meaning of “balance” – BSC is designed to overcome the limitations of systems that are financially focused – Nonfinancial objectives fall into one of three perspectives: 1. 2. 3. Customer Internal business process Learning and growth BPM Methodologies • In BSC, the term “balance” arises because the combined set of measures are supposed to encompass indicators that are: – – – – – Financial and nonfinancial Leading and lagging Internal and external Quantitative and qualitative Short term and long term BPM Methodologies Strategy map A visual display that delineates the relationships among the key organizational objectives for all four BSC perspectives Performance Dashboards • Dashboards and scorecards both provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily explored Performance Dashboards Performance Dashboards • Dashboards versus scorecards – Performance dashboards Visual display used to monitor operational performance (free form) – Performance scorecards Visual display used to chart progress against strategic and tactical goals and targets (predetermined measures)