Business Intelligence CH 1-4*, 13 – Flashcards

Unlock all answers in this set

Unlock answers
question
Business Intelligence- brief history/ timeline
answer
Roots go back to the late 1960's 1970- Decision support systems (DSS) 1980- EIS, OLAP, GIS, + 1980- data warehousing & dashboards/ scorecards Now we call data analytics/ intelligence/ mining...
question
Business intelligence- definition
answer
Business intelligence is a broad term that encompasses applications, technologies, and processes for gathering, storing, accessing, and analyzing data to help business users make better decisions BI includes both "getting data out" & "getting data in"
question
What are the three kinds of business intelligence?
answer
1. descriptive analytics 2. predictive analytics 3. prescriptive analytics
question
How are business values found?
answer
capturing, storing, and analyzing new kinds of data.
question
What is Moore's law?
answer
law that states that processing "capacity" doubles every 18 months. ex. CPU, cache, memory
question
How does data mining differs from Moore's law?
answer
it is more aggressive- processing "capacity" doubles every 9 months. the rapid and continuing improvement in computing capacity is an essential enabler of the growth of data mining
question
Examples of BI tool vendors
answer
- SAS enterprise miner -XLMiner -Weka -SAP business warehouse - R - DBMiner -IBM intelligent Miner -MegaPuter Poly Analyst -IBM SPSS Modeler (formerly SPSS Clementine) -Microsoft SQL server 2008 Analysis Services -Oracle Data Mining (formerly 'Darwin') -ANGOSS Knowledge studio -DigiMine
question
Who uses Data Mining?
answer
-The Military -Intelligence agencies -Security Specialists -Medical researchers -Businesses ex. - american express - target - amazon - facebook
question
what does the Residential Positive Achievement Change Tool (R-PACT) do?
answer
R-PACT collects data on prior criminal history, academic performance, involvement with antisocial peers and use of appropriate social skills for controlling emotions and handling difficult situations. -used by Florida's Department of Juvenile Justice (DJJ) to track key areas of development for residential youths.
question
Database Management vs Business Intelligence
answer
Database Management- uses a transactional database to manage the data of organizations- everyday transactions. Business Intelligence- uses an analytical data store to analyze managerial decision-making- predictive/periodic analysis
question
Information flow
answer
data entry -> transactional database (stores real-time transactional data)->data extraction -> analytical data store (stores historical transactional and summary data) -> data analysis
question
OLTP (on-line transaction processing)
answer
- repetitive usage, major task of traditional relational DBMS - Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
question
OLAP (on-line analytical processing)
answer
-Ad-hoc usage, major task of data warehouse system -data analysis and decision making
question
How does OLTP differ from OLAP?
answer
- User & system orientation (customer vs market) - Data contents (current, detailed vs historical, consolidated) - Database design (ER + application vs star + subject ) - View (current, local vs evolutionary, integrated) - Access patterns (update vs read-only but complex queries)
question
Data Warehouse definitions
answer
- subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision-making process - a decision support database that is maintained separately from the organization's operational database - supports information processing by providing a solid platform of consolidated, historical data for analysis
question
problems with operational data
answer
- dirty data - missing values - inconsistent data - data not integrated - wrong granularity (too fine vs not fine enough) - too much data (too many attributes, too many data points)
question
Dirty data definition
answer
mistakes in spelling or punctuation, incorrect data associated with a field, incomplete or outdated data or data that is duplicated in the database
question
what is the curse of dimensionality?
answer
too much data causes 1. problem caused by the exponential increase in volume associated with adding extra dimensions to a (mathematical space) 2. too many rows or data points 3. with more attributes, the easier it is to build a model that fits the sample data but that is worthless as a predictor - major activities in data mining concerns efficient and effective ways of selecting attributes
question
data integrity problems
answer
- same person, different spellings - multiple ways to denote company name - use of different names - different account numbers generated by different applications for the same customer - required fields left blank -invalid product codes collected at point of sale
question
common applications for data mining across industries- profiling and segmentation
answer
what is predicted? customers behaviors and needs by segment resulting business decision? how to better target product/ service offers
question
common applications for data mining across industries- cross sell and up-sell
answer
what is predicted? what customers are likely to buy resulting business decision? which product/service to recommend
question
common applications for data mining across industries- acquisition and retention
answer
what is predicted? customer preferences and purchase patterns resulting business decision? how to grow and maintain valuable customers
question
common applications for data mining across industries- campaign management
answer
what is predicted? the success of customer communications resulting business decision? how to direct the right offer to the right person at the right time
question
common applications for data mining across industries- profitability and lifetime value
answer
what is predicted? drivers of future value (margin and retention) resulting business decision? which customers to invest in and how to best appeal to them
question
Data mining process (5)
answer
1. SAMPLE THE DATA by creating target data set large enough to contain significant information 2. EXPLORE THE DATA by searching by searching for anticipated relationships and unanticipated trends and anomalies to gain deeper understanding 3. MODIFY THE DATA by creating, selecting, and transforming the variables to focus your model selection process 4. MODEL THE DATA by using analytical tools to search for a combination of data that reliably predicts a desired outcome 5. ASSESS THE DATA and models by evaluating the usefulness and reliability of the findings from the data mining process
question
what are the origins of data mining?
answer
data mining draws ideas from: -artificial intelligence -pattern recognition -statistics -database systems
question
Data mining tasks
answer
prediction methods -predict unknown/future values of other variables -predict likelihood of a particular outcome description methods -find human-interpretable patterns that describe the data
question
data mining methods examples
answer
descriptive -clustering -association rule discovery -sequential pattern discovery -visualization predictive -classification -regression -neural networks -deviation detection
question
RFM analysis
answer
allows you to analyze and rank customers according to purchasing patterns R= how Recently a customer purchased your products F= how Frequently a customer purchases your products M= how much Money a customer typically spends on your products score 1-5 in decreasing order
question
Decision tree
answer
-classify data according to a pre-defined outcome -depends on characteristics of the data -provides insight about whether: - a customer receive a loan - credit card charge be legitimate - an investment pay off
question
Clustering
answer
-determine distinct groups of data -based on data across multiple dimensions -outcome: -customer segmentation -identifying patient care groups -performance of business sectors
question
association rules
answer
- used to determine which events occur together -usually that "event" is a product purchase -determine: -which products are bought together -which web sites are likely to be visited in a single session -sets of customization options that should be bundled
question
nearest- neighbor classifiers
answer
requires: - the set of stored records - distance metric to compute distance between records - the value of k, the nearest neighbors to retrieve
question
nearest neighbor definition
answer
k-nearest neighbors of a record x are data points that have the k smallest distance to x
question
data, dimension, terminologies
answer
- dimension - number of column -variables - column (x, y)= (price, income)= ^x is disadvantage x= y= predictor response input variable output variable independent var outcome variable attribute dependent var feature target variable field Records (row) observation case pattern
question
How can business intelligence be used for reporting?
answer
can be used for: - sorting - grouping - summing - averaging - comparing
question
How can business intelligence be used for data mining?
answer
- sophisticated statistical techniques, regression analysis, and decision tree analysis -used to discover hidden pattern and relationships -classification, prediction, association rules - data reduction, data exploration, visualization
question
How can business intelligence be used or knowledge management?
answer
- create value by collecting and sharing core human knowledge about products, best practices, suppliers, etc.
question
RFM vs OLAP
answer
OLAP is more generic than RFM OLAP provides the ability to sum, count, average, and perform other simple arithmetic operations on groups of data
question
Descriptive analytics
answer
- core of traditional BI -> reporting/OLAP, dashboards, and data visualization ex. clustering, association rule, and pattern discovery
question
supervised learning
answer
goal: predict a single "target" or "outcome" variable training data, where the value of the outcome of interest is known methods: classification and prediction
question
unsupervised learning
answer
goal: segment data into meaningful segments; detect patterns there is no target (outcome) variable to predict or classify methods: association rules, data reduction & exploration, visualization
question
supervised learning- classification method
answer
goal: predict CATEGORICAL target (outcome) variable ex. purchase/ no purchase, fraud/no fraud, creditworthy/ not creditworthy each row is a case (customer, tax return, applicant) each column is a variable target variable is often binary (yes/no)
question
supervised learning- prediction method
answer
goal: predict NUMERICAL target (outcome) variable ex. sales, rev, performance just like the classification method: -> each row is a case (customer, tax return, applicant) -> each column is a variable **taken together, classification and prediction constitute "predictive analysis"
question
unsupervised learning- association rules method
answer
goal: produce rules that define "what goes with what" ex. if X was purchased Y was also purchased rows are transactions used in recommender systems- "our records show you bought x, you may also like y" also called affinity analysis or market basket analysis
question
unsupervised learning- data reduction method
answer
-> distillation of complex/large data into simpler/smaller data -> reducing the number of variables/ columns (ex. principal components) -> reducing the number of records/rows (ex. clustering)
question
unsupervised learning- visualization method
answer
-> graphs and plots of data -> histograms, box plots, scatterplots, bar charts -> useful to examine relationships between pairs of variables and to detect outlier
question
steps in data mining (9)
answer
1. define/understand purpose 2. obtain data 3. explore, clean, pre-process data 4. reduce the data 5. specify task (classification, clustering etc) 6. choose the techniques (regression, neural networks etc) 7. iterative implementation and "tuning" 8. assess results - compare models 9. deploy best model
question
step 2: obtaining data-> sampling
answer
-data mining typically deals with huge databases -algorithms and models are typically applied to a sample from a database, to produce statistically-valid results
question
sampling
answer
-> allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data -> choose a representative subset of the data - simple random sampling may have very poor performance in the presence of a skew ->develop adaptive sampling methods - stratified sampling: - approximate the % of each class (or subpopulation of interest) in the overall database - used in conjunction with skewed data
question
rare event oversampling
answer
often the event of interest is rare ex. response to mailing, fraud in taxes sampling may yield too few "interesting" cases to effectively train a model a popular solution: oversample the rare cases to obtain a more balance training set later, need to adjust results for the oversampling
question
data exploration
answer
data sets are typically large, complex & messy need to review the data to help refine the task use techniques of reduction and visualization
question
types of variables
answer
-> determine the types pf pre-processing needed & algorithms used main groups: - interval (quantitative or numerical) - nominal (qualitative or categorical) - ordinal (categorical in nature but with an order) *arithmetic operations are performed in interval
question
variable handling
answer
interval variable: occasionally need to "bin" into categories nominal variable in most algorithms we must create binary dummies: number of dummies = [# of categories-1] ordinal variable can often be used as it is as if it were an interval variable
question
detecting outliers
answer
an outlier is an observation that is "extreme"- distant from the rest of the data outliers can have disproportionate influence on models the purpose of identifying outliers is to call attention to values that need further review once detected, domain knowledge is requires to determine if it is an error or truly extreme
question
handling missing data
answer
solution 1: omission -if a small # of records have missing values - if many records are missing values on a small set of variables, we can drop the variables or use proxy -if many records have missing values, it is not practical solution 2: imputation -replacing missing values with reasonable substitutes -letting you keep the record and use the rest of its (non-missing) information Solution 3: check the importance of the predictor
question
how to handle noisy data?
answer
binning method: -sort data & partition into equi-depth bins -smooth by bin means, medians, boundaries, etc. clustering: -detect and remove outliers combined computer and human inspection: -detect suspicious values and check by human regression: -smooth by fitting data into regression functions
question
normalizing (standardizing) data
answer
-useful for classification involving neural networks or distance measurement such as nearest neighbor classification and clustering -used in some techniques when variables with the largest scales would dominate and skew results -puts all variables on same scale -normalizing function: subtract mean and divide by std. dev. alternate function: scale to 0-1 by subtracting minimum and dividing by the range -> useful when data contains dummies & numeric
question
the problem of overfitting
answer
statistical models can produce highly complex explanations of relationships between variable the "fit" may be excellent in training data set when used with new data, models of great complexity do not do so well
question
partitioning the data
answer
problem: how well will our model perform with new data? solution: separate data into two parts -> training partition to develop the model -> validation partition to implement the model and evaluate its performance on new data this addresses the issue of overfitting
question
test partition
answer
when a model is developed on training data, it can overfit the training data assessing multiple models on same validation data can overfit validation data some methods use the validation data to choose a parameter. this too can lead to overfitting the validation data SOLUTION: final selected model is applied to a test partition to give unbiased estimate of its performance on new data
question
Graphs for data exploration- basic plots
answer
-line graphs -bar charts -scatterplots
question
Graphs for data exploration- distribution plots
answer
-boxplots -histograms
question
Data exploration in preprocessing
answer
data exploration is a MANDATORY initial step to: -understanding the data structure, cleaning the data (identifying unexpected gaps, incorrect, redundant, or missing values), identifying outliers, discovering initial patterns (correlations among variables and surprising clusters), and generating interesting questions -Data exploration helps variable derivation and selection, redundant data identification, category combination for data reduction, etc.
question
scatterplot
answer
displays RELATIONSHIP between two numerical variables types of relationships: -linear relationship -curvilinear relationship -strong relationship -weak relationship -no relationship
question
distribution plots
answer
display "how many" of each value occur in a data set -or, for continuous data or data with many possible values, "how many" values are in each of the series of ranges of "bins" key: "how many"
question
Histograms
answer
show the distribution of the outcome variable (median house value)
question
Box Plot
answer
Top outliers are defined as those above Q3+1.5(Q3-Q1) -Max= maximum of non-outliers -analogous definitions for bottom outliers and for "min" -details may differ across software
question
Heat Maps
answer
color conveys information in data mining, heat maps are used to visualize: -correlations -missing data
question
Matrix Plot
answer
-shows scatterplots for variable pairs
question
Pivot table/chart
answer
-very useful for showing interactive multi-dimensional view -combines information from multiple variables and compute a range of summary statistics -good exploratory task
question
Binning data
answer
-discretizing continuous distributions -binning is a process of grouping measured data into data classes (continuous variables are transformed into categorical variables) -putting the values of the variable in certain number of bins, we use the binned-variable as a categorical variable NOTE: -continuous numerical values may bot be allowed to be selected as input/output variables -using binned variables can be easily read and interpreted -non-linear dependencies can be modeled using a linear relationship
question
Missing data handling
answer
XLMiner considers a cell to be missing data if it is empty or contains an invalid formula the records with missing data can be either deleted fully or the missing values can be replaced options on how to replace the missing data are by mean or median or mode or a value specified by the user
question
Exploring the data
answer
statistical summary of data: common/metrics -average/median -min/max -std. dev -counts & percentages Note: perfect symmetry indicates that mean=median Note 1*: Left/ Negative skew = high point is on right; tail is on left No skew= high point is in middle; tails are equal Right/ Positive skew= high point is on left; tail is on right Note 2*: left vs right skew just means where is the tail of the graph
question
Reducing categories
answer
a single categorical variable with M categories is typically transformed into M-1 dummy variables -each dummy variable takes the values 0 or 1 0=no for the category 1=yes problem: we can end up with too many variables solution: reduce number of categories by combining close or similar categories ex. use only categories that are most relevant to the analysis, & label the rest as other we can use pivot tables to assess categorical outcome variable sensitivity to the dummies
question
What are association rules?
answer
association rules are the study of "what goes with what" -are transaction based or event based Association rules are also called market based analysis and affinity analysis association rules originated with the study of customer transaction databases to determine associations among items purchased note: do not count compliments most popular method is APRIORI ALGORITHM
question
ideas similar to association rules apply to industries
answer
-retailers point-of-sale (POS) data -credit card data (possibly cross-merchant purchases) -banking services ordered -record of insurance claims (for fraud detection) -medical record
question
why do we care about association rules?
answer
- for product placement ex. whole foods- flowers next to birthday cards ex. wal-mart- customers who purchase barbie dolls have a 60% likelihood to also purchase one of 3 types of candy bars - for recommendations ex. amazon- as you are looking at hdtv's, you might also want hdmi cables - for bundling ex. travel packages (flight+hotel+car) - for other applications ex. price discrimination, website/catalog design, fraud detection (multiple suspicious insurance claims), medical complications (based on combinations of treatments)
question
Association rules- definition
answer
given a transactional database (set of transactions) find rules that predict the OCCURRENCE OF AN ITEM BASED ON THE OCCURRENCE OF OTHER ITEMS in the database -> implication means co-occurrence, not causality
question
RULE
answer
* we are not just looking at frequency we are also looking at the dependency * if {set of items} -> then {set of items} ex. if {diapers}-> then {beer} ex.1* if x occurs, y also occurs x= antecedent y=consequent x leads to y "if" part = antecedent "then" part= consequent "item set"= the items (products) comprising the antecedent or consequent -antecedent and consequent are DISJOINT (have no items in common)
question
support formula
answer
# occurrences / total occurrences
question
Frequent item sets
answer
ideally, we want to create all possible combinations of items PROBLEM: computation time grows exponentially as # items increases SOLUTION: consider only "frequent item sets" * criterion for frequent: "support"
question
Support
answer
support quantifies the significance of the "CO-OCCURRENCE" of the items involved in a rule in practice we only care about item sets with strong enough support
question
Confidence formula
answer
# of occurrences containing antecedent and consequent/ # of occurrences of just antecedent in other words: support/ antecedent
question
valid association rules
answer
a rule has to meet a MINIMUM SUPPORT and a MINIMUM CONFIDENCE -> both thresholds determined by decision maker high confidence suggests a strong association rule -> this can be deceptive because when the antecedent and/or the consequent has a high level of support, we can have a high value for confidence even if the antecedent and consequent are independent!
question
Lift
answer
the lift of a rule measure "how much MORE likely the consequent is, given the antecedent" lift = confidence of the rule/ support of the consequent
question
generating frequent item sets
answer
for k products... 1. user sets a minimum support criterion 2. next, generate list of ONE-ITEM sets that meet the support criterion 3. use the list of one-item sets to generate list of TWO-ITEM sets that meet the support criterion 4. use list of two-item sets to generate list of THREE-ITEM sets 5. continue up through k-item sets recursively
question
measures of performance
answer
CONFIDENCE= the % of antecedent transactions that also have the consequent item set LIFT= confidence/(benchmark confidence) BENCHMARK CONFIDENCE= transactions with consequent as % of all transaction lift>1 indicates a rule that is useful in finding consequent item sets (more useful than just selecting transactions randomly)
question
process of rule selection
answer
generate all rules that meet specified support and confidence -find frequent item sets (those with sufficient support) -from these item sets, generate rules with sufficient confidence
question
Lift ratio
answer
shows how effective the rule is in finding consequents (useful if finding particular consequents is important)
question
Confidence ratio
answer
shows the rate at which consequents will be found (useful in learning costs of promotion)
question
support ratio
answer
measures overall impact
question
caution: the role of chance
answer
random data can generate apparently interesting association rules the more rules you produce, the greater this danger rules based on large numbers of records are less subject to this danger
question
confidence is not symmetric
answer
if a->b meets the minimum confidence threshold b->a will not necessarily meet it
Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New