Stat Learning – Flashcards
Unlock all answers in this set
Unlock answersquestion
The hierarchy principle
answer
- If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coe cients are not significant. - The rationale for this principle is that interactions are hard to interpret in a model without main effects — their meaning is changed. - Specifically, the interaction terms also contain main effects,
question
Classification Summary
answer
• Logistic regression is very popular for classification, especially when K = 2. • LDA is useful when n is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when K > 2. • Naive Bayes is useful when p is very large.
question
Confidence Intervals
answer
Quantifies uncertainty about a parameter of the population.
question
Statistical Learning model
answer
Suppose that we observe a quantitative response Y and p different predictors, X1, X2, . . . , Xp. We assume that there is some relationship between Y and X =(X1,X2,...,Xp), which can be written in the very general form Y =f(X)+ ?. Here f is some fixed but unknown function of X1, . . . , Xp, and ? is a random error term, which is independent of X and has mean zero. In this formula- tion, f represents the systematic information that X provides about Y
question
Interesting Advertising Question
answer
- Which media contribute to sales? - Which media generate the biggest boost in sales? or - How much increase in sales is associated with a given increase in TV advertising?
question
Parametric Methods
answer
The model-based approach just described is referred to as parametric; it reduces the problem of estimating f down to one of estimating a set of parameters. Assuming a parametric form for f simplifies the problem of estimating f because it is generally much easier to estimate a set of pa- rameters, such as ?0,?1,...,?p in the linear model (2.4), than it is to fit an entirely arbitrary function f. The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor. We can try to address this problem by choosing flexible models that can fit many different possible functional forms for f. But in general, fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.
question
Non-parametric methods
answer
Non-parametric methods do not make explicit assumptions about the functional form of f . Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f. Any parametric approach brings with it the possibility that the functional form used to estimate f is very different from the true f, in which case the resulting model will not fit the data well. In contrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of f is made. But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.
question
Why would we ever choose to use a more restrictive method instead of a very flexible approach?
answer
There are several reasons that we might prefer a more restrictive model. If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between Y and X1,X2,...,Xp. In contrast, very flexible approaches, such as the splines discussed in and the boosting methods can lead to such complicated estimates of f that it is difficult to understand how any individual predictor is associated with the response. When inference is the goal, there are clear ad- vantages to using simple and relatively inflexible statistical learning meth- ods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. We will often obtain more accurate predictions using a less flexible method. This phenomenon, which may seem counterintuitive at first glance, has to do with the potential for overfitting in highly flexible methods.
question
Regression Versus Classification Problems
answer
Variables can be characterized as either quantitative or qualitative (also known as categorical). Quantitative variables take on numerical values. In contrast, qualitative variables take on val- ues in one of K different classes, or categories. We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems. However, the distinction is not always that crisp. Least squares linear regression is used with a quantitative response, whereas logistic regression is typically used with a qualitative (two-class, or semi- supervised learning quantitative qualitative categorical class regression classification binary) response. As such it is often used as a classification method. But since it estimates class probabilities, it can be thought of as a regression method as well. Some statistical methods, such as K-nearest neighbors and boosting can be used in the case of either quantitative or qualitative responses.
question
Measuring the Quality of Fit
answer
In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by.... where f(xi) is the prediction that f gives for the ith observation. The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.
question
K-Nearest Neighbors
answer
Given a positive in- teger K and a test observation x0, the KNN classifier first identifies the K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j. Finally, KNN applies Bayes rule and classifies the test observation x0 to the class with the largest probability. Despite the fact that it is a very simple approach, KNN can often pro- duce classifiers that are surprisingly close to the optimal Bayes classifier.