# WGU Intro to Probability and Statistics Brad Bledsoe
question

Population

The entire group that is the target of interest, not just people. Eg, \”the population of 1 bedroom apartments\”
question

Sample

A subgroup of the population. Eg, \”the 1 bedroom apartments with dishwashers.\”
question

Steps in the statistics process

1. PRODUCE DATA (by studying a sample of the population) 2. EXPLORATORY DATA ANALYSIS (Summarize data.) 3. PROBABILITY ANALYSIS (Determine how the sample may differ from the population.) 4. INFERENCE (draw conclusions)
question

Data

pieces of info about individuals organized into variables
question

Individual

a particular person or object
question

Variable

a particular characteristic of the individual
question

Dataset

a set of data identified with particular circumstances. Typically displayed in tables with rows as the individuals and columns as the variables
question

Quantitative vs Categorical/Qualitative variables

Quantitaive: Numerical values. Represent a measurement. Categorical: category or label values into which individuals are grouped.
question

Three steps in Exploratory Data Analysis

1. Organize and SUMMARIZE raw data 2. DISCOVER important features and patterns and striking deviations. 3. INTERPRET findings in the context of the problem
question

Examining Distributions

exploring data obtained from one variable at a time
question

Examining Relationships

exploring data obtained from two variables at a time
question

Distribution

what values the variable takes, how often
question

Three types of graphical displays of categorical distributions

1. Pie Charts 2. Bar Charts 3. Pictogram
question

Bins

ranges of data to make charting easier, like a bar chart where each bar shows a range like 70-80%
question

Numerical Summaries

category counts and percentages
question

Four types of Graphical displays of Quantitative Variables

1. Histogram 2. Stemplot 3. Dotplot 4. Boxplot
question

Histogram

like a bar chart but the x axis is numerical, in order. Eg: x axis is years, y axis is Men’s income and Women’s income. Or, the x axis is number of hours studied, and y axis is number of students falling into each number of hours studied category.
question

4 ways to interpret a histogram

1. Shape – Symmetry/Skewness, Peakness (Modality) 2. Center – midpoint 3. Spread – approx range covered by all the data 4. Outliers – observations that fall outside overall pattern
question

Symmetric distributions (on a histogram)

look symmetric. can be multi-peaked, but symmetrical
question

Skewness (on a histogram)

data is skewed to the right or left because outliers. (Careful because the histogram looks heavy to the opposite side than to that which it is skewed. Think of the outliers as pulling a long tail out from the main data, making it not symmetrical.)
question

Peakedness (on a histogram) (three types)

1. Unimodal (single peaked) distribution 2. Bimodal (double peaked) distribution 3. Uniform distribution (Many peaks, all the same)
question

Stemplot (or stem and leaf plot)

1. Write all the \”stems\” down in a list, in ascending numerical order. (The stems are all the numbers but the right most number. Eg: dataset 34 35 36 347 367 the stems are 3, 3, 3, 34, 36, but you only use each identical stem once, so it would be 3, 34, 36) 2. Draw a line to the right of the list 3. Write all the leaves next to the stem, and rearrange them in increasing order
question

two Virtues of a stemplot

1. preserves the data while sorting it 2. when rotated looks like a histogram
question

Dotplot

a stemplot with dots instead of leaves
question

Boxplot

Shows the \”five number spread\”: min, Q1, Median, Q2, Max Y axis is range Drawn box is interquartile range Points for outliers, minimum and maximum Is most useful for showing side by side comparisons
question

1. \”Upper limit\” = Q3 through Max 2. 75th percentile = Median through Q3 3. 50th percentile = Q1 through Median 4. 25th percentile = Q1 (this doesn’t make sense) 5. Lower limit = Minimum through Q1
question

3 Measures of Center

1. Mode – the value most often found (not sensitive to outliers) 2. Median – the center value (or average of the two center values) (not sensitive to outliers) 3. Mean – the average (sensitive to outliers)
question

1. Range – the distance between max and minimum values 2. Inter-Quartile Range – the range of the middle 50% 3. Standard Deviation – how far the observations are from their mean. (The average may be 9, but the real average is 4 away from 9.)
question

Calculate range

max – min
question

Calculate inter-quartile range

1. Find median (by arranging data in increasing order) 2. Find median of bottom 50% (Q1, \”The first quartile) 3. Find median of top 50% (Q3, \”The third quartile\”) 4. Q3-Q1=IQR
question

The 1.5(IQR) criterion for outliers

1. Q1-1.5(IQR) 2.Q3+1.5(IQR) 3. Any datapoints outside of these two points are possible outliers.
question

Outliers – when to keep, when to discard?

1. Keep if could happen again, produced by essentially same process. 2. Discard if produced by a different process and your purpose is to understand the process which produced most of the data. 3. Discard if produced by an error or typo that cannot be fixed.
question

Notations for Standard Deviation

SD, s, Sd, St Dev
question

Calculate Standard Deviation

1. Find the mean 2. Find distances between observations and the mean 3. Square each deviation 4. Add up the squares of each deviation and divide by the number of deviations minus 1 5. Find square root of result EXPLANATION We can’t average the deviations because they add up to zero. The reason we average the squares of the deviations minus 1 is beyond the scope of this course to explain. The average of the squared deviations is called the variance of the data.
question

Is the \”standard deviation\” or \”variance of the data\” influenced by outliers?

yes, strongly
question

The \”standard deviation rule\”

Approx 68% of observations fall within 1 standard deviation of the mean Approx 95% of observations fall within 2 standard deviations of the mean Approx 99.7 of observations fall within 3 standard deviations of the mean (3 standard deviations = the standard deviation x 3)
question

Notation for mean

an x with a line over it
question

Choose between using mean and standard deviation verses the five number summary

1. use mean and SD for relatively symmetrical distributions with no outliers 2. use five number summary for all others
question

Steps to choose which data display and numerical summary is best

1. Identify the explanatory/independent variable (x) and the response/dependent variable (y) 2. Is the explanatory variable categorical or quantitative? 3. Is the response variable categorical or quantitative? 4. Notate it C-C, C-Q, Q-C, or Q-Q 5. Select approach based on above
question

Select data display and numerical summary approach for case C-C, C-Q, Q-C, or Q-Q

1. Case C-C: Two way table or double bar chart using conditional percents. 2. Case C-Q: Box plots and five number spread 3. Case Q-C: Not covered in the text 4. Case Q-Q: Scatterplot (explanatory on x, response on y) or labelled scatterplot
question

Correlation Coefficient

Measures the strength and direction of a linear relationship between two quantitative variables. Does not tell you IF a relationship is linear. A curvalinear relationship can include a linear relationship or not.The correlation coefficient tells you the strenghth of the linear relationship, not the curvalinear relationship
question

Notation of the correlation coefficient

r
question

Correlation Coefficient and Outliers

Outliers strongly effect the r-value, so the CC should only be used after seeing the scatterplot.
question

Range of values in the correlation coefficient

-1 to 1 -1 is the strongest negative linear relationship +1 is the strongest positive linear relationship Close to zero is a weaker linear relationship
question

Regression and Linear Regression

The technique that specifies the dependence of the response variable on the explanatory variable. If it’s a linear dependence, then it’s linear regression. It’s finding the line that best fits the pattern of the linear relationship.
question

Calculate linear regression or the \”least squares regression line\”

1. y=a+bx 2. b=r(Sy/Sx) 3. a = Y with line over it – b(x with line over it) Key: r = the correlation coeffient Sx = standard deviation of the explanatory variable’s values Sy=standard deviation of the response variable’s values X with line over it = the mean of the explanatory variable’s values Y with line over it = the mean of the response variable’s values EXPLANATION Find the slope of the \”least squares regression line\”. (Just like the standard line equation, y=a+bX, helps you find the slope, or the change in y when x changes by 1, the \”least squares regression line\” formula helps you find the average change in the response variable when the explanatory variable increases by 1 unit. It’s called the \”least squares regression line\” because it’s the line which results in the smallest sum of squared vertical deviations.
question

Line

a set of points that obey a particular relationship between x and y
question

Equation of the Line (Algebra Review)

Y=a+bX a=the y-intercept, or the value that y takes when x =zero b=the slope, or the change in y when x changes by 1
question

Extrapolation

prediction for ranges of the explanatory variable that are not in the data
question

Causation and lurking variables

Association does not imply causation Lurking variables are not among the variables in a study but could substantially effect your interpretation of the relationship among those variables
question

Whenever a lurking variable causes us to rethink the direction of an assocation
question

Correlation between quantitative variables vs. correlation between category variables

There can only be correlation between quantitative variables, not category variables
question

10 sampling types and terms

(Remember S,V,V,C,S, S,P,C,M,S) \”Some very very cute samples. Some pleasing, cute, magnificent samples.\” 1. Sampling Frame 2.Volunteer Sample 3. Volunteer Response 4. Convenience Sample 5. Systematic Sampling 6. Simple Random Sample 7. Probability Sampling Plan/Technique 8. Cluster Sampling 9. Multi-Stage Sampling 10. Stratified Sampling
question

Sampling Frame

The study should be designed so that the sampling frame is the entire population being studied. (My notes just say \”should be the population studied\”. May want to double check meaning.)
question

Probability Sampling Plan/Technique

Any sampling plan or technique that relies on random selection
question

Volunteer Sample and Volunteer Response

1. Participants include themselves in the study. Biased because only people with strong opinions volunteer, but sometimes it’s the only ethical method. (Eg medical) 2.Participants are not required to respond. Biased because you don’t hear from those not interested in responding.
question

Convenience Sample

Individuals happen to be there at researcher’s convenience, like standing outside the arts building to catch students to question.
question

Cluster, Multi-Stage, and Stratefied sampling.

CLUSTER: Select random sample of natural clusters (5 out of 40 majors) and use all the individuals within the selected clusters (all the students with those 5 majors). MULTI-STAGE: select random sample of clusters (5 out of 40 maors) and select random individuals within the cluster (random students within the five majors). STRATFIED: Use all the clusters/strata (all 40 majors). Randomly select individuals from each of the strata. (Random students within all 40 majors.)
question

Systematic Sampling

eg: Send to every 50th address. (Would exclude siblings because same last name. Might have other effects that need to be thought of depending on the system.)
question

Simple Random Sample

Select names out of a hat. The only sampling system with no bias.
question

3 Types of studies

1. Observational – no interference 2. Experiment – Researchers control inputs 3. Sample Survey – individuals report (A study can’t be both observational and experimental)
question

Prospective vs retrospective studies

forward vs backward in time
question

Factor

the explanatory variable in a study
question

Treatments

Imposed values of the explanatory variable in a study. (Four quitting smoking techniques.)
question

Randomized Controlled Experiment – what is it and can you draw causal conclusions from it?

Researchers control value of explanatory variable with a randomized procedure. (Subjects are randomly assigned to different treatments.) Can draw causal conclusions from this kind of study.
question

notation of \”sample\”

n
question

Causal Conclusions (when can you draw them?)

you can draw causal conclusions if the researches randomly assigned the explanatory variable to individuals
question

Control Group

Segment of studied individuals who didn’t receive treatment (or a sugar pill). Not always necessary, and sometimes ethically questionable.
question

\”Blind\” and \”Double Blind\”

Blind – participants don’t know what they’re getting Double Blind – researchers and participants don’t know who is getting what. Prevents \”experimenter effect\”
question

Experimenter Effect

prevented by double blind studies
question

Hawthorne Effect

Lack of realism (lack of ecological validity) (in a study)
question

noncompliance

when study participants don’t do what they are asked to do which skews the data
question

Blocking

Not imposing complete randomization in a study, but blocking individuals into groups like male and female
question

Matched Pairs

1 individual in a study gets 2 treatments or 2 similar individuals get 2 treatments
question

Open vs Closed Questions on a survey

What is your favorite kind of food vs. Which of these five foods is your favorite?
question

6 types of survey questions to be aware of

1. Open vs. Closed questions 2. Unbalanced response options 3. Leading questions 4. Planting ideas with questions 5. complicated questions 6. sensitive questions
question

Leading questions vs. planting ideas with questions

Leading question: \”how long have you been beating your wife?\” Planting ideas with questions: \”Given the huge deficit, are you in favor of universal health care?\”
question

Probability Notation

P(it will rain) or P(it will not rain) P(A) or P(not A) P(B), P(C) and so on
question

Probability Rule #1: MEASUREMENT OF PROBABILITY (Made up term for memory tool. No title given to the rule in the text.)

between 0-1 (which means between 0-100% chance). So if the solution is above 1 it’s wrong.
question

Theoretical (Classical) vs. Empirical (Observational) Probability

Theoretical (Classical) : flipping coin, rolling dice. Outcomes can be predicted by the nature of the situation. Empirical (Observational) : series of trials with outcomes that can’t be predicted
question

Relative Frequency

The probability of an event is the relative frequency occurring in a series of trials. Relative Frequency of event A = number of times A occurred / total number of repetitions
question

Law of large numbers

As the number of trials increases the empirical probability gets closer and closer to the theoretical probability
question

Sample Space vs. \”Possible Outcomes for the Event\”

Sample Space: The list of all possible outcomes Possible Outcomes for the Event: outcomes which match the \”event\” being looked for
question

The complement of event A is

not A, or the probability that A does not occur
question

Venn Diagram

Overlapping circles to help visualize relationships between probabilities of events
question

Disjoint

mutually exclusive
question

Probability Rule #2 SUM OF PROBABILITIES (Made up term for memory tool. The rule was given no title in the text.)

P(S)=1 The sum of the probabilities of all possible outcomes is 1
question

Probability Rule #3: THE COMPLEMENT RULE

P(not A) = 1 – P(A) or P(A) = 1 – P(not A) The probability that an event does not occur is 1 minus the probability that it does occur or vice versa. This makes sense when you remember that the sum of all the probabilities is 1. So the likelihood of something not happening is 1 minus the likelihood of it happening. Often, it is easier to find the compliment, which is why we can use this formula either way. Use for problems like, \”At least one of several events occur\”
question

Probability Rule #4: THE ADDITION RULE FOR DISJOINT EVENTS

If A and B are disjoint events, then P(A or B) = P(A) + P(B). In other words, in probability, \”or\” always means \”+\”.
question

Probability Rule #5: THE MULTIPLICATION RULE FOR INDEPENDENT EVENTS

P(A and B) = P(A) x P(B). In other words, in probability, \”and\” always means \”x\”. (Mulitply) (This may seem counterintuitive because you’re expecting that multiplying will make a larger number but actually you’re always multiplying decimals so it makes a smaller result.)
question

Independent vs Disjoint events

IF EVENT IS DISJOINT, IT CAN’T BE INDEPENDENT. There can be all other combos of the two. DISJOINT = mutually exclusive. One happening means anther can’t happen. PART OF \”OR\” QUESTIONS. INDEPENDENT = one happening doesn’t effect the probability of the other happening. PART OF \”AND\” QUESTIONS (Note: if the group from which individuals are chosen is very large, then one being chosen does not effect the probability that the next being chosen will be any certain type. In a small set, the first selection does effect the next selection.)
question

In probability, \”or\” means ________ and \”and\” means _______.

1.addition (more chance of) 2.multiplication (less chance of)
question

Probability Rule #6: GENERAL ADDITION RULE

P(A or B) = P(A) + P(B) – P(A and B) Think of a venn diagram with overlapping circles. You subtract the overlapping part because you included it twice, once as part of A and once as part of B. Problems like this can be interpreted as \”at least one of two events\”. Indeed you can use the compliment rule for them to get the same results, but the general addition rule is easier. The compliment rule is best for \”at least ___ of many events\”.
question

P(A or B) How do you solve?

1. Are the events disjoint? 2. If disjoint, use Addition Rule for Disjoint events: P(A)+P(B). 3. If not disjoint, use general addition rule: P(A)+P(B)-P(A and B)
question

How to solve: Two categorical values each with two possible values

Two way table
question

Notation of conditional probability

P(B|A) Probability of B, given A or Probability of B on the condition that A happened
question

The \”definition of conditional probability\” formula.

P(B|A) = P(A and B) P(A) Similar to how we say something has a 30 out of 100 chance of happening by saying 30/100, to find the probability of B happening given that A has happened, we take the probability of A and B happening and divide it by the probability of just A happening. Most common test question for this is \”Side effect A, Side effect B, and both\”. What is the probability that the patient who has suffered side effect A will also suffer side effect B? P(B|A) We take the chance of A and B and divide it by the chance of just A. You might think you can use a two way table for these problems, but if the question is, given that the patient got A, what is the chance he got B, then it’s not a simple matter of using the given info for the chance of getting both at the same time. You have to take that \”both\” figure and divide it by the \”given side effect\” figure. However it’s very useful to make a two way table to get the figures to plug into the \”definition\” formula.
question

Perform an independence check

Events are independent if: Method 1: P(A|B) = P(A) Method 2: P(B|A) = P(B) Method 3: P(B|A) = P(B|not A) Method 4: P(A and B) = P(A) x P(B)
question

Probability Rule #7: (I gave it a number, text did not. Earlier referred to it as a version of rule #5) THE GENERAL MULTIPLICATION RULE

P(A and B) = P(A) x P(B|A)
question

Probability Tree

Draw diagram where possibilities emerge from events. (My words, not the text)
question

When to use a Probability Tree

For scenarios where there are stages or conditional probabilities.
question

Bayes’ Rule or Bayes’ Theorem

P(A|B) = P(A) x P(B) / P(A) x P(B|A) + P(not A) x P(B|not A) Also known as \”The Law of Total Probability\” Not sure wrote down this formula right
question

The \”definition\” of conditional probability vs. The General Multiplication Rule

Definition: P(B|A) = P(A and B)/P(A) General Multiplication Rule: P(A and B) = P(A) x P(B|A) See how they are the same equation?
question

Linear Regression vs Correlation Coefficient

Linear Regression is finding the line that matches the way the data falls on the scatterplot. (If it’s not linear than it’s just called regression.) Correlation Coefficient is calculating the strength of the linear relationship. (Can’t tell you IF there’s a linear relationship though.)
question

The range of the Correlation Coefficient vs. the range of probability