WGU Intro to Probability and Statistics

Unlock all answers in this set

Unlock answers
question
Population
answer
The entire group that is the target of interest, not just people. Eg, "the population of 1 bedroom apartments"
question
Sample
answer
A subgroup of the population. Eg, "the 1 bedroom apartments with dishwashers."
question
Steps in the statistics process
answer
1. PRODUCE DATA (by studying a sample of the population) 2. EXPLORATORY DATA ANALYSIS (Summarize data.) 3. PROBABILITY ANALYSIS (Determine how the sample may differ from the population.) 4. INFERENCE (draw conclusions)
question
Data
answer
pieces of info about individuals organized into variables
question
Individual
answer
a particular person or object
question
Variable
answer
a particular characteristic of the individual
question
Dataset
answer
a set of data identified with particular circumstances. Typically displayed in tables with rows as the individuals and columns as the variables
question
Quantitative vs Categorical/Qualitative variables
answer
Quantitaive: Numerical values. Represent a measurement. Categorical: category or label values into which individuals are grouped.
question
Three steps in Exploratory Data Analysis
answer
1. Organize and SUMMARIZE raw data 2. DISCOVER important features and patterns and striking deviations. 3. INTERPRET findings in the context of the problem
question
Examining Distributions
answer
exploring data obtained from one variable at a time
question
Examining Relationships
answer
exploring data obtained from two variables at a time
question
Distribution
answer
what values the variable takes, how often
question
Three types of graphical displays of categorical distributions
answer
1. Pie Charts 2. Bar Charts 3. Pictogram
question
Bins
answer
ranges of data to make charting easier, like a bar chart where each bar shows a range like 70-80%
question
Numerical Summaries
answer
category counts and percentages
question
Four types of Graphical displays of Quantitative Variables
answer
1. Histogram 2. Stemplot 3. Dotplot 4. Boxplot
question
Histogram
answer
like a bar chart but the x axis is numerical, in order. Eg: x axis is years, y axis is Men's income and Women's income. Or, the x axis is number of hours studied, and y axis is number of students falling into each number of hours studied category.
question
4 ways to interpret a histogram
answer
1. Shape - Symmetry/Skewness, Peakness (Modality) 2. Center - midpoint 3. Spread - approx range covered by all the data 4. Outliers - observations that fall outside overall pattern
question
Symmetric distributions (on a histogram)
answer
look symmetric. can be multi-peaked, but symmetrical
question
Skewness (on a histogram)
answer
data is skewed to the right or left because outliers. (Careful because the histogram looks heavy to the opposite side than to that which it is skewed. Think of the outliers as pulling a long tail out from the main data, making it not symmetrical.)
question
Peakedness (on a histogram) (three types)
answer
1. Unimodal (single peaked) distribution 2. Bimodal (double peaked) distribution 3. Uniform distribution (Many peaks, all the same)
question
Stemplot (or stem and leaf plot)
answer
1. Write all the "stems" down in a list, in ascending numerical order. (The stems are all the numbers but the right most number. Eg: dataset 34 35 36 347 367 the stems are 3, 3, 3, 34, 36, but you only use each identical stem once, so it would be 3, 34, 36) 2. Draw a line to the right of the list 3. Write all the leaves next to the stem, and rearrange them in increasing order
question
two Virtues of a stemplot
answer
1. preserves the data while sorting it 2. when rotated looks like a histogram
question
Dotplot
answer
a stemplot with dots instead of leaves
question
Boxplot
answer
Shows the "five number spread": min, Q1, Median, Q2, Max Y axis is range Drawn box is interquartile range Points for outliers, minimum and maximum Is most useful for showing side by side comparisons
question
The Five Number Spread
answer
1. "Upper limit" = Q3 through Max 2. 75th percentile = Median through Q3 3. 50th percentile = Q1 through Median 4. 25th percentile = Q1 (this doesn't make sense) 5. Lower limit = Minimum through Q1
question
3 Measures of Center
answer
1. Mode - the value most often found (not sensitive to outliers) 2. Median - the center value (or average of the two center values) (not sensitive to outliers) 3. Mean - the average (sensitive to outliers)
question
3 measures of spread
answer
1. Range - the distance between max and minimum values 2. Inter-Quartile Range - the range of the middle 50% 3. Standard Deviation - how far the observations are from their mean. (The average may be 9, but the real average is 4 away from 9.)
question
Calculate range
answer
max - min
question
Calculate inter-quartile range
answer
1. Find median (by arranging data in increasing order) 2. Find median of bottom 50% (Q1, "The first quartile) 3. Find median of top 50% (Q3, "The third quartile") 4. Q3-Q1=IQR
question
The 1.5(IQR) criterion for outliers
answer
1. Q1-1.5(IQR) 2.Q3+1.5(IQR) 3. Any datapoints outside of these two points are possible outliers.
question
Outliers - when to keep, when to discard?
answer
1. Keep if could happen again, produced by essentially same process. 2. Discard if produced by a different process and your purpose is to understand the process which produced most of the data. 3. Discard if produced by an error or typo that cannot be fixed.
question
Notations for Standard Deviation
answer
SD, s, Sd, St Dev
question
Calculate Standard Deviation
answer
1. Find the mean 2. Find distances between observations and the mean 3. Square each deviation 4. Add up the squares of each deviation and divide by the number of deviations minus 1 5. Find square root of result EXPLANATION We can't average the deviations because they add up to zero. The reason we average the squares of the deviations minus 1 is beyond the scope of this course to explain. The average of the squared deviations is called the variance of the data.
question
Is the "standard deviation" or "variance of the data" influenced by outliers?
answer
yes, strongly
question
The "standard deviation rule"
answer
Approx 68% of observations fall within 1 standard deviation of the mean Approx 95% of observations fall within 2 standard deviations of the mean Approx 99.7 of observations fall within 3 standard deviations of the mean (3 standard deviations = the standard deviation x 3)
question
Notation for mean
answer
an x with a line over it
question
Choose between using mean and standard deviation verses the five number summary
answer
1. use mean and SD for relatively symmetrical distributions with no outliers 2. use five number summary for all others
question
Steps to choose which data display and numerical summary is best
answer
1. Identify the explanatory/independent variable (x) and the response/dependent variable (y) 2. Is the explanatory variable categorical or quantitative? 3. Is the response variable categorical or quantitative? 4. Notate it C-C, C-Q, Q-C, or Q-Q 5. Select approach based on above
question
Select data display and numerical summary approach for case C-C, C-Q, Q-C, or Q-Q
answer
1. Case C-C: Two way table or double bar chart using conditional percents. 2. Case C-Q: Box plots and five number spread 3. Case Q-C: Not covered in the text 4. Case Q-Q: Scatterplot (explanatory on x, response on y) or labelled scatterplot
question
Correlation Coefficient
answer
Measures the strength and direction of a linear relationship between two quantitative variables. Does not tell you IF a relationship is linear. A curvalinear relationship can include a linear relationship or not.The correlation coefficient tells you the strenghth of the linear relationship, not the curvalinear relationship
question
Notation of the correlation coefficient
answer
r
question
Correlation Coefficient and Outliers
answer
Outliers strongly effect the r-value, so the CC should only be used after seeing the scatterplot.
question
Range of values in the correlation coefficient
answer
-1 to 1 -1 is the strongest negative linear relationship +1 is the strongest positive linear relationship Close to zero is a weaker linear relationship
question
Regression and Linear Regression
answer
The technique that specifies the dependence of the response variable on the explanatory variable. If it's a linear dependence, then it's linear regression. It's finding the line that best fits the pattern of the linear relationship.
question
Calculate linear regression or the "least squares regression line"
answer
1. y=a+bx 2. b=r(Sy/Sx) 3. a = Y with line over it - b(x with line over it) Key: r = the correlation coeffient Sx = standard deviation of the explanatory variable's values Sy=standard deviation of the response variable's values X with line over it = the mean of the explanatory variable's values Y with line over it = the mean of the response variable's values EXPLANATION Find the slope of the "least squares regression line". (Just like the standard line equation, y=a+bX, helps you find the slope, or the change in y when x changes by 1, the "least squares regression line" formula helps you find the average change in the response variable when the explanatory variable increases by 1 unit. It's called the "least squares regression line" because it's the line which results in the smallest sum of squared vertical deviations.
question
Line
answer
a set of points that obey a particular relationship between x and y
question
Equation of the Line (Algebra Review)
answer
Y=a+bX a=the y-intercept, or the value that y takes when x =zero b=the slope, or the change in y when x changes by 1
question
Extrapolation
answer
prediction for ranges of the explanatory variable that are not in the data
question
Causation and lurking variables
answer
Association does not imply causation Lurking variables are not among the variables in a study but could substantially effect your interpretation of the relationship among those variables
question
Simpson's Paradox
answer
Whenever a lurking variable causes us to rethink the direction of an assocation
question
Correlation between quantitative variables vs. correlation between category variables
answer
There can only be correlation between quantitative variables, not category variables
question
10 sampling types and terms
answer
(Remember S,V,V,C,S, S,P,C,M,S) "Some very very cute samples. Some pleasing, cute, magnificent samples." 1. Sampling Frame 2.Volunteer Sample 3. Volunteer Response 4. Convenience Sample 5. Systematic Sampling 6. Simple Random Sample 7. Probability Sampling Plan/Technique 8. Cluster Sampling 9. Multi-Stage Sampling 10. Stratified Sampling
question
Sampling Frame
answer
The study should be designed so that the sampling frame is the entire population being studied. (My notes just say "should be the population studied". May want to double check meaning.)
question
Probability Sampling Plan/Technique
answer
Any sampling plan or technique that relies on random selection
question
Volunteer Sample and Volunteer Response
answer
1. Participants include themselves in the study. Biased because only people with strong opinions volunteer, but sometimes it's the only ethical method. (Eg medical) 2.Participants are not required to respond. Biased because you don't hear from those not interested in responding.
question
Convenience Sample
answer
Individuals happen to be there at researcher's convenience, like standing outside the arts building to catch students to question.
question
Cluster, Multi-Stage, and Stratefied sampling.
answer
CLUSTER: Select random sample of natural clusters (5 out of 40 majors) and use all the individuals within the selected clusters (all the students with those 5 majors). MULTI-STAGE: select random sample of clusters (5 out of 40 maors) and select random individuals within the cluster (random students within the five majors). STRATFIED: Use all the clusters/strata (all 40 majors). Randomly select individuals from each of the strata. (Random students within all 40 majors.)
question
Systematic Sampling
answer
eg: Send to every 50th address. (Would exclude siblings because same last name. Might have other effects that need to be thought of depending on the system.)
question
Simple Random Sample
answer
Select names out of a hat. The only sampling system with no bias.
question
3 Types of studies
answer
1. Observational - no interference 2. Experiment - Researchers control inputs 3. Sample Survey - individuals report (A study can't be both observational and experimental)
question
Prospective vs retrospective studies
answer
forward vs backward in time
question
Factor
answer
the explanatory variable in a study
question
Treatments
answer
Imposed values of the explanatory variable in a study. (Four quitting smoking techniques.)
question
Randomized Controlled Experiment - what is it and can you draw causal conclusions from it?
answer
Researchers control value of explanatory variable with a randomized procedure. (Subjects are randomly assigned to different treatments.) Can draw causal conclusions from this kind of study.
question
notation of "sample"
answer
n
question
Causal Conclusions (when can you draw them?)
answer
you can draw causal conclusions if the researches randomly assigned the explanatory variable to individuals
question
Control Group
answer
Segment of studied individuals who didn't receive treatment (or a sugar pill). Not always necessary, and sometimes ethically questionable.
question
"Blind" and "Double Blind"
answer
Blind - participants don't know what they're getting Double Blind - researchers and participants don't know who is getting what. Prevents "experimenter effect"
question
Experimenter Effect
answer
prevented by double blind studies
question
Hawthorne Effect
answer
Lack of realism (lack of ecological validity) (in a study)
question
noncompliance
answer
when study participants don't do what they are asked to do which skews the data
question
Blocking
answer
Not imposing complete randomization in a study, but blocking individuals into groups like male and female
question
Matched Pairs
answer
1 individual in a study gets 2 treatments or 2 similar individuals get 2 treatments
question
Open vs Closed Questions on a survey
answer
What is your favorite kind of food vs. Which of these five foods is your favorite?
question
6 types of survey questions to be aware of
answer
1. Open vs. Closed questions 2. Unbalanced response options 3. Leading questions 4. Planting ideas with questions 5. complicated questions 6. sensitive questions
question
Leading questions vs. planting ideas with questions
answer
Leading question: "how long have you been beating your wife?" Planting ideas with questions: "Given the huge deficit, are you in favor of universal health care?"
question
Probability Notation
answer
P(it will rain) or P(it will not rain) P(A) or P(not A) P(B), P(C) and so on
question
Probability Rule #1: MEASUREMENT OF PROBABILITY (Made up term for memory tool. No title given to the rule in the text.)
answer
between 0-1 (which means between 0-100% chance). So if the solution is above 1 it's wrong.
question
Theoretical (Classical) vs. Empirical (Observational) Probability
answer
Theoretical (Classical) : flipping coin, rolling dice. Outcomes can be predicted by the nature of the situation. Empirical (Observational) : series of trials with outcomes that can't be predicted
question
Relative Frequency
answer
The probability of an event is the relative frequency occurring in a series of trials. Relative Frequency of event A = number of times A occurred / total number of repetitions
question
Law of large numbers
answer
As the number of trials increases the empirical probability gets closer and closer to the theoretical probability
question
Sample Space vs. "Possible Outcomes for the Event"
answer
Sample Space: The list of all possible outcomes Possible Outcomes for the Event: outcomes which match the "event" being looked for
question
The complement of event A is
answer
not A, or the probability that A does not occur
question
Venn Diagram
answer
Overlapping circles to help visualize relationships between probabilities of events
question
Disjoint
answer
mutually exclusive
question
Probability Rule #2 SUM OF PROBABILITIES (Made up term for memory tool. The rule was given no title in the text.)
answer
P(S)=1 The sum of the probabilities of all possible outcomes is 1
question
Probability Rule #3: THE COMPLEMENT RULE
answer
P(not A) = 1 - P(A) or P(A) = 1 - P(not A) The probability that an event does not occur is 1 minus the probability that it does occur or vice versa. This makes sense when you remember that the sum of all the probabilities is 1. So the likelihood of something not happening is 1 minus the likelihood of it happening. Often, it is easier to find the compliment, which is why we can use this formula either way. Use for problems like, "At least one of several events occur"
question
Probability Rule #4: THE ADDITION RULE FOR DISJOINT EVENTS
answer
If A and B are disjoint events, then P(A or B) = P(A) + P(B). In other words, in probability, "or" always means "+".
question
Probability Rule #5: THE MULTIPLICATION RULE FOR INDEPENDENT EVENTS
answer
P(A and B) = P(A) x P(B). In other words, in probability, "and" always means "x". (Mulitply) (This may seem counterintuitive because you're expecting that multiplying will make a larger number but actually you're always multiplying decimals so it makes a smaller result.)
question
Independent vs Disjoint events
answer
IF EVENT IS DISJOINT, IT CAN'T BE INDEPENDENT. There can be all other combos of the two. DISJOINT = mutually exclusive. One happening means anther can't happen. PART OF "OR" QUESTIONS. INDEPENDENT = one happening doesn't effect the probability of the other happening. PART OF "AND" QUESTIONS (Note: if the group from which individuals are chosen is very large, then one being chosen does not effect the probability that the next being chosen will be any certain type. In a small set, the first selection does effect the next selection.)
question
In probability, "or" means ________ and "and" means _______.
answer
1.addition (more chance of) 2.multiplication (less chance of)
question
Probability Rule #6: GENERAL ADDITION RULE
answer
P(A or B) = P(A) + P(B) - P(A and B) Think of a venn diagram with overlapping circles. You subtract the overlapping part because you included it twice, once as part of A and once as part of B. Problems like this can be interpreted as "at least one of two events". Indeed you can use the compliment rule for them to get the same results, but the general addition rule is easier. The compliment rule is best for "at least ___ of many events".
question
P(A or B) How do you solve?
answer
1. Are the events disjoint? 2. If disjoint, use Addition Rule for Disjoint events: P(A)+P(B). 3. If not disjoint, use general addition rule: P(A)+P(B)-P(A and B)
question
How to solve: Two categorical values each with two possible values
answer
Two way table
question
Notation of conditional probability
answer
P(B|A) Probability of B, given A or Probability of B on the condition that A happened
question
The "definition of conditional probability" formula.
answer
P(B|A) = P(A and B) P(A) Similar to how we say something has a 30 out of 100 chance of happening by saying 30/100, to find the probability of B happening given that A has happened, we take the probability of A and B happening and divide it by the probability of just A happening. Most common test question for this is "Side effect A, Side effect B, and both". What is the probability that the patient who has suffered side effect A will also suffer side effect B? P(B|A) We take the chance of A and B and divide it by the chance of just A. You might think you can use a two way table for these problems, but if the question is, given that the patient got A, what is the chance he got B, then it's not a simple matter of using the given info for the chance of getting both at the same time. You have to take that "both" figure and divide it by the "given side effect" figure. However it's very useful to make a two way table to get the figures to plug into the "definition" formula.
question
Perform an independence check
answer
Events are independent if: Method 1: P(A|B) = P(A) Method 2: P(B|A) = P(B) Method 3: P(B|A) = P(B|not A) Method 4: P(A and B) = P(A) x P(B)
question
Probability Rule #7: (I gave it a number, text did not. Earlier referred to it as a version of rule #5) THE GENERAL MULTIPLICATION RULE
answer
P(A and B) = P(A) x P(B|A)
question
Probability Tree
answer
Draw diagram where possibilities emerge from events. (My words, not the text)
question
When to use a Probability Tree
answer
For scenarios where there are stages or conditional probabilities.
question
Bayes' Rule or Bayes' Theorem
answer
P(A|B) = P(A) x P(B) / P(A) x P(B|A) + P(not A) x P(B|not A) Also known as "The Law of Total Probability" Not sure wrote down this formula right
question
The "definition" of conditional probability vs. The General Multiplication Rule
answer
Definition: P(B|A) = P(A and B)/P(A) General Multiplication Rule: P(A and B) = P(A) x P(B|A) See how they are the same equation?
question
Linear Regression vs Correlation Coefficient
answer
Linear Regression is finding the line that matches the way the data falls on the scatterplot. (If it's not linear than it's just called regression.) Correlation Coefficient is calculating the strength of the linear relationship. (Can't tell you IF there's a linear relationship though.)
question
The range of the Correlation Coefficient vs. the range of probability
answer
Range of Correlation Coefficient is -1 to 1. Close to zero is a weaker linear relationship. Range of probability is 0-1, which can be translated into 0-100% chance.
question
Calculate the Correlation Coefficient
answer
Text says you don't need to know the formula. (It has lots os symbols I don't know.) But it is part of calculating the linear regression. Perhaps you solve for the correlation coefficient.
Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New