Tests & Measurements Chap. 3-5 – Flashcards
Unlock all answers in this set
Unlock answersquestion
Test Scores
answer
mathematical representation of an examinee's performance
question
Raw scores: number of items scored in a specific manner
answer
to give raw scores more meaning, we need to transform them into standard scores
question
Standard scores
answer
norm-referenced OR criterion-referenced
question
Norm-referenced interpretations: examinee's performance is compared to that of other people (most psych. test are norm-referenced)
answer
-norms: average scores of an identified group of individuals -norm-based interpretation: process of comparing an individual's test score to a norm group
question
Standardized samples should be representative of the type of individuals expected to take the test
answer
Developing normative data: define population, select random sample and test it
question
National standardization sample obtained through stratified random sampling, in the U.S. samples stratified based on gender, age, ethnicity, etc. (must exceed 1,000 participants)
answer
once standardization sample is selected, normative tables or norms are developed
question
Nationally representative samples are common
answer
other samples are available for some tests like local norms and clinical norms
question
Standardized administration: test should be administered under the same conditions and same administrative procedures
answer
-standard scores: raw scores are transformed to another unit of measurement -use SD units to indicate where an examinee's score is located relative to the mean of the distribution
question
There are several standard scores formats (transforming raw scores into standard scores): z-score (m=0, SD=1), T-scores (m=50, SD=10), IQs (m=100, SD=15)
answer
standard scores can be set to any desired M and SD (with the fancy of the test author frequently being the sole determining factor)
question
Z-scores (+ is above mean, and - is below mean): z=(X-M)/SD
answer
z score to raw score: X=(Z)(SD)+M
question
Disadvantages of z scores: difficult to interpret
answer
half of the z scores in a distribution will be negative, carry decimal places, few test publishers routinely report z-scores
question
Percentile rank: reflects the percentage of people scoring below a given point (so a percentile rank of 20 indicates that only 20% of individuals scored below this point)
answer
-range from 1 to 99 (rank of 50 indicates mean score) -percentile rank is not the same as percentage correct: a percentile rank of 60 means that examinee scored better than 60% of sample, NOT correctly answered 60% of questions -percentile (not percentile rank): point in a distribution at which a specific percentage of scores are less than or equal to a specified score (so 60% percentile at 104 indicates that 60% of scores are 104 or below)
question
Quartile scores: lower 25%=1, 26 to 50%=2, 51 to 75%=3, upper 25%=4
answer
Stanine: not as common as percentiles, expressed in whole numbers from 1 to 9, with 4, 5, and 6 being considered average
question
Criterion-referenced score interpretations: the examinee's performance is compared to a specified level of performance
answer
-criterion-referenced interpretations are absolute: compared to an absolute standard -often used in educational settings
question
Examples of criterion-referenced interpretations:
answer
-percentage correct (i.e. 85% on a classroom test) -mastery testing: a cut score is established (pass/fail driver's license) -standards-based interpretations: involves 3 to 5 performance categories (i.e. assigned "A" to reflect superior work)
question
The terms norm-referenced and criterion-referenced apply to score interpretations
answer
NOT tests!
question
Norm-referenced interpretations can be applied to both maximum performance and typical response tests
answer
Criterion-referenced are typically applied only to maximum performance
question
Item Response Theory Scores (Rasch/IRT-scores, Change Sensitive Scores (or CSS)): fundamental for computer adaptive testing
answer
-theory holds that responses to items on a test are accounted by latent traits -latent trait: it is inferred to exists based of theories and evidence of its existence -intelligence is a latent trait
question
IRT Scores cont'd: each examinee possesses a certain amount of intelligence
answer
-IRT describes how examinees at different levels of ability will respond to individual test items -the specific ability level of an examinee is defined as the level at which examinee can get half of the items correct -they can be transformed to either norm or criterion referenced scores
question
Qualitative descriptions of test scores: helps communicate test results (i.e. IQs 145 and above=very gifted, IQs 90-109=average)
answer
-Test manuals should provide information on: normative samples (type of sample like national, size of sample, how well it matched U.S. population) and test scores (type of scores provided like T-score, how to transform raw scores, information on confidence intervals)
question
Reliability refers to the: consistency, accuracy, or stability of test scores
answer
Factors that may affect reliability: time test was administered, items included, external distractions, internal distractions, person grading the test
question
Measurement Error: error is present in all measurement
answer
even in physics it is reduced but not eliminated
question
Classical Test Theory (or CTT) is the most influential theory to help us understand measurement issues (Charles Spearman in the early 1900s):
answer
-holds that every score has two components: true score that reflects the examinee's true skills AND error score which is the unexplained difference between a person's actual score on a test and that person's true score
question
Xi = T + E Xi = Obtained or observed score T = True score E = Random measurement error
answer
Random measurement error varies from: -person to person -test to test -administration to administration
question
True score can not be directly measured: It is a theoretical reflection of the actual amount of the trait so all we see is an observed score
answer
Measurement error: -Random -Systematic
question
Random measurement error is the result of chance factors
answer
-It can increase or decrease an individual's observed score -It reduces: the usefulness of measurement, ability to generalize, confidence in test results -Random error reduces the reliability of test results if errors are responsible for much of the variability so test scores will be inconsistent AND if errors have little effect on test scores so test reflects mainly consistent aspects of performance
question
Systematic measurement error: increases or decreases the true score by same amount each time (E.g., scale that adds 2 pounds, social desirability)
answer
-Does not lower reliability: test is reliably inaccurate the same each time -It is difficult to identify -It is not considered in reliability analysis
question
Measurement errors are random: Equally likely to be positive or negative, over an infinite number of testings the error will increase and decrease a person's score by the same amount, and errors will tend to average zero
answer
-Make a test longer also reduces the influence of random error for the same reason -Error is normally distributed -Reduce the error and reliability increases -Job is to reduce the sources of error as much as possible
question
Sources of measurement error: tests rarely include every possible question
answer
-Content sampling error (considered largest source of measurement error): differences between sample of items on test and total domain of items like all possible items, if items are a good sample of domain then content error will be small -Time sampling error (temporal stability): random fluctuations in performance over time, includes changes in examinee like fatigue and the environment like distractions -inter-rater differences: when scoring is subjective -errors in administration -clerical errors
question
Reliability coefficients: CTT: Xi = T + E, extended to incorporate the concept of variance: σ2X = σ2T + σ2E
answer
σ2X = Observed score variance σ2T = True score variance σ2E = Error score variance
question
General symbol for reliability is rxx: rxx = σ2T / σ2X (reliable tests will have positive signs)
answer
-Reliability is the ratio of true score variance to total score variance -Reliability is the proportion of test score variance due to true score variance
question
Reliability coefficients are correlation coefficients: reflect the proportion of test score variance attributable to true score variance
answer
-so rxx = .90 indicates that 90% of the score variance is due to true score variance -there are different ways to obtain the scores that are correlated
question
Psychologist use different methods for checking reliability:
answer
-Test-retest reliability -Alternate forms -Internal consistency -Inter-rater agreement
question
Test-Retest Reliability: administer the same test on two occasions, correlate the scores from both administrations, primarily reflects time sampling error
answer
-reflects the degree to which test scores can be generalized to different situations or over time -important to consider length of interval between testing -optimal interval is determined by the way tests results are used (i.e. Intelligence and Mood) -Carry-over effects -Practice and memory effects -Characteristics of attribute may change with time, also time consuming and expensive
question
Procedure: test-retest
answer
-Administering a test to a group of individuals -Re-administering the same test at a later time -Compute the correlation between both scores, should be above .70
question
Alternate-Form Reliability (like test form "A" and "B"): Requires two equivalent or parallel forms, correlate the scores of the different forms, can be administered simultaneously (time error) or delayed (content and time error)
answer
-Alternate-form reliability may reduce, but typically not eliminate carryover effects -Few tests have alternate forms
question
Internal Consistency: Estimates errors related to content sampling, Extent to which individuals respond similarly to items measuring the same concept, single administration
answer
-Split-Half Reliability -Coefficient alpha -Kuder-Richardson
question
Split-Half Reliability: Administer the test, then divide it into two equivalent halves, Correlate the scores for the half tests
answer
-How to split a test? First half -second half, Odd-even split, Randomly -longer tests more reliable -twice as many test items, able to sample domain more accurately -better sample of domain, lower error due to content sampling and higher reliability -BUT, splitting test makes it shorter, less reliability
question
Adjusting Split-Half Estimates: Correction formula: The Spearman-Brown formula; statistically adjusts reliability coefficient when test length is reduced to estimate what the reliability would have been if test were longer
answer
rt= 2rh/1+rh rh = the half correlation
question
Split-Half Method
answer
-Advantages: No need for separate administrations or alternate forms -Problems: Primarily reflects content-sampling error and correlation may vary depending on how test is split
question
Coefficient Alpha: sensitive to content-sampling error and item heterogeneity; can be calculated from one test administration; used as a measure of reliability
answer
-Examines the consistency of responding to all items -Represents the mean reliability coefficient from all possible split halves -Especially useful for tests that do not have right or wrong answers (E.g., attitudes, personality)
question
Reliability coefficient is a function of:
answer
-Extent to which each item represents an observation of the same "thing" observed by other test items -Number of observations one makes
question
rxx = k(rij) / 1 + (k-1) rij k = number of items in the test rij = average inter-correlation among test items
answer
-Compute the correlations among all items -Compute the average of those inter-correlations -Use formula to obtain standardized estimate
question
One way to increase reliability is to increase the number of items:
answer
-Each item represents an individual assessment of the true score -With multiple items combined, errors will tend to average out -Therefore, increasing the number of items increases reliability
question
Kuder-Richardson Reliability:
answer
Applicable when tests are scored dichotomously (i.e., right or wrong, scored 0 or 1)
question
Inter-Rater Reliability: Two or more individuals score the same test independently
answer
-Calculate correlation between the scores - Appropriate when scoring requires making judgments -Important when scoring is subjective -A popular index to estimate inter-rater agreement is Cohen's Kappa (categorical data)
question
Interpreting Reliability Coefficients: The proportion of a scale's total variance that is attributable to a true score
answer
rxx = 1 - error variance SO, for example, rxx = .80, i.e., 20% of variability is due to unsystematic variance
question
Composite scores: when scores are combined to form a composite (like IQ scores)
answer
-the reliability of composite scores is better than individual scores in composite -tests are simply sample of the test domain -combining multiple measures is analogous to increasing the number of observations
question
Difference scores: involves calculating the difference between two scores (i.e. D = X - Y, where D = Achievement test - IQ Score)
answer
-the reliability of difference scores is typically lower than the individual scores
question
If a test is to be administered multiple times: Test-Retest Reliability
answer
Tests to be administered one time: -Homogeneous content - coefficient alpha -Heterogeneous content - split-half coefficient
question
Alternate Forms available:
answer
Alternate form reliability: delayed and simultaneous
question
Factors to consider when evaluating reliability coefficients:
answer
-Construct: what might be acceptable for measure of personality may not be for intelligence -Time available for testing -How the scores will be used -Method of estimating reliability
question
The standard error of measurement (SEM) is more useful when interpreting test scores.
answer
Reliability coefficients are most useful in comparing the scores produced by different tests.
question
Standard error of measurement: the SD of the distribution of scores that would be obtained by one person if he or she were tested on an infinite number of parallel forms of a test compromised if items randomly sampled from the same content domain
answer
-Function of the reliability coefficient and standard deviation of the scores -As reliability increases, the SEM decreases
question
Confidence Intervals: reflect a range that contains the examinee's true score
answer
-Confidence intervals are calculated using the SEM and the SD of the scores -As reliability increases, SEM and confidence intervals get smaller
question
About 68% of the scores in a normal distribution are located between 1 SD above and below the mean
answer
If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 67 and 73
question
About 95% of the scores in a normal distribution are located between 1.96 SD above and below the mean
answer
If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 64.12 and 75.88
question
The SEM and confidence intervals remind us that scores are not perfect
answer
-When the reliability of the test scores is high, the SEM is low because high reliability implies low random measurement error -The smaller the standard error of measurement, the narrower the range
question
CTT: Only an undifferentiated error component
answer
Generalizability theory: Shows how much variance is associated with different sources of error
question
Reliability information reported as a Test Information Function (TIF): A TIF illustrates reliability at different points along the distribution.
answer
TIFs can be converted into an analog of the SEM.
question
How Test Manuals Report Reliability Information:
answer
At a minimum, manuals should report: internal consistency reliability estimates, test-retest reliability, standard error of measurement (SEM), and information on confidence intervals (typically 90% and 95% intervals)
question
Validity: refers to the appropriateness and accuracy of the interpretation of test scores (does the test measure what it is designed to measure?)
answer
if test scores are interpreted in multiple ways, each interpretation needs to be evaluated
question
An achievement test can be used to:
answer
-evaluate students' performance -assign a student to an appropriate instructional program -evaluate a learning disability (the validity of each of these interpretations needs to be evaluated)
question
Reliability tells us whether a test measures whatever it measures consistently
answer
Validity is about our confidence that interpretations we make from a test score are likely to be correct
question
Reliability is a necessary, but insufficient, condition for validity.
answer
-For interpretation of scores to be valid, test scores must be reliable. -However, reliable scores do not guarantee valid score interpretations.
question
Construct underrepresentation: Present when the test does not measure important aspects of the specified construct.
answer
A test of math skills that contains division problems only
question
Construct-irrelevant variance: Present when the test measures features that are unrelated to the specified construct.
answer
A math test with complex written instructions
question
External Features that Can Impact Validity
answer
-Examinee characteristics (e.g., anxiety): max performance test: low motivation/high anxiety impact interpretations AND typical response test: client may attempt to present him/herself in a more/less pathological manner -deviation from standard test administration/scoring procedures (follow time limits/provide instructions) -instruction and coaching -appropriateness of standardization sample (norm-referenced interpretations)
question
Traditional validity nomenclature:
answer
-Content Validity: is the content of the test relevant and representative of the domain? -Criterion-Related Validity: involves examining the relationships between the test and external variables -Construct Validity: involves an integration of evidence that relates to the meaning of the test scores
question
Traditional validity nomenclature suggests that there are different "types" of validity
answer
-Modern conceptualization views validity as a unitary concept. -Not types of validity but sources of validity evidence. -The current view is that validity is a single concept with multiple sources of evidence to demonstrate it
question
Sources of Validity Evidence: Standards for Educational and Psychological Testing (1999) describe five sources of evidence:
answer
-Evidence Based on Test Content -Evidence Based on Relations to Other Variables -Evidence Based on Internal Structure -Evidence Based on Response Processes -Evidence Based on Consequences of Testing
question
Evidence Based on Test Content: Traditionally referred as content validity, Examines the relationship between the content of the test and the construct it is designed to measure, Does the test cover the content that it is suppose to cover?
answer
-Process of relevance of the content starts at early stages of development: Identify what we want to measure and delineate the construct or content domain to be measured -Typical response scale to measure anxiety: Experts review clinical and research literature and develop items designed to assess the theoretical construct being measured -Test developers include a detailed description of procedures for writing items as validity evidence
question
After test is develop, developers continue collecting validity evidence based on content
answer
-A qualitative process: expert judges review correspondence of test content and its construct -Experts: same who help during test construction or independent group
question
Experts evaluate two major issues:
answer
-Item Relevance: Does each individual item reflects content in the specified domain? -Content Coverage: Does overall test reflects essential content in the domain?
question
Content-based validity is specially important for:
answer
-Academic achievement tests -Employment tests: sample of skills needed to succeed at job and used to demonstrate consistency between content of test and job requirements
question
Face Validity
answer
-not a form of validity -Does the test "appear to measure" what it is designed to measure to the general public? -Tests with "face validity" are usually better received by the public.
question
Evidence Based on Relations to Other Variables: Historically referred as criterion validity
answer
-Obtained by examining relationships between test scores and other variables -Several distinct applications: Test-Criterion Evidence, Convergent and Discriminant Evidence, and Contrasted Groups Studies
question
Test-Criterion Evidence: Criterion: Measure of some outcome of interest
answer
-Many tests are designed to predict performance on some variable (the criterion) -Can test scores predict performance on a criterion? (e.g., SAT predict college GPA) -Types of studies to collect test-criterion evidence: Predictive Studies and Concurrent Studies
question
Predictive studies involve a time interval between test and criterion.
answer
In concurrent studies, the test and criterion are measured at the same time.
question
Predictive evidence of validity:
answer
-Administering a test to applicants of a job -Holding their scores for a pre-established period of time but not using those scores as part of selection process -When time has elapsed, a measure of the behavior that the test was designed to predict (criteria) is taken -A test has predictive validity when its scores are significantly correlated with the scores on the criteria
question
Concurrent evidence of validity:
answer
-Collect criterion data from a group of current employees -Give those same employees the test they wish to use as part of their selection process -The test demonstrates evidence of concurrent validity if its scores are significantly correlated with the scores on the criteria
question
Researchers use a correlation coefficient to examine the relationship between the criterion and the predictor
answer
In this context, the correlation coefficient is referred as the validity coefficient (rxy)
question
Issues in test-criterion studies
answer
-Selecting a criterion: Criterion's measure must be both valid and reliable -Criterion contamination: Predictor and criterion scores must be obtained independently -Interpreting validity coefficients: How large should validity coefficients be? -Validity generalization
question
Convergent Evidence: Construct Validity
answer
-Correlate test scores with tests of the same or similar construct -Expect moderate to strong positive correlations (like anxiety and depression)
question
Discriminant Evidence: Construct Validity
answer
-Correlate test with tests of a dissimilar construct -Expect negative correlations (like self-esteem and anxiety)
question
Multitrait-Multimethod Studies combines convergent and divergent strategies
answer
-Requires to examine two or more traits using two or more measurement methods -Allows to determine what the test correlates with (and does not correlate with) as well as how method of measurement influences the relationship
question
Contrasted Group Studies: Examine different groups expected to differ on the construct measured by the test
answer
Examples: -Contrast depressed vs. non-depressed -Young vs. old examinees
question
Evidence Based on Internal Structure: Examine the internal structure and determine if it matches the construct being measured
answer
Factor analysis is a prominent technique.
question
Factor Analysis: A statistical method that evaluates the interrelationships of variables and derives factors
answer
-Factor analysis allows one to detect the presence and structure of latent constructs among a set of variables. -Factor analysis starts with a correlation matrix.
question
Evidence Based on Response Processes
answer
-Are the responses invoked by the test consistent with the construct being assessed? -Does a test of math reasoning require actual analysis and reasoning, or simply rote calculations? -Can also include actions of those administering and grading the test.
question
Evidence Based on Consequences of Testing: "consequential validity evidence." *informal*
answer
-If the test is thought to result in benefits, are those benefits being achieved? -Controversial -Some suggest that this concept should incorporate social issues and values.
question
Validity Argument: Validation should involve the integration of multiple sources of evidence into a coherent commentary.
answer
-All information on test quality is relevant to validity: score reliability, standardized administration and scoring, accurate scaling equating and setting, attention to fairness
question
How Test Manual Report Validity Evidence
answer
-Different types of validity evidence are most applicable to different types of tests. -The manual should use multiple sources of validity evidence to build a compelling validity argument.
question
In classical test theory, T stands for ____ score, X stands for ___ score and E stands for ____
answer
true; observed; random measurement error
question
Define random error of measurement and provide an example
answer
testing environment a result of chance factors
question
Define systematic error of measurement and provide an example
answer
two extra pounds for every measurement of weight
question
____ error reduces the reliability of test results while ____ error does not lower reliability (test is reliably inaccurate by the same amount each time). Therefore, ____ error is the main focus of classical test theory.
answer
Random; systematic; random
question
What conclusion could be drawn from a reliability coefficient of .75?
answer
25% error
question
____ reliability requires that two forms of the test are administered to the same group of individuals while in ____ a test developer gives the same test to the same group of test takers on two different occasion.
answer
alternate form; test/retest
question
____ method of estimating reliability requires dividing the test into halves, then correlating the set of individual test scores on the second half.
answer
split-half reliability
question
The coefficient alpha is also known as the ____ of all possible split-half coefficients.
answer
average (mean)
question
____ tests produce more reliable scores than ____ tests.
answer
long; short
question
Unreliable test scores will lead to ____ standard error of measurements.
answer
larger
question
When interpreting the test scores of individuals, the ____ is more practical than the ____.
answer
standard error of measurement; reliability coefficient
question
In terms of threats to validity....
answer
constructive underrepresentation is present when the test does not measure important aspects of the specified construct
question
On the other hand, ....
answer
construct irrelevant variance is present when the test measures features that are unrelated to the specified construct
question
Common threats to validity
answer
-examinee characteristics (high test anxiety) -deviations from standard test procedures
question
Contemporary conceptualizations views validity as a....
answer
unitary construct while
question
Traditional nomenclature suggests that there are three different....
answer
types of validity
question
Validity evidence based on ....
answer
test content is produced by an examination of the relationship between the content of the test and the construct or domain the test is designed to measure
question
____validity is not technically a form of validity and refers to the degree to which a test 'appears' to measure what it is designed to measure
answer
Face
question
Examples in which validity evidence is based on relations to other variables
answer
GRE given to students prior to entering their first year of grad school
question
____studies involve a time interval between test and criterion but in ____studies the test and criterion are measured at the same time.
answer
Predictive; concurrent
question
"Correlating scores on a new test to measure anxiety with a measure of sensation seeking" is an example of ____validity
answer
discriminant
question
"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity
answer
convergent
question
____ ____ studies combine convergent and divergent strategies.
answer
Multi-trait multimethod
question
____ ____ allows one to detect the presence and structure of latent constructs among a set of variables.
answer
Factor analysis
question
____ ____ is a statistical procedure that allows one to predict performance on one test from performance on another (given that both are correlated with each other).
answer
Linear regression
question
____ ____ is a method of obtaining validity that examines different groups expected to differ on the construct measured by the test, e.g., contrasting depressed vs. non-depressed groups.
answer
Contrasted group studies
question
It is important that predictor and criterion scores be obtained independently in order to avoid ____ ____
answer
criterion contamination
question
"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity
answer
convergent