question

Test Scores

answer

mathematical representation of an examinee's performance

question

Raw scores: number of items scored in a specific manner

answer

to give raw scores more meaning, we need to transform them into standard scores

question

Standard scores

answer

norm-referenced OR criterion-referenced

question

Norm-referenced interpretations: examinee's performance is compared to that of other people (most psych. test are norm-referenced)

answer

-norms: average scores of an identified group of individuals -norm-based interpretation: process of comparing an individual's test score to a norm group

question

Standardized samples should be representative of the type of individuals expected to take the test

answer

Developing normative data: define population, select random sample and test it

question

National standardization sample obtained through stratified random sampling, in the U.S. samples stratified based on gender, age, ethnicity, etc. (must exceed 1,000 participants)

answer

once standardization sample is selected, normative tables or norms are developed

question

Nationally representative samples are common

answer

other samples are available for some tests like local norms and clinical norms

question

Standardized administration: test should be administered under the same conditions and same administrative procedures

answer

-standard scores: raw scores are transformed to another unit of measurement -use SD units to indicate where an examinee's score is located relative to the mean of the distribution

question

There are several standard scores formats (transforming raw scores into standard scores): z-score (m=0, SD=1), T-scores (m=50, SD=10), IQs (m=100, SD=15)

answer

standard scores can be set to any desired M and SD (with the fancy of the test author frequently being the sole determining factor)

question

Z-scores (+ is above mean, and - is below mean): z=(X-M)/SD

answer

z score to raw score: X=(Z)(SD)+M

question

Disadvantages of z scores: difficult to interpret

answer

half of the z scores in a distribution will be negative, carry decimal places, few test publishers routinely report z-scores

question

Percentile rank: reflects the percentage of people scoring below a given point (so a percentile rank of 20 indicates that only 20% of individuals scored below this point)

answer

-range from 1 to 99 (rank of 50 indicates mean score) -percentile rank is not the same as percentage correct: a percentile rank of 60 means that examinee scored better than 60% of sample, NOT correctly answered 60% of questions -percentile (not percentile rank): point in a distribution at which a specific percentage of scores are less than or equal to a specified score (so 60% percentile at 104 indicates that 60% of scores are 104 or below)

question

Quartile scores: lower 25%=1, 26 to 50%=2, 51 to 75%=3, upper 25%=4

answer

Stanine: not as common as percentiles, expressed in whole numbers from 1 to 9, with 4, 5, and 6 being considered average

question

Criterion-referenced score interpretations: the examinee's performance is compared to a specified level of performance

answer

-criterion-referenced interpretations are absolute: compared to an absolute standard -often used in educational settings

question

Examples of criterion-referenced interpretations:

answer

-percentage correct (i.e. 85% on a classroom test) -mastery testing: a cut score is established (pass/fail driver's license) -standards-based interpretations: involves 3 to 5 performance categories (i.e. assigned "A" to reflect superior work)

question

The terms norm-referenced and criterion-referenced apply to score interpretations

answer

NOT tests!

question

Norm-referenced interpretations can be applied to both maximum performance and typical response tests

answer

Criterion-referenced are typically applied only to maximum performance

question

Item Response Theory Scores (Rasch/IRT-scores, Change Sensitive Scores (or CSS)): fundamental for computer adaptive testing

answer

-theory holds that responses to items on a test are accounted by latent traits -latent trait: it is inferred to exists based of theories and evidence of its existence -intelligence is a latent trait

question

IRT Scores cont'd: each examinee possesses a certain amount of intelligence

answer

-IRT describes how examinees at different levels of ability will respond to individual test items -the specific ability level of an examinee is defined as the level at which examinee can get half of the items correct -they can be transformed to either norm or criterion referenced scores

question

Qualitative descriptions of test scores: helps communicate test results (i.e. IQs 145 and above=very gifted, IQs 90-109=average)

answer

-Test manuals should provide information on: normative samples (type of sample like national, size of sample, how well it matched U.S. population) and test scores (type of scores provided like T-score, how to transform raw scores, information on confidence intervals)

question

Reliability refers to the: consistency, accuracy, or stability of test scores

answer

Factors that may affect reliability: time test was administered, items included, external distractions, internal distractions, person grading the test

question

Measurement Error: error is present in all measurement

answer

even in physics it is reduced but not eliminated

question

Classical Test Theory (or CTT) is the most influential theory to help us understand measurement issues (Charles Spearman in the early 1900s):

answer

-holds that every score has two components: true score that reflects the examinee's true skills AND error score which is the unexplained difference between a person's actual score on a test and that person's true score

question

Xi = T + E Xi = Obtained or observed score T = True score E = Random measurement error

answer

Random measurement error varies from: -person to person -test to test -administration to administration

question

True score can not be directly measured: It is a theoretical reflection of the actual amount of the trait so all we see is an observed score

answer

Measurement error: -Random -Systematic

question

Random measurement error is the result of chance factors

answer

-It can increase or decrease an individual's observed score -It reduces: the usefulness of measurement, ability to generalize, confidence in test results -Random error reduces the reliability of test results if errors are responsible for much of the variability so test scores will be inconsistent AND if errors have little effect on test scores so test reflects mainly consistent aspects of performance

question

Systematic measurement error: increases or decreases the true score by same amount each time (E.g., scale that adds 2 pounds, social desirability)

answer

-Does not lower reliability: test is reliably inaccurate the same each time -It is difficult to identify -It is not considered in reliability analysis

question

Measurement errors are random: Equally likely to be positive or negative, over an infinite number of testings the error will increase and decrease a person's score by the same amount, and errors will tend to average zero

answer

-Make a test longer also reduces the influence of random error for the same reason -Error is normally distributed -Reduce the error and reliability increases -Job is to reduce the sources of error as much as possible

question

Sources of measurement error: tests rarely include every possible question

answer

-Content sampling error (considered largest source of measurement error): differences between sample of items on test and total domain of items like all possible items, if items are a good sample of domain then content error will be small -Time sampling error (temporal stability): random fluctuations in performance over time, includes changes in examinee like fatigue and the environment like distractions -inter-rater differences: when scoring is subjective -errors in administration -clerical errors

question

Reliability coefficients: CTT: Xi = T + E, extended to incorporate the concept of variance: σ2X = σ2T + σ2E

answer

σ2X = Observed score variance σ2T = True score variance σ2E = Error score variance

question

General symbol for reliability is rxx: rxx = σ2T / σ2X (reliable tests will have positive signs)

answer

-Reliability is the ratio of true score variance to total score variance -Reliability is the proportion of test score variance due to true score variance

question

Reliability coefficients are correlation coefficients: reflect the proportion of test score variance attributable to true score variance

answer

-so rxx = .90 indicates that 90% of the score variance is due to true score variance -there are different ways to obtain the scores that are correlated

question

Psychologist use different methods for checking reliability:

answer

-Test-retest reliability -Alternate forms -Internal consistency -Inter-rater agreement

question

Test-Retest Reliability: administer the same test on two occasions, correlate the scores from both administrations, primarily reflects time sampling error

answer

-reflects the degree to which test scores can be generalized to different situations or over time -important to consider length of interval between testing -optimal interval is determined by the way tests results are used (i.e. Intelligence and Mood) -Carry-over effects -Practice and memory effects -Characteristics of attribute may change with time, also time consuming and expensive

question

Procedure: test-retest

answer

-Administering a test to a group of individuals -Re-administering the same test at a later time -Compute the correlation between both scores, should be above .70

question

Alternate-Form Reliability (like test form "A" and "B"): Requires two equivalent or parallel forms, correlate the scores of the different forms, can be administered simultaneously (time error) or delayed (content and time error)

answer

-Alternate-form reliability may reduce, but typically not eliminate carryover effects -Few tests have alternate forms

question

Internal Consistency: Estimates errors related to content sampling, Extent to which individuals respond similarly to items measuring the same concept, single administration

answer

-Split-Half Reliability -Coefficient alpha -Kuder-Richardson

question

Split-Half Reliability: Administer the test, then divide it into two equivalent halves, Correlate the scores for the half tests

answer

-How to split a test? First half -second half, Odd-even split, Randomly -longer tests more reliable -twice as many test items, able to sample domain more accurately -better sample of domain, lower error due to content sampling and higher reliability -BUT, splitting test makes it shorter, less reliability

question

Adjusting Split-Half Estimates: Correction formula: The Spearman-Brown formula; statistically adjusts reliability coefficient when test length is reduced to estimate what the reliability would have been if test were longer

answer

rt= 2rh/1+rh rh = the half correlation

question

Split-Half Method

answer

-Advantages: No need for separate administrations or alternate forms -Problems: Primarily reflects content-sampling error and correlation may vary depending on how test is split

question

Coefficient Alpha: sensitive to content-sampling error and item heterogeneity; can be calculated from one test administration; used as a measure of reliability

answer

-Examines the consistency of responding to all items -Represents the mean reliability coefficient from all possible split halves -Especially useful for tests that do not have right or wrong answers (E.g., attitudes, personality)

question

Reliability coefficient is a function of:

answer

-Extent to which each item represents an observation of the same "thing" observed by other test items -Number of observations one makes

question

rxx = k(rij) / 1 + (k-1) rij k = number of items in the test rij = average inter-correlation among test items

answer

-Compute the correlations among all items -Compute the average of those inter-correlations -Use formula to obtain standardized estimate

question

One way to increase reliability is to increase the number of items:

answer

-Each item represents an individual assessment of the true score -With multiple items combined, errors will tend to average out -Therefore, increasing the number of items increases reliability

question

Kuder-Richardson Reliability:

answer

Applicable when tests are scored dichotomously (i.e., right or wrong, scored 0 or 1)

question

Inter-Rater Reliability: Two or more individuals score the same test independently

answer

-Calculate correlation between the scores - Appropriate when scoring requires making judgments -Important when scoring is subjective -A popular index to estimate inter-rater agreement is Cohen's Kappa (categorical data)

question

Interpreting Reliability Coefficients: The proportion of a scale's total variance that is attributable to a true score

answer

rxx = 1 - error variance SO, for example, rxx = .80, i.e., 20% of variability is due to unsystematic variance

question

Composite scores: when scores are combined to form a composite (like IQ scores)

answer

-the reliability of composite scores is better than individual scores in composite -tests are simply sample of the test domain -combining multiple measures is analogous to increasing the number of observations

question

Difference scores: involves calculating the difference between two scores (i.e. D = X - Y, where D = Achievement test - IQ Score)

answer

-the reliability of difference scores is typically lower than the individual scores

question

If a test is to be administered multiple times: Test-Retest Reliability

answer

Tests to be administered one time: -Homogeneous content - coefficient alpha -Heterogeneous content - split-half coefficient

question

Alternate Forms available:

answer

Alternate form reliability: delayed and simultaneous

question

Factors to consider when evaluating reliability coefficients:

answer

-Construct: what might be acceptable for measure of personality may not be for intelligence -Time available for testing -How the scores will be used -Method of estimating reliability

question

The standard error of measurement (SEM) is more useful when interpreting test scores.

answer

Reliability coefficients are most useful in comparing the scores produced by different tests.

question

Standard error of measurement: the SD of the distribution of scores that would be obtained by one person if he or she were tested on an infinite number of parallel forms of a test compromised if items randomly sampled from the same content domain

answer

-Function of the reliability coefficient and standard deviation of the scores -As reliability increases, the SEM decreases

question

Confidence Intervals: reflect a range that contains the examinee's true score

answer

-Confidence intervals are calculated using the SEM and the SD of the scores -As reliability increases, SEM and confidence intervals get smaller

question

About 68% of the scores in a normal distribution are located between 1 SD above and below the mean

answer

If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 67 and 73

question

About 95% of the scores in a normal distribution are located between 1.96 SD above and below the mean

answer

If an individual obtains a scores of 70 in a test with a SEM of 3.0. we would expect her true score to be between 64.12 and 75.88

question

The SEM and confidence intervals remind us that scores are not perfect

answer

-When the reliability of the test scores is high, the SEM is low because high reliability implies low random measurement error -The smaller the standard error of measurement, the narrower the range

question

CTT: Only an undifferentiated error component

answer

Generalizability theory: Shows how much variance is associated with different sources of error

question

Reliability information reported as a Test Information Function (TIF): A TIF illustrates reliability at different points along the distribution.

answer

TIFs can be converted into an analog of the SEM.

question

How Test Manuals Report Reliability Information:

answer

At a minimum, manuals should report: internal consistency reliability estimates, test-retest reliability, standard error of measurement (SEM), and information on confidence intervals (typically 90% and 95% intervals)

question

Validity: refers to the appropriateness and accuracy of the interpretation of test scores (does the test measure what it is designed to measure?)

answer

if test scores are interpreted in multiple ways, each interpretation needs to be evaluated

question

An achievement test can be used to:

answer

-evaluate students' performance -assign a student to an appropriate instructional program -evaluate a learning disability (the validity of each of these interpretations needs to be evaluated)

question

Reliability tells us whether a test measures whatever it measures consistently

answer

Validity is about our confidence that interpretations we make from a test score are likely to be correct

question

Reliability is a necessary, but insufficient, condition for validity.

answer

-For interpretation of scores to be valid, test scores must be reliable. -However, reliable scores do not guarantee valid score interpretations.

question

Construct underrepresentation: Present when the test does not measure important aspects of the specified construct.

answer

A test of math skills that contains division problems only

question

Construct-irrelevant variance: Present when the test measures features that are unrelated to the specified construct.

answer

A math test with complex written instructions

question

External Features that Can Impact Validity

answer

-Examinee characteristics (e.g., anxiety): max performance test: low motivation/high anxiety impact interpretations AND typical response test: client may attempt to present him/herself in a more/less pathological manner -deviation from standard test administration/scoring procedures (follow time limits/provide instructions) -instruction and coaching -appropriateness of standardization sample (norm-referenced interpretations)

question

Traditional validity nomenclature:

answer

-Content Validity: is the content of the test relevant and representative of the domain? -Criterion-Related Validity: involves examining the relationships between the test and external variables -Construct Validity: involves an integration of evidence that relates to the meaning of the test scores

question

Traditional validity nomenclature suggests that there are different "types" of validity

answer

-Modern conceptualization views validity as a unitary concept. -Not types of validity but sources of validity evidence. -The current view is that validity is a single concept with multiple sources of evidence to demonstrate it

question

Sources of Validity Evidence: Standards for Educational and Psychological Testing (1999) describe five sources of evidence:

answer

-Evidence Based on Test Content -Evidence Based on Relations to Other Variables -Evidence Based on Internal Structure -Evidence Based on Response Processes -Evidence Based on Consequences of Testing

question

Evidence Based on Test Content: Traditionally referred as content validity, Examines the relationship between the content of the test and the construct it is designed to measure, Does the test cover the content that it is suppose to cover?

answer

-Process of relevance of the content starts at early stages of development: Identify what we want to measure and delineate the construct or content domain to be measured -Typical response scale to measure anxiety: Experts review clinical and research literature and develop items designed to assess the theoretical construct being measured -Test developers include a detailed description of procedures for writing items as validity evidence

question

After test is develop, developers continue collecting validity evidence based on content

answer

-A qualitative process: expert judges review correspondence of test content and its construct -Experts: same who help during test construction or independent group

question

Experts evaluate two major issues:

answer

-Item Relevance: Does each individual item reflects content in the specified domain? -Content Coverage: Does overall test reflects essential content in the domain?

question

Content-based validity is specially important for:

answer

-Academic achievement tests -Employment tests: sample of skills needed to succeed at job and used to demonstrate consistency between content of test and job requirements

question

Face Validity

answer

-not a form of validity -Does the test "appear to measure" what it is designed to measure to the general public? -Tests with "face validity" are usually better received by the public.

question

Evidence Based on Relations to Other Variables: Historically referred as criterion validity

answer

-Obtained by examining relationships between test scores and other variables -Several distinct applications: Test-Criterion Evidence, Convergent and Discriminant Evidence, and Contrasted Groups Studies

question

Test-Criterion Evidence: Criterion: Measure of some outcome of interest

answer

-Many tests are designed to predict performance on some variable (the criterion) -Can test scores predict performance on a criterion? (e.g., SAT predict college GPA) -Types of studies to collect test-criterion evidence: Predictive Studies and Concurrent Studies

question

Predictive studies involve a time interval between test and criterion.

answer

In concurrent studies, the test and criterion are measured at the same time.

question

Predictive evidence of validity:

answer

-Administering a test to applicants of a job -Holding their scores for a pre-established period of time but not using those scores as part of selection process -When time has elapsed, a measure of the behavior that the test was designed to predict (criteria) is taken -A test has predictive validity when its scores are significantly correlated with the scores on the criteria

question

Concurrent evidence of validity:

answer

-Collect criterion data from a group of current employees -Give those same employees the test they wish to use as part of their selection process -The test demonstrates evidence of concurrent validity if its scores are significantly correlated with the scores on the criteria

question

Researchers use a correlation coefficient to examine the relationship between the criterion and the predictor

answer

In this context, the correlation coefficient is referred as the validity coefficient (rxy)

question

Issues in test-criterion studies

answer

-Selecting a criterion: Criterion's measure must be both valid and reliable -Criterion contamination: Predictor and criterion scores must be obtained independently -Interpreting validity coefficients: How large should validity coefficients be? -Validity generalization

question

Convergent Evidence: Construct Validity

answer

-Correlate test scores with tests of the same or similar construct -Expect moderate to strong positive correlations (like anxiety and depression)

question

Discriminant Evidence: Construct Validity

answer

-Correlate test with tests of a dissimilar construct -Expect negative correlations (like self-esteem and anxiety)

question

Multitrait-Multimethod Studies combines convergent and divergent strategies

answer

-Requires to examine two or more traits using two or more measurement methods -Allows to determine what the test correlates with (and does not correlate with) as well as how method of measurement influences the relationship

question

Contrasted Group Studies: Examine different groups expected to differ on the construct measured by the test

answer

Examples: -Contrast depressed vs. non-depressed -Young vs. old examinees

question

Evidence Based on Internal Structure: Examine the internal structure and determine if it matches the construct being measured

answer

Factor analysis is a prominent technique.

question

Factor Analysis: A statistical method that evaluates the interrelationships of variables and derives factors

answer

-Factor analysis allows one to detect the presence and structure of latent constructs among a set of variables. -Factor analysis starts with a correlation matrix.

question

Evidence Based on Response Processes

answer

-Are the responses invoked by the test consistent with the construct being assessed? -Does a test of math reasoning require actual analysis and reasoning, or simply rote calculations? -Can also include actions of those administering and grading the test.

question

Evidence Based on Consequences of Testing: "consequential validity evidence." *informal*

answer

-If the test is thought to result in benefits, are those benefits being achieved? -Controversial -Some suggest that this concept should incorporate social issues and values.

question

Validity Argument: Validation should involve the integration of multiple sources of evidence into a coherent commentary.

answer

-All information on test quality is relevant to validity: score reliability, standardized administration and scoring, accurate scaling equating and setting, attention to fairness

question

How Test Manual Report Validity Evidence

answer

-Different types of validity evidence are most applicable to different types of tests. -The manual should use multiple sources of validity evidence to build a compelling validity argument.

question

In classical test theory, T stands for ____ score, X stands for ___ score and E stands for ____

answer

true; observed; random measurement error

question

Define random error of measurement and provide an example

answer

testing environment a result of chance factors

question

Define systematic error of measurement and provide an example

answer

two extra pounds for every measurement of weight

question

____ error reduces the reliability of test results while ____ error does not lower reliability (test is reliably inaccurate by the same amount each time). Therefore, ____ error is the main focus of classical test theory.

answer

Random; systematic; random

question

What conclusion could be drawn from a reliability coefficient of .75?

answer

25% error

question

____ reliability requires that two forms of the test are administered to the same group of individuals while in ____ a test developer gives the same test to the same group of test takers on two different occasion.

answer

alternate form; test/retest

question

____ method of estimating reliability requires dividing the test into halves, then correlating the set of individual test scores on the second half.

answer

split-half reliability

question

The coefficient alpha is also known as the ____ of all possible split-half coefficients.

answer

average (mean)

question

____ tests produce more reliable scores than ____ tests.

answer

long; short

question

Unreliable test scores will lead to ____ standard error of measurements.

answer

larger

question

When interpreting the test scores of individuals, the ____ is more practical than the ____.

answer

standard error of measurement; reliability coefficient

question

In terms of threats to validity....

answer

constructive underrepresentation is present when the test does not measure important aspects of the specified construct

question

On the other hand, ....

answer

construct irrelevant variance is present when the test measures features that are unrelated to the specified construct

question

Common threats to validity

answer

-examinee characteristics (high test anxiety) -deviations from standard test procedures

question

Contemporary conceptualizations views validity as a....

answer

unitary construct while

question

Traditional nomenclature suggests that there are three different....

answer

types of validity

question

Validity evidence based on ....

answer

test content is produced by an examination of the relationship between the content of the test and the construct or domain the test is designed to measure

question

____validity is not technically a form of validity and refers to the degree to which a test 'appears' to measure what it is designed to measure

answer

Face

question

Examples in which validity evidence is based on relations to other variables

answer

GRE given to students prior to entering their first year of grad school

question

____studies involve a time interval between test and criterion but in ____studies the test and criterion are measured at the same time.

answer

Predictive; concurrent

question

"Correlating scores on a new test to measure anxiety with a measure of sensation seeking" is an example of ____validity

answer

discriminant

question

"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity

answer

convergent

question

____ ____ studies combine convergent and divergent strategies.

answer

Multi-trait multimethod

question

____ ____ allows one to detect the presence and structure of latent constructs among a set of variables.

answer

Factor analysis

question

____ ____ is a statistical procedure that allows one to predict performance on one test from performance on another (given that both are correlated with each other).

answer

Linear regression

question

____ ____ is a method of obtaining validity that examines different groups expected to differ on the construct measured by the test, e.g., contrasting depressed vs. non-depressed groups.

answer

Contrasted group studies

question

It is important that predictor and criterion scores be obtained independently in order to avoid ____ ____

answer

criterion contamination

question

"Correlating scores on a new IQ test with scores on the Wechsler Intelligence Scale" is an example of ____validity

answer

convergent

Tests & Measurements Chap. 3-5 – Flashcards

Unlock all answers in this set

Haven't found what you were looking for?

Search for samples, answers to your questions and flashcards