How to design a psychological test

Actually, designing a psychological test IS difficult. Measuring individual differences, such as personality traits or intelligence, is fraught with potential pitfalls.

Individual differences are most commonly measured by tests of one kind or another and these tests all need to be reliable, valid, and unbiased. However, before this, it is vital that any test is administered in a standardised testing situation and is scored using the same scoring and interpretation procedure each time. If these things were not held invariant, then test results would be chaotic even if it was reliable, valid, and free from bias.

There are two main types of psychological test. One of these types are ability tests, which contain items that are often different problems that are believed to rely on a particular mental ability in order to be solved. These can be "free response" items, but a much more easily scored method of response is to offer multiple choice answers. Intelligence tests are an example of ability test. The other type of tests are personality tests. These typically take the form of self-report questionnaires, often using Likert scales.

General Points:

Personality tests need to be administered with caution. Firstly, it is important to bear in mind that what is being measured is not the individual answers to items themselves, but rather how all the responses relate to each other. The purpose is to extract personality types through statistical analysis of the way certain types of people tend to respond to questions. Secondly, personality tests suffer from several potential problems that mean that respondents do not give true answers. For instance, people may actually lie when giving answers, particularly if the question is related to something that may be socially undesirable. People tend to want to paint a favourable picture of themselves, especially if the test they are taking is part of a selection procedure for a job. In order to combat this, a good test should make use of a "lie scale", as advocated by Eysenck. This involves a series of questions embedded in the corpus of the test that ask things that are common and socially undesirable, such as "Did you ever cheat at a test in school?". If an individual admits to very few of these, then questions may be raised over the honesty of their other responses. People can also display "response biases" when answering test questions. One common phenomenon is known as "acquiescence", which is that individuals are more likely to agree or say "yes" to statements than disagree or say "no" (Cronbach, 1946). This would not be so much of a problem (as a number could be subtracted to a person's score) if it were not for the fact that there may be individual differences in acquiescence. To combat this, a test should have items that measure the same basic thing, but are scaled in opposite directions, so that any acquiescence effects should cancel out.

Care must be taken to ensure that a psychological test is as free as possible from the effects of so-called "nuisance factors" - such as lucky guessing of the right answer, misunderstanding what is expected, social pressures like deliberately underperforming so as not to stand out, perceived importance of attaining a high score, or ability. Nuisance factors have a large effect on items on personality tests since it is almost impossible to find items where the trait being measured accounts for more than 20-30% of variation in responses. This would seem to be a major problem, and indeed it is, providing that it is not controlled for. There is a way around nuisance factors, and that is to effectively cancel them out. This can be achieved by using several different items to measure the same trait, but each item measures a different aspect of the trait and is therefore affected by different nuisance factors. When these items are summed together, the effects of nuisance factors are minimised. Using this method up to 80-90% of variance in response can be accounted for.

Reliability:

However, all the things briefly discussed above to help to make better tests would be of no real use if the test itself is not reliable. Reliability is essentially a measure of how consistently a person will achieve the same score on the same test, providing nothing major happens to that person in between test taking occasions, or how well their score is an approximation of their true score. This is measured by "alpha", which is influenced by the average size of the correlation between the test items, and the number of items in the scale. A test will be maximally reliable if all the items measure the same underlying trait. Also, the more test items, the more reliable the test, since it will be more likely that nuisance factors will cancel out. To calculate alpha accurately, the test must be administered to a sample of at least 200 people. The square root of alpha is then a good approximation of the correlation between an individual's test score and their true score. A rule of thumb that should be applied to help ensure that a test is reliable enough to be used is that alpha should be above 0.7, as then the correlation between test and true scores will be 0.84. But if the test is to be used for important purposes, such as deciding if a child need remedial education, then alpha should really be above 0.9.

There are several things that should be avoided in tests with respect to reliability. Firstly, tests should not have multiple questions that are in fact paraphrases of each other. This would inevitably increase the average correlation as there will be multiple questions with very highly correlating answers - which can make a great difference since correlations between items are usually only in the range of 0.2 - 0.4. It is also important that the sample used to calculate alpha is similar to the groups of people the test will be administered to. It may be the case that a test is reliable for one group, but not for another due to any number of individual differences.

Validity:

It is also vitally important for a test to be valid. Validity refers to the degree to which test items measure what they are intended to measure. Reliability is a necessary precursor to validity, but reliability does not make a test valid, since it could very consistently measure something that its designer did not intend. A test can have four types of validity. The first is "face validity", which is a measure of how well test items seem to measure what they are supposed to through simple scrutiny of their content. However, this is not such a useful method of testing validity since scrutiny is no guarantee of validity. But if a test does have face validity, then it helps it to sell commercially. The second type of validity is "content validity". This occurs on occasions when it is possible to construct a test that must be valid. For example, occupational psychologists sometimes use "workbasket" approaches, where applicants for a position are selected on the basis of how well they perform on tasks that they are likely to perform should they get the job. The third type of validity is "construct validity", which can be assessed through the performance of thoughtful experiments. Within this, "convergent validity" is the term given to the measurement of how much test scores correlate with things they should be expected to (such as other similar test), and "divergent validity" is a measure used to check that the test does not measure anything it is not supposed to (based on evidence from the literature). Fourthly, there is "predictive validity", which measures a test's ability to predict behaviour. This is important for many tests since there are lots of them that do aim to predict behaviour. However, this can be very difficult to measure, especially as far as test for selecting candidates for positions is concerned, since there will no job performance data for those who fail to qualify.

Bias:

Even if a test is both reliable and valid, that is no guarantee that it will not be biased for, or against, certain groups. A test is biased if it systematically underestimates or overestimates the true scores of certain individuals. For instance, knowledge-based tests are always biased if given to people who have no way of realistically knowing the answers. Another example would be an English-based intelligence test given to a non-English speaker. It is very important for tests in selection procedures to be unbiased so that certain groups are not underrepresented in the workforce. Bias can be due to external factors, such as group differences, and/or due to internal factors, such as some questions being significantly harder for a particular group. But it is important to remember that individual differences within groups tend to be much greater than differences between groups, and it can be very difficult to decide upon a criterion for forming groups in the first place. Also, a very large sample is required before subtle degrees of bias can be detected, but if the sample is too large then almost every item will show a small, but significant, degree of bias.

Practical Applications:

Tests of ability, such as intelligence tests, are often used to select people for a job or a course of study, and the advantage of this is that, in theory, only the most intelligent individuals are selected. However, the disadvantage may be that a person who is more than able to do the specific tasks required of a job may not be selected as they were up against a less-able candidate who happened to be better at intelligence tests. Another point is that intelligence tests may provide only a very artificial representation of intellectual ability. Intelligence tests typically measure maximal intelligence because they are conducted in a situation where someone is focussed on the specific task of completing the test. However, this may not reflect typical intelligence, which is more stable over time. Ackerman (1994) suggests that this may be the reason why intelligence test scores do not correlate highly with subsequent occupational and academic performance - things that rely more on the long-term application of typical intelligence. Goff and Ackerman (1992) have formulated a self-report measure of typical intellectual engagement (TIE), essentially a personality measure, which can be used to help remedy such testing problems.

There are also advantages and disadvantages of using personality tests to select candidates for positions. They could be used to make sure that a candidate does not have any personality traits that would mean they would not function so well in the working environment in question, or clash with colleagues. Also, some companies use personality tests to help inform their employees of better ways in which they could work and get the most out of themselves. However, some may question the fairness of selecting someone based on something else except ability, and also, it is unlikely that any candidates would admit to having negative traits such as poor working attitudes.

Some final points about what makes a good psychological test include the fact that it should be as unfamiliar as possible to the participant, since this will help to eliminate better performance due to practice effects and knowledge of what is expected, rather than true underlying ability. In addition, it is very helpful if there are a few practice items at the beginning of a course of testing so participants know what they need to do. If a test is to be used for a selection procedure, then it must meet all of the criteria discussed above - it must be reliable, valid, free from bias, and take all steps to ensure that the test is constructed and conducted well.

---------

References:

Cooper, C. (1998) Individual Differences. London: Arnold

Ackerman, P.L., & Heggestad, E.D. (1997) Intelligence, personality and interests: evidence for overlapping tests. Psychological Bulletin, 121: 219-245

the inherent limitations of psychological testing	The Torture of Children -- The World's Secret Shame	Wisconsin Card Sorting Test	Myers-Briggs
INTJ	Fly Me to the Moon	Catarrh	fetal position
Xanadu	February 14, 2001	Absolut Stress	Antarctic physical qualification
story problem	Antarctic Stations	Newcomb's Paradox	Impregnation by Shoulder Tapping
Boolean algebra	July 15, 2019	There is no answer; there is no question	Likert scale
August 30, 2000	Playing With Fire	bbialb