Study Notes on Psychological Tests!
Meaning of Psychological Test:
“A psychological test is essentially an objective and standardised measure of a sample of behaviour, that is, it is based on observation of a chosen sample of human behaviour.” In the words of Freeman: “a psychological test is a standardised instrument designed to measure objectively one or more aspects of a total personality by means of samples of verbal or non-verbal responses, or by means of other behaviours. The key words in this definition are standardisation, objectivity and samples.”
Basically, the function of psychological testing is to measure individual differences, or between reactions of the same individual on different occasions. Historically, psychological testing began with identification of the feeble minded. Even today the general public still identifies psychological tests with intelligence tests. But intelligence tests represent only one group of several types of currently available psychological tests.
The psychological tests may be classified from the point of view of purposes to which the tests are put, into various groups:
1. Aptitude tests, and
2. Proficiency tests.
An aptitude test is a test designed to discover what potentiality a given person has for learning some particular vocation or acquiring some particular skill.
A proficiency test, on the other hand, is to discover how perfect or skillful a person actually is in a given type of activity.
Aptitude tests may be:
1. Specific or
2. General.
The Orleans-Solomon Latin Prognosis test for ability to learn Latin is an example of specific aptitude test. The Stenquist Mechanical Aptitude test is designed to test mechanical aptitude in general, i.e., a kind of average of a person’s abilities to do all kinds of mechanical work.
All tests of general ability, like Binet-Simon tests, the Terman group test of mental ability, or the tests of general or average scholastic aptitude will fall under this group. The Solomon Latin Prognosis test will fall under specific scholastic aptitude.
Most aptitudes are of such complexity that a single test unit will rarely be able to sample the determining factors. Hence batteries of aptitude tests are used. Thus aptitude tests may be of single unit type or a battery type.
Psychological tests may be classified into apparatus vs. non-apparatus tests. The non-apparatus tests are strictly verbal tests of oral type, though most verbal tests are pencil-paper tests. The classical example of pencil and paper test is the Army Alpha test.
There may be individual vs. group tests, speed vs. power tests. Speed tests may again be classified into time-limit tests and work-limit tests. There may be tests of various stimulus response mechanisms, like tests to study sensory efficiency and motor efficiency.
Tests of mental efficiency or psychological tests are tests involving central processes, like memory tests, tests of problem solving, association, word building, sentence completion etc. There are tests of character and temperament, known as personality tests.
The guiding principles in construction of test batteries are:
1. The tests should each correlate as highly with the aptitude criterion as possible, and
2. they should correlate as low with each other as possible.
Justification for Multiple Aptitude Batteries:
Single tests are rarely adequate to detect latent aptitudes, as aptitudes are ordinarily made of a complex of ability. Hence the main problem of modem aptitude testing is one of devising test batteries. This is known as the differential approach to the measurement of ability. Such instruments or test batteries provide an intellectual profile showing the individual’s strengths and weaknesses.
This is because intelligence tests are less general than supposed, and they actually measure combination of special aptitudes like verbal and numerical aptitudes. Finally, the application of factor analysis to the study of trait organisation has provided the theoretical basis for the construction of multiple aptitude batteries. Some best known examples of such batteries are SRA Primary Mental Abilities (PMA), Differential Aptitude Tests (DAT), Guilford-Zimmerman Aptitude Test etc.
Construction of Aptitude Battery Tests:
A general outline of six steps involved:
1. Make a careful analysis of the job or activity or vocation in question. The objective is to find out, as far as practicable, the traits or characteristics of human behaviour which lead to success in that particular vocation.
2. The choice of a preliminary battery of tests.
3. Try out the preliminary tests to determine objectively which are to be preserved in the final test and which are to be rejected. This step is known at testing the tests. It consists in applying the tests to a group of subjects who are about to start training in that particular field of aptitude under investigation, but who have not yet started the training.
4. Find a criterion score. That is, to secure a quantitative determination of the final success or vocational proficiency of the trial group of subjects after the completion of the training course.
5. Check the test scores of the trial group of subjects, secured by step 3, against their criterion scores obtained by step 4. This is obtained by the computation of correlation of coefficients. The tests which will remain after this hard test will form the final test battery.
6. The sixth and the final step is the determination of the relative weights to be given to various surviving tests. This can be done by the method of multiple regression equation, in order to make the best possible prediction of the aptitude in question.
An idea is given as to high and low correlation:
1. Below .45 or .50—practically useless for differential prognosis.
2. From .50 to .60—of some value.
3. From .60 to .70—of considerable value.
4. From .70 to .80—of decided value.
5. Above .80—usually not obtained.
The guiding principle is repeated again. The tests should each correlate as highly with the aptitude criterion as possible, and they should correlate as low with each other as possible.
Achievement Tests:
Achievement tests have been developed to measure the effects of a course of instruction or training in schools or the results of specialised vocational training. It has already been pointed out that the difference between achievement tests and aptitude tests is one of degree only.
Achievement tests measure the result of a specified course of training under controlled conditions, while aptitude tests measure the result of “learning under relatively uncontrolled and unknown conditions.” Secondly, they can be distinguished with reference to their respective uses.
That is, achievement tests represent the terminal evaluation of an individual’s status on completion of a certain course of training; while aptitude tests attempt to predict later performance of an individual. Both depend on past learning, though in the case of aptitude testing, the past learning is used as an indicator of future learning.
Achievement tests may be teacher-made or standardised. The teacher- made tests are only for use in the classrooms. The teacher uses it simply to test the efficiency of his pupils in a subject taught by him in the class only. These tests do not have norms determined for the evaluation of scores.
The standardised tests, on the other hand, are prepared by specialists. In order to standardize, a test is applied on a large sample of population of the same age group. The value of each item is determined by the number of a certain age group passing the test. The mean or median response of a group determines the norm of the test. Each test item of a standardised test has thus a norm determined.
It has to be remembered that a norm is not the same as standard. “Norm does not refer to any normative value. It simply represents a score actually achieved by a large group of pupils of the same age, and reading in the same class.
These scores represent the average response, and a value gets attached to an item, while standard refers to an objective set by educationists or syllabus-framers to be achieved by pupils of a certain age group.” In other words – “norms are measure of achievement which represent the typical performance of a designated group or groups.” For standardisation, various norms have been proposed like national, regional and local norms, age and grade norms, mental age norms, I.Q.s, percentile, standard score norms, normalized standard scores etc.
There are 4 types of standardised achievement tests:
(1) Survey Test batteries to determine an individual’s specific strength and weaknesses.
(2) Single Survey tests—The examiner has a choice of a single subject matter test.
(3) Diagnostic test—to identify difficulties in learning a subject.
(4) Prognostic test—designed to predict achievement in specific school subjects.
The teacher-made tests may be of two types:
(i) Objective
(ii) Essay type.
The first step involved in preparation of any achievement test—whether teacher-made or standardised—is the determination of the objectives of measurement, that is, the behaviours to be assessed are to be clearly defined at the outset.
The educational objectives to be assessed, have been very clearly defined by Dr. B. S. Bloom in the Taxonomy of Educational Objectives.
The cognitive domain in the taxonomy lists six categories:
Knowledge, comprehension, application, analogies, synthesis and evaluation
I. Knowledge involves remembering of facts, terms and principles in the form they were learned, either by recall or recognition.
Knowledge is, again, classified under 3 heads:
(1) Knowledge of specifics which involves knowledge of terminology (e.g. terms in chemistry) and knowledge of specific facts (e.g. knowledge of physical and chemical properties etc.).
(2) Knowledge of universals and abstractions in a field (e.g., knowledge of principles and generalisations, theories and structures).
(3) Knowledge of ways and means of dealing with specific facts (e.g. knowledge of conventions, trends and sequences, classification and categories, criteria and methodology).
II. Comprehension means understanding the meaning or purpose of something e.g., when teachers present a science demonstration and project an instructional film, use a chart or diagram in a test book, or exhibit pictures illustrating different styles of architectures, they are trying to help the students to comprehend what they are studying. It may involve translation, interpretation, and extrapolation.
III. Application involves the use of information and ideas in new situations. The chief difference between the categories of comprehension and application is that the latter involves facing a new problem or unfamiliar problems.
Comprehension does not guarantee that the individual will be able to recognise its relevance and apply it correctly in real life situations. Students need practice in re-structuring unfamiliar problem situations and applying the concepts and principles learnt.
IV. Analysis is breaking down something to reveal its structure and interrelationships among its parts, sample verbs are “analyse”, “differentiate” and “relate.”
V. Synthesis is combining various elements or parts in a structural whole. Sample verbs are ‘design’, ‘devise’, ‘formulate’ and ‘plan.’
VI. Evaluation of student performance in skills is often neglected, though it is highly important. It involves the capacity to make a judgment based on reasoning. A sample evaluation item is “Evaluate in procedure involved in standardizing this test.”
Besides, the cognitive domain, there are affective and psychomotor objectives. One of the main objectives of education is to create certain attitudes, values and other effective states in the learner. A number of classifications have been proposed by Dr Bloom, Krathwohl and Masia (1964). Any psychological test must be reliable and valid, whether aptitude or achievement.
Reliability:
In psychometrics the term reliability means consistency. Reliability of a test depends on how consistently it measures what it does. The basic definition of reliability as stated by Guilford is as follows—”The reliability of any set of measurements is logically defined as the proportion of their variance that is true variance…operationally, it is some kind of self- correlation of a test.” Suppose a test is applied on a certain group of people, it gives a set of scores.
After some time the same test or a similar one is again applied on the same group. A second set of scores is received. If the variance between the two sets of scores is very high, the test must be defective somewhere and not enough reliable.
If a child’s I.Q. measures 110 on Monday, and 50 next Friday, none of the scores may be accepted as a reliable index to his intelligence. Hence, each test must be thoroughly checked for reliability before it may be applied. This is a ‘must’ in case of all standardised tests. Anastasi describes the reliability of a test as “consistency of scores obtained by the same individuals on different occasions or with different set of equivalent items.” It is the internal consistency of any test.
There are different methods of measuring reliability of a test. Not only that, the same method is again described differently. Recently, the American Psychological Association has drawn up a list of the conventional reliability coefficients that are in use.
It is given below:
Coefficient of Correlation:
In order to calculate the reliability of a test, it is essential to have an idea of correlation coefficient whose statistical mark of identity is ‘r’. Correlation coefficient is a ratio which expresses an amount of relation or correspondence between two sets of scores. If a group of persons get scores in the same order on a second test as on the first, there is a full positive correspondence between the two tests, and the ‘r’ is + 1.00. Again, a set of scores on a certain test may have absolutely no relation with another set of scores.
That is, a person who stands first in the first test may occupy any place in the second test. Similarly, other persons may also get ranks without any consistent relation with the former ranks. In this case the ‘r’ will be 0 00. If the persons who always do well in the first test, consistently do badly in the second, there is a negative relation between the two.
Here ‘r’ is -1.00. In other words, coefficient of correlation or ‘r’ varies on a linear scale ranging between +1.00, through 0.00 to -1.00. “A positive correlation indicates that large amounts of the one variable tend to accompany large amounts of the other; a negative correlation indicates that small amounts of the one variable tend to accompany large amounts of the other.
A zero correlation indicates no consistent relationship. There are several methods of computing correlation coefficient. When one has only rank-order data, the most practical method would be the Spearman rank difference method, which finds out the ‘rho’, the rank-difference coefficient of correlation. For the purposes of research and construction of tests, the best method of computing the correlation coefficient is by the Pearson product-moment correlation coefficient method.
Some of the accepted methods of finding out test-reliability are discussed here:
Retest Reliability:
In the test-retest method the same test is applied twice, and the correlation coefficient between the two sets of scores is calculated. The reliability coefficient (r 11) or (rtt) is called the coefficient of stability. It is to be noticed that coefficient of reliability is sometimes represented by the symbol rtt (as by Guilford) or by r11, (as by Garrett).
The difficulty of this method lies in the fact that the time elapsing between the two applications may produce practice effects on the second application. Suppose the test items are based on reasoning. Once the pattern of reasoning is understood, it will be easily solved on the second occasion. Specially in the case of public examinations this method has no scope for application.
Equivalent form of Reliability:
To avoid the difficulties of the test-retest method, different forms (equivalent, alternate, parallel) of the same test may be applied successively on the same group of pupils. The coefficient arrived at by this method is called coefficient of equivalence. Here the question of time interval and consequently practice effects does not arise.
The difficulty of this method is that it is not always possible to construct or find exactly equal or parallel forms. The number of items must be equal and the content of each parallel item must be the same. The difficulty and range of items also should be identical. Hence, the parallel forms are more or less like independent tests. These forms, however, may be utilised for many purposes other than that of checking test reliability.
Split-half method:
In the split-half method the test requires to be applied only once. In this method, the same test is split into two, and, accordingly, two sets of scores on the same group of individuals can be arrived at. As the test is applied once only, the question of time gap and of practice effect does not arise at all. The difficulty lies in the splitting of the test into two equal halves. The two halves must be equal in respect of number, level of difficulty, variability and content of items.
There are many ways of such splitting. One method is to apply the test as a whole on a sample of population, find out the difficulty level by the percentage of pass on each item, and then divide the test into two halves in accordance with the level of item difficulty and content.
Then by application on some other group, the correlation coefficient between two sets of scores, obtained on the two halves, may be computed. From the half-test reliability, the self-correlation of the whole test is estimated by the Spearman-Brown formula. This correlation coefficient is called the split-half reliability.
Another easy method of splitting up the test is to divide it into two by odds and evens. In order to split it in odds and events, the test items should be arranged from the least to the most difficult. The test may then be applied once, and the reliability calculated.
The degree of reliability depends on the strength of the test or on averaging the results obtained from several applications of the test, or from alternate forms. Usually the effect of the length on the split-half reliability is corrected by the Spearman-Brown (Prophecy) formula, which is:
reliability
n = ratio of the length of the test, i.e., if the no. of test items is increased
From 25 to 100, n will be 4, if it is decreased from 60 to 30, n will be 1/2.
Applied to split-half reliability, the formula involves doubling the length of the test.
The Spearman-Brown formula for estimating reliability from two comparable halves of a test is:
The Split-Half Technique:
From the self-correlation of the half-test the reliability coefficient of the whole test may be estimated from the formula
When the reliability coefficient of one-half of a test is -60, it follows from the above formula that the reliability of the whole test (r11) is -75.
Inter-Item Consistency:
The fourth method is to test the relability by finding out the inter-item consistency, i.e., the consistency of responses on each item. In this method both the equivalence and homogeneity of items may be tested. If the items are not homogeneous in content, the coefficient of inter-item consistency may become low, in spite of a high split-half or equivalent form reliability. In fact, the difference between split-half coefficient and inter-item consistency coefficient may be taken as an index of the heterogeneity of the items.
Kuder and Richardson have evolved various formulae to find out inter- item consistency. In this method it is not required to divide the test into two forms or equal halves. But the test is to be applied only once, and the consistency can be calculated on the basis of the performance of the examinee.
Among the various K-R formulae, the following is the most commonly used:
Where r11 is the reliability coefficient of the whole test,
n = the number of test items
Q t = the standard deviation of the total scores of the test
p = percentage of pass on each items,
q = proportion of failure on each items.
K-R reliability coefficient is not really very different from the split-half reliability. It may be rather described as the mean or average of various split- half reliability coefficient of the same test. If the test items are not homogeneous, K-R coefficient will be lower than split-half reliability. Both methods consider inter-item consistency, because in both methods the test is applied only once, but the coefficients on the two methods may be quite different because of difference in homogeneity of items on the tests. Hence, whenever inter-item coefficient is referred to, it is necessary to mention the method.
Validity:
The validity of an examination depends upon the efficiency with which it measures what it attempts to measure. While measuring the validity of a test, it is to be remembered that a test cannot be described valid in general terms. Validity is a relative term. It is to be tested only in relation to the purpose for which the test has been constructed, and in relation to the expected level of the ability of the pupils concerned.
That is, a test constructed to measure the attainment of the secondary school pupils on Economics with high validity may have low validity in relation to college pupils. The usual method or judging validity of a test is to compare and correlate its scores with those on similar tests.
The American Psychological Association for technical recommendations has classified the various methods for testing validity into four groups:
(1) Content validity,
(2) Predictive validity,
(3) Concurrent validity, and
(4) Construct validity.
Content Validity:
Content validity may also be described as curricular validity. If the test is on some academic subject, for example, it is to be seen that the curriculum on that subject has been fully covered, and the objectives of the curriculum are satisfied. Estimate of content does no? merely mean estimate of factual knowledge, but understanding of the general principles and their application to particular fields.
A reading test, for example, must take note of the desirable skills in reading.
Reading for information or ‘work-level type’ include the following essential skills:
Skill in recognising new words, ability to locate material quickly, ability to comprehend quickly what is read, ability to select and evaluate material needed, ability to organise what is read, remembrance of material read, attitude to reading and proper care of books and so on.
Validity of reading test (for information) will be judged by the extent to which the above skills are measured, and also by the opportunity of the pupils to master these skills. Any educational test must fulfil this criterion.
Predictive Validity:
Test results are expected to predict the future performance of the examinees. When the later performance tallies with the expectations, the test is said to have predictive validity. For example, prediction is made from test scores on mathematics, to students’ achievement in science subjects. The validity of admission tests in the case of professional colleges will be judged by the correlation between the scores on the admission test and the end tests.
Scholastic aptitude tests, vocational aptitude tests, interest inventories etc., must be tested by the criterion of predictive validity. That is, in order to judge the predictive validity of a test it is necessary to follow up the group of examinees to see how they will achieve on criteria as job performance, or grades in schools or colleges. This predictive validity cannot be judged by mere analysis of the contents of a test.
“The basic procedure in studying the predictive validity of a test is:
(1) To administer the test to a group of students or prospective employees,
(2) Follow them up and obtain data for each person on some criterion measure of his later success, and
(3) Compute a coefficient of correlation between individuals’ test scores and their criterion scores, which may represent success in college, in a specific training programme, or on the job.
Such a coefficient of correlation may be called predictive validity. We can interpret predictive validity coefficients in terms of the standard error of estimate of predicted scores.
The formula for standard error of estimate for predicted criterion is:
SE = SD
where r is the predictive validity coefficient.”
Concurrent Validity:
It means relation to other evidences of ability at the same time. There is a distinction between concurrent validity and predictive validity. While predictive validity refers to future ability, concurrent validity refers to existing states. Anastasi explains the difference very aptly by an example – “The difference can be illustrated by asking – Is Smith neurotic? (Concurrent validity). Is Smith likely to become neurotic? (Predictive validity).”
Concurrent validity may be found by correlating present test scores with ratings and scores on other existing tests. Correlation with teachers’ ratings is an example of such validation. The difficulty of teachers’ ratings lies in its ‘halo effect’, i.e., the tendency of the teacher to rank a child high in all the attributes, if the child is found to excel in some aspect. Hence rating scales also require to be prepared with great care.
It may also be found by correlating scores with those on other comparable tests, e.g., scores on the school final examination may be correlated with the school test examination which selects pupils appearing in the school final examination. In order to validate multiple choice questions on spelling and arithmetic; for example, students’ scores on dictation tests of spelling and on arithmetic tests involving actual calculation, may be used as the external criteria.
Construct Validity:
When tests are constructed to measure any particular attribute or trait, it is necessary to define the area of that trait very carefully. “Construct validity of a test is the extent to which the test may be said to measure a ‘theoretical construct’ or ‘trait’. There are various methods of finding out the exact nature of a trait, characteristic or ability”. “Age-differentiation, correlations with other tests, factor analysis, internal consistency, and effect of experimental variables on test scores” are some of the methods to find out construct validity.
In case of Intelligence Tests, age is a major criterion. That is, scores are expected to show an increase with growth in age. Correlation with other tests refers to the same procedure of validation as described in the case of content validity. The only difference is that the correlations should be moderately high, but not very high. Very high correlation would mean a mere duplication of a test. Moderate correlation would mean that the area measured by the new test is the same as that measured by the established test.
Factor analysis is another method of locating specific traits, by correlating each test with every other of a certain group of tests applied together. A cluster of correlations among a few tests is accepted to indicate the presence of a common trait. The factor loadings or correlation of the test with each factor is known as factorial validity of the test.
In the internal consistency method the criterion is the total score on the test itself. The sub-test scores may be correlated with the total score. This method was applied in a recent statistical study of examination in Sanskrit and other languages by N.C.E.R.T.’ The total score of a student was accepted as the criterion score, being ‘a fair approximation to his true ability in the subject’, and the question items were taken as sub-tests. The validity of question items was studied by correlating the items scores with the total scores.
Other Criteria:
Besides the above statistical criteria to judge the reliability and validity of tests, there are other practical means to measure the value of tests.
These are:
(1) administrability,
(2) scorability,
(3) objectivity,
(4) economy, and
(5) utility.
The test should be constructed in such a manner that it may be administered easily. It means, from the point of view of students, that the questions may be easily grasped by them, and within their capacity to answer.
From the point of view of teachers, the test should involve less time in preparation. Then the test requires to be easily scorable. The scoring should not involve much labour on the part of the teachers. Expense and utility are other factors which also must be taken into account. Besides, the test must be objective as far as possible. The examiners’ point of view should have no influence on the application and the scoring of the test. The measurement of the pupil’s knowledge should be the only consideration.
In this connection, a brief reference is made to the scoring devices. Students’ scores on any test or examination represent items answered correctly. These scores are known as raw scores, unless they are treated statistically. Through raw scores we may at most learn the relative position of a pupil in a certain group in relation to a certain subject-matter. But these are not comparable between groups.
There are various ways of converting raw scores. These may be changed into standard scores i.e., Z-scores, T-scores and the like, ‘based on difference of students’ score from the group average, expressed in SD units or some multiple thereof.
These may be changed into percentile scores or normalized standard scores, ‘based on the relative positions, or rank of the student’s score within the group of all students tested, or some defined reference groups.’ Thirdly, these may be converted into age or grade scores, like ‘the average age or grade status of students obtaining the same score.’ These points may be looked in any book on educational statistics.
An explanatory note on the need for converting raw scores into some relative measure is quoted from Anastasi:
“First, they indicate the individual’s relative standing in the normative sample, and thus permit an evaluation of his performance in reference to other persons. Secondly, they provide comparable measures which make possible a direct comparison of the individual’s performance on different tests.
If, for example, we find that a given individual has a raw score of 40 on a vocabulary test and 22 on an arithmetic reasoning test, we obviously know nothing about his relative performance on the two tests. Is he better in vocabulary or in arithmetic, or equally good in both? Since raw scores on different tests are usually expressed in different units, a direct comparison of such scores is impossible.
The difficulty level of the particular test would also affect such a comparison between raw scores. Converted scores, on the other hand, can be expressed in same units and referred to the same or to closely similar normative samples for different tests. The individual’s relative performance in many different functions can thus be compared.”
Teacher made tests may be of essay type or objective type.
Horace Mann was in favour of these essay type tests, as against oral tests.
He was the president of the Boston Examination Project, the report of which came out in 1845.
He advanced the following reasons in favour of written essay type examinations against the system of oral tests:
(1) It is impartial
(2) It is just to the pupils
(3) It is more thorough than older form of examination
(4) It presents the “officious interference” of the teacher
(5) It determines beyond appeal or gain
(6) It takes away “all possibility of favouritism”
(7) It makes the information obtained available to all
(8) It enables all to apprise the ease or difficulty of the question.
Vernon has classified the general objectives and functions of written examinations into certain groups:
(1) Examinations are necessary to test the level of attainment of pupils. These may be described as educational barometers or achievement tests.
(2) The quality of teaching and school management can be assessed through examinations.
(3) The main value of examination lies in their predictive character and guidance of pupils. An analysis of specific disability and backwardness or excellence in certain areas of learning may lead to valuable guidance in the matter of selection of subjects, professions and eradication of difficulties. Kandel also accepts the main purpose of examination to be guidance.
(4) The capacity to pass an examination indicates not only ability in certain areas of learning but also reveals some personality qualities. Preparation for examination means diligence, perseverance and submitting oneself to strenuous discipline.
(5) Last, though not the least, examinations work as stimulus to study. It makes both teacher or pupil apply hard to work and thus helps to raise the standard of teaching.
Unfortunately examinations do not fulfill the above purposes, especially, they do not fulfill the statistical criteria of reliability and validity. They are extremely subjective from the point of view of scoring, and they suffer from function fluctuation. Besides, halo effect was noticed in the case of teachers’ markings. Another serious complaint is that a few questions cannot sample adequately the extent of the individual’s knowledge on the total subject.
The defects of the traditional essay-type examinations led to the development of objective types of tests or new type tests. In case of objective type of tests, the scoring can be done purely objectively. The sampling of the area of learning can be done in a much better way, the number of questions being large, and short in nature, practically the whole syllabus can be covered. The objective items are of different forms – short answer, true-false, matching and multiple choice. The objective tests may be informal and made by the teachers, or standardized and made by specialists.
The main advantage of objective type tests is that these provide scope for extensive sampling, secondly, the subjective element of examiners’ personal opinion views and attitudes will not affect scoring. But the disadvantage is that these do not provide any scope for training in the organisation of thought and good expression in language. Secondly, these emphasise factual knowledge and recapitulation, and thirdly, the guessing factor cannot be avoided fully.
The main advantage of essay type examination on the other hand, is that these may reveal some higher mental faculties like the power of understanding, appreciation, criticism and expression of reasoning in an organised manner and also prepare the student to synthesise the ideas and express the deductions in good and precise language.
Steps involved in construction of standardised achievement tests:
(1) Specify the behavioural objectives. We have already mentioned the type of objectives to be considered under The Taxonomy of Educational Objectives by Dr. Benjamin. S. Bloom.
(2) A test must be reliable and valid. The various methods of testing reliability have been mentioned earlier. Reliability means consistency but only reliability will not do. The test must be valid.
“Both reliability and validity depend ultimately upon the characteristics of items making up the test. Any test can be improved through selection, substitution, or revision of items. Item analysis makes it possible to shorten a test and at the same time increases its validity and reliability other things being equal, a longer test is more valid and reliable than a shorter test”—Anastasi.
After specification of the behavioural objectives, comes the questions of items selection. A teacher will have to decide upon the types of objective items to be used, like true-false, multiple choice, matching, completion, essay or recall etc. It is preferable to use a subtest of one type of item only, to be less confusing for students.
The items may be arranged in terms of difficulty, i.e., from easy to hard, and then tried out on a small sample, set the time limit which will depend upon the age of pupils, length of the class hour, purpose of the test, whether survey or for diagnostic purpose, or speed or power.
Next comes the question of item analysis. Computation of the difficulty and validity or discriminative power of an item is called item analysis. The difficulty of an item depends upon the number of examinees in the try-out group answering it correctly. An item answered correctly by 90% of the group is, naturally, easier than one answered by 10% of the group. Very hard and very easy items are ordinarily less useful than items of intermediate difficulty.
“The validity or discriminative power of an item depends on how well it distinguishes between the brightest and dullest pupils in the group. If all the members of the experimental group answer an item correctly or if none does, the item have no validity, since in neither case does it separate the good from the poor members of the class.”
Garrett recommends the method of bi-serial-r to determine the validity of items in their tests. “By means of bi-serial-r, we can compute the correlation between success and failure on a single item and size of the total score on the test, or on some other measure of performance taken as the criterion. The size of the correlation between items and the test scores shows how well the items is working together with other items—as a member of the team. Items unrelated to total scores are discarded.”
Steps in the determination of item validity by use of bi-serial-r are:
(1) Arrange the total scores from the highest to the lowest.
(2) Count off the highest and lowest 27% of the paper as nearly as possible. It has been pointed out by Garrett that when the distribution of ability is normal, the sharpest discrimination between the extreme groups is obtained when item analysis is based upon the highest and lowest 27 per cent in each case. He cites an example. If there are 120 children in the standardization group; 32 may be put in the top and 32 in the bottom groups.
(3) Then count the number in the high group and the number in the low group who pass each item, and express these figures as percentages. Suppose, for example, that items no. 18 is passed by 60 p.c. of the high group, and by 30 p.c. of the low group, then from tables we read that the bi- serial correlation between these items and the whole test is 3% in general, any item with a bi-serial-r of -20 or more can be taken as valid if the test is fairly long. In a short test, items of higher validity are needed. Both hard and easy items are valid, that is, they have discriminative power.
(4) Then determine the difficulty value of each item by arranging the percentages of pupils that pass it in the high and low group. An item passed by 60 per cent of the high group and by 30 p.c. of the low group, for example, has a difficulty index of 45, that is (.60+.30)/20
(5) It is pointed out that items with difficulty value of .50 or thereabouts are the best items, as these can discriminate very well between the good and poor students.
If the test is to cover a wide range of talent, as is required in most school examinations, Garrett suggests a plan to follow in selecting items:
(i) Take about 15% of items passed by 85-100 p.c. (very easy)
(ii) Take about 35% of items passed by 50-85 p.c. (fairly easy)
(iii) Take about 35% of items passed by 15-50 p.c. (fairly hard)
(iv) Take about 15% of items passed by 0-15 p.c. (very hard)
Items passed by 100 p.c. or by nobody have no validity in either case, but sometimes examiners like to give some very easy items at the beginning to gain confidence of pupils, and a few very hard items to test the very bright pupils.
(6) Then comes the question of distractors, while using multiple-choice items. If one answer is not chosen at all, then it is not a good distractor, and again on the other hand if a mislead is chosen by many from both high and low group, then the mislead must be made less attractive.
(7) It is advisable to maintain a file of items for future use. On one side of the card the items may be written.
On the other side of die card, one should write the:
(a) Size and character of the experimental group, on which the data are based,
(b) The validity of each item, i.e. the bi-serial-r with the test score,
(c) The difficulty value of the item, and
(d) Data on misleads.
A teacher must be acquainted with the meaning of bi-serial-r. Garrett points out that the correlation between a set of scores and a two-category classification like yes-no, true-false, pass-fail, cannot be found by the ordinary product moment formula, or by the rank difference method, and if the distribution of pupils is a normal one, the best method of computing the correlation is that of bi-serial-r.
The formula for computing the bi-serial-r is:
When MP = the mean of scores made by 60 students who answered ‘yes’ to items 72.
Mq = the mean of scores made by 40 students who answered ‘no’ to items 72.
[In the above problem undertaken the Mp = 60-08 and Mq = 55-00]
σ (sigma) is the P of the whole distribution, i.e. distribution of 100 scores, which is 11-63 in the above case, given the spread of the test scores of the entire group.
60 % of the group answered ‘yes’, and 40% of the group answered ‘no’ to item 72, so assuming a normal distribution of opinions on this item (varying from complete agreement on through indifference to complete disagreement) upon which a dichotomous division is forced, the dividing line is placed between the ‘yes’ and ‘no’ group at a distance of 10% from the middle of the curve.
A short method has also been suggested when the teacher may not be able to use the bi-serial-r method of computation.
The steps are:
(1) The test papers may be arranged, in order of the highest score to the lowest.
(2) Count off the 25 % of the best papers and 25 % of the lowest group. If the total group is small, e.g., under fifty, that some larger proportion, say the upper half and lower half. If the number is 80, in the try-out sample, 20 (or 25%) fall in the high group and 20 in the low group, each item may be examined to see if it can separate these two criterion groups.’
(3) Determine the validity index. Two criteria groups who” answer each item correctly are to be determined. If 15 in the high group answer an item correctly, and 5 in the low group answer the same correctly, the validity is 15 – 5 or 10, and the validity index is 10/20 or .50. In other words, validities are simply the difference between the number right in the two extreme groups. The chief advantage of a validity index is to put validities in a percentage scale, as are the difficulties.
The lowest validity index of an item by this method is of course 0/20 or .00.
The validity indices run from 0 to 1. Items having zero or negative validity must be re-written before they are used or discarded.
(4) So the formula for validity index is (RH – Rl) or — (RH-RL)/NH.
Using the same nomenclature, the difficulty index of an item is — (RH+RL)/ (NH+NL) in which NH and NL are the numbers for the high and low groups, respectively.
In the example above wherein RH = 15 and RL = 5, the validity index is 10/20 or .50, and the difficulty index is (15+5)/(20+20) or. 50.
(5) Select the item having the highest validity indices for the final test (following the tables provided in apportioning the difficulty values, if the test is to cover a wide range of talent).
(6) Next consider the distractors, and examine the misleads if multiple choice items are to be used.
(7) A card file of acceptable items may always be prepared to help the teacher if he wants to lengthen or shorten a test. When there are many items, parallel forms may be constructed.
In short, there is not much difference between aptitude and achievement tests. Formerly, aptitude tests were believed to measure “innate capacity” independent of prior learning. But it was a misconcept, and has been corrected later on. It is now accepted that every psychological test measures to some extent the individual’s current state of learning and experience, but while revealing the effects of past learning, the test scores, may, under certain circumstances, serve as predictors of future learning.
The concept of “developed abilities” is coming to replace the conflicting traditional categories of aptitude and achievement in psychometrics. All ability tests, whether general intelligence tests or aptitude tests, or special aptitude tests or achievement tests, measure the level of attainment by the individual in some field or other. It is pointed out that much caution must be made to select the test, depending on the purpose of test.
The traditional practice is to administer an achievement test at the end of a unit of course of study to determine whether students have attained the objective of instruction. Technically, this procedure is known as summative evaluation, that is, a test score is viewed as an end product, or summing up. In contrast, there is a formative evaluation which is based on the belief that the processes of instruction and evaluation should be integrated, and evaluation is continuous.
Not only all educational measurement traditionally been summative, but they have also been norm-referenced rather than criterion-referenced. The purpose of formative evaluation is “to help both the learner and the teacher focus upon the particular learning necessary for movement towards mastery.”‘
Norm-Referenced and Criterion Referenced Tests:
A person’s score on non-referenced tests is interpreted by comparing it with the distribution of scores obtained from some norm (standardization) group. But a person’s score on a criterion reference test is interpreted by comparing it with an established standard or criterion of effective performance.
A particular achievement test can serve both as a norm-reference test and a criterion reference test. How much material a student has learned (criterion reference function) and how his performance compares with that of other students (norm-reference function) can sometimes be determined by the same test.
Uses of Achievement Tests:
(1) They can be used as an aid to assignment of grade.
(2) They can facilitate learning and motivate the learner. The incentive value of “knowledge of results” is an accepted fact.
(3) They provide a means of adapting instruction to individual need. “Especially, the criterion-referenced tests with specified content domain will help the teacher to estimate the learner’s degree of knowledge obtained, his difficulty level etc. in specific subjects, and make corresponding arrangement to fill up the gaps.”
(4) The achievement tests may be utilised as aids in evaluation and improvement of teaching and in the formulation of educational goals.
Table of Tests Available with the Psychological Corporation
Order Service Centre
P.O. Box 83994, San Antonio, Texas 78283-3954
A. Cognitive/Intellectual Assessment
i. Wechsler Intelligence Scale for Children—3rd Ed.
ii. Wechsler Preschool and Primary Scale of Intelligence—Revised
iii. Wechsler Adult Intelligence Scale—Revised
iv. Differential Ability Scales
v. System of Multicultural Pluralistic Assessment
vi. Kendrick Cognitive Tests for the Elderly
B. Non-Verbal Cognitive/Intellectual:
i. Draw a person – Screening Procedure for Emotional Disturbance
ii. Draw a person – A Quantitative Scoring System
iii. Matrix Analogies Test – Expanded Form
iv. Matrix Analogies Test – Short Form
v. Raven’s Progressive Matrices
vi. Mill Hill Vocabulary Series
vii. Beta ll
viii. Good enough—Harris Drawing Test
ix. Columbia Mental Maturity Scale Porteus Mazes
C. Individual Achievement/Basic Skills:
i. Wechsler Individual Achievement Test (WIAT)
ii. Basic Achievement Skills Individual Screener (BASIS)
iii. Wide Range Achievement Test—Revised (WRAT-R) Tests of Academic Performance
iv. Multi-level Academic Survey Tests
v. Tests of Word Knowledge—(TOWK)
vi. Test of Written Language—-2
vii. S.A.M.I.—Standardised Inventory
D. Early Childhood/Infant Assessment:
i. Bayley Scales of Infant Development
ii. Cattell Infant Intelligence Scale
iii. Neurobehavioral Assessment of Pre-term Infant
iv. McCarthy Scales of Children’s Abilities
v. McCarthy Screening Test
vi. Miller Assessment for the Preschoolers
vii. MRT. Fifth Edition
viii. Boehm—Revised
ix. Boehm—Pre-school
x. Boehm—Resource Guide for Basic Concept Teaching
xi. Preschool Language Scale 3 (PLS 3)
xii. Bracken Basic Concept Scale
xiii. Bracken Concept Development Program
xiv. Diagnostic Inventory for Screening Children
xv. Pre-school Language Assessment Instrument (PLRI)
E. Adaptive Behaviour and Behaviour Rating Scales
i. Comprehensive Behaviour Rating Scale for Children
ii. Pupil Rating Scale Revised – Screening for Learning and Disabilities
iii. Conner’s Rating Scales
iv. Normative Adaptive Behaviour Checklist
v. Learning Behaviour Scale (Research Edition)
vi. Study of Children’s Learning Behaviours
vii. Kohn Problem Checklist (Revised Edition)
viii. Kohn Social Competence Scale (Research Edition)
ix. AAMD Adaptive Behaviour Scale
x. Adaptive Behaviour Inventory for Children
F. Guidance and Counselling
i. Differential Aptitude Tests, Fifth Edition, Form C
ii. Career Interest Inventory
iii. Multidimensional Self Concept Scale
iv. Strong Interest Inventory
v. Self-directed Search
vi. Reading—Free Vocational Interest Inventory
vii. Gordon Occupational Checklist II
viii. Wide Range Interest—Opinion Test
G. Personality Assessment:
i. Beck Scale for Suicide Ideation
ii. Beck Depression Inventory
iii. Beck Anxiety Inventory
iv. Beck Hopelessness Scale
v. Reynolds Adolescent Depression Scale
vi. Suicidal Ideation Questionnaire
vii. Gordon Personal Profile-Inventory
viii. An Interpretive Guide to the Gordon Personal Profile-Inventory
ix. Edwards Personal Preference Schedule
x. High School Personality Questionnaire
xi. Children’s Personality Questionnaire
xii. Sixteen Personality Factor Questionnaire
xiii. Rust Inventory of Schizotypal Cognitions
xiv. California Psychological Inventory, Second Edition
xv. Mooney Problem Checklists
xvi. Eating Inventory
xvii. Jenkins Activity Survey
H. Projective Techniques:
i. Rorschach Techniques
ii. Holtzman Inkblot Techniques
iii. Rotter Incomplete Sentences Blank
iv. Children’s Apperception Test
v. Early Memories Procedure
I. Neuropsychological Assessment/Motor Impairment
i. Wechsler Memory Scale—Revised
ii. WAIS-R as Neuropsychological Instrument
iii. California Verbal Learning Test (Adult Version)
iv. Thinkable
v. Wide Range Assessment of Memory and Learning (WRAML)
vi. Attention Process Training
vii. Wisconsin Card Sorting Test
viii. Boder Test of Reading—Spelling Patterns
ix. Benton Visual Retention Test
x. Visual Aural Digit Span Test
xi. Western Aphasia Battery
xii. Bender Visual Motor Gestalt Test
xiii. Examining for Aphasia
xiv. Boston Diagnostic Aphasia Examination
xv. Multilingual Aphasia Examination (MRE)
xvi. Quick Neurological Screening Test—Revised
xvii. Minneseta Test for Differential Diagnosis of Aphasia
xviii. Tomi-Henderson Revision
J. Software:
i. WPPSI-R writer
ii. The Interpretative Software System, The Computerized Boston
iii. Differential Aptitude Tests – Computerized Adaptive Edition Report
iv. WISC-R Microcomputer-Assisted Interpretative
v. WAIS-R
vi. Rorschack Interpretation Assistance Program
vii. CVLT Administration and Scoring System
viii. McDermott Multidimensional Assessment of Children (M-MAC)