For Teaching Online...: 2/19/12

Surveys, Questionnaires, and Ethical Considerations

Assessments of three test instruments were conducted for a preliminary review of the potential tools available for the research project as stated below. By locating and appraising such instruments, and critically assessing the positive and negative aspects of each instrument, valuable information can be gathered that can be applied toward appropriately making the decisions necessary to conduct tests for a research project. In addition to evaluating such tools from the perspective of the ethical considerations, emphasis should also be placed upon the components relative to reliability and validity. After reviewing approximately fifty test instruments, comprehensive work is required for ongoing evaluation of more test instruments, which will aid this writer in making a decision on whether a new test instrument would be a better choice for this research project.

Research Subject

My research topic addresses the reasons that America continues to drop in the global academic performance ranking at the high school level. As reported by the Organization for Economic Cooperation and Development (OECD) Programme for International Student Assessment (PISA) (2009) based in France, out of 70 countries, the United States ranks 14th overall in reading performance with Korea, Finland, and Canada holding the 1st, 2nd, and 3rd positions. PISA (2009) also reported that America was 25th in mathematics, and 17th in science. Zeitvogel (2011) wrote that since the international ranking system began over a decade ago, America dropped from first position to just average. One of the negative changes in educational performance is linked to economic performance (Zeitvogel, 2011).

Research Questions and Hypotheses

The following research questions were derived from the research project noted above. These questions are relevant for identifying potential test instruments because the researcher must focus on the answers being sought before the decision is made regarding the tests to be used. The questions for the selected mixed research methodology are:

1. Why has the ranking of 9th grade high school students in America's institutes of learning continued to fall if compared with 9th graders from global learning institutes? (quantitative and qualitative)

2. What techniques and instructional philosophies are educators using in the higher ranking countries? (qualitative)

3. Has the implementation of instructional technologies affected the achievement of greater successes in academic performance? (qualitative)

4. Would a global initiative to improve academic performance be attainable? (qualitative)

5. Have the learning to learn competencies being discussed globally as noted by Hoskins and Fredriksson (2008) affected greater academic performance? (qualitative)

The first research question above addressing the falling ranking of American 9th grade high school students if compared with 9th graders from global learning institutes can be investigated and possibly answered by using quantifiable and qualitative data. However, to identify test instruments to measure quantifiable information, this paper addresses only the quantifiable research elements of interest. Addressing the first research question is quantifiable because the test scores from course subjects, which are science, mathematical, and reading comprehension, must be gathered for American 9th grade students, and compared to scores of international 9th grade students.

Although PISA (2009) has gathered international achievement data for 9th graders for several years, new data collection is required to ensure the validity and reliability of the suspected (and reported) finding that American 9th grade students' achievement continues to degrade if compared to international 9th grade students. Therefore, the research proposal includes (1) using one of the test instruments described below to analyze the current American 9th grade students' scores, and (2) extending the review of the data collected by PISA to also include the data from the International Association for the Evaluation of Educational Achievement (IEA) (2011). Reviewing the data (scores) gathered by PISA and the IEA for all other countries will further validate the processes used by international associations to conduct achievement tests. The aforementioned two steps will ensure (1) that the claims made about American 9th grade students' achievement declining if compared to international students at the 9th grade level is correct, (2) that the process to answer the qualitative research questions can begin, and (3) that the mixed research results will enable a decision on the current hypothesis:

Null (H0): The academic performance ranking of America's 9th grade students does not continue to fall if compared to 9th grade students from other countries.

Alternate (H1): The academic performance ranking of America's 9th grade students continues to fall if compared with 9th grade student from other countries.

Instrument One

Instrument's Description

Lowe's (2010) review of the test instrument, EXPLORE, for ninth graders indicates its design measures academic progress in English, math, reading, and science. EXPLORE is also a well-designed achievement test developed by ACT, Inc., and is highly valuable for preparing students for the ACT Assessment that most high schools in America administer. ACT's Career Planning Survey, an accompanying piece, is helpful for all students rather than just college-bound students in exploring career options. The quality of most of the materials for EXPLORE is exceptional, and the testing booklet and materials that students and parents receive are well-written and professionally done.

Information to be Collected

The 9th grade Student Report produces test scores in English, Mathematics, Reading, and Science Reasoning. Student scores are computer generated onto a grid illustrating the percentage of students in a national norm group who scored at or below the student's individual score in those areas. A composite score is also provided showing the average of the four test scores, and students are able to identify strong and weak areas.

Target Population. Ninth graders.

Variables to be Measured. Scores for English, Mathematics, Reading, and
Science

The Format and Length of the Instrument

EXPLORE is an instrument that is a 120-minute test. The test is computer-scored. The test includes a four-part (30 minutes each) timed, paper-and-pencil test.

Method to be Used to Test for Reliability and Validity

Lowe (2010) reported that the reliability coefficients for Science was .78, and .88 for the English test. There are four forms of the EXPLORE and the reliability coefficient for the composite scores for each of the four forms is .95, with the highest standard error of measurement on Form 3 at 1.03 and the lowest on Form 2 at .93 (Lowe, 2010). Salkind (2009) reported that most tests' reliability coefficients are in the .80 to .90 range. Therefore, the reliability of the EXPLORE instrument is substantial, and would be very acceptable for use in the research project identified above. (Although the developers of EXPLORE expected a correlation between EXPLORE scores and students' grades, these correlations ranged only from .39 to .66. There was a higher correlation between EXPLORE and PLAN test scores up to .82.)

The Appropriateness of the Instrument for the Research Subject

As noted above, Lowe's (2010) reported that EXPLORE's correlation coefficients were in the usual range for test instruments, which ensures a substantial level of reliability because the instrument provided similarity in test scores between test-taking. For example, conducting this test in a test-retest procedure would, therefore, provide reliable results. EXPLORE was also designed after (1) carefully studying the instructional objectives for Grades 6 through 9 from states that had published objectives, (2) reviewing state-approved textbooks for Grades 6 through 8, and (3) educators of Grades 6 through 12 and at the postsecondary level were consulted. Lowe (2010) found that any test administrator following the directions in the EXPLORE manual would have no difficulties in administering the test, and found the Program Guide to be exceptionally thorough and forthcoming with useful information.

"EXPLORE was originally normed in 1992 with over 4,000 eighth grade students, and the sample was fairly representative of the national population in terms of gender, with some discrepancy in the demographic characteristics of race and geographic region" (Lowe, 2010, par. 6). EXPLORE was again normed in 1995 on over 14,000 eighth grade students, and "the sample was stratified using school size, public or private status, and geographic region" (Lowe, 2010, par. 6), and in 1999 with 4,789 eighth graders and 6,660 ninth graders with some discrepancy in the characteristics of race and geographic region, which could have been due to different cohorts or a sampling error. Because the developer conducted repeated norming analyses, which compares each student's performance on the test to the performance of a representative sample of public school students of the same age and grade, the resulting individual student scores that are reported to the student, parents, school, and national educational organizations provides greater relevance.

Other factors that Lowe (2010) did not mention that would need to be obtained relative to the reliability of EXPLORE includes whether the reported good correlation coefficients also included other important reliability components. For example, Lowe's (2010) review indicated that the test was easy to administer and take, and this increases reliability. However, other components that improve reliability should also include whether testing conditions were similar, and whether external events were eliminated.

Factors relative to validity of EXPLORE would also need to be investigated. For example, Lowe (2010) indicates that content validity was achieved due to the test being normed multiple times using large representative samples, and because EXPLORE's test questions were extracted from the instructional objectives for Grades 6 through 9 from states that had published objectives. Criterion validity of EXPLORE could be better assessed by conducting testing sessions for 9th graders with less time between tests, e.g., annually, so that EXPLORE's goal to estimate present academic performance would be more valid.

Lowe (2010) did not account for many of the features relative to ethical testing practices, which certainly are important as to whether EXPLORE could be used in the research project. Lowe (2010) remarked that the test's results were very beneficial for students and parents in that the student's strengths and weaknesses were scored against like students. Several of the materials accompanying the EXPLORE test were designed for student handouts, which benefitted students in career-tracking and college-tracking. Other ethical features needing further investigation not mentioned by Lowe (2010) include whether anonymity concerns were addressed, whether coercion was involved or a signed consent obtained, if published reports included group rather than individual performance, and whether EXPLORE was designed using information from professional organizations.

Instrument Two

Instrument's Description

Adams (2010) reported that the Stanford Achievement Test represents a "high level of skill in item writing and makes use of adequate samples in item analysis and standardization, and…have a high ceiling" (par. 15) that is frequently conducive for high school students focusing on admission to more selective post-secondary schools. Unfortunately, the tests do not appear "adequate for (a) use in curriculum evaluation (except as item norms are used), (b) measurement of intra-individual differences (except for upper-grade students in major fields), or (c) study of individual student gains from year to year" (Adams, 2010, par. 15). Items (a) and (b) could improve if a greater number of difficult questions were included, and item (b) would improve if data on individual gains was possible.

The Stanford Achievement High School Basic Battery instrument provides a good description of the content sampled, and carefully classifies items by content. However, classifications did not include the type of behavior nor content. Another shortcoming reported by Adams (2010) included that highly useful information could be provided by the Stanford instrument if data had been casted in "the form of expectancy tables showing what performance on the Stanford battery may be anticipated from students having various Otis IQ's" (par. 24), which would be very useful information that is not generally provided by most current achievement test batteries, and is relevant to the research project.

Information to be Collected

Adams (2010) wrote that the Stanford Achievement Test appears to provide a high degree of achievement test items derived from a clearly defined set of universal content. The potential user who finds that the test's content sample reflects subjects of high interest may not find a different achievement test that could serve his or her purpose more efficiently. More detail is provided below.

Target Population. Ninth graders although the Stanford test provides testing for several high school grades.

Variables to be Measured. Scores for math, science, and reading.

The Format and Length of the Instrument

Adams (2010) assessed the technical quality of the Stanford instrument as high, and the general usability as above average, and alludes that the test is 40 minutes in length. The greatest deficiency of the battery according to Adams (2010) was the manner in which the manual's information is presented. For example, several slight inconveniences for the test administrator include three separate booklet tests located in the same section of the manual as the information for the basic battery tests, which could cause confusion for teachers and school administrators who are not completely familiar with the Stanford Achievement Tests nor the basic battery tests, which complement it. Also, the Stanford presentation of means for various norm groups in one table on one page of the manual, and presentation of the associated standard deviations in another table on a different page was unnecessarily confusing. Adams (2010) was also uncomplimentary of the manual requiring the user to first go from raw scores to standard scores and then, in addition, in a different table having to convert standard scores to stanines and percentile ranks when a direct translation of the raw score to the derived score would be possible.

Methods Used to Test for Reliability and Validity

Adams (2010) reported that the Stanford test developers provided four estimates of each test's reliability at each grade level. Unfortunately, the estimates were configured using internal consistency methods rather than estimating the reliability coefficients "across forms and across testing occasions" (Adams, 2010, par. 10). Therefore, even though 75 of 108 reliability coefficients were .90 or above, these numbers are unreliable. Adams (2010) also wrote that achievement test users desire comparisons (and reliability coefficient numbers with ranges) within a single school or school system but the Stanford test information was not forthcoming of such data. A positive note includes that the "norming sample was large (22,699 students in 58 schools), and selected so as to be representative of nine geographic regions; regional representation was further checked against two socioeconomic indices" (Adams, 2010, pars. 10-13), which are reported "to be positively related to educational achievement" (pars. 10-13).

Adams (2010) reported that test questions "appear to be consistently well designed" (par. 2). However, Adams (2010) reported issues with reliability and validity:

Although textbooks and courses of study were surveyed and teachers and subject matter specialists consulted, no evidence is presented concerning the extent of their agreement on such important decisions as having about 45 percent more items in physical science than in biological science. One would like to be assured, however, that a number of well-informed judges participated in making the decisions concerning the proportion of items for each of the major content categories; and it would be desirable to know the extent of agreement among judges. (par. 3).

Adams (2010) found that the Stanford instrument should have described the "tryout example for each grade…in terms of the percentage of students at each grade level who had taken advanced mathematics and science courses" (par. 7). If the Stanford scores showed that a high school scored above average on item norms, inferring that the school was doing an excellent instructional job would be incorrect because the above average scores could be due to a "higher-than-average percentage of students enrolled in these elective subjects" (Adams, 2010, par. 7). Adams (2010) additionally reported that the Stanford test's norms were not collected in a standardized state-wide testing manner but in a nationally occurring manner, and no data was provided for the norms for national testing. Although sample sizes were impressive, and typically improves reliability, no claims were published regarding representativeness of the tryout sample.

Therefore, Adams (2010) rightly concludes that even if the sample size was representative, "item difficulty values obtained in October would undoubtedly be different from those obtained in March" (par. 7). Adams (2010) disclosed that although the Stanford test development used many approved psychometric techniques, 40-minute tests, which would be sensitive indicators of growth in achievement, was not possible. Discriminating between the weaknesses in reliability especially because the test has issues with revealing academic achievement, which is very important for the research project, could provoke this test from not being used.

Adams' (2010) major complaint involves the instrument's "lack of validity data which are specifically relevant to important school decisions that might be improved if they were made in the light of the Stanford scores" (par. 25). The inclusion of one or two examples of effective uses for Stanford's test scores could greatly benefit the educational quality for students. Other ethical considerations noted in Adams' (2010) summary include that the test's manual reports difficulties for individual students in interpreting academic performance from year to year especially when compared to like student populations. Such problems arise from the Stanford test not being subjected to retesting nor regression analysis not being completed, which would aid in interpreting student achievement even lacking retesting. Due to the impaired benefits to students, using the Stanford test negatively impacts ethical researcher behavior, in this writer's opinion.

The Appropriateness of the Instrument for the Research Subject

In addition to the topics reviewed in the previous section, which negatively impact the use of the Stanford for the research project, Adams' (2010) review of this instrument also included that "during the process of test development, item analyses were conducted for students in new, as well as traditional, curricula" (par. 1). Tests that include questions related to newer curricula is important to the research project because the test instrument used to assess academic achievement internationally should include the newest learning objectives. If the test is to provide comparisons between improving or decreasing academic achievement in international students, a natural component of the comparison should include the latest academic knowledge as well as testing students' comprehension of concepts and principles, which the Stanford test includes.

Instrument Three

Instrument's Description

Lehmann (2010) presents a critique on this instrument, which is the National Proficiency Survey Series (NPSS), developed to evaluate student proficiency in high school language arts, mathematics, social studies, and science courses. Although criticism has been applied against national standardized tests, which includes the Stanford Test, most schools in America continue to use them.

Information to be Collected

There are three components in each of language arts, social studies, and science, and four in mathematics. Each component tests 13-22 objectives. Each objective contains three items, which reflect the major content areas.

Target Population. Ninth graders.

Variables to be Measured. Scores for math, science, and reading.

The Format and Length of the Instrument

Time for administering and testing is about 50 minutes per test, which is the usual amount of time per high school class period. The test uses a four-option, multiple-choice format for all the tests. Lehmann (2010) wrote that the tests, the Administrator's Manual, and the Technical Manual are printed on standard "8 1/2 by 11 inch booklets...all printing is in blue ink on white paper with the letters identifying the distracters printed in red ink, and the quality of paper, reproductions, and illustrations are excellent" (par. 22). "Machine-scorable (NCS), and self-scorable answer sheets are provided" (Lehmann, 2010, par. 22), and the test developers provide a diskette for computerized grading. No scoring reports are provided.

Method Used to Test for Reliability and Validity

The test's information reports that a national field test involved 7,800 students in 13 school systems within seven states in 1987. According to Lehmann (2010), no specific details are provided by the developers for the instrument other than the:
Consideration (that) was given to ethnic and gender groups, that the final items were selected from a large item bank, that the items were written by classroom teachers and curriculum experts, the items had to satisfy stringent requirements, and the highest statistical criteria was included in the test. (par. 3)

The tests were normed twice in 1988 on 22,616 public and Roman Catholic school students from 45 high schools in 20 states in grades 9 through 12. The sampling design "stratified public schools on (a) geographic regions (the nation was divided into two regions); (b) district enrollment (small, medium, and large); and (c) district socioeconomic status (low, average, and high)" (Lehmann, 2010, par. 23). However, the designers did not provide sufficient data for users to determine if standardization methodology was correct, and percentile ranks by grade were only provided for the Writing Fundamentals and Literature tests (Lehmann, 2010).

Lehmann (2010) discloses reliability details reported by the developer, which is that: The reliability data for the Mathematics, History, and Science tests are reported only in terms of composite grades. The KR20s range from .82 to .91 with a median of .86. The standard errors of measurement are all less than 3 raw score points. Although the reliabilities are acceptable, this reviewer contends that if individual decisions are made on the basis of these results, the reliabilities should be higher. Stability estimates would also be desirable. (par. 27)

Lehmann (2010) reported validity concerns relative to the developers claiming that the test provides "nationally-normed criterion-referenced surveys" (par. 29). However, Lehmann (2010) rebuts their claim by saying that while it is possible for a "domain-referenced test to have norms…how can a true criterion-referenced test have norms?" (par. 29). The "sine qua non (prerequisite) of an achievement test is content validity" (Lehmann, 2010, par. 29), and the test's administrative information tells users that it is the user's responsibility to determine the test's validity.

Although the test's information provided "who wrote, edited, and reviewed test items; field trials and statistical data provided; the bias reviews conducted; national p-values for each objective; the item p-value distributions, means, and medians by test; and the self-reported grade-point average for the standardization sample" (par. 28), more pertinent information was minimal. Lehmann (2010) also suggested the developers provide "test blueprints by test and taxonomical classification (knowledge, understanding, and higher order thinking skills)" (par. 28). Such information would aid users in determining if the NPSS was valid for their purposes.

The Appropriateness of the Instrument for the Research Subject

The instrument's appropriateness for the research subject reflects comments by Lehmann (2010), which is that the SPSS' "reliabilities are respectable although not spectacular for a test whose results might be used for individual decisions…the content validity evidence presented is supportive...(and) the quality of the items and reproduction are excellent" (par. 32). However, despite its good aspects, two parts of the research project involves (1) determining if American 9th grade students' academic achievement is falling as reported, and (2) identifying methods through qualitative research methodology that can be used to improve academic achievement. Unfortunately, while the first objective could be determined using this instrument, the second objective could not because Lehmann (2010) reports that "there is little information provided (by the SPSS) to help classroom teachers become better teachers using the test results" (par. 33). Therefore, being that the basis of any achievement test is meant to improve results from the teaching-learning process, Lehmann (2010) contends the SPSS cannot provide such a result.
Conclusion

Assessments of three test instruments were conducted for a preliminary review of the potential tools available for the stated research project. By locating and appraising such instruments, and critically assessing the positive and negative aspects of each instrument, valuable information was gained that can be applied toward appropriately making the decisions necessary to conduct tests for the research project. In addition to evaluating such tools from the perspective of the ethical considerations, emphasis was also placed upon the components relative to reliability and validity. After reviewing approximately fifty test instruments, comprehensive work will continue to further evaluate more test instruments, which should aid this writer in making a decision about the most appropriate test instrument for the upcoming research project, and whether one exists or should be created.

References

Adams, G.S. (2010). Stanford achievement test: High school basic battery. Retrieved from Mental Measurements Yearbook and Tests in Print. (7:27)

Hoskins, B., & Fredriksson, U. (2008). Learning to learn: What is it and can it be measured? Retrieved from http://publications.jrc.ec.europa.eu/repository/bitstream/
111111111/979/1/learning%20to%20learn%20what%20is%20it%20and%20can%20it%20be%20measured%20final.pdf

Lehmann, I.J. (2010). National proficiency survey series. Retrieved from Mental Measurements Yearbook and Tests in Print. (11010792)

Lowe, S.A. (2010). Review of EXPLORE. Retrieved from Mental Measurements Yearbook and Tests in Print. (15012572)

Organization for Economic Cooperation and Development (OECD). (2012). Statistics portal. Retrieved from http://www.oecd.org/statsportal/0,2639,en_2825_293564_1_1_
1_1_1,00.htm

Programme for International Student Assessment (PISA). (2009). PISA 2009 results: What students know and can do: Student performance in reading, mathematics and science, Volume I. Retrieved from http://browse.oecdbookshop.org/oecd/pdfs/free/

Salkind, N.J. (2009). Exploring research (7th ed.). Upper Saddle River, New Jersey: Pearson Education Inc.

The International Association for the Evaluation of Educational Achievement (IEA). (2011). About IEA. Retrieved from http://www.iea.nl/home.html

Zeitvogel, K. (2010, December). US falls to average in education ranking. Retrieved from http://lib.store.yahoo.net/lib/realityzone/UFNusEducationAverage.html

For Teaching Online...

Search This Blog

Wednesday, February 22, 2012

Our Buddy went to Heaven

Surveys, Questionnaires, Test Instruments, and Ethical Considerations

My garden haven

Pinterest