Session Overview
Session
PA19: Measurement 2
Time:
Friday, 24/Jul/2015:
4:30pm - 6:00pm

Session Chair: Frank M. Goldhammer
Location: KOL-G-217 (Ⅳ)
capacity: 125

Presentations

Assessing test-taking engagement using response times

Frank Goldhammer1,3, Thomas Martens1, Oliver Lüdtke2,3

1DIPF - German Institute for International Educational Research, Germany; 2IPN - Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik, Germany; 3ZIB - Centre for International Student Assessment, Germany; goldhammer@dipf.degoldhammer@dipf.de

A problem of low-stake assessments is low test-taking engagement threatening the validity of test score interpretations. Therefore, we addressed the question of how indicators of test-taking engagement can be defined and validated in the context of the OECD Programme for the International Assessment of Adult Competencies (PIAAC). The approach was to identify disengaged response behavior by means of response time thresholds (cf. Lee & Jia, 2014). Constant thresholds were considered as well as item-specific thresholds based on the visual inspection of (bimodal) response time distributions (VI method) and the proportion correct conditioning on response time (P+>0% method). Results based on 152514 participants from 22 countries showed that the VI method could only be applied to a portion of items. Overall, the validity checks comparing the proportion correct of engaged and disengaged response behavior revealed that the P+>0% method performed slightly better than the other methods. Finally, we computed the proportion of disengaged responses across items and countries by domain. Overall this proportion was quite low. The results also revealed that there was an increase from part 1 to part 2 of the assessment in disengaged response behavior suggesting a drop in test-taking motivation during the course of test-taking.

Examining test items for Differential Distractor Functioning (DDF) across different groups

Ioannis Tsaousis

University of Crete, Greece; tsaousis@uoc.grtsaousis@uoc.gr

The aim of this study was to examine the effectiveness of the alternative false responses on multiple-choice items in cognitive based test. Particularly, using Item Response Theory (e.g. Differential Distractor Analysis) as a methodological framework, we were interested in examining whether the distractors, or incorrect option choices, used in each item increase the probability for DIF effects across different groups. Data were sampled from approximately 600 students from the Greek Military Academies (i.e., Air Force, Army and Navy Academy), and who completed the Army Numerical Reasoning Test. To examine for possible DDF effects we used the odds ratio approach, whereby the DDF effect of each distractor is obtained using a generalization of the Mantel-Haenszel common odds ratio estimator adapted to each distractor. The results from the analysis revealed that there some items that exhibit DDF across different groups. Results also suggested that items showing DDF were more likely to be located in the second half of the test rather than the first half. The findings from this study allow us to determine the items needed further observation, and designate DDF analysis as a useful tool that could be used to understand better why a particular item exhibits DIF across groups.

Controlling time-related individual differences in test-taking behavior by presenting time information

Miriam Hacker, Frank Goldhammer, Ulf Kröhne

Educational Research and Educational Information (DIPF), Germany; hacker@dipf.dehacker@dipf.de

Generally speaking, in ability or competence assessments, test takers answer the questions in a self-paced way. This can make test takers differ considerably in the amount of time spent to complete a task. Such (construct-irrelevant) individual differences in test-taking behavior can produce differences in test performance although test takers may be equally able or skilled. Thus, time-related test-taking behavior can influence the measurement and affect comparability of ability scores. Previous findings on this measurement problem relating to the so-called ‘speed-accuracy-tradeoff’ originate from speed test studies. The present study aims to address the research questions with regard to power tests and to develop appropriate measurement approaches. For this study, reading competence tests were administered in a control condition with no influence on the timing behavior and several experimental conditions differing in how the timing behavior was influenced. The impact of the conditions on individual differences in timing behavior, performance, as well as the tests reliability and validity were assessed. Additional covariates were assessed to further explore performance differences within experimental conditions. The random sample consists of 1065 german students (521 female, 544 male; M = 20.51 years). First results show, i.e., that presenting time information can reduce rapid guessing behavior and decrease the number of missing responses.

Gender differences in general knowledge tests: Caused by unbalanced interest domains?

Philipp Meinolf Engelberg, Ralf Schulze

Bergische Universität Wuppertal, Germany; engelberg@uni-wuppertal.deengelberg@uni-wuppertal.de

Robust gender differences in standardized psychological tests of general knowledge, favoring men, have been repeatedly reported in test manuals and the pertinent literature. For example, the norm sample of the frequently used German general knowledge test I-S-T 2000 R evidenced an effect size of d = 0.30. In the present study, gender differences in interests as well as an inbalanced representation of interest domains between men and women in knowledge tests were both investigated as potential causes for these findings. Based on the results from an assessment of both male and female interests (n =507), a knowledge test consisting of 121 items that tap exclusively on female interest domains was created. A total of 202 participants completed both this new test and the I-S-T knowledge test. Subsequent factor analyses yielded a 2-factor solution with opposing gender differences. The I-S-T indicators showed substantial loadings on the factor with male advantage only. The results support the hypothesis that gender differences in knowledge tests are not based on gender differences in true general knowledge but may – at least partially – be attributed to an unbalanced item selection from predominantly male interest domains.

Are student evaluations of teaching really reliable? A Bayesian meta-analysis

Sherin Natalia Bopp, Sven Hug, Rüdiger Mutz

ETH Zurich, Switzerland; sherin.bopp@gess.ethz.chsherin.bopp@gess.ethz.ch

Student evaluation of teaching (SET) has become a fixed part of most university quality assurance systems in order to assess teaching performance. Numerous primary studies to different topics of SET reflect the strong development in research on SET especially in the last 30 years. In face of this huge literature it is still not possible to integrate results of primary studies to conclusive overall statements, even in comprehensive reviews. Therefore, for the first time in research on SET, more sophisticated Bayesian meta-analysis techniques have been used here to establish general quantitative statements about SET in order to address the complex problems of data analysis (e.g., multilevel data, different teaching dimensions). Of major concern in research on SET are the key concepts of test theory (e.g., reliability, validity). In a first step the reliability of SET has been investigated with 218 primary studies. We address the following questions: Which kind of reliability concepts were used in the studies? Are SET scores on the average actually sufficient reliable as Marsh (1984) has claimed? How much do SET results vary across and within studies? What are the determinants of reliability of SET? Inital results and conclusions will be presented.