Half-day - Afternoon
To prevent response biases associated with the use of rating scales, test items may be presented as comparative judgements. These include the popular forced choice, where respondents rank two or more items, and more complex Q-sorts, where ranking with ties is obtained. Researchers may also collect the extent to which items are preferred to each other, for example by grading their preferences using such categories as “much more like me” or “slightly more like me” (graded-preference format), or expressing their preferences as “proportions-of-total” (compositional format). Responses collected with such formats are relative within the person, leading to major psychometric challenges – interpersonally incomparable (ipsative) data. Since measurement of individual differences requires absolute position on the traits of interest, appropriate methods of scaling ipsative data are required.
This workshop will introduce participants to state-of-the-art methods for analysing and scoring comparative judgments, and provide recommendations for designing effective comparative measures through understanding their fundamental strengths and limitations. I will focus on the Thurstonian factor-analytic approach, applicable to all types of ipsative data and using the same Confirmatory Factor Analysis (CFA) framework with outcomes of various types – binary, ordinal and continuous. The Thurstonian family include the TIRT model for choice and ranking (Brown & Maydeu-Olivares, 2011), which has been applied to major assessments such as the Occupational Personality Questionnaire OPQ32r. The family has been recently expanded to analyse the percentage-of-total (Brown, 2016) and the graded-preference data (Brown & Maydeu-Olivares, 2017). This unified approach will be demonstrated with empirical data analysis examples, including well-known personality questionnaires. Participants will learn the basics of good comparative designs, and best practices in creating informative measures in comparative format. We also touch on the topic of response biases and discuss whether and when the comparative formats can prevent them. Although the workshop is too short to practice analysis and scoring comparative judgments using software, I will point the participants towards useful resources to help them start empirical work in this field.
Half-day – Morning
Educational assessment and measurement tools are rapidly evolving to capture a broad range of learner competencies such as critical thinking, communication and collaborative problem solving (Griffin and Care, 2014). A key innovation in such tools is the use of interfaces that enable immersive interactions and capture complex student data in a multitude of sensory modalities. However, this poses a challenge: how do we extract meaningful evidence of competency from such complex, noisy and unstructured data? This workshop presents advances in artificial intelligence (AI) including modern deep learning frameworks (Bengio, 2009) to address these challenges. These models are able to exploit concept hierarchies that reflect the underlying structure inherent in data and goals of the assessment
Half-day – Morning
In recent years, many areas of social and biomedical science have been experiencing a credibility revolution. Large scale studies of reproducibility have shown that many findings fail to replicate and trust in the underpinnings of some fields has been shaken. To restore credibility and embrace rigor, scientists are embracing open scientific practices, including pre-registration, transparent sharing of research materials, data, and analysis code, large scale collaborations, and pre-printing research results. This workshop provides an overview of tools and technology needed to facilitate reproducible research. I focus on why, when, and how to pre-register research, as well as strategies to organize your research workflow that encourage reproducibility and facilitate sharing of materials, data, and analysis code.
Recent theoretical and computational developments in Structural Equation Modeling (SEM) have been implemented in free and open-source R packages, and are now available for applied users. The goal of this workshop is to give an overview of these developments and to demonstrate how open-source R packages (e.g. lavaan) can be used to apply them in practice. The following topics will be included: multilevel SEM, exploratory factor analysis and exploratory SEM, new developments in assessing model fit, and small-sample solutions. Although SEM is an established technology, applying SEM in practice can still be challenging. Measurement instruments may not fit adequately, sample sizes may be (too) small, data may be clustered, etc. In this workshop, we provide a no-nonsense overview of several important recent developments that applied users should be aware of when applying SEM in practice.
Half-day – Morning
RMultistage testing (MST; Yan, von Davier & Lewis, 2014) has become an important framework of tailored testing, thus there are more and more international operational testing programs are considering MST for practical administrations. This workshop provides a general overview of a computerized multistage test (MST) design and its important concepts and processes for the international testing community. The MST design is described, why it is needed, and how it differs from other test designs, such as linear test and computer adaptive test (CAT) designs, how it works, and with hands-on experiences on how to do simulations with MST software.
Half-day - Afternoon
The elevation of fairness as a foundational element along with validity and reliability was a substantive change to the 2014 Standards for Educational and Psychological Testing (herein referred to as the ‘Testing Standards’). This change stems from the belief that fairness “is a central issue in achieving valid test results”. Since fairness and validity are so intertwined, these same issues may interfere with the validity of test score interpretations. Documentation of standards alone does not ensure application in practice. Noted reasons for this disparity include poor dissemination of concepts and principles to professional organizations, unavailability of accessible and efficient methodologies, or no identification of relevant scholarship, examples, or recommendations for practice. The Workshop will provide an overview of different definitions and theories of fairness as the term applies to assessment, as well as the unique challenges and opportunities concerning fairness in applied settings and scholarship. The Testing Standards can provide a common resource and perspective for the workshop, but presenters 2 will also use examples case studies and one or more group tasks to ensure audience engagement and participation.
Participants will learn about ethical principles and test standards governing test interpretation and the necessary psychometric principles and procedures assessing viability of test scores and comparisons required for sound ethical application.
Weiner (1989) cogently noted, psychologists must “(a) know what their tests can do and (b) act accordingly. … Acting accordingly—that is, expressing only opinions that are consonant with the current status of validity data—is the measure of his or her ethicality” (p. 829). To follow Weiner’s advice, psychologists must possess and apply fundamental competencies in psychological measurement and the importance of these competencies cannot be overstated for ethical assessment and clinical practice (Dawes, 2005; McFall, 2000). Interpretation of tests and procedures must be informed by strong empirical evidence from different types of reliability, validity, and diagnostic utility studies; each of which addresses a different interpretation issue. Unfortunately, most test technical manuals and popular interpretation guides and textbooks neglect reporting and addressing some critically important psychometric research methods and results necessary to judge the adequacy of the different available test scores and comparisons used in interpretation. So that psychologists may ethically interpret test scores or procedures, this workshop delineates and highlights the varied psychometric research methods psychologists must consider to adequately assess the viability of the different scores and comparisons advocated by publishers and authors. Specific research examples with popular tests and procedures are provided as exemplars. Internal consistency, short– and long–term temporal stability, interrater agreement, concurrent validity, predictive validity, incremental predictive validity, age/developmental changes, distinct group differences, theory consistent intervention effects, convergent & divergent validity, internal structure (EFA & CFA), and diagnostic efficiency/utility methods are among those presented and each answer different but relevant questions regarding interpretation of test scores and comparisons. Following this workshop participants will be better able to critically evaluate psychometric information provided in test manuals, textbooks, interpretation guidebooks, Mental Measurements Yearbook, and the extant literature.
The goal of this workshop is to provide education on item response theory (IRT) approaches to modeling ratings / ratings and scoring constructed response tests. This topic is particularly relevant due to the current assessment climate, which promotes the use of rich response formats and task-based assessments. It is imperative that appropriate methods are used to analyze and monitor raters scoring these tasks, and adjust for rater effects in analysis / scoring. In this workshop we will offer modeling solutions with these goals in mind. The content will include an introduction to ratings/rater effects, an overview of modeling strategies with a goal-based approach to ratings/rater analytics, and three units that delve deeper in specific modeling strategies (rater response IRT models, the multifaceted Rasch model, and the hierarchical rater model). The final unit involves hands-on activities and will allow participants to focus on decision-making, model selection, and the interpretation of model parameter estimates in a case study format, as well as provide an opportunity for self-assessment.