用戶體驗研究技術(shù)進階-pku20161講義lim esl essay raters cognitive processes_第1頁
用戶體驗研究技術(shù)進階-pku20161講義lim esl essay raters cognitive processes_第2頁
用戶體驗研究技術(shù)進階-pku20161講義lim esl essay raters cognitive processes_第3頁
用戶體驗研究技術(shù)進階-pku20161講義lim esl essay raters cognitive processes_第4頁
用戶體驗研究技術(shù)進階-pku20161講義lim esl essay raters cognitive processes_第5頁
已閱讀5頁,還剩32頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、ESL essay raters cognitive processesPaula Winke and Hyojung LimMichigan State U This is a study of rater behaviorThis is a study of rater behaviorMy essayHow does a rater make scoring decisions? What does a rater pay attention to when rating?This is a study of rate

2、r behaviorMy essayLanguage testers need to know if construct-irrelevant variation in scores stem from how raters approach and think about a rubric. This is a study of rater behaviorMy essayEmpirical studies on raters cognitive processes are scarce (especially with analytic scoring), and findings are

3、 not consistent. Previous findingsMy essayRaters focus on different features in essays when scoring; weight the different scoring categories differently (Cumming et al., 2002; Eckes, 2008; Orr, 2002). Previous findingsMy essaySometimes they consider external features that are not even described in a

4、 rubric (Barkaoui, 2010; Lumley, 2005; Vaughan, 1991).Previous findingsMy essayRaters may have different attentional foci when scoring, and their foci may depend on the scale type (holistic vs. analytic), the raters experience (expert vs. novice rater),the raters L1 and even L2 background.The curren

5、t studyWed like to knowHow raters cognitively process (i.e., use) an analytic rubric while rating ESL essays Whether variability in processing (difference in rubric usage) is associated with lower inter-rater reliability Research QuestionsTo which parts of an analytic rubric do raters pay the most a

6、ttention (measured as total fixation duration and visit count)? Are inter-rater reliability statistics on the ponents of an analytic rubric related to the amount of attention paid to those ponents? Method9 raters, all ESL instructors in the same English-language program at a large, Midwestern univer

7、sity and native speakers of English. Each rated 40 essays (4 prompts * 10 essays). Analytic rating scale: Currently used at the language program; it is a modified version from Jacobs et al. (1981) content, organization, vocabulary, language use, and mechanicsTobii TX300 eye-tracker: The rubric was i

8、nstalled in the Tobii Studio program. Content OrganizationVocabulary Language UseMechanics Method9 raters, all ESL instructors in the same English-language program at a large, Midwestern university and native speakers of English. Each rated 40 essays (4 prompts * 10 essays). Analytic rating scale: C

9、urrently used at the language program; it is a modified version from Jacobs et al. (1981) content, organization, vocabulary, language use, and mechanicsTobii TX300 eye-tracker: The rubric was installed in the Tobii Studio program. The data collection set-up64cmRubricEssayScoreProcedureSession 1 in c

10、onference roomSession 2 in LabSession 3 in LabTwo-hour rater training sessionThe raters worked through 7 benchmark essays with Paula.Hyojung explained the procedure.Background questionnaireEye calibrationPractice rating (norming session)Block 1: 10 essaysBlock 2: 10 essays Eye calibration Practice r

11、ating (norming session)Block 3: 10 essaysBlock 4: 10 essays The dataData AnalysisTo quantify attention: total fixation duration (divided by the number of words in each category) and visit countTo observe a rating process: time to first fixation, gaze plots, and heat maps (Bax & Weir, 2012)Inter-rate

12、r reliability: the intraclass coefficient (ICC) and reliability adjusted by the Spearman-Brown prophecy formulaStatistics: the Kruskal-Wallis and Mann-Whitney (post hoc) testResultsIn general, raters read the rubric from left to right, starting from content, organization, vocabulary, language use to

13、 mechanics. Oftentimes (71 times, to be specific), mechanics were overlooked. ResultsOrganization received the most attention (in terms of fixation duration and visit count) and showed the highest inter-rater reliability; raters attended least to and agreed least on mechanics. r = .90r = .75Fixation

14、 duration (mean) in seconds with # of words controlledVisit countIntraclass CoefficientSpearman-Brown prophecy formula Content.0714.03.89.82Organization.080Vocabulary.058Language Use.052Mechanics.045Statistical resultsOrganization, Content Vocab. Lang Mechanic

15、s Vocab, Organization, Lang, Content MechanicsResultsFrom a qualitative review of the videos and heatmaps in comparison with each raters inter-rater reliability estimate, we believe that raters who agreed the most had common attentional foci, whereas those who agreed the least did not. Incongruous R

16、atersRaters 1 and 7 were found to be most incongruous, given their lowest inter-rater reliability for the total score (.45), and the second lowest reliability for content (.36) and for mechanics (.28). Because the scores for Essay 2 had the largest standard deviation, we looked at the heat maps for

17、essay 2 for raters 1 and 7. Essay 2Rater 1Essay 2Rater 7Agreeing RatersRaters 6 and 8 had the highest correlation coefficient in total scores (r=.79) as well as on the sub-scores for content (r=.75) and mechanics (r=.67). Given that the scores of Essay 8 shows the smallest standard deviation, the he

18、at maps for the essay 8 were compared between rater 6 and 9. Essay 8Rater 6Essay 8Rater 8DiscussionRaters attention and inter-rater reliabilityMore attention leads to higher inter-rater reliability with analytic scoring. ( greater care and attention decrease reliability with holistic scoring, Wolfe,

19、 1997) Those who showed higher inter-rater reliability showed similar reading patterns reading a relatively large area of the rubric, and having common patterns of attentional foci. DiscussionThe effect of the layout With an analytic scale, raters decision-making behaviors tend to operate within the

20、 scope of the given guidelines (Smith, 2000). Part of the guidelines is the order of the categories. We think that raters gave their most attention to content and organization and their least attention to mechanics because of a primacy effect.It has to do with rubric real estate. DiscussionIn Lumley

21、s (2005) study, the conventions of presentation (spelling, punctuation, script layout) received the second most attention after content, more attention than organization and grammar. In his study, the conventions of presentation came second after content in the rubric. May also be evidence of this p

22、rimacy effect.DiscussionRaters may use the rubric mainly to justify or adjust the scores for an essay on which they have already made decisions. When finishing reading an essay, raters seemed to know where the quality of the essay would fall in the grid of the analytic rubric.Those who showed higher

23、 inter-rater agreement appeared to look through more descriptors for various levels; those who didnt seemed to stick to their initial judgment. Limitations & Future DirectionsThe eye-movement data dont fully explain why raters paid more attention to certain categories or whether raters considered no

24、n-criterion features. - analysis of our stimulated-recall interview data is needed.We dont know if there was any halo effect across essays in the rating process.Information is lacking on how raters read the essays and how they went back and forth between the essays and the rating scale. We have coll

25、ected data for a second study in which both the rubric and essay are on screen, and data for a third study to investigate potential halo effects. Questions or comments?Paula Winke Hyojung Lim Notes on EssaysWe assembled a stratified sample of 40 essays from prior ESL place

26、ment tests at a large Midwestern university. We culled four sets of 10 essays, each set from one of four scoring bands (64 and below, 65-69, 70-74, and 75 and above: see supplemental material that panies the online version of this manuscript). We balanced the selection of the 40 essays equally acros

27、s four prompts as follows, with two to three essays at each score band being a response to one of these prompts:Do you think it is better for people to make their purchases online or to go shopping in stores and malls? Use specific details and examples to explain your answer. Some people say that al

28、l international students who are studying English should have an American roommate for at least one year. What is your opinion on this topic? Some employees have bosses that they really like working for, while others have bosses that they absolutely hate. What are the most important qualities of a g

29、ood boss at work, and why?If you had the choice, would you rather take a college course online or have the same class face to face with an instructor and classmates in a classroom? Use specific details and examples to explain your answer. The length of student essays was limited to one page so that raters did not need to flip over pages while rating. The order of 10 essays within each prompt set was randomized, and the order of the four prompt sets was counterbalanced across raters. A packet of 40 copied essays w

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論