版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
OutlineofthesessionCorpusdesignissuesCorpusrepresentativenessCorpusbalanceSamplingCorpussizeTypesofcorporaIntroducingsomewell-knownEnglishcorporaofdifferenttypesRepresentativenessAcorpusisacollectionof(1)machine-readable(2)authentictexts(includingtranscriptsofspokendata)whichis(3)sampledtobe(4)representativeofaparticularlanguageorlanguagevarietyAcorpusisdifferentfromarandomcollectionoftextsoranarchiveRepresentativenessisadefiningfeatureofacorpusAslanguageisinfinitebutacorpushastobefiniteinsize,wesampleandproportionallyincludeawiderangeoftexttypestoensuremaximumbalanceandrepresentativenessSomedefinitions…“generallyassembledwithparticularpurposesinmind,andareoftenassembledtobe(informallyspeaking)representative
ofsomelanguageortexttype”(Leech1992:116)“…selectedandorderedaccordingtoexplicitlinguisticcriteriainordertobeusedasasampleofthelanguage”(Sinclair2019)“Awell-organizedcollectionofdata”(McEnery2019)“gatheredaccordingtoexplicitdesigncriteria”(Tognini-Bonelili2019:2)“builtaccordingtoexplicitdesigncriteriaforaspecificpurpose”(Atkinsetal1992)textsselectedandputtogether“inaprincipledway”(Johansson2019:3)Whatisrepresentativeness?“Acorpusisthoughttoberepresentativeofthelanguagevarietyitissupposedtorepresentifthefindingsbasedonitscontentscanbegeneralizedtothesaidlanguagevariety”(Leech1991)Representativenessreferstotheextenttowhichasampleincludesthefullrangeofvariabilityinapopulation(Biber1993)Whatisrepresentativeness?RepresentativenessisafluidconceptcloselyrelatedtoyourresearchquestionsIfyouwantacorpuswhichisrepresentativeofgeneralEnglish,acorpusrepresentativeofnewspaperswillnotdoIfyouwantacorpusrepresentativeofnewspapers,acorpusrepresentativeofTheTimeswillnotdoTwotypesofrepresentativenessTherepresentativenessofgeneralcorporaand(domain-orgenrespecific)specializedcorporaaremeasuredindifferentwaysGeneralcorporaBalance:TherangeofgenresincludedinacorpusandtheirproportionSampling:HowthetextchunksforeachgenreareselectedSpecializedcorporaDegreeofclosure/saturation:Closure/saturationforaparticularlinguisticfeature(e.g.sizeoflexicon)ofavarietyoflanguage(e.g.computermanuals)meansthatthefeatureappearstobefiniteorissubjecttoverylimitedvariationbeyondacertainpoint,i.e.thecurveoflexicalgrowthisflatteningoutWhyshouldwecareaboutrepresentativeness?Readerofcorpus-basedstudies(assessment)Tointerprettheresultsofcorpusresearchwithcaution,consideringwhetherthecorpusdataandthemethodusedinthestudywasappropriateCorpususer(assessment)Importantto“knowyourcorpus”TodecidewhetheragivencorpusisappropriatefortheirspecificresearchquestionTomakeappropriateclaimsonthebasisofsuchacorpusCorpuscreator(assessment?)Tomaketheircorpusasrepresentativeaspossibleofalanguage(variety)claimedtorepresentTodocumentdesigncriteriaexplicitlyandmakethedocumentationavailabletocorpususersCriteriafortextselectionThecriteriausedtoselecttextsforacorpusareprincipallyexternalTheexternalvs.internalcriteriacorrespondstoBiber’s(1993:243)situationalvs.linguisticperspectivesExternalcriteriaaredefinedsituationallyirrespectiveofthedistributionoflinguisticfeaturesInternalcriteriaaredefinedlinguistically,takingintoaccountthedistributionofsuchfeaturesItiscirculartouseinternalcriterialikethedistributionofwordsorgrammaticalfeaturesastheprimaryparametersfortheselectionofcorpusdataIfthedistributionoflinguisticfeaturesispre-determinedwhenthecorpusisdesigned,thereisnopointinanalyzingsuchacorpustodiscovernaturallyoccurringlinguisticfeaturedistributionsCriteriafortextselectionTime?Ifacorpusisnotregularlyupdated,itrapidlybecomesunrepresentative(Hunston2019)Therelevanceofpermanenceincorpusdesignactuallydependsonhowweviewacorpus-astaticordynamiclanguagemodelStaticmodel:samplecorpora(nearlyallexistingcorpora,BNC,LOB/FLOB)Dynamicmodel:BankofEnglishCriteriafortextselectionTips“Criteriafordeterminingthestructureofacorpusshouldbesmallinnumber,clearlyseparatefromeachother,andefficientasagroupindelineatingacorpusthatisrepresentativeofthelanguageorvarietyunderexamination.”(Sinclair2019)CorpusbalanceAbalancedcorpuscoversawiderangeoftextcategorieswhicharesupposedtoberepresentativeofthelanguage(variety)underconsiderationTheproportionsofdifferentkindsoftextitcontainsshouldcorrespondwithinformedandintuitivejudgementsThereisnoscientificmeasureforbalance–justbestguessTheacceptablebalanceisdeterminedbytheintendeduse–yourresearchquestionsTheBNCmodelGenerallyacceptedasbeingabalancedcorpusHasbeenfollowedintheconstructionofanumberofcorpora4,124texts(includingtranscriptsofrecording)ca.100millionwords:90%Written+10%SpokenThreecriteriaforWrittenDomain:thecontenttype(i.e.subjectfield)Time:theperiodoftextproductionMedium:thetypeoftextpublication(book,periodicalsetc)TwocriteriaforSpokenDemographic:informalconversationsbyspeakersselectedbyagegroup,sex,socialclassandgeographicalregionContext-governed:formalencounterssuchasmeetings,lecturesandradiobroadcastsrecordedin4broadcontextcategoriesWrittenBNCSpokenBNCBNCvs.balanceThedesigncriteriaoftheBNCillustratesthenotionofcorpusbalance
verywell“Inselectingtextsforinclusioninthecorpus,accountwastakenofbothproduction,bysamplingawidevarietyofdistincttypesofmaterial,andreception,byselectinginstancesofthosetypeswhichhaveawidedistribution.Thus,havingchosentosamplesuchthingsaspopularnovels,ortechnicalwriting,best-sellerlistsandlibrarycirculationstatisticswereconsultedtoselectparticularexamplesofthem.”(AstonandBurnard2019:28)Pragmaticsincorpusdesign“Mostgeneralcorporaoftodayarebadlybalancedbecausetheydonothavenearlyenoughspokenlanguageinthem;estimatesoftheoptimalproportionofspokenlanguagerangefrom50%-theneutraloption-to90%,followingaguessthatmostpeopleexperiencemanytimesasmuchspeechaswriting”(Sinclair2019)ThewrittenBNCisninetimesaslargeasthespokenBNCIsspeechlessfrequentorimportantthanwriting?PragmaticsincorpusdesignAbsolutelynot!…butwritingtypicallyhasalargeraudiencethanspeech…alsocollectionofspokendatacosts10timesasmuchasforwrittendata…ittakes10hourstotranscribeonehourofrecordingPragmaticconsiderationsalsomeanthatbalanceisamoreimportantissueforastaticsamplecorpusthanforadynamicmonitorcorpusAsamonitorcorpusisfrequentlyupdated,itisusually“impossibletomaintainacorpusthatalsoincludestextofmanydifferenttypes,assomeofthemarejusttooexpensiveortimeconsumingtocollectonaregularbasis.”(Hunston2019:30-31)Corpusbalance:Sometips“Thecorpusbuildershouldretain,astargetnotions,representativenessandbalance.Whilethesearenotpreciselydefinableandattainablegoals,theymustbeusedtoguidethedesignofacorpusandtheselectionofitscomponents.”(Sinclair2019)“Itwouldbeshort-sightedindeedtowaituntilonecanscientificallybalanceacorpusbeforestartingtouseone,andhastytodismisstheresultsofcorpusanalysisas‘unreliable’or‘irrelevant’becausethecorpususedcannotbeprovedtobe‘balanced’.”(Atkinsetal1992:6)SamplingincorpuscreationLanguageisinfinite,butacorpusisfiniteinsize,sosamplingisinescapableincorpusbuilding“Someofthefirstconsiderationsinconstructingacorpusconcerntheoveralldesign:forexample,thekindsoftextsincluded,thenumberoftexts,theselectionofparticulartexts,theselectionoftextsamplesfromwithintexts,andthelengthoftextsamples.Eachoftheseinvolvesasamplingdecision,eitherconsciousornot.”(Biber1993)Samplevs.populationTheaimofsampling“istosecureasamplewhich,subjecttolimitationsofsize,willreproducethecharacteristicsofthepopulation,especiallythoseofimmediateinterest,ascloselyaspossible”(Yates1965:9)Asampleisascaled-downversionofalargerpopulationAsampleisrepresentativeifwhatwefindforthesamplealsoholdsforthegeneralpopulationCorpusrepresentativenessandbalancerelyheavilyonsamplingAcorpusisasampleofagivenpopulation(languageorlanguagevariety)SamplingincorpuscreationSamplingunitForwrittentext,itcouldbeabook,periodicalornewspaperSamplingframeAlistofsamplingunitsPopulationLanguages,language,orlanguagevarietyunderconsiderationTheassemblyofallsamplingunits,whichcanbedefinedintermsofLanguageproduction(demographic:speakersandwriters)Languagereception(demographic:audienceandreaders)Languageasaproduct(registersandgenres)ExamplesofBrownandLOBBrownPopulation:WrittenEnglishtextpublishedintheUnitedStatesin1961Samplingframe:AlistofthecollectionofbooksandperiodicalsintheBrownUniversityLibraryandtheProvidenceAthenaeumSamplingunit:eachbook/periodicalwithinthesamplingframeLOBPopulation:WrittenEnglishtextpublishedintheUKaround1961Samplingframe:TheBritishNationalBibliographyCumulatedSubjectIndex1960–1964(forbooks)andWilling’sPressGuide1961(forperiodicals)Samplingunit:eachbook/periodicalwithinthesamplingframeSamplingtechniquesSimplerandomsamplingAllsamplingunitswithinthesamplingframearenumberedandthesampleischosenbyuseofatableofrandomnumbersPositivelycorrelatingwithfrequencyinthepopulation,sorarefeaturesmaynotbeincludedStratifiedrandomsamplingThepopulationisdividedinrelativelyhomogeneousgroups(i.e.strata),andthentheselatteraresampledatrandomNeverlessrepresentativethansimplerandomsamplingStratifiedrandomsamplingThewholepopulationinBrown/LOBcorpusisdividedinto15textcategoriesandthensamplesweredrawnfromeachcategoryatrandomIndemographicsamplingforcollectingspokendata,individuals(samplingunits)inthepopulationarefirstdividedintodifferentgroupsonthebasisofspeaker/writerage,sexandsocialclass,andthensamplesaretakenatrandomfromeachgroupSizeofsamplesFulltextsortextchunks?“Samplesoflanguageforacorpusshouldwhereverpossibleconsistofentiredocumentsortranscriptionsofcompletespeechevents”(Sinclair2019)GoodforstudyingtextualorganizationAfull-textcorpusmaybeinappropriateorproblematicPeculiarityofanindividualstyleortopicmayoccasionallyshowthroughTherearecopyrightissuesinincludingfulltextsFrequentlinguisticfeaturesarequitestableintheirdistributionsandhenceshorttextchunks(e.g.2,000runningwords)areusuallysufficientTextinitial,middleorendchunks?Textinitial,middle,andendsamplesmustbetakeninabalancedwayProportionofsamplesInstratifiedrandomsampling,howmanysamplesshouldbetakenforeachcategory?Thenumbersofsamplesacrosstextcategoriesshouldbeproportionaltotheirfrequenciesand/orweightsinthetargetpopulationinorderfortheresultingcorpustobeconsideredasrepresentativeDifficulttodetermineobjectively,justwell-informedandintuitiveguessProportionofgenresinBrownConstantsamplesize:ca.2,000words“Relativelyspeaking…”AnyclaimofcorpusrepresentativenessandbalancemustbeinterpretedinrelativetermsThereisnoobjectivewaytobalanceacorpusortomeasureitsrepresentativenessCorpusbalanceandrepresentativenessareafluidconceptTheresearchquestionthatonehasinmindwhenbuildingacorpusdetermineswhatanacceptable
balanceisforthecorpusoneshoulduseandwhetheritissuitably
representativeCorpusbalanceisalsoinfluencedbypracticalconsiderationsHoweasilycandataofdifferenttypesbecollected?CorpussizeHowlargeshouldacorpusbe?Thereisnoeasyanswertothisquestion.Krishnamurthy(2019):“Sizematters.”Leech(1991):“Sizeisnotall-important.”ThesizeofthecorpusneededdependsuponthepurposeforwhichitisintendedaswellasanumberofpracticalconsiderationsThekindofquerythatisanticipatedfromusersAreyoustudyingcommonorrarelinguisticfeatures?ThemethodologytheyusetostudythedataHowmuchworkcanbedonebythemachineandhowmuchhastobedonebyhand?Forcorpuscreators,alsothesourceofdataArethedatainelectronicformreadilyavailableatareasonablecost?CorpussizeCorpussizeincreaseswiththedevelopmentoftechnology1960s-70sBrownandLOB:onemillionwords1980sTheBirmingham/Cobuildcorpora:20Mwords1990sTheBritishNationalCorpus:100MwordsEarly21stCenturyTheBankofEnglish:524MwordsCorpussizeIsalargecorpusreallywhatyouwant?Thesizeofthecorpusneededtoexplorearesearchquestiondependsonthefrequencyanddistributionofthelinguisticfeaturesunderconsiderationinthatcorpus–yourresearchquestionCorporaforlexicalstudiesaremuchlargerthanthoseforgrammaticalstudiesSpecializedcorporaserveaverydifferentyetimportantpurposefromlargemulti-million-wordcorporaCorporathatneedextensivemanualannotationoranalysisarenecessarilysmallManycorpustoolssetaceilingonthenumberofconcordancesthatcanbeextractedTheoptimumsizeofacorpusisdeterminedbytheresearchquestionthecorpusisintendedtoaddressaswellaspracticalconsiderationsExploringexistingEnglishcorporaTolearnhowcorporacanbeclassifiedTolearnaboutdesigndecisionsincreatingdifferentkindsofcorporaTobecomefamiliarwitharangeofwell-knownandinfluentialcorporaTypesofcorpora,differentusesGeneralvs.specializedcorporaWrittenvs.spokencorporaSynchronicvs.diachroniccorporaMonolingualvs.multilingualcorporaComparablevs.parallelcorporaNativevs.learnercorporaDevelopmentalvs.learnercorporaRawvs.annotatedcorporaSamplevs.monitorcorpora…MonitorcorporaConstantlyupdatedandgrowinginsizeMuchlargercorpussizeOftencontainfulltextAlwaysup-to-dateOftenonlyadmitnewmaterialwhichhasnewfeaturesnotalreadyincorpusUsedtotrackchangesacrossdifferentperiodsoftimeMonitorcorporacouldbeaseriesofstaticcorporaDisadvantagesNoattempttobalancethecorpusTextavailabilitycanbecomeanissue(e.g.,copyrights)ConfusingtoindicatespecificcorpusversionCannoteasilycompareresultsrunoncorporaofdifferentsizesSomewell-knownEnglishcorporaTheBritishNationalCorpus(BNC)TheBankofEnglish(BoE)BYUAmericanEnglishcorpusCorporaoftheBrownfamily(Brown,LOB,FLOB,Frown)ICEcorpora(GB,EA,HK,Singapore,Philippines,NewZealandetc)London-LundcorpusofspokenEnglishSBCSAETheHelsinkiDiachronicCorpusofEnglishTexts(8th-18thCentury,ca.5millionwords)TheInternationalCorpusofLearnerEnglish(ICLE)MICASETheBNCFirstandbest-knownnationalcorpus(samplecorpus)100Mwordbalancedcorpusofwritten(90%)andspoken(10%)BritishEnglishincurrentuse1960-earlier1990sRichmetadataencodedforlanguagevariationstudiesPOStaggedAccessingtheBNCBYU-BNC:/bnc/BNCOnline:natcorp.ox.ac.uk/getting/index.xml.ID=order_online
LancasterBNCWebCQPeditionbncweb.lancs.ac.uk/bncwebSignup/user/login.php
BNCBaby:natcorp.ox.ac.uk/corpus/baby/index.html
SketchEngine:sketchengine.co.uk/
BNCPIE:/
TheBoEBestknownmonitorcorpus524Mwords(countingandgrowing)ofpresent-dayEnglishlanguage75%writtenand25%spoken70%BrE,20%AmEand10%otherEnglishvarietiesParticularlyusefulforlexicalandlexicographicstudies,e.g.trackingnewwords,newusesormeaningsofoldwords,andwordsfallingoutofuseAccesstotheBoEA56Mwordsampler:collins.co.uk/books.aspx?group=153CorpusofContemporary
AmericanEnglish(COCA)385+MwordsofAmericanEnglish20Mperyearfor1990-2019Equallydividedamongspoken,fiction,popularmagazines,newspapers,andacademictextsUpdatedevery6-9monthsUsefulforstudyingvariationacrossgenresandovertimeFreeonlineaccess/CorporaoftheBrownfamilyBrown:WrittenAmEin1961LOB:WrittenBrEin1961FLOB:WrittenBrEin1991Frown:WrittenAmEin1991CommoncorpusdesignOneMwordeach500samples(ca.2000wordseach)Sameproportionsfromthesame15textcategoriesUsefulforsynchronicanddiachroniccomparisonofBrEandAmEFurtherinformationICAMECD:khnt.hit.uib.no/icame/manuals/TheICEcorpora20oneMwordbalancedcorporaE.g.Britain,Ireland,US,Canada,HongKong,Singapore,India,thePhilippines,EastAfricaCommoncorpusdesign500samples(ca.2000wordseach)60%spoken+40%written12Genres1990-1994DesignedforthesynchronicstudyofworldEnglishesMoreinformationucl.ac.uk/english-usage/ice/TheLondon-LundCorpusFirstelectroniccorpusofspontaneouslanguageAcorpusofspokenBritishEnglishrecordedfrom1953-1987100texts,eachof5,000words,totalinghalfa
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 自然博物館單元課程設(shè)計
- 軸承座課程設(shè)計夾具設(shè)計
- 2025年外聯(lián)部工作計劃書范例(3篇)
- 2025年度架子工崗位外包合同2篇
- 網(wǎng)絡(luò)課程設(shè)計校園局域網(wǎng)
- 2025年酒類產(chǎn)品定制加工合同模板2篇
- 倉庫保管員崗位責(zé)任制模版(2篇)
- 二零二五年度房屋租賃合同范本包含家具損壞賠償3篇
- 2025年度水利工程勞務(wù)分包與施工圖審核合同3篇
- 2025年度新能源汽車充電設(shè)施租賃認(rèn)籌協(xié)議書(綠色出行)3篇
- 承諾函(支付寶)
- 危險化學(xué)品目錄2023
- FZ/T 81024-2022機(jī)織披風(fēng)
- GB/T 24123-2009電容器用金屬化薄膜
- 艾滋病梅毒乙肝實(shí)驗(yàn)室檢測
- 國鐵橋梁人行道支架制作及安裝施工要點(diǎn)課件
- 領(lǐng)導(dǎo)科學(xué)全套精講課件
- 粵教版地理七年級下冊全冊課件
- 小學(xué)科學(xué)蘇教版六年級上冊全冊精華知識點(diǎn)(2022新版)
- 萎縮性胃炎共識解讀
- 2022版義務(wù)教育語文課程標(biāo)準(zhǔn)(2022版含新增和修訂部分)
評論
0/150
提交評論