算法拆箱：一個(gè)過(guò)程模型算法解決方案

上傳人：媚*** IP屬地：境外上傳時(shí)間：2024-04-02 格式：DOCX 頁(yè)數(shù)：11 大小：129.70KB 積分：12 舉報(bào) 版權(quán)申訴

已閱讀5頁(yè)，還剩6頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

PortlandStateUniversity

PDXScholar

BusinessFacultyPublicationsand

Presentations

TheSchoolofBusiness

8-2021

UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution

MartaStelmaszakRosa

PortlandStateUniversity,

stmar

ta@

Followthisandadditionalworksat:

/busadmin_fac

Partofthe

BusinessCommons

Letusknowhowaccesstothisdocumentbenefitsyou.

CitationDetails

Stelmaszak,M.(2021)UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution.AmericasConferenceonInformationSystems2021,9-13August2021.

ThisConferenceProceedingisbroughttoyouforfreeandopenaccess.IthasbeenacceptedforinclusioninBusinessFacultyPublicationsandPresentationsbyanauthorizedadministratorofPDXScholar.Pleasecontactusifwecanmakethisdocumentmoreaccessible:

pdxscholar@

UnboxingtheAlgorithm:AProcessModel

Twenty-SeventhAmericasConferenceonInformationSystems,Montreal,2021

PAGE

UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution

CompletedResearch

MartaStelmaszakPortlandStateUniversity

stmarta@

Abstract

Withtheexplosionofdata,analyticsandartificialintelligence,informationsystemsresearchfocusesontheuse,managementandconsequencesofalgorithms.Thisfar,onlyahandfulofpapersofferinsightsintohowalgorithmicsolutionswork.Toaddressthisgap,westudiedthecodemakingup45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonadatascienceplatformK.Wesynthesizedaprocessmodelofanalgorithmicsolution:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingmodel.Unboxingthealgorithmandinvestigatingtheprocessoffersamorefine-tunedunderstandingandlanguagetobetterconceptualizetheuse,managementandconsequencesofalgorithmicsolutions.Italsoprovidesascaffoldingforresearchintothedevelopmentofalgorithmicsolutions,highlightingtheirvariability,experimentationanddatascientistdecisions.

Keywords

Algorithms,algorithmicsolutions,datascience,informationsystemsdevelopment,processmodel

Introduction

Algorithmshave,withoutadoubt,attractedresearchattentionacrossanumberoffields,frommediastudies,throughsociology,tocomputerscience.Managementandinformationsystems(IS)researchersstudyalgorithmicsolutionsprimarilyintermsoftheiruse,managementandconsequencesforindividualsinworkcontexts,inorganizations,andinthewidersociety(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,thisresearchcangreatlybenefitfromanimprovedunderstandingofhowalgorithmicsolutionsaredeveloped,andthustherehavebeencallstofocusmoreontheorizingtheirdevelopment(vandenBroeketal.2021).Thisfar,onlyahandfulofpapersinISofferinsightsintohowalgorithmicsolutionsworkwhichisanessentiallinkbetweenunderstandingtheiruseandtheirdevelopment.Againstthisbackground,thisstudyaimstoanswerasimplequestion:whatistheprocessofmakinganalgorithmicsolutionwork?

Touncoverthebuildingblocksandproposeaprocessmodel,westudied45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK(Dissanayakeetal.2015;MangalandKumar2016).Referringtoacommonproblemfacedbymanycompaniesandoftentackledbyalgorithmicsolutions,thecreditcarddatasetattractedover200notebookswithcodeandcommentsdescribingattemptstobestpredictcustomerchurn.Weselected35ofthebest-regardednotebooks,downloadedthemandcodedthemusingagroundedtheoryapproach(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006).Wethengroupedthethemestoidentifytheelementsthatmadeupeachproposedalgorithmicsolutionanddistilledaprocessmodelofhowtheyweredeveloped.

Basedonourfindings,weproposeaprocessmodelofmakinganalgorithmicsolutionworkencompassing:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingthemodel.Wecontributetoinformationsystemsandmanagementliteraturebydevelopingaprocessmodelofanalgorithmicsolution

thatoffersamorefine-tunedlanguagetoinvestigatenotonlytheuse,managementandconsequencesofalgorithmicsolutionsonindividuals,organizationsandsocieties,butalsoenablesafurtherstudyofthedesignanddevelopmentofsuchsolutionsfromasocio-technicalperspective.

MakingAlgorithmicSolutionsWork

Recenttechnological(processingcapabilities,bigdata,machinelearning),societal(useofsmartphones,attitudestowardsdata,socialmedia)andorganizational(phantomization,networks)developmentscontributedtothegrowthinuseofvariousalgorithms(Baptistaetal.2017;Berenteetal.2019).ISresearchinthesocio-technicaltraditionhasthusfocusedonthestudyoftheuse,managementandconsequencesofalgorithmsonindividual,organizationalandsocietallevels(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,farlessattentionhasbeenpaidsofartotheunderstandingofhowalgorithmsandalgorithmicsolutionsbasedonthemaredeveloped(vandenBroeketal.2021).Firstpapersbegintouncoverhowdatascientistsandsubjectmatterexpertsneedtoworktogetherinthedevelopmentprocess(vandenBroeketal.2021),howthepracticesofdatascientistsinthebankingindustryrelyonbothsubjectivityandobjectivityintheproductionofinformation(Joshi2020),andhowdatascientistsengageinthepracticesofknowledgehiding(GhasemaghaeiandTurel2021).Inotherwords,whilefocusingpredominantlyonwhathappensafterthealgorithmsareputtowork,currentliteratureofferslittleinsightintohowalgorithmsaremadetowork,thatiswhatstepsneedtobeinplaceforanalgorithmicsolutiontoworkeffectively.Suchunderstandingisessentialbecausetheprocessofmakinganalgorithmicsolutionwork,asweshowbelow,determineswhatkindsofinsightsandpredictionsitoffers,thusinfluencingdecisions.

Mostresearcherswhoinbroadstrokesdescribewhatgoesintomakingalgorithmicsolutionsworkintheirpapersrefertocertainaspectswithvaryingconsistency:thefactthatalgorithmsprocessdata(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;Gregoryetal.2020;Gr?nsundandAanestad2020;Lebovitz2020;Lycett2013;NewellandMarabelli2015;Pachidietal.2021;Shresthaetal.2019)inanautomatedorpreprogramedway(Galliersetal.,2017;Gr?nsund&Aanestad,2020;Güntheretal.,2017;Shresthaetal.,2019)tolearnmodels(BalasubramanianandYe2021;Lietal.2019;Shresthaetal.2019)leadingtonewinsights(Güntheretal.,2017;Günther&Joshi,2020;Pachidietal.,2021),decisions(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;NewellandMarabelli2015)orpredictions(Lebovitz2020;Lietal.2019;Shresthaetal.2019).Thisoffersapunctuatedandincompletepictureoftheelementsinvolvedindevelopingalgorithmsthatcanbesubsequentlyusedinbusinesssettings.

Ahandfulofpapersofferinsightsintotheessentialelementsofwhathappensinsidealgorithmicsolutions.Pachidietal.(Pachidietal.2021)provideadetaileddescription,coveringvariouselementsthatareatplayinapredictivemodel:

“Themodelcombinedanumberofinternalandexternaldatasources,suchastimeseriesofcustomertransactions,Nielsenmarketdata,GartnerICTspendingpredictions,financialdata,andusagedata.Theoutputofthemodelwasrepresentedinaspreadsheetformatthatcontainedalistofallmedium-sizedcustomersandpredictionsregardingpotentialsalesopportunities.TheCLMmodelallocatedcustomerstodifferentcustomersegments(A,B,C,D)basedontheirhistoricalandpredictedsaleswithTelCo.ForeachTelCoproductline(e.g.,businesstelephonesystems,mobilephonepackages,fixedlinesetups),theCLMmodelassignedapositioninthecustomersaleslifecycle(inform,specify,sell,maintain),eachofwhichentailedadifferentcontactstrategy.Thus,themodeloutputconsistedofarankingofopportunities,withaprioritizedactionlistforaccountmanagers.”

Gr?nsundandAanestad(2020,p.7)aresimilarlydetailed:

“Thealgorithm-supportedanalysissystemwasdesignedtoautomatebothdataacquisitionandtheprocessingofdataforsubsequentanalysis.Acquisitionofdatawasautomatedbythesystempullingstreamsofdataonshipactivityfromthesatellite-AISdataprovider,alongwithadditionaldatasuchasvesseldescriptionsandgeospatialdata,intoaHadoop-baseddatawarehouserepository.Herethedatawereextractedandconsolidated,thenclassifiedusingrule-basedNLP(NaturalLanguageProcessing)classification,andfinallypresentedinBItoolsthatallowedhumaninterpretationoftheoutput.”

Whilethedescriptionsbothpointtoobtaining,compilingandprocessingofdata,furtheranalysisandclassification,theyrevealdifferencesinhowthesolutionswork,anddonotofferacompletepicture.Takingamoregeneralview,OrlikowskiandScottdefineanalgorithmas“asetofstep-by-stepinstructionstoachieveadesiredresultinafinitenumberofmoves”(2015,p.210).Acknowledgingthismoretraditionaldefinitionofanalgorithm-aprogramcontainingafixedsequenceofinstructionsexecuteduntilasolutionisreached-rootedincomputerscience(HopcroftandUllman1983),Farajetal.(2018)‘update’andbroadenthescopeofthisdefinitionbyconceptualizinglearningalgorithmsas“anemergentfamilyoftechnologiesthatbuildonmachinelearning,computation,andstatisticaltechniques,aswellasrelyonlargedatasetstogenerateresponses,classifications,ordynamicpredictionsthatresemblethoseofaknowledgeworker”(p.62).AsimilardefinitionofartificialintelligencealgorithmsisputforwardbyTarafdaretal.(2020,p.1):“WedefineAIalgorithmsasthosethatextractinsightsandknowledgefrombigdatasources;computationalandstatisticaltechniquessuchasmachinelearning(ML)anddeeplearningembeddedinsuchalgorithms,aimto‘teach’computerstheabilitytododetectpatternsinbigdata”.

Whilethesedefinitionsofferagoodstartingpointandaninitialoverviewoftheelementsintheprocessofmakingalgorithmicsolutionswork,theyarepartialanddivergentintheirfocus.Thesedifferencesinthedefinition,understanding,scopeandscaleofthestepsandelementsrequiredtomakealgorithmsworkhamperthedevelopmentoftheunderstandingoftheuse,managementandconsequencesofalgorithms,andatthesametimemakeuncoveringtheirdevelopmentmoredifficult.ForISresearchtosystematicallyprogressinthisareaitisthusfundamentaltoask:whatistheprocessofmakinganalgorithmicsolutionwork?

ResearchSettingandMethods

Toanswerthisquestion,westudied45publicdatasciencenotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK.Belowwedescribetheresearchsetting,aswellasdatacollectionandanalysismethods.

ResearchSetting

Kisapopularplatformfordatascientistsandmachinelearningengineerswheretheycandevelopandimprovetheirskills,aswellasparticipateincorporate-sponsoredcompetitionsbyaddressingavarietyofproblemsrelatedtodatasetspublishedontheplatform.K,partofAlphabetInc,allowstouploaddatasets,setspecifictasksforthemandcreateinteractiveJypyternotebookswhereuserscandeveloptheiralgorithmicsolutions.Kwasselectedasasettingbecauseofitspublicavailabilityandopennessinsharingnotebooksthatallowsanunprecedentedaccesstotheinnerworkingsofalgorithmicsolutions.OthershaveusedKforresearchpurposesaswell(Dissanayakeetal.2015;MangalandKumar2016).

Thedatasetweselectedforthisstudyisawell-regardedandpopulardatasetwithhighusability.Itcontainsthedetailsofaround10,000creditcardcustomersofabank,wherebyaportionofcustomerschurned.Thegoalistoidentify,basedon18variablessuchasage,salary,creditcardlimitandsimilar,whatmakesacustomerchurn(giveupacreditcard)tobeabletopredictcustomersatriskofchurninginthefuture,aswellastoidentifythevariablesthataremostpredictiveoftheriskofchurn(“Kaggle.Com”2021).Whenthedatasetwasinvestigatedforthepurposesofthisresearchproject,therewerearound210notebookssubmittedthatcontainedalgorithmicsolutionspertainingtothisdataset,withconstantdailyactivityinexistingnotebooksandnewnotebooksbeingadded.

Weselectedanopenandpublicdatasetratherthanacompetitionbecausethemajorityofnotebookssubmittedforcompetitionsareprivateandthusvisibleonlytosponsorcompanies,andcompetitionsareusuallyveryspecificandlimitthenumberofpotentialalgorithmicsolutionsapplied.Incontrast,publicnotebooksallowgoodaccesstoavarietyofnotebookscontainingfairlyunrestrictedsolutionsandallowformuchmoreexperimentationonthepartofusers.FromthemanydatasetsavailableonK,weselectedthecreditcardcustomersdatasetbecauseitisrelatedtoacommonproblemthatmanycompaniesandbusinessesface,anditisaproblemthatisoftentackledbydevelopingalgorithmicsolutions,thusitisagoodrepresentativesampleofwhatresearchersininformationsystemsandmanagementwouldconsiderofinterest.

Datacollection

InJanuaryandFebruary2021,wecollected57JupyternotebooksthatwerecreatedusingthecreditcardcustomerdatasetinPythonasthesetprogramminglanguage.Thenotebookswerearrangedfromthe‘hottest’(ameasureusedonKtodefinenotebookswithmostactivity,editsandhighestvotesbythecommunity,Kaggle.Com,2021)totheleasthot,andthusthosethatwecollectedwereconsideredamongthe‘hottest’atthetime.Wedecidedtoselectthe‘hottest’notebooksasthesewereassessedashighqualitybythecommunity,thuswerelikelytocontainwell-developedalgorithmicsolutions.WediscardednotebooksinRtoeliminatedifferencesinprogramminglanguages,andnotebooksthatcontainedonlypartialsolutions,forexampleonlyanalyzeddatawithoutbuildingactualmodels.Weendedupwith45suitablenotebooks.UsingafeatureavailableonK,wedownloadedalloftheselectednotebooksandconvertedthemtoPDFdocumentstoanalyzetheminnVivo.

Dataanalysis

Sinceourstudyisrootedingroundedtheory(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006),weproceededbyinductivelycodingthenotebookstoidentifythedifferentelementsofcodetheycontainedbywhattheseelementsofcodedid.Wecodedeachsegmentofcodeineachnotebooktoidentifyitsfunction.Verbaldescriptionsofdatascientistssometimesprovidedadditionalinformationastotheroleofeachcodesegment,sothesewerecodedtoo.However,thedescriptionsweremostlyusefulinthesecondstageofdataanalysis,wherewegroupedthecodesweobtainedintohigher-levelelementsoftheprocess,astheyexplainedtheflowoftheprocess.Forexample,inthenotebooksdatascientistswouldsometimesindicatetheywereproceedingtoexploratorydataanalysis,andweusedthesecommentstogroupelementsofcodeidentifiedundertheelement‘ExploratoryDataAnalysis’.

Becauseoftheinductivenatureofourstudy,weoscillatedbetweendataanalysisandfurtherdatacollection.Aftercodingthefirst30notebooks,webegantogroupthecodestostartbuildingthemodel.Wethenproceededwithcodingandanalyzingnotebooksonebyonetosupplementandverifythemodelthatwasemergingfromouranalysis.Whenwereachednotebooknumber35,thesubsequent10notebooksdidnotaddanynewcodestothecodebookandatthispointwedecidedtostopcodingandanalyzingthenotebooksaswereachedthepointofsaturation.

UnboxingtheAlgorithm

Inthissection,wepresenttheelementsoftheprocessofmakinganalgorithmicsolutionworkthatweidentifiedinthedata.Eachelementisdiscussedinturnbyshowingwhatkindsofoperationswereperformedineveryelement.

PreparingtheEnvironment

Notebooksbeginwithsettingtheenvironmentinwhichthedevelopmentofthealgorithmicsolutiontakesplace:programminglanguage,accelerationandconnectiontotheinternet.ThenotebooksweobservedwereallsetupinaPython3environment,which“comeswithmanyhelpfulanalyticslibrariesinstalled”(Notebook002)andallowstowriteupto20GBtotheworkingdirectory.Notebooksgivethepossibilitytoturnonanaccelerator,suchasaGPU,forfasterprocessing,andtoconnecttotheinternetforaccesstoexternalfiles.Insomenotebooks,datascientistsuseverbalcommentstoidentifyandrestatetheproblem.

Afterthisinitialsetup,variousnecessarylibrariesareimported,thatispre-packagedfunctionsdesignedforspecificpurposesthatcanbedeployedbydatascientistswithouttheneedtocodesuchfunctionsfromscratch.Invariably,thenotebooksfeatured“numpy”(Notebook005),aPythonlibraryforlinearalgebraand“pandas”(Notebook007)allowingfordataprocessingandforexamplereadinginCSVfiles,amongothers.Thesetwolibrariesareessentialtodevelopthealgorithmicsolution.Otherlibrariesimportedincludedatavisualizationpackages,suchas“seaborn”or“matplotlib”(Notebook029),whicharefairlystandardandpopularlibrariesforthispurpose.Insomenotebooks,allrequiredpackagesareimportedinthebeginningofthenotebook,including“sklearn”and“keras”(Notebook014)thatareusedforbuildingmodels,whileothernotebooksimportadditionallibrariesasandwhenneeded.Librariesareimportedwithsimplecode:“importnumpyasnp”(Notebook001),forexample.Importinglibrariesisastandardprocedureandtherearenotsubstantialcommentsregardingthisstep.Thereexistsavarietyoflibraries

usedindevelopingalgorithmicsolutionsthatarewidelyused,andtheyencapsulateandabstractoutthecomplexitybehindsuchtasksliketrainingaspecificmodel,asexplainedbelow.

ReadinginData

Thenextelementintheprocessinanalgorithmicsolutionistoreadintherequireddata.Thefirststephere,quitelogically,includesloadingdatain.BecausethedatasetthatthenotebooksuseisuploadedtoKaggle,itcanbeattachedtoeachnotebookwithasimplesearchwithintheinterface,andthenimportedbyexecutingacommandfromthe“pandas”library“read_csv”(Notebook001).

Inspectingthedatafollows,usuallythroughfunction“head”,displayingfirstfive(bydefault)rowsofthedatasetandcorrespondingcolumnswithcolumnheaders,andsometimesfunction“shape”displayingthedimensionsofthedataset(numberofrowsandnumberofcolumns)aswellasfunction“columns”,givingthenamesofcolumnsinthedataset.Injustonenotebook,weobservedexplicitlylookingforduplicateentriesinthedataset.Commandstoperformthesefunctionsarepre-packagedandtakeformsof“df.head()”,“df.shape”or“df.columns”(Notebook003).Thisstageoftheprocessalsoinvolvescheckingdatatypespresentinthedataset,performedbyusingfunctions“info”or“dtypes”thatindicatewhichcolumnscontaininteger(wholenumbers),float(fractionswithdecimalpoints)orobject(textormixednumericandnon-numericvalues)datatype.Thisisimportantasmostalgorithmicsolutionsworkonlywithnumericalvalues.Aspartofreadingindata,simpledescriptivestatisticsofthedataareobtainedthroughfunction“describe”,resultingindisplayingthenumberofrows,mean,standarddeviation,minimumvalue,quartiles,andmaximumvalueforeachcolumn.

Conductingthethesestepsisessentialtoloadthedatasetandobtainbasicinformationaboutthedataneededtoconfirmthatthedataisloadedcorrectly,containstheexpectedcolumnsandrows,andtogaininitialfamiliaritywiththedataset.

CleaningData

Afterreadinginthedataset,dataiscleanedtoprepareitforfurtherprocessing.Thisisessentialbeforeanyanalysiscantakeplace.Stepsatthisstagetendtobetakeninvariousordersacrossthenotebooks,andarereportedhereinnoparticularorder.

Missingvaluesareidentifiedanddealtwith:thatisNULLvaluesinthedatasethavetoberesolvedbeforeanyanalysiscantakeplace.Thisisdonebyusingthefunction“isnull”,listingallcolumnswiththenumberofmissingvalues(Notebook001).Thecustomerchurndatasetcontainednonullvalues,sointhiscasetherewasnoneedtodeploysolutionstosolvethisproblem.Missingvalueshavetoberesolvedasthemajorityofalgorithmscannotdealwithdatasetscontainingmissingvalues.Oneofthewaystosolvethisproblemthatispresentedinthenotebooksisthemethodofimputation,thatisreplacingthemissingornullvaluewithanexistingvaluefromthedataset.Inthesolutionproposedinthenotebookthisisdonebasedonthenearestneighborofthemissingvalue,butsincenomissingvaluesweredetected,thesolutionisnotimplemented.

Inthenotebooks,wefoundsometimescolumnsarerenamediftheirnamesarenotintuitiveenoughorsimplytoolong.Certaincolumnscontainingvariablesthatarenotneededfortheanalysisareremoved.Forexample,thecustomerchurndatasetcontainstwocolumnswithNa?veBayesClassifierbydefault,andtheauthorofthedatasetsuggestsremovingthesecolumnsbeforeproceedingwithanalysis.Atdifferentpointsindatacleaning,exploratorydataanalysisorpre-processingthedatasetvariouscolumnsarealsoremovediftheyarenotcontributingtothemodel(forexample,removingcustomerID:“data=data.drop(columns=[‘CLIENTNUM’]”,Notebook015).Insomenotebooks,outliersareremovedfromthedatausingacommonstatisticalmeasureofz-score,indicatinghowfarfromthemeanagivendatapointis.Intheonlynotebookweobservedthatremovedoutliers,thisresultedinremoving810rowsfromthedataset.

Allnotebookswestudiedtransformdatatypesaspartofcleaningdata.Thisstep,sometimesreferredtoasfeatureengineering,isrequiredwhenthedatasetcontainsobjectdatatypes,whicharecategoricalvariablestypicalinmanydatasets,suchasmaritalstatus,levelofeducationorgender.Thesedatatypeshavetobetransformedintonumericalvariablesinordertobeanalyzed.Thisisconductedbyusingpre-existingfunctionstoencodethesevariablesasintegers(e.g.primaryeducationas1,secondaryas2,tertiaryas3)or

usingpopularone-hotencodingwherethereisnonaturalordinalrelationshipbetweencategoriesanddummyvariablesarecreated(e.g.maleis0,femaleis1).Cleaneddataisanessentialelementofanyalgorithmicsolution,aswithoutthestepstakeninthiselement,dataeitherresultsinerroneousanalysisandmodeltraining,orsimplycannotbeusedtotrainmodels.

ExploratoryDataAnalysis

Thenextstepinthealgorithmicsolutionprocessisexploratorydataanalysis,wherebyactionsaretakentolearnabouttherelationshipbetweenthedependentvariableofinterest(here:customerchurnorattrition)andindependentvariablesthatmayhelpbuildthepredictivemodel.Thisstepisessentialtouncoverwhatmodelwillbethemostappropriateforthedatasetandwhichvariablescanbepotentiallyofinterest.

Thefirststepistoidentifythedependentvariable(atrivialmatterinthegivendataset),andtoanalyzeindependentvariables.Thisisveryoftenperformedbyvisualizingthemindependently,inrelationtoeachother,orinrelationtothedependentvariable.Inmostcases,suchvisualizationswereimplementedusingfunctionsfromvisualizationlibraries,suchas“seaborn”,“matplotlib”orrarely“plotly”.Visualizingdataisthepartthattakesupthemostcodeinnearlyallnotebooksweanalyzed.Variousvisualizationsareproduced,suchasboxplots,piecharts,histograms,inordertohelpidentifywhichvariablesmaybeusefulinbuildingthemodel.Visualizationsareoftenaccompaniedbycommentssuchas“Femalesareslightlymorelikelytochurnwith17%comparedtomaleswith15%,we’llconvertthis9featureto1-0”(Notebook013).Somenotebookscontainmorecomprehensivecommentsonthelearningsfromvisualizations.

Thenextstepinexploratorydataanalysisistoidentifycorrelationsbetweenvariables.Identifyingcorrelationsisanimportantstepinexploratorydataanalysis,asfromthisdecisionscanbemadeastowhichfeaturestoincludeinpre-processingthedatasetformodelbuilding,asdescribedbelow.Forexample,Notebook022basedontheidentificationofcorrelationsdecidesto“#Dropsomefeatureswhichhavelessthan0.01correlationandgreaterthan-0.01correlation”.Exploratorydataanalysisisarequiredstepofbuildinganalgorithmicsolutionasitprovidesthenecessaryinsightintothedatasetforthepurposesofmodelbuilding.Itisatthisstagethattheimportanceofvariableswithrespecttothetargetvariableisassessed.

Pre-processingtheDataset

Thefollowingstepintheprocessistopre-processthedataset,whichinvolvespreparingthedatasetaccordingtotherequirementsofmodelbuilding.First,dataneedtobescaled,whichmayinvolveactualscaling,thatischangingtherangeofvariablestoacommonrange,e.g.between0and1,ornormalizingthevariablesfollowinganormaldistribution.Scalingisperformedtoensurethatnovariableisinterpretedasmorepredictivethanitactuallyisjustbecauseitsnumericalvaluesareonadifferentfromothervariables.Scalingisroutinelyperformedusingstandardpre-packagedfunctions,suchas“StandardScaler”fromthepopular“sklearn”library(Notebook026).

Thedatasetshouldberesampledifitisnotbalanced,thatisifonecategoryispresentmuchmorefrequentlythananother.Inthecaseofthedatasetinvestigated,customerswhoattiredoccurredmuchlessfrequently,asidentifiedinexploratorydataanalysis,soresamplingwasrequired.Thisisusuallydonebyoversamplingfromthegroupofattiredcustomers,mostfrequentlyusingapre-packagedfunction‘SMOTE’(SyntheticMinorityOversamplingTechnique)whichcreatesadditionaldatapoi

人人文庫(kù)> 全部分類> 行業(yè)資料 > 信息產(chǎn)業(yè)

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

算法拆箱：一個(gè)過(guò)程模型算法解決方案

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

算法拆箱：一個(gè)過(guò)程模型 算法解決方案

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔

算法拆箱：一個(gè)過(guò)程模型算法解決方案