




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
PortlandStateUniversity
PDXScholar
BusinessFacultyPublicationsand
Presentations
TheSchoolofBusiness
8-2021
UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution
MartaStelmaszakRosa
PortlandStateUniversity,
stmar
ta@
Followthisandadditionalworksat:
/busadmin_fac
Partofthe
BusinessCommons
Letusknowhowaccesstothisdocumentbenefitsyou.
CitationDetails
Stelmaszak,M.(2021)UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution.AmericasConferenceonInformationSystems2021,9-13August2021.
ThisConferenceProceedingisbroughttoyouforfreeandopenaccess.IthasbeenacceptedforinclusioninBusinessFacultyPublicationsandPresentationsbyanauthorizedadministratorofPDXScholar.Pleasecontactusifwecanmakethisdocumentmoreaccessible:
pdxscholar@
.
UnboxingtheAlgorithm:AProcessModel
Twenty-SeventhAmericasConferenceonInformationSystems,Montreal,2021
PAGE
10
UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution
CompletedResearch
MartaStelmaszakPortlandStateUniversity
stmarta@
Abstract
Withtheexplosionofdata,analyticsandartificialintelligence,informationsystemsresearchfocusesontheuse,managementandconsequencesofalgorithms.Thisfar,onlyahandfulofpapersofferinsightsintohowalgorithmicsolutionswork.Toaddressthisgap,westudiedthecodemakingup45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonadatascienceplatformK.Wesynthesizedaprocessmodelofanalgorithmicsolution:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingmodel.Unboxingthealgorithmandinvestigatingtheprocessoffersamorefine-tunedunderstandingandlanguagetobetterconceptualizetheuse,managementandconsequencesofalgorithmicsolutions.Italsoprovidesascaffoldingforresearchintothedevelopmentofalgorithmicsolutions,highlightingtheirvariability,experimentationanddatascientistdecisions.
Keywords
Algorithms,algorithmicsolutions,datascience,informationsystemsdevelopment,processmodel
Introduction
Algorithmshave,withoutadoubt,attractedresearchattentionacrossanumberoffields,frommediastudies,throughsociology,tocomputerscience.Managementandinformationsystems(IS)researchersstudyalgorithmicsolutionsprimarilyintermsoftheiruse,managementandconsequencesforindividualsinworkcontexts,inorganizations,andinthewidersociety(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,thisresearchcangreatlybenefitfromanimprovedunderstandingofhowalgorithmicsolutionsaredeveloped,andthustherehavebeencallstofocusmoreontheorizingtheirdevelopment(vandenBroeketal.2021).Thisfar,onlyahandfulofpapersinISofferinsightsintohowalgorithmicsolutionsworkwhichisanessentiallinkbetweenunderstandingtheiruseandtheirdevelopment.Againstthisbackground,thisstudyaimstoanswerasimplequestion:whatistheprocessofmakinganalgorithmicsolutionwork?
Touncoverthebuildingblocksandproposeaprocessmodel,westudied45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK(Dissanayakeetal.2015;MangalandKumar2016).Referringtoacommonproblemfacedbymanycompaniesandoftentackledbyalgorithmicsolutions,thecreditcarddatasetattractedover200notebookswithcodeandcommentsdescribingattemptstobestpredictcustomerchurn.Weselected35ofthebest-regardednotebooks,downloadedthemandcodedthemusingagroundedtheoryapproach(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006).Wethengroupedthethemestoidentifytheelementsthatmadeupeachproposedalgorithmicsolutionanddistilledaprocessmodelofhowtheyweredeveloped.
Basedonourfindings,weproposeaprocessmodelofmakinganalgorithmicsolutionworkencompassing:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingthemodel.Wecontributetoinformationsystemsandmanagementliteraturebydevelopingaprocessmodelofanalgorithmicsolution
thatoffersamorefine-tunedlanguagetoinvestigatenotonlytheuse,managementandconsequencesofalgorithmicsolutionsonindividuals,organizationsandsocieties,butalsoenablesafurtherstudyofthedesignanddevelopmentofsuchsolutionsfromasocio-technicalperspective.
MakingAlgorithmicSolutionsWork
Recenttechnological(processingcapabilities,bigdata,machinelearning),societal(useofsmartphones,attitudestowardsdata,socialmedia)andorganizational(phantomization,networks)developmentscontributedtothegrowthinuseofvariousalgorithms(Baptistaetal.2017;Berenteetal.2019).ISresearchinthesocio-technicaltraditionhasthusfocusedonthestudyoftheuse,managementandconsequencesofalgorithmsonindividual,organizationalandsocietallevels(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,farlessattentionhasbeenpaidsofartotheunderstandingofhowalgorithmsandalgorithmicsolutionsbasedonthemaredeveloped(vandenBroeketal.2021).Firstpapersbegintouncoverhowdatascientistsandsubjectmatterexpertsneedtoworktogetherinthedevelopmentprocess(vandenBroeketal.2021),howthepracticesofdatascientistsinthebankingindustryrelyonbothsubjectivityandobjectivityintheproductionofinformation(Joshi2020),andhowdatascientistsengageinthepracticesofknowledgehiding(GhasemaghaeiandTurel2021).Inotherwords,whilefocusingpredominantlyonwhathappensafterthealgorithmsareputtowork,currentliteratureofferslittleinsightintohowalgorithmsaremadetowork,thatiswhatstepsneedtobeinplaceforanalgorithmicsolutiontoworkeffectively.Suchunderstandingisessentialbecausetheprocessofmakinganalgorithmicsolutionwork,asweshowbelow,determineswhatkindsofinsightsandpredictionsitoffers,thusinfluencingdecisions.
Mostresearcherswhoinbroadstrokesdescribewhatgoesintomakingalgorithmicsolutionsworkintheirpapersrefertocertainaspectswithvaryingconsistency:thefactthatalgorithmsprocessdata(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;Gregoryetal.2020;Gr?nsundandAanestad2020;Lebovitz2020;Lycett2013;NewellandMarabelli2015;Pachidietal.2021;Shresthaetal.2019)inanautomatedorpreprogramedway(Galliersetal.,2017;Gr?nsund&Aanestad,2020;Güntheretal.,2017;Shresthaetal.,2019)tolearnmodels(BalasubramanianandYe2021;Lietal.2019;Shresthaetal.2019)leadingtonewinsights(Güntheretal.,2017;Günther&Joshi,2020;Pachidietal.,2021),decisions(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;NewellandMarabelli2015)orpredictions(Lebovitz2020;Lietal.2019;Shresthaetal.2019).Thisoffersapunctuatedandincompletepictureoftheelementsinvolvedindevelopingalgorithmsthatcanbesubsequentlyusedinbusinesssettings.
Ahandfulofpapersofferinsightsintotheessentialelementsofwhathappensinsidealgorithmicsolutions.Pachidietal.(Pachidietal.2021)provideadetaileddescription,coveringvariouselementsthatareatplayinapredictivemodel:
“Themodelcombinedanumberofinternalandexternaldatasources,suchastimeseriesofcustomertransactions,Nielsenmarketdata,GartnerICTspendingpredictions,financialdata,andusagedata.Theoutputofthemodelwasrepresentedinaspreadsheetformatthatcontainedalistofallmedium-sizedcustomersandpredictionsregardingpotentialsalesopportunities.TheCLMmodelallocatedcustomerstodifferentcustomersegments(A,B,C,D)basedontheirhistoricalandpredictedsaleswithTelCo.ForeachTelCoproductline(e.g.,businesstelephonesystems,mobilephonepackages,fixedlinesetups),theCLMmodelassignedapositioninthecustomersaleslifecycle(inform,specify,sell,maintain),eachofwhichentailedadifferentcontactstrategy.Thus,themodeloutputconsistedofarankingofopportunities,withaprioritizedactionlistforaccountmanagers.”
Gr?nsundandAanestad(2020,p.7)aresimilarlydetailed:
“Thealgorithm-supportedanalysissystemwasdesignedtoautomatebothdataacquisitionandtheprocessingofdataforsubsequentanalysis.Acquisitionofdatawasautomatedbythesystempullingstreamsofdataonshipactivityfromthesatellite-AISdataprovider,alongwithadditionaldatasuchasvesseldescriptionsandgeospatialdata,intoaHadoop-baseddatawarehouserepository.Herethedatawereextractedandconsolidated,thenclassifiedusingrule-basedNLP(NaturalLanguageProcessing)classification,andfinallypresentedinBItoolsthatallowedhumaninterpretationoftheoutput.”
Whilethedescriptionsbothpointtoobtaining,compilingandprocessingofdata,furtheranalysisandclassification,theyrevealdifferencesinhowthesolutionswork,anddonotofferacompletepicture.Takingamoregeneralview,OrlikowskiandScottdefineanalgorithmas“asetofstep-by-stepinstructionstoachieveadesiredresultinafinitenumberofmoves”(2015,p.210).Acknowledgingthismoretraditionaldefinitionofanalgorithm-aprogramcontainingafixedsequenceofinstructionsexecuteduntilasolutionisreached-rootedincomputerscience(HopcroftandUllman1983),Farajetal.(2018)‘update’andbroadenthescopeofthisdefinitionbyconceptualizinglearningalgorithmsas“anemergentfamilyoftechnologiesthatbuildonmachinelearning,computation,andstatisticaltechniques,aswellasrelyonlargedatasetstogenerateresponses,classifications,ordynamicpredictionsthatresemblethoseofaknowledgeworker”(p.62).AsimilardefinitionofartificialintelligencealgorithmsisputforwardbyTarafdaretal.(2020,p.1):“WedefineAIalgorithmsasthosethatextractinsightsandknowledgefrombigdatasources;computationalandstatisticaltechniquessuchasmachinelearning(ML)anddeeplearningembeddedinsuchalgorithms,aimto‘teach’computerstheabilitytododetectpatternsinbigdata”.
Whilethesedefinitionsofferagoodstartingpointandaninitialoverviewoftheelementsintheprocessofmakingalgorithmicsolutionswork,theyarepartialanddivergentintheirfocus.Thesedifferencesinthedefinition,understanding,scopeandscaleofthestepsandelementsrequiredtomakealgorithmsworkhamperthedevelopmentoftheunderstandingoftheuse,managementandconsequencesofalgorithms,andatthesametimemakeuncoveringtheirdevelopmentmoredifficult.ForISresearchtosystematicallyprogressinthisareaitisthusfundamentaltoask:whatistheprocessofmakinganalgorithmicsolutionwork?
ResearchSettingandMethods
Toanswerthisquestion,westudied45publicdatasciencenotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK.Belowwedescribetheresearchsetting,aswellasdatacollectionandanalysismethods.
ResearchSetting
Kisapopularplatformfordatascientistsandmachinelearningengineerswheretheycandevelopandimprovetheirskills,aswellasparticipateincorporate-sponsoredcompetitionsbyaddressingavarietyofproblemsrelatedtodatasetspublishedontheplatform.K,partofAlphabetInc,allowstouploaddatasets,setspecifictasksforthemandcreateinteractiveJypyternotebookswhereuserscandeveloptheiralgorithmicsolutions.Kwasselectedasasettingbecauseofitspublicavailabilityandopennessinsharingnotebooksthatallowsanunprecedentedaccesstotheinnerworkingsofalgorithmicsolutions.OthershaveusedKforresearchpurposesaswell(Dissanayakeetal.2015;MangalandKumar2016).
Thedatasetweselectedforthisstudyisawell-regardedandpopulardatasetwithhighusability.Itcontainsthedetailsofaround10,000creditcardcustomersofabank,wherebyaportionofcustomerschurned.Thegoalistoidentify,basedon18variablessuchasage,salary,creditcardlimitandsimilar,whatmakesacustomerchurn(giveupacreditcard)tobeabletopredictcustomersatriskofchurninginthefuture,aswellastoidentifythevariablesthataremostpredictiveoftheriskofchurn(“Kaggle.Com”2021).Whenthedatasetwasinvestigatedforthepurposesofthisresearchproject,therewerearound210notebookssubmittedthatcontainedalgorithmicsolutionspertainingtothisdataset,withconstantdailyactivityinexistingnotebooksandnewnotebooksbeingadded.
Weselectedanopenandpublicdatasetratherthanacompetitionbecausethemajorityofnotebookssubmittedforcompetitionsareprivateandthusvisibleonlytosponsorcompanies,andcompetitionsareusuallyveryspecificandlimitthenumberofpotentialalgorithmicsolutionsapplied.Incontrast,publicnotebooksallowgoodaccesstoavarietyofnotebookscontainingfairlyunrestrictedsolutionsandallowformuchmoreexperimentationonthepartofusers.FromthemanydatasetsavailableonK,weselectedthecreditcardcustomersdatasetbecauseitisrelatedtoacommonproblemthatmanycompaniesandbusinessesface,anditisaproblemthatisoftentackledbydevelopingalgorithmicsolutions,thusitisagoodrepresentativesampleofwhatresearchersininformationsystemsandmanagementwouldconsiderofinterest.
Datacollection
InJanuaryandFebruary2021,wecollected57JupyternotebooksthatwerecreatedusingthecreditcardcustomerdatasetinPythonasthesetprogramminglanguage.Thenotebookswerearrangedfromthe‘hottest’(ameasureusedonKtodefinenotebookswithmostactivity,editsandhighestvotesbythecommunity,Kaggle.Com,2021)totheleasthot,andthusthosethatwecollectedwereconsideredamongthe‘hottest’atthetime.Wedecidedtoselectthe‘hottest’notebooksasthesewereassessedashighqualitybythecommunity,thuswerelikelytocontainwell-developedalgorithmicsolutions.WediscardednotebooksinRtoeliminatedifferencesinprogramminglanguages,andnotebooksthatcontainedonlypartialsolutions,forexampleonlyanalyzeddatawithoutbuildingactualmodels.Weendedupwith45suitablenotebooks.UsingafeatureavailableonK,wedownloadedalloftheselectednotebooksandconvertedthemtoPDFdocumentstoanalyzetheminnVivo.
Dataanalysis
Sinceourstudyisrootedingroundedtheory(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006),weproceededbyinductivelycodingthenotebookstoidentifythedifferentelementsofcodetheycontainedbywhattheseelementsofcodedid.Wecodedeachsegmentofcodeineachnotebooktoidentifyitsfunction.Verbaldescriptionsofdatascientistssometimesprovidedadditionalinformationastotheroleofeachcodesegment,sothesewerecodedtoo.However,thedescriptionsweremostlyusefulinthesecondstageofdataanalysis,wherewegroupedthecodesweobtainedintohigher-levelelementsoftheprocess,astheyexplainedtheflowoftheprocess.Forexample,inthenotebooksdatascientistswouldsometimesindicatetheywereproceedingtoexploratorydataanalysis,andweusedthesecommentstogroupelementsofcodeidentifiedundertheelement‘ExploratoryDataAnalysis’.
Becauseoftheinductivenatureofourstudy,weoscillatedbetweendataanalysisandfurtherdatacollection.Aftercodingthefirst30notebooks,webegantogroupthecodestostartbuildingthemodel.Wethenproceededwithcodingandanalyzingnotebooksonebyonetosupplementandverifythemodelthatwasemergingfromouranalysis.Whenwereachednotebooknumber35,thesubsequent10notebooksdidnotaddanynewcodestothecodebookandatthispointwedecidedtostopcodingandanalyzingthenotebooksaswereachedthepointofsaturation.
UnboxingtheAlgorithm
Inthissection,wepresenttheelementsoftheprocessofmakinganalgorithmicsolutionworkthatweidentifiedinthedata.Eachelementisdiscussedinturnbyshowingwhatkindsofoperationswereperformedineveryelement.
PreparingtheEnvironment
Notebooksbeginwithsettingtheenvironmentinwhichthedevelopmentofthealgorithmicsolutiontakesplace:programminglanguage,accelerationandconnectiontotheinternet.ThenotebooksweobservedwereallsetupinaPython3environment,which“comeswithmanyhelpfulanalyticslibrariesinstalled”(Notebook002)andallowstowriteupto20GBtotheworkingdirectory.Notebooksgivethepossibilitytoturnonanaccelerator,suchasaGPU,forfasterprocessing,andtoconnecttotheinternetforaccesstoexternalfiles.Insomenotebooks,datascientistsuseverbalcommentstoidentifyandrestatetheproblem.
Afterthisinitialsetup,variousnecessarylibrariesareimported,thatispre-packagedfunctionsdesignedforspecificpurposesthatcanbedeployedbydatascientistswithouttheneedtocodesuchfunctionsfromscratch.Invariably,thenotebooksfeatured“numpy”(Notebook005),aPythonlibraryforlinearalgebraand“pandas”(Notebook007)allowingfordataprocessingandforexamplereadinginCSVfiles,amongothers.Thesetwolibrariesareessentialtodevelopthealgorithmicsolution.Otherlibrariesimportedincludedatavisualizationpackages,suchas“seaborn”or“matplotlib”(Notebook029),whicharefairlystandardandpopularlibrariesforthispurpose.Insomenotebooks,allrequiredpackagesareimportedinthebeginningofthenotebook,including“sklearn”and“keras”(Notebook014)thatareusedforbuildingmodels,whileothernotebooksimportadditionallibrariesasandwhenneeded.Librariesareimportedwithsimplecode:“importnumpyasnp”(Notebook001),forexample.Importinglibrariesisastandardprocedureandtherearenotsubstantialcommentsregardingthisstep.Thereexistsavarietyoflibraries
usedindevelopingalgorithmicsolutionsthatarewidelyused,andtheyencapsulateandabstractoutthecomplexitybehindsuchtasksliketrainingaspecificmodel,asexplainedbelow.
ReadinginData
Thenextelementintheprocessinanalgorithmicsolutionistoreadintherequireddata.Thefirststephere,quitelogically,includesloadingdatain.BecausethedatasetthatthenotebooksuseisuploadedtoKaggle,itcanbeattachedtoeachnotebookwithasimplesearchwithintheinterface,andthenimportedbyexecutingacommandfromthe“pandas”library“read_csv”(Notebook001).
Inspectingthedatafollows,usuallythroughfunction“head”,displayingfirstfive(bydefault)rowsofthedatasetandcorrespondingcolumnswithcolumnheaders,andsometimesfunction“shape”displayingthedimensionsofthedataset(numberofrowsandnumberofcolumns)aswellasfunction“columns”,givingthenamesofcolumnsinthedataset.Injustonenotebook,weobservedexplicitlylookingforduplicateentriesinthedataset.Commandstoperformthesefunctionsarepre-packagedandtakeformsof“df.head()”,“df.shape”or“df.columns”(Notebook003).Thisstageoftheprocessalsoinvolvescheckingdatatypespresentinthedataset,performedbyusingfunctions“info”or“dtypes”thatindicatewhichcolumnscontaininteger(wholenumbers),float(fractionswithdecimalpoints)orobject(textormixednumericandnon-numericvalues)datatype.Thisisimportantasmostalgorithmicsolutionsworkonlywithnumericalvalues.Aspartofreadingindata,simpledescriptivestatisticsofthedataareobtainedthroughfunction“describe”,resultingindisplayingthenumberofrows,mean,standarddeviation,minimumvalue,quartiles,andmaximumvalueforeachcolumn.
Conductingthethesestepsisessentialtoloadthedatasetandobtainbasicinformationaboutthedataneededtoconfirmthatthedataisloadedcorrectly,containstheexpectedcolumnsandrows,andtogaininitialfamiliaritywiththedataset.
CleaningData
Afterreadinginthedataset,dataiscleanedtoprepareitforfurtherprocessing.Thisisessentialbeforeanyanalysiscantakeplace.Stepsatthisstagetendtobetakeninvariousordersacrossthenotebooks,andarereportedhereinnoparticularorder.
Missingvaluesareidentifiedanddealtwith:thatisNULLvaluesinthedatasethavetoberesolvedbeforeanyanalysiscantakeplace.Thisisdonebyusingthefunction“isnull”,listingallcolumnswiththenumberofmissingvalues(Notebook001).Thecustomerchurndatasetcontainednonullvalues,sointhiscasetherewasnoneedtodeploysolutionstosolvethisproblem.Missingvalueshavetoberesolvedasthemajorityofalgorithmscannotdealwithdatasetscontainingmissingvalues.Oneofthewaystosolvethisproblemthatispresentedinthenotebooksisthemethodofimputation,thatisreplacingthemissingornullvaluewithanexistingvaluefromthedataset.Inthesolutionproposedinthenotebookthisisdonebasedonthenearestneighborofthemissingvalue,butsincenomissingvaluesweredetected,thesolutionisnotimplemented.
Inthenotebooks,wefoundsometimescolumnsarerenamediftheirnamesarenotintuitiveenoughorsimplytoolong.Certaincolumnscontainingvariablesthatarenotneededfortheanalysisareremoved.Forexample,thecustomerchurndatasetcontainstwocolumnswithNa?veBayesClassifierbydefault,andtheauthorofthedatasetsuggestsremovingthesecolumnsbeforeproceedingwithanalysis.Atdifferentpointsindatacleaning,exploratorydataanalysisorpre-processingthedatasetvariouscolumnsarealsoremovediftheyarenotcontributingtothemodel(forexample,removingcustomerID:“data=data.drop(columns=[‘CLIENTNUM’]”,Notebook015).Insomenotebooks,outliersareremovedfromthedatausingacommonstatisticalmeasureofz-score,indicatinghowfarfromthemeanagivendatapointis.Intheonlynotebookweobservedthatremovedoutliers,thisresultedinremoving810rowsfromthedataset.
Allnotebookswestudiedtransformdatatypesaspartofcleaningdata.Thisstep,sometimesreferredtoasfeatureengineering,isrequiredwhenthedatasetcontainsobjectdatatypes,whicharecategoricalvariablestypicalinmanydatasets,suchasmaritalstatus,levelofeducationorgender.Thesedatatypeshavetobetransformedintonumericalvariablesinordertobeanalyzed.Thisisconductedbyusingpre-existingfunctionstoencodethesevariablesasintegers(e.g.primaryeducationas1,secondaryas2,tertiaryas3)or
usingpopularone-hotencodingwherethereisnonaturalordinalrelationshipbetweencategoriesanddummyvariablesarecreated(e.g.maleis0,femaleis1).Cleaneddataisanessentialelementofanyalgorithmicsolution,aswithoutthestepstakeninthiselement,dataeitherresultsinerroneousanalysisandmodeltraining,orsimplycannotbeusedtotrainmodels.
ExploratoryDataAnalysis
Thenextstepinthealgorithmicsolutionprocessisexploratorydataanalysis,wherebyactionsaretakentolearnabouttherelationshipbetweenthedependentvariableofinterest(here:customerchurnorattrition)andindependentvariablesthatmayhelpbuildthepredictivemodel.Thisstepisessentialtouncoverwhatmodelwillbethemostappropriateforthedatasetandwhichvariablescanbepotentiallyofinterest.
Thefirststepistoidentifythedependentvariable(atrivialmatterinthegivendataset),andtoanalyzeindependentvariables.Thisisveryoftenperformedbyvisualizingthemindependently,inrelationtoeachother,orinrelationtothedependentvariable.Inmostcases,suchvisualizationswereimplementedusingfunctionsfromvisualizationlibraries,suchas“seaborn”,“matplotlib”orrarely“plotly”.Visualizingdataisthepartthattakesupthemostcodeinnearlyallnotebooksweanalyzed.Variousvisualizationsareproduced,suchasboxplots,piecharts,histograms,inordertohelpidentifywhichvariablesmaybeusefulinbuildingthemodel.Visualizationsareoftenaccompaniedbycommentssuchas“Femalesareslightlymorelikelytochurnwith17%comparedtomaleswith15%,we’llconvertthis9featureto1-0”(Notebook013).Somenotebookscontainmorecomprehensivecommentsonthelearningsfromvisualizations.
Thenextstepinexploratorydataanalysisistoidentifycorrelationsbetweenvariables.Identifyingcorrelationsisanimportantstepinexploratorydataanalysis,asfromthisdecisionscanbemadeastowhichfeaturestoincludeinpre-processingthedatasetformodelbuilding,asdescribedbelow.Forexample,Notebook022basedontheidentificationofcorrelationsdecidesto“#Dropsomefeatureswhichhavelessthan0.01correlationandgreaterthan-0.01correlation”.Exploratorydataanalysisisarequiredstepofbuildinganalgorithmicsolutionasitprovidesthenecessaryinsightintothedatasetforthepurposesofmodelbuilding.Itisatthisstagethattheimportanceofvariableswithrespecttothetargetvariableisassessed.
Pre-processingtheDataset
Thefollowingstepintheprocessistopre-processthedataset,whichinvolvespreparingthedatasetaccordingtotherequirementsofmodelbuilding.First,dataneedtobescaled,whichmayinvolveactualscaling,thatischangingtherangeofvariablestoacommonrange,e.g.between0and1,ornormalizingthevariablesfollowinganormaldistribution.Scalingisperformedtoensurethatnovariableisinterpretedasmorepredictivethanitactuallyisjustbecauseitsnumericalvaluesareonadifferentfromothervariables.Scalingisroutinelyperformedusingstandardpre-packagedfunctions,suchas“StandardScaler”fromthepopular“sklearn”library(Notebook026).
Thedatasetshouldberesampledifitisnotbalanced,thatisifonecategoryispresentmuchmorefrequentlythananother.Inthecaseofthedatasetinvestigated,customerswhoattiredoccurredmuchlessfrequently,asidentifiedinexploratorydataanalysis,soresamplingwasrequired.Thisisusuallydonebyoversamplingfromthegroupofattiredcustomers,mostfrequentlyusingapre-packagedfunction‘SMOTE’(SyntheticMinorityOversamplingTechnique)whichcreatesadditionaldatapoi
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 美團(tuán)外賣活動(dòng)策劃方案
- 醫(yī)療器械公司勞動(dòng)合同
- 庭院綠化施工合同
- 高效辦公工具使用解決方案
- 環(huán)保產(chǎn)業(yè)技術(shù)創(chuàng)新與應(yīng)用合作協(xié)議
- 地產(chǎn)項(xiàng)目土地開發(fā)合作合同
- 個(gè)人分包勞務(wù)分包合同
- 新興技術(shù)交流及應(yīng)用方案推進(jìn)工作指引
- 醫(yī)療行業(yè)智能化診斷系統(tǒng)構(gòu)建方案
- 戰(zhàn)略合作合同協(xié)議
- 12月腹痛護(hù)理常規(guī)
- 控股集團(tuán)公司組織架構(gòu)圖.docx
- DB11_T1713-2020 城市綜合管廊工程資料管理規(guī)程
- 最常用2000個(gè)英語(yǔ)單詞_(全部標(biāo)有注釋)字母排序
- 氣管套管滑脫急救知識(shí)分享
- 特種設(shè)備自檢自查表
- 省政府審批單獨(dú)選址項(xiàng)目用地市級(jí)審查報(bào)告文本格式
- 往復(fù)式壓縮機(jī)安裝方案
- 漢字的演變甲骨文PPT課件
- 在銀行大零售業(yè)務(wù)工作會(huì)議上的講話講解學(xué)習(xí)
- 古代傳說(shuō)中的藝術(shù)形象-
評(píng)論
0/150
提交評(píng)論