版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
_
MachineLearningForAbsoluteBeginners
OliverTheobald
SecondEdition
Copyright?2017byOliverTheobald
Allrightsreserved.Nopartofthispublicationmaybereproduced,distributed,ortransmittedinanyformorbyanymeans,includingphotocopying,recording,orotherelectronicormechanicalmethods,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembodiedincriticalreviewsandcertainothernon-commercialusespermittedbycopyrightlaw.
Contents
INTRODUCTION
WHATISMACHINELEARNING?MLCATEGORIES
THEMLTOOLBOXDATASCRUBBING
SETTINGUPYOURDATAREGRESSIONANALYSISCLUSTERING
BIAS&VARIANCE
ARTIFICIALNEURALNETWORKSDECISIONTREES
ENSEMBLEMODELINGBUILDINGAMODELINPYTHONMODELOPTIMIZATIONFURTHERRESOURCESDOWNLOADINGDATASETSFINALWORD
INTRODUCTION
MachineshavecomealongwaysincetheIndustrialRevolution.Theycontinuetofillfactoryfloorsandmanufacturingplants,butnowtheircapabilitiesextendbeyondmanualactivitiestocognitivetasksthat,untilrecently,onlyhumanswerecapableofperforming.Judgingsongcompetitions,drivingautomobiles,andmoppingthefloorwithprofessionalchessplayersarethreeexamplesofthespecificcomplextasksmachinesarenowcapableofsimulating.
Buttheirremarkablefeatstriggerfearamongsomeobservers.Partofthisfearnestlesontheneckofsurvivalistinsecurities,whereitprovokesthedeep-seatedquestionofwhatif?Whatifintelligentmachinesturnonusinastruggleofthefittest?Whatifintelligentmachinesproduceoffspringwithcapabilitiesthathumansneverintendedtoimparttomachines?Whatifthelegendofthesingularityistrue?
Theothernotablefearisthethreattojobsecurity,andifyou’reatruckdriveroranaccountant,thereisavalidreasontobeworried.AccordingtotheBritishBroadcastingCompany’s(BBC)interactiveonlineresourceWillarobottakemyjob?,professionssuchasbarworker(77%),waiter(90%),charteredaccountant(95%),receptionist(96%),andtaxidriver(57%)eachhaveahighchanceofbecomingautomatedbytheyear2035.
[1]
Butresearchonplannedjobautomationandcrystalballgazingwithrespecttothefutureevolutionofmachinesandartificialintelligence(AI)shouldbereadwithapinchofskepticism.AItechnologyismovingfast,butbroadadoptionisstillanuncharteredpathfraughtwithknownandunforeseenchallenges.Delaysandotherobstaclesareinevitable.
NorismachinelearningasimplecaseofflickingaswitchandaskingthemachinetopredicttheoutcomeoftheSuperBowlandserveyouadeliciousmartini.Machinelearningisfarfromwhatyouwouldcallanout-of-the-boxsolution.
Machinesoperatebasedonstatisticalalgorithmsmanagedandoverseenbyskilledindividuals—knownasdatascientistsandmachinelearningengineers.Thisisonelabormarketwherejobopportunitiesaredestinedfor
growthbutwhere,currently,supplyisstrugglingtomeetdemand.IndustryexpertslamentthatoneofthebiggestobstaclesdelayingtheprogressofAIistheinadequatesupplyofprofessionalswiththenecessaryexpertiseandtraining.
AccordingtoCharlesGreen,theDirectorofThoughtLeadershipatBelatrixSoftware:
“It’sahugechallengetofinddatascientists,peoplewithmachinelearningexperience,orpeoplewiththeskillstoanalyzeandusethedata,aswellasthosewhocancreatethealgorithmsrequiredformachinelearning.Secondly,whilethetechnologyisstillemerging,therearemanyongoingdevelopments.It’sclearthatAIisalongwayfromhowwemightimagineit.”
[2]
Perhapsyourownpathtobecominganexpertinthefieldofmachinelearningstartshere,ormaybeabaselineunderstandingissufficienttosatisfyyourcuriosityfornow.Inanycase,let’sproceedwiththeassumptionthatyouarereceptivetotheideaoftrainingtobecomeasuccessfuldatascientistormachinelearningengineer.
Tobuildandprogramintelligentmachines,youmustfirstunderstandclassicalstatistics.Algorithmsderivedfromclassicalstatisticscontributethemetaphoricalbloodcellsandoxygenthatpowermachinelearning.Layeruponlayeroflinearregression,k-nearestneighbors,andrandomforestssurgethroughthemachineanddrivetheircognitiveabilities.Classicalstatisticsisattheheartofmachinelearningandmanyofthesealgorithmsarebasedonthesamestatisticalequationsyoustudiedinhighschool.Indeed,statisticalalgorithmswereconductedonpaperwellbeforemachinesevertookonthetitleofartificialintelligence.
Computerprogrammingisanotherindispensablepartofmachinelearning.Thereisn’taclick-and-dragorWeb2.0solutiontoperformadvancedmachinelearninginthewayonecanconvenientlybuildawebsitenowadayswithWordPressorStrikingly.Programmingskillsarethereforevitaltomanagedataanddesignstatisticalmodelsthatrunonmachines.
Somestudentsofmachinelearningwillhaveyearsofprogrammingexperiencebuthaven’ttouchedclassicalstatisticssincehighschool.Others,perhaps,neverevenattemptedstatisticsintheirhighschoolyears.Butnottoworry,manyofthemachinelearningalgorithmswediscussinthisbookhaveworkingimplementationsinyourprogramminglanguageofchoice;noequationwritingnecessary.Youcanusecodetoexecutetheactualnumber
crunchingforyou.
Ifyouhavenotlearnedtocodebefore,youwillneedtoifyouwishtomakefurtherprogressinthisfield.Butforthepurposeofthiscompactstarter’scourse,thecurriculumcanbecompletedwithoutanybackgroundincomputerprogramming.Thisbookfocusesonthehigh-levelfundamentalsofmachinelearningaswellasthemathematicalandstatisticalunderpinningsofdesigningmachinelearningmodels.
Forthosewhodowishtolookattheprogrammingaspectofmachinelearning,Chapter13walksyouthroughtheentireprocessofsettingupasupervisedlearningmodelusingthepopularprogramminglanguagePython.
WHATISMACHINELEARNING?
In1959,IBMpublishedapaperintheIBMJournalofResearchandDevelopmentwithan,atthetime,obscureandcurioustitle.AuthoredbyIBM’sArthurSamuel,thepaperinvestedtheuseofmachinelearninginthegameofcheckers“toverifythefactthatacomputercanbeprogrammedsothatitwilllearntoplayabettergameofcheckersthancanbeplayedbythepersonwhowrotetheprogram.”
[3]
Althoughitwasnotthefirstpublicationtousetheterm“machinelearning”perse,ArthurSamueliswidelyconsideredasthefirstpersontocoinanddefinemachinelearningintheformwenowknowtoday.Samuel’slandmarkjournalsubmission,SomeStudiesinMachineLearningUsingtheGameofCheckers,isalsoanearlyindicationofhomosapiens’determinationtoimpartourownsystemoflearningtoman-mademachines.
Figure1:Historicalmentionsof“machinelearning”inpublishedbooks.Source:GoogleNgramViewer,2017
ArthurSamuelintroducesmachinelearninginhispaperasasubfieldofcomputersciencethatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.
[4]
Almostsixdecadeslater,thisdefinitionremainswidelyaccepted.
AlthoughnotdirectlymentionedinArthurSamuel’sdefinition,akeyfeatureofmachinelearningistheconceptofself-learning.Thisreferstotheapplicationofstatisticalmodelingtodetectpatternsandimprove
performancebasedondataandempiricalinformation;allwithoutdirectprogrammingcommands.ThisiswhatArthurSamueldescribedastheabilitytolearnwithoutbeingexplicitlyprogrammed.Buthedoesn’tinferthatmachinesformulatedecisionswithnoupfrontprogramming.Onthecontrary,machinelearningisheavilydependentoncomputerprogramming.Instead,Samuelobservedthatmachinesdon’trequireadirectinputcommandtoperformasettaskbutratherinputdata.
Figure2:ComparisonofInputCommandvsInputData
Anexampleofaninputcommandistyping“2+2”intoaprogramminglanguagesuchasPythonandhitting“Enter.”
>>>2+2
4
>>>
Thisrepresentsadirectcommandwithadirectanswer.
Inputdata,however,isdifferent.Dataisfedtothemachine,analgorithmisselected,hyperparameters(settings)areconfiguredandadjusted,andthemachineisinstructedtoconductitsanalysis.Themachineproceedstodecipherpatternsfoundinthedatathroughtheprocessoftrialanderror.Themachine’sdatamodel,formedfromanalyzingdatapatterns,canthenbeusedtopredictfuturevalues.
Althoughthereisarelationshipbetweentheprogrammerandthemachine,theyoperatealayerapartincomparisontotraditionalcomputerprogramming.Thisisbecausethemachineisformulatingdecisionsbasedonexperienceandmimickingtheprocessofhuman-baseddecision-making.
Asanexample,let’ssaythatafterexaminingtheYouTubeviewinghabitsofdatascientistsyourmachineidentifiesastrongrelationshipbetweendata
scientistsandcatvideos.Later,yourmachineidentifiespatternsamongthephysicaltraitsofbaseballplayersandtheirlikelihoodofwinningtheseason’sMostValuablePlayer(MVP)award.Inthefirstscenario,themachineanalyzedwhatvideosdatascientistsenjoywatchingonYouTubebasedonuserengagement;measuredinlikes,subscribes,andrepeatviewing.Inthesecondscenario,themachineassessedthephysicalfeaturesofpreviousbaseballMVPsamongvariousotherfeaturessuchasageandeducation.However,inneitherofthesetwoscenarioswasyourmachineexplicitlyprogrammedtoproduceadirectoutcome.Youfedtheinputdataandconfiguredthenominatedalgorithms,butthefinalpredictionwasdeterminedbythemachinethroughself-learninganddatamodeling.
Youcanthinkofbuildingadatamodelassimilartotrainingaguidedog.Throughspecializedtraining,guidedogslearnhowtorespondinvarioussituations.Forexample,thedogwilllearntoheelataredlightortosafelyleaditsmasteraroundobstacles.Ifthedoghasbeenproperlytrained,then,eventually,thetrainerwillnolongerberequired;theguidedogwillbeabletoapplyitstraininginvariousunsupervisedsituations.Similarly,machinelearningmodelscanbetrainedtoformdecisionsbasedonpastexperience.
Asimpleexampleiscreatingamodelthatdetectsspamemailmessages.Themodelistrainedtoblockemailswithsuspicioussubjectlinesandbodytextcontainingthreeormoreflaggedkeywords:dearfriend,free,invoice,PayPal,Viagra,casino,payment,bankruptcy,andwinner.Atthisstage,though,wearenotyetperformingmachinelearning.Ifwerecallthevisualrepresentationofinputcommandvsinputdata,wecanseethatthisprocessconsistsofonlytwosteps:Command>Action.
Machinelearningentailsathree-stepprocess:Data>Model>Action.
Thus,toincorporatemachinelearningintoourspamdetectionsystem,weneedtoswitchout“command”for“data”andadd“model”inordertoproduceanaction(output).Inthisexample,thedatacomprisessampleemailsandthemodelconsistsofstatistical-basedrules.Theparametersofthemodelincludethesamekeywordsfromouroriginalnegativelist.Themodelisthentrainedandtestedagainstthedata.
Oncethedataisfedintothemodel,thereisastrongchancethatassumptionscontainedinthemodelwillleadtosomeinaccuratepredictions.Forexample,undertherulesofthismodel,thefollowingemailsubjectlinewouldautomaticallybeclassifiedasspam:“PayPalhasreceivedyourpaymentforCasinoRoyalepurchasedoneBay.”
AsthisisagenuineemailsentfromaPayPalauto-responder,thespamdetectionsystemisluredintoproducingafalsepositivebasedonthenegativelistofkeywordscontainedinthemodel.Traditionalprogrammingishighlysusceptibletosuchcasesbecausethereisnobuilt-inmechanismtotestassumptionsandmodifytherulesofthemodel.Machinelearning,ontheotherhand,canadaptandmodifyassumptionsthroughitsthree-stepprocessandbyreactingtoerrors.
Training&TestData
Inmachinelearning,dataissplitintotrainingdataandtestdata.Thefirstsplitofdata,i.e.theinitialreserveofdatayouusetodevelopyourmodel,providesthetrainingdata.Inthespamemaildetectionexample,falsepositivessimilartothePayPalauto-responsemightbedetectedfromthetrainingdata.Newrulesormodificationsmustthenbeadded,e.g.,emailnotificationsissuedfromthesendingaddress“
payments@
”shouldbeexcludedfromspamfiltering.
Afteryouhavesuccessfullydevelopedamodelbasedonthetrainingdataandaresatisfiedwithitsaccuracy,youcanthentestthemodelontheremainingdata,knownasthetestdata.Onceyouaresatisfiedwiththeresultsofboththetrainingdataandtestdata,themachinelearningmodelisreadytofilterincomingemailsandgeneratedecisionsonhowtocategorizethoseincomingmessages.
Thedifferencebetweenmachinelearningandtraditionalprogrammingmayseemtrivialatfirst,butitwillbecomeclearasyourunthroughfurtherexamplesandwitnessthespecialpowerofself-learninginmorenuancedsituations.
Thesecondimportantpointtotakeawayfromthischapterishowmachinelearningfitsintothebroaderlandscapeofdatascienceandcomputerscience.Thismeansunderstandinghowmachinelearninginterrelateswithparentfieldsandsisterdisciplines.Thisisimportant,asyouwillencountertheserelatedtermswhensearchingforrelevantstudymaterials—andyouwillhearthemmentionedadnauseaminintroductorymachinelearningcourses.Relevantdisciplinescanalsobedifficulttotellapartatfirstglance,suchas“machinelearning”and“datamining.”
Let’sbeginwithahigh-levelintroduction.Machinelearning,datamining,computerprogramming,andmostrelevantfields(excludingclassical
statistics)derivefirstfromcomputerscience,whichencompasseseverythingrelatedtothedesignanduseofcomputers.Withintheall-encompassingspaceofcomputerscienceisthenextbroadfield:datascience.Narrowerthancomputerscience,datasciencecomprisesmethodsandsystemstoextractknowledgeandinsightsfromdatathroughtheuseofcomputers.
Figure3:ThelineageofmachinelearningrepresentedbyarowofRussianmatryoshkadolls
Poppingoutfromcomputerscienceanddatascienceasthethirdmatryoshkadollisartificialintelligence.Artificialintelligence,orAI,encompassestheabilityofmachinestoperformintelligentandcognitivetasks.ComparabletothewaytheIndustrialRevolutiongavebirthtoaneraofmachinesthatcouldsimulatephysicaltasks,AIisdrivingthedevelopmentofmachinescapableofsimulatingcognitiveabilities.
Whilestillbroadbutdramaticallymorehonedthancomputerscienceanddatascience,AIcontainsnumeroussubfieldsthatarepopulartoday.Thesesubfieldsincludesearchandplanning,reasoningandknowledgerepresentation,perception,naturallanguageprocessing(NLP),andofcourse,machinelearning.MachinelearningbleedsintootherfieldsofAI,includingNLPandperceptionthroughtheshareduseofself-learningalgorithms.
Figure4:Visualrepresentationoftherelationshipbetweendata-relatedfields
ForstudentswithaninterestinAI,machinelearningprovidesanexcellentstartingpointinthatitoffersamorenarrowandpracticallensofstudycomparedtotheconceptualambiguityofAI.Algorithmsfoundinmachinelearningcanalsobeappliedacrossotherdisciplines,includingperceptionandnaturallanguageprocessing.Inaddition,aMaster’sdegreeisadequatetodevelopacertainlevelofexpertiseinmachinelearning,butyoumayneedaPhDtomakeanytrueprogressinAI.
Asmentioned,machinelearningalsooverlapswithdatamining—asisterdisciplinethatfocusesondiscoveringandunearthingpatternsinlargedatasets.Popularalgorithms,suchask-meansclustering,associationanalysis,andregressionanalysis,areappliedinbothdataminingandmachinelearningtoanalyzedata.Butwheremachinelearningfocusesontheincrementalprocessofself-learninganddatamodelingtoformpredictionsaboutthefuture,dataminingnarrowsinoncleaninglargedatasetstogleanvaluableinsightfromthepast.
Thedifferencebetweendataminingandmachinelearningcanbeexplainedthroughananalogyoftwoteamsofarchaeologists.Thefirstteamismadeupofarchaeologistswhofocustheireffortsonremovingdebristhatliesinthewayofvaluableitems,hidingthemfromdirectsight.Theirprimarygoalsaretoexcavatethearea,findnewvaluablediscoveries,andthenpackuptheirequipmentandmoveon.Adaylater,theywillflytoanotherexoticdestinationtostartanewprojectwithnorelationshiptothesitethey
excavatedthedaybefore.
Thesecondteamisalsointhebusinessofexcavatinghistoricalsites,butthesearchaeologistsuseadifferentmethodology.Theydeliberatelyreframefromexcavatingthemainpitforseveralweeks.Inthattime,theyvisitotherrelevantarchaeologicalsitesintheareaandexaminehoweachsitewasexcavated.Afterreturningtothesiteoftheirownproject,theyapplythisknowledgetoexcavatesmallerpitssurroundingthemainpit.
Thearchaeologiststhenanalyzetheresults.Afterreflectingontheirexperienceexcavatingonepit,theyoptimizetheireffortstoexcavatethenext.Thisincludespredictingtheamountoftimeittakestoexcavateapit,understandingvarianceandpatternsfoundinthelocalterrainanddevelopingnewstrategiestoreduceerrorandimprovetheaccuracyoftheirwork.Fromthisexperience,theyareabletooptimizetheirapproachtoformastrategicmodeltoexcavatethemainpit.
Ifitisnotalreadyclear,thefirstteamsubscribestodataminingandthesecondteamtomachinelearning.Atamicro-level,bothdataminingandmachinelearningappearsimilar,andtheydousemanyofthesametools.Bothteamsmakealivingexcavatinghistoricalsitestodiscovervaluableitems.Butinpractice,theirmethodologyisdifferent.Themachinelearningteamfocusesondividingtheirdatasetintotrainingdataandtestdatatocreateamodel,andimprovingfuturepredictionsbasedonpreviousexperience.Meanwhile,thedataminingteamconcentratesonexcavatingthetargetareaaseffectivelyaspossible—withouttheuseofaself-learningmodel—beforemovingontothenextcleanupjob.
MLCATEGORIES
Machinelearningincorporatesseveralhundredstatistical-basedalgorithmsandchoosingtherightalgorithmorcombinationofalgorithmsforthejobisaconstantchallengeforanyoneworkinginthisfield.Butbeforeweexaminespecificalgorithms,itisimportanttounderstandthethreeoverarchingcategoriesofmachinelearning.Thesethreecategoriesaresupervised,unsupervised,andreinforcement.
SupervisedLearning
Asthefirstbranchofmachinelearning,supervisedlearningconcentratesonlearningpatternsthroughconnectingtherelationshipbetweenvariablesandknownoutcomesandworkingwithlabeleddatasets.
Supervisedlearningworksbyfeedingthemachinesampledatawithvariousfeatures(representedas“X”)andthecorrectvalueoutputofthedata(representedas“y”).Thefactthattheoutputandfeaturevaluesareknownqualifiesthedatasetas“l(fā)abeled.”Thealgorithmthendecipherspatternsthatexistinthedataandcreatesamodelthatcanreproducethesameunderlyingruleswithnewdata.
Forinstance,topredictthemarketrateforthepurchaseofausedcar,asupervisedalgorithmcanformulatepredictionsbyanalyzingtherelationshipbetweencarattributes(includingtheyearofmake,carbrand,mileage,etc.)andthesellingpriceofothercarssoldbasedonhistoricaldata.Giventhatthesupervisedalgorithmknowsthefinalpriceofothercardssold,itcanthenworkbackwardtodeterminetherelationshipbetweenthecharacteristicsofthecaranditsvalue.
Figure1:Carvaluepredictionmodel
Afterthemachinedecipherstherulesandpatternsofthedata,itcreateswhatisknownasamodel:analgorithmicequationforproducinganoutcomewithnewdatabasedontherulesderivedfromthetrainingdata.Oncethemodelisprepared,itcanbeappliedtonewdataandtestedforaccuracy.Afterthemodelhaspassedboththetrainingandtestdatastages,itisreadytobeappliedandusedintherealworld.
InChapter13,wewillcreateamodelforpredictinghousevalueswhereyistheactualhousepriceandXarethevariablesthatimpacty,suchaslandsize,location,andthenumberofrooms.Throughsupervisedlearning,wewillcreatearuletopredicty(housevalue)basedonthegivenvaluesofvariousvariables(X).
Examplesofsupervisedlearningalgorithmsincluderegressionanalysis,decisiontrees,k-nearestneighbors,neuralnetworks,andsupportvectormachines.Eachofthesetechniqueswillbeintroducedlaterinthebook.
UnsupervisedLearning
Inthecaseofunsupervisedlearning,notallvariablesanddatapatternsareclassified.Instead,themachinemustuncoverhiddenpatternsandcreatelabelsthroughtheuseofunsupervisedlearningalgorithms.Thek-meansclusteringalgorithmisapopularexampleofunsupervisedlearning.ThissimplealgorithmgroupsdatapointsthatarefoundtopossesssimilarfeaturesasshowninFigure1.
Figure1:Exampleofk-meansclustering,apopularunsupervisedlearningtechnique
IfyougroupdatapointsbasedonthepurchasingbehaviorofSME(SmallandMedium-sizedEnterprises)andlargeenterprisecustomers,forexample,youarelikelytoseetwoclustersemerge.ThisisbecauseSMEsandlargeenterprisestendtohavedisparatebuyinghabits.Whenitcomestopurchasingcloudinfrastructure,forinstance,basiccloudhostingresourcesandaContentDeliveryNetwork(CDN)mayprovesufficientformostSMEcustomers.Largeenterprisecustomers,though,aremorelikelytopurchaseawiderarrayofcloudproductsandentiresolutionsthatincludeadvancedsecurityandnetworkingproductslikeWAF(WebApplicationFirewall),adedicatedprivateconnection,andVPC(VirtualPrivateCloud).Byanalyzingcustomerpurchasinghabits,unsupervisedlearningiscapableofidentifyingthesetwogroupsofcustomerswithoutspecificlabelsthatclassifythecompanyassmall,mediumorlarge.
Theadvantageofunsupervisedlearningisitenablesyoutodiscoverpatternsinthedatathatyouwereunawareexisted—suchasthepresenceoftwomajorcustomertypes.Clusteringtechniquessuchask-meansclusteringcanalsoprovidethespringboardforconductingfurtheranalysisafterdiscretegroupshavebeendiscovered.
Inindustry,unsupervisedlearningisparticularlypowerfulinfrauddetection
—wherethemostdangerousattacksareoftenthoseyettobeclassified.Onereal-worldexampleisDataVisor,whoessentiallybuilttheirbusinessmodelbasedonunsupervisedlearning.
Foundedin2013inCalifornia,DataVisorprotectscustomersfromfraudulent
onlineactivities,includingspam,fakereviews,fakeappinstalls,andfraudulenttransactions.Whereastraditionalfraudprotectionservicesdrawonsupervisedlearningmodelsandruleengines,DataVisorusesunsupervisedlearningwhichenablesthemtodetectunclassifiedcategoriesofattacksintheirearlystages.
Ontheirwebsite,DataVisorexplainsthat"todetectattacks,existingsolutionsrelyonhumanexperiencetocreaterulesorlabeledtrainingdatatotunemodels.Thismeanstheyareunabletodetectnewattacksthathaven’talreadybeenidentifiedbyhumansorlabeledintrainingdata."
[5]
Thismeansthattraditionalsolutionsanalyzethechainofactivityforaparticularattackandthencreaterulestopredictarepeatattack.Underthisscenario,thedependentvariable(y)istheeventofanattackandtheindependentvariables(X)arethecommonpredictorvariablesofanattack.Examplesofindependentvariablescouldbe:
Asuddenlargeorderfromanunknownuser.I.E.establishedcustomersgenerallyspendlessthan$100perorder,butanewuserspends$8,000inoneorderimmediatelyuponregisteringtheiraccount.
Asuddensurgeofuserratings.I.E.AsatypicalauthorandbookselleronA,it’suncommonformyfirstpublishedworktoreceivemorethanonebookreviewwithinthespaceofonetotwodays.Ingeneral,approximately1in200Amazonreadersleaveabookreviewandmostbooksgoweeksormonthswithoutareview.However,Icommonlyseecompetitorsinthiscategory(datascience)attracting20-50reviewsinoneday!(Unsurprisingly,IalsoseeAmazonremovingthesesuspiciousreviewsweeksormonthslater.)
Identicalorsimilaruserreviewsfromdifferentusers.FollowingthesameAmazonanalogy,Ioftenseeuserreviewsofmybookappearonotherbooksseveralmonthslater(sometimeswithareferencetomynameastheauthorstillincludedinthereview!).Again,Amazoneventuallyremovesthesefakereviewsandsuspendstheseaccountsforbreakingtheirtermsofservice.
Suspiciousshippingaddress.I.E.Forsmallbusinessesthatroutinelyshipproductstolocalcustomers,anorderfromadistantlocation(wheretheydon'tadvertisetheirproducts)caninrarecasesbeanindicatoroffraudulentormaliciousactivity.
Standaloneactivitiessuchasasuddenlargeorderoradistantshippingaddressmayprovetoolittleinformationtopredictsophisticated
cybercriminalactivityandmorelikelytoleadtomanyfalsepositives.Butamodelthatmonitorscombinationsofindependentvariables,suchasasuddenlargepurchaseorderfromtheothersideoftheglobeoralandslideofbookreviewsthatreuseexistingcontentwillgenerallyleadtomoreaccuratepredictions.Asupervisedlearning-basedmodelcoulddeconstructandclassifywhatthesecommonindependentvariablesareanddesignadetectionsystemtoidentifyandpreventrepeatoffenses.
Sophisticatedcybercriminals,though,learntoevadeclassification-basedruleenginesbymodifyingtheirtactics.Inaddition,leadinguptoanattack,attackersoftenregisterandoperatesingleormultipleaccountsandincubatetheseaccountswithactivitiesthatmimiclegitimateusers.Theythenutilizetheirestablishedaccounthistorytoevadedetectionsystems,whicharetrigger-heavyagainstrecentlyregisteredaccounts.Supervisedlearning-basedsolutionsstruggletodetectsleepercellsuntiltheactualdamagehasbeenmadeandespeciallywithregardtonewcategoriesofattacks.
DataVisorandotheranti-fraudsolutionprovidersthereforeleverageunsupervisedlearningtoaddressthelimitationsofsupervisedlearningbyanalyzingpatternsacrosshundredsofmillionsofaccountsandidentifyi
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024版房屋互換的協(xié)議書(shū)范本
- 籃球單手肩上投籃 說(shuō)課稿-2023-2024學(xué)年高一上學(xué)期體育與健康人教版必修第一冊(cè)
- 第二章 問(wèn)題研究:何時(shí)“藍(lán)天”常在說(shuō)課稿2023-2024學(xué)年高中地理人教版(2019)必修一001
- 2024年電動(dòng)自行車(chē)租賃協(xié)議3篇
- 教科版高中信息技術(shù)必修說(shuō)課稿-7.2.2 個(gè)人數(shù)字化信息資源管理實(shí)例
- 2024版游泳池租賃承包協(xié)議樣本版
- 《綠色簡(jiǎn)約團(tuán)隊(duì)介紹》課件
- 2024年項(xiàng)目合作協(xié)議詳細(xì)條款
- 《歷史街區(qū)的保護(hù)》課件
- 2025年滬科新版高二語(yǔ)文下冊(cè)階段測(cè)試試卷
- 上海市徐匯區(qū)上海小學(xué)二年級(jí)上冊(cè)語(yǔ)文期末考試試卷及答案
- 精密制造行業(yè)研究分析
- 心源性暈厥護(hù)理查房課件
- 2022-2023學(xué)年浙江省杭州市蕭山區(qū)五年級(jí)(上)期末科學(xué)試卷(蘇教版)
- 船舶輔機(jī):噴射泵
- 巖土工程勘察服務(wù)投標(biāo)方案(技術(shù)方案)
- 疼痛護(hù)理課件
- 副院長(zhǎng)兼總工程師的崗位說(shuō)明書(shū)
- 農(nóng)民專(zhuān)業(yè)合作社章程參考
- 財(cái)務(wù)會(huì)計(jì)制度及核算軟件備案報(bào)告書(shū)
- 肌骨超聲簡(jiǎn)介
評(píng)論
0/150
提交評(píng)論