




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
_
MachineLearningForAbsoluteBeginners
OliverTheobald
SecondEdition
Copyright?2017byOliverTheobald
Allrightsreserved.Nopartofthispublicationmaybereproduced,distributed,ortransmittedinanyformorbyanymeans,includingphotocopying,recording,orotherelectronicormechanicalmethods,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembodiedincriticalreviewsandcertainothernon-commercialusespermittedbycopyrightlaw.
Contents
INTRODUCTION
WHATISMACHINELEARNING?MLCATEGORIES
THEMLTOOLBOXDATASCRUBBING
SETTINGUPYOURDATAREGRESSIONANALYSISCLUSTERING
BIAS&VARIANCE
ARTIFICIALNEURALNETWORKSDECISIONTREES
ENSEMBLEMODELINGBUILDINGAMODELINPYTHONMODELOPTIMIZATIONFURTHERRESOURCESDOWNLOADINGDATASETSFINALWORD
INTRODUCTION
MachineshavecomealongwaysincetheIndustrialRevolution.Theycontinuetofillfactoryfloorsandmanufacturingplants,butnowtheircapabilitiesextendbeyondmanualactivitiestocognitivetasksthat,untilrecently,onlyhumanswerecapableofperforming.Judgingsongcompetitions,drivingautomobiles,andmoppingthefloorwithprofessionalchessplayersarethreeexamplesofthespecificcomplextasksmachinesarenowcapableofsimulating.
Buttheirremarkablefeatstriggerfearamongsomeobservers.Partofthisfearnestlesontheneckofsurvivalistinsecurities,whereitprovokesthedeep-seatedquestionofwhatif?Whatifintelligentmachinesturnonusinastruggleofthefittest?Whatifintelligentmachinesproduceoffspringwithcapabilitiesthathumansneverintendedtoimparttomachines?Whatifthelegendofthesingularityistrue?
Theothernotablefearisthethreattojobsecurity,andifyou’reatruckdriveroranaccountant,thereisavalidreasontobeworried.AccordingtotheBritishBroadcastingCompany’s(BBC)interactiveonlineresourceWillarobottakemyjob?,professionssuchasbarworker(77%),waiter(90%),charteredaccountant(95%),receptionist(96%),andtaxidriver(57%)eachhaveahighchanceofbecomingautomatedbytheyear2035.
[1]
Butresearchonplannedjobautomationandcrystalballgazingwithrespecttothefutureevolutionofmachinesandartificialintelligence(AI)shouldbereadwithapinchofskepticism.AItechnologyismovingfast,butbroadadoptionisstillanuncharteredpathfraughtwithknownandunforeseenchallenges.Delaysandotherobstaclesareinevitable.
NorismachinelearningasimplecaseofflickingaswitchandaskingthemachinetopredicttheoutcomeoftheSuperBowlandserveyouadeliciousmartini.Machinelearningisfarfromwhatyouwouldcallanout-of-the-boxsolution.
Machinesoperatebasedonstatisticalalgorithmsmanagedandoverseenbyskilledindividuals—knownasdatascientistsandmachinelearningengineers.Thisisonelabormarketwherejobopportunitiesaredestinedfor
growthbutwhere,currently,supplyisstrugglingtomeetdemand.IndustryexpertslamentthatoneofthebiggestobstaclesdelayingtheprogressofAIistheinadequatesupplyofprofessionalswiththenecessaryexpertiseandtraining.
AccordingtoCharlesGreen,theDirectorofThoughtLeadershipatBelatrixSoftware:
“It’sahugechallengetofinddatascientists,peoplewithmachinelearningexperience,orpeoplewiththeskillstoanalyzeandusethedata,aswellasthosewhocancreatethealgorithmsrequiredformachinelearning.Secondly,whilethetechnologyisstillemerging,therearemanyongoingdevelopments.It’sclearthatAIisalongwayfromhowwemightimagineit.”
[2]
Perhapsyourownpathtobecominganexpertinthefieldofmachinelearningstartshere,ormaybeabaselineunderstandingissufficienttosatisfyyourcuriosityfornow.Inanycase,let’sproceedwiththeassumptionthatyouarereceptivetotheideaoftrainingtobecomeasuccessfuldatascientistormachinelearningengineer.
Tobuildandprogramintelligentmachines,youmustfirstunderstandclassicalstatistics.Algorithmsderivedfromclassicalstatisticscontributethemetaphoricalbloodcellsandoxygenthatpowermachinelearning.Layeruponlayeroflinearregression,k-nearestneighbors,andrandomforestssurgethroughthemachineanddrivetheircognitiveabilities.Classicalstatisticsisattheheartofmachinelearningandmanyofthesealgorithmsarebasedonthesamestatisticalequationsyoustudiedinhighschool.Indeed,statisticalalgorithmswereconductedonpaperwellbeforemachinesevertookonthetitleofartificialintelligence.
Computerprogrammingisanotherindispensablepartofmachinelearning.Thereisn’taclick-and-dragorWeb2.0solutiontoperformadvancedmachinelearninginthewayonecanconvenientlybuildawebsitenowadayswithWordPressorStrikingly.Programmingskillsarethereforevitaltomanagedataanddesignstatisticalmodelsthatrunonmachines.
Somestudentsofmachinelearningwillhaveyearsofprogrammingexperiencebuthaven’ttouchedclassicalstatisticssincehighschool.Others,perhaps,neverevenattemptedstatisticsintheirhighschoolyears.Butnottoworry,manyofthemachinelearningalgorithmswediscussinthisbookhaveworkingimplementationsinyourprogramminglanguageofchoice;noequationwritingnecessary.Youcanusecodetoexecutetheactualnumber
crunchingforyou.
Ifyouhavenotlearnedtocodebefore,youwillneedtoifyouwishtomakefurtherprogressinthisfield.Butforthepurposeofthiscompactstarter’scourse,thecurriculumcanbecompletedwithoutanybackgroundincomputerprogramming.Thisbookfocusesonthehigh-levelfundamentalsofmachinelearningaswellasthemathematicalandstatisticalunderpinningsofdesigningmachinelearningmodels.
Forthosewhodowishtolookattheprogrammingaspectofmachinelearning,Chapter13walksyouthroughtheentireprocessofsettingupasupervisedlearningmodelusingthepopularprogramminglanguagePython.
WHATISMACHINELEARNING?
In1959,IBMpublishedapaperintheIBMJournalofResearchandDevelopmentwithan,atthetime,obscureandcurioustitle.AuthoredbyIBM’sArthurSamuel,thepaperinvestedtheuseofmachinelearninginthegameofcheckers“toverifythefactthatacomputercanbeprogrammedsothatitwilllearntoplayabettergameofcheckersthancanbeplayedbythepersonwhowrotetheprogram.”
[3]
Althoughitwasnotthefirstpublicationtousetheterm“machinelearning”perse,ArthurSamueliswidelyconsideredasthefirstpersontocoinanddefinemachinelearningintheformwenowknowtoday.Samuel’slandmarkjournalsubmission,SomeStudiesinMachineLearningUsingtheGameofCheckers,isalsoanearlyindicationofhomosapiens’determinationtoimpartourownsystemoflearningtoman-mademachines.
Figure1:Historicalmentionsof“machinelearning”inpublishedbooks.Source:GoogleNgramViewer,2017
ArthurSamuelintroducesmachinelearninginhispaperasasubfieldofcomputersciencethatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.
[4]
Almostsixdecadeslater,thisdefinitionremainswidelyaccepted.
AlthoughnotdirectlymentionedinArthurSamuel’sdefinition,akeyfeatureofmachinelearningistheconceptofself-learning.Thisreferstotheapplicationofstatisticalmodelingtodetectpatternsandimprove
performancebasedondataandempiricalinformation;allwithoutdirectprogrammingcommands.ThisiswhatArthurSamueldescribedastheabilitytolearnwithoutbeingexplicitlyprogrammed.Buthedoesn’tinferthatmachinesformulatedecisionswithnoupfrontprogramming.Onthecontrary,machinelearningisheavilydependentoncomputerprogramming.Instead,Samuelobservedthatmachinesdon’trequireadirectinputcommandtoperformasettaskbutratherinputdata.
Figure2:ComparisonofInputCommandvsInputData
Anexampleofaninputcommandistyping“2+2”intoaprogramminglanguagesuchasPythonandhitting“Enter.”
>>>2+2
4
>>>
Thisrepresentsadirectcommandwithadirectanswer.
Inputdata,however,isdifferent.Dataisfedtothemachine,analgorithmisselected,hyperparameters(settings)areconfiguredandadjusted,andthemachineisinstructedtoconductitsanalysis.Themachineproceedstodecipherpatternsfoundinthedatathroughtheprocessoftrialanderror.Themachine’sdatamodel,formedfromanalyzingdatapatterns,canthenbeusedtopredictfuturevalues.
Althoughthereisarelationshipbetweentheprogrammerandthemachine,theyoperatealayerapartincomparisontotraditionalcomputerprogramming.Thisisbecausethemachineisformulatingdecisionsbasedonexperienceandmimickingtheprocessofhuman-baseddecision-making.
Asanexample,let’ssaythatafterexaminingtheYouTubeviewinghabitsofdatascientistsyourmachineidentifiesastrongrelationshipbetweendata
scientistsandcatvideos.Later,yourmachineidentifiespatternsamongthephysicaltraitsofbaseballplayersandtheirlikelihoodofwinningtheseason’sMostValuablePlayer(MVP)award.Inthefirstscenario,themachineanalyzedwhatvideosdatascientistsenjoywatchingonYouTubebasedonuserengagement;measuredinlikes,subscribes,andrepeatviewing.Inthesecondscenario,themachineassessedthephysicalfeaturesofpreviousbaseballMVPsamongvariousotherfeaturessuchasageandeducation.However,inneitherofthesetwoscenarioswasyourmachineexplicitlyprogrammedtoproduceadirectoutcome.Youfedtheinputdataandconfiguredthenominatedalgorithms,butthefinalpredictionwasdeterminedbythemachinethroughself-learninganddatamodeling.
Youcanthinkofbuildingadatamodelassimilartotrainingaguidedog.Throughspecializedtraining,guidedogslearnhowtorespondinvarioussituations.Forexample,thedogwilllearntoheelataredlightortosafelyleaditsmasteraroundobstacles.Ifthedoghasbeenproperlytrained,then,eventually,thetrainerwillnolongerberequired;theguidedogwillbeabletoapplyitstraininginvariousunsupervisedsituations.Similarly,machinelearningmodelscanbetrainedtoformdecisionsbasedonpastexperience.
Asimpleexampleiscreatingamodelthatdetectsspamemailmessages.Themodelistrainedtoblockemailswithsuspicioussubjectlinesandbodytextcontainingthreeormoreflaggedkeywords:dearfriend,free,invoice,PayPal,Viagra,casino,payment,bankruptcy,andwinner.Atthisstage,though,wearenotyetperformingmachinelearning.Ifwerecallthevisualrepresentationofinputcommandvsinputdata,wecanseethatthisprocessconsistsofonlytwosteps:Command>Action.
Machinelearningentailsathree-stepprocess:Data>Model>Action.
Thus,toincorporatemachinelearningintoourspamdetectionsystem,weneedtoswitchout“command”for“data”andadd“model”inordertoproduceanaction(output).Inthisexample,thedatacomprisessampleemailsandthemodelconsistsofstatistical-basedrules.Theparametersofthemodelincludethesamekeywordsfromouroriginalnegativelist.Themodelisthentrainedandtestedagainstthedata.
Oncethedataisfedintothemodel,thereisastrongchancethatassumptionscontainedinthemodelwillleadtosomeinaccuratepredictions.Forexample,undertherulesofthismodel,thefollowingemailsubjectlinewouldautomaticallybeclassifiedasspam:“PayPalhasreceivedyourpaymentforCasinoRoyalepurchasedoneBay.”
AsthisisagenuineemailsentfromaPayPalauto-responder,thespamdetectionsystemisluredintoproducingafalsepositivebasedonthenegativelistofkeywordscontainedinthemodel.Traditionalprogrammingishighlysusceptibletosuchcasesbecausethereisnobuilt-inmechanismtotestassumptionsandmodifytherulesofthemodel.Machinelearning,ontheotherhand,canadaptandmodifyassumptionsthroughitsthree-stepprocessandbyreactingtoerrors.
Training&TestData
Inmachinelearning,dataissplitintotrainingdataandtestdata.Thefirstsplitofdata,i.e.theinitialreserveofdatayouusetodevelopyourmodel,providesthetrainingdata.Inthespamemaildetectionexample,falsepositivessimilartothePayPalauto-responsemightbedetectedfromthetrainingdata.Newrulesormodificationsmustthenbeadded,e.g.,emailnotificationsissuedfromthesendingaddress“
payments@
”shouldbeexcludedfromspamfiltering.
Afteryouhavesuccessfullydevelopedamodelbasedonthetrainingdataandaresatisfiedwithitsaccuracy,youcanthentestthemodelontheremainingdata,knownasthetestdata.Onceyouaresatisfiedwiththeresultsofboththetrainingdataandtestdata,themachinelearningmodelisreadytofilterincomingemailsandgeneratedecisionsonhowtocategorizethoseincomingmessages.
Thedifferencebetweenmachinelearningandtraditionalprogrammingmayseemtrivialatfirst,butitwillbecomeclearasyourunthroughfurtherexamplesandwitnessthespecialpowerofself-learninginmorenuancedsituations.
Thesecondimportantpointtotakeawayfromthischapterishowmachinelearningfitsintothebroaderlandscapeofdatascienceandcomputerscience.Thismeansunderstandinghowmachinelearninginterrelateswithparentfieldsandsisterdisciplines.Thisisimportant,asyouwillencountertheserelatedtermswhensearchingforrelevantstudymaterials—andyouwillhearthemmentionedadnauseaminintroductorymachinelearningcourses.Relevantdisciplinescanalsobedifficulttotellapartatfirstglance,suchas“machinelearning”and“datamining.”
Let’sbeginwithahigh-levelintroduction.Machinelearning,datamining,computerprogramming,andmostrelevantfields(excludingclassical
statistics)derivefirstfromcomputerscience,whichencompasseseverythingrelatedtothedesignanduseofcomputers.Withintheall-encompassingspaceofcomputerscienceisthenextbroadfield:datascience.Narrowerthancomputerscience,datasciencecomprisesmethodsandsystemstoextractknowledgeandinsightsfromdatathroughtheuseofcomputers.
Figure3:ThelineageofmachinelearningrepresentedbyarowofRussianmatryoshkadolls
Poppingoutfromcomputerscienceanddatascienceasthethirdmatryoshkadollisartificialintelligence.Artificialintelligence,orAI,encompassestheabilityofmachinestoperformintelligentandcognitivetasks.ComparabletothewaytheIndustrialRevolutiongavebirthtoaneraofmachinesthatcouldsimulatephysicaltasks,AIisdrivingthedevelopmentofmachinescapableofsimulatingcognitiveabilities.
Whilestillbroadbutdramaticallymorehonedthancomputerscienceanddatascience,AIcontainsnumeroussubfieldsthatarepopulartoday.Thesesubfieldsincludesearchandplanning,reasoningandknowledgerepresentation,perception,naturallanguageprocessing(NLP),andofcourse,machinelearning.MachinelearningbleedsintootherfieldsofAI,includingNLPandperceptionthroughtheshareduseofself-learningalgorithms.
Figure4:Visualrepresentationoftherelationshipbetweendata-relatedfields
ForstudentswithaninterestinAI,machinelearningprovidesanexcellentstartingpointinthatitoffersamorenarrowandpracticallensofstudycomparedtotheconceptualambiguityofAI.Algorithmsfoundinmachinelearningcanalsobeappliedacrossotherdisciplines,includingperceptionandnaturallanguageprocessing.Inaddition,aMaster’sdegreeisadequatetodevelopacertainlevelofexpertiseinmachinelearning,butyoumayneedaPhDtomakeanytrueprogressinAI.
Asmentioned,machinelearningalsooverlapswithdatamining—asisterdisciplinethatfocusesondiscoveringandunearthingpatternsinlargedatasets.Popularalgorithms,suchask-meansclustering,associationanalysis,andregressionanalysis,areappliedinbothdataminingandmachinelearningtoanalyzedata.Butwheremachinelearningfocusesontheincrementalprocessofself-learninganddatamodelingtoformpredictionsaboutthefuture,dataminingnarrowsinoncleaninglargedatasetstogleanvaluableinsightfromthepast.
Thedifferencebetweendataminingandmachinelearningcanbeexplainedthroughananalogyoftwoteamsofarchaeologists.Thefirstteamismadeupofarchaeologistswhofocustheireffortsonremovingdebristhatliesinthewayofvaluableitems,hidingthemfromdirectsight.Theirprimarygoalsaretoexcavatethearea,findnewvaluablediscoveries,andthenpackuptheirequipmentandmoveon.Adaylater,theywillflytoanotherexoticdestinationtostartanewprojectwithnorelationshiptothesitethey
excavatedthedaybefore.
Thesecondteamisalsointhebusinessofexcavatinghistoricalsites,butthesearchaeologistsuseadifferentmethodology.Theydeliberatelyreframefromexcavatingthemainpitforseveralweeks.Inthattime,theyvisitotherrelevantarchaeologicalsitesintheareaandexaminehoweachsitewasexcavated.Afterreturningtothesiteoftheirownproject,theyapplythisknowledgetoexcavatesmallerpitssurroundingthemainpit.
Thearchaeologiststhenanalyzetheresults.Afterreflectingontheirexperienceexcavatingonepit,theyoptimizetheireffortstoexcavatethenext.Thisincludespredictingtheamountoftimeittakestoexcavateapit,understandingvarianceandpatternsfoundinthelocalterrainanddevelopingnewstrategiestoreduceerrorandimprovetheaccuracyoftheirwork.Fromthisexperience,theyareabletooptimizetheirapproachtoformastrategicmodeltoexcavatethemainpit.
Ifitisnotalreadyclear,thefirstteamsubscribestodataminingandthesecondteamtomachinelearning.Atamicro-level,bothdataminingandmachinelearningappearsimilar,andtheydousemanyofthesametools.Bothteamsmakealivingexcavatinghistoricalsitestodiscovervaluableitems.Butinpractice,theirmethodologyisdifferent.Themachinelearningteamfocusesondividingtheirdatasetintotrainingdataandtestdatatocreateamodel,andimprovingfuturepredictionsbasedonpreviousexperience.Meanwhile,thedataminingteamconcentratesonexcavatingthetargetareaaseffectivelyaspossible—withouttheuseofaself-learningmodel—beforemovingontothenextcleanupjob.
MLCATEGORIES
Machinelearningincorporatesseveralhundredstatistical-basedalgorithmsandchoosingtherightalgorithmorcombinationofalgorithmsforthejobisaconstantchallengeforanyoneworkinginthisfield.Butbeforeweexaminespecificalgorithms,itisimportanttounderstandthethreeoverarchingcategoriesofmachinelearning.Thesethreecategoriesaresupervised,unsupervised,andreinforcement.
SupervisedLearning
Asthefirstbranchofmachinelearning,supervisedlearningconcentratesonlearningpatternsthroughconnectingtherelationshipbetweenvariablesandknownoutcomesandworkingwithlabeleddatasets.
Supervisedlearningworksbyfeedingthemachinesampledatawithvariousfeatures(representedas“X”)andthecorrectvalueoutputofthedata(representedas“y”).Thefactthattheoutputandfeaturevaluesareknownqualifiesthedatasetas“l(fā)abeled.”Thealgorithmthendecipherspatternsthatexistinthedataandcreatesamodelthatcanreproducethesameunderlyingruleswithnewdata.
Forinstance,topredictthemarketrateforthepurchaseofausedcar,asupervisedalgorithmcanformulatepredictionsbyanalyzingtherelationshipbetweencarattributes(includingtheyearofmake,carbrand,mileage,etc.)andthesellingpriceofothercarssoldbasedonhistoricaldata.Giventhatthesupervisedalgorithmknowsthefinalpriceofothercardssold,itcanthenworkbackwardtodeterminetherelationshipbetweenthecharacteristicsofthecaranditsvalue.
Figure1:Carvaluepredictionmodel
Afterthemachinedecipherstherulesandpatternsofthedata,itcreateswhatisknownasamodel:analgorithmicequationforproducinganoutcomewithnewdatabasedontherulesderivedfromthetrainingdata.Oncethemodelisprepared,itcanbeappliedtonewdataandtestedforaccuracy.Afterthemodelhaspassedboththetrainingandtestdatastages,itisreadytobeappliedandusedintherealworld.
InChapter13,wewillcreateamodelforpredictinghousevalueswhereyistheactualhousepriceandXarethevariablesthatimpacty,suchaslandsize,location,andthenumberofrooms.Throughsupervisedlearning,wewillcreatearuletopredicty(housevalue)basedonthegivenvaluesofvariousvariables(X).
Examplesofsupervisedlearningalgorithmsincluderegressionanalysis,decisiontrees,k-nearestneighbors,neuralnetworks,andsupportvectormachines.Eachofthesetechniqueswillbeintroducedlaterinthebook.
UnsupervisedLearning
Inthecaseofunsupervisedlearning,notallvariablesanddatapatternsareclassified.Instead,themachinemustuncoverhiddenpatternsandcreatelabelsthroughtheuseofunsupervisedlearningalgorithms.Thek-meansclusteringalgorithmisapopularexampleofunsupervisedlearning.ThissimplealgorithmgroupsdatapointsthatarefoundtopossesssimilarfeaturesasshowninFigure1.
Figure1:Exampleofk-meansclustering,apopularunsupervisedlearningtechnique
IfyougroupdatapointsbasedonthepurchasingbehaviorofSME(SmallandMedium-sizedEnterprises)andlargeenterprisecustomers,forexample,youarelikelytoseetwoclustersemerge.ThisisbecauseSMEsandlargeenterprisestendtohavedisparatebuyinghabits.Whenitcomestopurchasingcloudinfrastructure,forinstance,basiccloudhostingresourcesandaContentDeliveryNetwork(CDN)mayprovesufficientformostSMEcustomers.Largeenterprisecustomers,though,aremorelikelytopurchaseawiderarrayofcloudproductsandentiresolutionsthatincludeadvancedsecurityandnetworkingproductslikeWAF(WebApplicationFirewall),adedicatedprivateconnection,andVPC(VirtualPrivateCloud).Byanalyzingcustomerpurchasinghabits,unsupervisedlearningiscapableofidentifyingthesetwogroupsofcustomerswithoutspecificlabelsthatclassifythecompanyassmall,mediumorlarge.
Theadvantageofunsupervisedlearningisitenablesyoutodiscoverpatternsinthedatathatyouwereunawareexisted—suchasthepresenceoftwomajorcustomertypes.Clusteringtechniquessuchask-meansclusteringcanalsoprovidethespringboardforconductingfurtheranalysisafterdiscretegroupshavebeendiscovered.
Inindustry,unsupervisedlearningisparticularlypowerfulinfrauddetection
—wherethemostdangerousattacksareoftenthoseyettobeclassified.Onereal-worldexampleisDataVisor,whoessentiallybuilttheirbusinessmodelbasedonunsupervisedlearning.
Foundedin2013inCalifornia,DataVisorprotectscustomersfromfraudulent
onlineactivities,includingspam,fakereviews,fakeappinstalls,andfraudulenttransactions.Whereastraditionalfraudprotectionservicesdrawonsupervisedlearningmodelsandruleengines,DataVisorusesunsupervisedlearningwhichenablesthemtodetectunclassifiedcategoriesofattacksintheirearlystages.
Ontheirwebsite,DataVisorexplainsthat"todetectattacks,existingsolutionsrelyonhumanexperiencetocreaterulesorlabeledtrainingdatatotunemodels.Thismeanstheyareunabletodetectnewattacksthathaven’talreadybeenidentifiedbyhumansorlabeledintrainingdata."
[5]
Thismeansthattraditionalsolutionsanalyzethechainofactivityforaparticularattackandthencreaterulestopredictarepeatattack.Underthisscenario,thedependentvariable(y)istheeventofanattackandtheindependentvariables(X)arethecommonpredictorvariablesofanattack.Examplesofindependentvariablescouldbe:
Asuddenlargeorderfromanunknownuser.I.E.establishedcustomersgenerallyspendlessthan$100perorder,butanewuserspends$8,000inoneorderimmediatelyuponregisteringtheiraccount.
Asuddensurgeofuserratings.I.E.AsatypicalauthorandbookselleronA,it’suncommonformyfirstpublishedworktoreceivemorethanonebookreviewwithinthespaceofonetotwodays.Ingeneral,approximately1in200Amazonreadersleaveabookreviewandmostbooksgoweeksormonthswithoutareview.However,Icommonlyseecompetitorsinthiscategory(datascience)attracting20-50reviewsinoneday!(Unsurprisingly,IalsoseeAmazonremovingthesesuspiciousreviewsweeksormonthslater.)
Identicalorsimilaruserreviewsfromdifferentusers.FollowingthesameAmazonanalogy,Ioftenseeuserreviewsofmybookappearonotherbooksseveralmonthslater(sometimeswithareferencetomynameastheauthorstillincludedinthereview!).Again,Amazoneventuallyremovesthesefakereviewsandsuspendstheseaccountsforbreakingtheirtermsofservice.
Suspiciousshippingaddress.I.E.Forsmallbusinessesthatroutinelyshipproductstolocalcustomers,anorderfromadistantlocation(wheretheydon'tadvertisetheirproducts)caninrarecasesbeanindicatoroffraudulentormaliciousactivity.
Standaloneactivitiessuchasasuddenlargeorderoradistantshippingaddressmayprovetoolittleinformationtopredictsophisticated
cybercriminalactivityandmorelikelytoleadtomanyfalsepositives.Butamodelthatmonitorscombinationsofindependentvariables,suchasasuddenlargepurchaseorderfromtheothersideoftheglobeoralandslideofbookreviewsthatreuseexistingcontentwillgenerallyleadtomoreaccuratepredictions.Asupervisedlearning-basedmodelcoulddeconstructandclassifywhatthesecommonindependentvariablesareanddesignadetectionsystemtoidentifyandpreventrepeatoffenses.
Sophisticatedcybercriminals,though,learntoevadeclassification-basedruleenginesbymodifyingtheirtactics.Inaddition,leadinguptoanattack,attackersoftenregisterandoperatesingleormultipleaccountsandincubatetheseaccountswithactivitiesthatmimiclegitimateusers.Theythenutilizetheirestablishedaccounthistorytoevadedetectionsystems,whicharetrigger-heavyagainstrecentlyregisteredaccounts.Supervisedlearning-basedsolutionsstruggletodetectsleepercellsuntiltheactualdamagehasbeenmadeandespeciallywithregardtonewcategoriesofattacks.
DataVisorandotheranti-fraudsolutionprovidersthereforeleverageunsupervisedlearningtoaddressthelimitationsofsupervisedlearningbyanalyzingpatternsacrosshundredsofmillionsofaccountsandidentifyi
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 氣象、水文儀器及裝置項目安全風(fēng)險評價報告
- 復(fù)方芩蘭口服液項目風(fēng)險評估報告
- 蘇州科技大學(xué)《建筑安裝工程概預(yù)算》2023-2024學(xué)年第二學(xué)期期末試卷
- 福建醫(yī)科大學(xué)《能源動力》2023-2024學(xué)年第二學(xué)期期末試卷
- 商洛學(xué)院《建筑裝飾材料與工程概預(yù)算》2023-2024學(xué)年第二學(xué)期期末試卷
- 廣西農(nóng)業(yè)工程職業(yè)技術(shù)學(xué)院《SPSS軟件運用》2023-2024學(xué)年第一學(xué)期期末試卷
- 云南商務(wù)職業(yè)學(xué)院《藥事法規(guī)》2023-2024學(xué)年第一學(xué)期期末試卷
- 四川省成都市雙流棠湖中學(xué)2025屆高三(二模)生物試題試卷含解析
- 郯城縣2025屆小升初總復(fù)習(xí)數(shù)學(xué)測試卷含解析
- 浙江省衢州市江山市2025屆五年級數(shù)學(xué)第二學(xué)期期末質(zhì)量檢測模擬試題含答案
- 肝臟結(jié)核CT表現(xiàn)課件
- 設(shè)備周期保養(yǎng)檢修記錄表
- 中國大學(xué)生心理健康量表(CCSMHS)
- 專利法全套ppt課件(完整版)
- GB∕T 3639-2021 冷拔或冷軋精密無縫鋼管
- 西師版六年級下冊數(shù)學(xué)第五單元 總復(fù)習(xí) 教案
- 色譜、質(zhì)譜、聯(lián)用
- 獨生子女父母退休一次性獎勵審批1
- 鋁合金窗陜西銀杉節(jié)能門窗有限責(zé)任公司鋁合金制作及安裝工藝流程圖
- 蘇教版小學(xué)數(shù)學(xué)四年級下冊《圖形旋轉(zhuǎn)》練習(xí)題
- 燒結(jié)普通磚、多孔磚回彈計算
評論
0/150
提交評論