FAFU機(jī)器學(xué)習(xí)04-1 Feature Extraction and Preprocessing課件_第1頁
FAFU機(jī)器學(xué)習(xí)04-1 Feature Extraction and Preprocessing課件_第2頁
FAFU機(jī)器學(xué)習(xí)04-1 Feature Extraction and Preprocessing課件_第3頁
FAFU機(jī)器學(xué)習(xí)04-1 Feature Extraction and Preprocessing課件_第4頁
FAFU機(jī)器學(xué)習(xí)04-1 Feature Extraction and Preprocessing課件_第5頁
已閱讀5頁,還剩48頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

FoundationsofMachineLearning

FeatureExtractionandPreprocessing2023/11/4FeatureExtractionandPreprocessingLesson4-1FeatureExtractionandPreprocessingTheexamplesdiscussedinlinearregressionusedsimplenumericexplanatoryvariables,suchasthediameterofapizza.Manymachinelearningproblemsrequirelearningfromobservationsofcategoricalvariables,text,orimages.Inthislesson,youwilllearnbasictechniquesforpreprocessingdataandcreatingfeaturerepresentationsoftheseobservations.Thesetechniquescanbeusedwiththeregressionmodels,LinearRegression,aswellasthemodelswewilldiscussinsubsequentlesson.2023/11/4FeatureExtractionandPreprocessingLesson4-2FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-3ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominal(定類):

categories,states,or“namesofthings”Hair_color={auburn,black,blond,brown,grey,red,white}maritalstatus,occupation,IDnumbers,zipcodes

2023/11/4FeatureExtractionandPreprocessingLesson4-4ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominal(定類):

categories,states,or“namesofthings”Hair_color={auburn,black,blond,brown,grey,red,white}maritalstatus,occupation,IDnumbers,zipcodes

Binary(二類)Nominalattributewithonly2states(0and1)Symmetricbinary:bothoutcomesequallyimportant,e.g.,genderAsymmetricbinary:outcomesnotequallyimportant,e.g.,medicaltest(positivevs.negative)Convention:assign1tomostimportantoutcome(e.g.,HIVpositive)2023/11/4FeatureExtractionandPreprocessingLesson4-5ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominalBinaryOrdinal

(定序)Valueshaveameaningfulorder(ranking)butmagnitudebetweensuccessivevaluesisnotknown.Size={small,medium,large},grades,armyrankings2023/11/4FeatureExtractionandPreprocessingLesson4-6ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominalBinaryOrdinalInterval(定距)Measuredonascaleofequal-sizedunitsValueshaveorderE.g.,temperatureinC?orF?,calendardatesNotruezero-point2023/11/4FeatureExtractionandPreprocessingLesson4-7ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominalBinaryOrdinalIntervalRatio(定比)Inherentzero-point

Wecanspeakofvaluesasbeinganorderofmagnitudelargerthantheunitofmeasurement(10K?istwiceashighas5K?).e.g.,temperatureinKelvin,length,counts,monetaryquantities2023/11/4FeatureExtractionandPreprocessingLesson4-8Categoricalvariablesone-of-Korone-hot(獨(dú)熱)Categoricalvariablesarecommonlyencodedusingone-of-Korone-hotencoding,inwhichtheexplanatoryvariableisencodedusingonebinaryfeatureforeachofthevariable‘spossiblevalues.Forexample,let'sassumethatourmodelhasacityexplanatoryvariablethatcantakeoneofthreevalues:NewYork,SanFrancisco,orChapelHill.One-hotencodingrepresentsthisexplanatoryvariableusingonebinaryfeatureforeachofthethreepossiblecities.2023/11/4FeatureExtractionandPreprocessingLesson4-9sklearn.feature_extraction:FeatureExtraction2023/11/4FeatureExtractionandPreprocessingLesson4-10sklearn.feature_extraction.DictVectorizerTransformslistsoffeature-valuemappingstovectors.Thistransformerturnslistsofmappings(dict-likeobjects)offeaturenamestofeaturevaluesintoNumpyarraysorscipy.sparsematricesforusewithscikit-learnestimators.Whenfeaturevaluesarestrings,thistransformerwilldoabinaryone-hot(akaone-of-K)coding:oneboolean-valuedfeatureisconstructedforeachofthepossiblestringvaluesthatthefeaturecantakeon.Forinstance,afeature“f”thatcantakeonthevalues“ham”and“spam”willbecometwofeaturesintheoutput,onesignifying“f=ham”,theother“f=spam”.Featuresthatdonotoccurinasample(mapping)willhaveazerovalueintheresultingarray/matrix.2023/11/4FeatureExtractionandPreprocessingLesson4-11ExampleuseofDictVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-12>>>from

sklearn.feature_extraction

import

DictVectorizer

>>>v

=

DictVectorizer(sparse=False)>>>D

=[{'foo':1,'bar':2},{'foo':3,'baz':1}]>>>X

=

v.fit_transform(D)>>>X

array([[2.,0.,1.],

[0.,1.,3.]])

>>>v.inverse_transform(X)==[{'bar':2.0,'foo':1.0},{'baz':1.0,'foo':3.0}]True

>>>v.transform({'foo':4,'unseen_feature':3})array([[0.,0.,4.]])

ExampleuseofDictVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-13>>>fromsklearn.feature_extractionimportDictVectorizer>>>onehot_encoder=DictVectorizer(sparse=False)>>>D=[{'city':'NewYork'},{'city':'SanFrancisco'},{'city':'ChapelHill'}]>>>X=onehot_encoder.fit_transform(D)>>>Xarray([[0.,1.,0.],[0.,0.,1.],[1.,0.,0.]])>>>onehot_encoder.feature_names_['city=ChapelHill','city=NewYork','city=SanFrancisco']>>>Canwerepresentthevaluesofacategoricalexplanatoryvariablewithasingleintegerfeature?FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-14ExtractingfeaturesfromtextManymachinelearningproblemsusetextasanexplanatoryvariable.Textmustbetransformedtoadifferentrepresentationthatencodesasmuchofitsmeaningaspossibleinafeaturevector.Inthefollowingsectionswewillreviewvariationsofthemostcommonrepresentationoftextthatisusedinmachinelearning:thebag-of-wordsmodel.2023/11/4FeatureExtractionandPreprocessingLesson4-15ExtractingfeaturesfromtextThebag-of-wordsrepresentationThemostcommonrepresentationoftextisthebag-of-words(詞袋)model.This

representationusesamultiset,orbag,thatencodesthewordsthatappearinatext;thebag-of-wordsdoesnotencodeanyofthetext'ssyntax,ignorestheorderofwords,anddisregardsallgrammar.Bag-of-wordscanbethoughtofasanextensiontoone-hotencoding.Itcreatesonefeatureforeachwordofinterestinthetext.Thebag-of-wordsmodelismotivatedbytheintuitionthatdocumentscontainingsimilarwordsoftenhavesimilarmeanings.Thebag-of-wordsmodelcanbeusedeffectivelyfordocumentclassificationandretrievaldespitethelimitedinformationthatitencodes.2023/11/4FeatureExtractionandPreprocessingLesson4-16ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe

sklearn.feature_extraction.text

submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.feature_extraction.text.CountVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokencountsfeature_extraction.text.HashingVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokenoccurrencesfeature_extraction.text.TfidfTransformer([…])Transformacountmatrixtoanormalizedtfortf-idfrepresentationfeature_extraction.text.TfidfVectorizer([…])ConvertacollectionofrawdocumentstoamatrixofTF-IDFfeatures.2023/11/4FeatureExtractionandPreprocessingLesson4-17ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe

sklearn.feature_extraction.text

submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.sklearn.feature_extraction.text.CountVectorizerConvertacollectionoftextdocumentstoamatrixoftokencountsThisimplementationproducesasparserepresentationofthecountsusingscipy.sparse.csr_matrix.Ifyoudonotprovideana-prioridictionaryandyoudonotuseananalyzerthatdoessomekindoffeatureselectionthenthenumberoffeatureswillbeequaltothevocabularysizefoundbyanalyzingthedata.2023/11/4FeatureExtractionandPreprocessingLesson4-18Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-19>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[11010101][11101010]]>>>print(vectorizer.vocabulary_){'unc':7,'in':3,'the':6,'lost':4,'played':5,'basketball':0,'duke':1,'game':2}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-20>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[0110101001][0111010010][1000000100]]>>>print(vectorizer.vocabulary_){'unc':9,'in':4,'the':8,'lost':5,'sandwich':7,'played':6,'basketball':1,'duke':2,'game':3,'ate':0}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-21UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]Themeaningsofthefirsttwodocumentsaremoresimilartoeachotherthantheyaretothethirddocument,andtheircorrespondingfeaturevectorsaremoresimilartoeachotherthantheyaretothethirddocument'sfeaturevectorwhenusingametricsuchasEuclideandistance.Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-22UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]sklearn.metrics.pairwise.euclidean_distance>>>fromsklearn.metrics.pairwiseimporteuclidean_distances>>>counts=[[0,1,1,0,0,1,0,1],[0,1,1,1,1,0,0,0],[1,0,0,0,0,0,1,0]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[1]))Distancebetween1stand2nddocuments:[[2.]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[1],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]ExtractingfeaturesfromtextForrealapplications--High-dimensionalfeaturevectorsThefirstproblemisthathigh-dimensionalvectorsrequiremorememorythansmallervectors.Thesecondproblemisknownasthecurseofdimensionality(維數(shù)災(zāi)難/維度詛咒),ortheHugheseffect.Asthefeaturespace'sdimensionalityincreases,moretrainingdataisrequiredtoensurethatthereareenoughtraininginstanceswitheachcombinationofthefeature'svalues.Ifthereareinsufficienttraininginstancesforafeature,thealgorithmmayoverfitnoiseinthetrainingdataandfailtogeneralize.2023/11/4FeatureExtractionandPreprocessingLesson4-23ExtractingfeaturesfromtextStop-wordfilteringRemovewordsthatarecommontomostofthedocumentsinthecorpus.Thesewords,calledstopwords,includedeterminerssuchasthe,a,andan;auxiliaryverbssuchasdo,be,andwill;andprepositionssuchason,around,andbeneath.Stopwordsareoftenfunctionalwordsthatcontributetothedocument‘smeaningthroughgrammarratherthantheirdenotations.TheCountVectorizerclasscanfilterstopwordsprovidedasthestop_wordskeywordargumentandalsoincludesabasicEnglishstoplist.2023/11/4FeatureExtractionandPreprocessingLesson4-24>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True,stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[01100101][01111000][10000010]]>>>print(vectorizer.vocabulary_){'unc':7,'lost':4,'sandwich':6,'played':5,'basketball':1,'duke':2,'game':3,'ate':0}ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationWhilestopfilteringisaneasystrategyfordimensionalityreduction,moststoplistscontainonlyafewhundredwords.Alargecorpusmaystillhavehundredsofthousandsofuniquewordsafterfiltering.Twosimilarstrategiesforfurtherreducingdimensionalityarecalledstemmingandlemmatization.詞干提?。╯temming)是抽取詞的詞干或詞根形式(不一定能夠表達(dá)完整語義)。詞形還原(lemmatization),是把一個任何形式的語言詞匯還原為一般形式(能表達(dá)完整語義)WecanusetheNaturalLanguageToolKit(NTLK)tostemandlemmatizethecorpus.2023/11/4FeatureExtractionandPreprocessingLesson4-25ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsInsteadofusingabinaryvalueforeachelementinthefeaturevector,wewillnowuseanintegerthatrepresentsthenumberoftimesthatthewordsappearedinthedocument.2023/11/4FeatureExtractionandPreprocessingLesson4-26ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsNormalizedtermfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-27ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(對數(shù)詞頻調(diào)整方法)2023/11/4FeatureExtractionandPreprocessingLesson4-28ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(對數(shù)詞頻調(diào)整方法)Augmentedtermfrequency(詞頻放大法)2023/11/4FeatureExtractionandPreprocessingLesson4-29ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsNormalization,logarithmicallyscaledtermfrequencies,andaugmentedtermfrequenciescanrepresentthefrequenciesoftermsinadocumentwhilemitigatingtheeffectsofdifferentdocumentsizes.However,anotherproblemremainswiththeserepresentations.Thefeaturevectorscontainlargeweightsfortermsthatoccurfrequentlyinadocument,evenifthosetermsoccurfrequentlyinmostdocumentsinthecorpus.2023/11/4FeatureExtractionandPreprocessingLesson4-30ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)2023/11/4FeatureExtractionandPreprocessingLesson4-31ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)TF-IDFvalueistheproductofitstermfrequencyandinversedocumentfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-32ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsTfidfVectorizerclasswrapsCountVectorizerandTfidfTransformer2023/11/4FeatureExtractionandPreprocessingLesson4-33>>>fromsklearn.feature_extraction.textimportTfidfVectorizer>>>corpus=['ThedogateasandwichandIateasandwich','Thewizardtransfiguredasandwich’]>>>vectorizer=TfidfVectorizer(stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[0.754583970.377291990.536892710.0.][0.0.0.449436420.63166720.6316672]]>>>print(vectorizer.vocabulary_){'sandwich':2,'dog':1,'transfigured':3,'ate':0,'wizard':4}ExtractingfeaturesfromtextTF-IDF+機(jī)器學(xué)習(xí)分類器基于深度學(xué)習(xí)的文本分類

FastText:將整篇文檔的詞及N-gram向量疊加平均得到文檔向量,然后使用文檔向量做softmax多分類。涉及兩個技巧:字符級N-gram特征的引入以及分層Softmax分類。Word2Vec:Word2vec是WordEmbedding的方法之一。他是2013年由谷歌的Mikolov提出了一套新的詞嵌入方法。由于Word2vec會考慮上下文,跟之前的Embedding方法相比,效果要更好(但不如18年之后的方法)BERT(BidirectionalEncoderRepresentationsfromTransformers)詞向量模型,2018年10月在《BERT:Pre-trainingofDeepBidirectionalTransformersforLanguageUnderstanding》這篇論文中被Google提出,在11種不同NLP測試中創(chuàng)出最佳成績。2023/11/4FeatureExtractionandPreprocessingLesson4-34FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-35ExtractingfeaturesfromimagesComputervisionisthestudyanddesignofcomputationalartifactsthatprocessandunderstandimages.Theseartifactssometimesemploymachinelearning.Anoverviewofcomputervisionisfarbeyondthescopeofthiscourse,butinthissectionwewillreviewsomebasictechniquesusedincomputervisiontorepresentimagesinmachinelearningproblems.2023/11/4FeatureExtractionandPreprocessingLesson4-36ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesThedigitsdatasetincludedwithscikit-learncontainsgrayscaleimagesofmorethan1,700hand-writtendigitsbetweenzeroandnine.Eachimagehaseightpixelsonaside.Eachpixelisrepresentedbyanintensityvaluebetweenzeroand16;whiteisthemostintenseandisindicatedbyzero,andblackistheleastintenseandisindicatedby16.Thefollowingfigureisanimageofahand-writtendigittakenfromthedataset:2023/11/4FeatureExtractionandPreprocessingLesson4-37ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.2023/11/4FeatureExtractionandPreprocessingLesson4-38ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.2023/11/4FeatureExtractionandPreprocessingLesson4-39LargefeaturevectorsSensitivetochangesinthescale,rotation,andtranslationofimagesFurthermore,learningfrompixelintensitiesisitselfproblematic,asthemodelcanbecomesensitivetochangesinilluminationModerncomputervisionapplicationsfrequentlyuseeitherhand-engineeredfeatureextractionmethodsthatareapplicabletomanydifferentproblems,orautomaticallylearnfeatureswithoutsupervisionproblemusingtechniquessuchasdeeplearningExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesHumanscanquicklyrecognizemanyobjectswithoutobservingeveryattributeoftheobject.Thisintuitionismotivationtocreaterepresentationsofonlythemostinformativeattributesofanimage.Theseinformativeattributes,orpointsofinterest,arepointsthataresurroundedbyrichtexturesandcanbereproduceddespiteperturbingtheimage.Edgesandcornersaretwocommontypesofpointsofinterest.2023/11/4FeatureExtractionandPreprocessingLesson4-40ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesLet'susescikit-imagetoextractpointsofinterestfromthefollowingfigure:2023/11/4FeatureExtractionandPreprocessingLesson4-41ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeatures經(jīng)典關(guān)鍵點(diǎn)檢測器:HARRIS-1988HarrisCornerDetectorShi,Tomasi-1996GoodFeaturestoTrack(Shi,Tomasi)SIFT-1999ScaleInvariantFeatureTransform(Lowe)SURF-2006SpeededUpRobustFeatures現(xiàn)代關(guān)鍵點(diǎn)檢測器:FAST-2006FeaturesfromAcceleratedSegmentTestBRIEF-2010BinaryRobustIndependentElementaryFeaturesORB-2011OrientedFASTandRotatedBRIEFBRISK-2011BinaryRobustInvariantScalableKeypointsFREAK-2012FastRetinaKeypointKAZE-2012KAZE2023/11/4FeatureExtractionandPreprocessingLesson4-42ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)(尺度不變特征轉(zhuǎn)換)isamethodforextractingfeaturesfromanimagethatislesssensitivetothescale,rotation,andilluminationoftheimagethantheextractionmethodswehavepreviouslydiscussed.EachSIFTfeature,ordescriptor,isavectorthatdescribesedgesandcornersinaregionofanimage.Unlikethepointsofinterestinourpreviousexample,SIFTalsocapturesinformationaboutthecompositionofeachpointofinterestanditssurroundings.2023/11/4FeatureExtractionandPreprocessingLesson4-43ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)(加速穩(wěn)健特征)isanothermethodofextractinginterestingpointsofanimageandcreatingdescriptionsthatareinvariantoftheimage'sscale,orientation,andillumination.SURFcanbecomputedmorequicklythanSIFT,anditismoreeffectiveatrecognizingfeaturesacrossimagesthathavebeentransformedincertainways.2023/11/4FeatureExtractionandPreprocessingLesson4-44ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)Liketheextractedpointsofinterest,theextractedSIFT(orSURF)areonlythefirststepincreatingafeaturerepresentationthatcouldbeusedinamachinelearningtask.2023/11/4FeatureExtractionandPreprocessingLesson4-45FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization(數(shù)據(jù)標(biāo)準(zhǔn)化/歸一化方法)2023/11/4FeatureExtractionandPreprocessingLesson4-46使用sklearn進(jìn)行數(shù)據(jù)預(yù)處理--標(biāo)準(zhǔn)化/歸一化/正則化一、標(biāo)準(zhǔn)化(Z-Score),或者去除均值和方差縮放公式為:(X-mean)/std計(jì)算時對每個屬性/每列分別進(jìn)行。將數(shù)據(jù)按期屬性(按列進(jìn)行)減去其均值,并處以其方差。得到的結(jié)果是,對于每個屬性/每列來說所有數(shù)據(jù)都聚集在0附近,方差為1。實(shí)現(xiàn)時,有兩種不同的方式:使用sklearn.preprocessing.scale()函數(shù),可以直接將給定數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化。使用sklearn.preprocessing.StandardScaler類,使用該類的好處在于可以保存訓(xùn)練集中的參數(shù)(均值、方差)直接使用其對象轉(zhuǎn)換測試集數(shù)據(jù)。2023/11/4FeatureExtractionandPreprocessingLesson4-472023/11/4FeatureExtractionandPreprocessingLesson4-48>>>fromsklearnimport

preprocessing>>>import

numpyas

np>>>X=np.array([[1.,-1.,

2.],...

[2.,

0.,

0.],...

[0.,

1.,-1.]])>>>X_scaled=preprocessing.scale(X)

>>>X_scaled

array([[0.

...,-1.22...,

1.33...],

[

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論