




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
FoundationsofMachineLearning
FeatureExtractionandPreprocessing2023/11/4FeatureExtractionandPreprocessingLesson4-1FeatureExtractionandPreprocessingTheexamplesdiscussedinlinearregressionusedsimplenumericexplanatoryvariables,suchasthediameterofapizza.Manymachinelearningproblemsrequirelearningfromobservationsofcategoricalvariables,text,orimages.Inthislesson,youwilllearnbasictechniquesforpreprocessingdataandcreatingfeaturerepresentationsoftheseobservations.Thesetechniquescanbeusedwiththeregressionmodels,LinearRegression,aswellasthemodelswewilldiscussinsubsequentlesson.2023/11/4FeatureExtractionandPreprocessingLesson4-2FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-3ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominal(定類):
categories,states,or“namesofthings”Hair_color={auburn,black,blond,brown,grey,red,white}maritalstatus,occupation,IDnumbers,zipcodes
2023/11/4FeatureExtractionandPreprocessingLesson4-4ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominal(定類):
categories,states,or“namesofthings”Hair_color={auburn,black,blond,brown,grey,red,white}maritalstatus,occupation,IDnumbers,zipcodes
Binary(二類)Nominalattributewithonly2states(0and1)Symmetricbinary:bothoutcomesequallyimportant,e.g.,genderAsymmetricbinary:outcomesnotequallyimportant,e.g.,medicaltest(positivevs.negative)Convention:assign1tomostimportantoutcome(e.g.,HIVpositive)2023/11/4FeatureExtractionandPreprocessingLesson4-5ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominalBinaryOrdinal
(定序)Valueshaveameaningfulorder(ranking)butmagnitudebetweensuccessivevaluesisnotknown.Size={small,medium,large},grades,armyrankings2023/11/4FeatureExtractionandPreprocessingLesson4-6ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominalBinaryOrdinalInterval(定距)Measuredonascaleofequal-sizedunitsValueshaveorderE.g.,temperatureinC?orF?,calendardatesNotruezero-point2023/11/4FeatureExtractionandPreprocessingLesson4-7ExtractingfeaturesfromcategoricalvariablesTypesofvariablesNominalBinaryOrdinalIntervalRatio(定比)Inherentzero-point
Wecanspeakofvaluesasbeinganorderofmagnitudelargerthantheunitofmeasurement(10K?istwiceashighas5K?).e.g.,temperatureinKelvin,length,counts,monetaryquantities2023/11/4FeatureExtractionandPreprocessingLesson4-8Categoricalvariablesone-of-Korone-hot(獨(dú)熱)Categoricalvariablesarecommonlyencodedusingone-of-Korone-hotencoding,inwhichtheexplanatoryvariableisencodedusingonebinaryfeatureforeachofthevariable‘spossiblevalues.Forexample,let'sassumethatourmodelhasacityexplanatoryvariablethatcantakeoneofthreevalues:NewYork,SanFrancisco,orChapelHill.One-hotencodingrepresentsthisexplanatoryvariableusingonebinaryfeatureforeachofthethreepossiblecities.2023/11/4FeatureExtractionandPreprocessingLesson4-9sklearn.feature_extraction:FeatureExtraction2023/11/4FeatureExtractionandPreprocessingLesson4-10sklearn.feature_extraction.DictVectorizerTransformslistsoffeature-valuemappingstovectors.Thistransformerturnslistsofmappings(dict-likeobjects)offeaturenamestofeaturevaluesintoNumpyarraysorscipy.sparsematricesforusewithscikit-learnestimators.Whenfeaturevaluesarestrings,thistransformerwilldoabinaryone-hot(akaone-of-K)coding:oneboolean-valuedfeatureisconstructedforeachofthepossiblestringvaluesthatthefeaturecantakeon.Forinstance,afeature“f”thatcantakeonthevalues“ham”and“spam”willbecometwofeaturesintheoutput,onesignifying“f=ham”,theother“f=spam”.Featuresthatdonotoccurinasample(mapping)willhaveazerovalueintheresultingarray/matrix.2023/11/4FeatureExtractionandPreprocessingLesson4-11ExampleuseofDictVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-12>>>from
sklearn.feature_extraction
import
DictVectorizer
>>>v
=
DictVectorizer(sparse=False)>>>D
=[{'foo':1,'bar':2},{'foo':3,'baz':1}]>>>X
=
v.fit_transform(D)>>>X
array([[2.,0.,1.],
[0.,1.,3.]])
>>>v.inverse_transform(X)==[{'bar':2.0,'foo':1.0},{'baz':1.0,'foo':3.0}]True
>>>v.transform({'foo':4,'unseen_feature':3})array([[0.,0.,4.]])
ExampleuseofDictVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-13>>>fromsklearn.feature_extractionimportDictVectorizer>>>onehot_encoder=DictVectorizer(sparse=False)>>>D=[{'city':'NewYork'},{'city':'SanFrancisco'},{'city':'ChapelHill'}]>>>X=onehot_encoder.fit_transform(D)>>>Xarray([[0.,1.,0.],[0.,0.,1.],[1.,0.,0.]])>>>onehot_encoder.feature_names_['city=ChapelHill','city=NewYork','city=SanFrancisco']>>>Canwerepresentthevaluesofacategoricalexplanatoryvariablewithasingleintegerfeature?FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-14ExtractingfeaturesfromtextManymachinelearningproblemsusetextasanexplanatoryvariable.Textmustbetransformedtoadifferentrepresentationthatencodesasmuchofitsmeaningaspossibleinafeaturevector.Inthefollowingsectionswewillreviewvariationsofthemostcommonrepresentationoftextthatisusedinmachinelearning:thebag-of-wordsmodel.2023/11/4FeatureExtractionandPreprocessingLesson4-15ExtractingfeaturesfromtextThebag-of-wordsrepresentationThemostcommonrepresentationoftextisthebag-of-words(詞袋)model.This
representationusesamultiset,orbag,thatencodesthewordsthatappearinatext;thebag-of-wordsdoesnotencodeanyofthetext'ssyntax,ignorestheorderofwords,anddisregardsallgrammar.Bag-of-wordscanbethoughtofasanextensiontoone-hotencoding.Itcreatesonefeatureforeachwordofinterestinthetext.Thebag-of-wordsmodelismotivatedbytheintuitionthatdocumentscontainingsimilarwordsoftenhavesimilarmeanings.Thebag-of-wordsmodelcanbeusedeffectivelyfordocumentclassificationandretrievaldespitethelimitedinformationthatitencodes.2023/11/4FeatureExtractionandPreprocessingLesson4-16ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe
sklearn.feature_extraction.text
submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.feature_extraction.text.CountVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokencountsfeature_extraction.text.HashingVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokenoccurrencesfeature_extraction.text.TfidfTransformer([…])Transformacountmatrixtoanormalizedtfortf-idfrepresentationfeature_extraction.text.TfidfVectorizer([…])ConvertacollectionofrawdocumentstoamatrixofTF-IDFfeatures.2023/11/4FeatureExtractionandPreprocessingLesson4-17ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe
sklearn.feature_extraction.text
submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.sklearn.feature_extraction.text.CountVectorizerConvertacollectionoftextdocumentstoamatrixoftokencountsThisimplementationproducesasparserepresentationofthecountsusingscipy.sparse.csr_matrix.Ifyoudonotprovideana-prioridictionaryandyoudonotuseananalyzerthatdoessomekindoffeatureselectionthenthenumberoffeatureswillbeequaltothevocabularysizefoundbyanalyzingthedata.2023/11/4FeatureExtractionandPreprocessingLesson4-18Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-19>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[11010101][11101010]]>>>print(vectorizer.vocabulary_){'unc':7,'in':3,'the':6,'lost':4,'played':5,'basketball':0,'duke':1,'game':2}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-20>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[0110101001][0111010010][1000000100]]>>>print(vectorizer.vocabulary_){'unc':9,'in':4,'the':8,'lost':5,'sandwich':7,'played':6,'basketball':1,'duke':2,'game':3,'ate':0}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-21UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]Themeaningsofthefirsttwodocumentsaremoresimilartoeachotherthantheyaretothethirddocument,andtheircorrespondingfeaturevectorsaremoresimilartoeachotherthantheyaretothethirddocument'sfeaturevectorwhenusingametricsuchasEuclideandistance.Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-22UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]sklearn.metrics.pairwise.euclidean_distance>>>fromsklearn.metrics.pairwiseimporteuclidean_distances>>>counts=[[0,1,1,0,0,1,0,1],[0,1,1,1,1,0,0,0],[1,0,0,0,0,0,1,0]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[1]))Distancebetween1stand2nddocuments:[[2.]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[1],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]ExtractingfeaturesfromtextForrealapplications--High-dimensionalfeaturevectorsThefirstproblemisthathigh-dimensionalvectorsrequiremorememorythansmallervectors.Thesecondproblemisknownasthecurseofdimensionality(維數(shù)災(zāi)難/維度詛咒),ortheHugheseffect.Asthefeaturespace'sdimensionalityincreases,moretrainingdataisrequiredtoensurethatthereareenoughtraininginstanceswitheachcombinationofthefeature'svalues.Ifthereareinsufficienttraininginstancesforafeature,thealgorithmmayoverfitnoiseinthetrainingdataandfailtogeneralize.2023/11/4FeatureExtractionandPreprocessingLesson4-23ExtractingfeaturesfromtextStop-wordfilteringRemovewordsthatarecommontomostofthedocumentsinthecorpus.Thesewords,calledstopwords,includedeterminerssuchasthe,a,andan;auxiliaryverbssuchasdo,be,andwill;andprepositionssuchason,around,andbeneath.Stopwordsareoftenfunctionalwordsthatcontributetothedocument‘smeaningthroughgrammarratherthantheirdenotations.TheCountVectorizerclasscanfilterstopwordsprovidedasthestop_wordskeywordargumentandalsoincludesabasicEnglishstoplist.2023/11/4FeatureExtractionandPreprocessingLesson4-24>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True,stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[01100101][01111000][10000010]]>>>print(vectorizer.vocabulary_){'unc':7,'lost':4,'sandwich':6,'played':5,'basketball':1,'duke':2,'game':3,'ate':0}ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationWhilestopfilteringisaneasystrategyfordimensionalityreduction,moststoplistscontainonlyafewhundredwords.Alargecorpusmaystillhavehundredsofthousandsofuniquewordsafterfiltering.Twosimilarstrategiesforfurtherreducingdimensionalityarecalledstemmingandlemmatization.詞干提?。╯temming)是抽取詞的詞干或詞根形式(不一定能夠表達(dá)完整語義)。詞形還原(lemmatization),是把一個任何形式的語言詞匯還原為一般形式(能表達(dá)完整語義)WecanusetheNaturalLanguageToolKit(NTLK)tostemandlemmatizethecorpus.2023/11/4FeatureExtractionandPreprocessingLesson4-25ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsInsteadofusingabinaryvalueforeachelementinthefeaturevector,wewillnowuseanintegerthatrepresentsthenumberoftimesthatthewordsappearedinthedocument.2023/11/4FeatureExtractionandPreprocessingLesson4-26ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsNormalizedtermfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-27ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(對數(shù)詞頻調(diào)整方法)2023/11/4FeatureExtractionandPreprocessingLesson4-28ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(詞頻-逆文檔頻率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(對數(shù)詞頻調(diào)整方法)Augmentedtermfrequency(詞頻放大法)2023/11/4FeatureExtractionandPreprocessingLesson4-29ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsNormalization,logarithmicallyscaledtermfrequencies,andaugmentedtermfrequenciescanrepresentthefrequenciesoftermsinadocumentwhilemitigatingtheeffectsofdifferentdocumentsizes.However,anotherproblemremainswiththeserepresentations.Thefeaturevectorscontainlargeweightsfortermsthatoccurfrequentlyinadocument,evenifthosetermsoccurfrequentlyinmostdocumentsinthecorpus.2023/11/4FeatureExtractionandPreprocessingLesson4-30ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)2023/11/4FeatureExtractionandPreprocessingLesson4-31ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)TF-IDFvalueistheproductofitstermfrequencyandinversedocumentfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-32ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsTfidfVectorizerclasswrapsCountVectorizerandTfidfTransformer2023/11/4FeatureExtractionandPreprocessingLesson4-33>>>fromsklearn.feature_extraction.textimportTfidfVectorizer>>>corpus=['ThedogateasandwichandIateasandwich','Thewizardtransfiguredasandwich’]>>>vectorizer=TfidfVectorizer(stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[0.754583970.377291990.536892710.0.][0.0.0.449436420.63166720.6316672]]>>>print(vectorizer.vocabulary_){'sandwich':2,'dog':1,'transfigured':3,'ate':0,'wizard':4}ExtractingfeaturesfromtextTF-IDF+機(jī)器學(xué)習(xí)分類器基于深度學(xué)習(xí)的文本分類
FastText:將整篇文檔的詞及N-gram向量疊加平均得到文檔向量,然后使用文檔向量做softmax多分類。涉及兩個技巧:字符級N-gram特征的引入以及分層Softmax分類。Word2Vec:Word2vec是WordEmbedding的方法之一。他是2013年由谷歌的Mikolov提出了一套新的詞嵌入方法。由于Word2vec會考慮上下文,跟之前的Embedding方法相比,效果要更好(但不如18年之后的方法)BERT(BidirectionalEncoderRepresentationsfromTransformers)詞向量模型,2018年10月在《BERT:Pre-trainingofDeepBidirectionalTransformersforLanguageUnderstanding》這篇論文中被Google提出,在11種不同NLP測試中創(chuàng)出最佳成績。2023/11/4FeatureExtractionandPreprocessingLesson4-34FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-35ExtractingfeaturesfromimagesComputervisionisthestudyanddesignofcomputationalartifactsthatprocessandunderstandimages.Theseartifactssometimesemploymachinelearning.Anoverviewofcomputervisionisfarbeyondthescopeofthiscourse,butinthissectionwewillreviewsomebasictechniquesusedincomputervisiontorepresentimagesinmachinelearningproblems.2023/11/4FeatureExtractionandPreprocessingLesson4-36ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesThedigitsdatasetincludedwithscikit-learncontainsgrayscaleimagesofmorethan1,700hand-writtendigitsbetweenzeroandnine.Eachimagehaseightpixelsonaside.Eachpixelisrepresentedbyanintensityvaluebetweenzeroand16;whiteisthemostintenseandisindicatedbyzero,andblackistheleastintenseandisindicatedby16.Thefollowingfigureisanimageofahand-writtendigittakenfromthedataset:2023/11/4FeatureExtractionandPreprocessingLesson4-37ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.2023/11/4FeatureExtractionandPreprocessingLesson4-38ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.2023/11/4FeatureExtractionandPreprocessingLesson4-39LargefeaturevectorsSensitivetochangesinthescale,rotation,andtranslationofimagesFurthermore,learningfrompixelintensitiesisitselfproblematic,asthemodelcanbecomesensitivetochangesinilluminationModerncomputervisionapplicationsfrequentlyuseeitherhand-engineeredfeatureextractionmethodsthatareapplicabletomanydifferentproblems,orautomaticallylearnfeatureswithoutsupervisionproblemusingtechniquessuchasdeeplearningExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesHumanscanquicklyrecognizemanyobjectswithoutobservingeveryattributeoftheobject.Thisintuitionismotivationtocreaterepresentationsofonlythemostinformativeattributesofanimage.Theseinformativeattributes,orpointsofinterest,arepointsthataresurroundedbyrichtexturesandcanbereproduceddespiteperturbingtheimage.Edgesandcornersaretwocommontypesofpointsofinterest.2023/11/4FeatureExtractionandPreprocessingLesson4-40ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesLet'susescikit-imagetoextractpointsofinterestfromthefollowingfigure:2023/11/4FeatureExtractionandPreprocessingLesson4-41ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeatures經(jīng)典關(guān)鍵點檢測器:HARRIS-1988HarrisCornerDetectorShi,Tomasi-1996GoodFeaturestoTrack(Shi,Tomasi)SIFT-1999ScaleInvariantFeatureTransform(Lowe)SURF-2006SpeededUpRobustFeatures現(xiàn)代關(guān)鍵點檢測器:FAST-2006FeaturesfromAcceleratedSegmentTestBRIEF-2010BinaryRobustIndependentElementaryFeaturesORB-2011OrientedFASTandRotatedBRIEFBRISK-2011BinaryRobustInvariantScalableKeypointsFREAK-2012FastRetinaKeypointKAZE-2012KAZE2023/11/4FeatureExtractionandPreprocessingLesson4-42ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)(尺度不變特征轉(zhuǎn)換)isamethodforextractingfeaturesfromanimagethatislesssensitivetothescale,rotation,andilluminationoftheimagethantheextractionmethodswehavepreviouslydiscussed.EachSIFTfeature,ordescriptor,isavectorthatdescribesedgesandcornersinaregionofanimage.Unlikethepointsofinterestinourpreviousexample,SIFTalsocapturesinformationaboutthecompositionofeachpointofinterestanditssurroundings.2023/11/4FeatureExtractionandPreprocessingLesson4-43ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)(加速穩(wěn)健特征)isanothermethodofextractinginterestingpointsofanimageandcreatingdescriptionsthatareinvariantoftheimage'sscale,orientation,andillumination.SURFcanbecomputedmorequicklythanSIFT,anditismoreeffectiveatrecognizingfeaturesacrossimagesthathavebeentransformedincertainways.2023/11/4FeatureExtractionandPreprocessingLesson4-44ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)Liketheextractedpointsofinterest,theextractedSIFT(orSURF)areonlythefirststepincreatingafeaturerepresentationthatcouldbeusedinamachinelearningtask.2023/11/4FeatureExtractionandPreprocessingLesson4-45FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization(數(shù)據(jù)標(biāo)準(zhǔn)化/歸一化方法)2023/11/4FeatureExtractionandPreprocessingLesson4-46使用sklearn進(jìn)行數(shù)據(jù)預(yù)處理--標(biāo)準(zhǔn)化/歸一化/正則化一、標(biāo)準(zhǔn)化(Z-Score),或者去除均值和方差縮放公式為:(X-mean)/std計算時對每個屬性/每列分別進(jìn)行。將數(shù)據(jù)按期屬性(按列進(jìn)行)減去其均值,并處以其方差。得到的結(jié)果是,對于每個屬性/每列來說所有數(shù)據(jù)都聚集在0附近,方差為1。實現(xiàn)時,有兩種不同的方式:使用sklearn.preprocessing.scale()函數(shù),可以直接將給定數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化。使用sklearn.preprocessing.StandardScaler類,使用該類的好處在于可以保存訓(xùn)練集中的參數(shù)(均值、方差)直接使用其對象轉(zhuǎn)換測試集數(shù)據(jù)。2023/11/4FeatureExtractionandPreprocessingLesson4-472023/11/4FeatureExtractionandPreprocessingLesson4-48>>>fromsklearnimport
preprocessing>>>import
numpyas
np>>>X=np.array([[1.,-1.,
2.],...
[2.,
0.,
0.],...
[0.,
1.,-1.]])>>>X_scaled=preprocessing.scale(X)
>>>X_scaled
array([[0.
...,-1.22...,
1.33...],
[
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- Z=82附近原子核形狀共存研究
- 面向數(shù)據(jù)與設(shè)備異構(gòu)的聯(lián)邦學(xué)習(xí)優(yōu)化方法研究與應(yīng)用
- 精神疾病健康指導(dǎo)
- 精油開背培訓(xùn)
- 超聲科科室簡介
- 關(guān)注心理健康 創(chuàng)造和諧班級
- 預(yù)防食源性疾病課件
- 順豐快遞教學(xué)課件
- 幼兒園教師教育教學(xué)能力提升培訓(xùn)
- 音樂說課教育課件
- 放化療相關(guān)口腔黏膜炎預(yù)防及護(hù)理課件
- 北京市海淀區(qū)2025屆高一下生物期末檢測模擬試題含解析
- JT∕T 795-2023 事故汽車修復(fù)技術(shù)規(guī)范
- 2024四川廣元市檢察機(jī)關(guān)招聘聘用制書記員22人筆試備考題庫及答案解析
- 內(nèi)科患者VTE風(fēng)險評估表
- 一年級上冊美術(shù)教案-第1課 讓大家認(rèn)識我:誠實最好 ▏人美版
- 科學(xué)認(rèn)識天氣智慧樹知到期末考試答案2024年
- (高清版)DZT 0064.15-2021 地下水質(zhì)分析方法 第15部分:總硬度的測定 乙二胺四乙酸二鈉滴定法
- 心理體檢收費(fèi)目錄
- 雅魯藏布江米林-加查段沿線暴雨泥石流危險度評價的中期報告
- 抗生素的正確使用與合理配比
評論
0/150
提交評論