




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
PRIVACYENHANCINGTECHNOLOGY(PET):
PROPOSEDGUIDEONSYNTHETICDATA
GENERATION
Published15July2024VersionNumber1.0
JOINTLYDEVELOPEDWITHSUPPORTEDBY
2
TABLEOFCONTENTS
I.IntroductiontoPrivacyEnhancingTechnology(PET) 3
II.SyntheticData 4
WhatisSyntheticData? 5
UnderWhatCircumstancesisSyntheticDataUseful? 6
CaseStudies 8
III.Recommendations 10
AnnexA:HandbookonKeyConsiderationsandBestPracticesin
SyntheticDataGeneration 11
AnnexB:DataDictionaryFormat 24
AnnexC:ExamplesofMethodsofSyntheticDataGeneration 27
AnnexD:Re-identificationRisks 33
AnnexE:ExamplesofApproachesinEvaluationofRe-identification
Risks 35
ACKNOWLEDGEMENTS 41
3
I.IntroductiontoPrivacyEnhancingTechnology(PET)
PrivacyEnhancingTechnologies(PETs)areasuiteoftoolsandtechniquesthatallowtheprocessing,analysis,andextractionofinsightsfromdatawithoutrevealingtheunderlyingpersonalorcommerciallysensitivedata.ByincorporatingPETs,companiescanmaintainacompetitiveedgeinthemarketthroughleveragingtheirexistingdataassetsforinnovationwhilecomplyingwithdataprotectionregulations,reducingtheriskofdatabreachesanddemonstratingacommitmenttodataprotection.PETsarenotjustadefensivemeasure;theyareaproactivesteptowardsfosteringacultureofdataprotectionandsecuringacompany'sreputationinthedigitalage.
PETscangenerallybeclassifiedintothreekeycategorie
s1:
dataobfuscation,encrypteddataprocessing,andfederatedanalytics.PETscanalsobecombinedtoaddressvaryingneedsoforganisations.ThefollowingTable1mapsoutthecurrenttypesofPETsinthemarketandtheirkeyapplications.
Table1.TypesofPETsandtheirapplications
Categoriesof
PETs
PETs
Examplesofapplications(non-exhaustive)
Data
obfuscation
Anonymisation/pseudonymisationtechniques
?Securestorage
?Datasharingandretention
?Softwaretesting
Syntheticdatageneration
?Privacy-preservingAImachinelearning
?Datasharingandanalysis
?Softwaretesting
Differentialprivacy
?Expandingresearchopportunities
?Datasharing
Zeroknowledgeproofs
?Verifyinginformationwithout
requiringdisclosure(e.g.,ageverification)
Encrypteddataprocessing
Homomorphicencryption
?Securedatastoredincloud
1AdaptedfromOECD,“EmergingPrivacyEnhancingTechnologies:CurrentRegulatoryandPolicyApproaches,”O(jiān)ECDDigitalEconomyPapers(OECD,2023).
4
?Computingonprivate
datathatisnotdisclosed
Multi-partycomputation
(includingprivatesetintersection)
?Computingonprivate
datathatisnotdisclosed
Trustedexecutionenvironments
?Computingusing
modelsthatneedtoremainprivate
?Computingonprivate
datathatisnotdisclosed
Federatedanalytics
Federatedlearning
?Privacy-preservingAImachinelearning
Distributedanalysis
II.SyntheticData
Thisguidefocusesontheuseofsyntheticdata
2
togeneratestructureddata.Whilesyntheticdataisgenerallyfictitiousdatathatmaynotbeconsideredpersonaldataonitsown,itisnotinherentlyrisk-freeduetopossiblere-identificationrisk
s3.
Assuch,thisguideproposesgoodpracticesthatorganisationsmayadopttogeneratesyntheticdatatominimisesuchrisksforasetofcommonusecasearchetypes.Theguidealsoincludesasetofgoodpracticesandriskassessments/considerationsforgeneratingsyntheticdataaswellasgovernancecontrols,contractualprocess,andtechnicalmeasurestomitigateresidualrisks.
ThetargetaudienceforthisguideareCIOs,CTOs,CDOs,datascientists,dataprotectionpractitioners,andtechnicaldecision-makerswhomaydirectlyorindirectlybeinvolvedinthegenerationanduseofsyntheticdata.
Syntheticdataisatechnologythatisbeingactivelyresearchedanddevelopedatthetimeofpublication.Hence,thisguideisnotintendedtoprovideacomprehensiveorin-depthreviewofthetechnologyoritsassessmentmethods.Theguideisintendedtobealivingdocument,andwillbeupdatedtoensureitsrecommendationsremainrelevant.
2Therearetwotypesofsyntheticdata:fullysyntheticdataandpartiallysyntheticdata.Thisguidediscussestheuseoffullysyntheticdata.
3Inthisguide,wegenerallyrefertoprivacyrisksasre-identificationrisks.
5
WhatisSyntheticData?
Syntheticdataiscommonlyreferredtoasartificialdatathathasbeengeneratedusingapurpose-builtmathematicalmodel(includingartificialintelligence(AI)/machinelearning(ML)models)oralgorithm.Itcanbederivedbytrainingamodel(oralgorithm)onasourcedatasettomimicthecharacteristicsandstructureofthesourcedata.Goodqualitysyntheticdatacanretainthestatisticalpropertiesandpatternsofthesourcedatatoahighextent.Asaresult,performinganalysisonsyntheticdatacanproduceresultssimilartothoseyieldedwithsourcedata.
Characteristicsofsyntheticdata
Figure1showsanexampleofhowsyntheticdatamaylooklikeascomparedwiththesourcedata.Generatedsyntheticdatawillgenerallyhavedifferentdatapointsfromthesourcedata,asseenfromthetabulardata.However,thesyntheticdatawillhavestatisticalpropertiesthatareclosetothatofthesourcedata,i.e.,capturingthedistributionandstructureofthesourcedataasseenfromthetrendlinesinFigure1.
Figure1:Sourcedataversussyntheticdata
.4
Assuch,syntheticdatamaynotalwaysbeinherentlyrisk-freeasinformationaboutanindividualinthesourcedataset,orconfidentialdata,canstillbeleakedduetotheresemblanceofthesyntheticdatatothesourcedata.Therewillalsobetrade-off
s5
betweendatautilityanddataprotectionrisksinsyntheticdatageneration.However,suchriskscanbeminimisedbytakingdataprotectionintoconsiderationduringthesyntheticdatagenerationprocess.
4DiagramtakenwithmodificationfromKhaledElEmam,LucyMosquera,andRichardHoptroff,PracticalSyntheticDataGeneration(O’ReillyMedia,Inc,2020).
5Trade-offbetweendatautilityanddataprotectionrisksisfurtherdiscussedinAnnexA:Step1andStep
3inthisguide.
6
UnderWhatCircumstancesisSyntheticDataUseful?
SyntheticdatacanbeusedinavarietyofusecasesrangingfromgeneratingtrainingdatasetsforAImodelstodataanalysisandcollaboration.Theuseofsyntheticdatanotonlycanaccelerateresearch,innovation,collaboration,anddecision-makingbutalsomitigateconcernsaboutcybersecurityincidentsanddatabreaches,enablingbettercompliancewithdataprotection/privacyregulations.Table2discussesafewcommonusecasearchetypes,theirkeybenefits,andgoodpracticesthatorganisationscanfocusonwhengeneratingsyntheticdata.
Table2.Usecasearchetypesforsyntheticdata.
TypesofUseCases
KeyBenefits
GoodPracticesto
GenerateSyntheticData
Usecasearchetype1:GeneratingtrainingdatasetforAImodels
Augmenting
dataforAI/MLmodels
?Syntheticdataaddressesthechallengeoftheuserhavingtoobtainlargevolumesof
labelleddataneededfor
trainingandtestingAI/MLmodelsduetocosts,legalregulations,andproprietaryrights.
?Augmentingtrainingdatasetswithsyntheticallygeneratedlabelleddatacanbemore
cost-effective,especially
whenthesourcedatasetsaresparse.
?Addnoise*toorreducegranularityofthe
syntheticdatapoints.
?Suchfictitiousnewdatapointswillgenerallynotbeconsideredpersonaldata.
*Ifthestatistical
properties/characteristicsofthesyntheticdatais
representativeofthe
populationinquestionandnotsignificantlyskewed
towardsaspecific
individual/groupof
individualsusedassourcetrainingdata,addingof
noisemightnotbe
necessaryasre-
identificationrisksaregenerallylow.
Increasing
datadiversityforAI/ML
models
?Syntheticdatacanbeusedtosimulaterareeventsor
augmentunder-representedgroupsintrainingAImodels.
?DiversedatasetscanbeusefulinimprovingperformanceofAI/MLmodels
Usecasearchetype2:Dataanalysisandcollaboration
Datasharingandanalysis
?Underlyingtrendsorpatterns,andbiasesofthedataare
usefulfordataanalytics
regardlessofwhetherthedatasourceisrealorsynthetic.
?Balancethetrade-offs
betweendatautilityanddataprotectionby
incorporatingdataprotectionmeasures
7
?Syntheticdatacanenabledatasharingforanalysisespeciallyinindustriesandsectors,e.g.,healthcare,wherethesourcedatacanbesensitive.
throughoutthesyntheticdatagenerationprocess,forexample:
Datapreparation
?Removeoutliersfromsourcedata
?Pseudonymisesourcedata
?Employdata
minimisationand
generalisegranulardata
Syntheticdatageneration
?Addnoisebeforeoraftersyntheticdatageneration
Postsyntheticdatageneration
?Incorporatetechnical,
contractual,and
governancemeasurestomitigateanyresidualre-identificationrisks
Previewing
datafor
collaboration
?Syntheticdatacanbeusedindataexploration,analysis,andcollaborationtoprovide
stakeholderswitha
representativepreviewofthesourcedatawithoutexposingsensitiveinformation.
?Thisenablesstakeholdersto
exploreandunderstandthe
structure,relationships,and
potentialinsightswithinthe
datatogainassuranceofthedataqualitybeforefinalisinganyagreementor
collaboration.
Usecasearchetype3:Softwaretesting
System
development/software
testing
?Organisationscanuse
syntheticdatainsteadof
productiondatatofacilitatesoftwaredevelopment.
?Useofsyntheticdatacanhelporganisationsavoiddata
breachesintheeventofthedevelopmentenvironmentbeingcompromised.
?Focusongenerating
syntheticdatathat
followssemanticse.g.,format,min/maxvaluesandcategories,of
sourcedatainsteadofthestatistical
characteristicsandproperties.
RefertoAnnexAforproposedconsiderationsandgoodpracticestogeneratesyntheticdata.
8
CaseStudies
(A)TrainingAImodelforfrauddetectioninthefinancialsecto
r6
Problem:Sincethenumberoffraudulenttransactionsinthesourcedataissmallcomparedtonormal,non-fraudulenttransactions,thesourcedatadidnottrainmodelsverywellforfrauddetection.
Solution:J.P.Morgansuccessfullyusedsyntheticdataforfrauddetectionmodeltraining.AImodelswereprovidedwithsamplesofnormalandfraudulenttransactionstounderstandthetell-talesignsofsuspicioustransactions.
Benefit:Syntheticdataprovedtobemoreeffectiveintermsoftrainingmodelstodetectanomalousbehaviour.Thisisbecausethesyntheticdatausedwasdesignedtocontainahigherpercentageoffraudulenttransactions.
(B)TrainingAImodelforresearchintoAIbia
s7
Problem:Multi-labelclassificationandregressionmodelsarefrequentlyutilisedatMastercardforvariousapplications,includingfraudprevention,anti-moneylaunderingandmarketingusecasesforportfoliooptimisation.Thesemodels,whilepowerful,requirecarefulattentiontoproxiesofdemographicattributeswithintheirtrainingdata,whichcouldlearnunintendedbiases.Ensuringtheaccuracyandfairnessofthesemodelsiscomplexduetotheirmulti-labelsetting,theconfidentialityofthedemographicattributes,andthechallengesinaccessingthetrainingdatasetformodeldevelopment.
Solution:MastercardpartneredwithresearcherstodevelopnewAIbiastestingmethodsadaptedtomulti-labelsettings.Toprotecttheprivacyofthedatasharedexternally,syntheticdatawascreatedtosupportmodeltrainingandmethodologicalresearchintofairmulti-labelmodels.
Benefit:Syntheticdatawasmeasuredtobesufficientlyprivatetobesharedwithexternalresearcherswhilecapturingrealrelationshipswithinthesourcedata.Syntheticdataenablednewinsightsthatwouldnothavebeenpossiblewithouttheprivacyprotectingcharacteristicsinherenttosyntheticdata.
6J.P.Morgan,“SyntheticDataforRealInsights,”TechnologyBlog,n.d.,
/
technology/technology-blog/synthetic-data-for-real-insights
7ContributedbyMastercard
9
(C)Safeguardingpatientdatafordataanalysis
8
Problem:Priortoutilisingsyntheticdata,Johnson&Johnson(J&J)allowedexternalresearchersorconsortiatoaccesshealthcaredataforresearchproposalsvalidatedbyJ&J.Tosafeguardpatientprivacy,thedatawastransformedintoanonymisedhealthcaredata.However,feedbackreceivedindicatedthattheoverallusefulnessoftheanonymiseddata,whichreliedontraditionalanonymisationtechniques,wasnotalwayssatisfactoryanddidnotalwaysmeettherequirementsoftheresearchersorconsortia.
Solution:J&Jhasintroducedhigh-qualityAIgeneratedsyntheticdataasanadditionaloptiontoprocesstheirhealthcaredata.
Benefit:Researchersandclientshaveexperiencedsignificantlyimprovedanalysis.Whenemployedproperly,thisformofsyntheticdatacaneffectivelyrepresentthetargetpopulationandoffervariousanalyticalandscientificbenefits.
(D)Facilitatingdata
collaboration9
Problem:Apharmaceuticalcompanywantedtopurchaseheart-relatedhealthdatafromaresearchinstitutetotestoutanewhypothesis.Thehealthdata,whichwascollectedbytheresearchinstitutefromconsentingsubjects,washostedunderahighlyregulatedenvironmentasrequiredofthehealthcaresector.However,thispresentssignificantchallengesformanydataengagementactivities.
Solution:A*STARwasengagedbythepharmaceuticalcompanytobuildapipelinetocreatesyntheticcopiesoftheactualdata,whichcanthenbebroughtoutsideofthisregulatedenvironment.
Benefit:Thisallowedthepharmaceuticalcompanytopreviewthedataandbeassuredofthedataqualitypriortothehigh-valuepurchaseandaccesstotheactualdata.
8ContributedbyJohnson&Johnson(J&J)
9ContributedbyA*STAR
10
III.Recommendations
SyntheticdatahasthepotentialtodrivethegrowthofAI/MLbyenablingAImodeltrainingwhileprotectingtheunderlyingpersonaldata.ItalsoaddressesdatasetrelatedchallengesforAImodeltraining,suchasinsufficientandbiaseddata,throughenablingtheaugmentationandincreaseddiversityoftrainingdatasets.
Inaddition,syntheticdatacanbeusedtofacilitateandsupportorganisations’dataanalytics,collaborationandsoftwaredevelopmentneeds.Anaddedbenefitofusingsyntheticdatainplaceofproductiondatatofacilitatesoftwaredevelopmentisthatdatabreachescanbeavoidedintheeventthedevelopmentenvironmentiscompromised.
PDPCrecommendsasetofgoodpracticesandriskassessments/considerationsforgeneratingsyntheticdataandtoreduceanyresidualrisksfromre-identificationthroughgovernancecontrols,contractualprocess,andtechnicalmeasures(refertoAnnexA).
11
AnnexA:HandbookonKeyConsiderationsandBestPracticesinSyntheticDataGeneration
Inthishandbook,wedescribethekeyconsiderationsandbestpracticesfororganisationstoreducere-identificationrisksofsynthetictabulardatathroughafive-stepapproach.
Foranyothercomplexsyntheticdatasetsthatareunstructured,organisationsareadvisedtoconsiderhiringsyntheticdataexperts,datascientistsorindependentriskassessorstoassessandmitigatetherisksofthegeneratedsyntheticdata.
Overviewoffive-stepapproachtogeneratesyntheticdata
Step1:Knowyourdata
Beforeembarkingonanysyntheticdataproject,itisnecessarytohaveaclearunderstandingofthepurposeandusecasesofthesyntheticdataandthesourcedatathatthesyntheticdataistomimic.Thiswillhelptodeterminewhetheruseofsyntheticdatamightberelevantandidentifythepossiblerisksofusingthesyntheticdata.Someoftheconsiderationsmayinclude:
?Wheregeneraltrends/insightsofsourcedataaresensitive,organisationshouldtakenotethattheuseofsyntheticdatawillnotofferanyprotectiontothetrends/insightssincetheywillbereplicatedinthesyntheticdata.
?Wherethesyntheticdataisintendedtobereleasedpublicly,organisationsmayhavetoprioritisedataprotectionoverdatautilityinsuchcircumstances.
12
?Whererelevant,organisationsshouldalsoputinplacepropercontractualobligationsonrecipientsofsyntheticdatawherenecessarytopreventre-identificationattacksonthedata.
Withthisknowledge,themanagementanddataowner,withthehelpofrelevantstakeholderssuchasthedataanalyticsteam,shouldestablishobjectivespriortosyntheticdatagenerationtodetermineanacceptableriskthreshol
d10
ofthegeneratedsyntheticdataandtheexpectedutilityofthedata.Thiswillhelpprovideorganisationswiththeappropriatebenchmarkstoassessanytrade-offsbetweendataprotectionrisksanddatautility.
Thesebenchmarksmaybeadjustedappropriatelytomeetthebusinessobjectives,takingintoconsiderationanytrade-offsbetweendatautilityanddataprotectionrisksafterthesyntheticdatagenerationprocess,aswellassafeguardsandcontrolstomitigateorloweranyresidualrisksposedbythegeneratedsyntheticdata.Theacceptancecriteriashouldbeincorporatedintotheorganisation'sriskassessments(e.g.,enterpriseriskmanagement
framework11
ifapplicable)oraDataProtectionImpactAssessment(“DPIA”
)12.
Step2:Prepareyourdata
Whenpreparingthesource
data13
forgeneratingsyntheticdata,itisimportanttoconsiderthefollowing:
?Whatarethekeyinsightsthatneededtobepreservedinthesyntheticdata?
?Whicharethenecessarydataattributesforthesyntheticdatatomeetthebusinessobjectives?
10There-identificationriskthresholdrepresentsthelevelofre-identificationriskthatisacceptableforagivensyntheticdataset.Thereiscurrentlynouniversallyacceptednumericalvalueforriskthreshold.ForfurtherdetailsrefertoStep4(Assessre-identificationrisks).
11OrganisationsmayrefertoISO27001formoreinformationondevelopinganenterpriseriskmanagementframework.
12AnexampleofthisisPDPC’sGuidetoDataProtectionImpactAssessments.ADPIAisapplicableinthecasewherepersonaldataisinvolved.TheDPIAmaynotberelevantinsituationswherethesyntheticdatagenerationdoesnotinvolvepersonaldataprocessing.
13Thisstepassumesthatthesourcedatahasbeenproperlycleaned(suchasfixingorremovingincorrect,corrupted,incorrectlyformatted,duplicate,orincompletedata)andisofacceptablequalityforthegenerationofsyntheticdata.
13
Understandingkeyinsightstobepreserved
Toensurethatthesyntheticdatacanmeetthebusinessobjectives,organisationsneedtounderstandandidentifythetrends,keystatisticalproperties,andattribute-relationshipsinthesourcedatathatneedtobepreservedforanalysise.g.,identifyrelationshipsbetweendemographiccharacteristicsofpopulationandtheirhealthconditions.
Organisationsshouldconsider,atthispoint,whetheroutliertrendsandinsightsarenecessarytobepreservedforthebusinessobjectives.Keyconsiderationscouldincludethefollowing:
?Ifoutliersarenotnecessarytomeetthebusinessobjectivesandtheriskofre-identificationishigh,organisationsshouldconsiderremovingtheoutliers.Thiscanbedonepriortosyntheticdatagenerationoratsubsequentstagesofthesyntheticdatageneration.
?Iftheobjectiveistomimicthecharacteristicsofthesourcedataascloselyaspossible,includingoutliers,thentheorganisationmayhavetopreservetheoutliertrend/insighttomeetthebusinessobjectives.Insuchinstance,theorganisationshouldnotethatthere-identificationrisksofindividualsintheoutlierdatamaybehighandhenceputinplaceriskmitigationmeasures.
?Ifthebusinessobjectiveistobalancethenumberofdatapointsindifferentdatacategories,thenthesyntheticdatagenerationprocessitselfcanhelpmitigatetheissueofoutlierssimplybygeneratingmoreoutliers.Forexample,inadataset,thenumberofoutlierdatapointscomprisingmaleindividualsmaybebalancedwithoutlierdatapointscomprisingfemaleindividuals.
Selectingdataattributes
Basedonthekeyinsightsneeded,organisationsshouldapplydataminimisationtoextractonlytherelevantdataattributesfromthesourcedata.Thereafter,removeorpseudonymisealldirectidentifier
s14
fromtheextracteddata.
Wheregranularinformationisnotnecessary,organisationsmaygeneraliseorfurtheraddnoisetothedataatthispointoratalatersteptoreducetheriskofre-identification.Forexample,organisationscangeneraliseexactheightandweight
14RefertoPDPC’sGuidetoBasicAnonymisationonhowtoidentifydirectidentifiersinadataset.
14
informationintoheightandweightbandstoreducethepossibilityofheightandweightcombinationsbeingusedtoidentifyanyoutliers.
Organisationsshouldalsostandardiseanddocumentthedetailsoneachdataattribute(suchasdatadefinitions,standards,metricsetc.)inadatadictionary.Thisenablestheorganisationtosubsequentlyvalidatetheintegrityofthegeneratedsyntheticdatatodetectanomaliesandfixanydatainconsistencies.RefertothefollowingchecklistinTable3forkeyconsiderations.
Table3:Checklistfordatapreparation
DataPreparationChecklist
Understandkeyinsights
i.
Identifytrendsandentityrelationshipstobepreservedforsyntheticdatageneration.
ii.
Removeoutliersifsuchtrends/insightsarenotnecessary.Thiscanbeperformedpostgeneration.
Selectdataattributes
iii.
Applydataminimisationtoselectonlydataattributesthatarenecessarytomeetbusinessneeds.
iv.
Removeorpseudonymisedirectidentifiers(e.g.,name,nationalidentificationnumbers).
v.
Generalisegranulardataoraddnoise(e.g.,usingdifferentialprivac
y15)
tothe
data/modelifsuchdetailedinformationisnotnecessary.Thiscanalsobeperformedpostgeneration.
vi.
Standardiseanddocumentformat,constraints,andcategoriesofsourcedataindatadictionary(refertoAnnexBforareferencetemplate):
Format
?Standardisestringstolowerorpropercase
?Datatypes,columnnames,structures,relationships
?FrequencyofdatarecordConstraints
?Constraintsofvaluesforeachdatatype,e.g.,min-maxvalues,non-negativevalues,non-nullvalues
Category
?Typesofdatacategories
?Expectedorvalidvaluesfordataattributeswithineachdatacategory.Exampleofadatacategoryis“country”.
15Theuseofdifferentialprivacytoaddnoisetosyntheticdataiswidelydiscussedasamechanismtoreducere-identificationrisks.However,thereiscurrentlynouniversalstandardonhowtoimplementdifferentialprivacy.Moreover,thenoiseaddedmayalsoreducetheutilityofthesyntheticdata,makingitlessaccurateorusefulforcertaintypesofanalysis.
15
Step3:Generatesyntheticdata
Therearemanydifferentmethods
16
togeneratesyntheticdata,forexample,sequentialtree-basedsynthesisers,copulas,anddeepgenerativemodels(DGMs).Organisationsneedtoconsiderwhichmethodsaremostappropriate,basedontheirusecases,dataobjectives,andtypesofdata.PleaserefertoAnnexCformoreinformationonthesesyntheticdatagenerationmethods.Thereafter,organisationsmayconsidersplittingthesourcedataintotwoseparatesetse.g.,80%astrainingdataset,and20%ascontrol
dataset17
forassessingre-identificationrisksofthesyntheticdata.
Aftergeneratingsyntheticdata,itisagoodpracticefororganisationstoperformthefollowingchecksonthequalityofthegeneratedsyntheticdata:
?Dataintegrity
?Datafidelity
?Datautility
Dataintegrity
Dataintegrityensurestheaccuracy,completeness,consistency,andvalidityofthesyntheticdataascomparedwiththesourcedata.Organisationscanvalidatetheintegrityofthegeneratedsyntheticdataagainstthedictionaryofthesourcedata.
Datafidelity
Datafidelityexaminesifsyntheticdatacloselyfollowsthecharacteristicsandstatisticalattributesofthesourcedata.Thereareafewmetricsformeasuringdatafidelityandtheyaretypicallydonebystatisticallycomparingthegeneratedsyntheticdatadirectlywiththesourcedata.Organisationsshouldusetheperformancemetric(s)fordatafidelit
y18
(seeTable4)thatbestmeettheirdataobjectives.
16ThisguidemaynotbecomprehensiveincoveringallothersyntheticdatagenerationmethodssuchasBayesianmodelandvariationalautoencoders(VAE).
17RefertoApproach2inAnnexEformoredetailsontheassessmentandevaluationframeworkforquantifyingre-identificationrisk.
18ThereareothergenericmetricsdescribedhereinadditiontothoselistedinTable4.SeeKhaledElEmametal.,
“UtilityMetricsforEvaluatingSyntheticHealthDataGenerationMethods:ValidationStudy,”
JMIRMedicalInformatics10,no.4(2022)
Table4:Performancemetricsfordatafidelity
Performancemetricsgenerallyusedforassessingdatafidelity
Histogram-basedsimilarity
Measuresthesimilaritybetweensourceandsyntheticdata’sdistributionsthroughahistogramcomparisonofeachfeature.Thisensuresthesyntheticdatapreservesimportantstatisticalpropertiessuchascentraltendency(mean,median),dispersion(variance,range),anddistributionshape(skewness,kurtosis).
Correlationalsimilarity
Measuresthepreservationofrelationshipsbetweenfeaturesinthesourceandsyntheticdatasets.Forexample,ifhighereducationtypicallyleadstohigherincomeinthesourcedata,thispatternshouldalsobeevidentinsyntheticdata.
Datautility
Datautilityreferstohowwellsyntheticdatacanreplaceoraddtosourcedataforthespecificdataobjectiveoftheorganisation.
Therearedifferentapproachestoevaluatetheutilityofsyntheticdata.Thetruetestofutilityishowitperformsinreal-worldtasks.OnecommonapproachtocheckthisisbytrainingidenticalAI/MLmodelsonsyntheticandtrainingdata.T
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 四年級上冊信息技術(shù)教學(xué)設(shè)計-第11課 家鄉(xiāng)美景巧保存 電子工業(yè)版(安徽)
- 企業(yè)文化建設(shè)的重要性試題及答案
- 投資咨詢中的數(shù)據(jù)保護問題試題及答案
- 2024年人力資源管理師的前沿知識試題及答案
- 養(yǎng)老行業(yè)創(chuàng)業(yè)項目
- 無人機應(yīng)用技術(shù)專業(yè)(2021 級)人才培養(yǎng)方案
- 2024年陪診師考試臨床決策試題及答案
- 中職電子商務(wù)教師資格證考試的試題及答案總結(jié)
- 黑龍江省七臺河市桃山區(qū)2025屆數(shù)學(xué)四年級第二學(xué)期期末綜合測試試題含解析
- 黑龍江省佳木斯市湯原高中2024-2025學(xué)年高三2月模擬(自主測試)二物理試題含解析
- 醫(yī)院?;分R培訓(xùn)課件
- 兒童營養(yǎng)及營養(yǎng)性疾病
- 專業(yè)設(shè)置可行性報告
- QC080000培訓(xùn)講義課件
- 病歷書寫規(guī)范細則(2024年版)
- 華南理工大學(xué)《統(tǒng)計學(xué)》2022-2023學(xué)年第一學(xué)期期末試卷
- GB/T 29468-2024潔凈室及相關(guān)受控環(huán)境圍護結(jié)構(gòu)夾芯板
- 爐襯材料與結(jié)構(gòu)的改進
- DB11-238-2021 車用汽油環(huán)保技術(shù)要求
- 2024年湖南省高考化學(xué)試卷真題(含答案解析)
- 《永久基本農(nóng)田調(diào)整劃定工作方案》
評論
0/150
提交評論