




版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
JournalPre-proofs
ArtificialIntelligenceinPharmaceuticalSciences
MingkunLu,JiayiYin,QiZhu,GaoleLin,MinjieMou,FuyaoLiu,ZiqiPan,
NanxinYou,XichenLian,FengchengLi,HongningZhang,LingyanZheng,
WeiZhang,HanyuZhang,ZihaoShen,ZhenGu,HonglinLi,FengZhu
PII:
S2095-8099(23)00164-9
DOI:
/10.1016/j.eng.2023.01.014
Reference:
ENG1255
Toappearin:
Engineering
ReceivedDate:
30September2022
RevisedDate:
11December2022
AcceptedDate:
6January2023
Pleasecitethisarticleas:M.Lu,J.Yin,Q.Zhu,G.Lin,M.Mou,F.Liu,Z.Pan,N.You,X.Lian,F.Li,H.
Zhang,L.Zheng,W.Zhang,H.Zhang,Z.Shen,Z.Gu,H.Li,F.Zhu,ArtificialIntelligenceinPharmaceutical
Sciences,Engineering(2023),doi:
/10.1016/j.eng.2023.01.014
ThisisaPDFfileofanarticlethathasundergoneenhancementsafteracceptance,suchastheadditionofacoverpageandmetadata,andformattingforreadability,butitisnotyetthedefinitiveversionofrecord.Thisversionwillundergoadditionalcopyediting,typesettingandreviewbeforeitispublishedinitsfinalform,butweareprovidingthisversiontogiveearlyvisibilityofthearticle.Pleasenotethat,duringtheproductionprocess,errorsmaybediscoveredwhichcouldaffectthecontent,andalllegaldisclaimersthatapplytothejournalpertain.
?2023PublishedbyElsevierLtd.onbehalfofChineseAcademyofEngineering.
1
Research
SmartProcessManufacturing—Review
ArtificialIntelligenceinPharmaceuticalSciences
MingkunLua,c,JiayiYina,QiZhua,GaoleLina,MinjieMoua,FuyaoLiua,ZiqiPana,NanxinYoua,XichenLiana,FengchengLia,HongningZhanga,LingyanZhenga,c,WeiZhanga,HanyuZhanga,ZihaoShenb,d,ZhenGua,
HonglinLib,d,e,*,FengZhua,c,*
aTheSecondAffiliatedHospital,ZhejiangUniversitySchoolofMedicine&CollegeofPharmaceuticalSciences,ZhejiangUniversity,Hangzhou310058,ChinabShanghaiKeyLaboratoryofNewDrugDesign,EastChinaUniversityofScienceandTechnology,Shanghai200237,China
cInnovationInstituteforArtificialIntelligenceinMedicineofZhejiangUniversity,Alibaba–ZhejiangUniversityJointResearchCenterofFutureDigitalHealthcare,Hangzhou330110,ChinadInnovationCenterforAIandDrugDiscovery,EastChinaNormalUniversity,Shanghai200062,China
eLingangLaboratory,Shanghai200031,China
*Correspondingauthors.
E-mailaddresses:
hlli@
(H.Li),
zhufeng@
(F.Zhu).
ARTICLEINFO
Articlehistory:
Received
Revised
Accepted
Availableonline
Keywords:
Artificialintelligence
Machinelearning
Deeplearning
Targetidentification
Targetdiscovery
Drugdesign
Drugdiscovery
2
ABSTRACT
Drugdiscoveryanddevelopmentaffectsvariousaspectsofhumanhealthanddramaticallyimpactsthepharmaceuticalmarket.However,investmentsinanewdrugoftengounrewardedduetothelongandcomplexprocessofdrugresearchanddevelopment(R&D).Withtheadvancementofexperimentaltechnologyandcomputerhardware,artificialintelligence(AI)hasrecentlyemergedasaleadingtoolinanalyzingabundantandhigh-dimensionaldata.ExplosivegrowthinthesizeofbiomedicaldataprovidesadvantagesinapplyingAIinallstagesofdrugR&D.Drivenbybigdatainbiomedicine,AIhasledtoarevolutionindrugR&D,duetoitsabilitytodiscovernewdrugsmoreefficientlyandatlowercost.ThisreviewbeginswithabriefoverviewofcommonAImodelsinthefieldofdrugdiscovery;then,itsummarizesanddiscussesindepththeirspecificapplicationsinvariousstagesofdrugR&D,suchastargetdiscovery,drugdiscoveryanddesign,preclinicalresearch,automateddrugsynthesis,andinfluencesinthepharmaceuticalmarket.Finally,themajorlimitationsofAIindrugR&Darefullydiscussedandpossiblesolutionsareproposed.
1.Introduction
Inthepastfewdecades,thepharmaceuticalindustryhasbeenlimitedbytheextentofcutting-edgeresearchinpharmaceuticalsciences,becausethedevelopmentofnewdrugsisalongandcomplexprocessaccompaniedbyhighrisksandhighcosts[1,2].Inotherwords,thecurrentfieldofdrugresearchanddevelopment(R&D)requiressignificantproductivityimprovementstoshortenthecycletimeandcostofdrugdevelopment[3].Technologiessuchasnetworkpharmacology,RNA-sequencing(RNA-seq),high-throughputscreening(HTS),orvirtualscreening(VS)haveallacceleratedthediscoveryofnewtargets,aswellasnewdrugstosomeextent[4–9].Nevertheless,thesetechnologieshaverarelybeensignificantcontributorstothecurrentprocessofnewdrugdiscovery.Thus,thereisanurgentneedfornewtechnologytodrivethedevelopmentofnewdrugs.
Asthecomputingpowerofdevicesgrows,artificialintelligence(AI)hasbeenusedinmanyrealcases,suchasinimageclassificationandspeechrecognition,duetoitsabilitytolearn,process,andpredictmassiveamountsofinformation[10–12].Atpresent,afteralongperiodofdataaccumulation,incombinationwiththedevelopmentofhigh-throughputRNA-seqtechnology,massiveamountsofbiomedicaldatahavebeencollected[13–18].Biomedicaldata,whichhasahighlevelofheterogeneityandcomplexity,comesfromavarietyofsources,includingomicsdatafromdifferentplatforms,experimentaldatafrombiologicalorchemicallaboratories,datageneratedbypharmaceuticalcompanies,publiclydisclosedtextualinformation,andmanuallycollateddatafrompubliclyavailabledatabases[19–22].AIcanbeusedtolearnthepotentialpatternsinthesevastamountsofbiomedicaldata,therebybringingnewopportunitiesandchallengestothepharmaceuticalsciencesandindustries.
TheAlphaFold2systemusedAIinthecriticalassessmentofproteinstructureprediction14(CASP14)competitionandoutperformedothersinaccuratelypredictingthethree-dimensional(3D)structuresofproteins[23].Similarly,intheOpen-GraphBenchmarkLarge-ScaleChallenge(OGB-LSC)competition,agraphneuralnetwork(GNN)combinedwithatransformermodelwonthetoprankinpredictingthemolecularpropertiescalculatedbymeansofdensityfunctionaltheory(DFT),whichisdifficultandhighlytime-consumingusingtraditionalmethods[24].ThesecompetitionsdemonstratedthestrongabilityofAItoanalyzebiologicalorchemicaldata.Duetoitspowerfulcapabilitytoutilizerelatedbiomedicaldatatounderstandcomplexbiologicalsystemsandchemicalreactionspaces[25,26],AIhashadarevolutionaryimpactonallstagesofdrugR&D,includingnotonlyresearchonproteinsandsmallmoleculesbutalsotheassisteddesignofclinicaltrialsandpost-marketsurveillance[27].Furthermore,inpharmaceuticalcompanies,manystate-of-the-art(SOTA)AImodelshavebeenadoptedindiversepipelinestoshortentheR&Dcycletimeanddecreasecosts[28–30].
AItechniquesinthiscontextmainlyinvolvemachinelearning(ML)anddeeplearning(DL).BothMLandDLalgorithmsareinvolvedintargetdiscoveryandvalidation[31],drugdiscoveryanddesign[32],andpreclinicaldrugresearch[33],wheretheyareusedtoanalyzedifferentdatacharacteristicsindifferentformats.Afteradrugcandidateisenrolledinaclinicaltrial[34],DLplaysapivotalroleinassistinginthedesignoftheclinicaltrialandinsupervisingandanalyzingdatafromtheclinicalphaseIV[33].Approveddrugshaveastrongimpactonmanufacturing[35]andthemarketeconomy,andDLcanplayapartintheseareasaswell.Therefore,inthisreview,wepresentacomprehensiveoverviewofmostaspectsoftheuseofAIinthepharmaceuticalsciences.WefocusonhowAIcanbeusedtopromotetargetdiscoveryanddrugdiscovery(asshowninFig.1)andreflectonhowtofurtheracceleratethedevelopmentofthisfield.
3
Fig.1.SummaryofAIapplicationsinthepharmaceuticalsciences.ADMET:absorption,distribution,metabolism,excretion,andtoxicity.
2.BasicconceptsofAIanditsscopeofapplication
AIwasfirstproposedattheDartmouthConferencein1956andwasdefinedasanalgorithmthatgivesmachinestheabilitytoreasonandperformfunctions[36].Fromperceptualmachinestosupportvectormachines(SVM)andartificialneuralnetworks(ANNs),thedevelopmentofAIhasgonethroughseveralupsanddowns,andiscurrentlyflourishingthankstothehardwaresupportthatisnowavailable.BothMLandDLfallunderthecategoryofAI;strictlyspeaking,DLcanbeplacedwithinthecategoryofML.However,ourdiscussionofMLinthisreviewonlyconcentratesontraditionalMLmethods,suchasrandomforest(RF)andSVMs.
2.1.Thebigdataera
Inthecurrentbigdataera,giganticamountsofbiologicalandclinicaldatahavelaidafoundationfortheapplicationofAIinthefieldofmedicalandpharmaceuticalresearch.AlthoughAIhasbeensuccessfullyandeffectivelyappliedinmultipleaspectsofthedrugR&Dprocess,thequantityandqualityofmedicaldatahavebecomeoneofthemainobstaclestothedevelopmentofAIinthepharmaceuticalsciences.Thusfar,pharmaceuticaldatabaseswithdetailedandstructuredbigdataproposedbymedicinalresearchersworldwideareplayingakeyroleinpromotingAIapplicationsinmedicalandpharmaceuticalresearch.
Forexample,thetherapeutictargetdatabase(TTD)includesthemostcomprehensiveinformationaboutknownand
4
Proteins
Genes
Drugs/drug
targets
Diseases
RCSB
PDB
PRIDE
UniProt
InterPro
VARIDT
Ensembl
UCSC
Genome
GEO
GenBank
RefSeq
EA
TTD
ChEMB
L
PubChe
m
DrugBank
DrugMAP
DTC
PHARO
S
TCGA
DisGenNET
ClinVar
OMIM
PDBcontains3Dstructuraldataoflargebiologicalmolecules,suchasproteinsandnucleicacids
PRIDEisapublicdatarepositoryforproteomics,includingproteinandpeptideidentifications,post-translationalmodificationsandsupportingspectralevidence
UniProtisaproteindatabasecontainingproteinsequences,functionalinformation,andanindexofresearchpapersInterProprovidesfunctionalanalysisofproteinsbyclassifyingthemintofamiliesandpredictingdomainsandimportantsitesVARIDTprovidescomprehensivedataonallaspectsofdrugtransporters’variability
Ensemblprovidescentralizedgenomicdataandpowerfulfunctionalitiessuchasgeneannotationandregulatoryfunctionpredictions
TheUCSCGenomebrowseroffersaccesstogenomesequencedatafromavarietyofvertebrateandinvertebratespeciesandmajormodelorganisms
TheGEOisadatabaserepositoryofhigh-throughputgeneexpressiondataandhybridizationarrays,chips,andmicroarraysGenBankisanannotatedcollectionofallpubliclyavailableDNAsequences
RefSeqprovidesseparateandlinkedrecordsforthegenomicDNA,genetranscripts,andcorrespondingproteinsformultipleorganisms
EAcollectsbaselinegeneexpressiondatafordifferentspeciesandcontexts,andcontainsdifferentialstudiesreportingexpressionchangesundertwodifferentconditions
TTDincludesthemostcomprehensiveinformationaboutknownandexploredtherapeuticproteinandnucleicacidtargetsChEMBLisamanuallycuratedlibraryofbioactivecompoundswithdrug-likeproperties
PubChemcoverscollectiveinformationonchemicalmoleculesandtheiractivitiesinresponsetobiologicalassaysDrugBankcombinescomprehensivedrugtargetinformationwithspecificdrugdata
DrugMAPprovidesacomprehensivelistofinteractingmoleculesfordrugs/drugcandidates,includinginformationondifferentialexpressionpatterns
DTCenablestheexplorationofbioactivitydata,theprocessingofnewbioactivitydata,anddatacurationinordertoimprovetheunderstandingofDTIs
PHAROSprovidesacomprehensive,integratedknowledgebaseforthedruggablegenome
TCGAhasover2.5petabytesofgenomic,epigenomic,transcriptomic,andproteomicdatarelatedtothecancergenomeDisGenNETcontainslarge,publiclyavailablecollectionsofgenesandvariantsassociatedwithhumandiseasesClinVarisapublicarchiveofreportsonrelationshipsamonghumanvariationsandphenotypes,withsupportingevidenceOMIMisanonlinecatalogofhumangenesandgeneticdisorders
[43]
[44]
[18]
[45]
[46,4
7]
[48]
[49]
[50]
[51]
[52]
[53]
[37]
[54]
[17]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
exploredtherapeuticproteinandnucleicacidtargets,thetargeteddisease,pathwayinformation,andthecorrespondingdrugsdirectedateachofthesetargets.Itprovidesdetailedknowledgeofthefunctionsoftargets,aswellastheirsequence,3Dstructures,ligand-bindingproperties,relevantenzymes,andcorrespondingdruginformation[37].PubChem[17]providescollectiveinformationofchemicalmoleculesandtheiractivitiesinresponsetobiologicalassays,includingmolecularstructure,identifiers,physicochemicalproperties,patentinformation,andmoleculartoxicity.Somepopulardatabasesaimedatvariouspharmaceuticalissueshavebeenproposedandarefrequentlyused;theseplaysignificantrolesinpromotingtheapplicationofAIinmedicalandpharmaceuticalresearch[38–42].Summarizingvariouspopularpharmaceuticaldatabases,Table1[17,18,37,43–62]providesbriefinformationonpopularpharmaceuticaldatabases,categorizedintoprotein-related,gene-related,drug-related,anddisease-relateddatabases.
Table1
Pharmaceuticaldatabasesfocusingonproteins,genes,drugs/drugtargets,anddiseases.
FocusDatabaseDescriptionRefs.
PDB:proteindatabank;PRIDE:proteomicsidentificationdatabase;GEO:geneexpressionomnibus;EA:expressionatlas;DTC:drugtargetcommons;DTIs:drug–targetinteractions;TCGA:thecancergenomeatlas;OMIM:onlinemendelianinheritanceinman.
5
2.2.MLandDL
Unliketraditionalcomputerprogrammingcalculations,MLandDLcanlearnpotentialpatternsfromtheinputdatawithoutexplicitprogramming.Theyarenotlimitedbytheformatoftheinputdata,whichisbroadandcanincludetext,images,sound,andmore(alltypesofdatathatcanbeencoded)[63].Similartothehumanlearningmodel,MLandDLcangraduallyrecognizedifferentfeaturesofthedata,inferthepatternslyingwithin,andupdatetheirmodelparametersthroughcontinuousiterationsuntilavalidmodelisformed.
Accordingtotheapplicationscenarios,themodelscanbecategorizedintoregressionmodelsandclassificationmodels.Thedifferencebetweenclassificationandregressiontasksliesmainlyinwhetherthetypeofoutputvariableiscontinuousordiscrete.ChengandNg[64]appliedMLapproachestopredictthebiologicalactivityofper-andpolyfluorinatedalkylsubstances(PFAS)withanoutputofcontinuousvalues,andthisstudyisatypicalregressiontask.Hongetal.[65]builtaDLmodeltopredictwhetheraproteininabacteriumisoftheT4SEtype,withanoutputofdiscretevalues(e.g.,0/1),andthisstudyisatypicalclassificationtask.
Dependingonthetypeoflearningalgorithmrequiredtosolvetheproblem,modelsareconceptualizedintothreecategories:supervisedlearning,unsupervisedlearning,andreinforcementlearning.Supervisedlearningisalabeled-data-drivenprocessthattrainsamodelontherelationshipbetweeninputanditsprespecifiedoutputinordertopredictthecategoriesorcontinuousvariablesoffutureinput.Incomparison,unsupervisedmethodsareusedforidentifyingpatternsinunlabeleddatasetsandexploringadataset’spotentialstructurestoallowclusteringofthedataforfurtheranalysis.Inaddition,semi-supervisedlearningispart-waybetweensupervisedandunsupervisedlearning;itacceptsonlypartofthelabeleddatatodevelopatrainingmodelandisusedasapotentialsolutionforproblemsthatlackhigh-qualitydata[66].Reinforcementlearningperformsmodelconstructionthroughconstantinteractivelearning,relyingonpenaltiesforfailureorrewardsforsuccess.
2.3.IntroductiontodifferenttypesofML/DL-basedalgorithms
MLandDLmethodshavebeensuccessfullyappliedtosolverelevantbiomedicalproblems,withtheadoptedmodelingapproachvaryingfordifferentproblemsoreventhesameproblems.Forexample,smallmoleculesusedtobecharacterizedasengineeredfeaturesfordirectloadinginseveralMLmethodstopredicttheproperties;however,morerecently,GNNscanalsobeutilizedtodescribesmallmoleculesforpredictionsofproperties[67].Determiningthefunctionannotationsofproteinsisessentialfortheselectionofdruggableproteinsaspotentialtargets.Maxatetal.[68]conductedaconvolutionalneuralnetwork(CNN)toannotatethegeneontologyannotation(GOA)ofproteins.Nadavetal.[69]builtarecurrentneuralnetwork(RNN)forproteinfunctionannotations,andXiaetal.[70]combinedbothaCNNandRNNtopredictthegeneontology(GO)labelofproteins.
MLbuildsaspecialalgorithm—notaspecificalgorithm—thatfocusesonthefeaturesofthedataandtransformsthemintoknowledgethatmachinescanreadtoprovidehumanswithnewinsights.Variouscommonalgorithmsexistforresearcherstochoosefrom.Thena?veBayes(NB)algorithmisaprobabilistic-basedclassifierbasedonBayes’theoremandindependenceassumptionsbetweenfeatures;itisasimpleandintuitivealgorithm[71].AnRFalgorithmconstructsasetofunrelateddecisiontreesthatformawholehierarchicalstructure;undermodelconstruction,eachtreeisindividuallyresponsibleforacorrespondingproblem[72].Thefinaldecisionisbasedonthemajorityvotesofthedecisiontrees.Modelsthatmakedecisionsbasedonthisapproacharealsocommonlyreferredtoasensemblemodels.eXtremegradientboosting(XGBOOST)isascalableMLalgorithmbasedongradientboosting,whichisalsoanensemblemodel[73].Multi-layerperceptron(MLP)canbeviewedasadirectedgraphconsistingofmultiplenodelayers,eachfullyconnectedtothenextlayer,sothatitmapsasetofinputvectorstoasetofoutputvectors.SVMisoneofthemostwidelyappliedMLalgorithms.Anoptimalhyperplaneisusedtoclassifysamples,whichareobtainedbymaximizingthemarginsbetweendifferentclassesinaspecificdimensionalspace,withthedimensionalitybeingdeterminedbythenumberoffeatures[74].K-nearestneighbor(KNN)isregardedas“l(fā)azylearning”thatclassifiesthesampleaccordingtoonlyafewneighboringsampleswhendistinguishingbetweencategories[75].Inadditiontotheabovemethods,severalotherMLmethodssuchasprincipalcomponentanalysis(PCA),partialleast-squares(PLS),lineardiscriminantanalysis(LDA),andlogisticregression(LR)havebeenappliedinbiomedicaldataprocesses[76,77].
DLispopularduetoitspowerfulgeneralizationandfeature-extractioncapabilities;itslearningandpredictionprocessisend-to-end.UnlikethetraditionalMLprocess(whichoftenconsistsofmultipleindependentmodules),DLobtainstheoutputdata(output-end)directlyfromtheinputdata(input-end)duringthemodeltrainingprocessandcontinuouslyadjustsandoptimizesthemodelbasedontheerrorbetweentheoutputandthetruevalue,untilitmeetstheexpectedresult.Adeepneuralnetwork(DNN)isafeed-forwardneuralnetworkconsistingofdenselyconnectedinput,hidden,andoutputlayers.Itachievesthefeaturelearningofinputdatabysimulatingnonlineartransformationsbetweenneurons,witheachlayerconsistingofvariousneurons[78].ACNNisafeed-forwardneuralnetworkthatconsistsofconvolutional(featureextraction)andpooling(dimensionalityreduction)layers.Theconvolutionalandpoolinglayershelptoextractalltheinformationinadatasetwithout
6
consumingtoomuchtimeandcomputationalresources[79].AnRNNisaclassofANNinwhichlinkednodesformadirectedorundirectedgraphalongatemporalsequence.AnRNNincludesafeedbackcomponentthatallowssignalsfromonelayertobefedbacktothepreviouslayer.Itistheonlyneuralnetworkwithinternalmemory,whichhelpstoaddressthedifficultyoflearningandstoringlong-terminformation[80].AGNNisaconnectivitymodelthatderivesthedependenciesinagraphbymeansofinformationtransferbetweennodesinthenetwork[81,82].AGNNupdatesthestateofanodeaccordingtoneighborsofthenodeatanydepthfromthenode;thisstateisabletorepresentthenodeinformation.TheneuralnetworkarchitecturesofthefournetworksdescribedaboveareshowninFig.2.
Anautoencoder(AE),whichconsistsofanencoderandadecoder,isusedtolearnefficientencodingsofinputdata.Theencoding,whichisgeneratedbyfeedinginputtotheencoder,regeneratestheinputbythedecoder.AnAEisusuallyusedfordatacompressionanddimensionalityreductionthroughtherepresentationmethods(i.e.,theencoding)ofasetofdata[83].Agenerativeadversarialnetwork(GAN)iscomposedoftwounderlyingneuralnetworks:ageneratorneuralnetworkandadiscriminatorneuralnetwork.Theformerisusedtogeneratecontent,whilethelatterisusedtodiscriminatethegeneratedcontent[84].Modelscanalsobeusedincombinationtosolveawiderrangeofproblems.Forexample,agraphconvolutionnetwork(GCN)extendsconvolutionaloperationsfromtraditionaldata(e.g.,images)tographdata[85].
Fig.2.SchematicnetworkarchitecturesforaDNN,GNN,CNN,andRNN.
Whenamodelfailstolearntheunderlyingpatternsindatafeatureseffectivelyandlosestheabilitytogeneralizetonewdata,suchaproblemiscalledmodelunderfitting[86].Incontrast,overfittingoccurswhenthemodelistrainingandnoisein
7
thedatafittedasarepresentativefeatureresultinginpoorpredictionsfornewdata[87].Comparedwithunderfitting,modeloverfittingismoredifficulttodealwith.Modelsoftenbecomeoverfittedduetobeingoverlycomplexorbecauseofanunderrepresentationofdata.Adatasetusedforamodelisoftendividedintoatrainingset,validationset,andtestset.Thesesetsarerespectivelyusedformodeltraining,modeladjustment,andmodelevaluation.Toputitsimply,amodelthatworksbadlyonboththetrainingandtestsetsisanunderfittedmodel,whileamodelthatworkswellonthetrainingsetbutbadlyonthetestsetisanoverfittedmodel.Typicalwaystosuppressoverfittingincluderegularization,dataaugmentation[88],dropout[89],earlystopping,ensemblelearning,andamongothermethods.
Researchersencounteredunderfittingandoverfittingproblems,usingonlyonemodeloftraditionalepidemicmodelsorMLmodels,whenpredictingthelong-termtrendsofthecoronavirusdisease2019(COVID-19)pandemic.Toaddresstheseissues,Sunetal.[90]proposedanewmodelcalleddynamic-susceptible-exposed-infective-quarantined(D-SEIQ).TheD-SEIQmodelcanaccuratelypredictthelong-termtrendsofCOVID-19outbreaksbyappropriatelymodifyingthesusceptible-exposed-infective-recovered(SEIR)modelandintegratingML-basedparameteroptimizationunderreasonableepidemiologyconstraints.
Differentmodelshavedifferentevaluationcriteria.Inregressionmodels,commonlyusedevaluationcriteriaincludemeansquarederror(MSE),rootMSE(RMSE),andRsquared.Inclassificationmodels,themorecommonlyusedcriteriaarerecall,precision,andF1score.Thereceiveroperatingcharacteristic(ROC)curveandprecision-recallcurve(PRC)arethemostcommonlyusedevaluationcriteriainclassificationmodels,withROCcurvestakingintoaccountbothpositiveandnegativecasestoassesstheoverallperformanceofthemodel,whilePRCsfocusmoreonpositivecases[91].
2.4.Abriefdescriptionofmoleculerepresentationasmodelinput
Overtime,theaccumulationofdataonsmallmoleculesandproteinshasresultedinanextremelylargedataresource.Databasesofmolecularsequences,structures,physicochemicalproperties,andsoforthhavebeencollectedandorganizedbydifferentorganizationsandcontainagreatdealofknowledgeandinformation.However,thedifferentsourcesandformatsofthedatamakeitdifficulttointegratethecorrelateddatafrommultipleheterogeneoussources.Therefore,itisparticularlyimportanttoadoptsuitablemethodstorepresentmoleculesinanappropriatewayandtominethecrucialinformationinthedataonmoleculesbymeansofAI[92].CurrentAIalgorithmsarehighlydependentonthequalityofthedata;thus,whenperformingmodelconstruction,itisnecessarytounifytheinputformatofmolecules,suchasbyrepresentingsmallmoleculesandproteinsasmodel-readablevectorsormatrices.
Atpresent,therepresentationofsmallmoleculesisgenerallydoneusingoneoffourmainapproaches.Thefirstapproachinvolvesknowledge-basedrepresentation.MoleculardescriptorsandmolecularfingerprintsbasedonhumanaprioriknowledgearewidelyusedinvariousMLorDLalgorithms[93].Thesecondapproachinvolvesdirectrepresentationbasedonimages.CNNshavenowbeenusedtolearnrulesfromtwo-dimensional(2D)digitalimages.A2DchemicaldigitalgridofamoleculecanbedirectlyusedasinputtoallowaCNNmodeltolearnthepropertiesofthemolecule[94].Thethirdapproachisstring-basedrepresentation.Forexample,atypicalcanonicalsimplifiedmolecular-inputline-entrysystem(SMILES)representssmallmoleculesintheformofstrings.Thus,CNNsandRNNscanbefurtherusedtolearnmolecularembeddingsfromthestringrepresentationsofchemicalstructures[95–97].Thefourthapproachinvolvesgraph-basedfeaturerepresentation.Representationmethodsbasedongraphconvolutionorgraphattentionhavebeenwidelyusedtoexplorethefeaturerepresentationofsmallmolecules.Inthesemethods,atomsandbondsareconsideredtobenodesandedges,respectively,whilenewmolecularrepresentationsareobtainedduringthecontinuousupdatingofinformationatindividualnodes.Graph-basedrepresentationshaveachievedoutstandingperformanceinavarietyofpharmaceuticallearningtasks[98,99].
Proteinrepresentationmethodscanbebasicallyclassifiedintofourcategories:representationbasedonintrinsicpropertiesofsequences,representationbasedonphy
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 銷貨合同范本照片
- 2025至2030年中國深色板材專用漆數(shù)據(jù)監(jiān)測研究報告
- 2024年四川長虹電器股份有限公司招聘運營管理崗位考試真題
- 續(xù)簽兼職合同范本
- 社交電商平臺的用戶體驗設計與互動性提升
- 2025至2030年中國桌面辦公用品數(shù)據(jù)監(jiān)測研究報告
- 2024年廣東深圳出入境邊防檢查總站醫(yī)院招聘筆試真題
- 多用途清潔劑采購協(xié)議
- 承包噴漆合同范本
- 嬰兒健康數(shù)據(jù)分析軟件企業(yè)制定與實施新質(zhì)生產(chǎn)力戰(zhàn)略研究報告
- 運維服務體系建立實施方案(5篇)
- 路面基層(級配碎石)施工方案
- 2025年村兩委工作計劃
- 2025年日歷(日程安排-可直接打印)
- 事業(yè)單位考試職業(yè)能力傾向測驗(社會科學專技類B類)試題及答案指導(2025年)
- 中小學反詐宣傳課件
- 口腔執(zhí)業(yè)醫(yī)師定期考核試題(資料)帶答案
- 2024年三八婦女節(jié)婦女權益保障法律知識競賽題庫及答案(共260題)
- 北京工業(yè)大學《機器學習基礎》2022-2023學年期末試卷
- 2023年7月浙江省普通高中學業(yè)水平考試(學考)語文試題答案
- 解剖臺市場發(fā)展前景分析及供需格局研究預測報告
評論
0/150
提交評論