Decoding the Structural Keywords in Protein Structure Universe

(整期优先)网络出版时间:2019-01-11
/ 1
Althoughtheproteinsequence-structuregapcontinuestoenlargeduetothedevelopmentofhigh-throughputsequencingtools,theproteinstructureuniversetendstobecompletewithoutproteinswithnovelstructuralfoldsdepositedintheproteindatabank(PDB)recently.Inthiswork,weidentifyaproteinstructuraldictionary(Frag-K)composedofasetofbackbonefragmentsrangingfrom4to20residuesasthestructural"keywords"thatcaneffectivelydistinguishbetweenmajorproteinfolds.Wefirstlyapplyrandomizedspectralclusteringandrandomforestalgorithmstoconstructrepresentativeandsensitiveproteinfragmentlibrariesfromalargescaleofhigh-quality,non-homologousproteinstructuresavailableinPDB.Weanalyzetheimpactsofclusteringcut-offsontheperformanceofthefragmenthbraries.Then,theFrag-KfragmentsareemployedasstructuralfeaturestoclassifyproteinstructuresinmajorproteinfoldsdefinedbySCOP(StructuralClassificationofProteins).OurresultsshowthatastructuraldictionarywithN4004-to20-residueFrag-KfragmentsiscapableofclassifyingmajorSCOPfoldswithhighaccuracy.