WaveNet语音合成技术 - 图文

发布时间 : 星期五 文章WaveNet语音合成技术 - 图文更新完毕开始阅读

REFERENCES

Agiomyrgiannakis,Yannis.Vocainethevocoderandapplicationsisspeechsynthesis.InICASSP,pp.4230–4234,2015.Bishop,ChristopherM.Mixturedensitynetworks.TechnicalReportNCRG/94/004,NeuralCom-putingResearchGroup,AstonUniversity,1994.Chen,Liang-Chieh,Papandreou,George,Kokkinos,Iasonas,Murphy,Kevin,andYuille,AlanL.SemanticimagesegmentationwithdeepconvolutionalnetsandfullyconnectedCRFs.InICLR,2015.URLhttp://arxiv.org/abs/1412.7062.Chiba,TsutomuandKajiyama,Masato.TheVowel:ItsNatureandStructure.Tokyo-Kaiseikan,1942.Dudley,Homer.Remakingspeech.TheJournaloftheAcousticalSocietyofAmerica,11(2):169–177,1939.`trous”tocomputethewavelettransform.Dutilleux,Pierre.Animplementationofthe“algorithmea

InCombes,Jean-Michel,Grossmann,Alexander,andTchamitchian,Philippe(eds.),Wavelets:Time-FrequencyMethodsandPhaseSpace,pp.298–304.SpringerBerlinHeidelberg,1989.Fan,Yuchen,Qian,Yao,andXie,Feng-Long,SoongFrankK.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InInterspeech,pp.1964–1968,2014.Fant,Gunnar.AcousticTheoryofSpeechProduction.MoutonDeGruyter,1970.

Garofolo,JohnS.,Lamel,LoriF.,Fisher,WilliamM.,Fiscus,JonathonG.,andPallett,DavidS.DARPATIMITacoustic-phoneticcontinuousspeechcorpusCD-ROM.NISTspeechdisc1-1.1.NASASTI/Recontechnicalreport,93,1993.Gonzalvo,Xavi,Tazari,Siamak,Chan,Chun-an,Becker,Markus,Gutkin,Alexander,andSilen,Hanna.RecentadvancesinGooglereal-timeHMM-drivenunitselectionsynthesizer.InInter-speech,2016.URLhttp://research.google.com/pubs/pub45564.html.He,Kaiming,Zhang,Xiangyu,Ren,Shaoqing,andSun,Jian.Deepresiduallearningforimagerecognition.CoRR,abs/1512.03385,2015.Hochreiter,S.andSchmidhuber,J.Longshort-termmemory.NeuralComput.,9(8):1735–1780,1997.Holschneider,Matthias,Kronland-Martinet,Richard,Morlet,Jean,andTchamitchian,Philippe.Areal-timealgorithmforsignalanalysiswiththehelpofthewavelettransform.InCombes,Jean-Michel,Grossmann,Alexander,andTchamitchian,Philippe(eds.),Wavelets:Time-FrequencyMethodsandPhaseSpace,pp.286–297.SpringerBerlinHeidelberg,1989.Hoshen,Yedid,Weiss,RonJ.,andWilson,KevinW.Speechacousticmodelingfromrawmulti-channelwaveforms.InICASSP,pp.4624–4628.IEEE,2015.Hunt,AndrewJ.andBlack,AlanW.Unitselectioninaconcatenativespeechsynthesissystemusingalargespeechdatabase.InICASSP,pp.373–376,1996.Imai,SatoshiandFuruichi,Chieko.Unbiasedestimationoflogspectrum.InEURASIP,pp.203–206,1988.Itakura,Fumitada.Linespectrumrepresentationoflinearpredictorcoef?cientsofspeechsignals.TheJournaloftheAcoust.SocietyofAmerica,57(S1):S35–S35,1975.Itakura,FumitadaandSaito,Shuzo.Astatisticalmethodforestimationofspeechspectraldensityandformantfrequencies.Trans.IEICE,J53A:35–42,1970.ITU-T.RecommendationG.711.PulseCodeModulation(PCM)ofvoicefrequencies,1988.J′ozefowicz,Rafal,Vinyals,Oriol,Schuster,Mike,Shazeer,Noam,andWu,Yonghui.Exploringthelimitsoflanguagemodeling.CoRR,abs/1602.02410,2016.URLhttp://arxiv.org/abs/1602.02410.

9

Juang,Biing-HwangandRabiner,Lawrence.MixtureautoregressivehiddenMarkovmodelsforspeechsignals.IEEETrans.Acoust.SpeechSignalProcess.,pp.1404–1413,1985.Kameoka,Hirokazu,Ohishi,Yasunori,Mochihashi,Daichi,andLeRoux,Jonathan.Speechanal-ysiswithmulti-kernellinearprediction.InSpringConferenceofASJ,pp.499–502,2010.(inJapanese).Karaali,Orhan,Corrigan,Gerald,Gerson,Ira,andMassey,Noel.Text-to-speechconversionwithneuralnetworks:ArecurrentTDNNapproach.InEurospeech,pp.561–564,1997.Kawahara,Hideki,Masuda-Katsuse,Ikuyo,anddeCheveign′e,Alain.Restructuringspeechrep-resentationsusingapitch-adaptivetime-frequencysmoothingandaninstantaneous-frequency-basedf0extraction:possibleroleofarepetitivestructureinsounds.SpeechCommn.,27:187–207,1999.Kawahara,Hideki,Estill,Jo,andFujimura,Osamu.Aperiodicityextractionandcontrolusingmixedmodeexcitationandgroupdelaymanipulationforahighqualityspeechanalysis,modi?cationandsynthesissystemSTRAIGHT.InMAVEBA,pp.13–15,2001.Law,EdithandVonAhn,Luis.Input-agreement:anewmechanismforcollectingdatausinghumancomputationgames.InProceedingsoftheSIGCHIConferenceonHumanFactorsinComputingSystems,pp.1197–1206.ACM,2009.Maia,Ranniery,Zen,Heiga,andGales,MarkJ.F.Statisticalparametricspeechsynthesiswithjointestimationofacousticandexcitationmodelparameters.InISCASSW7,pp.88–93,2010.Morise,Masanori,Yokomori,Fumiya,andOzawa,Kenji.WORLD:Avocoder-basedhigh-qualityspeechsynthesissystemforreal-timeapplications.IEICETrans.Inf.Syst.,E99-D(7):1877–1884,2016.Moulines,EricandCharpentier,Francis.Pitchsynchronouswaveformprocessingtechniquesfortext-to-speechsynthesisusingdiphones.SpeechCommn.,9:453–467,1990.Muthukumar,P.andBlack,AlanW.Adeeplearningapproachtodata-drivenparameterizationsforstatisticalparametricspeechsynthesis.arXiv:1409.8558,2014.Nair,VinodandHinton,GeoffreyE.Recti?edlinearunitsimproverestrictedBoltzmannmachines.InICML,pp.807–814,2010.Nakamura,Kazuhiro,Hashimoto,Kei,Nankaku,Yoshihiko,andTokuda,Keiichi.IntegrationofspectralfeatureextractionandmodelingforHMM-basedspeechsynthesis.IEICETrans.Inf.Syst.,E97-D(6):1438–1448,2014.Palaz,Dimitri,Collobert,Ronan,andMagimai-Doss,Mathew.Estimatingphonemeclasscondi-tionalprobabilitiesfromrawspeechsignalusingconvolutionalneuralnetworks.InInterspeech,pp.1766–1770,2013.Peltonen,Sari,Gabbouj,Moncef,andAstola,Jaakko.Nonlinear?lterdesign:methodologiesandchallenges.InIEEEISPA,pp.102–107,2001.Poritz,AlanB.LinearpredictivehiddenMarkovmodelsandthespeechsignal.InICASSP,pp.1291–1294,1982.Rabiner,LawrenceandJuang,Biing-Hwang.FundamentalsofSpeechRecognition.PrenticeHall,1993.Sagisaka,Yoshinori,Kaiki,Nobuyoshi,Iwahashi,Naoto,andMimura,Katsuhiko.ATRν-talkspeechsynthesissystem.InICSLP,pp.483–486,1992.Sainath,TaraN.,Weiss,RonJ.,Senior,Andrew,Wilson,KevinW.,andVinyals,Oriol.Learningthespeechfront-endwithrawwaveformCLDNNs.InInterspeech,pp.1–5,2015.Takaki,ShinjiandYamagishi,Junichi.Adeepauto-encoderbasedlow-dimensionalfeatureex-tractionfromFFTspectralenvelopesforstatisticalparametricspeechsynthesis.InICASSP,pp.5535–5539,2016.

10

Takamichi,Shinnosuke,Toda,Tomoki,Black,AlanW.,Neubig,Graham,Sakriani,Sakti,andNaka-mura,Satoshi.Post?lterstomodifythemodulationspectrumforstatisticalparametricspeechsynthesis.IEEE/ACMTrans.AudioSpeechLang.Process.,24(4):755–767,2016.Theis,LucasandBethge,Matthias.GenerativeimagemodelingusingspatialLSTMs.InNIPS,pp.1927–1935,2015.Toda,TomokiandTokuda,Keiichi.AspeechparametergenerationalgorithmconsideringglobalvarianceforHMM-basedspeechsynthesis.IEICETrans.Inf.Syst.,E90-D(5):816–824,2007.Toda,TomokiandTokuda,Keiichi.Statisticalapproachtovocaltracttransferfunctionestimationbasedonfactoranalyzedtrajectoryhmm.InICASSP,pp.3925–3928,2008.Tokuda,Keiichi.Speechsynthesisasastatisticalmachinelearningproblem.http://www.sp.nitech.ac.jp/?tokuda/tokuda_asru2011_for_pdf.pdf,2011.InvitedtalkgivenatASRU.Tokuda,KeiichiandZen,Heiga.Directlymodelingspeechwaveformsbyneuralnetworksforstatisticalparametricspeechsynthesis.InICASSP,pp.4215–4219,2015.Tokuda,KeiichiandZen,Heiga.Directlymodelingvoicedandunvoicedcomponentsinspeechwaveformsbyneuralnetworks.InICASSP,pp.5640–5644,2016.Tuerk,ChristineandRobinson,Tony.Speechsynthesisusingarti?cialneuralnetworkstrainedoncepstralcoef?cients.InProc.Eurospeech,pp.1713–1716,1993.T¨uske,Zolt′an,Golik,Pavel,Schl¨uter,Ralf,andNey,Hermann.AcousticmodelingwithdeepneuralnetworksusingrawtimesignalforLVCSR.InInterspeech,pp.890–894,2014.Uria,Benigno,Murray,Iain,Renals,Steve,Valentini-Botinhao,Cassia,andBridle,John.Modellingacousticfeaturedependencieswitharti?cialneuralnetworks:Trajectory-RNADE.InICASSP,pp.4465–4469,2015.vandenOord,A¨aron,Kalchbrenner,Nal,andKavukcuoglu,Koray.Pixelrecurrentneuralnetworks.arXivpreprintarXiv:1601.06759,2016a.vandenOord,A¨aron,Kalchbrenner,Nal,Vinyals,Oriol,Espeholt,Lasse,Graves,Alex,andKavukcuoglu,Koray.ConditionalimagegenerationwithPixelCNNdecoders.CoRR,abs/1606.05328,2016b.URLhttp://arxiv.org/abs/1606.05328.Wu,Yi-JianandTokuda,Keiichi.Minimumgenerationerrortrainingwithdirectlogspectraldistor-tiononLSPsforHMM-basedspeechsynthesis.InInterspeech,pp.577–580,2008.Yamagishi,Junichi.Englishmulti-speakercorpusforCSTRvoicecloningtoolkit,2012.URLhttp://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html.Yoshimura,Takayoshi.Simultaneousmodelingofphoneticandprosodicparameters,andchar-acteristicconversionforHMM-basedtext-to-speechsystems.PhDthesis,NagoyaInstituteofTechnology,2002.Yu,FisherandKoltun,Vladlen.Multi-scalecontextaggregationbydilatedconvolutions.InICLR,2016.URLhttp://arxiv.org/abs/1511.07122.Zen,Heiga.Anexampleofcontext-dependentlabelformatforHMM-basedspeechsynthesisinEnglish,2006.URLhttp://hts.sp.nitech.ac.jp/?Download.Zen,Heiga,Tokuda,Keiichi,andKitamura,Tadashi.ReformulatingtheHMMasatrajectorymodelbyimposingexplicitrelationshipsbetweenstaticanddynamicfeatures.Comput.SpeechLang.,21(1):153–173,2007.Zen,Heiga,Tokuda,Keiichi,andBlack,AlanW.Statisticalparametricspeechsynthesis.SpeechCommn.,51(11):1039–1064,2009.Zen,Heiga,Senior,Andrew,andSchuster,Mike.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pp.7962–7966,2013.

11

SpeechTextVocoderanalysisTextanalysisoModeltrainingAcousticmodel?FeatureopredictionlVocodersynthesisTextanalysisSpeechTextl?ΛTrainingSynthesisFigure6:Outlineofstatisticalparametricspeechsynthesis.

Zen,Heiga,Agiomyrgiannakis,Yannis,Egberts,Niels,Henderson,Fergus,andSzczepaniak,Prze-mys?aw.Fast,compact,andhighqualityLSTM-RNNbasedstatisticalparametricspeechsynthe-sizersformobiledevices.InInterspeech,2016.URLhttps://arxiv.org/abs/1606.

06061.

ATEXT-TO-SPEECHBACKGROUND

ThegoalofTTSsynthesisistorendernaturallysoundingspeechsignalsgivenatexttobesyn-thesized.Humanspeechproductionprocess?rsttranslatesatext(orconcept)intomovementsofmusclesassociatedwitharticulatorsandspeechproduction-relatedorgans.Thenusingair-?owfromlung,vocalsourceexcitationsignals,whichcontainbothperiodic(byvocalcordvibration)andaperiodic(byturbulentnoise)components,aregenerated.By?lteringthevocalsourceexcitationsignalsbytime-varyingvocaltracttransferfunctionscontrolledbythearticulators,theirfrequencycharacteristicsaremodulated.Finally,thegeneratedspeechsignalsareemitted.TheaimofTTSistomimicthisprocessbycomputersinsomeway.

TTScanbeviewedasasequence-to-sequencemappingproblem;fromasequenceofdiscretesym-bols(text)toareal-valuedtimeseries(speechsignals).AtypicalTTSpipelinehastwoparts;1)textanalysisand2)speechsynthesis.Thetextanalysisparttypicallyincludesanumberofnaturallanguageprocessing(NLP)steps,suchassentencesegmentation,wordsegmentation,textnormal-ization,part-of-speech(POS)tagging,andgrapheme-to-phoneme(G2P)conversion.Ittakesawordsequenceasinputandoutputsaphonemesequencewithavarietyoflinguisticcontexts.Thespeechsynthesisparttakesthecontext-dependentphonemesequenceasitsinputandoutputsasynthesizedspeechwaveform.Thisparttypicallyincludesprosodypredictionandspeechwaveformgeneration.Therearetwomainapproachestorealizethespeechsynthesispart;non-parametric,example-basedapproachknownasconcatenativespeechsynthesis(Moulines&Charpentier,1990;Sagisakaetal.,1992;Hunt&Black,1996),andparametric,model-basedapproachknownasstatisticalparametricspeechsynthesis(Yoshimura,2002;Zenetal.,2009).Theconcatenativeapproachbuildsuptheutterancefromunitsofrecordedspeech,whereasthestatisticalparametricapproachusesagener-ativemodeltosynthesizethespeech.Thestatisticalparametricapproach?rstextractsasequenceofvocoderparameters(Dudley,1939)o={o1,...,oN}fromspeechsignalsx={x1,...,xT}andlinguisticfeatureslfromthetextW,whereNandTcorrespondtothenumbersofvocoderparametervectorsandspeechsignals.Typicallyavocoderparametervectoronisextractedatev-ery5milliseconds.Itoftenincludescepstra(Imai&Furuichi,1988)orlinespectralpairs(Itakura,1975),whichrepresentvocaltracttransferfunction,andfundamentalfrequency(F0)andaperiodic-ity(Kawaharaetal.,2001),whichrepresentcharacteristicsofvocalsourceexcitationsignals.Thenasetofgenerativemodels,suchashiddenMarkovmodels(HMMs)(Yoshimura,2002),feed-forwardneuralnetworks(Zenetal.,2013),andrecurrentneuralnetworks(Tuerk&Robinson,1993;Karaalietal.,1997;Fanetal.,2014),istrainedfromtheextractedvocoderparametersandlinguisticfeaturesas

?=argmaxp(o|l,Λ),Λ(4)

Λ

whereΛdenotesthesetofparametersofthegenerativemodel.Atthesynthesisstage,themost

probablevocoderparametersaregeneratedgivenlinguisticfeaturesextractedfromatexttobesyn-thesizedas

?.?=argmaxp(o|l,Λ)o(5)

o

?usingavocoder.Thestatisticalparametricap-Thenaspeechwaveformisreconstructedfromo

proachoffersvariousadvantagesovertheconcatenativeonesuchassmallfootprintand?exibility

12

联系合同范文客服:xxxxx#qq.com(#替换为@)