-
-
Notifications
You must be signed in to change notification settings - Fork 105
Home
This wiki is aimed to provide a better insight to the procedure for generating the speech commonly used features for which the MFCCs
is the most famous one.
Basically is speech and speaker recognition there is a necessity to extract the components of the audio which are relevant to the context and linguistics(in the case of speech recognition) or to the speaker vocal characteristics(in the case of speaker recognition) and discarding all the non-informative parts of audio stream. The extracted features in high levels should be able to represent the vocal tract of what being said in order to have the ability to distinguish between different parts of spoken audio. Mel Frequency Cepstral Coefficents (MFCCs) introduced by Davis and Mermelstein in the 80's, have been widely used and been hard to beat ever since.
The high level representation of generating MFCCs is described below:
- Signal must be reformed into stack of frames.
- For each frame the power spectrum must be computed.
- The desired filterbank must be designed for the relevant spectrum.
- The energy coefficients should be calculated by filtering the spectrum of frames using the designed filterbank(The energy coefficients are calculated so far).
- The log of the energy features should be computer(log-energy features).
- The Discrete Fourier Transform must be performed to eliminate the correlation.
- Usually first 13 coefficients are used and the rest will be discarded(Just a common sense and no pushing!).
By generating features we try to estimate a reliable estimation of the characteristics which are important.