Speech/Music
Discrimination
Submit for course project
(CS790)
Instructor: Dr. Sushil Louis
University of Nevada, Reno
Overview
My project used a Modified Low Energy Ratio(MLER) and context-based classification method to discriminate speech and music file. Only one feature Modified Low Engergy Ratio (MLER) is used for decision rule. It is fast and robust. About the method analysis, please refer to paper by W. Q. Wang : A fast and Robust Speech/Music Discrimination Approach
Tool Package
The program uses Maaate package as audio feature analysis and extract tool. Maaate is a package to enable audio content analysis on audio files. Currently it supports MPEG-1/2 layers 1-3 audio files (not AAC) in libMaaateMPEG. So part of mp3
files works. From my testing, it works on almost all mp2 format files.For more information, please visit Maaate’s official web site: http://www.cmis.csiro.au/maaate/
Classification
The classification program will divide the input audio file evenly to multiple segment windows, each of them has 120 frames, each frame contains1152 PCM samples. Each window length is about 1 second. When the program is running, it analysis each segment window and calculate the feature MLER, when it get MLER of the current segment window, it will classify the window type, either speech or music, after it finish the first classification, the program go through all the segment windows and adjust the class type based on context.
Application Setting
Each frame has1152 PCM samples per block, each segment window contains 120 frames. (about 1 second in length)
MLER coefficient delta=0.1. From training result, threshold MLER for music/speech discrimination is 0.107 (I have to say, it varies quite a lot depends which type of music file you use for training)
Installation:
1) Unzip source file to a directory
2) Run configuration, command :
> ./configure
--prefix=$your file directory
3) make and install, command
> make
> make install
4) Run classification program, the exe
file name is: analysisSDAudio, it is located in:
../Maaate.0.3.1/demos/.
Before you run the
application, you need copy the target audio file to your working directory, the
Maaate package works on MPEG-1/2 layers 1-3 audio files, not all mp3
format files are supported, you can download
sample file from here
To run the classifier, you just need one parameter: audio file name, for example, to classifier audio file test2.mp2, the command to run it will be:
../demos>analysisSDAudio test2.mp2
Part of audio files used in the test and test result
Audio File Name Accuracy
(File Type) (Number of correctly classified/ total segments number)
Interview.mp2 (Speech) : Accuracy 123/123 =100%. Click here to see detail result
7 days.mp2 (Music): Accuracy 57/58=98%. Click here to see detail result
colors of the wind.mp2 (Music): Accuracy 223/223=100%. Click here to see detail result
Irish lullaby.mp2 (Music) : Accuracy 31/31=100%. Click here to see detail result
Mea culpa.mp2 (Music) : Accuracy 141/141=100%. Click here to see detail result
Bye bye bye.mp2 (Music) : Accuracy 168/176=95.5%. Click here to see detail result
Speech 1.mp2 (Speech) : Accuracy 138/139=99%. Click here to see detail result
Speech 2.mp2 (Speech) : Accuracy 120/127=97%. Click here to see detail result
Speech 3.mp2 (Speech): Accuracy 94/100=94%. Click here to see detail result
Soak up the sun.mp2 (Music): Accuracy 141/179=78.8%. Click here to see detail result
Conclusion
The total test file is about 7400 segments (total duration
is 123 minutes). Speech file and
audio file is about half half. The overall classification accuracy (delta=0.1,
threshold=0.107):
Accuracy
Class Music Speech
-------------------------------------------------------------
Music 94.71% 5.29%
Speech 2.31% 97.69%
The result indicates that the method is fast and robust and
the accuracy is pretty good.
In the other side, the program didn’t consider noise level; the sample files are all clear speech or music with clear rhythm. The classification accuracy for speech file is higher than music file. However, silence is not considered in my program either, if long time silence presented with some noise, it will low down the accuracy. Also the music recognition depends strongly on the music type. From the above test results, you can see, for some pure or soft music file (such as mea_culpa.mp2, irish lullaby.mp2) , the classification accuracy is pretty good (99%, 100%), while for some file (Soak up the sun.mp2 ), the accuracy low down to 78%, for some RAP type music, the accuracy is around 70%. Since some music types are really close to what it sounds as human talk when there is light instrument accompanied, it increase the difficulty to discriminate between music and speech.
Overall,
the application runs well and yields a music/speech classifier with good
accuracy. I am happy to make the classifier work and see the results
File Modified in the Maaate Package:
AnalysiSDAudio.cc SegmentData.cc
SegmentData.H Bandnrj.cc
Here is the
post for my project
2004. 07.15