Speech/Music Discrimination

Submit for course project (CS790)

Instructor: Dr. Sushil Louis

University of Nevada, Reno

 

Overview

 My project used a Modified Low Energy Ratio(MLER) and context-based classification method to discriminate speech and music file. Only one feature Modified Low Engergy Ratio  (MLER) is used for decision rule. It is fast and robust.  About the method analysis, please refer to paper by W. Q. Wang : A fast and Robust Speech/Music Discrimination Approach

 

Tool Package

The program uses Maaate package as audio feature analysis and extract tool. Maaate is a package to enable audio content 
analysis on audio files. Currently it supports MPEG-1/2 layers 1-3 audio files (not AAC) in libMaaateMPEG.  So part of mp3 
files works.  From my testing, it works on almost all mp2 format files.
For more information, please visit Maaate’s official web site:  http://www.cmis.csiro.au/maaate/

Classification

 The classification program will divide the input audio file evenly to multiple segment windows, each of them has 120 frames, each frame contains1152 PCM samples.  Each window length is about 1 second.  When the program is running, it analysis each segment window and calculate the feature MLER, when it get MLER of the current segment window, it will classify the window type, either speech or music, after it finish the first classification, the program go through all the segment windows and adjust the class type based on context.

Application Setting

Each frame has1152 PCM samples per block, each segment window contains 120 frames. (about 1 second in length)

MLER coefficient delta=0.1. From training result, threshold MLER for music/speech discrimination is 0.107 (I have to say, it varies quite a lot depends which type of music file you use for training)

Installation:

1)      Unzip source file to a directory

2)      Run configuration, command : 

     >  ./configure --prefix=$your file directory

3) make and install, command

     > make

          > make install

     4) Run classification program, the exe file name is: analysisSDAudio, it is located in:   ../Maaate.0.3.1/demos/.

Before you run the application, you need copy the target audio file to your working directory, the Maaate package works on MPEG-1/2 layers 1-3 audio files, not all mp3 format files are supported, you can download sample file from here

To run the classifier, you just need one parameter: audio file name, for example, to classifier audio file test2.mp2, the command to run it will be:

   ../demos>analysisSDAudio test2.mp2

 Part of audio files used in the test and test result 

   Audio File Name                                Accuracy                              

  (File Type)                                          (Number of correctly classified/ total segments number)

     Interview.mp2  (Speech) :                 Accuracy 123/123 =100%.                   Click here to see detail result

     7 days.mp2  (Music):                        Accuracy 57/58=98%.                           Click here to see detail result

     colors of the wind.mp2  (Music):      Accuracy  223/223=100%.                    Click here to see detail result

     Irish lullaby.mp2  (Music) :                Accuracy  31/31=100%.                        Click here to see detail result

     Mea culpa.mp2  (Music) :                 Accuracy 141/141=100%.                    Click here to see detail result

    Bye bye bye.mp2  (Music) :               Accuracy 168/176=95.5%.                   Click here to see detail result

     Speech 1.mp2  (Speech) :                 Accuracy  138/139=99%.                     Click here to see detail result

     Speech 2.mp2  (Speech) :                 Accuracy  120/127=97%.                     Click here to see detail result

     Speech 3.mp2  (Speech):                  Accuracy  94/100=94%.                       Click here to see detail result

     Soak up the sun.mp2  (Music):          Accuracy  141/179=78.8%.                  Click here to see detail result

Conclusion

The total test file is about 7400 segments (total duration is 123 minutes).  Speech file and audio file is about half half. The overall classification accuracy (delta=0.1, threshold=0.107):

                                              Accuracy 

               Class                      Music                     Speech                                  

               -------------------------------------------------------------

               Music                     94.71%                  5.29%

               Speech                   2.31%                    97.69%  

 

The result indicates that the method is fast and robust and the accuracy is pretty good. 

In the other side, the program didn’t consider noise level; the sample files are all clear speech or music with clear rhythm. The classification accuracy for speech file is higher than music file. However, silence is not considered in my program either, if long time silence presented with some noise, it will low down the accuracy. Also the music recognition depends strongly on the music type. From the above test results, you can see, for some pure or soft music file  (such as  mea_culpa.mp2, irish lullaby.mp2) , the classification accuracy  is pretty good (99%, 100%), while for some file   (Soak up the sun.mp2 ), the accuracy low down to 78%, for some RAP type music, the accuracy is around 70%. Since some music types are really close to what it sounds as human talk when there is light instrument accompanied, it increase the difficulty to discriminate between music and speech.

Overall, the application runs well and yields a music/speech classifier with good accuracy. I am happy to make the classifier work and see the results

File Modified in the Maaate Package:

AnalysiSDAudio.cc   SegmentData.cc    SegmentData.H   Bandnrj.cc

Here is the post for my project

 

2004. 07.15