S-TRANSFORM AND GAUSSIAN MIXTURE MODEL FOR ACOUSTIC SCENE CLASSIFICATION

In this study, Acoustic Scene Classification (ASC) system is designed with the help of S-transform and Gaussian Mixture Model (GMM). The S-transform is an extension of continuous wavelet transform that combines the progressive resolution with phase information. Thus, it exhibits the amplitude response of the frequency samples in contrast to wavelet transform. The S-transform coefficients are modeled by GMM using posterior probabilities of testing features. Also, preprocessing of acoustic signals is done by a series of operations; explosion, pre-emphasis filtration and windowing approach. The number of Gaussian components which is used to model the scene is varied (GMM-4, GMM-8, GMM-16, and GMM-32) and the performance of ASC system is analyzed using TAU Urban Acoustic Scenes 2019. The results show the effectiveness of the system with average recognition rate of 77.59%, 81.58%, 87.66% and 84.50% for GMM-4, GMM-8, GMM-16, and GMM-32 respectively.


I. INTRODUCTION
The development of signal processing techniques has greatly extended to various ranges of applications such as radar and sonar related applications, audio and speech analysis, medical signal analysis. These techniques may provide more information than human imagination. The ASC system is a pattern recognition application which helps to classify the scene with the help of information in the acoustic signals. Texture descriptors based ASC system is described in [1]. At first, the acoustic signal is converted into spectrogram and then features are extracted. To improve the accuracy, visual features also combined with acoustic features.
Different neural networks and time-frequency features are discussed in [2]. These are evaluated either merged or singly. Convolution Neural Networks (CNN) and Deep Neural Networks (DNN) are used and three features such as constant Qtransform, Gammatone filter and log energy Mel filter are extracted. Feature selection methods are employed for ASC system from the aggregate of visual and acoustic features in [3]. Principal component analysis and correlation based selection approaches are used not only to select features but also to reduce feature space dimension.
A multi scale fusion based ASC system is described in [4]. A top down pathway strategy is used to integrate the multi scale semantic features obtained from a CNN with Xception architecture. A channel weighted CNN is employed for the extraction. Efficient non-negative feature learning is analyzed for ASC in [5]. It uses feature fusion of low-level time-frequency representation by DNN and activation features non-negative matrix factorization approach.
Multi channel CNN and dense CNN are explored in [6] for ASC. Features are extracted in an end-to end manner for the detection and classification of acoustics scenes. An ASC system based on the ensembles of CNN is described in [7]. At first, the signals are converted into spectrograms and then nearest neighbour filters are applied on it to smooth similar patterns. A network ensemble model is built based on three different CNNs.
A bag-of-acoustic based system is discussed in [8] for ASC system using distributed microphone array. The extracted spatial features from each sound clip are quantized and regarded as a document for a particular acoustic scene classification. An approach for feature extraction in a constrained learning environment for ASC is described in [9]. DNN is used to stimulate the Fourier transform and their information loss is elevated by temporal transformer module. CNN and DNN are applied for ASC in [10] using log Mel and Mel frequency features. The parameters of CNN and DNN are varied and their performances are evaluated. An efficient ASC system using S-transform and GMM is presented here. The S-transform coefficients are effectively extracted after preprocessing and then modeled by GMM for recognition. The organization of the work is as follows: The techniques used in ASC systems are discussed in details in section 2 and the next section discusses the results obtained by the ASC system. Section 4 concludes this work based on S-transform and GMM.

II. METHODS AND MATERIALS
The ASC system discussed in this section has mainly two stages as in many pattern recognition systems; feature extraction and classification. From these stages, the discriminant features are extracted and are classified into their respective scenes. Figure 1 shows a computer vision system. First, the acoustic signals are acquired and are preprocessed in the preprocessing stage in order to extract the features without redundancy. The next stages are feature extraction and classification or recognition where the patterns in the given signals are classified by the extracted features.

Fig. 1 Computer vision system
The ASC system based on S-transform and GMM classifier is considered as a multiclass problem and its definition is as follows: Let us consider n-acoustic scenes AS where the total number of different scenes is n. i.e., ......... , , 3 2 1  and the main aim is to identify the scene from nscenes. To achieve this, a recognition system is designed which is defined by.
where the decision is achieved by the GMM classifier and t  (t is the number of features) is the feature space created by S-transform. The flow of ASC system is shown in Figure 2.

A. Preprocessing
The overall performance of pattern recognition system can be improved by employing preprocessing steps. The ASC system consists of two preprocessing steps. At first, the original acoustic signal samples are exploded to 16 bit so that the small amplitudes are also considered for the extraction of features. Then, the ambiguity in the exploded signals is removed using pre-emphasis filter which is given below: where x is referred to as the input acoustic signal and the  is referred to as the pre-emphasis filter coefficients. When applying this filter to the signal, the amplitudes of the lower frequency signals are decreased and the amplitudes of the higher frequency signals are increased. After filtering, windowing concept is applied. Let us consider the preprocessed signal ASp, q is the sample point of window applied, and k is the window length, then the resulting signal in a single frame is defined by,

Training Phase
where w is the windowing function which is a hamming window defined, by Frame duration of 25 milliseconds with an overlapping of 10 milliseconds is analyzed for extracting frames from the filtered acoustic signal. After preprocessing, the signals are represented by S-transform for extracting features.

B. S-Transform
In 1996, the S-transform was first developed to analyze geophysical data [75] which is a time-frequency representation. The S-transform uniquely combines the progressive resolution that is absolutely referenced with the phase information which is main difference between other time-frequency representations such as Fourier transform. The absolutely referenced phase is explained as such the phase information given by the S-transform will always refer to the time t = 0, as like the Fourier transform. Due to this phase properties, S-transform can also be called as the locally referenced phase. The generalization of S-transform is as follows: where w is denoted as the S-transform window and p is denoted as a set of parameters determining the shape and properties of w which is given below.

 
The equation in (2), can also be derived from Fourier transform, This S-transform window w has to satisfy the following four conditions as such; (11) From the above equation, the first conditions states that when the equation is integrated over all τ, then the S-transform will converge to the Fourier transform as; The third condition states that the symmetrical property between the shapes of S-transforms analyzing function are either at the positive or negative frequencies. The important feature of the S-transform is that it combines the timefrequency space along with the frequency dependent resolution in reference to the information of the local spectrum phase setting. In contrast to the wavelet transform, the S-transform exhibits the amplitude response of the frequency samples. The S-transform can be derived by the Short Time Fourier Transform (STFT) transform as shown in the below equation. Let h(t) be the STFT of the signal s, where  referred as the time spectral localization and f be referred as the Fourier frequency of the input signal respectively. And g(t) be referred as the windowing function. The S-transform can be derived by identifying the windowing function g(t), with respect to the Gaussian function as   By applying the above Gaussian function, the S-transforms can be defined as It is noted that if the window of S-transform is increased in time-domain, then the transform can provide better frequency resolution even for the low-frequency signals. It is considered that the S-transform have the complete information of the referenced phase signals.

C. GMM Classification
GMM classifier classifies the given classes of data by computing the posterior probability using the testing features with training database. In general,

III. RESULTS AND DISCUSSION
The TAU Urban Acoustic Scenes 2019 [13] database is used for performance evaluation of ASC system using S-transform and GMM classifier based system. Only the subtask A of task 1 (ASC) is analyzed. It consists of 10 different acoustic signals acquired from 12 large cities of Europe. They are Amsterdam, Vienna, Stockholm, Prague, Paris, Milan, Madrid, Lyon, London, Lisbon, Helsinki and Barcelona. The whole database is split into two sets; development set (acoustic signals from 10 cities) and evaluation set (acoustic signals from 12 cities). All signals are acquired using the same device. Table 1 shows the different scene class in the development dataset. All the scenes are classified (10-classes) using the ASC system. Table 2 shows the accuracy of ASC system with baseline model. The system is analyzed with different Gaussian components in the power of 2 i.e., GMM-4, GMM-8, GMM-16, and GMM-32. Tables 3 to 6 show the confusion matrix of the system using different Gaussian components.