Genre Classiﬁcation for Brazilian Music using Independent and Discriminant Features

—Digital music ﬁles are largely available both online and in private local collections. These databases may comprise hundreds or thousands of ﬁles, which in some cases may not carry tagged information about their content, making the search for the desired audio ﬁles very time consuming. An important task in this context is to organize the available database according to the prevailing musical genre. The purpose of this work is to develop an automatic music genre classiﬁcation system able to identify international music genres (i.e. pop, rock, classic, soul, funk) and also typical Brazilian rhythms such as Samba, Forró and Brazilian Popular Music. The proposed signal processing chain comprises two stages. Initially, audio signal features are computed and their relevance for music genre identiﬁcation estimated. Independent component analysis is applied to reduce mutual redundancy among the audio attributes. In the following, different classiﬁers based on neural networks and support vector machines are applied for music genre identiﬁcation. The proposed system efﬁciency is evaluated using an experimental dataset.


I. INTRODUCTION
C ONSIDERING the large amount of audio data files available, both online and in personal collections, along with an increasing availability of mobile digital audio playing devices which are equipped with high capacity storage drives, the search for the desired information may become tedious and time consuming. In this context, the use of an automatic system for efficient managing these large datasets is important for the final user.
Music information retrieval (MIR) [1] is an important and very active research field which combines aspects from signal processing, machine learning and musicology in the search for computational systems able to automatically access and identify the information contained in music data files. When dealing with a musical excerpt, different aspects such as the prevailing genre, the singer and the used instruments are relevant for classification purposes [2].
For audio signals automatic classification, the initial step usually comprises the extraction of relevant features (or attributes) from the digital files. After that, hypothesis testing (classification) is performed to assign to each audio signal a given class. Some studies has been carried on in the literature to achieve content-based audio signal classification. For example, the work [3] quantified the relevance of locally estimated parameters for musical instrument recognition. In [2], temporal segmentation was proposed for audio signals analysis by selecting equal time-length segments from different parts of the audio signal file.
The musical genre classification problem was addressed in [4], where Gaussian mixture model (GMM) and k-nearest neighbor (KNN) classifiers were applied for this purpose. The work [5] proposed a feature extraction method for music genre classification based on wavelet analysis. For automatic classification were used both, support vector machines and linear discriminants. In [6] musical genre definitions and hierarchies were discussed; and it was presented techniques for extracting meaningful information from audio data aiming at the characterization of musical excerpts. In [7], signals were classified according to the prevailing audio content into three classes: speech, music, and background noise. Hierarchical approaches for feature selection and classification were also evaluated. The analysis of the bass tracks was proposed in [8] for music genre classification by exploring stylistic similarities.
We noticed that not only traditional and popular rhythms received the attention of the community. For example, in [9] the influence of repeated patterns was considered for Dutch folk song classification. Computational techniques were also used in [10] to discover patterns in Native American music and identify musical differences between indigenous groups.
In this work, the genre classification problem is addressed considering 12 different classes, including typical Brazilian genres such as Samba, Brazilian Popular Music (MPB) and Forró [11]. Brazilian culture is very rich, comprising European, African and Native-American influences. This aspect together with the continental dimensions of the country allowed the appearance of several local (and very particular) music genres, which were properly considered only in few previous works, such as [12], in which visibility network features were used to describe the audio signals.
The main contributions of this paper include the study of audio features relevance for genre classification and a performance comparison between different classifier architectures based on both single layer feedforward neural networks (SLFN) and support vector machines (SVM) [13]. Independent component analysis (ICA) [14] is proposed here for efficient feature transformation, removing information redundancy. It is important to mention that the inclusion of typical Brazilian genres represents itself a relevant novelty of this work. Addi-tionally, results from a preliminary embedded implementation of the proposed system are presented.
This document is organized as follows. In Section II the proposed system is presented. The audio database and the system validation methodology are described in Section III. The experimental results are presented in Section IV and, the conclusions are derived in Section V.

II. THE PROPOSED SYSTEM
The proposed system architecture is detailed in this section. Initially, a brief overview is used to present the main signal processing chain. In the following, the proposed audio descriptors are introduced. The applied feature selection and redundancy removal techniques are considered in the next subsection. In the final sub-section, the used classification systems are presented.

A. System Overview
As illustrated in Fig. 1, the proposed music genre classification system comprises a signal processing chain that initiates with temporal segmentation of full-length audio files. Three data segments of 30 seconds each are selected from the audio files. The first one initiates after 15 seconds from signal beginning, the second one initiates exactly in half of the total signal length, and the last one ends 15 seconds before signal end. The adopted 15 seconds shift from the begin and end of the files intends to avoids selecting audio segments composed mostly of noise or silence, that may occur during the recording process. By selecting segments from different time locations, the estimated features are expected to better represent the complete audio signal characteristics.  For feature extraction, audio descriptors are estimated from the previously selected music excerpts. The used descriptors are zero-crossing rate (ZCR), melfrequency cepstral coefficients (MFCC), spectral power concentration (SPC), spectral centroid (SC), loudness (L) [15], and beat histogram (BH) [4].
Prior to be used to feed the classifiers, the input features relevance for genre identification is estimated. In this stage some non-relevant features are discarded. Additionally, independent component analysis is applied as a pre-processing step for the classification module to reduce features redundancy, which, in some cases, may prevent the proper training of the classifiers, by causing slow convergence and sub-optimal results. Different algorithms were used for pattern recognition and their results compared considering aspects such as the discrimination efficiency and the computational complexity.

B. Used audio descriptors
In this work, feature extraction was performed in short time windows of approximately 30 ms. Hamming windows with 30% overlap were used [2]. For proper characterization of the music files the following audio descriptors were estimated: zero-crossing rate (ZCR), melfrequency cepstral coefficients (MFCC), spectral power concentration (SPC), beat histogram (BH), spectral centroid (SC) and loudness (L). These descriptors are briefly presented in the following (for more details see, for example [15]).
The ZCR [2] is commonly used as an estimator for the fundamental (pitch) frequency and may be computed by counting the number of times the signal amplitude crosses the zero axis (N cross ) during a fixed time interval ∆T: The mel-frequency cepstrum coefficients (MFCC) are widely used for audio description (especially in speech processing applications) [16], [17], as they attempt to model the perception of the human ear. For this, a nonlinear frequency scale (the mel scale) is defined as: where f Hz is the frequency in Hz.
In order to obtain the mel-frequency cepstrum (see Fig. 2), the discrete Fourier transform (DFT) is applied to each audio signal frame. In the following, the logarithm amplitude spectrum is mapped to the mel-frequency scale and filtered using triangular overlapping filters. Finally the discrete cosine transform (DCT) is applied to produce the MFCC. The spectral power concentration (S PC ) vector and the spectral centroid (S C ) are parameters used to evaluate the distribution of signal power throughout the frequency range of interest (0 ≤ f ≤ F S /2, where F S is the sampling frequency). The S PC consists on the power spectral density (S( f )) sum within three frequency bands ( where the limit frequencies are: F (1) The S C [18] estimates power spectrum "center of mass": The loudness (L) [19] is an audio descriptor which aims to approximate the human perception of an audio signal intensity. To such end, an approximation of the human ear frequency response is used. As proposed in [20], a frequency-dependent weight factor (W ( f k H z ))) is defined for the outer ear: The outer ear weighted FFT module coefficients are defined as: where X ( f ) is the discrete Fourier transform of the audio signal. In this work, a simple estimative for the loudness is obtained by summing the weighted FFT components: The temporal features such as the tempo and the rhythm are important musical properties. A building block of these parameters is the onset, which may be defined as the beginning of a musical sound event (i.e. a stroke on a percussive instrument). The novelty function is usually applied to estimate the amount of audio signal changes over time and is an important step for the automatic detection of onsets [15].
The Beat Histogram (BH, also called beat spectrum) [4] is used to estimate the amplitude and frequency of the most relevant beats of a song. The BH can be interpreted as the frequency domain representation of the novelty function, as the result, there is a plot of the beat frequency (in BPM) vs its respective relevance (number of repetitions). The occurrence of multiple peaks indicates a intense rhythmic content. There are multiple ways of computing the beat histogram. In this work was used the procedure described in [4], as illustrated in Fig. 3.  The feature vector used to feed the classifiers is composed by the mean and variance estimates of the following audio descriptors: ZCR, first five MFCC, S PC (in three frequency bands), S C , L, and four beat histogram measures: the relative amplitude of the first and second peaks; the sum of the histogram; and the period of the first peak. Considering this, the feature vector comprises a total of 26 parameters (see Table I).

C. Feature Selection and Redundancy Removal
In classification systems, preprocessing the inputs is important in order to feed the classifiers from a compact and discriminant set of features. There are multiple ways to do such feature ranking, in this work it is used the sequential backward elimination procedure: the classifier system is initially trained from the full set of features and further re-trained after eliminating each feature individually [15]. Comparing the efficiency results it is possible to determine if the discrimination performance changes after removing each feature. Another issue that may me observed is the mutual redundancy among the input features. The classifier training process may be hampered if redundant features are used. To avoid this problem, independent component analysis (ICA) [21] is proposed in this work as a preprocessing step.
Considering that a set of N observed variables x = [x 1 , ..., x N ] T is generated from a linear combination of unknown sources s = [s 1 , ..., s N ] T , such that: where W is the N × N mixing matrix [14], ICA deals with the problem of finding an estimate y of s considering that the components y i are mutually independent. As the exact inverse mixing matrix is ill-conditioned (it is not possible to guarantee the correct multiplying factor in the estimated sources s i ) [14], a solution may be obtained if it is possible to find an approximation for the inverse of the mixing matrix B ≈ W −1 and so: In this work the FastICA algorithm [14] is applied for independent components estimation. Among the advantages of the method we can mention fast and more reliable convergence, computational simplicity and low memory requirements.
ICA is closely related do principal component analysis (PCA) [22], which is used in this work to estimate the level of redundancy in the feature vector. PCA explores secondorder statistics, removes signal correlation and produces linear projections (principal components) ordered by the amount of retained energy. Indeed, some ICA algorithms use PCA as a preprocessing step. Considering that after PCA all second order dependence is removed, the ICA problem is reduced to deal with the higher order statistics information.

D. The Proposed Classifier System
Automatic classifier systems have been successfully applied in different problem such as detection of partial discharges in electrical power systems [23], location of faults in electrical power lines [24], detection of broken bars in induction motors [25], and nondestructive evaluation of materials and structures [26].
In this work two different types of classifiers are applied. One based on a single-hidden layer feedforward neural network (SLFN) and another based on support vector machine (SVM). The obtained results are compared considering the discrimination efficiency. A preliminary implementation of the proposed system in dedicated electronics is used to estimate the computational complexity of each signal processing step.
SLFN are widely applied for classification problems (see for example [26], [27], [4]) and in this work, two different neural network architectures were used for the SLFN classifiers: (i) one comprises in the output layer one neuron associated to each musical genre (in this case, 12 neurons); and (ii) other uses one SLFN classifier specialized for each class of interest, in an one-against-all (OAA) approach. For both cases, the number of neurons in the hidden layer is chosen using a network growing procedure (starting from a small number of hidden neurons and adding hidden units until the desired discrimination performance is achieved). The hyperbolic tangent is used as activation function for all neurons and the standard error back propagation algorithm is applied for training.
SVM algorithm comprises a three-layered feedforward network structure, which initially projects the input data in a high-dimensional space and, in the following, uses kernel functions (usually nonlinear) to generate a low-dimensional feature space. In this work, the multi-class SVM classifier was implemented using an one-against-all approach.
Some example of popular kernel functions for SVM are the q-th order polynomial kernel, the radial basis function (RBF) kernel and the sigmoid kernel, respectively given by [28]: where z 1 and z 2 are vectors in the input space, || · || and T are the vector norm and the transpose operators, respectively, q (polynomial kernel order), σ 2 (RFB kernel variance), β and γ (sigmoid kernel gain and bias, respectively) are constants used to adjust each kernel function [28]. Applications of SVM include voice activity detection [29] and facial expression recognition [30].

III. DATABASE AND SYSTEM VALIDATION METHODOLOGY
The database comprises 1008 music files assigned by five expert listeners to twelve different musical genres: Blues, Classical, Country, Forró, Hip Hop, Jazz, Brazilian Popular Music (MPB), Pop, Reggae, Rock, Soul, and Samba (see Table II). It is important to mention that, during the class assignment procedure for the used dataset (required for supervised training), the expert listeners did not always agree in the music genre classification. In these cases, the class which received a larger number of indications was assigned to the music file.
To evaluate the performance of the proposed classifiers, the confusion matrix and the efficiencies geometric mean (EF) are computed. The confusion matrix presents the discrimination efficiencies (in the main diagonal) and classification errors (in off-diagonal positions) for each class of interest. EF provides a measure for the classifier overall performance: where EF M is the classification efficiency obtained for class m and M = 12 is the number of classes. The geometric mean is preferred here instead of the simple mean as it tends faster to small values when there occurs a low efficiency for a single class.
In order to account for statistical fluctuations in the dataset, for each classifier architecture the training procedure was restarted 10 times using different samples for the training, testing and validation sets. In this cross-validation procedure the amount of examples in each set are kept fixed into 50%, 30% and 20% of available signals, respectively. After that, the maximum value EF max and standard deviation σ E F are computed.

IV. EXPERIMENTAL RESULTS
This section is divided into three parts, initially the features selection and preprocessing results are presented. In the following, the proposed classifier systems are presented and compared. Finally, the proposed system is compared to previous works.

A. Features selection and preprocessing
It is possible to observe from Fig. 4 typical audio attributes (MFCCs and beat histogram) for different music genres. It is interesting to note that the patterns are not easily distinguished, and indeed in some cases are quite similar (e.g. MPB and Samba).
In order to evaluate the effects of information compaction, the principal component analysis (PCA) load curve was estimated. From Fig. 5 it can be observed that the first 22 more energetic principal components (from the 26 original features) retain approximately 99.9% of the total energy. This indicates that probably some features present high mutual redundancy. As it can be seen from the input feature correlation matrix in Fig. 6-(a), there is considerable correlation between some characteristics (see the dashed-circled areas in Fig. 6-(a)). This may contribute to hamper the classifiers training process. To reduce the redundancy between the input features, the attributes are processed using independent component analysis. It can be seen from Fig. 6-(b) that there is a considerable correlation reduction, as evidenced by the quasi-diagonal correlation matrix after ICA. The estimated independent components are used as new inputs for the classifier systems.
For estimation of features relevance the procedure adopted in this work was the sequential backward elimination [31]. In this case, each individual feature is removed from the feature set, the classifier re-trained and the performance index computed and compared to the one obtained by using the complete features set. For this procedure a SLFN classifier using 40 hidden neurons was used (the choise of the number of hidden neurons will be explained further). The results are illustrated in Fig 7-(a). It can be observed that by removing two individual features (namely feat-01, the mean of ZCR; and feat-20, the period of beat histogram first peak) the global classification performance improves. This clearly indicates that these are confusing features and thus, they should be removed from the features set. There are also some other features that do not considerably contribute for class discrimination as their removal produces a slight variation on the global discrimination index (namely feat-4, the variance of the first MFCC; and feat-25, the mean of loudness). A complementary analysis considered the removal of sets of features computed from the same audio descriptor (see Fig 7-(b)). In this case it is observed that the MFCC and the beathistogram (BH) sets of features are the most relevant ones. This is interesting to note that the BH individual features (feat-19 to feat-22 in Fig 7-(a)) are not highly relevant, but when they are considered together, they contribute significantly for class discrimination. After this feature relevance analysis it was decided to eliminate features 01 and 20.

B. Proposed classifiers efficiency evaluation
In this work four different classifiers were trained for the Brazilian music genre classification problem: (i) a SLFN classifier fed from the 24 more discriminant features (called SLFN); (ii) a SLFN classifier fed from 24 discriminant and independent features (SLFN-I); (iii) SLFN classifiers trained in a one-against-all configuration (called SLFN OAA ); and (iv) a SVM classifier, which was also trained in a one-against-all configuration (called SVM).
For training of SLFN classifiers it is important to properly determine the number of neurons in the hidden layer. In this work this was achieved by a network growing procedure. As illustrated in Fig. 8 for SLFN and SLFN-I classifiers, it can be seen that the highest discrimination efficiency was achieved for the SLFN-I classifier with 35 hidden neurons. The best result for the SLFN classifier was achieved for 40 hidden neurons. An interesting aspect also observed in Fig. 8 is that the use of independent features consistently produced higher discrimination efficiencies and smaller statistical fluctuations in the final global performance.
For the SVM classifiers tests were performed using different kernel functions (linear, radial basis, polynomial and sigmoid) and the best discrimination results were obtained for the sigmoid kernel.
The discrimination efficiencies obtained from different classifiers are summarized in Table III. As it can be seen, the  The confusion matrix for the SVM classifier is presented in Table IV. It can be seen that, in most cases, the crossconfusion between two genres is below 5 % and only in few cases it is above 10 % (highlighted inside boxes in Table IV). Some high confusion rates appeared for genres which present similar characteristics (and in some cases are also confused by human listeners) such as Reggae and Forró, Jazz and Soul, Pop and Reggae.
Considering aspects related to the computational cost, Table V presents the average computational time in system operation (in % of the total time) required for each signal processing step. The used classifier was the SVM (the training phase was not considered, only system operation) and this analysis was performed using a Texas Instruments TMS320C6713 DSP (clock-frequency of 225 MHz, 192 KB of internal memory, 512 KB of flash memory and 16 MB of SDRAM [32]). As it can be observed, the classification module requires only 1 % of the total time. The average processing time for each audio file is approximately 300 ms, and the complete dataset may be processed in less than six minutes. These results indicate that it may be possible to produce a version of the proposed system for embedded applications which may present a relatively fast response to the final user.

C. Comparison with previous works
In order to allow fair comparison of the proposed system with previous research in automatic music genre classification, were used here the results presented in [33] for different databases such as GTZAN [4], ISMIR 2004 [34], Homburg [35] and 1,517 Artists [36].
Table VI presents a summary comprising some relevant aspects of the proposed classification systems (the cases marked with '*' use short-length audio excerpts, instead of full-length music files). It can be observed that the global discrimination results are usually higher for datasets with smaller number of genres. Considering this, the proposed system, which comprises 12 genres and is the only one that uses discriminant and independent features, present results comparable to a dataset comprising only 9 genres.
Another interesting aspect is that only in [36] Latin music is considered explicitly in the classification problem and none of these works considered the diversity and the particularities of Brazilian music.

V. CONCLUSIONS
Music information retrieval from multimedia files is very important in the search for desired contents in large nontagged databases. This work deals with the identification of the prevailing musical genre for a dataset which includes Brazilian genres. As the Brazilian culture comprises multiple influences (European, African and Native-American), its musical genres present very specific characteristics, which can only be accounted by designing a specific automatic music genre identification system. The experimental results indicates that the combination of relevant and independent input features with SVM classifiers produce a music genre classification system with efficiency compatible to previous results presented in this field. Additionally, a preliminary implementation of the proposed system in embedded electronics indicates that it may be possible to develop a version for mobile devices. Since 2012 he has been with the Federal University of Bahia (Electrical and Computer Engineering Department). He is currently the coordinator of the Computer Engineering undergraduate program. His main research interest include audio signal processing and machine learning.