Multi-class Unbalanced Data Classification for Sleep Staging

Unbalanced data classification is a research focus for many applications, including financial fraud detection, network intrusion detection and cancer classification. However, unbalanced data classification is rarely investigated in the field of EEG-based sleep staging. Herein, considering the idea that old methods can be exploited in new applications, we propose a practical framework aiming to classify sleep stages with unbalanced data. In this framework, the data are balanced by using a SMOTE algorithm, in which the mean sample number is used for data expansion and the nearest neighbour number is set according to the G-mean values. Subsequently, the features are extracted and selected based on the balanced dataset. The effectiveness of the proposed framework is validated by testing eight sets of Sleep-EDF EEG data in the MIT-BIH physiological information database. From the results, the proposed framework can be used to not only improve the F-score value of the minority class but also to improve the G-mean value and the AUC value of the whole data set, which might benefit sleep studies and disorder diagnoses.


Introduction
In recent decades, many new data forms have emerged. In particular, the widespread existence of unbalanced data has brought great challenges to traditional machine learning algorithms. Although traditional classification algorithms such as the decision tree [1], [3]- [5], support vector machine (SVM) [4], [6]- [8], naive Bayes [2], and the k-nearest neighbours (KNN) algorithm [3], [9] are successful in dealing with many classification problems, most of these algorithms are based on balanced data. When the data are unbalanced, the performance of the classifier will be reduced; in particular, the minority class cannot be recognized correctly. In some practical applications, the significance of correctly recognizing the minority classes is greater than that of the majority classes, such as in cancer symptoms identification, credit card fraud identification and network hacker intrusion [10], [11].
Unbalanced data refers to the unbalanced distribution of classes [12]; that is, the number of one or more classes in such data accounts for a relatively small proportion of the total sample. For this kind of data distribution, the traditional classification algorithm often fails to achieve good classification effects in practical applications. At present, the research on the classification of unbalanced data sets mainly focuses on two aspects, namely, at the data level and the algorithm level [13], [14]. At the data level, the main idea is to change the distribution of the data sets by random under-sampling or random over-sampling, so that characteristics of each sleep stage. Table 1 presents the sleep expert classification results of the experimental data set in this paper, where W is the awake stage, REM is the rapid eye movement stage, and N1, N2, N3, and N4 are the non-rapid eye movement stages. N1 and N2 are light sleep, and N3 and N4 are deep sleep. The EEG data of Fpz-Cz and Pz-Oz are used in this paper. As shown in Table 1, the experimental data are unbalanced class data sets according to various criteria; for example, the proportion of the N1 stage is at least 3.7%, showing a serious class unbalance. Therefore, to improve the accuracy for the minority class, the data needs to be balanced. After balancing, the features are then extracted and filtered by the following three steps: a set of optimal feature subsets are obtained, trained and tested for SVM to obtain the classification results.

Data balance processing
For unbalanced data sets, if only minority classes are randomly over-sampled or under-sampled, there will be problems such as model over-fitting or data information loss. Therefore, the framework proposed in this paper is a basis for improvement at the data level.
First, the data is divided into a training set and a test set according to 8:2 ratio, and then the data of the training set is processed. The processing method is as follows: according to the sample mean of the data set, each category is divided into a large class and a small class set. For example, the sample mean is N/n, where N is the number of samples and n is the number of classes. If the number of samples belonging to a class is greater than the mean, this class is considered to be a large class, otherwise it is considered to be a small class. Then, the total sample numbers of the large and small sets are counted separately, and the sample mean value of the larger set is taken as the target of the smaller set. The SMOTE algorithm is used to expand the small sample to set A, extract the small class number M of set A, cluster the large class with M as the number of clusters, and obtain the K clustering properties. Then, the K cluster centroids are combined with the small samples to form a new balanced training set. The pseudocode of the algorithm is as follows: Input: Initial complete data set Q Output: Balanced training set D Train 1) Q=pre-process(); 2) Train [D]= 80% samples were randomly selected from data set Q, and the remaining 20% samples from test [D]= Q.
3) For training data set [D], if the sample number is N and the category number is n, the sample mean value is N/n, the larger class is T1, and the smaller class is T2. The sample numbers of set T1 and T2 are counted separately, and the sample mean value of the larger group,  , is the target of the smaller group. 4) Balance the data according to T1, T2,  : a) According to the  value, use the SMOTE algorithm to expand the small class samples in T2 to obtain set A; b) Extract the small class and large class sample set L1 and L2, respectively, in A, and calculate the quantity M of the L1 small class, so that K=M;

International Journal of Computer Electrical Engineering
c) Perform K-means clustering on samples of the set L2, and obtain the K disjoint subsets and their cluster centroids; d) Take out the K cluster centroids and record them as set L2'; 5) D train =L1∪L2'. The parameters of the traditional SMOTE algorithm are changed when data balancing is complete. In the past, the SMOTE algorithm was used to expand the data according to the over-sampling rate N; in this paper, the sample mean of the large class was taken as the target of the small class and the sample was then expanded. Additionally, for a parameter K in SMOTE, that is, the number of nearest neighbours, nine numbers have been tested in this paper: 1, 3, 5, 7, 9, 11, 13, 15, and 17. To avoid the dependence on results from data, the final results are averaged five times. By comparing the G-mean values to determine the best K values, the results of different K values of the balanced data set are shown in Fig. 1 below. The horizontal axis represents the KNN number, and the vertical axis identifies the average G-mean value. The graph shows that when k = 11, the G-mean value is the highest. Therefore, in the following experiments, the K value of the SMOTE algorithm is chosen to be 11. Since the original data has noise, invalid data, and non-uniform dimensions between the data, in the initial step, pre-process (), the data needs to be pre-processed. The specific steps are as follows. First, the signal is filtered with six layers of the 'db4' wavelet basis to remove noise. Second, invalid data can be removed by certain rules. For example, a threshold can be set for duplicate data and any data whose similarity is greater than the threshold can be then removed. Incomplete data can be completed by the KNN algorithm. Erroneous data outliers can be eliminated by clustering, regression, sub-boxes and other means, and outliers can also be removed by data distribution characteristics. Finally, the non-uniform dimension is eliminated by normalization. Min-max standardization is commonly used in normalization methods, and the results are normalized to [0,1].

Feature extraction
According to the expert staging results, a segment of data is intercepted every 30 seconds for feature extraction. First, two original EEG signals are superimposed, then 25 features related to sleep staging are extracted from three classes: the time domain, the frequency domain and non-linearity (see Table 2).  and  waves occur in the W phase and REM phase, so  or  waves can distinguish between the W phase and REM phase and NREM phase; the  wave only occurs in the deep sleep phase, the  wave occurs in the light sleep phase and REM phase, so the  wave or  wave can distinguish between deep sleep and light sleep. Sleep stage entropy is high in the W stage, decreases gradually with the deepening of sleep, and increases again in the REM stage.

Feature selection
In feature selection, a set of optimal feature subsets is selected from the original feature set based on some evaluation criteria. The purpose of feature selection is to select a set of minimal feature subsets according to the given criteria, so that the classification accuracy is not better or worse than the original. Moreover, feature selection can reduce the redundancy between features and remove some irrelevant attributes, making the description of the data sets more accurate and the final model smaller and easier to understand. Therefore, this paper makes feature selection on the basis of balanced data sets using the following three methods.

ReliefF algorithm
The Relief algorithm [23] is a single feature optimal strategy algorithm for two classifications. The basic idea is to assign different weights according to the correlation between each feature and category, and features with weights smaller than a certain threshold will be removed. The ReliefF algorithm is an extension of the Relief algorithm that can be used to handle multi-category problems [24]. The ReliefF algorithm randomly extracts one sample i In the formula, c [25], [26].
There are two definitions of the distance function. For numeric feature attributes, the following formula can be used to calculate the distance of different samples for that feature: For non-numeric feature attributes, the following formula is used to calculate the distance of different samples for that feature: The ReliefF algorithm is a well-known feature selection algorithm with comparatively good performance. It is not only simple in principle and efficient in operation but also has no restrictions on data types. Therefore, it has obtained better experimental results in classification problems.
In this paper, the ReliefF algorithm is first used to select 5, 7, 9, 11, 13, and 15 features randomly from 25 extracted features. SVM is then used to classify these features and the F-score average values are calculated for the minority class and all classes. The results are shown in Fig. 2, where the horizontal axis represents the feature sequence number and the vertical axis identifies the average value for the F-score under the number of selected features. The weights of each feature in each group are sorted separately and Table 3 shows the sorted feature sequence. Finally, according to the results in Fig. 2, the best feature subset is determined, that is, the least number of features selected and the highest average F-score value. Considering both Fig. 2 and Table 3, it is found that the greater the number of features, the greater the F-score average of all classes and the minority class. The exception is for 7 features, where the average F-score value decreases. This decrease may be due to the randomness of feature selection, as poor feature selection affects the final result. Considering the time and the average F-score value, the final number of feature subsets is determined to be 11, and the feature sequence number is 25, 24, 23, 20, 21, 15, 16, 1, 3, 4, and 11, that is  Wave,  Wave,  Wave, coefficient of variation, SEF95-SEF50 (8-16 Hz), SEF50 (8-16 Hz), SEF95 (8-16 Hz), SEF95-SEF50 (0.5-12 Hz), zero crossing rate, information entropy and symbol entropy.

Pearson correlation coefficient
A good subset of features should have a high correlation with classification, while the correlation between features should be low. Therefore, the Pearson correlation coefficient is used to calculate the correlation between features according to the following formula: In formula (4),  is the average value,  is the standard deviation, and E is the mathematical expectation. If the correlation coefficient between the two features is more than 95%, the two features have strong correlation and are redundant; therefore, the feature that ranks behind in ReliefF, that is, the feature with smaller weight, is deleted.
The results of the Pearson correlation coefficient method are shown in Table 4 below. It is found that the correlation coefficients for the pairs of tested features are very low. Therefore, no features are deleted in this step.

Sequential backward selection
This method is a top-down approach, which first assumes that the whole feature set is the optimal feature set needed at the beginning of the operation. The method then deletes a feature that does not contribute to the criterion function at each step of the algorithm until the number of remaining features meets the requirements of the set function. The advantage of this algorithm is that it takes into full account the statistical correlation between features. In practical applications, it has the characteristics of fast operation and strong computational performance and is a robust algorithm [27]- [29].
Since sequence backward selection is used to determine whether a feature contributes to the criterion function, the feature subset is evaluated by the accuracy of the final classification. Therefore, the feature subset selected in this round is the optimal subset. According to the results of this round of screening, feature numbers 11, 16, and 24 are delete in this step, which shows that the eight remaining features are the optimal subset for this paper, and the feature sequence number is 1, 3,4,15,20,21,23,25. All of the following results will be based on these eight features.

Classification method
The main idea of SVM [30] is to map the training set to a high-dimensional space through a kernel function, which can solve problems such as over-learning, dimension disaster, and local minimum. Some studies show that SVM has a better classification effect on balanced data. Therefore, this paper will give priority to using SVM as the classification method. The objective function of the nonlinear SVM problem is International Journal of Computer Electrical Engineering as follows: In formula (5), is a kernel function. The kernel function selected in this paper is the Gauss Radial Basis Kernel Function, shown in Formula (6): The values of parameters  , C have a great influence on the accuracy of the model. Therefore, a pair of multi-coding strategies is used to first divide the multi-classification problem into several binary classification problems. Then, a five-fold cross-validation is used, where the data sets are divided into five parts on average, four parts are randomly selected as training sets, and the remaining are used as test sets. Ten results are averaged together and used as the final result to optimize the model and obtain more reliable experimental results.

Model evaluation criteria
In traditional classification methods, accuracy is often used as an evaluation index, but in unbalanced data classification, accuracy is no longer a reasonable index [31], [32]. For unbalanced problems, the commonly used indicators are based on confusion matrices, such as Recall, F-measure, G-mean, and AUC. [33]. The confusion matrices are shown in Table 5. The F-measure is the harmonic mean of recall and precision, which is defined as follows: where , TP TP recall precision TP FN TP FP The G-mean values represents the geometric mean of the classification accuracy of the minority classes and majority classes. This value maximizes the accuracy of both classes while maintaining the balance of the classification accuracy of the majority classes and minority classes. That is, the G-mean values is the largest only when both classes are high. Therefore, the G-mean value can reasonably evaluate the overall classification performance of unbalanced datasets [34]. It is defined as follows: The Area Under Curve (AUC) is the area under the Receiver Operating Characteristic (ROC) curve. The Volume 12, Number 2, June 2020 index for the classification model of unbalanced data in many papers. The larger the area under the curve, the larger the AUC value will be, and the model will have a better classification effect. This paper will use the F-measure value to measure the classification performance of minority classes, the G-measure value to measure the overall classification performance of data sets, and the AUC value to measure the classification performance of classifiers.

Results
The data set is divided into the training set and test set according to 8:2 ratio. The validation method is a 5-fold cross-validation, and the average of 10 results is used as the experimental results. Table 6 compares the F-score values of the minority class of different algorithms under the balanced data set and the unbalanced data set, and calculates the average value of the F-scores for each category of the minority class according to formula (7) as the final result. Table 7 compares the G-mean values of the different sleep stages under different algorithms and whether the data sets are balanced.    W  89  93  89  99  92  92  83  94  N1  91  90  88  95  66  62  88  82  N2  87  92  90  93  91  88  90  90  N3  93  91  91  96  80  77  88  82  N4  95  93  90  97  90  89  88  90  R  88  92  88  95  87  82 87 87  Fig. 3 compares the F-score values for balanced and unbalanced data and shows the evaluation results of different algorithms under balanced and unbalanced conditions. In Fig. 4, the horizontal axis represents four different algorithms, and the vertical axis identifies the values under different evaluation indexes. In the Fig. 4, Un and B represent unbalanced and balanced data sets, respectively. This paper also uses the ROC curve to observe the classifier performance under balanced and unbalanced data sets, as shown in Fig. 5 below.   By considering the results in Table 6, Table 7 and Table 8, it can be seen that after data balancing, the F-score values of the minority class, G-mean values and AUC values classified by different algorithms are higher than those obtained for unbalanced data sets. Additionally, when comparing the four algorithms of decision tree, KNN, Naive Bayes and SVM, the classification result of SVM is the highest in both the balanced and unbalanced data sets. Because SVM is based on the VC dimension theory of statistical learning theory and the principle of structural risk minimization, it can achieve global optimal classification with limited sample information [35]. In addition, in Table 7, the G-mean value of the awake stage is the highest under the SVM classification because the EEG features of waking and sleeping conditions are quite different. From the results shown in Fig. 3 and Fig. 5, the classification effect has been improved after data balancing, and the average F-score value in balanced data sets is 9.2% higher than that in unbalanced data sets. In Fig. 4, the G-mean and F-score values fluctuate greatly before and after balancing, but the AUC values fluctuate very little, indicating that the AUC value is not affected by the sample distribution. AUC values closer to 1 indicate better classification effects. In summary, the proposed framework is suitable for dealing with multi-class unbalanced data models in sleep staging and can more effectively increase the number of minority class samples in unbalanced data sets, thereby improving the classification accuracy of the minority class and the classification performance of classifiers.

Conclusion
To address the problem of low classification accuracy of minority classes in sleep staging with unbalanced data, this paper makes improvements on the data level, extracting and filtering the features of the data. Additionally, the proposed framework is validated by using the EEG data of MIT-BIH, which is a public database. The results show that the proposed framework not only improves the recognition effect of the minority class but also improves the overall classification effect. In future work, the following improvements should be considered: (1) reducing the time complexity and improving the real-time performance of data balancing, (2) extracting better and fewer features of classification, and (3) improving the method of synthesizing the minority class samples to make the distribution of new samples more reasonable.

Conflict of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.