Impact of Feature Selection Methods on the Classification of DDoS Attacks using XGBoost

— Distributed Denial of Service (DDoS) attacks impose a major challenge for today's security systems, given the variety of its implementations and the scale that the attacks can achieve. One approach for their early detection is the use of Machine Learning (ML) techniques, which create rules for classifying traffic from historical data. However, different types of data contribute unequally to the assertiveness of the trained model. The use of Feature Selection (FS) techniques as a pre-processing step allows identification of the most relevant features for the problem in question. This action reduces training time and can even improve performance when noisy variables are eliminated. The current work is based on a public dataset and the XGBoost algorithm to measure the impact of FS techniques on the DDoS attack classification problem. We consider both techniques independent of the sample labels, as well as methods that use this information to rank the variables in order of importance. We analyzed the problem from the point of view of Binary and Multiclass classification. We also created a benchmark of classification metrics and execution times. Our comparisons involved the Accuracy, Precision, Recall, and F1 Score metrics for different FS methods, in addition to training and execution time. In the results it is possible to verify for both the Binary (78% reduction of the features) and Multiclass classifiers (60% reduction of the features), that the ANOVA method proved to be the most beneficial.


I. INTRODUCTION
ISTRIBUTED Denial of Service (DDoS) attacks are increasingly frequent and voluminous on the Internet. Daily, thousands of attacks are triggered to the most diverse targets: governments, e-commerce companies, telecommunications service providers, multimedia content distributors, among others [1]. The motivations for these attacks are very diverse, such as economic interests, political activism, or even intellectual curiosity. Currently it is even possible to hire DDoS attacks towards a specific target [2]. Attackers, who have numerous infected devices under their command, charge for the duration and volume of the shots. In addition to the direct economic and social damage caused by the interruption of services, the reputation and credibility of the attack victims are also severely tarnished [3].
Detection of DDoS attacks can be performed by Intrusion Detection/Prevention Systems (IDPS). IDPS have traditionally been divided into two areas: detection by signature and by anomaly [4]. Although attack signatures developed by experts are able to identify threats with great precision, they become ineffective against unprecedented attacks. In contrast, the use of anomaly detection is able to offer some protection even against zero-day attacks. One of the biggest difficulties in implementing DDoS detection systems via anomaly detection is in minimizing the occurrence of false positive or false negative alerts.
There are several anomaly detection techniques. Some are based on the comparison of correlations and gains [5 -6], while others focus on clustering methods [7]. However, it is Machine Learning (ML) models that have gained more relevance recently, thanks to advances in the availability of computational power, specialized software, and public datasets. The application of ML models for detecting DDoS attacks has been discussed in the literature for some decades [8 -10]. The problem of anomaly detection has been attacked by both the use of supervised learning (such as classifiers) and unsupervised learning.
However, the selection of features that serve as input to a model is a less explored subject within the context of attack detection. The use of Feature Selection (FS) can bring several benefits, including: (1) a more agile attack detection process; (2) less need for storage and memory when implementing the classifier; and (3) an increase in the ability to interpret the model generated [11].
The objective of this work is, therefore, to measure the impact of FS techniques in the problem of classifying DDoS attacks using Machine Learning. The proposal is to establish a fixed classifier algorithm and evaluate the influence of FS methods on performance metrics. The use of classifiers presupposes a labeled dataset. In particular, in this context, the following are analyzed: • The impact of FS methods that are independent of the target feature; • The impact of FS methods that do depend on the target feature; • The execution time of the whole model (which includes performing FS and training the classifier).
The rest of the article is divided as follows: Section II presents the related works, introducing the main types of FS, datasets, and classifiers found in the literature. Section III details the proposal of the experiments, including the environment used, the data architecture, and the quality metrics of the classification. Section IV presents and analyzes the main results obtained in the computer simulations. Finally, Section V contains the general conclusions of the work, contributions, and future works.

II. BACKGROUND
This section briefly describes the related works and concepts related to our proposal.

A. Related Works
The FS algorithms are inserted in the context of dimensionality reduction. The objective of these techniques is to find a subset of input features, so that they are closer to the target feature and more distant from each other [12]. In this case, there are several ways to define distance, such as, for example, Pearson's Correlation Coefficient (PCC) or Mutual Information (MI) [13]. An FS technique is characterized by the choice of a subset of features, among the original variables, without any transformation or creation of new variables. It differs, therefore, from feature extraction techniques, such as Principal Component Analysis (PCA), which projects input features in a different space to the original one. An unwanted consequence of methods that transform the original variables is the lack of interpretability of the new variables.
FS methods are divided into: (1) filters; (2) wrappers; (3) embedded; and (4) hybrids [14]. In the work of Polat, Polat, and Cetin [8], the uses of filters, wrappers, and LASSO (Least Absolute Shrinkage and Selection Operator, which is an embedded way of executing FS) are explored as a preprocessing step for the classification of DDoS attacks in Software-Defined Networks (SDN). As classifiers the authors employ Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NB), and Artificial Neural Networks (ANN). In all cases, Accuracy improvements occur when the techniques are applied.
The use of filters involves the comparison of a single variable with the target, without taking into account its relationship with the other variables. Because of this, these methods are computationally efficient. They involve assigning a number to each feature and then usually sorting them, choosing the most appropriate K variables. This category includes PCC, Information Gain (IG), and Chi-Square Test, for example. Pattawaro and Polprasert [15] propose a novel technique for classification problems, the Attribute Ratio (AR), which only takes into account counting ratios of the observations belonging to each class, thus being a type of filter.
Wrappers are problems of exhaustive search in the space of variables, directly involving the use of some classifier method. A quality metric is chosen (such as Accuracy), a model is trained using a subset of the features and, depending on the result, new variables are added or removed, iteratively. An example is the work of Gupta [10] who applies Recursive Feature Elimination (RFE) to select the appropriate variables for the classification of DDoS attacks. However, wrappers are computationally very expensive.
The choice of the best variables within the space of all possible combinations is considered an NP-Hard problem [16]. There is even research on ways to optimize the execution of the wrappers. For example, one way is to use Genetic Algorithms [17], that incorporate the idea of objective function (analogous to Natural Selection) to choose the most suitable population of variables. Tree-based classifiers, such as Decision Trees (DT), Random Forest (RF), and XGBoost, already have methods of quantifying the importance of each feature built into their training. An example is in the work of Dhaliwal, Nahid, and Abbas [18], in which an XGBoost model is trained for intrusion detection tasks, using the feature importance score generated by the algorithm to interpret the results. In the work of Wang et. al [14] a hybrid FS model is shown. More specifically, an ensemble of the results obtained by other methods is carried out, aggregating them through arithmetic and geometric means.
Several datasets have been used over the years as a reference for intrusion detection research, with an emphasis on the most popular: KDD 99 [10] [19 -20] and NSL-KDD [21 -23]. These attack records are used in several studies, many of them with a focus on detecting anomalous network behaviors using ML. However, these datasets are out of date, since new types of attacks, which are not represented in these sets, have been introduced in the 21st century.
Therefore, it is necessary to use a more up-to-date dataset. The dataset Canadian Institute for Cybersecurity DDoS 2019 (CICDDoS2019), published by Sharafaldin et al. [24], is extracted from a testbed with real equipment, such as routers, switches, firewalls, and several servers. In their work, the authors generate both attacks and benign background traffic. In total, 86 network parameters are collected. Attack traffic is more represented than normal traffic in this dataset. As part of the paper that presents the set, the authors propose the use of ML to perform the classification of attacks, building models with the Iterative Dichotomiser 3 (ID3), RF, NB, and Logistic Regression (LR) algorithms. Since its publication, this dataset has already served as the basis for some works such as that of Hussain [25], which uses resampling to address class imbalance, and Li [26], which makes use of both traditional ML and Deep Learning (DL) algorithms.
A summary of the main related works is found in Table I. The datasets used are listed for reference. Table I also shows dataset processing techniques, such as Feature Selection and sampling. The methods for training a DDoS attack classifier are also exposed, such as the algorithm used, the type of classification it performs and whether Cross-Validation was used or not. Finally, quality metrics are also listed in Table I. In addition, the experiments that will be shown in this work are placed at the end of the table for comparison with the others. In general, this work seeks to apply a wide range of Feature Selection techniques and performance metrics to assess the impact of the former on the DDoS attack classification problem. We emphasize that this work tackles the problem from both a Binary and a Multiclass point of view. The ratios between metrics and runtimes are also introduced, here called Benefit-Cost Ratios, and will be explained in Section III.D.
Tree-based algorithms have been shown to be efficient as classifiers and regressors for tabular data. In particular, XGBoost [27] drew the attention of industry and academia, for its performance and training agility. It is a gradient boosting algorithm, which produces a strong predictive model through the combination of weak models. Its use is very popular in ML competition platforms, such as Kaggle [28]. In addition, it has been applied to the classification problem of DDoS attacks with good results [15][18] [29].

B. Literature Review
As described in [27], XGBoost makes its predictions through an ensemble of decision trees. These trees are constructed so as to minimize a cost function that contains regularization terms (to penalize very complex models). The weight that each new tree will contribute to the final prediction is calculated to go against the gradient of the cost function. As its training depends on the result of the previous iteration, the trees are therefore trained sequentially. XGBoost will be used as the classifier in the experiments that follow.
Several Feature Selection methods are used in the simulations in the next sections. Some of them do not depend on the label of the samples we have in hand. For example, features with low variance (and, consequently, also the constants features, with variance = ) have no discriminatory power in a decision tree, and therefore can be discarded [30]. Correlated variables end up dividing among themselves the importance they have in a predictive model [13]. They can thus be represented by a single variable.
There are variable selection methods that are labeldependent. We can use Analysis of Variance (ANOVA), for example, to see if a categorical target (as in the case of DDoS attacks) influences the behavior of a numerical variable. ANOVA uses an F-test to determine whether two or more means come from the same distribution or not [31]. A high value of the F-statistic implies that the target categories influence the distribution of the numerical variable and the latter should be added to the feature set.
The idea of Mutual Information (MI) comes from Information Theory and can be used to select features. It is the extent to which knowing a random variable reduces the uncertainty that one has about another random variable. It is usually calculated between two categorical variables, but it can be adapted to the case that we are dealing with a numeric variable and a categorical one [32]. A high MI value between a feature and the target suggests that the former should be included in the model.
The Relief family of algorithms performs variable selection by assigning a weight to each feature [33]. Some samples are taken from the dataset and compared with their closest neighbors: those of the same category (nearest hits) and those of different categories (nearest misses). Weight is computed in As a by-product of XGBoost training, rankings of feature importance are created. One of these rankings concerns the Gain of the variables [27], which is a metric that measures the relative contribution of each variable to each tree trained in the model. We can then select only the features considered most important by this metric, which becomes a Feature Selection method embedded in XGBoost.
RFE is a wrapper Feature Selection technique. It starts from a complete feature set and uses a Machine Learning model to list the most important features. From that point on, the worst one is discarded and the process starts again with the remaining set [34], which characterizes the recursive nature of this method. For the experiments in this paper, XGBoost itself is used as the base Machine Learning model for RFE.

III. METHODOLOGY
To measure the influence of the use of FS on the DDoS attack classification problem, the CICDDoS2019 dataset was chosen [24]. This set includes labeled traffic samples from 12 modern DDoS attacks (NTP, DNS, LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP, UDP-Lag, WebDDoS, SYN, TFTP), in addition to benign traffic. The dataset is processed, subjected to several Feature Selection methods, and serves as input to a classifier that uses the XGBoost algorithm. No new Feature Selection method is being proposed in the experiments that follow. Instead, Feature Selection techniques known in the literature are used in order to compare them numerically.
Tree-based models have been shown to be consistently effective for classification problems using this dataset [25] [35].
The XGBoost classifier is fixed in the course of the experiments, since the objective of the study is to measure the influence of the FS techniques, not the chosen classification algorithm.
The steps of the simulations carried out are detailed in the rest of this section, as well as the main concepts necessary to evaluate them. Fig. 1 illustrates the general data flow of the experiments in this work. The next subsections detail the implementation of the architecture proposed by this diagram. Table II presents the distribution of the samples among the different traffic classes for the CICDDoS2019 dataset, demonstrating both the binary and multiclass representation of this set.

A. Resampling and Balancing
Intuitively, we can say that it is an extremely unbalanced dataset, with a number of attacks that far exceeds benign traffic where we have S distinct classes, each class Ci has ni elements and the complete data set has N samples. In equation (1), H is the Shannon Entropy [37]. The Pielou Index normalizes entropy by its maximum (ln ), so that the values are confined between 0 and 1. In this case, 0 represents the largest possible unbalance (all samples are of a single class) and 1 the largest possible balance (the samples are divided equally between classes). Note that the Pielou Index is applicable to both sets with binary and multiclass labels. An unbalanced dataset tends to favor the majority class(es) [38]. Therefore, as illustrated in Step Ⓐ of Fig. 1, two datasets were generated: one balanced from a binary point of view (in which there is the same number of attacks and benign traffic) and another balanced from a multiclass point of view (with all classes with equal representation). These two ways of resampling the dataset make the new Pielou Index, for both cases, equal to 1. Due to hardware limitation, the resampling process took place by undersampling all classes, in order to efficiently allocate the data in the available memory. Works like those of Hussain [25] and Li [26] also opt for undersampling. Finally, the 'WebDDoS' class was dropped because it had a much smaller number of samples than the other types of attacks.
After the resampling process, the binary and multiclass datasets serve as the basis for training binary and multiclass classifiers, respectively.

B. Label-Independent Feature Selection
The part called "Preprocessing" in Step Ⓑ of Fig. 1 includes several operations that follows. All categorical variables were discarded based on domain knowledge: attributes such as Flow ID, Source IP Address, and Destination IP Address are only identifiers, having no predictive value. Source Port and Destination Port can be treated as numerical variables, but this work will focus on network traffic statistics, choosing not to use the former. All the remaining variables are numeric attributes, such as counters and ratios of these counters over time (such as Bytes/sec). The following variables, even representing counters and measures of packet sizes (thus positive values), had spurious negative values in the dataset: 'Fwd Header Length', 'Bwd Header Length', 'Init_Win_bytes_forward', 'Init_Win_bytes_backward', 'min_seg_size_forward'. It was then decided to ignore the negative sign of these samples (i.e., make them positive). A detailed description of the features of the complete dataset can be found in [26]. Infinite values were replaced by the maximum value of the feature in question. The datasets after these operations serve as input for XGBoost classifiers. We used a 10-fold Stratified Cross-Validation (CV), where the metrics illustrated by Step Ⓓ in Fig. 1 were collected.
The second part of the Step Ⓑ performs the selection of variables through two operations: (1) removal of duplicate columns; and (2) removal of columns with low variance. These methods are called "Basic Methods" in subsequent sections.
Note that determining what is a "low" variance is a hyperparameter of the model, which should be defined as a threshold at the time of training. One point of attention is that, in order to verify the influence of the Feature Selection techniques on the classification metrics, it is important that the first ones are performed within the Cross-Validation process.
With the dataset reduced by the "Basic Methods", a PCC Matrix is generated between all the predictor variables. The correlation between the predictor variables and the target labels is not adequate, since the former are numerical and the latter categorical [34]. From this operation, groups of features are formed that have a correlation greater than a threshold established in training. This threshold, therefore, is another hyperparameter of the experiment. Within these groups, only one variable is maintained and the others are discarded. The resulting set is also used to train an XGBoost model, within a 10-fold CV. This method will be referred to as "Correlation" in the results of the experiments.

C. Label-Dependent Feature Selection
The methods of Step Ⓒ in Fig. 1 do take into account a relationship between the sample labels and the predictor variables to select the attributes that should be used in the model. In general, all of them are capable of generating rankings of importance of the features. With the ordered variables, the most relevant are chosen and the others are removed. As in the methods independent of the label, here it is also necessary to carry out the variable selection within the Cross-Validation process.
Thus, curves are generated for each of the classification quality metrics according to the value of , the number of selected features. This operation is repeated for each feature selection method considered.
To save time and memory allocation in the Feature Selection task with methods that depend on the sample label, it was decided to use as input to this block the output of the process of removing highly correlated variables (as in Fig. 1). This procedure facilitated further processing during the experiments, by significantly reducing the number of input features for this block.
As mentioned in section II, the chosen FS methods are divided into the following categories: (1) filters; (2) wrappers; (3) embedded, and (4) hybrid. As Filters, the F-test of ANOVA, MI and ReliefF methods are considered for the experiments. XGBoost Gain is used as an embedded FS technique. The RFE method is also used as a wrapper around XGBoost. All of these techniques are available in Scikit-Learn [39], with the exception of ReliefF [40], which has its own library, but is also compatible with NumPy Arrays.
Finally, an ensemble of the other techniques is performed, as proposed by Gupta [10]. In this method, the rankings of each variable for each FS technique considered are added in a new vector and the value of this sum is taken into account when ordering the variables. The experiments deal with an ensemble among ANOVA, MI, ReliefF, and XGBoost Gain. The RFE technique was not included in the ensemble due to its high computational cost.

D. Performance Evaluation
Both classifiers trained in Steps Ⓑ and Ⓒ are evaluated according to the performance metrics illustrated in Step Ⓓ from Fig. 1. Metrics differ between binary and multiclass cases.
For a binary classifier, we define one of the labels as positive and the other as negative. This choice is arbitrary, but must be taken into account when interpreting the results. Choosing the class "Benign" as negative and "Attack" as positive, we have the nomenclature in Table III that follows. For a multiclass classifier, consider T samples in the test set and S distinct Ci classes. TPi is the number of elements correctly classified with the label of Ci. FPi are elements classified as Ci, but that belong to another class. FNi are the elements of Ci, but mistakenly classified in another class.

1) Accuracy (AC)
: is the ratio between the correct classifications of the model and the total classifications made. For the binary case: For the multiclass case: 2) Precision (PR): is the ratio of the correct predictions and the total predictions for a given class. A high Precision value is linked to fewer false alarms. For the binary case: For the multiclass case, we have some possible aggregation methods [41]. The Macro-Average of Precision was chosen, in order not to privilege any specific class. We have: 3) Recall (RC): is the ratio of the correct predictions and the total elements of a given class. A high Recall value implies that most samples in a class have been recognized. For the binary case: For the multiclass case, again with the Macro-Average: 4) F1 Score (F1): the PR and RC metrics are conflicting requirements, since by increasing one of them the other is compromised. The F1 Score is the harmonic mean among the latter. For the binary case: For the multiclass case, the most recommended calculation [42] is given by: The metrics for Accuracy, Precision, Recall, and F1 Score vary between 0 (worst) and 1 (best), for both binary and multiclass classifiers. Since we are using the Cross-Validation process, we can calculate the Mean and Standard Deviation for each of these metrics.

5) Fit
Time: is the sum of the time spent with the execution of the Feature Selection process and the training of the classifier.

6) Benefit-Cost Ratio (BCR)
: is a family of ratios given by dividing one of the metrics from 1) to 4) by the required Fit Time. For an X metric, we have:

E. Multiclass Classifier Training Considerations
XGBoost training involves building new decision trees in order to go against the gradient of a cost function. For a Multiclass classifier, this cost function is traditionally the Categorical Cross-Entropy. However, it is also possible to train a Multiclass classifier from several Binary classifiers, in metalearning strategies such as One-Vs-Rest (OvR) [43] and One-Vs-One (OvO) [44]. In the latter cases, the cost function to be minimized is Binary Cross-Entropy. For the experiments, these meta-learning strategies were also considered.
The OvR strategy consists of transforming the classification problem between S classes into S binary classification problems. Each one of them determines whether the sample belongs to a certain class Ci or not (and in this case it is part of Benign classified as Benign the rest) [43]. Each Binary classifier must return a probability and the classifier with the highest probability indicates the predicted class for the sample. The OvO strategy demands more computational power, as it compares all possible 2-element combinations among the S classes [44]. This gives a total of ( − )/ binary classifiers. Each classifier then votes for a class. The class with the highest number of votes is predicted by the OvO model.

F. Analysis Scope
The present work focuses on numerically characterizing the influence of Feature Selection techniques on the performance metrics of a DDoS attack classifier. As all results are obtained using 10-fold Cross-Validation, we potentially have 10 distinct feature groups for each simulated data point. This situation is the same for both label-independent and label-dependent variable selection methods. It is beyond the scope of this article to list which features were selected in each round of the experiments.

IV. RESULTS
The results and analyses of the works described above are presented as follows.

A. Resampling and Balancing
The dataset described in Table II is unbalanced, which can favor the majority classes in a classification problem. From a binary point of view, it has a Pielou Index J = 0.012748 (according to equation 1). From a multiclass point of view, this same dataset has J = 0.764539. Two distinct datasets are generated, sub-samples of the original, in such a way that they become perfectly balanced data sets (J = 1). Sampling is performed in such a way that the final two datasets have the same number of elements.
The resulting datasets are shown in Table IV. Despite having different class distributions, both sets of data generated have 5280 elements. Fig. 2 shows the F1 Score result for the experiments with selection of variables independent of the sample labels. For the "Preprocessing" phase, 9 variables were eliminated, namely: Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp, SimillarHTTP, and Inbound. The source and destination IP addresses only serve as identifiers of the devices in the testbed. The datasets were collected in [24] distributing attacks evenly across multiple victim servers. Works like [26] and [35] also chose to discard IP addresses in their predictive modeling, since the statistical behavior of the network parameters is being taken into account for the detection of attacks. For the "Basic Methods", all features with variance < . were eliminated. For "Correlation", features with > . between them were grouped. The values of and were taken based on the method developed in [30] and proved to be adequate for the data sets used in the experiments.

B. Label-Independent Feature Selection
Analogously, Accuracy, Precision, and Recall were also measured. The behavior of all metrics was very similar in relation to the use of Feature Selection. The results of "Preprocessing" and "Basic Methods" were the same, proving that the removal of variables with low variance does not change the performance of a classifier (both Binary and Multiclass). The performance in removing the highly correlated variables also did not change much, with a slight increase in the standard deviation of the metrics when using Cross-Validation for the Binary, Multiclass, and Multiclass OvR classifiers.
The results of the classifiers subjected to label-independent FS methods are summarized in Table V ahead. Overall, Binary classifiers performed better than Multiclass classifiers. Among the Multiclass classifiers, those which used the meta-learning strategies (OvO and OvR) achieved slightly higher performances than the traditional Multiclass classifier.
We can see in Table V that, even though there is stability in the classification metrics, the number of variables could be reduced considerably. There is a mean reduction of 46% in the number of variables for the Binary classifiers (from 77 to 41 features), while the Multiclass classifiers have their variables reduced by 53% (from 77 to 36 features). Although the total number of samples is the same, the Binary and Multiclass datasets were built in different ways, with different goals (as already shown in Table IV). The Binary dataset has more elements of the benign traffic class. By having more samples of this type of traffic, which is closer to the normal behavior of users (that is unpredictable), the Binary dataset has fewer columns with low variance, which would be eliminated by the "Basic Methods" described prior. Benign traffic also lacks the repetitive patterns of attack traffic, which would be grouped and removed in a correlation analysis. The three Multiclass classifiers have the same number of end features (36 on average). The experiments were coded using the same random seed to divide the dataset into 10 folds, to make them reproducible. In this way, as the dataset is the same for the three Multiclass classifiers, it is divided in the same way within the Cross-Validation and goes through variable selection processes using the same hyperparameters, the Multiclass classifiers end up being trained with the same number of features.
The difference in the number of variables is reflected in the Fit Time, as shown in Fig. 3. The Fit Time was defined as the time to execute both the Feature Selection and the training of the classifier. Fig. 3 shows two different situations: for the Binary classifiers, the Fit Time has increased with the use of variable selection methods, even though the number of features has decreased. As for the Multiclass classifiers, we can observe a sharp decrease in the Fit Time. For the Multiclass OvO and Multiclass OvR classifiers, the results are similar to those of traditional Multiclass classifiers, and thus omitted in Fig. 3.
By separating the time for the selection of variables from the training time of the Machine Learning model, we can verify the influence of each step of the process. Table VI shows that the training time for Binary classifiers is already low in the initial models. When we remove variables using the other methods, the training time for Binary classifiers drops only 20%. For Multiclass classifiers, this drop is 37%.
However, the runtimes to execute Feature Selection steps are about the same order of magnitude of the training times of the Binary classifiers, but significantly less than the Multiclass classifiers training times. This makes the Fit Time of the Binary classifiers more sensitive to the increase in complexity between the "Basic Methods" (which remove low variance features) and the removal of highly correlated variables, which requires more arithmetic operations to be performed. Note that, because they are target-independent Feature Selection methods, their execution time depends only on the number of predictor variables.

C. Label-Dependent Feature Selection
The methods of selecting variables dependent on the label of the samples produce rankings of attributes according to a well-established statistic. From this, it is possible to choose the best K variables according to this criterion. Figures 4 to 11 illustrate the change in classification metrics as a function of the K value.

1) Binary Classifiers: Accuracy, Precision, Recall and F1
Score curves were constructed for the Feature Selection methods considered. Fig. 4 illustrates the Cross-Validation Mean curves for the F1 Score. The other metrics have curves with similar behavior and therefore omitted.
For Binary classifiers, quality metrics remain stable up to approximately 17 features, for all variable selection techniques. This represents a 78% reduction in the number of features, compared to the first trainings, where only "Preprocessing" had been carried out and 77 features had been used. When choosing less than 17 features, we noticed a performance degradation in almost all methods, in particular for ReliefF. As we reduce the multi-dimensional space by removing features, the points in this space become closer, that is, the distance measurements decrease. In this way, the particular choice of points that ReliefF makes is more sensitive to fluctuations in the data, which can cause errors that are reflected in the F1 Score. For less than 5   features, RFE suffers an abrupt worsening in metrics, which remains high when using more variables. The preparation of the RFE selector is done considering the features individually. It is possible that a variable has a greater predictive power when it is used together with another in the training of a Machine Learning model. However, in RFE, if this other variable is prematurely discarded, the classification result will be worse when we choose a small number of features. Fig. 5 shows the F1 Score Standard Deviation for all Feature Selection techniques. The graph was plotted on a logarithmic scale on the vertical axis. Lower values for the Standard Deviation indicate less fluctuations for the metric in question. Again, when choosing more than 17 features, all techniques produce results with low variance. Below this value, MI and ReliefF have greater variance. In the case of MI and its adaptation to numerical variables, its calculation is influenced by the data binning process. This process, in turn, is influenced by the Cross-Validation steps. As it is not possible to guarantee homogeneity in the sets generated in Cross-Validation, the features with more information also differ, which ends up increasing the variance of the performance metrics. For ReliefF, we must take into account that the choice of points that will have its neighbor checked is random, contributing to the increase in variance.
The Fit Time for each of the Feature Selection methods is shown in Fig. 6. There are few intersections in this graph, since most variable selection techniques take a fixed time to choose K attributes. The exception is the RFE method, which, given its recursive nature, takes more time when it needs to eliminate more variables. In the worst cases, it takes up to 4 times longer than the other methods. We can also highlight the Ensemble method, which is executed, approximately, in the sum of the 4 times of the methods that compose it. The ANOVA method has the fastest execution of all the considered methods. This is because ANOVA performs fewer arithmetic operations, in addition to not performing iterative or binning processes, like the other variable selection methods considered.
The benefit-cost ratio graph for the mean of the F1 Score is plotted on a logarithmic scale in Fig. 7. Since the F1 Score (and similarly the other metrics) remains approximately constant when removing attributes, it is more efficient to choose Feature Selection methods that will be executed more quickly to achieve this result. For Binary classifiers, ANOVA is the one that best meets this criterion, regardless of the value of K. The XGBoost Gain as a variable selector appears as a second option, since the Binary classification has a low training time. As we remove variables, the RFE is progressively degrading its Benefit-Cost Ratio because, although it does not suffer such a sharp drop in Accuracy, Precision, Recall, and F1 Score, its cost in execution time grows rapidly.
2) Multiclass Classifiers: Accuracy, Precision, Recall and F1 Score curves were also generated for the Feature Selection methods used before Multiclass classifiers. Fig. 8 shows the Cross-Validation Mean for the F1 Score. The other metrics have similar graphs with similar results.
There is more variety of Multiclass results compared to that in Binary case. For Multiclass classifiers with less than 31 features, the chosen variable selection method has more influence. Note that this represents a 60% decrease in the number of features used in the original model. Mutual Information is the method that obtains the best results for the F1 Score Mean. The use of Mutual Information is particularly beneficial for multiclass classification, since the greater number of classes decreases the probability of occurrence of each one of them individually. The logarithm in the Mutual Information calculation formula highlights these small differences, making the values obtained for each class farther apart and more suitable for ranking. We can also note that the performance of XGBoost Gain is degraded quickly when we choose less than 10 features. These gains are calculated with the widest dataset (result of filtering it by label-independent methods). The individual contribution of each feature may not be as determinant for the final prediction of a sample class as its contribution combined with the others, resulting in worse performance with fewer features. Finally, RFE again suffers a sharp drop in performance with less than 5 features. This happens because of the premature elimination of features (as in the case of binary classifiers) and also because it incorporates the gain of XGBoost in the calculation of importance of  features, with the latter having a poor performance with few variables. Fig. 9 presents the F1 Score Standard Deviation for all variable selection techniques, on a logarithmic scale. As there are more classes than in the Binary case, the random cuts of samples and columns performed by XGBoost [27] have more influence on the methods of selecting variables that are based on it. RFE and XGBoost Gain, which have this characteristic, end up presenting a higher Standard Deviation (of all metrics) when selecting less than 15 variables.
The training time for a Multiclass classifier also directly influences the increase in the RFE Fit Time, as can be seen in Fig. 10. When used to choose less than 5 features, this method takes up to 11 times longer to be executed than the others. The computational complexity of the Multiclass classifier also makes XGBoost Gain have a higher execution time than all Filter methods.
The difference in Fit Times between the Binary and Multiclass classifiers directly influences the Benefit-Cost Ratio curve. In Fig. 11, we see this ratio for the F1 Score of the Multiclass classifiers, on a logarithmic scale. RFE requires a much longer execution time than the others, being below all other methods (in terms of the Benefit-Cost Ratio) for almost all K values. The XGBoost Gain and, consequently, the Ensemble that contains it, ends up worse than all Filters according to this measure. The best performances are, in this order, ANOVA, ReliefF, and Mutual Information, since they do not use the training of a classifier to select variables. The relative order between the Filters is the same as the Binary classification.

3) Multiclass OvO and Multiclass
OvR Classifiers: again, Accuracy, Precision, Recall, and F1 Score curves were constructed for the methods of selection of Multiclass classifier variables. However, this time the training of the classifiers was performed according to the OvO and OvR strategies. The metrics, in general, came close to those obtained with traditional Multiclass classifiers (whose cost function is the Categorical Cross-Entropy).
As an example, Table VII presents the results of Accuracy, F1 Score, and Fit Times for all variable selection methods,   focusing on the analysis for K = 15 selected variables (quantity arbitrarily chosen). In the  Table VIII. By the way the experiment is designed, the input of the ANOVA block corresponds to the output of the step of removing highly correlated variables. With this in mind, it is possible to comment on the general process of removing variables performed.
The decrease from 77 to 10 variables (87% reduction) brought with it a subtle performance degradation in all cases. For Binary, Multiclass, Multiclass OvO and Multiclass OvR classifiers, there was a reduction in the F1 Score of, respectively, 1.06%, 2.05%, 2.86% and 2.31%. However, the reduction in Fit Times is significantly more expressive. Binary, Multiclass, Multiclass OvO and Multiclass OvR classifiers had their times reduced by 19.86%, 57.30%, 35.39% and 59.84%, respectively. It is reasonable, therefore, to eliminate some variables to obtain such runtime gains, especially on large datasets.

V. CONCLUSION
In this work, the CICDDoS2019 dataset was used as a basis for verifying the impact of FS on the DDoS attack classification quality metrics. The reference classifier was XGBoost. FS techniques that are independent of the label of the samples were considered, such as the use of domain knowledge, and removal of attributes with low variance and high correlation. Methods that used the information contained in the sample label were also used, such as ANOVA, Mutual Information, ReliefF, XGBoost Gain, and RFE, in addition to an Ensemble technique involving the rankings of variables generated by other methods.
The removal of attributes with low variance did not alter the classification metrics. It was possible to verify the robustness As main contributions made in this work, we can highlight: • The formalization of the notion of dataset unbalance, using the Pielou Index; • Addressing the DDoS attack classification problem from both a Binary and Multiclass standpoint; • The use of meta-learning strategies such as One-Vs-One and One-Vs-Rest; • The use of a wide range of Feature Selection techniques and the creation of a benchmark of classification metrics according to them.
This study does not exhaust the FS theme in the classification of DDoS attacks, but brings a perspective that can serve as a basis for future work. A limitation of the work was the use of only one dataset (CICDDoS2019), artificially generated in a testbed by the authors of [24]. A dataset extracted from a production environment or one that is more balanced could have generated different results. All experiments were performed via CPU, but the parallelization of several GPU operations can bring several gains in execution time [45]. An interesting point would be to list which variables were selected in each Cross-Validation round, list which were the most frequent and whether the choice depends on the Feature Selection method used.
Most of the Feature Selection algorithms considered were available in Scikit-Learn, but there are other interesting techniques described in the literature such as mRMR [46], BORUTA [47], or even the use of SHAP values [48] to rank the attributes. To verify the ability to generalize the results described here, the use of the same methodology is proposed with other base classifiers, such as KNN and SVM, or nonlinear models, such as Neural Networks.
The programs and graphs created for this article, as well as the additional results mentioned in the text and not illustrated for the sake of brevity are fully available at the site https://github.com/pedrohauy/ddos_feature_selection.