MindMap Gallery AdaBoost meta-algorithm to improve classification performance skills mind map
A summary of techniques for improving classification performance using the AdaBoost meta-algorithm. The content covers classifiers based on multi-sampling of data sets, weak classifiers based on single-layer decision trees, and problems with non-balanced classification.
Edited at 2023-02-25 13:03:37One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
AdaBoost meta-algorithm to improve classification performance skills mind map
Classifier based on multiple sampling of data sets
Ensemble methods (meta-algorithms)
Integration of different algorithms
Integration of the same algorithm under different settings
Integration after assigning different parts of the data set to different classifiers
AdaBoost
advantage
Low generalization error rate
Easy to code
Can be applied to most classifiers
No parameter adjustment
shortcoming
Sensitive to outliers
Applicable data types
Numerical type
Nominal type
bagging: a classifier construction method based on random resampling of data
bootstrap aggregation method
From the original data set, select S times to obtain S new data sets.
The new data set is the same size as the original data set
Each data set is obtained by randomly selecting a sample from the original data set and replacing it with another random sample.
Often considered sampling with replacement
Allow the new data set to have duplicate values, while some values in the original data set no longer appear
After S data sets are constructed, a certain learning algorithm is applied to each data set to obtain S classifiers.
When classifying new data, apply these S classifiers and select the category with the most
random forest
boosting
similar to bagging
same
Multiple classifiers used consistently
different
train
The bagging classifier is obtained by serial training, and each new classifier is trained based on the already trained classifier.
Boosting obtains new classifiers by focusing on data that have been misclassified by existing classifiers.
Classification results
Bagging each classifier has the same weight
The weight of each classifier in boosting represents the success in the previous iteration.
AdaBoost process
Data collection
any method
Prepare data
Depends on the type of weak classifier used
Chapter: Single-Level Decision Tree
Simple weak classifiers work better
analyze data
any method
training data
Spend most of your time training
The classifier will train a weak classifier on the same data set multiple times
Test algorithm
Calculate classification error rate
Use algorithms
Similar to SVM
Training Algorithms: Improving Classifier Performance Based on Errors
AdaBoost
adaptive boosting
working process
Each sample in the training data is given a weight to form a vector D
Initially the weights are equal
First, train a weak classifier on the training data and calculate the error rate
Then, train the weak classifier again on the same dataset
Reweight
Yes
reduce
Incorrect
improve
Each classifier is assigned a weight value alpha
Error rate calculation based on each weak classifier
Error rate
Number of incorrectly classified samples/Number of all samples
Keep iterating until
Error rate 0
The number of weak classifiers reaches the user-specified value
Build a weak classifier based on a single-layer decision tree
single level decision tree
Also known as decision tree stump
working principle
Make decisions based on only a single feature
pseudocode
Set the minimum error rate minError to positive infinity
For each feature in the data set
for each step
for each inequality sign
Build a single-level decision tree and test it using a weighted data set
If the error rate is lower than minError, set the current single-layer decision tree as the best single-layer decision tree
Returns the best single-level decision tree
Implementation of the complete AdaBoost algorithm
pseudocode
for each iteration
Use the buildStump() function to find the best single-layer decision tree
Add the best single-level decision tree to the single-level decision tree array
Calculate alpha
Calculate the new weight vector D
Update cumulative category estimates
If the error rate is equal to 0.0, exit the loop
Test algorithm: Classification based on AdaBoost
Example: Applying AdaBoost on a difficult data set
overfitting
overfitting, overlearning
The test error rate reaches a minimum value and then starts to rise again
Some literature states that the test error rate of a well-performing data set will reach a stable value.
Imbalanced classification problem
Other classification performance metrics: precision, recall, ROC curve
confusion matrix
Can help people better understand errors in classification
True positive TP, false positive FP, true negative TN, false negative FN
Accuracy
TP/(TP FP)
Recall
TP/(TP FN)
ROC curve
receiver operating characteristics
horizontal axis
false positive ratio
FP/(FP TN)
vertical axis
true yang ratio
TP/(TP FN)
used for
Compare classifiers
Cost-benefit analysis
Ideally
The best classifier is in the upper left corner as much as possible
Area under the curve AUC
The average performance value of the classifier
Classifier decision control based on cost function
cost sensitive learning
Cost matrix with values other than 0 and 1
Introduce cost information
AdaBoost
Adjust the error weight vector D based on the cost function
Naive Bayes
Select the category with the minimum expected cost rather than the maximum probability as the classification result
SVM
Choose different parameters C for different categories in the cost function
Data sampling methods to deal with imbalanced problems
undersampling
Delete sample
Select and delete samples that are far away from the decision boundary
Mix of undersampling and oversampling
oversampling
Copy sample
Copy existing sample
Add points similar to existing examples
interpolation point
May cause overfitting