MindMap Gallery KNN algorithm
If most of the K most similar samples of a sample in the feature space belong to a certain category, then the sample also belongs to this category. Let's take a look at the selection of K values and kd tree knowledge.
Edited at 2023-03-30 20:41:06One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
KNN algorithm
definition
If most of the K most similar samples of a sample in the feature space belong to a certain category, then the sample also belongs to this category.
API
KNeighborsClassifier(n_neighbors=5,algorithm="auto")
n_neighbors represents the selection of K value
algorithm indicates whether the nearest neighbor method, brutal search method, or auto is used
distance
Euclidean distance
manhattan distance
Chebyshev distance
Minkowski distance
Normalized Euclidean distance
cosine distance
Hamming distance
Jaccard distance
Mahalanobis distance
Selection of K value
The k value is too small: easily affected by outliers - overfitting
The k value is too large: it suffers from the problem of sample balancing-underfitting
Approximation error (focusing on the training set) - overfitting - performs well on the training set, but performs poorly on the test set
Estimation error (focusing on the test set) - smaller, indicating better prediction of unknown data
kd tree
Construction of kd tree
nearest neighbor search
Search within this domain
To search across other domains
Case: Prediction of flower species--
Obtaining data sklearn.datasets
Acquisition of small data: sklearn.datasets.load_* (local)
Acquisition of big data: sklearn.datasets.fetch_* (network)
Introduction to data set return values: The return value is batch (dictionary)
data: characteristic data array
target: tag (target) array
DESCR: data description
feature_names: feature names
target_names: tag names
Data visualization: seaborn(x=,y=,data=,hue="target value",fit_reg=False does not perform data fitting)
Data set division: sklearn.model_selection.train_test_split (x=feature value, y=target value, test_size=0.2 test set size, random_size=22 random number seed)
x_train,x_test,y_train,y_test are return values
feature engineering
Feature preprocessing
Normalized
Definition: Map original data to a certain range (0, 1)
Api
sklearn.preprocessing.MinMaxScaler(feature_range=(0,1))
fit_transformer(X) converts to an array of the same shape
Summary: The maximum and minimum values are very susceptible to outliers, so this method is less robust and is only suitable for traditional accurate small data scenarios.
standardization
Definition: Transform the original data into a range with a mean of 0 and a standard deviation of 1
API
StandardScaler()
fit_trasformer()
Summary: Outliers have less impact on the results and are suitable for modern noisy data.
Summarize
Simple and effective, low retraining cost, suitable for cross-class samples, and suitable for automatic classification of large samples
Lazy learning, category scoring is not standardized, is not good at unbalanced samples, and requires a large amount of calculation
machine learning
Get dataset
Basic data processing
Data splitting
feature engineering
Feature preprocessing
Normalized
standardization
machine learning
Model evaluation
Cross-validation
Definition: Divide the training set into a validation set and a test set, leaving the test set unchanged. Divide the training set into several equal parts, which is called several-fold cross-validation.
Features: Cannot improve test accuracy, just to provide more accurate prediction results
grid search
For some hyperparameters of the model, it is impossible to determine how big to choose. When the model training effect is the best, grid search is generally used to test the performance of different hyperparameters and return the optimal hyperparameter selection.
API
GridSearchCV(estimator=cross-validation model name, param_grid=number of hyperparameters, cv=several-fold cross-validation)
return value
best_estimator: best model
best_params_: Parameter combination for best results
best_score_: best prediction result
cv_results_: prediction results of the overall model