MindMap Gallery Data mining and analysis technology mind map
A computing process that uses methods such as artificial intelligence, machine learning, and statistics to extract useful, previously unknown patterns or knowledge from massive amounts of data.
Edited at 2021-12-27 22:46:49One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
Data mining and analysis technology
Chapter 1 Overview of Data Mining
Understand before class
Summary
machine learning
Operating procedures
data import
Data preprocessing
feature engineering
Split
Training model
Evaluation model
Predict new data
AI
Characteristics of big data
A lot
Diverse
high speed
value
1.1 Introduction to Data Mining
definition
A computing process that uses methods such as artificial intelligence, machine learning, and statistics to extract useful, previously unknown patterns or knowledge from massive amounts of data.
background
The amount of data has expanded dramatically, giving rise to new research directions: database-based knowledge discovery, and research on corresponding data mining theories and technologies.
The next technology hotspot after the Internet
While a large amount of information brings convenience to people, it also brings a lot of problems.
Too much information and difficult to digest
It is difficult to distinguish the authenticity of information
Information security is difficult to guarantee
Information comes in different forms and is difficult to process uniformly
Explosive data but poor knowledge
The evolution from business data to business information
Data collection → data access → data warehouse, decision support → data mining (providing predictive information)
stage
Data preprocessing
Clean, integrate, select, transform
data mining
model evaluation
process
data, information, knowledge
data
"8000m", "10000m"
Produced from the observation and measurement of objective things, we call the objective things under study entities
information
"8000m is the maximum altitude for aircraft flight", "10000m high mountain"
Knowledge
"Planes cannot climb over this mountain"
wisdom
main content
Association rule mining
beer and diapers
supervised machine learning
Discrete label prediction—label classification
Continuous Label Prediction—Numerical Prediction
Unsupervised machine learning—clustering (similarity algorithm)
return
Establish quantitative relationships between multiple variables
Classification of algorithms
supervised learning
Learn a function (model) from the given training data. When new data arrives, the result can be predicted based on this function (model)
Training data has clear identification or results
Regression algorithm, neural network, SVM support vector machine
Regression algorithm
linear regression
Deal with numerical problems, and the final prediction result is a number, such as: house price
logistic regression
Belongs to a classification algorithm, such as: determining whether an email is spam
Neural Networks
Applied to visual recognition and speech recognition
SVM support vector machine algorithm
Enhancement of logistic regression algorithm
unsupervised learning
Training data is not specifically labeled
Clustering algorithm, dimensionality reduction algorithm
Clustering Algorithm
Calculate the distance in the population and divide the data into multiple populations based on the distance
Dimensionality reduction algorithm
Reduce the data from high dimensionality to low dimensionality. The dimension represents the size of the feature quantity of the data. For example: house price contains the four characteristics of the length, width, area, and number of rooms of the house, that is, the dimension is 4-dimensional data, and the length and width facts The above information overlaps with the information represented by area. Area = length × width. Redundant information is removed through dimensionality reduction.
Compress data and improve machine learning efficiency
Enterprise data applications
semi-supervised learning
How to use a small number of labeled samples and a large number of unlabeled samples for training and classification problems
Image Identification
reinforcement learning
Learning subjects make judgments based on feedback from their observed surroundings
Robot control
1.2 Basic processes and methods of data mining
basic method
Predictive Mining
Extrapolate on current data to make predictions
descriptive mining
Characterize the general characteristics of the data in the database (correlation, trend, clustering, anomaly...)
Data mining flow chart
Main data mining methods in Sixth Middle School (P6)
Summary summary of the data set
Data association rules
A way of describing potential connections between data, usually represented by the implication A-B
Classification and prediction
clustering
Heterogeneous detection
time series model
1.3 Application of data mining
business
Healthcare and Medicine
banking and insurance
social media
tool
Weka, matlab, Java
Relevant information
subtopic
Chapter 2 Data Description and Visualization
2.1 Overview
Analyze data attributes and data values→data description and visualization
2.2 Data objects and attribute types
data set
Composed of data objects
Sales database: customers, store items, sales Medical database: patient, treatment information University database: student, professor, course information
data object
A data object represents an entity
Known as: sample, example, instance, data point, object, tuple
Attributes
a characteristic of a data object
the term
Database: Dimension
Machine Learning: Features
Statistics: Variables
Data Mining, Databases: Properties
Classification
Nominal properties
Nominal attribute values are some symbols or names of things that represent categories and names
Nominal attribute: hair color, possible values: black, white, brown Nominal attribute: Marital status, possible values: married, single, divorced, widowed
Binary attributes (special nominal attributes)
There are only two categories and status
symmetric binary
The difference in data size is small Example: Gender - male, female
asymmetric binary
Data size varies greatly Example: Medical test – negative, positive
ordinal properties
There is an order, but the difference between them is unknown. It is usually used for rating.
Teacher title, military rank, customer satisfaction
Numeric properties
interval scaling properties
Sequentially measured in unit length
Ratio scale properties
Has a fixed zero point, is ordered and can calculate multiples
Discrete and continuous attributes
2.3 Basic statistical description of data
measure of central tendency
mean, median, mode
Metric data spread
Range, quartile, quartile range
Five-number summary, box plots and outliers
Variance, standard deviation
Graphical depiction of basic statistics of data
Quantile plot
Quantile - Quantile plot
Histogram
Height - quantity, frequency
Scatter plot
Discover correlations between attributes
2.4 Data visualization
definition
Express data effectively through graphics
Three visualization methods
Boxplot (boxplot)
Analyze the dispersion differences of multiple attribute data
Can display the distribution of data and display outliers (need to be deleted)
Histogram
Analyze the change distribution of a single attribute in various intervals
Scatter plot
Display the correlation distribution between two sets of data
2.4.1 Pixel-based visualization
A simple way to visualize one-dimensional values is to use pixels, whose color reflects the value of that dimension
Suitable for one-dimensional values, not suitable for distribution of multi-dimensional spatial data
2.4.2 Geometric projection visualization
Help users discover projections of multidimensional data. The primary challenge of geometric projection technology is to figure out how to visualize high-dimensional space in two dimensions.
For two-dimensional data points, a Cartesian coordinate system scatter plot is usually used. Different colors or shapes can be used in the scatter plot as the third dimension of the data.
(Used for three-dimensional data sets) Scatter plots, scatter plot matrices, and parallel coordinate visualization (when the number of dimensions is large)
2.4.3 Icon-based visualization
Represent multidimensional data values with a small number of icons
Two commonly used icon methods
Chernov face (allows visualization up to 36 dimensions)
Reveal trends in data
Elements such as the eyes, mouth, and nose of the face use different shapes, sizes, positions, and directions to represent dimension values.
Each face represents an n-dimensional data point (n≤18), and the meaning of various facial features is understood by identifying small differences in faces.
character line drawing
2.4.4 Hierarchical visualization
Divide all dimensions into subsets (i.e. subspaces) and visualize these subspaces hierarchically
Two commonly used hierarchical visualization methods
X-axis Y-axis subset hierarchy
number chart
2.4.5 Visualizing complex objects and relationships
Tag Cloud
2.5 Data similarity and dissimilarity measurement
concept
Similarity
Measures how similar two data objects are. The larger the value, the more similar they are. The usual value range is [0,1]
Dissimilarity
Measures the degree of difference between two data objects. The smaller the value, the more similar the data is. The minimum dissimilarity is usually 0.
Proximity
Refers to similarity or dissimilarity
Provides two data structures
Data Matrix (Object - Attribute Matrix)
Store n data objects, each n data object has n rows, and p attribute characteristics have p columns)
Dissimilarity Matrix (Object - Object Matrix)
Dissimilarity value used to store data objects
Usually a triangular matrix
Proximity measure for nominal attributes
Proximity measure for binary attributes
Dissimilarity in Numeric Properties
Several common methods for calculating distance measures for the dissimilarity of numerical attribute objects
Euclidean distance
manhattan distance
Ou and Man simultaneously satisfy the following properties
Minkowski distance
Promotion of Ouyuman
supremum distance
gives the maximum value of the difference between objects
Proximity measures for ordinal attributes
Dissimilarity of mixed attributes
Each type of attributes is divided into a group, and data mining analysis (such as cluster analysis) is performed on each type. If these analyzes get the same results, the method works, but in practical applications it is difficult to get the same results for each attribute type classification.
A better approach: just do a single analysis, combine the different attributes in a single dissimilarity matrix, and transform the attributes into a common interval [0.0,0.1]
example
subtopic
Cosine similarity (just understand it)
Text retrieval, biological information mining
Document vector, word frequency vector
Frequency vectors are usually long and sparse (have many 0 values)
Chapter 7 Support Vector Machine
Classification of support vector machines
Linear binary classification problem
Find the optimal hyperplane
Chapter 6 Classification and Prediction
6.1 Data classification
continuous variable
height, weight
Categorical variables
Unordered categorical variable
Orderly classification
General methods for data classification
Classification, ordering, distance, ratio
6.2 Decision tree model
Generate decision tree
Prune decision tree
6.2.1 How decision trees work
6.3 Bayesian classification model
maximum a posteriori hypothesis
The learner selects the most likely hypothesis h from the candidate hypothesis set H when given the data D. h is called the maximum posterior hypothesis.
Need to ask for joint probability
It is usually assumed that each attribute is independently and identically distributed
Before this, correlation calculations and merging must be performed to minimize the correlation between attributes.
Features
Attributes can be discrete or continuous
Solid mathematical foundation and stable classification efficiency
Not sensitive to missingness, noisy data, and outliers
If the attributes are not relevant, the classification effect is very good
6.4 Linear discriminant model
6.5 Logistic regression model
6.6 Model evaluation and selection
Chapter 5 Association Rule Mining
5.1 Overview
concept
Association rule mining is used to mine the correlation between item sets in the transaction database and mine all association rules that meet the minimum threshold requirements of support and confidence.
Association rules are used to find potentially useful dependencies between data items in large amounts of data.
frequent itemsets
Item set that satisfies minimum support and minimum credibility
Support
Credibility
Strong rules
Rules that meet or exceed minimum support and confidence
Main steps of data mining
In the item set of big data, find the occurrence number ≥ frequent item set
From the frequent itemsets obtained above, establish association rules that meet the minimum support and credibility conditions.
5.2 Classification
5.3 Research steps
5.4 Apriori algorithm analysis
5.6 Generalization of Association Rules (GRI)
depth first search
5.7 In-depth exploration of association rules
Chapter 4 Data Reduction (Data Reduction)
4. 1 Overview of maintenance
Streamline data to the greatest extent while maintaining the original appearance of the data
4.2 Attribute selection and numerical reduction
Evaluation criteria for attributes (P58)
consistency measurement
The degree of consistency between two attributes
The degree of consistency between education level and VIP level
correlation measurement
The correlation between different attributes refers to the relationship between them
Correlation between education level and VIP level
The higher the correlation between two attributes, the higher the accuracy of inferring the value of one attribute from the value of the other attribute.
Discrimination ability measurement
The ability of a certain attribute to distinguish records in the database
information measurement
The greater the amount of information an attribute contains, the more important it is
The amount of information is usually measured by "information entropy"
Attribute subset selection method
Select forward step by step
Set target property set to empty
Each iteration selects the best attribute from the remaining attributes in the original data set and adds it to the target attribute set.
Remove the attribute from the original dataset
Repeat this process until the target set meets the requirements
step by step backward selection
First assign the original attribute set to the non-target attribute set
In each iteration, the attribute with the worst comprehensive score is eliminated from the target attribute set.
Repeat this process until the target attribute set meets the requirements
numerical reduction
Transform properties into variables to reduce their dynamic range
Simple function transformation
Standardization of data
Discretize attributes and encode them with integers
Equal width discretization, equal depth discretization
Binaryize the attribute so that it has only two values
If the attribute value is a signal or image, compression encoding can also be performed
4.3 Linear regression
definition
Is the study of the relationship between a single dependent variable and one or more independent variables
usefulness
Prediction refers to using observed variables to predict dependent variables
Causal analysis treats the independent variable as the cause of the dependent variable.
linear regression
Multiple Regression
nonlinear regression
Model data that does not have linear dependencies
Use polynomial regression modeling method, and then perform variable transformation to convert the nonlinear model into a linear model, and then solve it using the least squares method
4. 4 Principal Component Analysis (PCA Principal Component Analysis)
Commonly used methods for dimensionality reduction of high-dimensional data
Make a linear combination of original variables and reflect all or most of the information of the original quantity through a few combined variables.
The combined variable is the principal component
Chapter 3 Data collection and preprocessing (cleaning, integration, reduction, transformation)
3.1 Overview
Characteristics of big data collection
The first step in the big data life cycle
Compared with traditional data, big data data is massive, diverse, and heterogeneous.
From collection to processing, big data needs to weigh consistency, availability, and partition fault tolerance.
Big data collection methods (understand)
Log collection of distributed systems
Network data collection
Web crawler, website public API (application programming interface)
DPI Deep Packet Inspection
DFI Depth/Dynamic Flow Inspection
Specific system interface data collection
3.2 Purpose and tasks of data preprocessing
Purpose
Improve data quality
main mission
Data cleaning
Clarify noise in data and correct inconsistencies
data integration
Consolidate data from multiple data sources into a consistent data store, such as a data warehouse
Data transformation (such as normalization)
Compress data into smaller intervals
3.3 Data cleaning
The essence is a process of modifying the data model
Data cleaning path (understand)
1. Missing value cleaning
Remove missing values
mean imputation
hot card filling method
nearest distance decision filling method
regression imputation
multiple imputation methods
k—nearest neighbor method
Bayesian based approach
2. Cleaning of outliers (outliers, wild values)
Definition and identification of outliers
Handling outliers
3. Format content cleaning
4. Logic error cleaning
Remove duplicates
Remove unreasonable values
5. Non-required data cleaning
6.Relevance verification
3.4 Data integration
concept
Data integration in the traditional sense
Combine data from multiple data stores and store it in a single data store, such as a data warehouse
Data integration in a general sense
ETL—Extract, transform, load (to destination) It is an important part of building a data warehouse
The user extracts the required data from the data source, cleans the data, and finally loads the data into the data warehouse according to the predefined data warehouse model.
Importance of models
Standardize the definition of data to achieve unified coding, classification and organization
Data redundancy often occurs when integrating multiple databases
Detect redundant attributes
correlation analysis
discrete variables
Chi-square test
The larger the value, the more relevant it is
continuous variable
Correlation coefficient
Equal to 1, -1, completely linearly related
Greater than 0, positive correlation
Equal to 0, there is no linear correlation
Less than 0, negative correlation
analysis of covariance
Greater than 0, positive correlation
equal to 0, independence
Some data have covariance 0, but are not independent
Less than 0, negative correlation
Data reduction strategy
Dimensionality reduction
Scenarios that require dimensionality reduction
Data is sparse and has high dimensions
High-dimensional data adopts a rule-based classification method
Use complex models (such as deep learning), but the number of training sets is small
Need to visualize
Typical dimensionality reduction method—PCA principal component analysis
introduce
There are some correlations between many attributes in the data.
Can you find a way to combine multiple related attributes to form only one attribute?
concept
Recombine multiple original attributes with certain correlations (such as p attributes) into a set of unrelated comprehensive attributes to replace the original attributes. Usually the mathematical treatment is to linearly combine the original attributes of p as the comprehensive attributes of the petitioner.
For example: student scores, language, math, foreign affairs, history, geography, etc. are divided into two attributes: liberal arts and science.
Data reduction - sampling
data compression
Reduce the size of the data by reducing its quality, such as pixels
3.5 Data transformation
Data transformation strategy
Smoothness, attribute construction, aggregation, normalization, discretization, concept layering
Commonly used data transformation methods
Transform data through normalization
discretization by binning
Discretization by histogram binning
Discretization through clustering, decision trees and correlation analysis
Conceptual stratification of nominal data
discretization
equal width method
Equal frequency method
clustering method