MindMap Gallery Data mining concepts and techniques
Mind map of data mining concepts and techniques. Data mining is the process of mining interesting patterns and knowledge from large amounts of data. Data sources include databases, data warehouses, the Web, other information repositories, or data flowing dynamically into the system.
Edited at 2023-05-19 11:54:52One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
Data mining concepts and techniques
Data Mining Overview
data mining concepts
The process of mining interesting patterns and knowledge from large amounts of data. Data sources include databases, data warehouses, the Web, other information repositories, or data dynamically flowing into the system
Mining knowledge from data, knowledge discovery in data (KDD)
knowledge discovery process
(1) Data cleaning: eliminate noise and delete inconsistent data
(2) Data integration: multiple data sources can be combined together
(3) Data selection: Extract and analyze task-related data from the data
(4) Data transformation: Transform and unify data into a form suitable for mining through summary and aggregation operations
(5) Data mining: basic steps, using intelligent methods to extract data patterns
(6) Pattern evaluation: Identify truly interesting patterns that represent knowledge based on some measure of interest
(7) Knowledge representation: Use visualization and knowledge representation technology to provide users with mineable knowledge
Data collection and database creation (1960s or earlier) Raw file processing
Database management systems (1970s-early 1980s)
Advanced Database Systems (mid-1980s-present)
Advanced Data Analysis (late 1980s-present)
Data types for data mining
Database systems
composition
Internally related data (database)
Software programs that manage and access data
Define the database structure and data storage, describe and manage concurrent, shared or distributed data access, and ensure the consistency and security of information in the face of system failure and unauthorized access.
A relational database is a collection of tables, each table is given a unique name
Each tuple in the relational table represents an object, is identified by a unique keyword, and is described by a set of attribute values.
Each table contains a set of attributes (columns or fields) and usually holds a large number of tuples (records or rows)
Semantic data models are usually built for relational databases, such as the entity-relationship (ER) data model
database
A data warehouse is an information repository that collects information from multiple data sources, stores it in a consistent schema, and typically resides on a single site. Data warehouse is constructed through data cleansing, data transformation, data integration, data loading and periodic data refresh.
transaction data
Generally, each record in the transaction database represents a transaction, such as a customer's purchase or a flight booking. A transaction contains a unique transaction identification number (TransID), and a list of items (such as purchased items) that make up the transaction. The transaction database may have some additional tables related to it that contain other information about the transaction, such as product descriptions.
Other types of data
Time-related or sequence data (historical records, time series data), data streams (video surveillance, they play continuously), spatial data (maps), engineering design data (building data, integrated circuits), hypertext and multimedia data (text, images), graph and network data (such as social information networks), the World Wide Web, special semantics (sequence, audio and video content, connectivity) and mining patterns with rich structure and semantics
Data mining function
(1) Characterization and differentiation
Data characterization: generally summarize data for the class under study (target class)
Simple data summary based on statistical measures and graphs
OLAP roll-up
Attribute-oriented induction techniques
Data differentiation: Compare the target class with one or more comparison classes (contrast classes)
Comparing measures by distinguishing rules
(2) Frequent mode
frequent itemsets
Frequent subsequences (sequence patterns)
frequent substructure
(3) Association and correlation mining
Single-dimensional association rules: association rules containing a single predicate
Multidimensional association rules: associations involving multiple attributes or predicates
(4) Classification and regression
Classification
Concept: Find a model (or function) that describes and distinguishes a data class or concept so that the model can be used to predict the class punctuation of an object whose class label is unknown
method
Classification rules (IF-THEN rules)
Decision tree: A tree structure similar to a flow chart, in which each node represents a test on an attribute value, each branch represents a result of the test, and the leaves represent classes or class distributions.
Mathematical formula
Processing units similar to neurons with weighted connections between units
Naive Bayes classification, support vector machine, K nearest neighbor classification
Regression: Used to predict missing or hard-to-obtain numerical data values, and also includes the identification of distribution trends based on available data.
Correlation analysis is performed before classification and regression and it attempts to identify attributes that are significantly related to the classification and regression processes.
(5) Cluster analysis
Concept: Objects are clustered or grouped according to the principle of maximizing intra-class similarity and minimizing inter-class similarity. Clusters of objects are formed in such a way that objects in the same cluster are highly similar but very dissimilar to objects in other clusters. Each cluster formed can be viewed as an object class from which rules can be derived. Clustering also facilitates the formation of classifications, that is, organizing observations into a class hierarchy and grouping similar events together.
(6) Outlier analysis
Concept: Find data objects in a dataset that are inconsistent with the general behavior or model of the data
Statistics and Data Mining
Statistics studies the collection, analysis, interpretation and presentation of data, and data mining has a natural connection with statistics.
A statistical model is a set of mathematical functions that describe the behavior of target objects using random variables and their probability distributions.
(1) Statistical models can be the result of data mining tasks, and data mining tasks can also be built on statistical models. Therefore, when mining patterns in large data sets, the data mining process can use the model to help identify noise and noise in the data. Missing values.
(2) Statistical research develops some data and statistical models for prediction and forecasting tools. Statistics is useful for mining various patterns from data and understanding the potential mechanisms that generate and influence these patterns.
(3) Statistical methods can also be used to verify data mining results. For example: after establishing a classification or prediction model, statistical hypothesis testing should be used to verify the model.
Using statistical methods in data mining is not simple. How to apply statistical methods to large data sets is a huge challenge. Many statistical methods have high computational complexity.
machine learning
Concept: How computers learn (or improve their performance) based on data. The main research field is that computers automatically learn to recognize complex patterns based on data and make intelligent decisions.
type
Supervised learning: Similar to classification, the supervision in learning comes from labeled instances in the training data set
Unsupervised learning: similar to clustering, the input instances are not labeled
Semi-supervised learning: using labeled and unlabeled instances when learning a model
Active learning: Let users play an active role in the learning process
For classification and clustering tasks, machine learning research often focuses on model accuracy. In addition to accuracy, data mining research places great emphasis on the effectiveness and scalability of mining methods on large data sets, as well as ways to deal with complex data types, and the development of new, non-traditional methods.
Data mining application areas: business intelligence, web search, bioinformatics, health care informatics, finance, digital libraries and digital government
Main issues in data mining
Mining method
Discover new types of knowledge
Mining knowledge in multi-dimensional space
Data Mining—An Interdisciplinary Endeavor
Improve discovery capabilities in network environments
Dealing with uncertain data, noisy or incomplete data
Pattern evaluation and mining of pattern constraint guidance
user interface
interactive mining
Combine background knowledge
Specific data mining and data mining query languages
Representation and visualization of data mining results
Effectiveness and scalability
Effectiveness and scalability of data mining algorithms
Parallel distributed and incremental mining algorithms
Diversity of database types
Handle complex data types
Mining dynamic, networked, global databases
Data Mining and Society
The social impact of data mining
Privacy-preserving data mining
invisible data mining
Data preprocessing
concept
Data object: also called a sample, instance, data point or object, a data object represents an entity
Attributes
Nominal attribute: The value is the name of some symbol or thing. Each value represents some category, encoding or status, so nominal attributes are considered categorical
Binary attribute: It is a nominal attribute with only two status categories: 0 or 1. 0 means that the attribute does not appear, and 1 means that the attribute appears.
A binary attribute is symmetric if its two states are of equal value and carry the same weight, and asymmetric if the consequences of its states are not equally important.
Ordinal property: one whose possible values have a meaningful order or rank, but the difference between successive values is unknown.
Numeric properties
Interval scaling properties: Measured with equal unit scales. The values of the interval scale are ordered and can be 0, positive or negative. Therefore, in addition to rank assessment, this property allows us to compare and quantitatively evaluate differences between values
Ratio scale attribute: It is a numerical attribute with a fixed zero point, that is, if the measure is a ratio scale, then we can say that one value is a multiple (or ratio) of another value. In addition, these values are ordered, so we You can calculate the difference between values, as well as the mean, median, and mode.
Cluster: A collection of data objects such that objects in the same cluster are similar to each other but different from objects in other clusters.
Data matrix: used to store data objects, consisting of two entities or "things", namely rows (representing objects) and columns (representing attributes), so it is called a bimodular matrix.
Dissimilarity matrix: used to store the dissimilarity values of data objects. It contains only one type of entity, so it is called a single-mode matrix.
Data quality: accuracy, completeness, consistency, timeliness, credibility, interpretability
Data cleaning
Concept: "Clean data" by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.
Missing value handling
Ignore tuples
Fill in missing values manually
Use a global variable to fill in missing values
Fill missing values using a central measure of the attribute, such as the mean or median
Use the attribute mean or median of all samples in which a given tuple belongs to the same class
Fill missing values using the most likely value
Noisy data processing
binning
box mean smoothing
bin median smooth
box boundaries smooth
return
Outlier analysis (clustering)
Data integration: combining data from multiple data sources into a consistent data store, such as a data warehouse
Reasons for data preprocessing: Low-quality data will lead to low-quality mining results
Importance of data preprocessing: It can significantly improve the overall quality of data mining models and reduce the time required for actual mining.
Data preprocessing steps: data cleaning - data integration - data specification - data transformation
Data transformation strategy
Smoothing: removes noise from data, including binning, regression and clustering
Attribute construction (feature construction): New attributes are constructed from the given attributes and added to the attribute set to aid the mining process.
Aggregation: summarizing or clustering data
Normalization: Scale the data so that it falls into a specific small interval, such as (-1, 1) or (0, 1)
discretization
Concept: The original value of the numerical attribute is replaced with an interval label or a concept label
Methods: binning, histogram analysis, cluster analysis, decision tree analysis, correlation analysis
concept layering
Concept: Define a mapping sequence that maps low-level concepts to higher-level, more general concepts
method
It is up to the user to explicitly specify the partial ordering of attributes at the schema level.
Illustrating part of the stratification by explicit data grouping
Describe attribute sets but not their partial order, for example: generate concept stratification based on the number of different values for each attribute
Only part of the attribute set is explained, for example: using predefined semantic relationships to generate concept hierarchies
Data curation
Concept: Used to obtain a reduced representation of a data set that is much smaller but still close to maintaining the integrity of the original data.
Strategy
Dimensional protocol
Concept: Reduce the number of random variables or attributes considered
type
Wavelet transform, principal component analysis: transform or project the original data into a smaller interval
Attribute subset selection: detecting and removing irrelevant, weakly relevant, or redundant attributes or dimensions
quantity specification
Concept: Replace original data with an alternative, smaller representation of the data
type
Parametric methods: regression, log-linear models
Non-parametric methods: histograms, clustering, sampling, data cube aggregation
data compression
Concept: Use transformations to obtain reduced or compressed representations of original data
type
Lossless: the original data can be reconstructed from the compressed data without loss of information
Lossy: the original data can only be approximately reconstructed
Data Mining and Online Analytical Processing
database
The data warehouse is a database that is maintained separately from the unit operational database.
Data warehouse allows various application systems to be integrated together, provides a solid platform for unified historical data analysis, and provides support for information processing.
A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that supports managers' decision-making processes.
OLTP: Online transaction processing system, performs online transaction and query processing
OLAP: Online analytical processing system, which organizes and provides data in different formats to meet the various needs of different users.
Database three-tier architecture
Top level: front-end tools
Middle tier: OLAP server
Bottom layer: data warehouse server
data warehouse model
Enterprise warehouse
Gathers all information on a subject, spans the entire enterprise, provides a full range of data integration, usually from one or more operational database systems or external information providers, and is versatile, containing both detailed and summary data
data mart
Concept: Contains a subset of enterprise-wide data that is useful to a specific user group. The scope is limited to selected topics. The data is usually aggregated.
Independent data mart: Data typically comes from one or more operational database systems or external information providers, or from data generated by a specific department or local area
Dependent data mart: directly from the enterprise database
virtual warehouse
Is a collection of views on the operating database. In order to process queries efficiently, only some possible summary views are materialized.
Metadata
Concept: Data about data. In a data warehouse, metadata is the data that defines warehouse objects.
content
Description of the data warehouse structure: warehouse schema, views, dimensions, hierarchies, definition of exported data, location and content of data marts
Operational metadata: data lineage, data circulation, management information
Algorithms used for aggregation: measure and dimension definition algorithms, granularity at which the data is located, partitioning, subject areas, aggregation, summary, predefined queries and reports
Mapping from operating environment to data warehouse: source databases and their contents, gateway descriptions, data extraction, cleansing, transformation rules and default values, data refresh and cleansing rules, security (user authorization and access control)
Data on system performance: In addition to timed scheduling rules for refresh, update, and copy cycles, including indexes and summaries to improve access and retrieval performance
Business Metadata: Business Terms and Definitions, Data Owners and Charging Policies
Difference from other data
(1) Metadata is used as a directory to help decision support system analysts locate the content of the data warehouse.
(2) As a guide for data mapping when data is converted from an operating environment to a data warehouse environment
(3) For the summary algorithm, the current detailed data is summarized into slightly comprehensive data, or the slightly comprehensive data is summarized into highly comprehensive data.
(4) Metadata should be stored and managed persistently (i.e. stored on the hard disk)
data cube
concept
It consists of a lattice of cubes, each cube corresponding to a different level of aggregation of a given multidimensional data. Allows data to be modeled and observed in multiple dimensions, defined by dimensions and facts.
Dimension: A perspective or entity that a unit wants to record. Each dimension can have a table associated with it, called a dimension table.
Fact: It is a numerical measure. The fact table contains the name or measure of the fact and the key for each associated dimension table.
Subtopic 5
database
design view
Top-down view: allows you to select relevant information needed for the data warehouse
Data source view: reveals the information collected, stored and managed by the operating database system
Data warehouse view: includes fact tables and dimension tables. They provide information stored in the data warehouse, including precomputed totals and counts, as well as information about sources, dates, and times that provide historical context.
Business Query View: Perspective on the data in the data warehouse from the end user's perspective
designing process
Select the business process to be modeled
Select business processing granularity
Select the dimensions to use for each fact table record
Select the measure to place in each fact table record
application
Information Processing: Supports queries and basic statistical analysis and reporting using crosstabs, tables, charts or graphs
Analytical processing: Supports basic OLAP operations, including slicing and dicing, drill-down, roll-up, and pivoting, generally operating on summary and detailed historical data
Data mining: supports knowledge discovery, including finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and using visualization tools to provide mining results.
Multidimensional Data Mining (OLAM)
Concept: Integrate data mining with OLAP to discover knowledge in multidimensional databases
importance
High quality of data in data warehouse
High quality of data in data warehouse: Most data mining tools need to run on integrated, consistent and clean data, which requires safe data cleaning, data transformation and data integration as pre-processing steps. The data warehouse constructed through these preprocessing steps not only serves as OLAP, but also serves as a high-quality and valuable data source for data mining.
Information processing infrastructure surrounding databases
Comprehensive data processing and data analysis infrastructure has been or will be systematically established around the data warehouse, including access, integration, merging and transformation of multiple heterogeneous databases, ODBC/OLEDB connections, Web access and service mechanism reports and OLAP analysis Tools should try to leverage available infrastructure rather than starting from scratch.
Multidimensional data exploration based on OLAP
Effective data mining requires exploratory data analysis. Users often want to traverse the database, select relevant data, analyze them at different granularities, and provide knowledge/results in different forms.
Multidimensional data mining provides mechanisms for data mining on different data subsets and different abstraction levels, drilling, rotating, filtering, dicing and slicing on data cubes and intermediate results of data mining.
Together with data/knowledge visualization tools, these will greatly enhance the capabilities and flexibility of exploratory data mining.
Online selection of data mining functions
Users often may not know what type of knowledge they want to mine. By integrating OLAP with multidimensional data mining functions, multidimensional data mining provides users with the flexibility to choose the desired data mining functions and dynamically switch data mining tasks.
Effective processing/steps for OLAP queries
(1) Determine which operations should be performed on the available cubes
This will involve converting select, projection, roll-up (grouping) and drill-down operations in the query into corresponding SQL/OLAP operations
(2) Determine which materialized cubes should be used for relevant operations
Involves finding all materialized cubes for which a possible user answers the query, using knowledge of "dominance" connections between cubes, performing pruning, evaluating the cost of using multiplication and division of the materialized cubes, and selecting the cube with the minimum cost.
Mining frequent patterns, associations and correlations
concept
Item set
branch topic
A collection of items. A set containing K items is called a K itemset. The frequency of occurrence of an item set is the number of transactions containing the item set, which is referred to as the frequency, support count, and count of the item set.
frequent itemsets
If the support of itemset I meets the predefined minimum support threshold (that is, the absolute support of I meets the corresponding minimum support count threshold), then I is a frequent itemset, and the set of frequent itemsets is recorded as Lk
closed itemset
If there is no true superitemset Y such that Y and X have the same support count in D, then the itemset X is closed in the data set D.
closed frequent itemset
If X is closed and frequent in D, then the itemset X is a closed frequent itemset in D
Extremely frequent itemsets
If X is frequent, and there is no super itemset Y such that X belongs to Y, and Y is frequent in D, then the itemset
Association rule mining process
(1) Find all frequent itemsets
By definition, each of these itemsets appears frequently at least as often as the predefined minimum support count min_sup
(2) Generating strong association rules from frequent item sets
By definition, these rules must satisfy minimum support and minimum confidence
F-P-growth advantages and disadvantages
(1) This method significantly reduces search overhead
(2) When the database is large, it is sometimes unrealistic to construct a main memory-based FP tree. One option is to first partition the database into a collection of projection databases and then mine within each projection database
(3) It is efficient and scalable for both mining long frequent patterns and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm
Classification and prediction
concept
Classification
Is a form of data analysis that extracts models that describe classes of data. A classifier or classification model predicts class labels (classes), and numerical predictions model continuous function values. Classification and numerical prediction are two major types of prediction problems.
supervised learning
The class label of each training tuple is provided, and the learning of the classifier is supervised by being told "which class" each training tuple belongs to.
Unsupervised learning (clustering)
The class label of each training tuple is unknown, and the number or set of classes to be learned may not be known in advance.
data classification process
(1) Learning stage (building a classifier model)
(2) Classification stage (use the model to predict the class label of the given data)
decision tree
Decision tree induction: is a top-down recursive tree induction algorithm that uses an attribute selection metric to select attribute tests for each non-leaf node of the tree. Algorithms include ID3, C4.5, CART, which use different attribute selection metrics.
Decision tree: A tree structure similar to a flowchart in which each internal node (non-leaf node) represents a test on an attribute, each branch represents an output of the test, and each leaf node (or The terminal node) stores a class label, and the top-most node of the tree is the root node.
Decision tree classifier advantages and disadvantages
(1) The construction of the decision tree classifier does not require any domain knowledge or parameter settings, so it is suitable for exploratory knowledge discovery.
(2) Decision trees can handle high-dimensional data
(3) The acquired knowledge is intuitive and easy to understand in the form of tree representation.
(4) The inductive learning and classification steps of decision trees are simple and fast
(5) Generally speaking, decision tree classifiers have good accuracy
(6) Decision trees are the basis of many business induction systems
Disadvantages: Successful use may depend on the data at hand
attribute selection measure
concept
is a heuristic that selects the splitting criterion that best divides the data partition D of training tuples labeled by a given class into separate classes.
If D is divided into smaller partitions based on the output of the splitting criterion, ideally each partition should be pure (i.e. all tuples falling into a given partition belong to the same class)
Attribute selection metrics are also called splitting criteria because they determine how to split on a given tuple. Attribute selection metrics provide a rank rating for each attribute that describes a given training tuple. The attribute with the best metric score is chosen. The splitting property of the given tuple.
If the splitting attribute is continuous, or we are limited to constructing binary trees, then a splitting point or a splitting subset must also be returned as part of the splitting criterion. The tree nodes created for partition D are marked with the splitting criterion, from each output of the criterion. Branches are grown and tuples are divided accordingly.
method
information gain
Gain rate
Gini index
pruning
concept
When a decision tree is created, due to noise and outliers in the data, many branches reflect anomalies in the training data. The pruning method deals with this overfitting problem.
Prune first
The tree is pruned by stopping tree construction early (e.g. by deciding not to split at a given node or by dividing a subset of training tuples). Once stopped, the nodes become leaves. The leaf can hold the most frequent classes among a subset of tuples, or a probability distribution over these tuples.
post-pruning
It prunes subtrees from fully grown ones. Prune the subtree at a given node by removing its branch and replacing it with a leaf. The class label of the leaf is the most frequent class in the subtree.
Combination classification
concept
Combining K learned models (or base classifiers) M1, M2, M3······MK together to create an improved composite classification model M*. Use the given data set to create K training sets D1, D2, D3...DK, where Di (1<=i<=k) is used to create the classifier Mi, given a new data to be classified A tuple that each base classifier votes by returning a class prediction.
type
bagging
Given a set D of d tuples, the bagging process is as follows: for iteration i (i, 2, 3·····k), the training set Di of d tuples is sampled with replacement, from the original elements Group set D is extracted. The term bagging means bootstrapping, where each training set is a bootstrap sample.
Due to the use of sampling with replacement, some tuples of D may not appear in Di, while other tuples may appear multiple times. A classification model Mi is learned from each training set Di. In order to classify an unknown tuple, Each classifier Mi returns its class prediction, which counts as one vote. The bagging classifier M* counts the votes and assigns the class with the highest votes to X.
promote
Weights are assigned to each training tuple, and K classifiers are learned iteratively. After learning the classifier Mi, the weights are updated. This makes the subsequent classifier Mi 1 pay more attention to the training tuples misclassified by Mi. The final improved classifier M* combines the votes of each individual classifier, where the weight of each individual classifier's vote is a function of its accuracy.
random forest
Imagine that each classifier in the combined classifier is a decision tree, so the collection of classifiers is a "forest". Each node of an individual decision tree determines the partition using randomly selected attributes. More precisely, each tree relies on the values of a random vector that is independently sampled and has the same distribution as all trees in the forest. When classifying, each tree votes and the class with the most votes is returned.
The difference between bagging and lifting
Because boosting focuses on misclassified tuples, there is a danger that the resulting composite model will overfit the data, and therefore the "boosted" resulting model may sometimes be less accurate than a single model derived from the same data.
Bagging is less affected by overfitting, although both can significantly improve accuracy compared to a single model, but boosting tends to give higher accuracy.
class imbalance problem
concept
Given two classes of data, if the main class of interest (positive class) is represented by only a small number of tuples, while the vast majority of tuples represent the negative class, the data is class-imbalanced.
For multi-class imbalanced data, the data distribution of each class is significantly different, where tuples of the main class or the class of interest are rare. The class imbalance problem is closely related to cost-sensitive learning, where the cost of errors is not equal for each class.
Strategy
Oversampling and undersampling
Assume that the original training set contains 100 positive tuples and 1000 negative tuples. In oversampling, rare tuples are copied to form a new training set containing 1000 positive tuples and 1000 negative tuples. In undersampling, negative tuples are randomly deleted, forming a new training set containing 100 positive tuples and 100 negative tuples.
threshold shift
Does not involve sampling, it is used for a classifier that returns a continuous output value for a given input tuple (similar to ROC). For a given input tuple X, this classifier returns a map f(x)——> [0,1] as output. Rather than manipulating training tuples, this method returns a classification decision based on the output.
Threshold shifting and combining methods outperform oversampling and undersampling, and threshold shifting is effective even on very imbalanced data sets.
ROC
The receiver operating characteristic curve (ROC) is a useful visual tool for comparing two classification models. The ROC curve gives the trade-off between the true positive ratio (TPR) and the false positive ratio (FPR) of a given model.
The ROC curve allows us to observe the trade-off between the proportion of positive instances correctly identified by the model and the proportion of negative instances identified as positive instances for different parts of the test set. The increase in TPR comes at the expense of an increase in FPR.
The area under the ROC curve is a measure of model accuracy
The ROC curve uses the class prediction probability of each test tuple to rank and order the test tuples so that the tuples most likely to belong to the positive class or "yes" class appear at the top of the table, and the tuples least likely to belong to the positive class Tuples appear at the bottom of the table.
Compared
User and system orientation
customer facing
market oriented
Data content
Manage current data
Manage historical data, provide summary and aggregation mechanisms, and store and manage information at different granularities
Database Design
E-R data model and application-oriented database design
Star or snowflake schema, topic-oriented database design
view
Pay attention to the current data of an enterprise/department
Spanning multiple versions of database schemas, processing information from different organizations, and integrating information from multiple databases
access mode
Read-only operations, complex queries
Access consists primarily of short atomic transactions, requiring concurrency control and recovery mechanisms