Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

EdrawMind

Data mining concepts and techniques

MindMap Gallery Data mining concepts and techniques

Data mining concepts and techniques

Mind map of data mining concepts and techniques. Data mining is the process of mining interesting patterns and knowledge from large amounts of data. Data sources include databases, data warehouses, the Web, other information repositories, or data flowing dynamically into the system.

Edited at 2023-05-19 11:54:52

PlotWizard

Recent works View more works>>

Data mining concepts and techniques

PlotWizard

Recent works View more works>>

Recommended to you
Outline

No relevant template

Data mining concepts and techniques

Data Mining Overview

data mining concepts

The process of mining interesting patterns and knowledge from large amounts of data. Data sources include databases, data warehouses, the Web, other information repositories, or data dynamically flowing into the system

Mining knowledge from data, knowledge discovery in data (KDD)

knowledge discovery process

(1) Data cleaning: eliminate noise and delete inconsistent data

(2) Data integration: multiple data sources can be combined together

(3) Data selection: Extract and analyze task-related data from the data

(4) Data transformation: Transform and unify data into a form suitable for mining through summary and aggregation operations

(5) Data mining: basic steps, using intelligent methods to extract data patterns

(6) Pattern evaluation: Identify truly interesting patterns that represent knowledge based on some measure of interest

(7) Knowledge representation: Use visualization and knowledge representation technology to provide users with mineable knowledge

Data collection and database creation (1960s or earlier) Raw file processing

Database management systems (1970s-early 1980s)

Advanced Database Systems (mid-1980s-present)

Advanced Data Analysis (late 1980s-present)

Data types for data mining

Database systems

composition

Internally related data (database)

Software programs that manage and access data

Define the database structure and data storage, describe and manage concurrent, shared or distributed data access, and ensure the consistency and security of information in the face of system failure and unauthorized access.

A relational database is a collection of tables, each table is given a unique name

Each tuple in the relational table represents an object, is identified by a unique keyword, and is described by a set of attribute values.

Each table contains a set of attributes (columns or fields) and usually holds a large number of tuples (records or rows)

Semantic data models are usually built for relational databases, such as the entity-relationship (ER) data model

database

A data warehouse is an information repository that collects information from multiple data sources, stores it in a consistent schema, and typically resides on a single site. Data warehouse is constructed through data cleansing, data transformation, data integration, data loading and periodic data refresh.

transaction data

Generally, each record in the transaction database represents a transaction, such as a customer's purchase or a flight booking. A transaction contains a unique transaction identification number (TransID), and a list of items (such as purchased items) that make up the transaction. The transaction database may have some additional tables related to it that contain other information about the transaction, such as product descriptions.

Other types of data

Time-related or sequence data (historical records, time series data), data streams (video surveillance, they play continuously), spatial data (maps), engineering design data (building data, integrated circuits), hypertext and multimedia data (text, images), graph and network data (such as social information networks), the World Wide Web, special semantics (sequence, audio and video content, connectivity) and mining patterns with rich structure and semantics

Data mining function

(1) Characterization and differentiation

Data characterization: generally summarize data for the class under study (target class)

Simple data summary based on statistical measures and graphs

OLAP roll-up

Attribute-oriented induction techniques

Data differentiation: Compare the target class with one or more comparison classes (contrast classes)

Comparing measures by distinguishing rules

(2) Frequent mode

frequent itemsets

Frequent subsequences (sequence patterns)

frequent substructure

(3) Association and correlation mining

Single-dimensional association rules: association rules containing a single predicate

Multidimensional association rules: associations involving multiple attributes or predicates

(4) Classification and regression

Classification

Concept: Find a model (or function) that describes and distinguishes a data class or concept so that the model can be used to predict the class punctuation of an object whose class label is unknown

method

Classification rules (IF-THEN rules)

Decision tree: A tree structure similar to a flow chart, in which each node represents a test on an attribute value, each branch represents a result of the test, and the leaves represent classes or class distributions.

Mathematical formula

Processing units similar to neurons with weighted connections between units

Naive Bayes classification, support vector machine, K nearest neighbor classification

Regression: Used to predict missing or hard-to-obtain numerical data values, and also includes the identification of distribution trends based on available data.

Correlation analysis is performed before classification and regression and it attempts to identify attributes that are significantly related to the classification and regression processes.

(5) Cluster analysis

Concept: Objects are clustered or grouped according to the principle of maximizing intra-class similarity and minimizing inter-class similarity. Clusters of objects are formed in such a way that objects in the same cluster are highly similar but very dissimilar to objects in other clusters. Each cluster formed can be viewed as an object class from which rules can be derived. Clustering also facilitates the formation of classifications, that is, organizing observations into a class hierarchy and grouping similar events together.

(6) Outlier analysis

Concept: Find data objects in a dataset that are inconsistent with the general behavior or model of the data

Statistics and Data Mining

Statistics studies the collection, analysis, interpretation and presentation of data, and data mining has a natural connection with statistics.

A statistical model is a set of mathematical functions that describe the behavior of target objects using random variables and their probability distributions.

(1) Statistical models can be the result of data mining tasks, and data mining tasks can also be built on statistical models. Therefore, when mining patterns in large data sets, the data mining process can use the model to help identify noise and noise in the data. Missing values.

(2) Statistical research develops some data and statistical models for prediction and forecasting tools. Statistics is useful for mining various patterns from data and understanding the potential mechanisms that generate and influence these patterns.

(3) Statistical methods can also be used to verify data mining results. For example: after establishing a classification or prediction model, statistical hypothesis testing should be used to verify the model.

Using statistical methods in data mining is not simple. How to apply statistical methods to large data sets is a huge challenge. Many statistical methods have high computational complexity.

machine learning

Concept: How computers learn (or improve their performance) based on data. The main research field is that computers automatically learn to recognize complex patterns based on data and make intelligent decisions.

type

Supervised learning: Similar to classification, the supervision in learning comes from labeled instances in the training data set

Unsupervised learning: similar to clustering, the input instances are not labeled

Semi-supervised learning: using labeled and unlabeled instances when learning a model

Active learning: Let users play an active role in the learning process

For classification and clustering tasks, machine learning research often focuses on model accuracy. In addition to accuracy, data mining research places great emphasis on the effectiveness and scalability of mining methods on large data sets, as well as ways to deal with complex data types, and the development of new, non-traditional methods.

Data mining application areas: business intelligence, web search, bioinformatics, health care informatics, finance, digital libraries and digital government

Main issues in data mining

Mining method

Discover new types of knowledge

Mining knowledge in multi-dimensional space

Data Mining—An Interdisciplinary Endeavor

Improve discovery capabilities in network environments

Dealing with uncertain data, noisy or incomplete data

Pattern evaluation and mining of pattern constraint guidance

user interface

interactive mining

Combine background knowledge

Specific data mining and data mining query languages

Representation and visualization of data mining results

Effectiveness and scalability

Effectiveness and scalability of data mining algorithms

Parallel distributed and incremental mining algorithms

Diversity of database types

Handle complex data types

Mining dynamic, networked, global databases

Data Mining and Society

The social impact of data mining

Privacy-preserving data mining

invisible data mining

Data preprocessing

concept

Data object: also called a sample, instance, data point or object, a data object represents an entity

Attributes

Nominal attribute: The value is the name of some symbol or thing. Each value represents some category, encoding or status, so nominal attributes are considered categorical

Binary attribute: It is a nominal attribute with only two status categories: 0 or 1. 0 means that the attribute does not appear, and 1 means that the attribute appears.

A binary attribute is symmetric if its two states are of equal value and carry the same weight, and asymmetric if the consequences of its states are not equally important.

Ordinal property: one whose possible values have a meaningful order or rank, but the difference between successive values is unknown.

Numeric properties

Interval scaling properties: Measured with equal unit scales. The values of the interval scale are ordered and can be 0, positive or negative. Therefore, in addition to rank assessment, this property allows us to compare and quantitatively evaluate differences between values

Ratio scale attribute: It is a numerical attribute with a fixed zero point, that is, if the measure is a ratio scale, then we can say that one value is a multiple (or ratio) of another value. In addition, these values are ordered, so we You can calculate the difference between values, as well as the mean, median, and mode.

Cluster: A collection of data objects such that objects in the same cluster are similar to each other but different from objects in other clusters.

Data matrix: used to store data objects, consisting of two entities or "things", namely rows (representing objects) and columns (representing attributes), so it is called a bimodular matrix.

Dissimilarity matrix: used to store the dissimilarity values of data objects. It contains only one type of entity, so it is called a single-mode matrix.

Data quality: accuracy, completeness, consistency, timeliness, credibility, interpretability

Data cleaning

Concept: "Clean data" by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies.

Missing value handling

Ignore tuples

Fill in missing values manually

Use a global variable to fill in missing values

Fill missing values using a central measure of the attribute, such as the mean or median

Use the attribute mean or median of all samples in which a given tuple belongs to the same class

Fill missing values using the most likely value

Noisy data processing

binning

box mean smoothing

bin median smooth

box boundaries smooth

return

Outlier analysis (clustering)

Data integration: combining data from multiple data sources into a consistent data store, such as a data warehouse

Reasons for data preprocessing: Low-quality data will lead to low-quality mining results

Importance of data preprocessing: It can significantly improve the overall quality of data mining models and reduce the time required for actual mining.

Data preprocessing steps: data cleaning - data integration - data specification - data transformation

Data transformation strategy

Smoothing: removes noise from data, including binning, regression and clustering

Attribute construction (feature construction): New attributes are constructed from the given attributes and added to the attribute set to aid the mining process.

Aggregation: summarizing or clustering data

Normalization: Scale the data so that it falls into a specific small interval, such as (-1, 1) or (0, 1)

discretization

Concept: The original value of the numerical attribute is replaced with an interval label or a concept label

Methods: binning, histogram analysis, cluster analysis, decision tree analysis, correlation analysis

concept layering

Concept: Define a mapping sequence that maps low-level concepts to higher-level, more general concepts

method

It is up to the user to explicitly specify the partial ordering of attributes at the schema level.

Illustrating part of the stratification by explicit data grouping

Describe attribute sets but not their partial order, for example: generate concept stratification based on the number of different values for each attribute

Only part of the attribute set is explained, for example: using predefined semantic relationships to generate concept hierarchies

Data curation

Concept: Used to obtain a reduced representation of a data set that is much smaller but still close to maintaining the integrity of the original data.

Strategy

Dimensional protocol

Concept: Reduce the number of random variables or attributes considered

type

Wavelet transform, principal component analysis: transform or project the original data into a smaller interval

Attribute subset selection: detecting and removing irrelevant, weakly relevant, or redundant attributes or dimensions

quantity specification

Concept: Replace original data with an alternative, smaller representation of the data

type

Parametric methods: regression, log-linear models

Non-parametric methods: histograms, clustering, sampling, data cube aggregation

data compression

Concept: Use transformations to obtain reduced or compressed representations of original data

type

Lossless: the original data can be reconstructed from the compressed data without loss of information

Lossy: the original data can only be approximately reconstructed

Data Mining and Online Analytical Processing

database

The data warehouse is a database that is maintained separately from the unit operational database.

Data warehouse allows various application systems to be integrated together, provides a solid platform for unified historical data analysis, and provides support for information processing.

A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that supports managers' decision-making processes.

OLTP: Online transaction processing system, performs online transaction and query processing

OLAP: Online analytical processing system, which organizes and provides data in different formats to meet the various needs of different users.

Database three-tier architecture

Top level: front-end tools

Middle tier: OLAP server

Bottom layer: data warehouse server

data warehouse model

Enterprise warehouse

Gathers all information on a subject, spans the entire enterprise, provides a full range of data integration, usually from one or more operational database systems or external information providers, and is versatile, containing both detailed and summary data

data mart

Concept: Contains a subset of enterprise-wide data that is useful to a specific user group. The scope is limited to selected topics. The data is usually aggregated.

Independent data mart: Data typically comes from one or more operational database systems or external information providers, or from data generated by a specific department or local area

Dependent data mart: directly from the enterprise database

virtual warehouse

Is a collection of views on the operating database. In order to process queries efficiently, only some possible summary views are materialized.

Metadata

Concept: Data about data. In a data warehouse, metadata is the data that defines warehouse objects.

content

Description of the data warehouse structure: warehouse schema, views, dimensions, hierarchies, definition of exported data, location and content of data marts

Operational metadata: data lineage, data circulation, management information

Algorithms used for aggregation: measure and dimension definition algorithms, granularity at which the data is located, partitioning, subject areas, aggregation, summary, predefined queries and reports

Mapping from operating environment to data warehouse: source databases and their contents, gateway descriptions, data extraction, cleansing, transformation rules and default values, data refresh and cleansing rules, security (user authorization and access control)

Data on system performance: In addition to timed scheduling rules for refresh, update, and copy cycles, including indexes and summaries to improve access and retrieval performance

Business Metadata: Business Terms and Definitions, Data Owners and Charging Policies

Difference from other data

(1) Metadata is used as a directory to help decision support system analysts locate the content of the data warehouse.

(2) As a guide for data mapping when data is converted from an operating environment to a data warehouse environment

(3) For the summary algorithm, the current detailed data is summarized into slightly comprehensive data, or the slightly comprehensive data is summarized into highly comprehensive data.

(4) Metadata should be stored and managed persistently (i.e. stored on the hard disk)

data cube

concept

It consists of a lattice of cubes, each cube corresponding to a different level of aggregation of a given multidimensional data. Allows data to be modeled and observed in multiple dimensions, defined by dimensions and facts.

Dimension: A perspective or entity that a unit wants to record. Each dimension can have a table associated with it, called a dimension table.

Fact: It is a numerical measure. The fact table contains the name or measure of the fact and the key for each associated dimension table.

Subtopic 5

database

design view

Top-down view: allows you to select relevant information needed for the data warehouse

Data source view: reveals the information collected, stored and managed by the operating database system

Data warehouse view: includes fact tables and dimension tables. They provide information stored in the data warehouse, including precomputed totals and counts, as well as information about sources, dates, and times that provide historical context.

Business Query View: Perspective on the data in the data warehouse from the end user's perspective

designing process

Select the business process to be modeled

Select business processing granularity

Select the dimensions to use for each fact table record

Select the measure to place in each fact table record

application

Information Processing: Supports queries and basic statistical analysis and reporting using crosstabs, tables, charts or graphs

Analytical processing: Supports basic OLAP operations, including slicing and dicing, drill-down, roll-up, and pivoting, generally operating on summary and detailed historical data

Data mining: supports knowledge discovery, including finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and using visualization tools to provide mining results.

Multidimensional Data Mining (OLAM)

Concept: Integrate data mining with OLAP to discover knowledge in multidimensional databases

importance

High quality of data in data warehouse

High quality of data in data warehouse: Most data mining tools need to run on integrated, consistent and clean data, which requires safe data cleaning, data transformation and data integration as pre-processing steps. The data warehouse constructed through these preprocessing steps not only serves as OLAP, but also serves as a high-quality and valuable data source for data mining.

Information processing infrastructure surrounding databases

Comprehensive data processing and data analysis infrastructure has been or will be systematically established around the data warehouse, including access, integration, merging and transformation of multiple heterogeneous databases, ODBC/OLEDB connections, Web access and service mechanism reports and OLAP analysis Tools should try to leverage available infrastructure rather than starting from scratch.

Multidimensional data exploration based on OLAP

Effective data mining requires exploratory data analysis. Users often want to traverse the database, select relevant data, analyze them at different granularities, and provide knowledge/results in different forms.

Multidimensional data mining provides mechanisms for data mining on different data subsets and different abstraction levels, drilling, rotating, filtering, dicing and slicing on data cubes and intermediate results of data mining.

Together with data/knowledge visualization tools, these will greatly enhance the capabilities and flexibility of exploratory data mining.

Online selection of data mining functions

Users often may not know what type of knowledge they want to mine. By integrating OLAP with multidimensional data mining functions, multidimensional data mining provides users with the flexibility to choose the desired data mining functions and dynamically switch data mining tasks.

Effective processing/steps for OLAP queries

(1) Determine which operations should be performed on the available cubes

This will involve converting select, projection, roll-up (grouping) and drill-down operations in the query into corresponding SQL/OLAP operations

(2) Determine which materialized cubes should be used for relevant operations

Involves finding all materialized cubes for which a possible user answers the query, using knowledge of "dominance" connections between cubes, performing pruning, evaluating the cost of using multiplication and division of the materialized cubes, and selecting the cube with the minimum cost.

Mining frequent patterns, associations and correlations

concept

Item set

branch topic

A collection of items. A set containing K items is called a K itemset. The frequency of occurrence of an item set is the number of transactions containing the item set, which is referred to as the frequency, support count, and count of the item set.

frequent itemsets

If the support of itemset I meets the predefined minimum support threshold (that is, the absolute support of I meets the corresponding minimum support count threshold), then I is a frequent itemset, and the set of frequent itemsets is recorded as Lk

closed itemset

If there is no true superitemset Y such that Y and X have the same support count in D, then the itemset X is closed in the data set D.

closed frequent itemset

If X is closed and frequent in D, then the itemset X is a closed frequent itemset in D

Extremely frequent itemsets

If X is frequent, and there is no super itemset Y such that X belongs to Y, and Y is frequent in D, then the itemset

Association rule mining process

(1) Find all frequent itemsets

By definition, each of these itemsets appears frequently at least as often as the predefined minimum support count min_sup

(2) Generating strong association rules from frequent item sets

By definition, these rules must satisfy minimum support and minimum confidence

F-P-growth advantages and disadvantages

(1) This method significantly reduces search overhead

(2) When the database is large, it is sometimes unrealistic to construct a main memory-based FP tree. One option is to first partition the database into a collection of projection databases and then mine within each projection database

(3) It is efficient and scalable for both mining long frequent patterns and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm

Classification and prediction

concept

Classification

Is a form of data analysis that extracts models that describe classes of data. A classifier or classification model predicts class labels (classes), and numerical predictions model continuous function values. Classification and numerical prediction are two major types of prediction problems.

supervised learning

The class label of each training tuple is provided, and the learning of the classifier is supervised by being told "which class" each training tuple belongs to.

Unsupervised learning (clustering)

The class label of each training tuple is unknown, and the number or set of classes to be learned may not be known in advance.

data classification process

(1) Learning stage (building a classifier model)

(2) Classification stage (use the model to predict the class label of the given data)

decision tree

Decision tree induction: is a top-down recursive tree induction algorithm that uses an attribute selection metric to select attribute tests for each non-leaf node of the tree. Algorithms include ID3, C4.5, CART, which use different attribute selection metrics.

Decision tree: A tree structure similar to a flowchart in which each internal node (non-leaf node) represents a test on an attribute, each branch represents an output of the test, and each leaf node (or The terminal node) stores a class label, and the top-most node of the tree is the root node.

Decision tree classifier advantages and disadvantages

(1) The construction of the decision tree classifier does not require any domain knowledge or parameter settings, so it is suitable for exploratory knowledge discovery.

(2) Decision trees can handle high-dimensional data

(3) The acquired knowledge is intuitive and easy to understand in the form of tree representation.

(4) The inductive learning and classification steps of decision trees are simple and fast

(5) Generally speaking, decision tree classifiers have good accuracy

(6) Decision trees are the basis of many business induction systems

Disadvantages: Successful use may depend on the data at hand

attribute selection measure

concept

is a heuristic that selects the splitting criterion that best divides the data partition D of training tuples labeled by a given class into separate classes.

If D is divided into smaller partitions based on the output of the splitting criterion, ideally each partition should be pure (i.e. all tuples falling into a given partition belong to the same class)

Attribute selection metrics are also called splitting criteria because they determine how to split on a given tuple. Attribute selection metrics provide a rank rating for each attribute that describes a given training tuple. The attribute with the best metric score is chosen. The splitting property of the given tuple.

If the splitting attribute is continuous, or we are limited to constructing binary trees, then a splitting point or a splitting subset must also be returned as part of the splitting criterion. The tree nodes created for partition D are marked with the splitting criterion, from each output of the criterion. Branches are grown and tuples are divided accordingly.

method

information gain

Gain rate

Gini index

pruning

concept

When a decision tree is created, due to noise and outliers in the data, many branches reflect anomalies in the training data. The pruning method deals with this overfitting problem.

Prune first

The tree is pruned by stopping tree construction early (e.g. by deciding not to split at a given node or by dividing a subset of training tuples). Once stopped, the nodes become leaves. The leaf can hold the most frequent classes among a subset of tuples, or a probability distribution over these tuples.

post-pruning

It prunes subtrees from fully grown ones. Prune the subtree at a given node by removing its branch and replacing it with a leaf. The class label of the leaf is the most frequent class in the subtree.

Combination classification

concept

Combining K learned models (or base classifiers) M1, M2, M3······MK together to create an improved composite classification model M*. Use the given data set to create K training sets D1, D2, D3...DK, where Di (1<=i<=k) is used to create the classifier Mi, given a new data to be classified A tuple that each base classifier votes by returning a class prediction.

type

bagging

Given a set D of d tuples, the bagging process is as follows: for iteration i (i, 2, 3·····k), the training set Di of d tuples is sampled with replacement, from the original elements Group set D is extracted. The term bagging means bootstrapping, where each training set is a bootstrap sample.

Due to the use of sampling with replacement, some tuples of D may not appear in Di, while other tuples may appear multiple times. A classification model Mi is learned from each training set Di. In order to classify an unknown tuple, Each classifier Mi returns its class prediction, which counts as one vote. The bagging classifier M* counts the votes and assigns the class with the highest votes to X.

promote

Weights are assigned to each training tuple, and K classifiers are learned iteratively. After learning the classifier Mi, the weights are updated. This makes the subsequent classifier Mi 1 pay more attention to the training tuples misclassified by Mi. The final improved classifier M* combines the votes of each individual classifier, where the weight of each individual classifier's vote is a function of its accuracy.

random forest

Imagine that each classifier in the combined classifier is a decision tree, so the collection of classifiers is a "forest". Each node of an individual decision tree determines the partition using randomly selected attributes. More precisely, each tree relies on the values of a random vector that is independently sampled and has the same distribution as all trees in the forest. When classifying, each tree votes and the class with the most votes is returned.

The difference between bagging and lifting

Because boosting focuses on misclassified tuples, there is a danger that the resulting composite model will overfit the data, and therefore the "boosted" resulting model may sometimes be less accurate than a single model derived from the same data.

Bagging is less affected by overfitting, although both can significantly improve accuracy compared to a single model, but boosting tends to give higher accuracy.

class imbalance problem

concept

Given two classes of data, if the main class of interest (positive class) is represented by only a small number of tuples, while the vast majority of tuples represent the negative class, the data is class-imbalanced.

For multi-class imbalanced data, the data distribution of each class is significantly different, where tuples of the main class or the class of interest are rare. The class imbalance problem is closely related to cost-sensitive learning, where the cost of errors is not equal for each class.

Strategy

Oversampling and undersampling

Assume that the original training set contains 100 positive tuples and 1000 negative tuples. In oversampling, rare tuples are copied to form a new training set containing 1000 positive tuples and 1000 negative tuples. In undersampling, negative tuples are randomly deleted, forming a new training set containing 100 positive tuples and 100 negative tuples.

threshold shift

Does not involve sampling, it is used for a classifier that returns a continuous output value for a given input tuple (similar to ROC). For a given input tuple X, this classifier returns a map f(x)——> [0,1] as output. Rather than manipulating training tuples, this method returns a classification decision based on the output.

Threshold shifting and combining methods outperform oversampling and undersampling, and threshold shifting is effective even on very imbalanced data sets.

ROC

The receiver operating characteristic curve (ROC) is a useful visual tool for comparing two classification models. The ROC curve gives the trade-off between the true positive ratio (TPR) and the false positive ratio (FPR) of a given model.

The ROC curve allows us to observe the trade-off between the proportion of positive instances correctly identified by the model and the proportion of negative instances identified as positive instances for different parts of the test set. The increase in TPR comes at the expense of an increase in FPR.

The area under the ROC curve is a measure of model accuracy

The ROC curve uses the class prediction probability of each test tuple to rank and order the test tuples so that the tuples most likely to belong to the positive class or "yes" class appear at the top of the table, and the tuples least likely to belong to the positive class Tuples appear at the bottom of the table.

Compared

User and system orientation

customer facing

market oriented

Data content

Manage current data

Manage historical data, provide summary and aggregation mechanisms, and store and manage information at different granularities

Database Design

E-R data model and application-oriented database design

Star or snowflake schema, topic-oriented database design

view

Pay attention to the current data of an enterprise/department

Spanning multiple versions of database schemas, processing information from different organizations, and integrating information from multiple databases

access mode

Read-only operations, complex queries

Access consists primarily of short atomic transactions, requiring concurrency control and recovery mechanisms