Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

EdrawMind

Big data mind map

This is an article about big data mind map and introduction to big data to help readers systematically understand big data. It is introduced in detail and described comprehensively. I hope it can help interested friends learn.

Edited at 2023-12-03 18:04:17

PlotWizard

Recent works View more works>>

Big data mind map

PlotWizard

Recent works View more works>>

Recommended to you
Outline

Big Data

Big data overview

Big Data Era

Internet and big data

The Internet emerged

The Internet gives birth to big data

Information Technology and Big Data

Information collection technology

information storage technology

information processing technology

information transmission technology

Cloud computing and big data

Internet of Things and Big Data

Big data concept

Big data in a narrow sense

A collection of large amounts of data that is difficult to manage with existing general techniques

Big data in a broad sense

Big data refers to the huge amounts of data or massive data generated in the era of information explosion, and the resulting series of changes in technology and cognitive concepts.

Characteristics of big data

Large amount of data Volume

Large storage capacity

Large increment

Variety of data types

Many sources

Many formats

Data speed Velocity

height data

Data value density is lowValue

The significance of developing big data

Thinking changes in the era of big data

Big data and its nature

Data, as a way of expressing information, is the result of the joint action of matter and consciousness.

Data has objective reality

Three major changes in the big data era

Full data model in the era of big data

Accept the mixed nature of your data

Allow imprecision

The more complex data, the better

Promiscuity is the standard approach

new database design

Highlight data correlation rather than causation

Correlation is the key to prediction

Explore the "what" rather than the "why"

Understanding the world through cause and effect

Big data becomes a competitive advantage

Increased demand for big data applications

The rise of big data applications

Real-time response is a new requirement for big data users

Enterprises build big data strategies

Big data collection and storage

Classification of big data

structured data

Traditional relational database data, row data

semi-structured data

It has a certain structure compared with ordinary plain text, but is more flexible than data in relational databases with strict theoretical models.

feature

Structural data self-describing

No need to distinguish between metadata and general data

The complexity of data structure descriptions

Dynamic nature of data structure description

unstructured data

Will not use database two-dimensional tables to express, such as pictures, files, hypermedia

Data processing mainly includes

Extraction of information content from web pages

Structural processing (including text lexical segmentation, part-of-speech analysis, and ambiguity processing)

Semantic processing (entity extraction, vocabulary correlation analysis, sentence correlation analysis)

Text construction (including vector space model, topic model)

Data sources in big data environment

traditional business data

Mainly structured data

Enterprise ERP system, POS terminal and online payment system

internet data

The vast amounts of data generated during interactions in cyberspace, including social media and communication records

It has the characteristics of large-scale, diversified and rapid

IoT data

On the basis of the computer Internet, radio frequency identification (RFID), sensors, infrared sensors, wireless data communications and other technologies are used to construct the internet of things that covers everything in the world.

There are larger amounts of data, higher transmission rates, more diverse data, and higher requirements for data authenticity.

Commonly used data collection methods

System log

Scribe

Facebook’s open source log collection system

composition

Scribe Agent

Scribe

DB storage system

Chukwa

composition

adapter

acting

collector

demultiplexer

Storage System

Data Display

Flume

It is a distributed, reliable and highly available massive log collection, aggregation and transmission system provided by Cloudera.

Web page data

composition

Acquisition module

Data processing module

data module

Crawl strategy

depth first traversal

breadth-first traversal

Backlink Count Strategy

OPIC strategy

Big site priority strategy

Other data

Storage management system in the era of big data

File system

A file system is the part of an operating system that organizes and manages files and directories on a computer

Traditional file systems such as NTFS and EXT4 can be used for small-scale data storage, but they may face performance bottlenecks in big data processing.

Distributed file system

It distributes data across multiple storage nodes and connects these nodes through the network to achieve high reliability, high throughput and scalability.

Some common distributed file systems include Hadoop HDFS, Google's GFS (Google File System) and Ceph, etc.

Database systems

A database system is a software system used to store, manage, and retrieve structured data

Such as Apache HBase, Apache Cassandra and MongoDB, etc.

These database systems typically adopt distributed architectures and are highly scalable and fault-tolerant.

cloud storage

Cloud storage is a solution for storing data in a cloud computing environment. It provides reliable, scalable storage services that allow users to access and manage their data over the Internet

structural model

storage layer

basic management

application interface layer

access layer

data visualization

Data visualization overview

What is data visualization

The development history of visualization

Data visualization classification

scientific visualization

information visualization

visual analytics

data visualization chart

Scatter plot

bubble chart

line chart

bar chart

heat map

radar chart

other

funnel chart

tree diagram

relation chart

word cloud

Sankey diagram

calendar chart

Data visualization tools

beginner level

excel

Infographic Tools

canvas

visem

google charts

piktochart

Venngage

Echarts

Big data magic mirror

map tools

My maps

batchgeo

fusion tables

mapshaeper

cartoDB

mapbox

Map stack

modest maps

Timeline tool

timetoasyt

xtimeline

tumeline maker

Advanced analysis tools

Pythome

Weka

Gephi

Real-time visualization

Technologies supporting big data

Commercial support for open source technologies

Big data technical architecture

base layer

Management

Analysis layer

Application layer

Big data processing platform

Hadoop

characteristic

High reliability

Use redundant data storage

Efficiency

Adopting two core technologies of distributed storage and distributed processing to efficiently process PB-level data

High scalability

High fault tolerance

low cost

What does it run on linux platform?

Developed based on JAVA

Support multiple programming languages

core components

Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop used to store data in large-scale distributed environments

Advantages and Disadvantages of HDFS Data Storage

1. advantage

High reliability: HDFS provides high reliability through data redundancy and fault tolerance mechanisms. It splits the file data into multiple data blocks and replicates multiple copies on multiple nodes in the cluster. If a node fails, lost copies of data can be recovered from other nodes.

High scalability: HDFS can store and process massive amounts of data on large-scale clusters. It supports horizontal expansion by adding additional nodes to increase storage capacity and processing power to meet growing data needs.

Adaptable to large file storage: HDFS is suitable for storing large files because it divides the file into fixed-size data blocks for storage. This approach helps improve data processing efficiency and reduces metadata overhead.

High throughput: HDFS optimizes the way of data access and transmission, and achieves high throughput data access through parallel reading and writing and data locality principles. This makes HDFS excellent in big data processing and analysis tasks.

2. shortcoming:

Low-latency access: HDFS is designed to handle batch processing tasks of large data sets, not real-time data access. Therefore, for application scenarios that require low-latency response (such as online transaction processing), the access latency of HDFS is relatively high.

Not suitable for small file storage: Since HDFS divides files into data blocks for storage, for a large number of small files, it will increase storage overhead and complicate metadata management. Therefore, HDFS is not suitable for storing large amounts of small files.

Consistency and real-time performance: HDFS adopts an eventual consistency model, which means that concurrent write operations to the same file may cause data consistency issues. In addition, HDFS is not suitable for application scenarios that require real-time data access and updates.

Hadoop YARN (Yet Another Resource Negotiator) is Hadoop's resource management and job scheduling framework

MapReduce is the computing model of Hadoop, used to process parallel computing of large-scale data sets.

ecosystem

Hive is a data warehouse infrastructure that provides a SQL-like query language (HiveQL) to process and analyze data stored in Hadoop

Pig is a platform for large-scale data processing and provides a scripting language called Pig Latin. Pig Latin language is a data flow language that can be used to describe data transformation operations and analysis processes.

HBase is a distributed column-oriented NoSQL database built on Hadoop's HDFS. It provides real-time read and write access to large-scale structured data with high scalability and reliability

Spark is a fast, general-purpose big data processing engine that can perform parallel computing of large-scale data sets in memory.

Sqoop is a tool for data transfer between Hadoop and relational databases. It can import data from relational databases into Hadoop for analysis, and export result data from Hadoop to relational databases.

Flume is a reliable, scalable big data collection system for collecting, aggregating and moving data from various data sources (such as log files, sensor data, etc.) into Hadoop.

Application scenarios

System log analysis

User habit analysis

Storm

characteristic

integrity

acker mechanism, data will not be lost

fault tolerance

Ease of use

Free and open source

Support multiple languages

core components

Topology

spout (data source)

Bolt (processing component)

Stream

Nimbus (master node)

Supervisor (worker node)

ZooKeeper (distributed coordination service)

Application scenarios

information flow processing

continuous calculation

Distributed remote procedure call

Spark

characteristic

Rapidity

Elastic scalability

Various computing models

Multi-language support

Comprehensive

Architecture

Driver is the main component of Spark application. The driver program runs in an independent process and is responsible for the control and coordination of the entire Spark application.

Cluster Manager is responsible for managing the resource scheduling and task allocation of Spark applications on the cluster.

Executor is a process running on the worker nodes in the cluster and is responsible for performing specific tasks.

RDD is the core data abstraction of Spark. It is an immutable data collection that can be partitioned and operated in parallel.

The DAG scheduler is responsible for converting operations in Spark applications into directed acyclic graphs (DAGs), optimizing and scheduling them.

Task Scheduler is responsible for allocating tasks in the Stage to available Executors for execution.

In a Spark application, when data reshuffling (Shuffle) operation is required, the data will be network transferred and redistributed between different Executors.

extensions

Spark SQL is Spark's structured query module, providing a high-level API and query language for processing structured data.

Spark Streaming is Spark’s stream processing module for real-time processing and analysis of data streams.

MLlib is Spark's machine learning library, which provides a series of machine learning algorithms and tools for data mining, predictive analysis and model training.

GraphX is Spark’s graph computing library for processing large-scale graph data and graph algorithms.

SparkR is the R language interface of Spark, allowing R language users to use Spark for large-scale data processing and analysis.

Application scenarios

Application scenarios for operating specific data sets multiple times

Coarse-grained update status application

The amount of data is not particularly large, but it is suitable for applications requiring real-time statistical analysis.

Comparison of the three

data processing model

Hadoop is suitable for offline large-scale data processing tasks, mainly used for batch data storage and analysis.

Spark supports multiple data processing models such as batch processing, real-time stream processing, interactive query, and machine learning.

Storm is a real-time stream processing framework for processing continuous data streams

Execution speed

Since Hadoop uses disk storage and MapReduce batch processing model, it is suitable for processing large-scale offline data, but it may be slower for scenarios with high real-time requirements.

Spark uses in-memory computing and RDD-based data abstraction, which can keep data in memory for iterative calculations and is suitable for data processing tasks that require higher performance and interactivity.

Storm focuses on real-time stream processing, has the characteristics of low latency and high throughput, and is suitable for rapid processing and analysis of real-time data.

Data processing capabilities

Hadoop provides a reliable distributed file system (HDFS) and a scalable MapReduce computing model, which is suitable for the storage and batch processing of massive data. It has good fault tolerance and data reliability

Spark provides richer data processing capabilities and supports multiple models such as batch processing, real-time stream processing, interactive query, and machine learning. It also provides high-level APIs and libraries (such as Spark SQL, Spark Streaming, and MLlib) to simplify the development of big data processing and analysis

Storm focuses on real-time stream processing and provides reliable message passing and stream topology processing models. It can process large-scale data streams in real time and supports complex stream processing logic

Ecosystem and support

Hadoop has an extensive ecosystem and a large number of tools and components, such as Hive, Pig, and HBase, for higher-level data processing and analysis. It has a mature community and extensive support

Spark also has an active open source community and a rich ecosystem that supports a variety of data processing and machine learning tasks. It is tightly integrated with the Hadoop ecosystem and can work seamlessly with HDFS, Hive and other Hadoop components

Storm's ecosystem is relatively small and mainly focuses on the field of real-time stream processing. It provides some plugins to integrate with other tools and systems such as Kafka and Cassandra

cloud computing

The concept and characteristics of cloud computing

concept

Cloud computing is a dynamically expanded computing model that provides users with network virtualized resources as services.

Features

Hyperscale

Virtualization

High reliability

Versatility

High scalability

on demand services

extremely cheap

Main deployment models of cloud computing

public cloud

Public cloud is cloud computing infrastructure built and managed by third-party service providers (such as Amazon AWS, Microsoft Azure and Google Cloud Platform)

Private Cloud

A private cloud is a cloud computing infrastructure built and managed by an organization itself to support its internal business needs

hybrid cloud

Hybrid cloud is a combination of public cloud and private cloud, providing more flexible and diverse solutions by connecting and integrating these two cloud environments. In a hybrid cloud, organizations can deploy workloads and applications into public or private clouds depending on their needs

Main service models of cloud computing

Infrastructure as a Service (IaaS)

IaaS is the most basic service model in cloud computing, which provides virtualized computing resources, storage, network and other infrastructure

Common IaaS providers include Amazon AWS’s EC2, Microsoft Azure’s virtual machine service, and Google Cloud Platform’s Compute Engine.

Platform as a Service (PaaS)

PaaS provides a platform to develop, run and manage applications in a cloud environment

Common PaaS providers include Microsoft Azure’s App Service, Google Cloud Platform’s App Engine and Heroku, etc.

Software as a Service (SaaS)

SaaS is the highest level service model in cloud computing. It provides fully managed applications that users can directly access and use through the Internet.

Common SaaS applications include email services (such as Gmail), online office suites (such as Microsoft 365 and Google Workspace), and customer relationship management (CRM) systems (such as Salesforce)

Main technologies of cloud computing

virtualization technology

Virtualization technology can realize server virtualization, storage virtualization, network virtualization, etc., enabling cloud computing platforms to achieve elastic allocation and management of resources.

middleware technology

Middleware technology plays a role in connecting and coordinating different components and services in cloud computing. It provides a set of software tools and services for managing and scheduling the deployment, configuration and execution of applications

Middleware technology also includes load balancing, container technology, message queues and service orchestration, etc., used to provide high availability, scalability and flexibility in cloud computing environments.

Cloud storage technology

Cloud storage technology is a technology used to store and manage large-scale data

The relationship between cloud computing and big data

Cloud computing provides the advantages of powerful computing and storage resources, elasticity and cost-effectiveness, providing ideal infrastructure and tools for big data processing and analysis.

Cloud computing provides efficient, flexible and scalable solutions for the storage, processing and analysis of big data, and promotes the development and application of big data technology.

application

Business big data

Precision marketing

Data collection and integration

User portrait construction

Target market segmentation

Predictive analytics and model building

Personalized marketing campaign execution

Results evaluation and optimization

policy support

concept

Decision support is a method that combines information technology and management science to provide decision-makers with the information, tools and models needed for decision-making.

It helps decision-makers make decisions by analyzing and interpreting data, providing decision-making models and algorithms, and providing visualization and interactive interfaces.

Classification

structured decision-making

unstructured decision making

semi-structured decision making

Process steps

Identify problems and formulate decision-making goals

Use probability to quantitatively describe the possibility of various outcomes produced by each plan

Decision makers quantitatively evaluate various outcomes

Comprehensive analysis of all aspects of information

Decision support system functions

Data Management and Integration: Collect, integrate, and manage data relevant to decision-making.

Model and algorithm support: Provides various decision-making models and algorithms for analysis and prediction.

Visualization and interactive interface: Help decision makers understand and operate data through visual display and interactive interface.

Scenario simulation and optimization: Supports the simulation and optimization of different decision-making options and evaluates their potential effects.

Collaboration and sharing: Support the collaboration and information sharing of decision-making teams and promote the collective decision-making process.

Innovation model

concept

Innovation models refer to methods and strategies used to innovate and change existing business models. It focuses on how to provide new value propositions to the market and gain competitive advantage through the creative use of resources, technology, market insights and business logic.

Constitutive conditions

Provide brand-new products and services and create new industrial fields

Its business model differs from other companies in at least several elements

Have good performance

method

Change revenue model

Subscription model: Offer a product or service as a subscription model and obtain a stable revenue stream through regular fees.

Advertising model: Provide products or services for free or at low prices, and earn profits through advertising revenue.

Freemium model: Provides a free version with basic functions and a paid version with advanced functions to generate revenue from paying users.

Data sales model: The collected data is analyzed and processed, and then sold to other organizations or individuals.

Trading platform model: Establish an online platform to connect buyers and sellers, and earn income through transaction commissions or handling fees.

Change business model

Open innovation model: Collaborate with external partners, communities, and innovation ecosystems to jointly develop and promote new products or services.

Platform model: Build a platform and ecosystem, introduce multiple parties to participate, and promote innovation and value co-creation.

Networked model: Through the Internet and digital technology, collaboration and information sharing within and outside the organization are realized to improve efficiency and flexibility.

Social enterprise model: integrate social and environmental responsibility into the business model and pursue social value and sustainable development.

Two-sided market model: Establish a two-sided market, attract suppliers and consumers at the same time, and achieve value creation by balancing the needs of both parties.

Change the industry model

Platform model: By building a platform and ecosystem, integrating upstream and downstream participants in the industry chain to achieve collaborative innovation and value co-creation.

Sharing economy model: improve resource utilization efficiency and meet user needs by sharing resources and services.

Self-service model: Use automation and digital technology to provide self-service and self-service interaction to reduce costs and improve efficiency.

Ecosystem model: Build an industrial ecosystem, integrate different enterprises and organizations, and achieve resource sharing and collaborative development.

Intelligent model: Apply artificial intelligence, Internet of Things and other technologies to provide intelligent products and services, changing the business logic and operation methods of the industry.

changing technology paradigm

Platform technology model: Build an open technology platform to attract developers and partners to achieve technology sharing and innovation.

Cloud computing model: Provide computing and storage resources as cloud services to meet user needs in an elastic and on-demand manner.

Edge computing model: Push computing and data processing to the edge of the network to improve response speed and data privacy.

Blockchain model: Use blockchain technology to achieve decentralized and credible transaction records and contract execution.

AI-driven model: Apply artificial intelligence technology to products or services to provide intelligent functions and personalized experiences.

Dimensions

strategic positioning innovation

Focus on the company's position and role in the market

method

Target market transfer: Shifting the target market from traditional markets to emerging markets or different market segments.

Differentiated positioning: Standing out from competitors by offering a unique product, service, or experience.

Brand Innovation: Redefining brand image and value proposition to attract new audiences and markets.

Resource capability innovation

Focus on the company’s internal resources and capabilities

method

Technological Innovation: Developing and applying new technologies to improve products, services, or business processes.

Talent Development: Develop and attract talent with new skills and knowledge to support innovation and business growth.

Partnership: Collaborate with external partners to share resources and capabilities and achieve complementary advantages.

Business ecological environment innovation

Focus on the relationship and interaction between the enterprise and the external environment

method

Open Innovation: Collaborating with external partners, startups, and communities to develop new products or services.

Ecosystem construction: Build a platform and ecosystem to attract multiple participants and achieve value co-creation and sharing.

Social Responsibility: Integrate social and environmental responsibility into the business model and pursue sustainable development and shared value.

Hybrid business model innovation

Involving the combination and integration of different business models

method

Platform model: Build a platform and ecosystem, integrate multiple business models, and promote multi-party cooperation and innovation.

Vertical integration: integrating different business activities up and down the value chain to achieve greater control and efficiency.

Diversification expansion: Expanding existing products or services into new markets or industries to achieve growth and diversification.

People's Livelihood Big Data

1. Smart medical care:

Smart healthcare uses information technology and big data analysis to improve medical services and health management. It can include electronic health records, telemedicine, medical data analytics, and more. The goal of smart healthcare is to improve medical efficiency, provide personalized medical services, and improve medical quality and patient experience.

2. Smart transportation:

Smart transportation uses information and communication technologies to optimize the operation and management of transportation systems. It can include traffic data collection, intelligent traffic signal control, traffic flow prediction, intelligent traffic management system, etc. The goal of smart transportation is to improve traffic efficiency, reduce traffic congestion and accidents, and provide more convenient, safe and environmentally friendly travel methods.

3. Wisdom Tourism:

Smart tourism uses information technology and big data analysis to provide more intelligent and personalized tourism services. It can include travel information platforms, intelligent navigation systems, travel data analysis, etc. The goal of smart tourism is to provide a better tourism experience, improve the utilization efficiency of tourism resources, and promote the sustainable development of the tourism industry.

4. Smart logistics:

Smart logistics uses technologies such as the Internet of Things, big data and artificial intelligence to optimize the management and operation of the logistics supply chain. It can include smart warehousing, smart transportation, smart distribution, etc. The goal of smart logistics is to improve logistics efficiency, reduce costs, improve logistics service quality, and meet rapidly changing market demands.

5. food safety

Food safety focuses on food quality and safety issues, involving food production, processing, transportation and sales. Using big data analysis and Internet of Things technology, we can monitor the source, quality and safety of food in real time, improve food traceability, prevent food safety incidents from occurring, and protect the health and rights of consumers.

6. Education big data

Educational big data uses big data analysis technology to study and improve teaching, learning and management in the field of education. By collecting and analyzing students' learning data, teachers' teaching data, etc., we can understand students' learning situations and needs, optimize teaching methods and resource allocation, and provide personalized learning support and guidance.

Industrial big data

Smart equipment

Intelligent equipment refers to integrating sensors, control systems, data analysis and other technologies to enable traditional industrial equipment to have perception, analysis and decision-making capabilities.

Intelligent equipment can monitor equipment status in real time, predict failures, optimize operating parameters, and support automated and intelligent production processes.

smart factory

Smart factories use advanced information technology and automation technology to realize the intelligence and automation of the production process.

Smart factories achieve optimization, flexibility and sustainable development of the production process by integrating various smart equipment, Internet of Things, big data analysis and other technologies

Intelligent service

Intelligent service refers to providing customers with personalized and intelligent services through the use of advanced technology and data analysis

In the industrial field, smart services can include predictive maintenance, remote monitoring, fault diagnosis, etc.

Government big data

Public opinion analysis

Refers to the process of systematically collecting, analyzing and evaluating social opinions and public sentiments. The government can use public opinion analysis to understand public attitudes and feedback on government policies, events and services.

Refined management and service

It refers to the use of government big data and advanced technology to provide more refined and personalized management and services to cities and society.

Emergency plan disposal

Refers to when emergencies and disasters occur, the government responds and handles quickly and effectively based on pre-established emergency plans.

Security big data

Network information security

Refers to security measures that protect networks and information systems from unauthorized access, destruction, leakage, and tampering. Network information security involves network architecture, data encryption, access control, vulnerability management, threat detection, etc.

Natural disaster warning

It refers to discovering and predicting the occurrence and development trends of natural disasters in advance by collecting, analyzing and interpreting various relevant data, so as to take corresponding prevention and response measures.

The future of big data

The rise of data markets

Infohimps

Factual

Windows Azure Marketplace

Public Data Sets on AWS

Turn original data into value-added data

Consumer privacy protection