MindMap Gallery Big data mind map
This is an article about big data mind map and introduction to big data to help readers systematically understand big data. It is introduced in detail and described comprehensively. I hope it can help interested friends learn.
Edited at 2023-12-03 18:04:17El cáncer de pulmón es un tumor maligno que se origina en la mucosa bronquial o las glándulas de los pulmones. Es uno de los tumores malignos con mayor morbilidad y mortalidad y mayor amenaza para la salud y la vida humana.
La diabetes es una enfermedad crónica con hiperglucemia como signo principal. Es causada principalmente por una disminución en la secreción de insulina causada por una disfunción de las células de los islotes pancreáticos, o porque el cuerpo es insensible a la acción de la insulina (es decir, resistencia a la insulina), o ambas cosas. la glucosa en la sangre es ineficaz para ser utilizada y almacenada.
El sistema digestivo es uno de los nueve sistemas principales del cuerpo humano y es el principal responsable de la ingesta, digestión, absorción y excreción de los alimentos. Consta de dos partes principales: el tracto digestivo y las glándulas digestivas.
El cáncer de pulmón es un tumor maligno que se origina en la mucosa bronquial o las glándulas de los pulmones. Es uno de los tumores malignos con mayor morbilidad y mortalidad y mayor amenaza para la salud y la vida humana.
La diabetes es una enfermedad crónica con hiperglucemia como signo principal. Es causada principalmente por una disminución en la secreción de insulina causada por una disfunción de las células de los islotes pancreáticos, o porque el cuerpo es insensible a la acción de la insulina (es decir, resistencia a la insulina), o ambas cosas. la glucosa en la sangre es ineficaz para ser utilizada y almacenada.
El sistema digestivo es uno de los nueve sistemas principales del cuerpo humano y es el principal responsable de la ingesta, digestión, absorción y excreción de los alimentos. Consta de dos partes principales: el tracto digestivo y las glándulas digestivas.
Big Data
Big data overview
Big Data Era
Internet and big data
The Internet emerged
The Internet gives birth to big data
Information Technology and Big Data
Information collection technology
information storage technology
information processing technology
information transmission technology
Cloud computing and big data
Internet of Things and Big Data
Big data concept
Big data in a narrow sense
A collection of large amounts of data that is difficult to manage with existing general techniques
Big data in a broad sense
Big data refers to the huge amounts of data or massive data generated in the era of information explosion, and the resulting series of changes in technology and cognitive concepts.
Characteristics of big data
Large amount of data Volume
Large storage capacity
Large increment
Variety of data types
Many sources
Many formats
Data speed Velocity
height data
Data value density is lowValue
The significance of developing big data
Thinking changes in the era of big data
Big data and its nature
Data, as a way of expressing information, is the result of the joint action of matter and consciousness.
Data has objective reality
Three major changes in the big data era
Full data model in the era of big data
Accept the mixed nature of your data
Allow imprecision
The more complex data, the better
Promiscuity is the standard approach
new database design
Highlight data correlation rather than causation
Correlation is the key to prediction
Explore the "what" rather than the "why"
Understanding the world through cause and effect
Big data becomes a competitive advantage
Increased demand for big data applications
The rise of big data applications
Real-time response is a new requirement for big data users
Enterprises build big data strategies
Big data collection and storage
Classification of big data
structured data
Traditional relational database data, row data
semi-structured data
It has a certain structure compared with ordinary plain text, but is more flexible than data in relational databases with strict theoretical models.
feature
Structural data self-describing
No need to distinguish between metadata and general data
The complexity of data structure descriptions
Dynamic nature of data structure description
unstructured data
Will not use database two-dimensional tables to express, such as pictures, files, hypermedia
Data processing mainly includes
Extraction of information content from web pages
Structural processing (including text lexical segmentation, part-of-speech analysis, and ambiguity processing)
Semantic processing (entity extraction, vocabulary correlation analysis, sentence correlation analysis)
Text construction (including vector space model, topic model)
Data sources in big data environment
traditional business data
Mainly structured data
Enterprise ERP system, POS terminal and online payment system
internet data
The vast amounts of data generated during interactions in cyberspace, including social media and communication records
It has the characteristics of large-scale, diversified and rapid
IoT data
On the basis of the computer Internet, radio frequency identification (RFID), sensors, infrared sensors, wireless data communications and other technologies are used to construct the internet of things that covers everything in the world.
There are larger amounts of data, higher transmission rates, more diverse data, and higher requirements for data authenticity.
Commonly used data collection methods
System log
Scribe
Facebook’s open source log collection system
composition
Scribe Agent
Scribe
DB storage system
Chukwa
composition
adapter
acting
collector
demultiplexer
Storage System
Data Display
Flume
It is a distributed, reliable and highly available massive log collection, aggregation and transmission system provided by Cloudera.
Web page data
composition
Acquisition module
Data processing module
data module
Crawl strategy
depth first traversal
breadth-first traversal
Backlink Count Strategy
OPIC strategy
Big site priority strategy
Other data
Storage management system in the era of big data
File system
A file system is the part of an operating system that organizes and manages files and directories on a computer
Traditional file systems such as NTFS and EXT4 can be used for small-scale data storage, but they may face performance bottlenecks in big data processing.
Distributed file system
It distributes data across multiple storage nodes and connects these nodes through the network to achieve high reliability, high throughput and scalability.
Some common distributed file systems include Hadoop HDFS, Google's GFS (Google File System) and Ceph, etc.
Database systems
A database system is a software system used to store, manage, and retrieve structured data
Such as Apache HBase, Apache Cassandra and MongoDB, etc.
These database systems typically adopt distributed architectures and are highly scalable and fault-tolerant.
cloud storage
Cloud storage is a solution for storing data in a cloud computing environment. It provides reliable, scalable storage services that allow users to access and manage their data over the Internet
structural model
storage layer
basic management
application interface layer
access layer
data visualization
Data visualization overview
What is data visualization
The development history of visualization
Data visualization classification
scientific visualization
information visualization
visual analytics
data visualization chart
Scatter plot
bubble chart
line chart
bar chart
heat map
radar chart
other
funnel chart
tree diagram
relation chart
word cloud
Sankey diagram
calendar chart
Data visualization tools
beginner level
excel
Infographic Tools
canvas
visem
google charts
piktochart
Venngage
D3
Echarts
Big data magic mirror
map tools
My maps
batchgeo
fusion tables
mapshaeper
cartoDB
mapbox
Map stack
modest maps
Timeline tool
timetoasyt
xtimeline
tumeline maker
Advanced analysis tools
R
Pythome
Weka
Gephi
Real-time visualization
Technologies supporting big data
Commercial support for open source technologies
Big data technical architecture
base layer
Management
Analysis layer
Application layer
Big data processing platform
Hadoop
characteristic
High reliability
Use redundant data storage
Efficiency
Adopting two core technologies of distributed storage and distributed processing to efficiently process PB-level data
High scalability
High fault tolerance
low cost
What does it run on linux platform?
Developed based on JAVA
Support multiple programming languages
core components
Hadoop Distributed File System (HDFS)
HDFS is the storage layer of Hadoop used to store data in large-scale distributed environments
Advantages and Disadvantages of HDFS Data Storage
1. advantage
High reliability: HDFS provides high reliability through data redundancy and fault tolerance mechanisms. It splits the file data into multiple data blocks and replicates multiple copies on multiple nodes in the cluster. If a node fails, lost copies of data can be recovered from other nodes.
High scalability: HDFS can store and process massive amounts of data on large-scale clusters. It supports horizontal expansion by adding additional nodes to increase storage capacity and processing power to meet growing data needs.
Adaptable to large file storage: HDFS is suitable for storing large files because it divides the file into fixed-size data blocks for storage. This approach helps improve data processing efficiency and reduces metadata overhead.
High throughput: HDFS optimizes the way of data access and transmission, and achieves high throughput data access through parallel reading and writing and data locality principles. This makes HDFS excellent in big data processing and analysis tasks.
2. shortcoming:
Low-latency access: HDFS is designed to handle batch processing tasks of large data sets, not real-time data access. Therefore, for application scenarios that require low-latency response (such as online transaction processing), the access latency of HDFS is relatively high.
Not suitable for small file storage: Since HDFS divides files into data blocks for storage, for a large number of small files, it will increase storage overhead and complicate metadata management. Therefore, HDFS is not suitable for storing large amounts of small files.
Consistency and real-time performance: HDFS adopts an eventual consistency model, which means that concurrent write operations to the same file may cause data consistency issues. In addition, HDFS is not suitable for application scenarios that require real-time data access and updates.
Hadoop YARN (Yet Another Resource Negotiator) is Hadoop's resource management and job scheduling framework
MapReduce is the computing model of Hadoop, used to process parallel computing of large-scale data sets.
ecosystem
Hive is a data warehouse infrastructure that provides a SQL-like query language (HiveQL) to process and analyze data stored in Hadoop
Pig is a platform for large-scale data processing and provides a scripting language called Pig Latin. Pig Latin language is a data flow language that can be used to describe data transformation operations and analysis processes.
HBase is a distributed column-oriented NoSQL database built on Hadoop's HDFS. It provides real-time read and write access to large-scale structured data with high scalability and reliability
Spark is a fast, general-purpose big data processing engine that can perform parallel computing of large-scale data sets in memory.
Sqoop is a tool for data transfer between Hadoop and relational databases. It can import data from relational databases into Hadoop for analysis, and export result data from Hadoop to relational databases.
Flume is a reliable, scalable big data collection system for collecting, aggregating and moving data from various data sources (such as log files, sensor data, etc.) into Hadoop.
Application scenarios
System log analysis
User habit analysis
Storm
characteristic
integrity
acker mechanism, data will not be lost
fault tolerance
Ease of use
Free and open source
Support multiple languages
core components
Topology
spout (data source)
Bolt (processing component)
Stream
Nimbus (master node)
Supervisor (worker node)
ZooKeeper (distributed coordination service)
Application scenarios
information flow processing
continuous calculation
Distributed remote procedure call
Spark
characteristic
Rapidity
Elastic scalability
Various computing models
Multi-language support
Comprehensive
Architecture
Driver is the main component of Spark application. The driver program runs in an independent process and is responsible for the control and coordination of the entire Spark application.
Cluster Manager is responsible for managing the resource scheduling and task allocation of Spark applications on the cluster.
Executor is a process running on the worker nodes in the cluster and is responsible for performing specific tasks.
RDD is the core data abstraction of Spark. It is an immutable data collection that can be partitioned and operated in parallel.
The DAG scheduler is responsible for converting operations in Spark applications into directed acyclic graphs (DAGs), optimizing and scheduling them.
Task Scheduler is responsible for allocating tasks in the Stage to available Executors for execution.
In a Spark application, when data reshuffling (Shuffle) operation is required, the data will be network transferred and redistributed between different Executors.
extensions
Spark SQL is Spark's structured query module, providing a high-level API and query language for processing structured data.
Spark Streaming is Spark’s stream processing module for real-time processing and analysis of data streams.
MLlib is Spark's machine learning library, which provides a series of machine learning algorithms and tools for data mining, predictive analysis and model training.
GraphX is Spark’s graph computing library for processing large-scale graph data and graph algorithms.
SparkR is the R language interface of Spark, allowing R language users to use Spark for large-scale data processing and analysis.
Application scenarios
Application scenarios for operating specific data sets multiple times
Coarse-grained update status application
The amount of data is not particularly large, but it is suitable for applications requiring real-time statistical analysis.
Comparison of the three
data processing model
Hadoop is suitable for offline large-scale data processing tasks, mainly used for batch data storage and analysis.
Spark supports multiple data processing models such as batch processing, real-time stream processing, interactive query, and machine learning.
Storm is a real-time stream processing framework for processing continuous data streams
Execution speed
Since Hadoop uses disk storage and MapReduce batch processing model, it is suitable for processing large-scale offline data, but it may be slower for scenarios with high real-time requirements.
Spark uses in-memory computing and RDD-based data abstraction, which can keep data in memory for iterative calculations and is suitable for data processing tasks that require higher performance and interactivity.
Storm focuses on real-time stream processing, has the characteristics of low latency and high throughput, and is suitable for rapid processing and analysis of real-time data.
Data processing capabilities
Hadoop provides a reliable distributed file system (HDFS) and a scalable MapReduce computing model, which is suitable for the storage and batch processing of massive data. It has good fault tolerance and data reliability
Spark provides richer data processing capabilities and supports multiple models such as batch processing, real-time stream processing, interactive query, and machine learning. It also provides high-level APIs and libraries (such as Spark SQL, Spark Streaming, and MLlib) to simplify the development of big data processing and analysis
Storm focuses on real-time stream processing and provides reliable message passing and stream topology processing models. It can process large-scale data streams in real time and supports complex stream processing logic
Ecosystem and support
Hadoop has an extensive ecosystem and a large number of tools and components, such as Hive, Pig, and HBase, for higher-level data processing and analysis. It has a mature community and extensive support
Spark also has an active open source community and a rich ecosystem that supports a variety of data processing and machine learning tasks. It is tightly integrated with the Hadoop ecosystem and can work seamlessly with HDFS, Hive and other Hadoop components
Storm's ecosystem is relatively small and mainly focuses on the field of real-time stream processing. It provides some plugins to integrate with other tools and systems such as Kafka and Cassandra
cloud computing
The concept and characteristics of cloud computing
concept
Cloud computing is a dynamically expanded computing model that provides users with network virtualized resources as services.
Features
Hyperscale
Virtualization
High reliability
Versatility
High scalability
on demand services
extremely cheap
Main deployment models of cloud computing
public cloud
Public cloud is cloud computing infrastructure built and managed by third-party service providers (such as Amazon AWS, Microsoft Azure and Google Cloud Platform)
Private Cloud
A private cloud is a cloud computing infrastructure built and managed by an organization itself to support its internal business needs
hybrid cloud
Hybrid cloud is a combination of public cloud and private cloud, providing more flexible and diverse solutions by connecting and integrating these two cloud environments. In a hybrid cloud, organizations can deploy workloads and applications into public or private clouds depending on their needs
Main service models of cloud computing
Infrastructure as a Service (IaaS)
IaaS is the most basic service model in cloud computing, which provides virtualized computing resources, storage, network and other infrastructure
Common IaaS providers include Amazon AWS’s EC2, Microsoft Azure’s virtual machine service, and Google Cloud Platform’s Compute Engine.
Platform as a Service (PaaS)
PaaS provides a platform to develop, run and manage applications in a cloud environment
Common PaaS providers include Microsoft Azure’s App Service, Google Cloud Platform’s App Engine and Heroku, etc.
Software as a Service (SaaS)
SaaS is the highest level service model in cloud computing. It provides fully managed applications that users can directly access and use through the Internet.
Common SaaS applications include email services (such as Gmail), online office suites (such as Microsoft 365 and Google Workspace), and customer relationship management (CRM) systems (such as Salesforce)
Main technologies of cloud computing
virtualization technology
Virtualization technology can realize server virtualization, storage virtualization, network virtualization, etc., enabling cloud computing platforms to achieve elastic allocation and management of resources.
middleware technology
Middleware technology plays a role in connecting and coordinating different components and services in cloud computing. It provides a set of software tools and services for managing and scheduling the deployment, configuration and execution of applications
Middleware technology also includes load balancing, container technology, message queues and service orchestration, etc., used to provide high availability, scalability and flexibility in cloud computing environments.
Cloud storage technology
Cloud storage technology is a technology used to store and manage large-scale data
The relationship between cloud computing and big data
Cloud computing provides the advantages of powerful computing and storage resources, elasticity and cost-effectiveness, providing ideal infrastructure and tools for big data processing and analysis.
Cloud computing provides efficient, flexible and scalable solutions for the storage, processing and analysis of big data, and promotes the development and application of big data technology.
application
Business big data
Precision marketing
Data collection and integration
User portrait construction
Target market segmentation
Predictive analytics and model building
Personalized marketing campaign execution
Results evaluation and optimization
policy support
concept
Decision support is a method that combines information technology and management science to provide decision-makers with the information, tools and models needed for decision-making.
It helps decision-makers make decisions by analyzing and interpreting data, providing decision-making models and algorithms, and providing visualization and interactive interfaces.
Classification
structured decision-making
unstructured decision making
semi-structured decision making
Process steps
Identify problems and formulate decision-making goals
Use probability to quantitatively describe the possibility of various outcomes produced by each plan
Decision makers quantitatively evaluate various outcomes
Comprehensive analysis of all aspects of information
Decision support system functions
Data Management and Integration: Collect, integrate, and manage data relevant to decision-making.
Model and algorithm support: Provides various decision-making models and algorithms for analysis and prediction.
Visualization and interactive interface: Help decision makers understand and operate data through visual display and interactive interface.
Scenario simulation and optimization: Supports the simulation and optimization of different decision-making options and evaluates their potential effects.
Collaboration and sharing: Support the collaboration and information sharing of decision-making teams and promote the collective decision-making process.
Innovation model
concept
Innovation models refer to methods and strategies used to innovate and change existing business models. It focuses on how to provide new value propositions to the market and gain competitive advantage through the creative use of resources, technology, market insights and business logic.
Constitutive conditions
Provide brand-new products and services and create new industrial fields
Its business model differs from other companies in at least several elements
Have good performance
method
Change revenue model
Subscription model: Offer a product or service as a subscription model and obtain a stable revenue stream through regular fees.
Advertising model: Provide products or services for free or at low prices, and earn profits through advertising revenue.
Freemium model: Provides a free version with basic functions and a paid version with advanced functions to generate revenue from paying users.
Data sales model: The collected data is analyzed and processed, and then sold to other organizations or individuals.
Trading platform model: Establish an online platform to connect buyers and sellers, and earn income through transaction commissions or handling fees.
Change business model
Open innovation model: Collaborate with external partners, communities, and innovation ecosystems to jointly develop and promote new products or services.
Platform model: Build a platform and ecosystem, introduce multiple parties to participate, and promote innovation and value co-creation.
Networked model: Through the Internet and digital technology, collaboration and information sharing within and outside the organization are realized to improve efficiency and flexibility.
Social enterprise model: integrate social and environmental responsibility into the business model and pursue social value and sustainable development.
Two-sided market model: Establish a two-sided market, attract suppliers and consumers at the same time, and achieve value creation by balancing the needs of both parties.
Change the industry model
Platform model: By building a platform and ecosystem, integrating upstream and downstream participants in the industry chain to achieve collaborative innovation and value co-creation.
Sharing economy model: improve resource utilization efficiency and meet user needs by sharing resources and services.
Self-service model: Use automation and digital technology to provide self-service and self-service interaction to reduce costs and improve efficiency.
Ecosystem model: Build an industrial ecosystem, integrate different enterprises and organizations, and achieve resource sharing and collaborative development.
Intelligent model: Apply artificial intelligence, Internet of Things and other technologies to provide intelligent products and services, changing the business logic and operation methods of the industry.
changing technology paradigm
Platform technology model: Build an open technology platform to attract developers and partners to achieve technology sharing and innovation.
Cloud computing model: Provide computing and storage resources as cloud services to meet user needs in an elastic and on-demand manner.
Edge computing model: Push computing and data processing to the edge of the network to improve response speed and data privacy.
Blockchain model: Use blockchain technology to achieve decentralized and credible transaction records and contract execution.
AI-driven model: Apply artificial intelligence technology to products or services to provide intelligent functions and personalized experiences.
Dimensions
strategic positioning innovation
Focus on the company's position and role in the market
method
Target market transfer: Shifting the target market from traditional markets to emerging markets or different market segments.
Differentiated positioning: Standing out from competitors by offering a unique product, service, or experience.
Brand Innovation: Redefining brand image and value proposition to attract new audiences and markets.
Resource capability innovation
Focus on the company’s internal resources and capabilities
method
Technological Innovation: Developing and applying new technologies to improve products, services, or business processes.
Talent Development: Develop and attract talent with new skills and knowledge to support innovation and business growth.
Partnership: Collaborate with external partners to share resources and capabilities and achieve complementary advantages.
Business ecological environment innovation
Focus on the relationship and interaction between the enterprise and the external environment
method
Open Innovation: Collaborating with external partners, startups, and communities to develop new products or services.
Ecosystem construction: Build a platform and ecosystem to attract multiple participants and achieve value co-creation and sharing.
Social Responsibility: Integrate social and environmental responsibility into the business model and pursue sustainable development and shared value.
Hybrid business model innovation
Involving the combination and integration of different business models
method
Platform model: Build a platform and ecosystem, integrate multiple business models, and promote multi-party cooperation and innovation.
Vertical integration: integrating different business activities up and down the value chain to achieve greater control and efficiency.
Diversification expansion: Expanding existing products or services into new markets or industries to achieve growth and diversification.
People's Livelihood Big Data
1. Smart medical care:
Smart healthcare uses information technology and big data analysis to improve medical services and health management. It can include electronic health records, telemedicine, medical data analytics, and more. The goal of smart healthcare is to improve medical efficiency, provide personalized medical services, and improve medical quality and patient experience.
2. Smart transportation:
Smart transportation uses information and communication technologies to optimize the operation and management of transportation systems. It can include traffic data collection, intelligent traffic signal control, traffic flow prediction, intelligent traffic management system, etc. The goal of smart transportation is to improve traffic efficiency, reduce traffic congestion and accidents, and provide more convenient, safe and environmentally friendly travel methods.
3. Wisdom Tourism:
Smart tourism uses information technology and big data analysis to provide more intelligent and personalized tourism services. It can include travel information platforms, intelligent navigation systems, travel data analysis, etc. The goal of smart tourism is to provide a better tourism experience, improve the utilization efficiency of tourism resources, and promote the sustainable development of the tourism industry.
4. Smart logistics:
Smart logistics uses technologies such as the Internet of Things, big data and artificial intelligence to optimize the management and operation of the logistics supply chain. It can include smart warehousing, smart transportation, smart distribution, etc. The goal of smart logistics is to improve logistics efficiency, reduce costs, improve logistics service quality, and meet rapidly changing market demands.
5. food safety
Food safety focuses on food quality and safety issues, involving food production, processing, transportation and sales. Using big data analysis and Internet of Things technology, we can monitor the source, quality and safety of food in real time, improve food traceability, prevent food safety incidents from occurring, and protect the health and rights of consumers.
6. Education big data
Educational big data uses big data analysis technology to study and improve teaching, learning and management in the field of education. By collecting and analyzing students' learning data, teachers' teaching data, etc., we can understand students' learning situations and needs, optimize teaching methods and resource allocation, and provide personalized learning support and guidance.
Industrial big data
Smart equipment
Intelligent equipment refers to integrating sensors, control systems, data analysis and other technologies to enable traditional industrial equipment to have perception, analysis and decision-making capabilities.
Intelligent equipment can monitor equipment status in real time, predict failures, optimize operating parameters, and support automated and intelligent production processes.
smart factory
Smart factories use advanced information technology and automation technology to realize the intelligence and automation of the production process.
Smart factories achieve optimization, flexibility and sustainable development of the production process by integrating various smart equipment, Internet of Things, big data analysis and other technologies
Intelligent service
Intelligent service refers to providing customers with personalized and intelligent services through the use of advanced technology and data analysis
In the industrial field, smart services can include predictive maintenance, remote monitoring, fault diagnosis, etc.
Government big data
Public opinion analysis
Refers to the process of systematically collecting, analyzing and evaluating social opinions and public sentiments. The government can use public opinion analysis to understand public attitudes and feedback on government policies, events and services.
Refined management and service
It refers to the use of government big data and advanced technology to provide more refined and personalized management and services to cities and society.
Emergency plan disposal
Refers to when emergencies and disasters occur, the government responds and handles quickly and effectively based on pre-established emergency plans.
Security big data
Network information security
Refers to security measures that protect networks and information systems from unauthorized access, destruction, leakage, and tampering. Network information security involves network architecture, data encryption, access control, vulnerability management, threat detection, etc.
Natural disaster warning
It refers to discovering and predicting the occurrence and development trends of natural disasters in advance by collecting, analyzing and interpreting various relevant data, so as to take corresponding prevention and response measures.
The future of big data
The rise of data markets
Infohimps
Factual
Windows Azure Marketplace
Public Data Sets on AWS
Turn original data into value-added data
Consumer privacy protection