MindMap Gallery DAMA-CDGA Data Governance Engineer-8. Data Integration and Interoperability
Data integration and interoperability describes the processes involved in moving and integrating data within and between different data stores, applications, and organizations.
Edited at 2024-03-05 20:24:30One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
One Hundred Years of Solitude is the masterpiece of Gabriel Garcia Marquez. Reading this book begins with making sense of the characters' relationships, which are centered on the Buendía family and tells the story of the family's prosperity and decline, internal relationships and political struggles, self-mixing and rebirth over the course of a hundred years.
Project management is the process of applying specialized knowledge, skills, tools, and methods to project activities so that the project can achieve or exceed the set needs and expectations within the constraints of limited resources. This diagram provides a comprehensive overview of the 8 components of the project management process and can be used as a generic template for direct application.
8. Data integration and interoperability
introduction
definition
Data integration and interoperability
Data integration and interoperability describes the process of moving and integrating data within and between different data stores, applications, and organizations
data integration
It is the integration of data into a physical or virtual consistent format.
Data interoperability
Is the ability to communicate between multiple systems
Critical to data warehouse, BI, master data and reference data
Because these focus on the transformation and integration of data from source systems to the data center, from the data center to the target system, and ultimately to the consumer
It is the core of the field of big data management.
Big data aims to integrate various types of data
Includes structured data stored in a database
Unstructured text data stored in documents or files
and other types of unstructured data, such as audio, video, streaming data, etc.
Integrate to mine, develop predictive models, and use them in operational intelligence activities
business drivers
The main purpose of data integration and interoperability is to effectively manage data movement
For enterprises, managing the complexity and associated costs of data integration is a reason to build a data integration architecture
Managing the complexity of data integration
Enterprise-grade data integration design is far more efficient than disparate or point-to-point solutions
Point-to-point solutions between applications can create thousands of interfaces and organizations can quickly become overwhelmed.
Maintenance and management costs
When multiple technologies are used to move data, each technology requires specific development and maintenance costs, resulting in increased support costs.
The application of standard tools can reduce maintenance and labor costs and improve the efficiency of troubleshooting work.
goals and principles
Target
Deliver data in a timely manner and in the format data consumers need
Consolidate data physically or virtually into the data center
Reduce the cost and complexity of management solutions by developing shared models and interfaces
Identify meaningful events, automatically trigger alarms and take action
Support business intelligence, data analysis, master data management and improvement of operational efficiency
in principle
Adopt an enterprise perspective to ensure future scalability design, enabled by iterative and incremental delivery
Balance local data needs with enterprise data needs, including support and maintenance
Ensure the reliability of data integration and interoperability designs and activities
basic concept
Extract, convert, load
Overview
1. ETL goal: enter the data warehouse with clear goals
2. Structured data: enter the data warehouse
3. Data warehouse: The ultimate goal is BI
At the heart of data integration and interoperability is the basic process of extraction, transformation, and loading (ETL)
Whether physical or virtual, batch or real-time, performing ETL is a necessary step in the flow of data between applications and organizations.
effect
Can be executed as a regularly scheduled event (batch processing)
Data required for analysis or reporting is usually in a batch job
Can be executed when new data or data is updated (real-time or event-driven)
Operational data processing is often real-time or near-real-time
extract
Includes selecting the required data and extracting it from its source data
The extracted data is then stored in a physical data repository on disk or in memory.
Convert
Is to make the selected data compatible with the structure of the target database
Format changes
Technical format conversion
Such as format conversion from EBCDIC to ASCII
structural changes
Data structure changes
From denormalized to normalized records
Semantic changes
Maintain consistent expression of semantics when converting data values
0, 1, 2, 3→UNKNOWN, FEMALE, MALE, NOT PROVIDED
Eliminate duplicates
If a rule requires a unique key or record, make sure you include a way to scan the target, detect and remove duplicate rows
rearrange
Change the order of data elements or records to fit a defined schema
Can be executed in batches or in real time
Or store the conversion results in the cache area in the physical state
Or store the converted data in memory in a virtual state
until moving to the loading step
load
Physically store or present the conversion results on the target system
Extract, load, convert
Overview
1. ELT goal: entering the data lake, the business scenario is unclear
2. Structured data and unstructured data: both can enter the data lake
3. Data Lake: The ultimate goal is AI
If the target system has stronger conversion capabilities than the source system or intermediate application system, the data processing sequence can be switched to ELT---extraction, loading, conversion
ELT allows data to be loaded into the target system before being converted
ELT allows source data to be instantiated on the target system in the form of raw data, which is beneficial to other processes
Loading into the data lake using ELT, which is common in big data environments
mapping
is a synonym for transformation, both the process of building a search matrix from a source structure to a target structure and the result of that process
Defines the source data to be extracted and the identification rules for the extracted data, the identification rules for the target to be loaded and the target rows to be updated, and the transformation or calculation rules to be applied.
Delay
definition
Refers to the time difference between when data is generated by the source system and when the data is available on the target system.
Different data processing methods will lead to different degrees of data delay
very high
Batch processing
higher
event driven
very low
Real-time synchronization
Batch processing
Data moves between applications and organizations in batches of files, either based on manual requests from data consumers or automatically triggered on a periodic basis. This type of interaction is called batch processing or ETL
Data moved in batch mode will represent all data at a given point in time
This set of data is called an increment, and the data at a certain moment is called a snapshot
For batch data integration solutions, there is often a significant delay between data changes in the source and data updates in the target, resulting in high latency from
Micro-batching
Batch data integration can be used for data transformation, migration and archiving as well as extracting and loading data from data warehouses and data marts
Instructs the batch to run more frequently than daily updates
opportunity
Timing of batch processing can be risky
To minimize application update issues, data movement can be scheduled to occur at the end of the logical processing day during the workday or during the night.
Change data capture
Is a method of reducing transmission bandwidth requirements by adding filtering to only include data that has changed within a specific time range
Change data capture monitors changes (insertions, changes, deletions) to a data set and then communicates these changes (deltas) to other data sets, applications, and organizations that consume the data
As part of the change data capture process, data can also be tagged with identifiers such as tags or timestamps.
Change data capture can be data-based or log-based
Based on data
The source system populates specific data elements
For example, a range of timestamps, codes, flags, which can all serve as change indicators
Source system processes are added to a simple list of objects and identifiers when changing data, which is then used to control the selection of extracted data
Source system copies changed data
Based on logs
The database management system creates a log of data activity that is copied and processed, then looks for specific changes that are translated and applied to the target database.
Near real-time and event-driven
Most data integration solutions that do not take a batch approach use a near-real-time or event-driven approach
Data is processed in smaller sets within a specific schedule, or as events occur, such as data updates
Near real-time processing has lower latency compared to batch processing
And because the work is distributed over time, the system load is lower
However, it is generally slower than synchronous data integration
Near real-time data integration solutions are often implemented using an enterprise service bus
asynchronous
In an asynchronous data flow, the system providing the data does not wait for the receiving system to confirm the update before continuing processing.
Asynchronous means that the sending or receiving system may be offline for a period of time while the other system is functioning normally
Asynchronous data integration does not block the source system application from continuing to execute, nor does it cause the source application to be unavailable if any target application is unavailable
Since data updates to the application in an asynchronous configuration are not timely, it is called quasi-real-time integration
real time, synchronous
There are situations where time delays or other differences between source and target data are not allowed
When data from one dataset must be perfectly synchronized with data from another dataset, a real-time synchronization solution must be used
In a synchronous integration solution, execution waits for confirmation from other applications or processes before executing the next activity or transaction.
Because time must be spent waiting for confirmation of data synchronization, this means that the solution can only handle fewer transactions
If any application that needs to update data is in an unavailable state, transactions within the application cannot be completed.
Low latency or stream processing
Low-latency data integration solution designed to reduce incident response times
Use solid state disk
Reduce read and write latency
Asynchronous solution
Typically used in low-latency solutions so that there is no need to wait for acknowledgment from subsequent processes before processing the next data
Massive multiprocessor or parallel processing
It is also a common configuration for low latency
copy
To provide better response times to users around the world, some applications have exact copies of their data sets maintained across multiple physical locations.
Replication technology minimizes the impact of analytics and queries on the performance of the main transaction operating environment
Therefore, data synchronization must be performed on each physically distributed copy of the data set.
Copy solution
Typically monitor the change log of a dataset rather than the dataset itself
Because they do not compete with applications for access to data sets, they minimize the impact on any operational application.
Only data from the change log is transferred between replicas
Standard replication solutions are near real-time
Replication tools perform best when the source and target datasets are exact copies of each other
If data changes occur at multiple replica sites, then a replication solution is not the best choice
Archive
Data that is used infrequently can be moved to an alternative data structure or storage solution that is less expensive for the organization
ETL functionality is used to archive data and potentially transform it into data structures in the archive environment
It is important to monitor archiving technology to ensure that data can still be accessed as technology changes
Enterprise message format/canonical format
KAFUKA
A normalized data model is a common model used by an organization or data exchange team to standardize data sharing formats
Convert data from the sending system to the receiving system according to common or enterprise specification message formats
The use of normalized formats reduces the amount of data translation between systems or organizations
Each system only needs to convert data into a central canonical format, rather than into numerous system formats.
interaction model
The interaction model describes the way connections are established between systems to transfer data
point to point
The vast majority of interactions between shared data systems are "point-to-point", where they pass data directly to each other
This model works in the context of a small set of systems
But when many systems require the same data from the same source, it becomes inefficient and increases organizational risk
Impact processing
If the source system is operational, the workload of providing the data may impact transaction processing
Management interface
The number of interfaces required for a point-to-point interaction model is close to the square of the system data
Once these interfaces are established, they need to be maintained and supported
The workload of managing and supporting the interfaces between systems can quickly become greater than the support of the systems themselves
potential inconsistencies
Design issues arise when multiple systems require different versions or data formats
Using multiple interfaces to obtain data can lead to inconsistent data sent to downstream systems.
hub-and-spoke
It consolidates shared data (physical or virtual) into a central data center that applications can use
All systems that want to exchange data do so through a central common data control, rather than directly with other systems (point-to-point).
Data warehouses, data marts, operational data stores, and master data management centers are all examples of data centers.
Data center provides consistent view of data with limited impact on source system performance
Adding systems to the mix requires only building interfaces to the data center
Enterprise Service Bus (ESB) is a data integration solution for sharing data in near real-time between multiple systems. Its data center is a virtual concept that represents a standard and standardized format for data sharing in an organization.
Some hub-and-spoke models have unacceptable latency or performance issues
The data center itself has the creation overhead in a hub-and-spoke architecture
Subscribe and publish
The publish-and-subscribe model involves systems that promote (publish) data and other systems that accept (subscribe) data.
Systems that push data are listed in the catalog of data services, and systems that wish to use the data subscribe to these services
When publishing data, the data will automatically be sent to subscribing users
Data integration and interoperability architecture concepts
application coupling
Coupling describes the degree to which two systems are intertwined
Tightly coupled
Two tightly coupled systems often have a synchronization interface where one system waits for a response from the other system
Represents operational risk
If one party is unavailable, then both are effectively unavailable, and the business continuity plans for both systems must be consistent
loose coupling
is an optimal interface design
Transfer data between systems without waiting for a response, and the unavailability of one system does not render the other system unavailable
Loose coupling can be achieved using various technologies such as services, APIs or message queues
Service-oriented architecture based on EBS is an example of loosely coupled data interaction design pattern
Orchestration and process control
Arrange
Used to describe how to organize and execute multiple related processes in a system
All systems that process messages or datagrams must be able to manage the order in which these processes are executed to maintain consistency and continuity.
process control
Is the component that ensures accurate and complete delivery, scheduling, extraction and loading of data
Enterprise application integration
In the enterprise application integration model EAI, software modules interact only through well-defined interface calls (application programming interface API)
Data storage can only be updated through its own software module. Other software cannot directly access the data in the application, only through the defined API.
EAI is based on object-oriented concepts, which emphasizes the ability to reuse and replace any module without affecting any other module
enterprise service bus
It acts as an intermediary between systems, delivering messages between them
Applications can encapsulate sent and received messages or files through the existing capabilities of the ESB
As an example of loose coupling, the ESB acts as a service between two applications
service-oriented architecture
Push data or update data can be provided through well-defined service calls between applications
Applications do not have to interact directly with other applications or understand the inner workings of other applications
Supports application independence and the organization's ability to replace systems without making significant changes to the systems they interact with
The goal of SOA is to define well-defined interactions between independent software modules
Each module can be used by other software modules or individual consumers to perform functions (provide functionality)
The key concept of SOA is that an independent service is provided: the service has no prior knowledge of the calling application, and the implementation of the service is a black box for the calling application
SOA can be implemented through various technologies such as Web services, messaging, and APIs.
Complex event handling
Event processing is a method of tracking and analyzing (processing) the flow of information about an occurring event, and drawing conclusions from it
Complex event processing refers to merging data from multiple sources, identifying meaningful events, setting rules for these events to guide event processing and routing, and then predicting behaviors or activities, and automatically triggering real-time responses based on the predicted results.
Such as sales opportunities, web clicks, orders, and customer calls, etc.
Complex event processing requires an environment that can integrate various types of data
Since predictions often involve large amounts of data of various types, complex event processing is often associated with big data
Complex event processing often requires the use of technologies that support ultra-low latency, such as processing real-time streaming data and in-memory databases
Data federation and virtualization
When data exists in different data repositories, it can also be aggregated by means other than physical integration
Data federation provides access to a combination of independent data repositories regardless of their respective structures
Data virtualization enables distributed databases as well as multiple heterogeneous data stores to be accessed and viewed as a single database
data as a service
Software as a ServiceSaaS
is a delivery and licensing model
Licensed applications provide services, but the software and data are located in data centers controlled by the software vendor rather than the licensing organization's data centers
Provide different levels of computing infrastructure as a service (IT as a service IaaS, platform as a service PaaS, database as a service DBaaS)
Data as a ServiceDaaS
Data is licensed from a vendor and provided by the vendor on demand, rather than storing and maintaining data in the licensed organization's data center
Cloud integration
Before cloud computing, integration could be divided into internal integration and inter-enterprise integration B2B
Internal integration
Services are provided through an internal middleware platform, often using a service bus ESB to manage data exchange between systems.
inter-enterprise integration
Completed through electronic data interchange EDI gateway and value-added network VAN
Cloud integration
Typically run as a SaaS application in the vendor's data center rather than within the organization that owns the data being integrated
data exchange standards
Data interaction standards are formal rules for the structure of data elements
Exchange patterns define the data transformation structures required by any system or organization to exchange data
Data needs to be mapped into the exchange specification
Agreeing on a consistent exchange format or data layout between systems can greatly simplify the process of sharing data in the enterprise, thereby reducing support costs and allowing workers to better understand the data
The National Information Exchange Model (NIEM) is a data exchange standard developed for exchanging documents and transactions between U.S. government agencies.
Activity
planning and analysis
Define data integration and lifecycle requirements
Defining data integration requirements involves understanding the organization's business goals, as well as the data and recommended technology options needed to achieve those goals
Relevant laws or regulations that also require the collection of this data
The process of defining requirements creates and discovers valuable metadata
The more complete and accurate an organization's metadata is, the greater its ability to manage data integration risks and costs.
Perform data exploration
Data exploration should occur before design
The goal of data exploration is to identify potential data sources for data integration efforts
Data exploration will identify where data may be obtained and where it may be integrated
The process combines technical searches with subject matter expertise using tools that scan metadata and actual content on an organization's data sets
Data exploration also includes high-level assessment work on data quality to determine whether the data is suitable for the goals of the integration plan
Record data lineage
The data exploration process will also reveal information about how data flows through an organization
This information can be used to document high-level data lineage: how the data was acquired or created by the organization, how it moved and changed within the organization, and how it was used by the organization for analysis, decision-making, or event triggering
Well-documented data lineage can include the rules under which data is changed and how often it is changed
The analysis process can also provide opportunities to improve existing data flows
Finding and eliminating these inefficiencies or ineffective configurations can greatly aid project success and improve an organization's overall ability to use its data.
Analyze data
Understanding the content and structure of your data is key to achieving success with your data set
Data profiling helps achieve this goal
If the data profiling process is skipped, some information that affects the design may not be discovered until testing or actual operation.
One of the goals of profiling is to assess the quality of the data
Like advanced data exploration, data profiling involves validating assumptions about data relative to actual data
Collect business rules
Business rules are a key subset of requirements, statements that define or constrain aspects of business processing.
Business rules are designed to maintain the business structure and control or influence the behavior of the business
Design data integration solutions
Design data integration solutions
Data integration solutions should be considered at both the enterprise and individual solution levels
Establishing enterprise standards allows organizations to save time implementing individual solutions
Select interaction model
Hub-and-spoke, point-to-point, publish-subscribe
Design a data service or exchange pattern
Model data centers, interfaces, messages, data services
Map data to target
Design data orchestration
Develop data integration solutions
Develop data services
Develop data flow orchestration
Develop a data migration plan
Develop a release method
Develop complex event processing flows
Maintain metadata for data integration and interoperability
implementation and monitoring
tool
Data transformation engine/ETL tool
A data transformation engine (or ETL tool) is the primary tool in the data integration toolbox and is at the heart of every enterprise data integration program
Whether the data is batch or real-time, physical or virtual, very sophisticated tools exist to develop and execute ETL.
Basic considerations for data transformation engine selection should include whether batch processing and real-time capabilities are required, and whether unstructured and structured data are included
The most mature ones currently are batch processing tools for structured data
Data virtualization server
Data transformation engine
Physically extract, transform, and load data
Data virtualization server
Virtually extract, transform, and integrate data
Can combine structured and unstructured data
enterprise service bus
Refers to both a software architecture model and a message-oriented middleware
For near real-time messaging between asynchronous stores, applications, and servers within the same organization
business rules engine
Many data integration solutions rely on business rules
As an important form of metadata, these rules can be used for basic integrations or in solutions that include complex event handling so that organizations can respond to these events in near real-time.
Data and process modeling tools
Data modeling tools are used to design not only target data structures but also intermediate data structures required for data integration solutions
Data profiling tools
Perform statistical analysis of the contents of a data set to understand the format, consistency, validity, and structure of the data
metadata repository
The store contains information about the data in the organization, including the data structure, internals, and business rules used to manage the data
method
Keep applications loosely coupled, limit the number of development and management interfaces, use a hub-and-spoke approach and create standardized interfaces
Implementation Guide
Readiness Assessment/Risk Assessment
Organizational and cultural change
Data integration and interoperability governance
data sharing agreement
Sets out the responsibilities and acceptable uses of data exchanged and is approved by the business data manager of the relevant data
Data integration and interoperability and data lineage
Metrics
Data availability
Data volume and speed
Solution cost and complexity