Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

MindMap Gallery DAMA-CDGA Data Governance Engineer-8. Data Integration and Interoperability

DAMA-CDGA Data Governance Engineer-8. Data Integration and Interoperability

Data integration and interoperability describes the processes involved in moving and integrating data within and between different data stores, applications, and organizations.

Edited at 2024-03-05 20:24:30

PlotWizard

Recent works View more works>>

DAMA-CDGA Data Governance Engineer-8. Data Integration and Interoperability

PlotWizard

Recent works View more works>>

Recommended to you
Outline

DAMA-CDGA Data Governance Engineer-17. Data Management and Organizational Change Management
- 11
PlotWizard
DAMA-CDGA Data Governance Engineer-16. Data Management Organization and Role Expectations
- 17
PlotWizard
DAMA-CDGA Data Governance Engineer-15. Data Management Maturity Assessment
- 19
PlotWizard
DAMA-CDGA Data Governance Engineer-14. Big Data and Data Science
- 16
PlotWizard
DAMA-CDGA Data Governance Engineer-13. Data Quality
- 18
PlotWizard
DAMA-CDGA Data Governance Engineer-12. Metadata Management
- 13
PlotWizard
DAMA-CDGA Data Governance Engineer-11. Data Warehousing and Business Intelligence
- 20
PlotWizard
DAMA-CDGA Data Governance Engineer-10. Reference Data and Master Data
- 23
PlotWizard
DAMA-CDGA Data Governance Engineer-9. Document and Content Management
- 19
PlotWizard
DAMA-CDGA Data Governance Engineer-7. Data Security
- 17
PlotWizard

8. Data integration and interoperability

introduction

definition

Data integration and interoperability

Data integration and interoperability describes the process of moving and integrating data within and between different data stores, applications, and organizations

data integration

It is the integration of data into a physical or virtual consistent format.

Data interoperability

Is the ability to communicate between multiple systems

Critical to data warehouse, BI, master data and reference data

Because these focus on the transformation and integration of data from source systems to the data center, from the data center to the target system, and ultimately to the consumer

It is the core of the field of big data management.

Big data aims to integrate various types of data

Includes structured data stored in a database

Unstructured text data stored in documents or files

and other types of unstructured data, such as audio, video, streaming data, etc.

Integrate to mine, develop predictive models, and use them in operational intelligence activities

business drivers

The main purpose of data integration and interoperability is to effectively manage data movement

For enterprises, managing the complexity and associated costs of data integration is a reason to build a data integration architecture

Managing the complexity of data integration

Enterprise-grade data integration design is far more efficient than disparate or point-to-point solutions

Point-to-point solutions between applications can create thousands of interfaces and organizations can quickly become overwhelmed.

Maintenance and management costs

When multiple technologies are used to move data, each technology requires specific development and maintenance costs, resulting in increased support costs.

The application of standard tools can reduce maintenance and labor costs and improve the efficiency of troubleshooting work.

goals and principles

Target

Deliver data in a timely manner and in the format data consumers need

Consolidate data physically or virtually into the data center

Reduce the cost and complexity of management solutions by developing shared models and interfaces

Identify meaningful events, automatically trigger alarms and take action

Support business intelligence, data analysis, master data management and improvement of operational efficiency

in principle

Adopt an enterprise perspective to ensure future scalability design, enabled by iterative and incremental delivery

Balance local data needs with enterprise data needs, including support and maintenance

Ensure the reliability of data integration and interoperability designs and activities

basic concept

Extract, convert, load

Overview

1. ETL goal: enter the data warehouse with clear goals

2. Structured data: enter the data warehouse

3. Data warehouse: The ultimate goal is BI

At the heart of data integration and interoperability is the basic process of extraction, transformation, and loading (ETL)

Whether physical or virtual, batch or real-time, performing ETL is a necessary step in the flow of data between applications and organizations.

effect

Can be executed as a regularly scheduled event (batch processing)

Data required for analysis or reporting is usually in a batch job

Can be executed when new data or data is updated (real-time or event-driven)

Operational data processing is often real-time or near-real-time

extract

Includes selecting the required data and extracting it from its source data

The extracted data is then stored in a physical data repository on disk or in memory.

Convert

Is to make the selected data compatible with the structure of the target database

Format changes

Technical format conversion

Such as format conversion from EBCDIC to ASCII

structural changes

Data structure changes

From denormalized to normalized records

Semantic changes

Maintain consistent expression of semantics when converting data values

0, 1, 2, 3→UNKNOWN, FEMALE, MALE, NOT PROVIDED

Eliminate duplicates

If a rule requires a unique key or record, make sure you include a way to scan the target, detect and remove duplicate rows

rearrange

Change the order of data elements or records to fit a defined schema

Can be executed in batches or in real time

Or store the conversion results in the cache area in the physical state

Or store the converted data in memory in a virtual state

until moving to the loading step

load

Physically store or present the conversion results on the target system

Extract, load, convert

Overview

1. ELT goal: entering the data lake, the business scenario is unclear

2. Structured data and unstructured data: both can enter the data lake

3. Data Lake: The ultimate goal is AI

If the target system has stronger conversion capabilities than the source system or intermediate application system, the data processing sequence can be switched to ELT---extraction, loading, conversion

ELT allows data to be loaded into the target system before being converted

ELT allows source data to be instantiated on the target system in the form of raw data, which is beneficial to other processes

Loading into the data lake using ELT, which is common in big data environments

mapping

is a synonym for transformation, both the process of building a search matrix from a source structure to a target structure and the result of that process

Defines the source data to be extracted and the identification rules for the extracted data, the identification rules for the target to be loaded and the target rows to be updated, and the transformation or calculation rules to be applied.

Delay

definition

Refers to the time difference between when data is generated by the source system and when the data is available on the target system.

Different data processing methods will lead to different degrees of data delay

very high

Batch processing

higher

event driven

very low

Real-time synchronization

Batch processing

Data moves between applications and organizations in batches of files, either based on manual requests from data consumers or automatically triggered on a periodic basis. This type of interaction is called batch processing or ETL

Data moved in batch mode will represent all data at a given point in time

This set of data is called an increment, and the data at a certain moment is called a snapshot

For batch data integration solutions, there is often a significant delay between data changes in the source and data updates in the target, resulting in high latency from

Micro-batching

Batch data integration can be used for data transformation, migration and archiving as well as extracting and loading data from data warehouses and data marts

Instructs the batch to run more frequently than daily updates

opportunity

Timing of batch processing can be risky

To minimize application update issues, data movement can be scheduled to occur at the end of the logical processing day during the workday or during the night.

Change data capture

Is a method of reducing transmission bandwidth requirements by adding filtering to only include data that has changed within a specific time range

Change data capture monitors changes (insertions, changes, deletions) to a data set and then communicates these changes (deltas) to other data sets, applications, and organizations that consume the data

As part of the change data capture process, data can also be tagged with identifiers such as tags or timestamps.

Change data capture can be data-based or log-based

Based on data

The source system populates specific data elements

For example, a range of timestamps, codes, flags, which can all serve as change indicators

Source system processes are added to a simple list of objects and identifiers when changing data, which is then used to control the selection of extracted data

Source system copies changed data

Based on logs

The database management system creates a log of data activity that is copied and processed, then looks for specific changes that are translated and applied to the target database.

Near real-time and event-driven

Most data integration solutions that do not take a batch approach use a near-real-time or event-driven approach

Data is processed in smaller sets within a specific schedule, or as events occur, such as data updates

Near real-time processing has lower latency compared to batch processing

And because the work is distributed over time, the system load is lower

However, it is generally slower than synchronous data integration

Near real-time data integration solutions are often implemented using an enterprise service bus

asynchronous

In an asynchronous data flow, the system providing the data does not wait for the receiving system to confirm the update before continuing processing.

Asynchronous means that the sending or receiving system may be offline for a period of time while the other system is functioning normally

Asynchronous data integration does not block the source system application from continuing to execute, nor does it cause the source application to be unavailable if any target application is unavailable

Since data updates to the application in an asynchronous configuration are not timely, it is called quasi-real-time integration

real time, synchronous

There are situations where time delays or other differences between source and target data are not allowed

When data from one dataset must be perfectly synchronized with data from another dataset, a real-time synchronization solution must be used

In a synchronous integration solution, execution waits for confirmation from other applications or processes before executing the next activity or transaction.

Because time must be spent waiting for confirmation of data synchronization, this means that the solution can only handle fewer transactions

If any application that needs to update data is in an unavailable state, transactions within the application cannot be completed.

Low latency or stream processing

Low-latency data integration solution designed to reduce incident response times

Use solid state disk

Reduce read and write latency

Asynchronous solution

Typically used in low-latency solutions so that there is no need to wait for acknowledgment from subsequent processes before processing the next data

Massive multiprocessor or parallel processing

It is also a common configuration for low latency

copy

To provide better response times to users around the world, some applications have exact copies of their data sets maintained across multiple physical locations.

Replication technology minimizes the impact of analytics and queries on the performance of the main transaction operating environment

Therefore, data synchronization must be performed on each physically distributed copy of the data set.

Copy solution

Typically monitor the change log of a dataset rather than the dataset itself

Because they do not compete with applications for access to data sets, they minimize the impact on any operational application.

Only data from the change log is transferred between replicas

Standard replication solutions are near real-time

Replication tools perform best when the source and target datasets are exact copies of each other

If data changes occur at multiple replica sites, then a replication solution is not the best choice

Archive

Data that is used infrequently can be moved to an alternative data structure or storage solution that is less expensive for the organization

ETL functionality is used to archive data and potentially transform it into data structures in the archive environment

It is important to monitor archiving technology to ensure that data can still be accessed as technology changes

Enterprise message format/canonical format

KAFUKA

A normalized data model is a common model used by an organization or data exchange team to standardize data sharing formats

Convert data from the sending system to the receiving system according to common or enterprise specification message formats

The use of normalized formats reduces the amount of data translation between systems or organizations

Each system only needs to convert data into a central canonical format, rather than into numerous system formats.

interaction model

The interaction model describes the way connections are established between systems to transfer data

point to point

The vast majority of interactions between shared data systems are "point-to-point", where they pass data directly to each other

This model works in the context of a small set of systems

But when many systems require the same data from the same source, it becomes inefficient and increases organizational risk

Impact processing

If the source system is operational, the workload of providing the data may impact transaction processing

Management interface

The number of interfaces required for a point-to-point interaction model is close to the square of the system data

Once these interfaces are established, they need to be maintained and supported

The workload of managing and supporting the interfaces between systems can quickly become greater than the support of the systems themselves

potential inconsistencies

Design issues arise when multiple systems require different versions or data formats

Using multiple interfaces to obtain data can lead to inconsistent data sent to downstream systems.

hub-and-spoke

It consolidates shared data (physical or virtual) into a central data center that applications can use

All systems that want to exchange data do so through a central common data control, rather than directly with other systems (point-to-point).

Data warehouses, data marts, operational data stores, and master data management centers are all examples of data centers.

Data center provides consistent view of data with limited impact on source system performance

Adding systems to the mix requires only building interfaces to the data center

Enterprise Service Bus (ESB) is a data integration solution for sharing data in near real-time between multiple systems. Its data center is a virtual concept that represents a standard and standardized format for data sharing in an organization.

Some hub-and-spoke models have unacceptable latency or performance issues

The data center itself has the creation overhead in a hub-and-spoke architecture

Subscribe and publish

The publish-and-subscribe model involves systems that promote (publish) data and other systems that accept (subscribe) data.

Systems that push data are listed in the catalog of data services, and systems that wish to use the data subscribe to these services

When publishing data, the data will automatically be sent to subscribing users

Data integration and interoperability architecture concepts

application coupling

Coupling describes the degree to which two systems are intertwined

Tightly coupled

Two tightly coupled systems often have a synchronization interface where one system waits for a response from the other system

Represents operational risk

If one party is unavailable, then both are effectively unavailable, and the business continuity plans for both systems must be consistent

loose coupling

is an optimal interface design

Transfer data between systems without waiting for a response, and the unavailability of one system does not render the other system unavailable

Loose coupling can be achieved using various technologies such as services, APIs or message queues

Service-oriented architecture based on EBS is an example of loosely coupled data interaction design pattern

Orchestration and process control

Arrange

Used to describe how to organize and execute multiple related processes in a system

All systems that process messages or datagrams must be able to manage the order in which these processes are executed to maintain consistency and continuity.

process control

Is the component that ensures accurate and complete delivery, scheduling, extraction and loading of data

Enterprise application integration

In the enterprise application integration model EAI, software modules interact only through well-defined interface calls (application programming interface API)

Data storage can only be updated through its own software module. Other software cannot directly access the data in the application, only through the defined API.

EAI is based on object-oriented concepts, which emphasizes the ability to reuse and replace any module without affecting any other module

enterprise service bus

It acts as an intermediary between systems, delivering messages between them

Applications can encapsulate sent and received messages or files through the existing capabilities of the ESB

As an example of loose coupling, the ESB acts as a service between two applications

service-oriented architecture

Push data or update data can be provided through well-defined service calls between applications

Applications do not have to interact directly with other applications or understand the inner workings of other applications

Supports application independence and the organization's ability to replace systems without making significant changes to the systems they interact with

The goal of SOA is to define well-defined interactions between independent software modules

Each module can be used by other software modules or individual consumers to perform functions (provide functionality)

The key concept of SOA is that an independent service is provided: the service has no prior knowledge of the calling application, and the implementation of the service is a black box for the calling application

SOA can be implemented through various technologies such as Web services, messaging, and APIs.

Complex event handling

Event processing is a method of tracking and analyzing (processing) the flow of information about an occurring event, and drawing conclusions from it

Complex event processing refers to merging data from multiple sources, identifying meaningful events, setting rules for these events to guide event processing and routing, and then predicting behaviors or activities, and automatically triggering real-time responses based on the predicted results.

Such as sales opportunities, web clicks, orders, and customer calls, etc.

Complex event processing requires an environment that can integrate various types of data

Since predictions often involve large amounts of data of various types, complex event processing is often associated with big data

Complex event processing often requires the use of technologies that support ultra-low latency, such as processing real-time streaming data and in-memory databases

Data federation and virtualization

When data exists in different data repositories, it can also be aggregated by means other than physical integration

Data federation provides access to a combination of independent data repositories regardless of their respective structures

Data virtualization enables distributed databases as well as multiple heterogeneous data stores to be accessed and viewed as a single database

data as a service

Software as a ServiceSaaS

is a delivery and licensing model

Licensed applications provide services, but the software and data are located in data centers controlled by the software vendor rather than the licensing organization's data centers

Provide different levels of computing infrastructure as a service (IT as a service IaaS, platform as a service PaaS, database as a service DBaaS)

Data as a ServiceDaaS

Data is licensed from a vendor and provided by the vendor on demand, rather than storing and maintaining data in the licensed organization's data center

Cloud integration

Before cloud computing, integration could be divided into internal integration and inter-enterprise integration B2B

Internal integration

Services are provided through an internal middleware platform, often using a service bus ESB to manage data exchange between systems.

inter-enterprise integration

Completed through electronic data interchange EDI gateway and value-added network VAN

Cloud integration

Typically run as a SaaS application in the vendor's data center rather than within the organization that owns the data being integrated

data exchange standards

Data interaction standards are formal rules for the structure of data elements

Exchange patterns define the data transformation structures required by any system or organization to exchange data

Data needs to be mapped into the exchange specification

Agreeing on a consistent exchange format or data layout between systems can greatly simplify the process of sharing data in the enterprise, thereby reducing support costs and allowing workers to better understand the data

The National Information Exchange Model (NIEM) is a data exchange standard developed for exchanging documents and transactions between U.S. government agencies.

Activity

planning and analysis

Define data integration and lifecycle requirements

Defining data integration requirements involves understanding the organization's business goals, as well as the data and recommended technology options needed to achieve those goals

Relevant laws or regulations that also require the collection of this data

The process of defining requirements creates and discovers valuable metadata

The more complete and accurate an organization's metadata is, the greater its ability to manage data integration risks and costs.

Perform data exploration

Data exploration should occur before design

The goal of data exploration is to identify potential data sources for data integration efforts

Data exploration will identify where data may be obtained and where it may be integrated

The process combines technical searches with subject matter expertise using tools that scan metadata and actual content on an organization's data sets

Data exploration also includes high-level assessment work on data quality to determine whether the data is suitable for the goals of the integration plan

Record data lineage

The data exploration process will also reveal information about how data flows through an organization

This information can be used to document high-level data lineage: how the data was acquired or created by the organization, how it moved and changed within the organization, and how it was used by the organization for analysis, decision-making, or event triggering

Well-documented data lineage can include the rules under which data is changed and how often it is changed

The analysis process can also provide opportunities to improve existing data flows

Finding and eliminating these inefficiencies or ineffective configurations can greatly aid project success and improve an organization's overall ability to use its data.

Analyze data

Understanding the content and structure of your data is key to achieving success with your data set

Data profiling helps achieve this goal

If the data profiling process is skipped, some information that affects the design may not be discovered until testing or actual operation.

One of the goals of profiling is to assess the quality of the data

Like advanced data exploration, data profiling involves validating assumptions about data relative to actual data

Collect business rules

Business rules are a key subset of requirements, statements that define or constrain aspects of business processing.

Business rules are designed to maintain the business structure and control or influence the behavior of the business

Design data integration solutions

Data integration solutions should be considered at both the enterprise and individual solution levels

Establishing enterprise standards allows organizations to save time implementing individual solutions

Select interaction model

Hub-and-spoke, point-to-point, publish-subscribe

Design a data service or exchange pattern

Model data centers, interfaces, messages, data services

Map data to target

Design data orchestration

Develop data integration solutions

Develop data services

Develop data flow orchestration

Develop a data migration plan

Develop a release method

Develop complex event processing flows

Maintain metadata for data integration and interoperability

implementation and monitoring

tool

Data transformation engine/ETL tool

A data transformation engine (or ETL tool) is the primary tool in the data integration toolbox and is at the heart of every enterprise data integration program

Whether the data is batch or real-time, physical or virtual, very sophisticated tools exist to develop and execute ETL.

Basic considerations for data transformation engine selection should include whether batch processing and real-time capabilities are required, and whether unstructured and structured data are included

The most mature ones currently are batch processing tools for structured data

Data virtualization server

Data transformation engine

Physically extract, transform, and load data

Data virtualization server

Virtually extract, transform, and integrate data

Can combine structured and unstructured data

enterprise service bus

Refers to both a software architecture model and a message-oriented middleware

For near real-time messaging between asynchronous stores, applications, and servers within the same organization

business rules engine

Many data integration solutions rely on business rules

As an important form of metadata, these rules can be used for basic integrations or in solutions that include complex event handling so that organizations can respond to these events in near real-time.

Data and process modeling tools

Data modeling tools are used to design not only target data structures but also intermediate data structures required for data integration solutions

Data profiling tools

Perform statistical analysis of the contents of a data set to understand the format, consistency, validity, and structure of the data

metadata repository

The store contains information about the data in the organization, including the data structure, internals, and business rules used to manage the data

method

Keep applications loosely coupled, limit the number of development and management interfaces, use a hub-and-spoke approach and create standardized interfaces

Implementation Guide

Readiness Assessment/Risk Assessment

Organizational and cultural change

Data integration and interoperability governance

data sharing agreement

Sets out the responsibilities and acceptable uses of data exchanged and is approved by the business data manager of the relevant data

Data integration and interoperability and data lineage

Metrics

Data availability

Data volume and speed

Solution cost and complexity