Features
Features:

Product Tour >

Edraw AI >

Paid Plans:

Individuals >

Business >

Eduaction >
Resources
Blog

History

How-tos & Tips

Discovery

Biography

Business Analysis

Examples

AI concept Map

Free AI Mind Map Generator

Onenote Mind Map

Bcg Matrix Examples

Nike Marketing Strategy

Unilever SWOT Analysis

Make Mind Maps in Google Docs

Guide

FAQs

What's New

Resource Center
Templates
All Templates

Brain Storming Templates

Strategy and Planning Templates

Project Management Templates

Product Management Templates

Human Resources Templates

Agile Workflow Templates

Marketing Templates

Education Templates

Fun and Games Templates

User Gallery
Download
Pricing
Enterprise

EdrawMind

Big Data Mindmap

MindMap Gallery Big Data Mindmap

Big Data Mindmap

Big data has one or more of the following characteristics: high volume, high velocity, or high variety.

Edited at 2020-10-10 02:55:38

Fiona_

Recent works View more works>>

Big Data Mindmap

Fiona_

Recent works View more works>>

Recommended to you
Outline

Big-Data-Mindmap

Storage

NoSQL

Accumulo (Apache)

BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance

Sorted, schema less, graph support

On top of Hadoop, Zookeeper, Thrift

BTrees, bloom filters, compression

Uses Zookeeper, MR, have CLI

Clustered. Linear scalability. Failover master

NSA

Per cell ACL + table ACL

Built in user DB and plaintext on wire

Iterators can be inserted in read path and table maintenance code

Advanced UI

HBase

On top of HDFS

BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance

Clustered. Linear scalability. Failover master

Sorted, RESTful

FB, eBay, Flurry, Adobe, Mozilla, TrendMicro

BTrees, bloom filters, compression

Uses Zookeeper, MR, have CLI

Column family ACL

Advanced security and integration with Hadoop

Triggers, stored procedures, cluster

management hooks

HBase is 4 5 times slower than HDFS in batch

context. HBase is for random access

Stores key/value pairs in columnar fashion

(columns are clubbed together as column families).

Provides low latency access to small amounts of

data from within a large data set.

Provides flexible data model.

HBase is not optimized for classic transactional

applications or even relational analytics.

HBase is CPU and Memory intensive with

sporadic large sequential I/O access

HBase supports massively parallelized processing via

MapReduce for using HBase as both source and sink

Distributed Filesystems

HDFS

Hadoop part

Optimized for streaming access of large files.

Follows write once read many ideology.

Doesn&#x27;t support random read/write.

Good for MapReduce jobs are primarily I/O bound

with fixed memory and sporadic CPU.

Tachyon

Memory centric

Hadoop/Spark compatible wo changes

High performance by leveraging lineage

information and using memory aggressively

40 contributors from over 15 institutions,

including Yahoo, Intel, and Redhat

High throughput writes

Read 200x and write 300x faster than HDFS

Latency 17x times less than HDFS

Uses memory (instead of disk) and recomputation (instead of replication) to

produce a distributed, fault tolerant, and high throughput file system.

Data sets are in the gigabytes or terabytes

Query Engine

Interactive

Drill (Apache)

Low latency

JSON like internal data model to

represent and process data

Dynamic schema discovery. Can query self describing

data formats as JSON, NoSQL, AVRO etc.

SQL queries against a petabyte or more of data distributed

across 10,000 plus servers

Drillbit instead of MapReduce

Full ANSI SQL:2003

Low latency distributed execution engine

Unknown schema on HBase,

Cassandra, MongoDB

Fault tolerant

Memory intensive

INPUT: RDBMS, NoSQL, HBase, Hive tables, JSON, Avro, Parquet, FS etc.

Shell, WebUI, ODBC, JDBC, CPP API

Not mature enough?

Impala (Cloudera)

Low latency

HiveQL and SQL92

INPUT: Can use HDFS or HBase wo move/transform. Can use Hive tables as metastore

Integrated with Hadoop, queries Hadoop

Requires specific format for performance

Memory intensive (if join two tables, one of them need to be in cluster memory)

Claimed as x30 faster than Hive for queries with multiple MR jobs, but it&#x27;s before

Stinger

No fault tolerance

Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet

Fine grained, role based authorization with Sentry. Supports Hadoop security

Presto (Facebook)

ANSI SQL

Real time

Petabytes of data

INPUT: Can use Hive, HDFS, HBase, Cassandra, Scribe

Similar to Impala and Hive/Stinger

Own engine

Declared as CPU more efficient than Hive/MapReduce

Just opened, not so big community

Not fail tolerant

Read only

No UDF

Requires specific format for performance

Hive 0.13 with Stinger

A lot of MR micro jobs

HQL

Can use HBase

Spark SQL

SQL, HQL or Scala to run on Spark

Can use Hive data also

JDBC/ODBC

Spark part

Shark

Replaced with Spark SQL

Non interactive

Hive &lt; 0.13

Long jobs

HQL

Based on Hadoop and MR

High latency

Machine Learning

MLib

Spark implementation of some common machine learning

algorithms and utilities, including classification, regression,

clustering, collaborative filtering, dimensionality reduction

Initial contribution from AMPLab, UC Berkeley,

shipped with Spark since version 0.8

Spark part

Mahout

On top of Hadoop

Using MR

Lucene part

Event Input

Kafka

Distributed publish subscribe messaging system

High throughput

Persistent scalable messaging for

parallel data load to Hadoop

Compression for performance,

mirroring for HA and scalability

Usually used for clickstream

Pull output

Use Kafka if you need a highly reliable and scalable enterprise messaging system

to connect many multiple systems, one of which is Hadoop.

Flume

Main use case is to ingest data into Hadoop

Distributed system

Collecting data from many sources

Mostly log processing

Push output

A lot of pre built pre built collectors

OUTPUT: Write to HDFS, HBase, Cassandra etc

INPUT: Can use Kafka

Use Flume if you have an non relational data sources such as

log files that you want to stream into Hadoop.

Tools

Scheduler

Oozie (Apache)

Schedule Hadoop jobs

Combines multiple jobs sequentially

into unit of work

Integrated with Hadoop stack

Supports jobs for MR, Pig, Hive and

Sqoop + system app, Java, shell

Panel

Hue (Cloudera)

UI for Hadoop and satellites (HDFS, MR,

Hive, Oozie, Pig, Impala, Solr etc.)

Webpanel

Upload files to HDFS, send Hive queries etc.

Data analyze

Pig (Apache)

High level scripting language

Can invoke from code on Java, Ruby etc.

Can get data from files, streams or other sources

Output to HDFS

Pig scripts translated to series of MR jobs

Data transfer

Sqoop (Apache)

Transferring bulk data between Apache Hadoop

and structured datastores such as relational

databases.

Two way replication with both

snapshots and incremental updates.

Import between external datastores,

HDSF, Hive, HBase etc.

Works with relational databases such as:

Teradata, Netezza, Oracle, MySQL, Postgres,

and HSQLDB

Visualization

Tableau

INPUT: Can access data in Hadoop via Hive, Impala, Spark

SQL, Drill, Presto or any ODBC in Hortonworks, Cloudera,

DataStax, MapR distributions

OUTPUT: reports, UI web, UI client

Clustered. Nearly linear scalability

Can access traditional DB

Can explore and visualize data

SQL

Security

Knox (Apache)

Provides single point of authentication

and access to services in Hadoop

cluster

Graph analytic

GraphX

Spark part

API for graphs and graph parallel computation