X

Download Introduction to Big Data PowerPoint Presentation

SlidesFinder-Advertising-Design.jpg

Login   OR  Register
X


Iframe embed code :



Presentation url :

Home / Education & Training / Education & Training Presentations / Introduction to Big Data PowerPoint Presentation

Introduction to Big Data PowerPoint Presentation

Ppt Presentation Embed Code   Zoom Ppt Presentation

PowerPoint is the world's most popular presentation software which can let you create professional Introduction to Big Data powerpoint presentation easily and in no time. This helps you give your presentation on Introduction to Big Data in a conference, a school lecture, a business proposal, in a webinar and business and professional representations.

The uploader spent his/her valuable time to create this Introduction to Big Data powerpoint presentation slides, to share his/her useful content with the world. This ppt presentation uploaded by programmingsguru in Education & Training ppt presentation category is available for free download,and can be used according to your industries like finance, marketing, education, health and many more.

About This Presentation

Introduction to Big Data Presentation Transcript

Slide 1 - Introduction to Big Data
Slide 2 - Reference: http://en.wikipedia.org/wiki/Big_data) What is “Big Data”? 2 "Big Data is a collection of data sets so large and complex that it becomes difficult to process using traditional database management tools and/or applications. Challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization."
Slide 3 - © 2013-2 The importance of Big Data in the Data Science equation Large data sets are not new (e.g. Energy, Telecomm, etc.) When the data itself becomes part of the problem (e.g. pushing existing limits) “Big Data” embodies a set of tools and technologies for dealing with vast data sets (e.g. capturing, storing, accessing, processing, etc.) Increased data volume dictates increased sophistication in the analysis and use of that data – the foundation of data science. 3
Slide 4 - 4 Characterizing Big Data Part I:
Slide 5 - Data Size 5 Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte 24 Exabytes (270) ≈1021 24 Terabytes (250) ≈1015 24 Megabytes (230) ≈109 24 or (210 bytes) ≈103 Yottabyte 1,024 Zettabytes (280) ≈1024 1,024 Petabyte ≈1018 1,024 Gigabyte ≈1012 1,024 Kilobytes ≈106
Slide 6 - Data Format/Composition/Mode of Access Binary Digit (Bit) Byte (8 Bits) 00000000 to 11111111 Data Type Collection of bytes for representing simple and complex entities (e.g. 123, 3.14, ‘A’, “Hello There!”, [27,59,-18], (“what”,”is”,”big”,“data”)) 0 or 1 Record Record Collection of data types for representing compound entities; fixed length vs. variable length Examples: fixed: (name, DOB) variable: (name, EmpID, WorkHistory) File Collection of records; text/binary; structured/semi-structured/unstructured (data at rest) (e.g. database, image, video, podcast, CSV, PDF, HTML, books, journals, etc.) File System File System Collection of files; local vs. network/distributed/cloud Stream 6 Collection of records; text /binary; structured/semi-structured/ unstructured (data in motion) (e.g. audio/video surveillance, network monitoring, stocks, etc.)
Slide 7 - Data’s V-Dimensions Volume Cisco Confidential 7 Data Size & Growth Rate Velocity Speed requirements Variety Data types Validity Legitimacy of the data sets (governance provisions)? Veracity Can the elicited results be believed? What business advantages can be gleaned? Value
Slide 8 - Relational Database Model Network Database Model Hierarchical Database Model Object Database Model Object-Relational Database Model XML Database Model Content Management Systems File Systems Data Warehouse 8 Distributed Databases Older Methods of Storing Big Data Part II:
Slide 9 - A collection of information that is arranged in a hierarchy. A file corresponds to a container for information. A directory corresponds to a container for files and directories. A sub-directory corresponds to a directory that is nested within another directory. Operations Create, Read, Update, Delete, Find, Navigate operating system commands Applications Examples Computer Operating Systems (DOS, Windows, Mac OS, Unix, VMS, etc.) Network File Systems (NFS), Network Attached Storage (NAS), File Servers, etc. File Systems File Systems 9
Slide 10 - A hierarchical database consists of a collection of records which are connected to one another through links. A record corresponds to a collection of fields; each field contains a single data value. A link corresponds to an association between exactly two records. Hierarchical Database Model Schema Boxes represent record types Lines correspond to links Includes a data definition language (DDL) and a data manipulation language (DML) Rooted Trees The records are organized into forests (collections of rooted trees). Dummy nodes are used for each tree root. A parent node can have multiple children (1:N). A child node has exactly one parent (1:1). No cycles are allowed in the structure. Examples IBM’s Information Management System (IMS) Microsoft Windows Registry Dummy Node for Records of type A Dummy Node for Records of type B A1 A2 B1 B2 B3 Links Record Types Hierarchical Database Model 10
Slide 11 - Representing many to many (M:N) relationships between two record types A and B is accomplished through record duplication. Hierarchical Database Model Continued Create two different trees to depict the one to many relationships. A one to many relationship from A to B (tree T1) A one to many relationship from B to A (tree T2) Record duplication is necessary to preserve the tree-structure organization of the database. Data inconsistency may result during updates Waste of space is unavoidable Root of the tree T1 A1 A2 B1 B2 B3 B1 Root of the tree T2 B1 B2 B3 A1 A2 A1 A2 Hierarchical Database Model 11
Slide 12 - Addressing Data Duplication with Virtual Records Contain no data, only represent a logical pointer to a physical record. When a record is to be replicated in several database trees, only a single copy of the record is kept in one of the trees. All other records are replaced with virtual records. Hierarchical Database Model Continued Dummy Node for Records of type A Dummy Node for Records of type B A1 A2 B1 B2 B3 Root of the tree T1 Virtual-A1 Virtual-A2 Virtual-B1 Virtual-B2 Virtual-B3 Virtual-B1 Root of the tree T2 Virtual-B1 Virtual-B2 Virtual-B3 Virtual-A1 Virtual-A2 Virtual-A1 Virtual-A2 Hierarchical Database Model 12
Slide 13 - A network database consists of a collection of records which are connected to one another through links. A record corresponds to a collection of fields, each field contains a single data value. A record and its fields are represented by a record type. A link corresponds to an association between exactly two records. Unlike in a hierarchical database, network databases allow cycles and can accommodate arbitrary information graphs. Network Database Model Schema Examples Boxes represent record types Lines correspond to links Links can be one-to-one (1:1), one-to-many (1:N), many-to-one (N:1), and many-to-many (M:N). Includes a data definition language (DDL) and a data manipulation language (DML) Computer Associates Integrated Database Management System (CA IDMS) Record Type A Record Type B Link Graph which represents the relationship between A and B A1 A2 B1 B2 B3 Network Database Model 13
Slide 14 - A relational database consists of a collection of tables (relationships). Rows in each table represent individual records. Columns in each table represent attributes (or fields). Each table is made up of key and non-key fields. Associations between tables (relationships) are realized through other tables Examples Apache Derby, IBM DB2, Informix, Ingres, Microsoft Access, PostgreSQL Microsoft SQL Server, MySQL, Oracle, Paradox, JavaDB Relational Database Model Table that represents all records of type T Record1 Record2 . . . Recordn Attr1 Attr2 Attr3 ... Attrm-1 Attrm Table for A Table for B Table for Relationship between A and B Relational Database Model 14
Slide 15 - Relational Database Theory Based on the concept of normal forms. The higher the normal form for a table, the less susceptible it is to inconsistencies and anomalies ACID Properties Atomicity - All operations occur or none occur, no partial transactions Consistency - Transaction brings the database from one valid state to another valid state Isolation - No transaction should be able to interfere with another transaction Durability - Once a transaction has been committed, the changes are permanent Relational Database Model Continued Relational Database Model 15
Slide 16 - Keys Simple Single attribute that uniquely identifies each tuple (row) in a table. Primary Unique set of attributes that identifies each tuple (row) in a table. Composite Two or more attributes that uniquely identify each row; where at least one attribute is NOT a simple key on its own. Compound Two or more attributes that uniquely identify each row; where each attribute is a simple key on its own. Relational Database Model Continued Relational Database Model Candidate A minimal super key. Super Key A set of attributes for a relation upon which all attributes are functionally dependent. Foreign Unique set of attributes that identifies each tuple (row) in a different table. 16
Slide 17 - Cisco Confidential 22 © 2013-2014 Cisco and/or its affiliates. All rights reserved. Relational Database Model Continued Relational Database Model Data Manipulation Select (Vertical/Horizontal Slicing), Update, Delete Join (Building Intermediate Tables) Cross, Theta, Equi, Natural, Inner, Full Outer, Left Outer, Right Outer Query Optimization Set Operations In, Not In, Union, Intersect, Except (Difference), Group By, Having, Nested Queries Views Join ➡ Structured Query Language (SQL) A declarative (as opposed to imperative), standards based language (e.g. SQL-2011) for creating, querying, and manipulating relational databases. Data Definition Create, Alter, Drop Indexes, Constraints, Triggers, Stored Procedures Access controls Selection Selection
Slide 18 - Relational Database Model Continued Relational Database Model R Select * From R cross join S; Cross Join (cross product) Select * From R,S; Select * From R,T Where R.r1 < T.r1; R1 R2 R3 1 2 3 2 3 4 Select * From R,T Where R.r3 < T.s1; R1 R2 R3 1 2 3 1 3 Equi Join (theta join using =) Select * From R join T On R.r1 < T.r1; R1 S1 3 1 3 1 Select * From R join T On R.r3 < T.s1; R1 S1 Select * From R natural join T; R1 R2 R3 S1 1 2 3 3 Natural Join (equi join on common attributes) Select * From S inner join T on S.s3 > (T.r1 + T.s1); S1 S2 S3 R1 S1 S1 3 Select * From R Left Outer Join T On R.r1 = T.r1; R1 R2 R3 R1 1 2 3 1 2 3 4 Null Null Left Outer Join (all rows from left) S1 3 Select * From R Right Outer Join T On R.r1 = T.r1; R1 R2 R3 R1 1 2 3 1 Null Null Null 3 1 Right Outer Join (all rows from right) Select * From R Full Outer Join T On R.r1 = T.r1; S1 3 Null 3 (Select * From R Left Outer Join T On R.r1 = T.r1) Union (Select * From R Right Outer Join T On R.r1 = T.r1); R1 R2 R3 R1 1 2 3 1 2 3 4 Null Null Null Null 1 S T Examples
Slide 19 - 24 © 2013-2014 Cisco and/or its affiliates. All rights reserved. Relational Database Model Continued Relational Database Model R Set Exclusion Examples Continued S T U Select count(*) From (Select u1 From U Group By u1 Having count(u2) > 2 AND sum(u3) > 4) as Temp; Count 2 Nested Query
Slide 20 - View A saved query that represents a virtual table. Allows information hiding. The virtual table is populated at access time. Read-only access Select ... From view_name … Materialized View A saved query that represents a persistent (as opposed to virtual) table. Like a view with respect to Information hiding Read-only Access Differences from a regular view Refreshed periodically (configurable). DDL syntax (e.g. create materialized view …) Not available with every RDBMS Relational Database Model Continued Relational Database Model Saved Query Definition 20 Create view view_name As SQL_Query; Create OR Replace View view_name As SQL_Query; Drop View view_name; Virtual Table View Actual Table Materialized View
Slide 21 - ODBMS also known as Object-Oriented Database Management Systems (OODBMS) Examples db4o, Caché, eXtremeDB, Perst, Objectivity/DB, ObjectStore, Versant Object Database, ObjectDB, VOSS Object-Oriented Concepts Class (Template, like a cookie cutter) Properties (attributes) / Behaviors (actions/methods) Access/Visibility to properties and behaviors Object (a cookie cut into the memory dough) An instance of a class Encapsulation Storing an object’s properties and behaviors together as part of the instance Relationships Inheritance (Single, Multiple) / Inheritance Hierarchy Object Database Model Object Database Model Person Class Properties Behaviors SSN, Name, Birthdate getSSN, setSSN, getName, setName, getBirthdate, setBirthdate, getAge Employee Class Properties Behaviors Org, Dept, Title, Mgr, EmployeeID, HireDate getOrg, setOrg, getDept, setDept, getTitle, setTitle, getEmployeeID, setEmployeeID, getMgr, setMgr, getReportingHierarchy, getDirectReports, getCoworkers, getHireDate, setHireDate Object Class Properties Behaviors ObjectID getID, setID IS-A IS-A OODBMS are integrated with an object-oriented programming language similar to RDBMS but with an object-oriented database model. Objects, classes, and inheritance are directly supported in the database schemas and in the query language. 21
Slide 22 - Object-Oriented Programming Languages C++, Java, C#, JavaScript, Ruby, Smalltalk, Scala, Groovy, ParaSail, Ceylon, Clojure, JRuby, ... Object-Oriented Applications Dynamically create and destroy objects Leverage an Object Graph during the application’s execution Object-Oriented Database Management Systems Support the modeling and creation of data as objects Include support for classes of objects and the inheritance of class properties and behaviors (methods) by subclasses and their objects. Create, Read (Search), Update, and Delete objects in the Database The class structure is the database schema Persistence - Explicit and Transparent Explicit Persistence - CRUD operations are performed in the code Transparent Persistence - Objects are moved to and from the database invisibly Object Database Model Continued Object Database Model Transactions Queries Indexes Administration, including tuning Instantiated Objects @ Time t 22
Slide 23 - Try to bridge the gap between traditional RDBMS and OODBMS Includes the full suite of RDBMS features Object-oriented features typically vary by vendor and revolve around the SQL-99 specification Inheritance (Table & Type) User-defined Data Types (Attributes & Tables) Functions for user-defined data types (UDTs) Examples PostgreSQL, CUBRID, Oracle, Informix, DB2, SQL Server Object-Relational Database Model Object Relational Database Model 23
Slide 24 - Object-Relational Database Model Continued A popular alternative to an ORDBMS is using an Object Relational Mapping (ORM) framework with a RDBMS Apache Cayenne, Hibernate, JDO, JPA, GORM, Active Record, ... ORM frameworks allow Software engineers to focus on and work with objects Database designers to focus on and work with relational database constructs The impedance mismatch between objects and tables to be transparently handled Object-Oriented Application ORM Framework RDBMS Objects Tables Maps objects to tables and vice versa Object Relational Database Model 24
Slide 25 - XML Database Model Two approaches: XML-enabled and Native XML databases XML-enabled databases rely on a middle-tier to transform XML to another DB representation Native XML Databases store, manipulate, and query XML documents Examples Sedna, Xindice, BaseX, eXist, MarkLogic Server, MonetDB/XQuery XML Database Model 25
Slide 26 - Content Management Systems Enterprise Content Management Systems (ECMs) Provide a mechanism to organize documents in various formats (structured and unstructured data) Administrative and User Tools Access control based on roles and permissions Storage and retrieval of data/Version Control Workflows Examples OpenCMS, Alfresco, WordPress, Apache Lenya, Apache Jackrabbit, SharePoint, Interwoven, Documentum, Content Management Systems APIs Proprietary JSR-170 (Content Repository API for Java) Content Management Interoperability Services (CMIS) Open standard for controlling diverse document management systems & repositories Content Content Management System API User + Admin Tools User Application 26
Slide 27 - Enterprise Data Warehouse A giant data repository facilitating various types of data aggregation, reporting, business intelligence (BI), data mining, and analytics processing. Data from the various source systems is placed into the warehouse via extract, translate, and load processes (ETL). Data marts represent specialized data warehouses. Tools are leveraged to extract new insights from the data warehouse and data marts. Data Warehouse Sales Marketing Supply Chain Operations Data Sources … ETL ETL ETL ETL Data Warehouse Data Vault 27 ETL Data Mart(s) Exploration, Mining, and Reporting Tools
Slide 28 - Distributed Databases Database system in which the data - and sometimes processing – is not centralized Database Duplication Active/Passive Configuration Database Replication Active/Active Configuration Conflict detection and resolution Database Fragmentation (DRDBMS) ▪ Data distributed/partitioned across locations ▪ Vertical (columns) and horizontal (rows) slicing ▪ Semi-Joins ▪ Expensive to ship data across the network ▪ Local query optimization based on costs/combine results Distributed Databases 28
Slide 29 - 29 New methods for Storing Big Data Part III:
Slide 30 - NOSQL databases represent newer data models aimed at Big Data problems. Differ from the relational model in several ways: SQL is not used as the primary query language Fixed-table schemas may not be required Joins are generally not supported ACID (atomicity, consistency, isolation, durability) guarantees may not exist Architectures typically leverage massively distributed computing resources (processing and storage) CAP Theorem (Brewer’s Theorem) Impossible for a distributed computer system to simultaneously provide three of the following guarantees: Consistency - All nodes see the same data at the same time Availability - Every request receives a response about whether or not it was successful Partition Tolerance - The system continues to operate despite arbitrary message loss or failure of part of the system Not Only SQL/No SQL (NOSQL) Approaches 30
Slide 31 - NOSQL Database Types Categorized according to the way they store their data Key-Value Stores Document Stores Column Stores Graph Databases Sharding Horizontal Partitioning Breaking a large database into several smaller databases that share nothing The smaller databases can be distributed across multiple servers. Size of database and # of transactions increases linearly, while query response time increases exponentially Not Only SQL (NOSQL) Approaches Continued Shard 1 Shard 2 Shard N-1 Shard N … Large Database Partitioning scheme (e.g. hash function) Application 31
Slide 32 - Two-column table (key, value) Keys are unique Values do not have to be unique Basic operations: AddOrUpdate(key, value) GetValue(key) DeleteKey(key) DoesKeyExist(key) Feature Differentiation Complexity of the keys Advanced operations (e.g. Expire, Lists, Sets, Hashes, …) Distributed vs. Non-distributed Memory-resident vs. Disk-based Examples Redis, Voldemort, Riak, Hibari, MemcacheDB, BerkeleyDB, Amazon S3, … 32 Key-Value Stores
Slide 33 - A database of JavaScript Object Notation (JSON) or Binary JSON (BSON) “documents” JSON is a light-weight data interchange format Based on a subset of the Object-Oriented JavaScript programming language Collections are indexed. Examples CouchDB, MongoDB, RavenDB, … Programming Language Tools & Frameworks See http://json.org Document Stores From json.org JSON Syntax Documents are analogous to records with fields and values. Grouped into collections. { 33 “name” : “Michael”, “GolfHandicapIndex” : 5, “Scores” : [ {“course” : “Lakeridge”, “score” : 73}, {“course” : “Wolf Run”, “score” : 77}, {”course” : “Wildcreek”, “score” : 79} ] } JSON Example
Slide 34 - Motivated by Google’s “Bigtable: A Distributed Storage System for Structured Data” [2006] For random read/write access to big data – consisting of billions of rows and millions of columns – atop clusters of commodity hardware. Vertical partitioning of the data according to the attributes (columns). “A Bigtable is a sparse, distributed, persistent, multi- dimensional sorted map [row key, col key, timestamp] Main Ideas Large tables can be expensive to process (entire row must be read). Extensible records that are partitioned across nodes. Rows and columns comprise the data model. Horizontal sharding based on row keys (key ranges) Columns can be partitioned into column groups/column families (allow related columns to be kept together) Column Stores ColA ColM Row1 Row2 … RowN M,N are large timestamp Bigtable 34 Apache HBase Apache Cassandra Apache Accumulo DynamoDB, Hyberbase, … Bigtable ColB ColC …
Slide 35 - Column Stores Continued ColA ColM Row1 Row2 … RowN ColB ColC … M,N are large { Assume the row keys are partitioned into 4 different ranges { Assume the columns are aggregated into 5 different column groups/families Row key range 1 Row key range 2 Row key range 3 Row key range 4 CF 1111 CF2 CF3 CF4 CF5 timestamp timestamp 35 Database distributed over 4 x 5 = 20 nodes
Slide 36 - Basics Leverage graph structures with nodes, edges, and properties to represent and store data. Nodes in a graph are similar to objects in that they have attributes/properties Edges are used to represent a relationship between two nodes or between a node and a property Properties represent attributes that are associated with nodes or edges (relationships) Hyper-edges represent a relationship between a set of nodes. A traversal navigates a graph, equivalent to performing a query Fully-transactional, enterprise-strength databases Application developers leverage an object- oriented, flexible network structure instead of static tables Graph Databases A C B Property2: Value2 … PropertyN: ValueN Node Undirected Edge Property1: Value1 PropertyA: Value1 PropertyB: Value2 … Property : Value Z Z { { Undirected Graph D E F G Directed Graph Node Directed Edge 41
Slide 37 - 42 Key Points Nodes (vertices) represent entities. Edges represent relationships. Nodes and edges are able to have properties Property Graphs Query = Traversal Network Science Graphs are everywhere! Examples Neo4j, InfiniteGraph, OrientDB, AllegroGraph, Titan, … Graph Databases Continued EA IA P follows Person Name: X ID: Y …… Areas of Expertise Expertise Area: E … Areas of Interest Interest Area: I …… interested-in J Job Roles Job Role: J … has-role YearsInRole: Y Ratings: R … requires Queries/Traversals/Algorithms for identifying: Who are the best mentors for a person P? Who are the experts in expertise area E? What areas does person P need to improve in? What are the intellectual capital risk areas? How do the people in job role J compare /who is promotion ready? . . . RequiredProficiencyRating: W …
Slide 38 - 43 Motivation High-volume transaction-oriented systems (e.g. financial, order processing, etc.) cannot give up strong transactional and consistency requirements, and therefore are left wanting with respect to NOSQL options. Enter New SQL A class of RDBMS that aim to provide the same scalability as NOSQL systems for on-line transaction processing (OLTP) – significant read/write activity – whilst still maintaining ACID properties and utilizing SQL as the primary interface. Approaches Parallel Databases (parallelization of various operations) Multi-processor Architectures Shared Memory – multiple processors share the main memory space Shared Disk – nodes have autonomous memory, but share mass storage Shared Nothing – nodes have their own main memory and mass storage New SQL Approaches Hybrid Architectures Non-Uniform Memory Access (NUMA) Memory access relative to the processor (local vs. other vs. shared) Cluster Distributed cluster of shared-nothing nodes where each node owns a subset of the data. Transactions and queries are fragmented and routed to the nodes that contain the needed data. In-Memory Databases Memory is always faster than mass-storage For high-volume transactions that are short-lived, access a small subset of the available data, and are executed over and over with different inputs. NewSQL Examples Google Spanner, Clustrix, FoundationDB, NuoDB, Translattice, VoltDB, Pivotal’s GemFire & SQLFire, MemSQL
Slide 39 - Cisco Confid 44 An approach to data management that allows an application to retrieve and manipulate data without requiring details about the underlying data model or where the data is located. Differs from ETL in that the source data remains in place. Data in the source systems is readable and can also be writable. Features Abstraction (location, data model, API, access language,...) Virtualized Data Access (common access point) Transformation (transform/reformat for use) Data Federation (combine data sets from multiple sources) Data Delivery (publish views and/or services for reuse) Examples Denodo, Composite (Cisco), Informatica, IBM SmartCloud Data Virtualization, … Data Virtualization www.cisco.com/web/services/enterprise-it-services/data-virtualization/documents/cisco-information-server-62-ds.pdf http://www.denodo.com/en/product/features.php
Slide 40 - 46 The Big Data Tool Zoo Tools for Processing and Accessing Big Data Part IV:
Slide 41 - 47 Framework which enables distributed processing of large data sets across clusters of computers. Primary Components Hadoop Common Hadoop Distributed File System (HDFS) Build an HDFS instance Use FS commands for interactive access hadoop fs –ls hadoop fs –mkdir –p /user/hadoop/demo hadoop fs –copyFromLocal myBigData /user/hadoop/demo hadoop fs –cat /user/hadoop/demo/myBigData Many other commands Hadoop YARN Facilitates interaction patterns for data in HDFS Batch (MapReduce), Interactive (Tez), Online (HBase) Streaming (Storm), Graph (Giraph), In-memory (Spark), others Hadoop MapReduce Batch-oriented processing: Map, Shuffle, Reduce ©\ The Big Data Tool Zoo (Part 1): Apache Hadoop Apache Hadoop Ecosystem Hadoop YARN + Framework for job scheduling and cluster resource management. + Facilitates broad array of data interaction patterns. Hadoop Distributed File System (HDFS) + Redundant, reliable storage. + Designed to run on commodity hardware. + Highly fault tolerant (failures expected) + Fast fault detection and automatic recovery. + Suitable for large data sets distributed across multiple nodes. + Provides high aggregate data bandwidth. + Designed for batch processing and providing streaming access. + Scales to hundreds of nodes in a single cluster. + Supports tens of millions of files in a single instance. + Interactive access provided via File System (FS) commands. Hadoop Common + Common utilities and libraries to support the other modules. Hadoop MapReduce + Batch-oriented, parallel processing of large data sets Other Tools + Processing large data sets.
Slide 42 - 48 MapReduce Google paper [2004] High-level programming model and implementation for large-scale parallel data processing Free variant: Hadoop Google claims in 2014 that its Cloud Dataflow is meant to replace MapReduce. MapReduce Programming Model Input and Output: each a set of (key, value) pairs Programmer specifies two functions: Map (inKey, inValue) € List (outKey, intermediateValue) Processes input key/value pair Produces (emits) a list of intermediate key/value pairs Reduce (outKey, List (intermediateValue)) € List (outValue) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) MapReduce MapReduce FundamentalsCisco Confidential Reduce Phase + System applies the reduce function in parallel to all intermediate values for a particular key and produces a set of merged output values as a result Shuffle Phase + All pairs with the same intermediate key are grouped together – similar to what an SQL “group by” would do Input Data + Processed in parallel in order to elicit a set of (key, value) pairs Map Phase + Map(inKey, inValue) € List (outKey, intermediateValue) + System applies the map function in parallel to all input key/value pairs in the input file.
Slide 43 - MapReduce Examples Map Shuffle Word Length Histograms from a Corpus of Documents Map Shuffle Reduce Word Frequency from a Corpus of Documents How many times does each unique word occur? Reduce How many big, medium, and small words occur, where big € 12+ letters, medium € 5…9 letters, and small € 1..4 letters? 49
Slide 44 - Cisco MapReduce Try for Yourself with jsmapreduce Go to www.jsmapreduce.com Register for a free account Try the examples (JavaScript and/or Python) Extend the examples or experiment with your own data Input data Map Function Reduce Function Execution Controls Status Output 44
Slide 45 - JSMapreduce Example – Add 4-card straight to Poker 45
Slide 46 - The Big Data Tool Zoo (Part 2): A broader perspective Hadoop YARN (Job Scheduling & Cluster Resource Management) Hadoop Distributed File System (HDFS): Redundant, Reliable, Storage Hadoop Common (Utilities and Libraries) MapReduce (batch exec. f/w) Application execution framework for complex directed-acyclic-graph (DAG) data processing tasks; accelerates Hadoop query processing. Bigtable-esque (column store) Real-time distributed processing of incoming data streams; real-time analytics, machine learning, … Iterative graph processing which extends Google’s Pregel. MapReduce Tez Analysis of large data sets; parallelization of MR tasks; Pig Latin language Hive Pig Data warehouse software allowing the querying and managing of large datasets which reside in distributed storage using an SQL-esque language called HiveQL. Also allows custom mappers & reducers. Includes HCatalog table storage/mgmt. Bulk Synchronous Parallel(BSP) Computing; advanced analytics beyond MapReduce; network algorithms, graph algorithms, machine learning, …. Spark In-memory analytics, 100X faster than MapReduce; general purpose data processing for large datasets. Combines SQL (SparkSQL), streaming, and complex analytics. Distributed application development framework; facilitates generic cluster resource management. Workflow scheduler (MR, Pig, Hive, …) Provision, manage, and monitor Hadoop clusters Bulk data xfer between Hadoop and structured data stores. Fault-tolerant Bigtable-esque distributed database SQL-supported big data warehouse system for Hadoop Distributed stream processing, leverages Kafka messaging f/w Distributed query engine; extends Google’s Dremel. Drill Samza Tajo Sqoop Ambari Twill Hama Oozie Giraph Storm HBase Flume Streaming Event Data Pig Hive Cassandra Mahout Machine Learning Chukwa Monitoring ZooKeeper Distributed Configuration Mgmt. 46