Hadoop Pig

Hadoop Pig

What is and Why Apache Pig?

  • Apache Pig is a platform for analyzing large data sets and is made of
    • Pig Latin: a high-level, data flow, scripting language for expressing data analysis
    • Pig runtime: มีโครงสร้างพื้นฐานสำหรับการประเมินผล script
  • Pig script ที่เขียนด้วย Pig Latin จะถูกแปลงเป็นงาน MapReduce โดย Pig runtime
  • Pig script มีการสร้างและการเรียกใช้งานได้ง่ายกว่า MapReduce code
    • ถูกออกแบบมาเพื่อใช้ในการวิเคราะห์ข้อมูล โดยไม่จำเป็นต้องมีการช่วยเหลือจาก MapReduce Java developers
    • Pig ใช้กันอย่างแพร่หลาย เป็นทางเลือกให้กับ MapReduce
    • ใช้กันอย่างแพร่หลาย เช่น Yahoo, LinkedIn, Twitter, NetFlix etc.

 

MapReduce VS Pig
Must be written in Java Written in scripting language
Must be done in “map” and “reduce” paradigm (mandates low-level thinking) High-level abstraction (enables more natural thinking)
Hard to do more useful yet advanced operations such as filtering, sorting, aggregation, joining, splitting Easy to do filtering, sorting, aggregation, joining, splitting
Provides several sophisticated data types (tuples, bags and maps)
~20 times shorter than MapReduce code with only
slightly slower than MapReduce

 

SQL VS Pig
Declarative language Data flow language
Complete result for each query the query could be complex Data can be stored/dumped at any point in the pipeline

 

Pig Use Cases

  • Scaling Extract – Transform – Load (ETL) process on a large data set
    • Pig is built on top of Hadoop, สามารถขยายเป็น server ขนาดใหญ่ได้ จึงสามารถประมวลชุดข้อมูลขนาดใหญ่ได้
  • Analyzing data with unknown or inconsistent schema
    • Pig can Load, Filter (clean), Join, Group,Sort, and Aggregate large amounts of data with unknown or inconsistent schema

Benefits of Pig Latin Language

  • Ease of programming
    • มีภาษา script ระดับสูง
    • งานที่ความซับซ้อน ซึ่งมีการแปลงข้อมูลหลายแบบที่เกี่ยวข้องกัน (joining, grouping, splitting, etc) สามารถแสดงได้อย่างง่ายดาย
  • Optimized implementation
    • Pig runtime is highly optimized
  • Extensibility
    • ผู้ใช้สามารถสร้างฟังก์ชันของตนเอง (เรียกว่า User Defined Functions) เพื่อทำการประมวลผลพิเศษ
    • ผู้ใช้สามารถเขียนคำศัพท์ script ของตนเองได้

Executing Pig

Pig Execution Modes: Shell and Script

  • Shell (Grunt shell) execution mode
    • Interactive shell for executing Pig commands
    • Started when script file is not provided when “pig” command is run
    • Support file system commands within the shell such as “cat” or “ls”
    • Can execute scripts within the shell via “run” or “exec” commands
    • Useful for development
  • Script execution mode
    • Execute “pig” command with a script file

Pig Execution Modes: Local vs MapReduce

  • Local mode
    • Executes in a single JVM
    • Accesses files on local file system
    • Used for development
    • pig -x local (Interactive shell called Grunt)
    • pig -x local <script-file> (Script)
  • MapReduce mode (Hadoop mode)
    • Executes on Hadoop cluster and HDFS
    • pig or pig -x mapreduce (Interactive shell called Grunt)
    • pig <script-file> or pig -x mapreduce <script-file> (Script)

Ways to execute Pig Script

  • Running Pig at the command line
  • Using locally installed Hue (Hadoop Web UI)
  • Using net accessible Hue (Hadoop Web UI) through Cloudera Live
    • ใช้ตัวเลือกนี้หากไม่ได้ติดตั้ง Hadoop ไว้ในเครื่อง

Example: word-count.pig

  • In the hands-on lab, we are going to do the following
    • Run each Pig statement in Grunt shell and dump the result
    • Run it as a script at the command-line
    • Run it via Hue Web UI

 

Pig Latin Concepts

Pig Lation Data Types

  • Field
    • Piece of data
  • Tuple
    • An ordered set of fields
    • Represented with parentheses (..) Example: (11, john, us)
    • เทียบได้กับ “rows” ใน RDBMS
  • Bag
    • Collection of tuples
    • Represented with braces {..}
      • Example: { (11, john, us), (33, shin) }
    • เทียบกับ “ตาราง” ใน RDBMS – Bags ไม่จำเป็นต้องให้ tuples ทั้งหมดมีหมายเลขเดียวกันหรือข้อมูลประเภทเดียวกัน
  • Relation
    • Is a bag (more specifically, an outer bag)

Referencing Relations

  • ความสัมพันธ์ถูกเรียกตามชื่อ (หรือนามแฝง)
  • Names are assigned by you as part of the Pig Latin statement
  • In the example below, the name (alias) of the relation is A.

Schema Data Types supported in Pig

  • Simple type
    • int, long, float, double
  • Array
    • chararray: Character array (string) in UTF-8
    • bytearray: Byte array (blob)
  • Complex data types
    • tuple: an ordered set of fields (19,2)
    • bag: a collection of tuples {(19,2), (18,1)}
    • map: a set of key value pairs [open#apache]

Referencing Fields

  • Fields are referred to by positional notation or by name
  • Positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2

Referencing Complex Data Type Fields

  • The fields in a tuple can be any data type, including the complex data types: bags, tuples, and maps

Outer Bag and Inner Bag

Case Sensitivity

  • The names (aliases) of relations A, B, and C are case sensitive
  • The names (aliases) of fields f1, f2, and f3 are case sensitive
  • Function names PigStorage and COUNT are case sensitive
  • Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, and DUMP are case insensitive. They can also be written as load, using, as, group, by, etc.

 

Pig Latin Script: Operators & Functions

Pig Latin Operators

  • relational operators
    • LOAD,FOREACH, FILTER, JOIN, GROUP, ORDER, UNION, SPLIT, STORE, DUMP, LIMIT, DISTINCT
  • Arithmetic operators
    • +, -, *,/ ,% ,etc.
  • Diagnostic operators
    • DESCRIBE, EXPLAIN, ILLUSTRATE, DUMP

Pig Latin Built-in Function

  • Eval functions
    • AVG, COUNT, MAX, MIN, SIZE, SUM, TOKENIZE, etc
  • Load/Store functions
    • PigStorage, BinStorage, HBaseStorage, etc
  • Math functions
    • ABS, RANDOM, SORT, CEIL, FLOOR, etc
  • String functions
    • LOWER, UPPER, SUBSTRING, REPLACE, ENDSWITH, etc
  • Datetime functions
    • CurrentTime, DaysBetween, etc

Pig Latin Script Structure

  • A Pig script is made of a sequence of Pig statements
  • A Pig statement is constructed using
    • Relational operator
    • Arithmetic operator
    • Functions
  • The result of Pig statement execution is captured into a relation

 

Pig Latin Script: Flow

Typical Pig Latin Script Flow

Step #1: Load data from the file system

  • LOAD

Step #2: Perform a series of “transformation” to the data

  • FILTER, FOREACH, GROUP, JOIN, UNION, SPLIT, SORT

Step #3: Execute and than display or save result

  • DUMP (to display result) or STORE (to save the result into HDFS or HBase)
  • DUMP or STORE triggers the execution

Step #1: Load Data using LOAD statement

LOAD ‘data’ [USING function] [AS schema];

  • ‘data’ : The name of the file or directory, in single quotes
  • [USING function]: specifies the load function to use
    • PigStorage is the default function
    • PigStorage( [field_delimiter] , [‘options’] ): The default field delimiter is tab (‘\t’), but can be customized with regular expression
  • [AS schema]: Schemas enable you to assign names to fields and declare types for fields
    • You can use the DESCRIBE operator to view the schema

Step #2: Perform Transformation to Data

  • Use FILTER operator to work with tuples (rows of data)
    • To filter out tuples based on condition
  • Use FOREACH operator to work with columns of data
    • To select columns
  • With a single relation
    • Use GROUP operator to group data in a single relation
  • With multiple relations (we will cover these in PIG Part 2 presentation)
    • Use COGROUP, inner JOIN, and outer JOIN operators to group or join data in two or more relations
    • Use UNION operator to merge the contents of two or more relations
    • Use SPLIT operator to partition the contents of a relation into multiple relations

 

Pig Latin Relational Operators

Filter Operator

  • Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH…GENERATE operation).
  • FILTER is commonly used to select the tuples that you want; or, conversely, to filter out (remove) the tuples you don’t want

FOREACH … GENERATE Operator

FLATTEN Operator

  • Flatten un-nests tuples as well as bags

TOKENIZE Function

  • Splits a string into tokens and outputs as a bag of words

GROUP Operator

ORDER alias BY Operator

Step #3: DUMP or STORE

  • No action is taken until DUMP or STORE commands are executed
    • Pig will parse, validate, and analyze the statements but not execute them until DUMP or STORE commands are executed
  • DUMP is for displaying the result to the screen
    • Mostly used during development time
  • STORE is for saving the results into HDFS or HBase

 

Pig Latin Diagnostic Operators

DESCRIBE, ILLUSTRATE, EXPLAIN

  • DESCRIBE displays the structure of the schema
  • ILLUSTRATE shows how Pig engine transforms the data

  • EXPLAIN produces various reports
    • Logical plan, Physical plan, MapReduce plan