Hadoop Pig

What is and Why Apache Pig?

Apache Pig is a platform for analyzing large data sets and is made of
- Pig Latin: a high-level, data flow, scripting language for expressing data analysis
- Pig runtime: มีโครงสร้างพื้นฐานสำหรับการประเมินผล script
Pig script ที่เขียนด้วย Pig Latin จะถูกแปลงเป็นงาน MapReduce โดย Pig runtime
Pig script มีการสร้างและการเรียกใช้งานได้ง่ายกว่า MapReduce code
- ถูกออกแบบมาเพื่อใช้ในการวิเคราะห์ข้อมูล โดยไม่จำเป็นต้องมีการช่วยเหลือจาก MapReduce Java developers
- Pig ใช้กันอย่างแพร่หลาย เป็นทางเลือกให้กับ MapReduce
- ใช้กันอย่างแพร่หลาย เช่น Yahoo, LinkedIn, Twitter, NetFlix etc.

MapReduce	VS	Pig
Must be written in Java		Written in scripting language
Must be done in “map” and “reduce” paradigm (mandates low-level thinking)		High-level abstraction (enables more natural thinking)
Hard to do more useful yet advanced operations such as filtering, sorting, aggregation, joining, splitting		Easy to do filtering, sorting, aggregation, joining, splitting
		Provides several sophisticated data types (tuples, bags and maps)
		~20 times shorter than MapReduce code with only slightly slower than MapReduce

SQL	VS	Pig
Declarative language		Data flow language
Complete result for each query the query could be complex		Data can be stored/dumped at any point in the pipeline

Pig Use Cases

Scaling Extract – Transform – Load (ETL) process on a large data set
- Pig is built on top of Hadoop, สามารถขยายเป็น server ขนาดใหญ่ได้ จึงสามารถประมวลชุดข้อมูลขนาดใหญ่ได้
Analyzing data with unknown or inconsistent schema
- Pig can Load, Filter (clean), Join, Group,Sort, and Aggregate large amounts of data with unknown or inconsistent schema

Benefits of Pig Latin Language

Ease of programming
- มีภาษา script ระดับสูง
- งานที่ความซับซ้อน ซึ่งมีการแปลงข้อมูลหลายแบบที่เกี่ยวข้องกัน (joining, grouping, splitting, etc) สามารถแสดงได้อย่างง่ายดาย
Optimized implementation
- Pig runtime is highly optimized
Extensibility
- ผู้ใช้สามารถสร้างฟังก์ชันของตนเอง (เรียกว่า User Defined Functions) เพื่อทำการประมวลผลพิเศษ
- ผู้ใช้สามารถเขียนคำศัพท์ script ของตนเองได้

Executing Pig

Pig Execution Modes: Shell and Script

Shell (Grunt shell) execution mode
- Interactive shell for executing Pig commands
- Started when script file is not provided when “pig” command is run
- Support file system commands within the shell such as “cat” or “ls”
- Can execute scripts within the shell via “run” or “exec” commands
- Useful for development
Script execution mode
- Execute “pig” command with a script file

Pig Execution Modes: Local vs MapReduce

Local mode
- Executes in a single JVM
- Accesses files on local file system
- Used for development
- pig -x local (Interactive shell called Grunt)
- pig -x local <script-file> (Script)
MapReduce mode (Hadoop mode)
- Executes on Hadoop cluster and HDFS
- pig or pig -x mapreduce (Interactive shell called Grunt)
- pig <script-file> or pig -x mapreduce <script-file> (Script)

Ways to execute Pig Script

Running Pig at the command line
Using locally installed Hue (Hadoop Web UI)
Using net accessible Hue (Hadoop Web UI) through Cloudera Live
- ใช้ตัวเลือกนี้หากไม่ได้ติดตั้ง Hadoop ไว้ในเครื่อง

Example: word-count.pig

In the hands-on lab, we are going to do the following
- Run each Pig statement in Grunt shell and dump the result
- Run it as a script at the command-line
- Run it via Hue Web UI

Pig Latin Concepts

Pig Lation Data Types

Field
- Piece of data
Tuple
- An ordered set of fields
- Represented with parentheses (..) Example: (11, john, us)
- เทียบได้กับ “rows” ใน RDBMS
Bag
- Collection of tuples
- Represented with braces {..}
  - Example: { (11, john, us), (33, shin) }
- เทียบกับ “ตาราง” ใน RDBMS – Bags ไม่จำเป็นต้องให้ tuples ทั้งหมดมีหมายเลขเดียวกันหรือข้อมูลประเภทเดียวกัน
Relation
- Is a bag (more specifically, an outer bag)

Referencing Relations

ความสัมพันธ์ถูกเรียกตามชื่อ (หรือนามแฝง)
Names are assigned by you as part of the Pig Latin statement
In the example below, the name (alias) of the relation is A.

Schema Data Types supported in Pig

Simple type
- int, long, float, double
Array
- chararray: Character array (string) in UTF-8
- bytearray: Byte array (blob)
Complex data types
- tuple: an ordered set of fields (19,2)
- bag: a collection of tuples {(19,2), (18,1)}
- map: a set of key value pairs [open#apache]

Referencing Fields

Fields are referred to by positional notation or by name
Positional notation is indicated with the dollar sign ($) and begins with zero (0); for example, $0, $1, $2

Referencing Complex Data Type Fields

The fields in a tuple can be any data type, including the complex data types: bags, tuples, and maps

Outer Bag and Inner Bag

Case Sensitivity

The names (aliases) of relations A, B, and C are case sensitive
The names (aliases) of fields f1, f2, and f3 are case sensitive
Function names PigStorage and COUNT are case sensitive
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERATE, and DUMP are case insensitive. They can also be written as load, using, as, group, by, etc.

Pig Latin Script: Operators & Functions

Pig Latin Operators

relational operators
- LOAD,FOREACH, FILTER, JOIN, GROUP, ORDER, UNION, SPLIT, STORE, DUMP, LIMIT, DISTINCT
Arithmetic operators
- +, -, *,/ ,% ,etc.
Diagnostic operators
- DESCRIBE, EXPLAIN, ILLUSTRATE, DUMP

Pig Latin Built-in Function

Eval functions
- AVG, COUNT, MAX, MIN, SIZE, SUM, TOKENIZE, etc
Load/Store functions
- PigStorage, BinStorage, HBaseStorage, etc
Math functions
- ABS, RANDOM, SORT, CEIL, FLOOR, etc
String functions
- LOWER, UPPER, SUBSTRING, REPLACE, ENDSWITH, etc
Datetime functions
- CurrentTime, DaysBetween, etc

Pig Latin Script Structure

A Pig script is made of a sequence of Pig statements
A Pig statement is constructed using
- Relational operator
- Arithmetic operator
- Functions
The result of Pig statement execution is captured into a relation

Pig Latin Script: Flow

Typical Pig Latin Script Flow

Step #1: Load data from the file system

LOAD

Step #2: Perform a series of “transformation” to the data

FILTER, FOREACH, GROUP, JOIN, UNION, SPLIT, SORT

Step #3: Execute and than display or save result

DUMP (to display result) or STORE (to save the result into HDFS or HBase)
DUMP or STORE triggers the execution

Step #1: Load Data using LOAD statement

LOAD ‘data’ [USING function] [AS schema];

‘data’ : The name of the file or directory, in single quotes
[USING function]: specifies the load function to use
- PigStorage is the default function
- PigStorage( [field_delimiter] , [‘options’] ): The default field delimiter is tab (‘\t’), but can be customized with regular expression
[AS schema]: Schemas enable you to assign names to fields and declare types for fields
- You can use the DESCRIBE operator to view the schema

Step #2: Perform Transformation to Data

Use FILTER operator to work with tuples (rows of data)
- To filter out tuples based on condition
Use FOREACH operator to work with columns of data
- To select columns
With a single relation
- Use GROUP operator to group data in a single relation
With multiple relations (we will cover these in PIG Part 2 presentation)
- Use COGROUP, inner JOIN, and outer JOIN operators to group or join data in two or more relations
- Use UNION operator to merge the contents of two or more relations
- Use SPLIT operator to partition the contents of a relation into multiple relations

Pig Latin Relational Operators

Filter Operator

Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH…GENERATE operation).
FILTER is commonly used to select the tuples that you want; or, conversely, to filter out (remove) the tuples you don’t want

FOREACH … GENERATE Operator

FLATTEN Operator

Flatten un-nests tuples as well as bags

TOKENIZE Function

Splits a string into tokens and outputs as a bag of words

GROUP Operator

ORDER alias BY Operator

Step #3: DUMP or STORE

No action is taken until DUMP or STORE commands are executed
- Pig will parse, validate, and analyze the statements but not execute them until DUMP or STORE commands are executed
DUMP is for displaying the result to the screen
- Mostly used during development time
STORE is for saving the results into HDFS or HBase

Pig Latin Diagnostic Operators

DESCRIBE, ILLUSTRATE, EXPLAIN

DESCRIBE displays the structure of the schema
ILLUSTRATE shows how Pig engine transforms the data

EXPLAIN produces various reports
- Logical plan, Physical plan, MapReduce plan