Module-11B

If you are not able to view above video, then signIn . If still not able to view then visit this page to subscribe.

Module 11B :   Hands On :  Apache Pig Complex Datatypes : Available (Length 14 Minutes) 

1. Understand Map, Tuple and Bag

2. Create Outer Bag and Inner Bag

3. Defining Pig Schema

There are two categories of data types available in Pig as below.

Complex Data Types : (Map, Tuple, Bag)

Tuple : It is a fixed length. 

Ordered collection of elements. You can imagine it is a row in a database table. Example

('Amit','Kumar',90,'9943019420') 

Above tuple represent 4 fields as below

(FName:Chararray,SName:Chararray,Score:int,CellPhone:Chararray)

You can have index based access. Like 0th element will be 'Amit' and 4th element will be '9943019420'

Note :

- A tuple is enclosed in parentheses ( ).

- A piece of data. A field can be any data type (including tuple and bag).

Example : ( 'Amit','Kumar',90, {('9943019420','9943019421')})

You can think of a tuple as a row with one or more fields,where each field can be any data type and any field may or may not have data.

Bag

 

Outer Bag : In below example myData is a relation or bag of tuples. You can think of this bag as an outer bag.

File : /user/cloudera/Training/pig/hadoopexam2.txt

Amit 20 30  Dinesh 20 10 Ganesh 30 40 Dinesh 30 30

myData = LOAD '/user/cloudera/Training/pig/hadoopexam2.txt' using PigStorage(' ')  as (fname:chararray, score:int, marks:int) ;

DUMP myData;

('Amit',20,30) ('Dinesh',20,10) ('Ganesh',30,40) ('Dinesh',30,30)

Inner Bag : Now, suppose we group relation myData by the first field to form relation groupData as below.

groupData = GROUP myData BY fname;

DUMP groupData;

('Amit',{('Amit',20,30)}) ('Dinesh',{('Dinesh',20,10),('Dinesh',30,30)}) ('Ganesh',{('Ganesh',30,40)})

In above example groupData is a relation or bag of tuples. 

Map : It is a key value pair. 

For example, ['fname'#'Amit','sname'#'Kumar'] will create a map with two keys, “fname” and “sname”. Here both the value are chararray.

Another example is 

['fname'#'Amit','sname'#'Kumar','score'#90] will create a map with three keys, “fname” , “sname” and score. Here first two values are chararray and third one is int.

Null : It is very similar to SQL languge. Null means value is unset.

There are various effect of Null values. As we move ahead with our hands on session. We will get more detail on it.

Pig schema : Assigning data type and name to fields. 

Example : Below data has defined schema like this (fname:charaary, score:int, marks:int)

('Amit',20,30) ('Dinesh',20,10) ('Ganesh',30,40) ('Dinesh',30,30)


How and what you can do while defining Schema :

If Schema is not defined :

/* The field data types are not specified ... */ myData = load 'data.txt' as (val1, val2); myData: {val1: bytearray,val2: bytearray}  /* The number of fields is not known ... */ myData = load 'data.txt'; myData: Schema for myData unknown