Module11BApachePig.avi





Module 11B :   Hands On :  Apache Pig Complex Datatypes : Available (Length 14 Minutes) 
1. Understand Map, Tuple and Bag
2. Create Outer Bag and Inner Bag
3. Defining Pig Schema

There are two categories of data types available in Pig as below.

  1. Scalar Data Types : int, float, long ,double, chararray, bytearray
  2. Complex Types: atom, map, tuple, bag

Complex Data Types : (Map, Tuple, Bag)

Tuple : It is a fixed length. 
Ordered collection of elements. You can imagine it is a row in a database table. Example
('Amit','Kumar',90,'9943019420') 
Above tuple represent 4 fields as below
(FName:Chararray,SName:Chararray,Score:int,CellPhone:Chararray)

You can have index based access. Like 0th element will be 'Amit' and 4th element will be '9943019420'

Note :
- A tuple is enclosed in parentheses ( ).
- A piece of data. A field can be any data type (including tuple and bag).
Example : ( 'Amit','Kumar',90, {('9943019420','9943019421')})

You can think of a tuple as a row with one or more fields,where each field can be any data type and any field may or may not have data.

Bag : 
  •  It is a collection of tuples.
  •  You must not access tuples in bag based on position.
  •  It may or may not have schema associated with bag.
  •  Bag data can be bigger than available memorym, because extra data will be spilled over disk.
  •  An inner bag is enclosed in curly brackets { }.
  •  A bag can have duplicate tuples.
  •  A bag can have tuples with differing numbers of fields.
  •  Bags have two forms: outer bag (or relation) and inner bag.
 
Outer Bag : In below example myData is a relation or bag of tuples. You can think of this bag as an outer bag.

File : /user/cloudera/Training/pig/hadoopexam2.txt

Amit 20 30 
Dinesh 20 10
Ganesh 30 40
Dinesh 30 30
myData = LOAD '/user/cloudera/Training/pig/hadoopexam2.txt' using PigStorage(' ')  as (fname:chararray, score:int, marks:int) ;
DUMP myData;
('Amit',20,30)
('Dinesh',20,10)
('Ganesh',30,40)
('Dinesh',30,30)

Inner Bag : Now, suppose we group relation myData by the first field to form relation groupData as below.

groupData = GROUP myData BY fname;
DUMP groupData;
('Amit',{('Amit',20,30)})
('Dinesh',{('Dinesh',20,10),('Dinesh',30,30)})
('Ganesh',{('Ganesh',30,40)})

In above example groupData is a relation or bag of tuples. 
  • The tuples in relation groupData have two fields. 
  • The first field is type charaary. 
  • The second field is type bag; you can think of this bag as an inner bag.
Map : It is a key value pair. 
For example, ['fname'#'Amit','sname'#'Kumar'] will create a map with two keys, “fname” and “sname”. Here both the value are chararray.

Another example is 
['fname'#'Amit','sname'#'Kumar','score'#90] will create a map with three keys, “fname” , “sname” and score. Here first two values are chararray and third one is int.
  • Maps are enclosed in straight brackets [ ].
  • Key value pairs are separated by the pound sign #.
  • Key type : key Must be chararray data type. Must be a unique value.
  • Default Vale :  value Any data type (the defaults to bytearray).
  • Unique : Key values within a relation must be unique.

Null : It is very similar to SQL languge. Null means value is unset.
There are various effect of Null values. As we move ahead with our hands on session. We will get more detail on it.


Pig schema : Assigning data type and name to fields. 
Example : Below data has defined schema like this (fname:charaary, score:int, marks:int)
('Amit',20,30)
('Dinesh',20,10)
('Ganesh',30,40)
('Dinesh',30,30)

  • If your data has Schema defined , it helps Pig to optimize the process as well as for error checking.
  • If you have not defined schema, still Pig will process the data. And try to find possible data types for your data.
  • Using the following operation , you can define schema.
    • LOAD : Enforces schema while loading data.
    • STREAM
    • FOREACH
How and what you can do while defining Schema :
  • You can define a schema that includes both the field name and field type.
  • You can define a schema that includes the field name only. In this case, the field type defaults to bytearray.
  • Not defined schema at all :  field is un-named and the field type defaults to bytearray
  • Data Type Casting : If you assign a type to a field, you can subsequently change the type using the cast operators.
If Schema is not defined :
  • When you JOIN/COGROUP/CROSS between two or more relations e.g. relation A and B , if  relation A has no schema defined. Then result of given operation will also not have schema.
  • If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is null.
  • If you UNION two relations with incompatible schema, the schema for resulting relation is null.
  • If the schema is null, Pig treats all fields as bytearray (in the backend, Pig will determine the real type for the fields dynamically)
  • See the examples below. If a field's data type is not specified, Pig will use bytearray to denote an unknown type. If the number of fields is not known, Pig will derive an unknown schema.
/* The field data types are not specified ... */
myData = load 'data.txt' as (val1, val2);
myData: {val1: bytearray,val2: bytearray}

/* The number of fields is not known ... */
myData = load 'data.txt';
myData: Schema for myData unknown