Module11DApachePig.avi





Module 11D :  Hands On :   Apache Pig Statements : Available (Length 8 Minutes) 
1. ForEach statement
2. Example 1 : Data projecting and foreach statement
3. Example 2 : Projection using schema
4. Example 3 : Another way of selecting columns using two dots ..


Data Transformation using following major operators.
  • Sorting
  • Grouping
  • Joining
  • Projecting
  • Filtering
foreach : Suppose you have 60 records in a relation, then you can apply your operation using 'foreach' , on each record of your relation.

Syntax : 
alias  = FOREACH { block };
myData = foreach categories generate *; -- It will select all the columns from categories and generate new relation myData

alias : Name of the relation also it is a outer bag.

Example 1 : Projection using foreach (HandsOn) : 
categories = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(','); 
myData = foreach categories generate *;
dump myData;
(1,2,Football)
(2,2,Soccer)
myDataSelected = foreach categories generate $0,$1; --Selecting only first 2 columns
dump myDataSelected;
(1,2)
(2,2)
(3,2)

Example 2 : Projection using schema (HandsOn)
categories2 = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',') AS (id:int, subId:int, catName:chararray); 
selectedCat = foreach categories2 generate subId,catName; --Selecting only two columns, using column name
DUMP selectedCat;
(2,Football)
(2,Soccer)
(2,Baseball & Softball)
subtract = foreach categories2 generate id-subId; --This is just to show you, you can use expression
dump subtract;
(-1)
(0)
(1)

So you can refer columns in a relation with
  • Their position like $0 - first element, $1- Second element (This should be used , when schema is not defined)
  • Their name as defined by schema like id, subId etc (Case sensitive)
  • Using * , you can select all the columns
Example 3 : Another way of selecting columns using two dots ..  (HandsOn
selectedCat3 = foreach categories2 generate id..catName; --Select all the columns between id and catName
DUMP selectedCat3;
selectedCat4 = foreach categories2 generate subId..; --Select all the columns subId and rest which comes after subId
DUMP selectedCat4;
selectedCat5 = foreach categories2 generate ..catName; --Select all the columns comes before catName inclusive
DUMP selectedCat5;