Module-11D
Module 11D : Hands On : Apache Pig Statements : Available (Length 8 Minutes)
1. ForEach statement
2. Example 1 : Data projecting and foreach statement
3. Example 2 : Projection using schema
4. Example 3 : Another way of selecting columns using two dots ..
Data Transformation using following major operators.
Sorting
Grouping
Joining
Projecting
Filtering
foreach : Suppose you have 60 records in a relation, then you can apply your operation using 'foreach' , on each record of your relation.
Syntax :
alias = FOREACH { block };
myData = foreach categories generate *; -- It will select all the columns from categories and generate new relation myData
alias : Name of the relation also it is a outer bag.
Example 1 : Projection using foreach (HandsOn) :
categories = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',');
myData = foreach categories generate *;
dump myData;
(1,2,Football) (2,2,Soccer)
myDataSelected = foreach categories generate $0,$1; --Selecting only first 2 columns
dump myDataSelected;
(1,2) (2,2) (3,2)
Example 2 : Projection using schema (HandsOn)
categories2 = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',') AS (id:int, subId:int, catName:chararray);
selectedCat = foreach categories2 generate subId,catName; --Selecting only two columns, using column name
DUMP selectedCat;
(2,Football) (2,Soccer) (2,Baseball & Softball)
subtract = foreach categories2 generate id-subId; --This is just to show you, you can use expression
dump subtract;
(-1) (0) (1)
So you can refer columns in a relation with
Their position like $0 - first element, $1- Second element (This should be used , when schema is not defined)
Their name as defined by schema like id, subId etc (Case sensitive)
Using * , you can select all the columns
Example 3 : Another way of selecting columns using two dots .. (HandsOn)
selectedCat3 = foreach categories2 generate id..catName; --Select all the columns between id and catName
DUMP selectedCat3;
selectedCat4 = foreach categories2 generate subId..; --Select all the columns subId and rest which comes after subId
DUMP selectedCat4;
selectedCat5 = foreach categories2 generate ..catName; --Select all the columns comes before catName inclusive
DUMP selectedCat5;