Module-11E

If you are not able to view above video, then signIn . If still not able to view then visit this page to subscribe.

Module 11E :   Hands On :  Apache Pig Complex Datatype practice : Available (Length 16 Minutes) 

1. Example 1 : Loading Complex Datatypes

2. Example 2 : Loading compressed files 

3. Example 3 : Store relation as compressed files

4. Example 4 : Nested FOREACH statements to solved same problem.

Example 1 : Loading Complex Datatypes (HandsOn): 

Step 1 : Download complex data

Step 2 : Below is the schema for the data.

member = tuple(member_id:int , member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course(course_name:chararray)} , technology:map[tech_name:chararray]     )

Step 3 : Upload data in hdfs first at /user/cloudera/Training/pig/complexData.txt

Step 4 : Now write Pig script using above schema to load the data and create a relation out of this.

hadoopexamMember = LOAD '/user/cloudera/Training/pig/complexData.txt' using PigStorage('|')

   AS (member_id:int     , member_email:chararray     , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray)     , course_list:bag{course: tuple(course_name:chararray)}     , technology:map[chararray]     ) ;

DESCRIBE hadoopexamMember ;

DUMP hadoopexamMember ;

Step 5 : Now find all the course lists subscribed by each member and its programming skills (To analyze , based on programming skills which all course they are interested in , so this results can be helpful to recommend same course list to similar member having same programming skills)

primaryProgramming = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, course_list, technology#'programming1' ;

DESCRIBE primaryProgramming ;

DUMP primaryProgramming ;

Step 6 : Now see some flatten results.

getFirstCourseSubscribed = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, FLATTEN(course_list), technology#'programming1' ;

DESCRIBE getFirstCourseSubscribed ;

DUMP getFirstCourseSubscribed ;

Step 7 : Get number of courses subscribed by each user

memberCourseCount= FOREACH hadoopexamMember GENERATE  member_email, COUNT(course_list);

dump memberCourseCount;

Note : We will see flatten operator in coming modules.

Example 2 : Loading compressed files (HandsOn): 

Step 1 : Download compressed file.

Step 2 : Upload in hdfs using Hue

Step 3 : Load compressed data using Pig

hadoopexamCompress = Load '/user/cloudera/Training/pig/complexData.txt.gz'  using PigStorage('|')   AS (member_id:int

   , member_email:chararray     , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray)     , course_list:bag{course: tuple(course_name:chararray)}     , technology:map[chararray]     ) ;

DESCRIBE hadoopexamCompress ;

DUMP hadoopexamCompress ;

Example 3 : Store relation as compressed files (HandsOn): 

STORE hadoopexamMember INTO '/user/cloudera/Training/pig/output/complexData.txt.bz2' using PigStorage('|') ;

Now check in HDFS using Hue, that following directory has been created or not in HDFS.

/user/cloudera/Training/pig/output/complexData.txt.bz2

Note : Bag does not guarantee that their tuples will always be in order. 

Example 4 : Nested FOREACH statements to solved same problem. (HandsOn): 

memberScore = load '/user/cloudera/HEPig/handson11/module11Score.txt' USING PigStorage('|') as (email:chararray, spend1:int, spend2:int);

DESCRIBE memberScore;

memberScore: {email: chararray,spend1: int,spend2: int}

groupedScore = group memberScore by email; -- produces bag memberScore containing all the records for a given value of email

DESCRIBE groupedScore;

groupedScore: {group: chararray,memberScore: {(email: chararray,spend1: int,spend2: int)}}

result = foreach groupedScore {   individualScore = foreach memberScore generate (spend1+spend2);--it will iterate only over the records of memberScore bag     generate group, SUM(individualScore); };

dump result;

describe result;

result: {group: chararray,long}