Module-11E
Module 11E : Hands On : Apache Pig Complex Datatype practice : Available (Length 16 Minutes)
1. Example 1 : Loading Complex Datatypes
2. Example 2 : Loading compressed files
3. Example 3 : Store relation as compressed files
4. Example 4 : Nested FOREACH statements to solved same problem.
Example 1 : Loading Complex Datatypes (HandsOn):
Step 1 : Download complex data
Step 2 : Below is the schema for the data.
member = tuple(member_id:int , member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course(course_name:chararray)} , technology:map[tech_name:chararray] )
Step 3 : Upload data in hdfs first at /user/cloudera/Training/pig/complexData.txt
Step 4 : Now write Pig script using above schema to load the data and create a relation out of this.
hadoopexamMember = LOAD '/user/cloudera/Training/pig/complexData.txt' using PigStorage('|')
AS (member_id:int , member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course: tuple(course_name:chararray)} , technology:map[chararray] ) ;
DESCRIBE hadoopexamMember ;
DUMP hadoopexamMember ;
Step 5 : Now find all the course lists subscribed by each member and its programming skills (To analyze , based on programming skills which all course they are interested in , so this results can be helpful to recommend same course list to similar member having same programming skills)
primaryProgramming = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, course_list, technology#'programming1' ;
DESCRIBE primaryProgramming ;
DUMP primaryProgramming ;
Step 6 : Now see some flatten results.
getFirstCourseSubscribed = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, FLATTEN(course_list), technology#'programming1' ;
DESCRIBE getFirstCourseSubscribed ;
DUMP getFirstCourseSubscribed ;
Step 7 : Get number of courses subscribed by each user
memberCourseCount= FOREACH hadoopexamMember GENERATE member_email, COUNT(course_list);
dump memberCourseCount;
Note : We will see flatten operator in coming modules.
Example 2 : Loading compressed files (HandsOn):
Step 1 : Download compressed file.
Step 2 : Upload in hdfs using Hue
Step 3 : Load compressed data using Pig
hadoopexamCompress = Load '/user/cloudera/Training/pig/complexData.txt.gz' using PigStorage('|') AS (member_id:int
, member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course: tuple(course_name:chararray)} , technology:map[chararray] ) ;
DESCRIBE hadoopexamCompress ;
DUMP hadoopexamCompress ;
Example 3 : Store relation as compressed files (HandsOn):
STORE hadoopexamMember INTO '/user/cloudera/Training/pig/output/complexData.txt.bz2' using PigStorage('|') ;
Now check in HDFS using Hue, that following directory has been created or not in HDFS.
/user/cloudera/Training/pig/output/complexData.txt.bz2
Note : Bag does not guarantee that their tuples will always be in order.
Example 4 : Nested FOREACH statements to solved same problem. (HandsOn):
memberScore = load '/user/cloudera/HEPig/handson11/module11Score.txt' USING PigStorage('|') as (email:chararray, spend1:int, spend2:int);
DESCRIBE memberScore;
memberScore: {email: chararray,spend1: int,spend2: int}
groupedScore = group memberScore by email; -- produces bag memberScore containing all the records for a given value of email
DESCRIBE groupedScore;
groupedScore: {group: chararray,memberScore: {(email: chararray,spend1: int,spend2: int)}}
result = foreach groupedScore { individualScore = foreach memberScore generate (spend1+spend2);--it will iterate only over the records of memberScore bag generate group, SUM(individualScore); };
dump result;
describe result;
result: {group: chararray,long}