Module11EApachePig.avi





Module 11E :   Hands On :  Apache Pig Complex Datatype practice : Available (Length 16 Minutes) 
1. Example 1 : Loading Complex Datatypes
2. Example 2 : Loading compressed files 
3. Example 3 : Store relation as compressed files
4. Example 4 : Nested FOREACH statements to solved same problem.


Example 1 : Loading Complex Datatypes (HandsOn)

Step 1 : Download complex data
Step 2 : Below is the schema for the data.
member = tuple(member_id:int
		, member_email:chararray
		, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray)
		, course_list:bag{course(course_name:chararray)}
		, technology:map[tech_name:chararray]
	    )

Step 3 : Upload data in hdfs first at /user/cloudera/Training/pig/complexData.txt
Step 4 : Now write Pig script using above schema to load the data and create a relation out of this.

hadoopexamMember = LOAD '/user/cloudera/Training/pig/complexData.txt' using PigStorage('|') 
AS (member_id:int , member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course: tuple(course_name:chararray)} , technology:map[chararray] ) ;
DESCRIBE hadoopexamMember ;
DUMP hadoopexamMember ;

Step 5 : Now find all the course lists subscribed by each member and its programming skills (To analyze , based on programming skills which all course they are interested in , so this results can be helpful to recommend same course list to similar member having same programming skills)
primaryProgramming = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, course_list, technology#'programming1' ;
DESCRIBE primaryProgramming ;
DUMP primaryProgramming ;

Step 6 : Now see some flatten results.
getFirstCourseSubscribed = FOREACH hadoopexamMember GENERATE member_id, member_email, name.first_name, FLATTEN(course_list), technology#'programming1' ;
DESCRIBE getFirstCourseSubscribed ;
DUMP getFirstCourseSubscribed ;

Step 7 : Get number of courses subscribed by each user
memberCourseCount= FOREACH hadoopexamMember GENERATE  member_email, COUNT(course_list);
dump memberCourseCount;

Note : We will see flatten operator in coming modules.

Example 2 : Loading compressed files (HandsOn)

Step 1 : Download compressed file.

Step 2 : Upload in hdfs using Hue

Step 3 : Load compressed data using Pig
hadoopexamCompress = Load '/user/cloudera/Training/pig/complexData.txt.gz'  using PigStorage('|')   AS (member_id:int
, member_email:chararray , name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray) , course_list:bag{course: tuple(course_name:chararray)} , technology:map[chararray] ) ;
DESCRIBE hadoopexamCompress ;
DUMP hadoopexamCompress ;

Example 3 : Store relation as compressed files (HandsOn)
STORE hadoopexamMember INTO '/user/cloudera/Training/pig/output/complexData.txt.bz2' using PigStorage('|') ;

Now check in HDFS using Hue, that following directory has been created or not in HDFS.
/user/cloudera/Training/pig/output/complexData.txt.bz2

Note : Bag does not guarantee that their tuples will always be in order. 


Example 4 : Nested FOREACH statements to solved same problem. (HandsOn)
memberScore = load '/user/cloudera/HEPig/handson11/module11Score.txt' USING PigStorage('|') as (email:chararray, spend1:int, spend2:int);
DESCRIBE memberScore;
memberScore: {email: chararray,spend1: int,spend2: int}
groupedScore = group memberScore by email; -- produces bag memberScore containing all the records for a given value of email
DESCRIBE groupedScore;
groupedScore: {group: chararray,memberScore: {(email: chararray,spend1: int,spend2: int)}}
result = foreach groupedScore {
  individualScore = foreach memberScore generate (spend1+spend2);--it will iterate only over the records of memberScore bag
    generate group, SUM(individualScore);
};
dump result;
describe result;
result: {group: chararray,long}

ċ
complexData.txt
(1k)
Training4Exam Info,
Dec 19, 2016, 7:05 AM
ċ
complexData.txt.gz
(0k)
Training4Exam Info,
Dec 19, 2016, 7:06 AM