Apache Pig HandsOn Lab

Module 11A :   Hands On :  Apache Pig Coding : Available (Length 23 Minutes) 
1. Working with Grunt shell
2. Create word count application
3. Execute word count application
4. Accessing HDFS from grunt shell

Here ,we will be running Apache Pig Sample scripts using grunts. Don't worry if you will not understand entire syntax, it is to just see the power of Apache Pig.
In few lines of code you can write word count application. (Refer the video session to understand it more)

Step 1 : Start Grunt shell.
    Open terminal and type pig 

Step 1A : Create a file at /user/cloudera/Training/pig/hadoopexam.txt
with following content.

I am learning Pig Using HadoopExam
I am learning Spark Using HadoopExam
I am learning Java Using HadoopExam
I am learning Hadoop Using HadoopExam

Step 2 : Now load the file stored in hdfs (Space separated file)
input1  = LOAD '/user/cloudera/Training/pig/hadoopexam.txt' AS (f1:chararray);

DUMP input1;
(I am learning Pig Using HadoopExam)
(I am learning Spark Using HadoopExam)
(I am learning Java Using HadoopExam)
(I am learning Hadoop Using HadoopExam)

Step 3: flatten the words in each line
wordsInEachLine = FOREACH input1 GENERATE flatten(TOKENIZE(f1)) as word;
DUMP wordsInEachLine;

Step 4: Group the same words
groupedWords = group wordsInEachLine by word;
dump groupedWords;
describe groupedWords;

Step 5 : Now do the wordcount.            
countedWords = foreach groupedWords generate group, COUNT(wordsInEachLine);
dump countedWords;

Now here we can see that, no need to wait for job to finish, we can check the results in between. After each step using DUMP statement we can check that our script is correct.  As we will move ahead, we will be keep writing complex applications and understand the concepts.   

More About PigLatin : 
  • Pig scripts can be a linear workflow (As shown above in word count example)
  • Pig Scripts can have branching like multiple data inputs are joined (De-normalizing) and data splitting etc.  
  • In Pig latin scripts , you will not find if statements and for loop (This is simply a DAG : Direct Acyclic Graph)    
Grunt : It is a shell, where we have been writing our Pig scripts. Generally production code will be written in a separate file. But while writing we want to test our scripts with test data, hence we will be using Grunt shell for prototyping our script.

Remember :
  • It provides Tab completion of commands (Not file name as in shell scripts)
  • Ctrl+D will help you to come out of Grunt 
Dump and Store : Pig Latin will not execute scripts until it sees Dump or Store command , as we have done in our example.

Accessing HDFS : You can use hdfs commands inside Grunt shell as below

> fs -ls

Accessing local shell

> sh ls

Killing Job Inside Grunt shell : By using kill command , we can kill the MapReduce jobs. Usually as soon as you submit your pig scripts it will print job_id as well. And you can use kill command and job_id together to kill the job as below.
> kill {job_id}

exec command : Inside Grunt shell , you can use exec command to run Pig script.

run command : Inside Grunt shell , you can use run command to run Pig script.

Difference between run and exec : run command runs the Pig Latin script in the same shell in which run command is executed.Hence, all the aliases defined inside the scripts will be available in same shell. But this is not the case with the exec command.