Module 11 : Apache Pig

Watch above Video to Understand More (Get Full subscription from here)

Introduction : 
        Apache PIG is a very important component of Hadoop Eco-System. It is very well matured component and being used in production. Apache PIG helps you write Data Flow engine , which can process data stored in HDFS (Hadoop Distributed File System). With the help of Apache Pig you can avoid writing MapReduce Jobs. With the help of Pig Latin script you can write a long series of data operations and following are the activities can be completed using 

  • ETL (Extract Transform and Load) : For example you can extract all the relevant information from the web-server logs and apply transformation like which all pages are visited by a user and finally save those results in  HDFS/HIve/HBase etc.        
 WebServerLogs --> Extract Relevant Info --> Apply Transformation (e.g. Mapping between page content and userid, aggregate, join,sort etc.) --> Load in Hive Table
  • Ad-hoc queries on Raw data : Data scientist or analytics team can directly do analysis on row data using Pig Latin script.
  • Iterative data processing : Some algorithm needs data to be processed iteratively , which can be easily implemented using Apache Pig. 

Other Benefits : 

  • Custom Functionality : Developer can write their own custom functions. If Pig does not provide inbuilt. 
  • Less Coding : You can write huge map-reduce job in few lines of Pig scripts.
  • Grunt Shell : Using the grunt shell , you can do ad-hoc analysis of data. Rather than executing Map-Reduce job, Like Writing MapReduce, create Jar files and then run Job.
  • Auto optimization : Many of the jobs can be optimized by the framework only. So you focus on your business rather than working on code optimization.
  • If you are using Pig script for data processing then you dont have to think always map-reduce, key-value.
  • Less Custom code : If you are doing filter, Projections and Joins using MapReduce framework then you have write lot of custom code. Which can be easily avoided using Apache Pig Script.
Components of Apache Pig : Pig has following components.
  • Pig Latin : a simple yet powerful high-level data flow language similar to SQL that executes MapReduce jobs.  PigLatin is often called simply "Pig".
  • Grunt Shell :  You can use Grunt shell (Interactive shell), to run your Pig scripts in shell. And can see results on shell itself. 
  • Pig Engine : Core framework, which converts the Pig scripts written by us into chain of MapReduce job , optimize the jobs and finally submits the jobs on Hadoop cluster to be executed. Pig runs on YARN.
Execution Mode : Pig can be executed with the any of the below mode.
  • MapReduce : This is the default mode, which we will be using in hour hands on sessions. It require hadoop cluster access.
  • Local : In this mode pig script will be executed on the same machine on which Pig is installed.