1. What is Pig ?
2. Introduction to Pig Data Flow Engine
3. Pig and MapReduce in Detail
4. When should Pig Used ?
5. Pig and Hadoop Cluster
6. Pig Interpreter and MapReduce
7. Pig Relations and Data Types
8. PigLatin Example in Detail
9. Debugging and Generating Example in Apache Pig
- ETL (Extract Transform and Load) : For example you can extract all the relevant information from the web-server logs and apply transformation like which all pages are visited by a user and finally save those results in HDFS/HIve/HBase etc.
WebServerLogs --> Extract Relevant Info --> Apply Transformation (e.g. Mapping between page content and userid, aggregate, join,sort etc.) --> Load in Hive Table
- Ad-hoc queries on Raw data : Data scientist or analytics team can directly do analysis on row data using Pig Latin script.
- Iterative data processing : Some algorithm needs data to be processed iteratively , which can be easily implemented using Apache Pig.
- Custom Functionality : Developer can write their own custom functions. If Pig does not provide inbuilt.
- Less Coding : You can write huge map-reduce job in few lines of Pig scripts.
- Grunt Shell : Using the grunt shell , you can do ad-hoc analysis of data. Rather than executing Map-Reduce job, Like Writing MapReduce, create Jar files and then run Job.
- Auto optimization : Many of the jobs can be optimized by the framework only. So you focus on your business rather than working on code optimization.
- If you are using Pig script for data processing then you dont have to think always map-reduce, key-value.
- Less Custom code : If you are doing filter, Projections and Joins using MapReduce framework then you have write lot of custom code. Which can be easily avoided using Apache Pig Script.
- Pig Latin : a simple yet powerful high-level data flow language similar to SQL that executes MapReduce jobs. PigLatin is often called simply "Pig".
- Grunt Shell : You can use Grunt shell (Interactive shell), to run your Pig scripts in shell. And can see results on shell itself.
- Pig Engine : Core framework, which converts the Pig scripts written by us into chain of MapReduce job , optimize the jobs and finally submits the jobs on Hadoop cluster to be executed. Pig runs on YARN.
- MapReduce : This is the default mode, which we will be using in hour hands on sessions. It require hadoop cluster access.
- Local : In this mode pig script will be executed on the same machine on which Pig is installed.