Module-11C

If you are not able to view above video, then signIn . If still not able to view then visit this page to subscribe.

Module 11C :   Hands On :  Apache Pig Data loading : Available (Length 14  Minutes) 

1. Understand Load statement

2. Loading csv file

3. Loading csv file with schema

4. Loading Tab separated file

5. Storing back data to HDFS.

Introduction : As discussed previously it is a data flow language. 

Relation : Once your processing step is completed , it generates new data-set. Which we named it as Relation.

Example : 

myData = load 'myData.txt'

Remember : 

Load Statement : Using this we will specify Input data to our script. (Usually this should be your first step)

Syntax

LOAD 'file/directory path' [USING function] [AS schema];

 

Example of loading CSV file (HandsOn).

categories = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',');

DESCRIBE categories; Schema for categories unknown.

DUMP categories; -- You must avoid

(1,2,Football) (2,2,Soccer) (3,2,Baseball & Softball) (4,2,Basketball) 

Example of loading CSV file and defining schema(HandsOn).

categoriesWithSchema = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',') AS (id:int,subId:int,categoryName:chararray);

DESCRIBE categoriesWithSchema ; categoriesWithSchema: {id: int,subId: int,categoryName: chararray}

Example with Tab separated file(HandsOn)

categoriesWithSchemaTab = LOAD '/user/cloudera/Training/pig/catTab.txt' AS (id:int,subId:int,categoryName:chararray);

DESCRIBE categoriesWithSchemaTab;

categoriesWithSchemaTab: {id: int,subId: int,categoryName: chararray}

DUMP categoriesWithSchemaTab ;

ILLUSTRATE categoriesWithSchemaTab ;

-------------------------------------------------------------------------------------- | categoriesWithSchemaTab     | id:int    | subId:int    | categoryName:chararray    |  -------------------------------------------------------------------------------------- |                             | 51        | 8            | NHL                       |  --------------------------------------------------------------------------------------

Loading Data from HBase 

divs = load 'myData' using HBaseStorage();

Store statement : It will save data in file system.

Syntax :

STORE 'alias/relation name' INTO 'directory' [USING function];

 

Example Save relation to HDFS(HandsOn)

STORE categoriesWithSchemaTab INTO '/user/cloudera/Training/pig/output/Tab/he1'; -- Save as tab separated data

STORE categoriesWithSchemaTab INTO '/user/cloudera/Training/pig/output/csv/he1' USING PigStorage(','); -- Save as csv data

Now verify data in hdfs directory as below.

cat /user/cloudera/HEPig/Tab/he1 cat /user/cloudera/HEPig/csv/he1