Module-11C
Module 11C : Hands On : Apache Pig Data loading : Available (Length 14 Minutes)
1. Understand Load statement
2. Loading csv file
3. Loading csv file with schema
4. Loading Tab separated file
5. Storing back data to HDFS.
Introduction : As discussed previously it is a data flow language.
Relation : Once your processing step is completed , it generates new data-set. Which we named it as Relation.
Example :
myData = load 'myData.txt'
Here 'myData' is a new Relation, which is generated based on load processing step.
Remember :
Keywords of Pig Latin are case insensitive (e.g. load data and LOAD data both are same) but not alias/relation name.
Load Statement : Using this we will specify Input data to our script. (Usually this should be your first step)
By default, load looks for your data on HDFS in a tab-delimited file using the default load function PigStorage.
Syntax :
LOAD 'file/directory path' [USING function] [AS schema];
Path : If you specify a directory name, all the files in the directory are loaded.
USING : It is a keyword, which help us to which function should be used to load data. As this is optional, if we dont use USING keyword then it will use 'PigStorage' function.
function : You can use in-built function or your custom function to load the data.
Schema : The loader produces the data of the type specified by the schema. If the data does not conform to the schema, depending on the loader, either a null value or an error is generated.
Example of loading CSV file (HandsOn).
Step 1 : Download file and save in HDFS (As shown in video)
Step 2 : Upload this file in HDFS
Step 3 : Write a pig Script as below.
categories = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',');
DESCRIBE categories; Schema for categories unknown.
DUMP categories; -- You must avoid
(1,2,Football) (2,2,Soccer) (3,2,Baseball & Softball) (4,2,Basketball)
Example of loading CSV file and defining schema(HandsOn).
categoriesWithSchema = LOAD '/user/cloudera/Training/pig/cat.txt' USING PigStorage(',') AS (id:int,subId:int,categoryName:chararray);
DESCRIBE categoriesWithSchema ; categoriesWithSchema: {id: int,subId: int,categoryName: chararray}
Example with Tab separated file(HandsOn)
Step 1 : Download tab separated file
Step 2 : Upload this file in HDFS
Step 3 : Now write Pig Script to load this file.
categoriesWithSchemaTab = LOAD '/user/cloudera/Training/pig/catTab.txt' AS (id:int,subId:int,categoryName:chararray);
DESCRIBE categoriesWithSchemaTab;
categoriesWithSchemaTab: {id: int,subId: int,categoryName: chararray}
DUMP categoriesWithSchemaTab ;
ILLUSTRATE categoriesWithSchemaTab ;
-------------------------------------------------------------------------------------- | categoriesWithSchemaTab | id:int | subId:int | categoryName:chararray | -------------------------------------------------------------------------------------- | | 51 | 8 | NHL | --------------------------------------------------------------------------------------
Loading Data from HBase
divs = load 'myData' using HBaseStorage();
Store statement : It will save data in file system.
Syntax :
STORE 'alias/relation name' INTO 'directory' [USING function];
alias : Name of the relation (which holds our calculated data). Needs to be stored in file system.
'directory' : The name of the storage directory, in quotes. If the directory already exists, the STORE operation will fail.
The output data files, named part-nnnnn, are written to this directory.
If the USING clause is omitted, the default store function PigStorage is used.
PigStorage is the default store function and does not need to be specified (simply omit the USING clause).
You can write your own store function if your data is in a format that cannot be processed by the built in functions
Example Save relation to HDFS(HandsOn)
STORE categoriesWithSchemaTab INTO '/user/cloudera/Training/pig/output/Tab/he1'; -- Save as tab separated data
STORE categoriesWithSchemaTab INTO '/user/cloudera/Training/pig/output/csv/he1' USING PigStorage(','); -- Save as csv data
Now verify data in hdfs directory as below.
cat /user/cloudera/HEPig/Tab/he1 cat /user/cloudera/HEPig/csv/he1