Tags

,

Requirement:-

To create an oozie workflow with 4 actions.

Actions in this workflow

  1. Pig action
  2. Shell action
  3. Pig action
  4. A mapreduce action

STEP1:Created 2 simple pig scripts as below

A = LOAD ‘/user/anoopk/input1/input_an_ri.txt’;

STORE A INTO ‘out2’;

A = LOAD ‘/user/anoopk/sites.txt’;

STORE A INTO ‘out3’;

Loaded hadoop-mapreduce-examples.jar to the HDFS folder Workflow/mr/lib/hadoop-mapreduce-examples.jar

The above script will load data from the file and store it in file under the folders out2 and out3. I have named the scripts as test1.pig and test.pig

Created the pigsample.properties as below

============

anoopk@etl1 ~ $ cat ooziesample.properties

filerepo2asAdapterNumbReduces=1

nameNode=hdfs://hdm1.gphd.local:8020

jobTracker=hdm3.gphd.local:8032

queueName=default

username=${user.name}

#oozie

oozie.use.system.libpath=true

oozie.wf.application.path=${nameNode}/user/${user.name}/Workflow/mr

In the above file the value of namenode can be obtained from the file /etc/gphd/hadoop/conf/hdfs-site.xml. This is the IP address of the name node and the RPC port (8020) its using for communication. This under the xml tag dfs.namenode.rpc-address.

Same is the case with job tracker. We shoould use the IP address/FQDN and the RPC port for yarn. This can be obtained from the file /etc/gphd/conf.gphd-2.0.1/yarn-site.xml under the xml tag ‘yarn.resourcemanager.address’

 

Created workflow.xml as below

<workflow-app name=”ooziesample” xmlns=”uri:oozie:workflow:0.4″>

<global>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

</global>

<start to=”pigsample”/>

<action name=”pigsample”>

<pig>

test.pig

</pig>

<ok to =”myfsaction”/>

<error to=”end”/>

</action>

<!–FS action starts –>

<action name=”myfsaction”>

<fs>

<mkdir path=’/user/anoopk/Workflow/fsaction1’/>

</fs>

<ok to=”pigsample1″/>

<error to=”end”/>

</action>

<action name=”pigsample1″>

<pig>

test1.pig

</pig>

<ok to=”mr-node”/>

<error to=”end”/>

</action>

<action name=”mr-node”>

<map-reduce>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<configuration>

<property>

<name>mapred.mapper.new-api</name>

<value>true</value>

</property>

<property>

<name>mapred.reducer.new-api</name>

<value>true</value>

</property>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

<property>

<name>mapreduce.map.class</name>

<value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>

</property>

<property>

<name>mapreduce.reduce.class</name>

<value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>

</property>

<property>

<name>mapreduce.combine.class</name>

<value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>

</property>

<property>

<name>mapred.output.key.class</name>

<value>org.apache.hadoop.io.Text</value>

</property>

<property>

<name>mapred.output.value.class</name>

<value>org.apache.hadoop.io.IntWritable</value>

</property>

<property>

<name>mapred.input.dir</name>

<value>/user/anoopk/Workflow/input-data/text</value>

</property>

<property>

<name>mapred.output.dir</name>

<value>/user/anoopk/Workflow/output-data/outputDir</value>

</property>

</configuration>

</map-reduce>

<ok to=”end”/>

<error to=”fail”/>

</action>

<kill name=”fail”>

<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>

</kill>

<end name=”end”/>

</workflow-app>

You should create the .properties file with the workflow-app name.

 

Validate the workflow.

#oozie validate workflow.xml

Copy the following fileto hdfs path /user/anoopk/Workflow/mr

  1. workflow.xml
  2. Pig scripts test.pig and test1.pigThe file ooziesample.properties should be there in the local path.

 

Then executed the workflow as below

============

anoopk@etl1 [1044]~/job$ oozie job -oozie http://hdm1.gphd.local:11000/oozie/ -config ooziesample.properties -auth KERBEROS -run

job: 0000253-160302152219596-oozie-oozi-W

============

We should use the option ‘-auth KERBEROS ‘ in a kerberos enables cluster as oozie is using ‘simple’ authentication by default.

Status can be checked using the below command.

anoopk@etl1 [1050]~/job$ oozie job -oozie http://hdm1.gphd.local:11000/oozie/ -info 0000254-160302152219596-oozie-oozi-W

Job ID : 0000254-160302152219596-oozie-oozi-W

————————————————————————————————————————————

Workflow Name : ooziesample

App Path : hdfs://hdm1.gphd.local:8020/user/anoopk/Workflow/mr

Status : SUCCEEDED

Run : 0

User : anoopk

Group : –

Created : 2016-04-04 12:26 GMT

Started : 2016-04-04 12:26 GMT

Last Modified : 2016-04-04 12:28 GMT

Ended : 2016-04-04 12:28 GMT

CoordAction ID: –

Actions

————————————————————————————————————————————

ID Status Ext ID Ext Status Err Code

————————————————————————————————————————————

0000254-160302152219596-oozie-oozi-W@:start: OK – OK –

————————————————————————————————————————————

0000254-160302152219596-oozie-oozi-W@pigsample OK job_1456868579772_1369 SUCCEEDED –

————————————————————————————————————————————

0000254-160302152219596-oozie-oozi-W@myfsaction OK – OK –

————————————————————————————————————————————

0000254-160302152219596-oozie-oozi-W@pigsample1 OK job_1456868579772_1371 SUCCEEDED –

————————————————————————————————————————————

0000254-160302152219596-oozie-oozi-W@mr-node OK job_1456868579772_1373 SUCCEEDED –

————————————————————————————————————————————

0000254-160302152219596-oozie-oozi-W@end OK – OK –


 

The job got succeeded and created the new folders out2 and out3. As per the shell action it created the HDFS folder /user/anoopk/Workflow/fsaction1. The output of map-reduce is also there in the folder Workflow/output-data/outputDir.

Advertisements