Spark2, PySpark and Jupyter installation and configuration

Steps to be followed for enabling SPARK 2, pysaprk and jupyter in cloudera clusters.

1.INSTALL ORACLE JDK IN ALL NODES

  1. Download and install java. It should be jdk 1.8+

# cd /usr/java/
# wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie” “http://download.oracle.com/otn-pub/java/jdk/8u144-b15/jdk-8u144-linux-x64.tar.gz”

# tar xzf jdk-8u144-linux-x64.tar.gz

 

2.Install java with Alternatives

# cd /usr/java
# alternatives –install /usr/bin/java java /usr/java/jdk1.8.0_144/bin/java 2
# alternatives –config java
There are 3 programs which provide ‘java’.

Selection Command
———————————————–
* 1 /opt/jdk1.7.0_60/bin/java
+ 2 /opt/jdk1.7.0_72/bin/java
3 /usr/java/jdk1.8.0_144/bin/java

Enter to keep the current selection[+], or type selection number: 3 [Press Enter]

3. Check java version

# java -version

java version “1.8.0_144”

Java(TM) SE Runtime Environment (build 1.8.0_144-b01)

Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

 

2. INSTALL ANACONDA PYTHON IN ALL NODES

Download anaconda python and give it execution permission as below

#wget https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh

#chmod 755  Anaconda3-5.0.0.1-Linux-x86_64.sh

This will ask to accept license and specify the installation location. Take a note of the installation location

3.INSTALL SPARK2.2.0

  1. Make sure that the requirements (CDH, cloudera manager and JDK)of Spark2 are met
    1. Restart the Cloudera Manager Server:
  1. Download the Spark2 CSD and place it in csd location configured
  2. Set the file ownership to cloudera-scm:cloudera-scm with permission 644.

          #  service cloudera-scm-server restart

  1. Log into the Cloudera Manager Admin Console and restart the Cloudera Management Service.
  2. Download the Spark2 parcel, distribute the parcel to the hosts in your cluster, and activate the parcel.
  3. Add the Spark 2 service to your cluster.
  4. Restart the stale services in the cluster
  5. Do the testing of  spark and pyspark

 

#spark-submit  \   
–class org.apache.spark.examples.SparkPi \   
–deploy-mode cluster \   
–master yarn \   
$SPARK_HOME/examples/lib/spark-examples_version.jar 10
$ hdfs dfs -mkdir /user/systest/spark
$ pyspark2
SparkSession available as ‘spark’.
>>> strings = [“one”,”two”,”three”]
>>> s2 = sc.parallelize(strings)
>>> s3 = s2.map(lambda word: word.upper())
>>> s3.collect()
[‘ONE’, ‘TWO’, ‘THREE’]
>>> s3.saveAsTextFile(‘hdfs:///user/systest/spark/canary_test‘)
>>> quit()
$ hdfs dfs -ls /user/systest/spark
Found 1 items
drwxr-xr-x   – systest supergroup          0 2016-08-26 14:41 /user/systest/spark/canary_test
$ hdfs dfs -ls /user/systest/spark/canary_test
Found 3 items
-rw-r–r–   3 systest supergroup          0 2016-08-26 14:41 /user/systest/spark/canary_test/_SUCCESS
-rw-r–r–   3 systest supergroup          4 2016-08-26 14:41 /user/systest/spark/canary_test/part-00000
-rw-r–r–   3 systest supergroup         10 2016-08-26 14:41 /user/systest/spark/canary_test/part-00001
$ hdfs dfs -cat /user/systest/spark/canary_test/part-00000
ONE
$ hdfs dfs -cat /user/systest/spark/canary_test/part-00001
TWO
THREE

5. INSTALLATION OF JUPYTER

 

1. Enable python 3.6,

2. Open your .bash_profile  #vim .bash_profile

3. Add the line PATH=/data/anaconda3/bin:$PATH:$HOME/bin

4. Comment the existing path

5. The final path will be as below.

       #PATH=$PATH:$HOME/bin
      PATH=/data/anaconda3/bin:$PATH:$HOME/bin
       export PATH

 

6.  Generate jupyter config

 

#jupyter notebook –generate-config


7.   Open the jupyter config and add ip, port, password and set not to open browser by default.
c.NotebookApp.open_browser = False
c.NotebookApp.password = u’sha1:b590ff3593c9:c469e487d6d4e4650677b318t8dedffec7be35db
c.NotebookApp.ip = ‘node1.host.com’
c.NotebookApp.port = 6090

 

The password is a hased one and can be generated as below.

from IPython.lib import passwd password = passwd(“secret”) password
8.   Launch jupyter

#jupyter notebook –config .jupyter/jupyter_notebook_config.py &

6. PySpark from jupyter

Login to Jupyter and after that issue the below commands first by creating  a normal python3 notebook

import os

import sys

os.environ[“SPARK_HOME”] = “/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2”

os.environ[“PYLIB”] = os.environ[“SPARK_HOME”] + “/python/lib”

sys.path.insert(0, os.environ[“PYLIB”] +”/py4j-0.10.4-src.zip”)

sys.path.insert(0, os.environ[“PYLIB”] +”/pyspark.zip”)

 

Successful code execution will show results as below.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s