Spark2, PySpark and Jupyter installation and configuration

Steps to be followed for enabling SPARK 2, pysaprk and jupyter in cloudera clusters. 1.INSTALL ORACLE JDK IN ALL NODES Download and install java. It should be jdk 1.8+ # cd /usr/java/ # wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie” “http://download.oracle.com/otn-pub/java/jdk/8u144-b15/jdk-8u144-linux-x64.tar.gz” # tar xzf jdk-8u144-linux-x64.tar.gz   2.Install java with Alternatives # cd /usr/java # alternatives … Continue reading Spark2, PySpark and Jupyter installation and configuration

Advertisements

Running Hadoop Benchmarking TestDFSIO on Cloudera Clusters

Hadoop provides a benchmarking mechanism for the cluster. The steps to benchmark cloudera cluster file system is below. set the HADOOP_HOME. HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop/ Run TestDFSIO as below. #hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-cdh4.3.0-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 #hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.0.0-cdh4.3.0-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 Once you run the test you will see TestDFSIO_results.log  file in … Continue reading Running Hadoop Benchmarking TestDFSIO on Cloudera Clusters