Apache Spark groupByKey and reduceByKey

March 7, 2020

I was answering a question on stackoverflow and came up with a pictorial representation for groupByKey and reduceByKey. Thought of sharing the same so that I myself do not lose it over the time.
Spark documentation itself is quite clear about this so not providing lot of text in here.

Both the examples below are summing up the values for a key.
1. groupByKey – Shuffles data first based on the keys and later gives us opportunity to work with the values.

Diagram

groupByKey

groupByKey

 

2. reduceByKey – Computes the reduce function first locally and then shuffling the results and run the reduce function once again to achieve final result.
Hence reduceByKey is like a combiner in Map-reduce world. It helps in reducing amount of data shuffled during the process.

Diagram

reduceByKey

reduceByKey


GroupBy and count using Spark DataFrame

June 18, 2019

Here we are trying to group by keys and run a count against them.

val datardd = sc.parallelize(Seq(“a”->1,”b”->1,”a”->1,”c”->1))

val mydf = datardd.toDF

mydf.groupBy($”name”).agg(“count” -> “count”).
withColumnRenamed(“count(count)”,”noofoccurrences”).
orderBy($”noofoccurrences”.desc).show

name noofoccurrences
a 2
b 1
c 1

Running Apache Spark on Windows

July 10, 2016

Running hadoop on windows is not trivial, however running Apache Spark on Windows proved not too difficult. I came across couple of blogs and stackoverflow discussion which made this possible. Putting down my notes below which are outcome of these reference material.

  1. Download http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-without-hadoop.tgz ( http://spark.apache.org/downloads.html )
  2. Download Hadoop distribution for Windows from http://www.barik.net/archive/2015/01/19/172716/
  3. Create hadoop_env.cmd  in {HADOOP_INSTALL_DIR}/conf directory.
    SET JAVA_HOME=C:\Progra~1\Java\jdk1.7.0_80
    
  4. In a new command window run hadoop-env.cmd followed by  {HADOOP_INSTALL_DIR}/bin/hadoop classpath
    The output of this command is used to initialize SPARK_DIST_CLASSPATH in spark-env.cmd (You may need to create this file.)
  5. Create spark-env.cmd in {SPARK_INSTALL_DIR}/conf
     #spark-env.cmd content
     SET HADOOP_HOME=C:\amit\hadoop\hadoop-2.6.0
     SET HADOOP_CONF_DIR=%HADOOP_HOME%\conf
     set SPARK_DIST_CLASSPATH=<Output of hadoop classpath>
     SET JAVA_HOME=C:\Progra~1\Java\jdk1.7.0_80
    
  6. Now run the examples or spark shell from {SPARK_INSTALL_DIR}/bin directory. Please note that you may have to run spark-env.cmd explicitly prior running the examples or spark-shell.

References :