Number of Records in an RDD partition

November 9, 2020

I was recently answering a question on about how data is partitioned when it is smaller than number of partitions itself. I ended up writing a simple code snippet to see how many records ended up in each partition when number if elements in an RDD are less than number of partitions specified.

Thought of putting down together couple of lines code that is useful if someone is looking for a way to count number of records in a partition.

This could be also useful to see if your partitions are skewed. Sometimes you run into OutOfMemory Error as one partition is too big when compared to other partition. In such cases, it is usually the case where lot of elements share the similar hash-key and thus end up on same partition.

For example if the key is null then all such elements would end up having same hash code and same partition.

Here is the code that would help you find number of records in a partition.

Below we have 4 elements in RDD and number of partitions are 8.

val rdd = sc.parallelize(List(1,2,3,4),8) 
rdd.mapPartitionsWithIndex((x,y) => { 
   println(s"partitions $x has ${y.length} records");y

Apache Spark groupByKey and reduceByKey

March 7, 2020

I was answering a question on stackoverflow and came up with a pictorial representation for groupByKey and reduceByKey. Thought of sharing the same so that I myself do not lose it over the time.
Spark documentation itself is quite clear about this so not providing lot of text in here.

Both the examples below are summing up the values for a key.
1. groupByKey – Shuffles data first based on the keys and later gives us opportunity to work with the values.





2. reduceByKey – Computes the reduce function first locally and then shuffling the results and run the reduce function once again to achieve final result.
Hence reduceByKey is like a combiner in Map-reduce world. It helps in reducing amount of data shuffled during the process.




GroupBy and count using Spark DataFrame

June 18, 2019

Here we are trying to group by keys and run a count against them.

val datardd = sc.parallelize(Seq(“a”->1,”b”->1,”a”->1,”c”->1))

val mydf = datardd.toDF

mydf.groupBy($”name”).agg(“count” -> “count”).

name noofoccurrences
a 2
b 1
c 1

Kill Tomcat service running on Windows

October 25, 2018

If you terminate a running Spring boot application from within the eclipse, at times the port on which embedded tomcat listens does not free up. I found the below commands on one of the StackOverflow posts which are really handy.

C:\yourdir>netstat -aon |find /i “listening” |find “8080”


Now grab the PID (68 in this case) and run the below command to kill it.

C:\yourdir>taskkill /F /PID 68

Big + Far Math Challenge @ ICC

April 22, 2017


Recently I participated and won First prize in Big Far Math challenge hosted by ICC. The challenge description can be found here –

Participating in it was a quite exciting and learning experience for me. I could explore different technical areas while gathering data and preparing visualizations with it.

I have shared the source code and a static version of the visualization on GitHub. The dynamic version was hosted on Apache Solr running on my local desktop.

You can visit the project page @ from which you can navigate to the visualizations that I came up with.

I am also sharing the presentation given to the judges as part of assessment if you are looking for more details.

– Amit

Running Apache Spark on Windows

July 10, 2016

Running hadoop on windows is not trivial, however running Apache Spark on Windows proved not too difficult. I came across couple of blogs and stackoverflow discussion which made this possible. Putting down my notes below which are outcome of these reference material.

  1. Download ( )
  2. Download Hadoop distribution for Windows from
  3. Create hadoop_env.cmd  in {HADOOP_INSTALL_DIR}/conf directory.
    SET JAVA_HOME=C:\Progra~1\Java\jdk1.7.0_80
  4. In a new command window run hadoop-env.cmd followed by  {HADOOP_INSTALL_DIR}/bin/hadoop classpath
    The output of this command is used to initialize SPARK_DIST_CLASSPATH in spark-env.cmd (You may need to create this file.)
  5. Create spark-env.cmd in {SPARK_INSTALL_DIR}/conf
     #spark-env.cmd content
     SET HADOOP_HOME=C:\amit\hadoop\hadoop-2.6.0
     set SPARK_DIST_CLASSPATH=<Output of hadoop classpath>
     SET JAVA_HOME=C:\Progra~1\Java\jdk1.7.0_80
  6. Now run the examples or spark shell from {SPARK_INSTALL_DIR}/bin directory. Please note that you may have to run spark-env.cmd explicitly prior running the examples or spark-shell.

References :

Big O notation

May 14, 2016


Came across a very nice introductory article on Big O notation.

Big Data For Social Good Challenge

March 16, 2015


During this winter, I participated in

Big Data For Social Good Challenge

which I just stumbled upon while searching something.

This challenge was about using IBM Bluemix’s “Analytics For Hadoop” service to process a data set that is minimum 500MB in size.

This was a wonderful opportunity to get some hands on on IBM Bluemix ( IBM is giving extended trial access if you are a participant). Apart from this I was also keen to build some Data visualization app on my own.

I selected CitiBike data for one year (2013-2014). Initially I did not had a clue about what insights I could gather from the dataset, but as soon as I ran some Apache Pig scripts and started looking at the output, I could see more and more use cases around the dataset.  I could not address all the use cases I thought as I soon hit the deadline pressure. I had to finish the video demonstration and write some write up about the project.

Overall it was a very enriching experience as I did so many things for the very first time.

Listing some of them below

  • IBM Bigsheets and  BigSQL
  • Using Chart.js library
  • Using Google Maps JavaScript APIs –  It was remarkably simpler than I thought. Much appreciate these APIs from Google.
  • Creating the custom Map icon – Never realized it would be this difficult
  • HTML 5/CSS challenges when putting up the UI
  • Last but not the least GitHub’s easy way to publish your work online.

Now that the challenge is in Public voting and judging phase, appreciate if you could take a look at

and provide your feedback and vote if you like it.

Introduction to Apache Pig

September 28, 2014


I had created this presentation on introduction of Apache Pig. Hope you find this useful to understand basics of Apache Pig.

Introduction to Apache Pig
Introduction to Apache Pig


Viewing Log statements in Hadoop Map-reduce jobs

August 23, 2014


Anyone who is new to the hadoop world often ends up frantically searching for the debug log statements that he might have added in mapper or reducer function. At least this has been the case with me when I started working on Hadoop. Hence I thought it might be a good idea to post this particular entry.

The mapper and reducer are not executed locally and hence you can not find the logs for those on local file system. The mapper/reducer are run on the hadoop cluster and hence cluster is the place where you should look for them. However you do need to know the “JobTracker” URL for your cluster.

  1. Access the “JobTracker.  The default URL for JobTracker is “http://{hostname}:50030/jobtracker.jsp”.
  2. This simple UI lists down the different Map-reduce jobs and their states ( namely “running”,”completed”, “failed” and “retired”).
  3. You need to locate the Map-reduce job that you started (Please Refer the attached screenshot.)jobtracker
  4. There are various ways to identify your Map-reduce job.
    1. If you run a Pig script, the Pig client will log the job id which you can search in jobtracker. The screen shot shows the map-reduce job corresponding to a Pig script.
    2. If you are running a plain map-reduce application then once you submit the job, you can search the same either by using the userid used to submit the job or the application name.
  5. Clicking on the job number reveals the job details as shown below –map reduce job details
  6. Following screenshot shows way to navigate to the log statements. Please note that the screenshot comprises of three steps.Map Reduce Log Statements

Hope you find this post useful while learning Hadoop.