Monday, November 28, 2016

HashMap - how does choosing the load factor affects the time complexity (Big Oh)?

This post is more about data structure called HashMap rather than Java.

What is a HashMap? 
HashMap is a data structure that is used to store values by first calculating a key for each value by the hashing function and then storing it. This enables to quickly find the element/value later by calculating the hash key again, usually in O(1) time. 

What is the load factor? 
HashMap stores elements in buckets. If 2 or more elements hash to the same hash key then they are stored in the same bucket - this situation is called 'collision'. These 2 elements are stored as a linked list. We should avoid collisions because when collision happens the search for that elements takes longer than O(1) time. This is okay when it happens for some elements that we're trying to store but not good if it happens for many elements. When will it happen for many elements? --> when we will have far lesser buckets than the number of elements that we need to store. We need to decide how full our HashMap should be before we enlarge it.. that fullness factor is called load factor and is given by formula n/N where n is the number of elements stored in the hashmap already and N is the number of buckets. You can consider N also to be the initial capacity of the hashmap because the whole point of this data structure is to make the search operation of O(1) and hence it's important to try our best to design it in a way that there are zero collisions. The default load factor for Java's HashMap is 0.75 with initial capacity=16.  That means it will increase the size of the hashmap (and hence rehash) when 12th element is inserted. 

How does choosing the load factor affects the performance? 
You have probably figured out this by now. 

  • More the load factor, late would be the enlarging of the hashmap buckets, so more would be the chances of collisions, so more elements would have search time greater than O(1).
  • Less the load factor (say 0.6 instead of 0.75), then we would be enlarging our hashmap capacity earlier - precisely when almost approximately 40% of buckets are still empty, this means there would be lesser collision scenarios at this stage and even lesser in the new hashmap that would be created with increased size (and hence increased buckets). 
Shoot for any questions if you have in the comments section. :)