Question regarding performance of HashTable of LinkedList Nodes

1338 views java
0

I implemented a HashTable with variable size buckets on init of the Class, merely an Array of linked lists sized at Runtime.

The issue is that with a small number of buckets where the linked-list must be traversed (can reach approx. 5K nodes in depth) is outperforming a HashTable with more buckets differing on three orders of magnitude greater.

    int SMALL_BUCKET_SIZE = 10;
    int BIG_BUCKET_SIZE = 10000;

    HashTable<String, Integer> smallHashTable = new HashTable<>(SMALL_BUCKET_SIZE);
    HashTable<String, Integer> bigHashtTable = new HashTable<>(BIG_BUCKET_SIZE);

I would expect the larger HashTable to be O(1) for search where the smaller hashtable having a higher collision rate, taking more time due to traversal of the linked nodes, yet my numbers below are showing the smaller table outperforming the wider table.

Fetch SmallTable: 0.000007
Fetch BigTable: 0.000018

So I decide to loop my HashTable.get a thousand times to factor for JIT and JVM Optimization. Now I'm starting to see numbers that seem to confirm what I'd expect.

Fetch SmallTable: 0.0000013630
Fetch BigTable: 0.0000002560

My question is around the sound-ness of my logic as well as the additional moving parts here. I've pasted my test alongside a link to the implementation of the HashTable and underlying Node structures.

Looking for depth/experience from folks here who may be able to provide interactive feedback regarding variables that factor into this such as key length and hashing collision rates, bucket density, etc.

HashTableTest.java

@Test
public void canInitializeHashTableWithBucketsForPerformance() throws InterruptedException {
    double smallTableTime, bigTableTime;
    int SMALL_BUCKET_SIZE = 10;
    int BIG_BUCKET_SIZE = 10000;

    HashTable<String, Integer> smallHashTable = new HashTable<>(SMALL_BUCKET_SIZE);
    HashTable<String, Integer> bigHashtTable = new HashTable<>(BIG_BUCKET_SIZE);
    List<String> strings = generateRandomStringKeys(1000);

    strings.forEach(string -> bigHashtTable.put(string, 10));
    strings.forEach(string -> smallHashTable.put(string, 10));

    Consumer<String> bigHashGet = bigHashtTable::get;
    Consumer<String> smallHashGet = smallHashTable::get;

    String theString = strings.get(strings.size() - 1);

    smallTableTime = getElapsedTimeFactoringOutJavaOptimization(theString, smallHashGet);
    bigTableTime = getElapsedTimeFactoringOutJavaOptimization(theString, bigHashGet);

    System.out.println(String.format("Fetch SmallTable: %.10f", smallTableTime));
    System.out.println(String.format("Fetch BigTable:   %.10f", bigTableTime));

    assertTrue(smallTableTime > bigTableTime);
}

public double getElapsedTimeFactoringOutJavaOptimization(String s, Consumer<String> aMethod) {
    long start = 0, end = 0;

    for (int i = 0; i < 1000; i++) {
        start = System.nanoTime();
        aMethod.accept(s);
        end = System.nanoTime();
    }

    return (end - start) / 1_000_000_000D;
}

public List<String> generateRandomStringKeys(int numOfRandomKeys) {
    List<String> keys = new ArrayList<>();

    for (int i = 0; i < numOfRandomKeys; i++) {
        byte[] array = new byte[10];
        new Random().nextBytes(array);
        keys.add(new String(array, Charset.forName("UTF-8")));
    }

    return keys;
}

The test can be found here - Github - HashTableTest.java

The implementation can be found here as well - Github - HashTable.java

answered question

Benchmarking an operation like this is going to be harder than just running it 1000 times. (1000 times is nowhere near enough, for starters.)

You should also average the speed of multiple rounds. Move the System.nanoTime() calls outside of the loop

@flakes Notice that I'm basically dropping all of the System.nanoTimes on the floor except for the last one. Moving them outside the loop would capture the entire time the loop ran, not a single final iteration.

1 Answer

4

There are many things wrong here, but a handful include:

  • Running this operation 1000 times and taking the difference of nanoTime for each of them does not make your benchmark valid. Seriously, use JMH. Or at least run it, like, ten million times.
  • Your hash table doesn't actually work any differently for different-sized tables. You use table[getHash(key) % RADIX], which basically means however big the table is, you only use 10 buckets in it and pretend the rest don't exist.
  • System.identityHashCode is not a useful hash function, especially on strings, especially when you're hoping to actually find an element that's in there...or not.
  • While you're at it, you're not using Node.next as a field, and might as well get rid of it.

posted this

Have an answer?

JD

Please login first before posting an answer.