What is a ‘HASH INDEX’ in SQL?
A Hash Index is a specific type of index that works best for equality comparisons. It uses a hash function to distribute rows evenly across a specified number of buckets. The hash function takes column values as input and returns a bucket number. When a query involving an equality operator is executed, the database system uses the hash function to find the bucket and fetch the related rows.
Creating a HASH INDEX
To create a HASH INDEX, you can use the following syntax:
CREATE INDEX index_name
ON table_name(column_name)
USING HASH;
This will create a HASH INDEX on the specified column of the specified table.
Example
Assume we have a table named ‘Employees’ with a column ‘EmpID’. Below is how to create a HASH INDEX on the ‘EmpID’ column:
CREATE INDEX idx_empid
ON Employees(EmpID)
USING HASH;
Now, any query that involves an equality comparison on ‘EmpID’ will use the ‘idx_empid’ index. This will significantly speed up the query process.
When to use a HASH INDEX?
While a HASH INDEX can improve performance, it’s not always the best choice. A HASH INDEX is most effective when the data distribution is uniform, and you need to perform a lot of equality comparisons. However, if your queries involve range-based searches or sorting operations, a B-Tree index might be a better choice.
Conclusion
Hash Indexing is a powerful tool in SQL that can significantly boost the performance of your database operations. However, it’s important to understand when to use it and when not to. We hope this tutorial provides a solid foundation for you to start using HASH INDEX in your SQL operations.
Glossary
CREATE INDEX
The ‘CREATE INDEX’ command is used in SQL to create an index on a specified column of a specified table. For example, when creating a HASH INDEX, the syntax looks like this: CREATE INDEX index_name ON table_name(column_name) USING HASH;. This command enhances the speed of data retrieval operations on the specified column by creating an index in the background.
HASH INDEX
A Hash Index in software development refers to a specific type of database index that uses a hash function to compute the location for data access. The hash function takes the index key value as input and returns a hash bucket where the corresponding data record can be found. Hash indexes are extremely efficient for equality searches, but not for range searches. They are commonly used in NoSQL databases and some relational databases also support them.
B+ Tree vs Hash Index (and when to use them)
Indexes are the fundamental unit of database performance. There are many index types, but the two most common are the B+ Tree and the hash index.
B+ trees are the default index type for most database systems and are more flexible than hash indexes. They offer excellent lookup and insertion times when configured correctly, and my personal opinion is that you should stick with B+ trees unless you’re trying to optimize an extremely performance sensitive table.
For example, later on, I will show that it only takes 4 disk I/Os to find a value in 8 TB of data using a properly configured B+ tree. No, that’s not a typo. Unless you absolutely need to save 3 disk I/Os, then you should preserve the flexibility of a B+ tree.
Intro to each index type
Let’s start by introducing the B+ tree and hash index.
B+ Tree index
A B+ Tree is a tree data structure with some interesting characteristics that make it great for fast lookups with relatively few disk IOs.
- A B+ Tree can (and should) have many more than 2 children per node.
- A B+ Tree is self balancing.
- A B+ Tree holds keys in internal nodes, but only holds values in the leaf nodes (the nodes on the bottom).
- A B+ Tree is sorted (values ascend from left leaf node to right leaf node).
Hash index
A hash index takes the key of the value that you are indexing and hashes it into buckets. In case you’re not aware, a hash function is something that takes an input, and produces a different, somewhat random, and hopefully unique, output.
- “Somewhat random” – Sometimes you actually want the hash function to place certain things somewhat close together on disk, because they are frequently read together. In this instance, you would want to use a “location sensitive hash function”.
- “Hopefully unique” – Sometimes two different inputs produce the same output. This is called a hash collision. In this case, the database usually uses a different hash function to resolve the collision. Hashing these values allows you to do 0(1) equality lookups based on the value of the indexed column. All you have to do to find the exact location of the rows with that value is hash the value you’re trying to find, and look in the place where the hash value leads you!
When to use B+ Tree vs Hash Index
If you are 100% certain that a column will only ever need to be looked up via direct equality (eg… where id = X), then you can use a hash index. Even in that case, unless that table is at the core of your business and is becoming a bottleneck for system performance in general, I would just use a B+ tree anyways. Why? Because you’re really not giving up very much!
B+ Tree
Flexibility
- The default index type
- Quite flexible! Allows you to use <, >, =, != operators, as well as others depending on your DB implementation.
- If you don’t know what to use, just use a B+ Tree
- Excellent for integer and date columns, where you will frequently compare values.
- Is very fast for “range style” queries, such as selecting records from a certain date range.
- B+ Trees store links between each leaf node, essentially creating a sorted doubly linked list!
Performance
Lookup times: log f (N/F) disk IOs, where:
- f – The “fanout” of your B+ Tree, or how many children each node has.
- N – The number of disk pages needed to hold the indexed keys
- F – The “slack” of our tree. Usually 2/3.
- Slack, in this instance, is how many nodes we build with empty values. Keeping empty values in the tree allows us to do inserts without moving a ton of data around, and keeps the overhead of maintaining the index lower.
Example
- Let’s say you’re indexing a table with 1 trillion records that has an 8 byte ID column. The ID column alone would be 8 trillion bytes (8 TB). As in, just the keys of your data, the index itself, is 8 TBs. An enormous table, in other words.
- Assume your DB blocks are 8 KB (a typical size).
- N is 1 million pages (8 TB / 8 KB) to hold indexed values.
- f of 5460
- F of 2/3
- Applying the formula above, plus one extra disk IO to find the root node, comes out to (drum roll please)… 4 disk IOs to look up a single value in 8 TB of data! Wowza!
If you actually have DB tables where just the index is 8 TB, then you are probably at Google. Cool! Hire me! For most people however, your indexes will probably be small enough to fit in RAM, and won’t require any IOs at all!
Hash Index
Flexibility
Only allows exact equality (and by extension, inequality) operators.
Performance
Lookup times: 0(1). It’s hard to beat that!
Reference:
https://reintech.io/blog/sql-hash-index-statement-detailed-guide#google_vignette
https://www.sqlpipe.com/blog/b-tree-vs-hash-index-and-when-to-use-them
Comments