January 22, 2025

Theoretical Breakthrough at MIT Could Boost Data Storage

The teams findings relate to so-called “linear-probing hash tables,” which were presented in 1954 and are amongst the earliest, simplest, and fastest data structures offered today. Data structures supply methods of arranging and saving information in computer systems, with hash tables being one of the most commonly utilized approaches. In a linear-probing hash table, the positions in which information can be kept lie along a direct selection.
If you just left an empty area in the hash table after deleting the info, that could trigger confusion when you later attempted to find something else, as the vacant spot may erroneously recommend that the product youre looking for is no place to be found in the database. “These surprising and new outcomes reverse one of the oldest traditional knowledge about hash table habits,” Leiserson says.

The teams findings associate with so-called “linear-probing hash tables,” which were introduced in 1954 and are amongst the earliest, most basic, and fastest data structures offered today. Data structures offer ways of organizing and storing information in computer systems, with hash tables being among the most frequently made use of methods. In a linear-probing hash table, the positions in which info can be kept lie along a direct array.
Suppose, for example, that a database is designed to keep the Social Security numbers of 10,000 people, Kuszmaul recommends. “We take your Social Security number, x, and well then compute the hash function of x, h( x), which provides you a random number between one and 10,000.” The next step is to take that random number, h( x), go to that position in the range, and put x, the Social Security number, into that spot.
If theres already something inhabiting that area, Kuszmaul says, “you just move on to the next complimentary position and put it there. This is where the term direct probing comes from, as you keep moving forward linearly up until you discover an open spot.” In order to later on retrieve that Social Security number, x, you simply go to the designated area, h( x), and if its not there, you move forward up until you either find x or concern a totally free position and conclude that x is not in your database.
Theres a somewhat different protocol for erasing a product, such as a Social Security number. If you just left an empty spot in the hash table after deleting the info, that could cause confusion when you later tried to discover something else, as the vacant spot may incorrectly suggest that the item youre searching for is no place to be discovered in the database. To avoid that issue, Kuszmaul explains, “you can go to the spot where the aspect was gotten rid of and put a little marker there called a tombstone, which indicates there used to be an aspect here, but its gone now.”
This basic treatment has actually been followed for more than half-a-century. In all that time, almost everyone using linear-probing hash tables has actually assumed that if you enable them to get too complete, long stretches of occupied areas would run together to form “clusters.” As an outcome, the time it takes to discover a complimentary spot would go up considerably– quadratically, in reality– taking so long regarding be impractical. Individuals have actually been trained to run hash tables at low capacity– a practice that can exact a financial toll by impacting the amount of hardware a company has to buy and keep.
But this time-honored principle, which has long militated against high load elements, has been totally upended by the work of Kuszmaul and his coworkers, Michael Bender of Stony Brook University and Bradley Kuszmaul of Google. They found that for applications where the variety of removals and insertions remains about the very same– and the quantity of data added is roughly equal to that eliminated– linear-probing hash tables can operate at high storage capabilities without sacrificing speed.
In addition, the group has developed a new strategy, called “graveyard hashing,” which involves artificially increasing the number of tombstones put in a selection up until they inhabit about half the totally free areas. These tombstones then reserve areas that can be used for future insertions.
This technique, which runs contrary to what individuals have usually been instructed to do, Kuszmaul states, “can cause optimum performance in linear-probing hash tables.” Or, as he and his coauthors maintain in their paper, the “properly designed use of tombstones can completely alter the … landscape of how direct penetrating behaves.”
Kuszmaul wrote up these findings with Bender and Kuszmaul in a paper published previously this year that will be presented in February at the Foundations of Computer Science (FOCS) Symposium in Boulder, Colorado.
Kuszmauls PhD thesis advisor, MIT computer science professor Charles E. Leiserson (who did not take part in this research study), agrees with that evaluation. “These new and surprising outcomes reverse among the earliest traditional knowledge about hash table habits,” Leiserson states. “The lessons will reverberate for years amongst professionals and theoreticians alike.”
As for translating their results into practice, Kuszmaul notes, “there are many considerations that go into constructing a hash table. Weve advanced the story substantially from a theoretical perspective, were simply starting to explore the experimental side of things.”
Recommendation: “Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering” by Michael A. Bender, Bradley C. Kuszmaul and William Kuszmaul, 2 July 2021, Computer Science > > Data Structures and Algorithms.arXiv:2107.01250.

New deal with linear-probing hash tables from MIT CSAIL might cause more effective information storage and retrieval in computer systems.
A trio of researchers that consists of William Kuszmaul– a computer system science PhD trainee at MIT– has actually made a discovery that could cause more efficient information storage and retrieval in computers.