Question

Advanced data structures in practice

In the 10 years I've been programming, I can count the number of data structures I've used on one hand: arrays, linked lists (I'm lumping stacks and queues in with this), and dictionaries. This isn't really surprising given that nearly all of the applications I've written fall into the forms-over-data / CRUD category.

I've never needed to use a red-black tree, skip list, double-ended queue, circularly linked list, priority queue, heaps, graphs, or any of the dozens of exotic data structures that have been researched in the past 50 years. I feel like I'm missing out.

This is an open-ended question, but where are these "exotic" data structures used in practice? Does anyone have any real-world experience using these data structures to solve a particular problem?

45 11329 45

1 Jan 1970

Solution

Some examples. They're vague because they were work for employers:

A heap to get the top N results in a Google-style search. (Starting from candidates in an index, go through them all linearly, sifting them through a min-heap of max size N.) This was for an image-search prototype.
Bloom filters cut the size of certain data about what millions of users had seen down to an amount that'd fit in existing servers (it all had to be in RAM for speed); the original design would have needed many new servers just for that database.
A triangular array representation halved the size of a dense symmetrical array for a recommendation engine (RAM again for the same reason).
Users had to be grouped according to certain associations; union-find made this easy, quick, and exact instead of slow, hacky, and approximate.
An app for choosing retail sites according to drive time for people in the neighborhood used Dijkstra shortest-path with priority queues. Other GIS work took advantage of quadtrees and Morton indexes.

Knowing what's out there in data-structures-land comes in handy -- "weeks in the lab can save you hours in the library". The bloom-filter case was only worthwhile because of the scale: if the problem had come up at a startup instead of Yahoo, I'd have used a plain old hashtable. The other examples I think are reasonable anywhere (though nowadays you're less likely to code them yourself).

2008-12-23

Solution

B-trees are in databases.

R-trees are for geographic searches (e.g. if I have 10000 shapes each with a bounding box scattered around a 2-D plane, which of these shapes intersect an arbitrary bounding box B?)

deques of the form in the C++ STL are growable vectors (more memory-efficient than linked lists, and constant-time to "peek" arbitrary elements in the middle). As far as I can remember, I've never used the deque to its full extent (insert/delete from both ends) but it's general enough that you can use it as a stack (insert/delete from one end) or queue (insert to one end, delete from the other) and also have high-performance access to view arbitrary elements in the middle.

I've just finished reading Java Generics and Collections -- the "generics" part hurts my head, but the collections part was useful & they point out some of the differences between skip lists and trees (both can implement maps/sets): skip lists give you built-in constant time iteration from one element to the next (trees are O(log n) ) and are much simpler for implementing lock-free algorithms in multithreaded situations.

Priority queues are used for scheduling among other things (here's a webpage that briefly discusses application); heaps are usually used to implement them. I've also found that the heapsort (for me at least) is the easiest of the O(n log n) sorts to understand and implement.

2008-12-23