📚 Drozdek (Ch. 3.6)
“Tables” are a natural way to express information for many real-world applications.
Imagine storing ratings for movies.
int (4 bytes), we need ~15.3MiB of storage.
int is silly for “5-star” ratings; we could use 1-byte per rating. We still need 4MiB to store it that way.How many “cells” in the table will have a rating entered?
int (4 bytes), we need ~15.3MiB of storage.
int is silly for “5-star” ratings; we could use 1-byte per rating. We still need 4MiB to store it that way.How many “cells” in the table will have a rating entered?
Not many! Most users won’t watch most movies - and they will rate even fewer.
Most of the space in the table will remain unused.
Sparse table with 17% populated cells.
In the table above, about 17% of the cells are used. (This is actually quite high for this kind of data…)
That means we waste 83% of the space we are using for the table… (about 2.8MiB)
Imagine that we keep a linked list for each user, containing the information:
We could use a 1-D array of these linked lists. (You could implement this as a 1-D array of head pointers as well, if you were writing the data structure from scratch for this purpose.)
We could use a 1-D array of these linked lists. (You could implement this as a 1-D array of head pointers as well, if you were writing the data structure from scratch for this purpose.)
Sparse table with 17% populated cells.
That would allow us to look up all ratings for a user, but it would be difficult to find all ratings for a movie.
Sparse table with 17% populated cells.
That would allow us to look up all ratings for a user, but it would be difficult to find all ratings for a movie.
Of course, if we cared about “per-movie” stats, we could just store a linked list for each movie instead…
But we would lose the ability to quickly find info on a “per-user” basis…
What if we do both?
If we use two 1-D arrays, one for “Movies” and one for “Users”, we could quickly find info from either point-of-view.
Sparse table as linked lists.
Sparse table as linked lists.
The nodes use 1 byte for the rating, plus at least two bytes for the user and movie number, plus two pointers (16 bytes), for a total of 21 bytes.
In addition, we need 8 bytes for each element of the two arrays to point to the lists.
In total, assuming 17% of the cells (\(N = (8000 \times 500) \times 0.17 = 680000\)) in a dense table are used, this implementation would require about \((680000 \times 21) + (500 \times 8) + (8000 \times 8) = 13.7\textrm{MiB}\).
This is not better than the full table of type int in this case, due to overhead from the pointers. However, what if we had 1,000,000 users and 3,600 movies?
Then, the usage is \(1000000 \times 3600 \times 0.17 \times 21 + 1000000 \times 8 + 3600 \times 8\) \(= 12\textrm{GiB}\) versus 3.4GiB for the full (dense) table.
What if we have a more sparse dataset?
Assume that there are values in only 1% of the possible cells… (This is the case for e.g. Netflix ratings.)
Now, we require \(1000000 \times 3600 \times 0.01 \times 21 + 1000000 \times 8 + 3600 \times 8\) \(= 0.17\textrm{GiB}\), versus 3.4GiB for the dense table.
This time, it is 5% the size of the full table! A 95% savings!
The sparsity matters.
Also, the storage size of the values matters. In this example, we used 1-byte representations for the ratings. If you needed to store something larger, the sparse (linked list with index) implementation would become more efficient at a smaller table size (or less sparse distribution).
Sparse Tables

CS 50x2 Accelerated Programming Series