Sparse Tables

📚 Drozdek (Ch. 3.6)

“Tables” are a natural way to express information for many real-world applications.

Think of spreadsheets…

Imagine storing ratings for movies.

“Makes sense” as a table with one row per user and one column per movie…
- Every movie may be reviewed on a “5-star” scale.
- Every user may rate any movie they want… or may not.
- Let’s say we have 8000 users and 500 movies.
  - Table size is \(8000 \times 500 = 4,000,000\) cells.
  - If each cell is an int (4 bytes), we need ~15.3MiB of storage.
    - Using int is silly for “5-star” ratings; we could use 1-byte per rating. We still need 4MiB to store it that way.

How many “cells” in the table will have a rating entered?

“Makes sense” as a table with one row per user and one column per movie…
- Every movie may be reviewed on a “5-star” scale.
- Every user may rate any movie they want… or may not.
- Let’s say we have 8000 users and 500 movies.
  - Table size is \(8000 \times 500 = 4,000,000\) cells.
  - If each cell is an int (4 bytes), we need ~15.3MiB of storage.
    - Using int is silly for “5-star” ratings; we could use 1-byte per rating. We still need 4MiB to store it that way.

How many “cells” in the table will have a rating entered?

Not many! Most users won’t watch most movies - and they will rate even fewer.

Most of the space in the table will remain unused.

Sparse table with 17% populated cells.

In the table above, about 17% of the cells are used. (This is actually quite high for this kind of data…)

That means we waste 83% of the space we are using for the table… (about 2.8MiB)

Possible solution using linked lists.

Imagine that we keep a linked list for each user, containing the information:

Which movie did the user rate?
What was the rating for the movie?

We could use a 1-D array of these linked lists. (You could implement this as a 1-D array of head pointers as well, if you were writing the data structure from scratch for this purpose.)

Sparse table with 17% populated cells.

That would allow us to look up all ratings for a user, but it would be difficult to find all ratings for a movie.

Sparse table with 17% populated cells.

That would allow us to look up all ratings for a user, but it would be difficult to find all ratings for a movie.

Of course, if we cared about “per-movie” stats, we could just store a linked list for each movie instead…

But we would lose the ability to quickly find info on a “per-user” basis…

What if we do both?

If we use two 1-D arrays, one for “Movies” and one for “Users”, we could quickly find info from either point-of-view.

The node would need to know about the Movie, the User, and the Rating.

Sparse table as linked lists.

The nodes use 1 byte for the rating, plus at least two bytes for the user and movie number, plus two pointers (16 bytes), for a total of 21 bytes.

In addition, we need 8 bytes for each element of the two arrays to point to the lists.

In total, assuming 17% of the cells (\(N = (8000 \times 500) \times 0.17 = 680000\)) in a dense table are used, this implementation would require about \((680000 \times 21) + (500 \times 8) + (8000 \times 8) = 13.7\textrm{MiB}\).

This is not better than the full table of type int in this case, due to overhead from the pointers. However, what if we had 1,000,000 users and 3,600 movies?

Then, the usage is \(1000000 \times 3600 \times 0.17 \times 21 + 1000000 \times 8 + 3600 \times 8\) \(= 12\textrm{GiB}\) versus 3.4GiB for the full (dense) table.

What if we have a more sparse dataset?

Assume that there are values in only 1% of the possible cells… (This is the case for e.g. Netflix ratings.)

Now, we require \(1000000 \times 3600 \times 0.01 \times 21 + 1000000 \times 8 + 3600 \times 8\) \(= 0.17\textrm{GiB}\), versus 3.4GiB for the dense table.

This time, it is 5% the size of the full table! A 95% savings!

The sparsity matters.

Also, the storage size of the values matters. In this example, we used 1-byte representations for the ratings. If you needed to store something larger, the sparse (linked list with index) implementation would become more efficient at a smaller table size (or less sparse distribution).

Sparse Tables