The Practical Usefulness of Sparse Data Representation in Machine Learning
Sparse Data Representation
Sparse data representation is useful in machine learning, especially in projects where datasets are huge and many entries in the dataset contain zeros. An example is text datasets applied with TF-IDF transformation, where every text example contains only a small number of vocabulary out of the entire dataset’s vocabulary.
Imagine a 2-dimensional matrix with a size of 10,000 along each dimension. This matrix has 10000² = 100 million entries. However, if only 3% of the entries have non-zero values and the remaining 97% are made up of zeros, is there another way to represent this matrix more efficiently than storing a large number of zeros in it?
1. Use of Sparse Data Representation for Computing
One way is to store only the non-zero values, and to store them in an array. The coordinates of these non-zero values are also stored as an array of coordinates, each with size 2 (one for each dimension). Coordinates that are not stored are deemed to have a value of zero. This representation requires only 3 x 10⁶ non-zero values, (3 x 10⁶) x 2 coordinate values, resulting in a total of 9 x 10⁶ values in contrast to the non-compressed matrix that requires 100 million values.
This is a 91% savings in memory requirement!
If we had needed 10GB of memory for holding some matrices, this is shaved to less than 1GB with the sparse representation.
Assume we need to perform an arithmetic operation on the matrix, being the summation of its entries. With the sparse data representation, instead of summing over the entire 100 million values in the matrix, we only need to sum over the 3 million non-zero values of the matrix. This reduces the number of addition operations from 10⁸-1 to 3x10⁶-1.
This is another 97% savings in computation!
If we had needed to run the processor at 100% capacity, this is shaved to 3% with the sparse representation.
Factors for using Sparse Data Representation for Computing
As the matrix becomes larger and increasingly sparse, using a sparse representation leads to savings in memory. As a larger number of sparse-aware operations are performed on the matrix, using a sparse representation could also lead to savings in computation. The savings further increase as the dimensionality increases beyond a 2-dimensional matrix! This would be practically useful when the total time taken to run a set of matrix operations is significant, such as when training a machine learning model.
Implementations of Sparse Data Representation
There are several implementations of sparse data representation. They trade off among features such as ease of structural modification of the matrix, access to slices of the matrix and efficiency of arithmetic operations. A good resource is the Python’s Scipy Sparse Matrix.
2. Applications of Sparse Data in Machine Learning
Sparse data is also an important application result in machine learning, where meaningful features of the data have values significantly greater than or less than zero, and non-useful features have values close to zero. This indirectly reduces the high dimensional data space to a low dimensional data space.
In another related technique data compression, the data space is directly projected onto a lower space that is more densely informative about the data.
There are many applications of sparse data and data compression in machine learning, such as selecting important features from data, removing noise from data, clustering data into groups, data generation and data reconstruction.