Nonlinear Adaptive Filtering (matrix)

~ Nonlinear Adaptive Filtering as a Form of Artificial Intelligence ~

Non-linear matrix factorization

Many social life modeling use matrix factorization for predicting of missing elements. The popular applications of that kind are movies and books ratings (common terms: recommeder systems and collaborative filtering) and predictions of sport games outcomes. The main problem in all these cases is the data quality. For movies ratings the matrix is filled by few percent. If to check dates of submitted ratings, it can be found that, sometimes, 100 movies are all rated with few second intervals. That means all ratings depend on a particular user mood at that moment. Sometimes different family members provide ratings, so one user is not necessarily one person. There is so-called aging of the movie, when ratings obtained several years later are significantly different compared to those provided at the moment of release. Movie rated as 5 by almost all users 10 years ago may not be rated that good at a present moment, because similar movies were released during this time, and this aging factor should be considered in prediction.

For sport outcomes the matrices are filled completely, or even more than that, they may have multiple elements in the same row/column position, so they are 3D tables. If row is a team Home and column is a team Away, they could play several games in the past with the different scores and it may be result of a chance or the evolution in skills. So data is extremely noisy, but non-the-less the prediction is still possible through factoring and obtaining product of corresponding cofactors.

These noisy matrix data often devaluates sophisticated mathematical theories and lead to a very small difference between elementary algorithm and the complex one. The good example is a Netflix prize. The RMSE difference between most primitive 1.0540 and most complex model 0.8554 is so small that entire modeling can be seen as a miserable failure of applied science. From the practical point of view, if the system predicts my rate for particular movie as 4 with an error of 1.05 or 0.85, it does not really make a difference for my decision to watch the movie.

However, not everything is that bad. I tried to factor matrices for soccer game outcomes. The primitive model usually provides accuracy from 40% to 45%, which does not give monetary benefits. Making bets using 40% accurate predictor leads even to money loss, 45% provides insignificant profit near 3% of the bet amount, but 50% correct predictions gives a good money making opportunity, such as stable 15% of the bet amount. So, there are cases when few percent in accuracy make a difference.

Since this is research in applied science, we do not provide any back up for gambling and the testing example is mathematically generated data, not the history of sport outcomes.

The elements for tested matrix were computed by formula

$$e_{i,j} = |sin(x_i - y_j)| , $$

where $x_i$ and $y_j$ were random scalars from range $[0.0, 3.14]$ for $i-th$ row and $j-th$ column.

About third of computed elements were randomly removed. At the first step, the matrix was factored by S.Funk version of a gradient descent, missed elements were predicted and compared to those removed by Pearson correlation coefficient. The accuracy depended on sizes and varied for each new randomly generated data set.

In the next step this linear factorization was improved by Kolmogorov-Arnold representation using obtained row and column vectors as inputs and available matrix elements as targets. The modeling goal was to find multivariate function $F$, such that being applied to linear decomposition for particular row and column

$$e_{i,j} = F(\mathbf x_i, \mathbf y_j) $$

gives better approximation to matrix element compared to inner product $<\mathbf x_i, \mathbf y_i>$.

It was found that for square matrices there is no advantage of nonlinear model compared to linear. However, for rectangular matrices the advantage was clearly observed. When sizes were near $10 \cdot 200$, typical accuracy computed by Pearson correlation coefficient of linear decomposition was near 91% and for Kolmogorov-Arnold improvement was near 96%. As I already said sometimes such small advantage matters.