Content




Two class problem in machine learning

Code location: Two class example



In the entire set of classification problems case of two classes is special. It is the only case when class labels may be replaced by two numbers, such as -1.0 and 1.0, and regression model can be used for estimation of probabilities. For three or more classes this approach fails miserably, but for two classes amazingly works.

As data generation we consider two random circles given by coordinates $x_1, y_1, x_2, y_2$ and radii $r_1, r_2$


The observed coordinates $x_1, y_1, x_2, y_2$ are uniformely distributed random values from $[0, R]$, observed radii are uniformely distributed on $[0, R/2]$, where $R$ is range (simply constant value). The output is assigned either -1.0 or 1.0 depending on overlapping. Aleatoric uncertainty is modelled by adding unobserved random addends with zero mean to each input. The range for unobserved addends for coordinates is $\delta R$ and for radii $\delta R/2$. In provided example $\delta = 0.7$.

The probabilities for targets are computed by repeated adding of random noise and counting number of cases when circles are overlapped. This Monte Carlo simulation is used for estimation of input dependent probabilities for the targets which are conventionally considered as true values. The format of generated data is shown below, all inputs are different and targets (in the last position) may not necessarily be accurate for each individual record. They are computed for unobserved (noisy) values.
48.00, 70.00, 43.50, 87.00, 38.00, 21.50, -1.00
98.00, 32.00, 28.50, 33.00, 33.00,  9.50,  1.00
83.00, 44.00, 48.00, 86.00, 65.00, 25.00, -1.00
 0.00, 72.00,  1.00, 25.00, 37.00, 19.00,  1.00
41.00, 49.00,  2.50, 39.00, 78.00, 11.50,  1.00
The size of data set is $4000$ records. Here is typical print out of the program:
Data is generated.
Building models ...
Training time 1.95 seconds

Probabilities for validation sample
0.72 0.38 0.71 0.86 0.61 0.80 0.21 0.43 0.18 0.48
0.75 0.90 0.94 0.75 0.97 0.05 0.10 0.33 0.41 0.35
0.97 0.96 0.46 0.26 0.69 0.45 0.08 0.26 0.43 0.99
0.26 0.30 0.28 0.83 0.76 0.93 0.98 0.61 0.87 0.88
0.98 0.92 0.74 0.59 0.98 0.36 0.92 0.83 0.29 0.35
0.84 0.54 0.97 0.61 0.95 0.76 0.30 0.72 0.59 0.96
0.34 0.12 0.57 0.41 0.76 0.74 0.38 0.94 0.77 0.72
0.22 0.44 0.75 0.00 0.47 0.21 0.87 0.88 0.95 0.34
0.53 0.71 0.86 0.22 0.32 0.25 0.94 0.73 0.05 0.61
0.91 0.79 0.91 0.44 0.94 0.94 1.00 0.16 0.90 0.96

Right predictions 91, out of 100
Pearson correlation for Monte Carlo and model probabilities 0.98
The validation set is 100 inputs not used in training. The actual probabilities were estimated by Monte Carlo generated noise. The print out shows probabilities for overlapping for circles of 100 validation inputs (the other is $1.0 - p$). They were compared to so-called actual values by Pearson correlation coefficient, which was $0.98$ for this execution. The number of correct predictions is $91$, but this result is not as informative as probabilities. Since data is approximate and system is stochastic, the accurate prediction is simply impossible, but accurate prediction of probabilities is possible and even more critical than accuracy of outcome prediction.

Why prediction of probabilities is more important than outcomes

For example, day trading. It is buying and selling stocks within the same day to avoid paying commission. For successful trading broker needs only accurate estimation of probabilities for either raising or declining stock price for the next few hours.

Comparison to expert software

For comparison we chose Infer.Net, which is large collection of libraries and methods for machine learning. The closest example to our task appeared to be Bayes Point Machine. We customized this example to process our data and passed same data set, which print out result is shown above. The accuracy of Infer.Net was $81\%$, in our implementation it is $98\%$. Number of correct predictions in Infer.Net is $82$, in our test $91$

Implementation details

The class labels were assigned -1.0 or 1.0 and Kolmogorov-Arnold representation was used as a regression model. The modelled output value was interpreted as probability with 0 as 50%. For stability of the result elementary bagging was used. The output was an average of prediction of four models built concurrently. Each model initialized from the random state.