It is possible to make nontrivial predictions exclusively using information that was available before the participants attended the events, but I haven’t systematically explored how well one could do.
So the model that I developed is potentially useful only in the special case where participants had attended similar past events.
It’s very likely that the simulation overstates the predictive power that the model would give in practice, if for no other reason than regression to the mean.
One example of this is that the most popular participants are more likely than usual to have been at their best on the day of the event than the other participants are, so that confidence that one can have that someone who was chosen by most of their dates at an event will be chosen by partners at a different event is lower than the confidence that one can have that the person will be chosen by partners at the same event.
But since comments focused on methodology rather than the empirical phenomena, I decided to write about methodology first, so that readers wouldn’t have to disbelief while reading my next post.
This post is more dense and technical than my last one.
[R(B) – R(A, B) R(A’, B)]/N In the special case where the rating type is “decision,” the averages correspond to frequencies, and for easy of comparison with other features these are most naturally replaced by their log odds ratios, so I did this.
I normalized these averages by subtracting off the average of all ratings that participants of B’s gender would have received at the event had the surrogates of A and B attended the event in lieu of A and B.
In fact, the performance of the random forest model when this feature is used eclipses the performance of the best generalizable model that I was able to construct.
The random forest seems to have used decision rules corresponding to reasoning of the type “the frequency with which other people chose the person is lower than I would expect of somebody so attractive, fun and likable, so probably the person was chosen this time.” Rather than using (**), we imagine that at the event, B had been on a date with someone other than A, who we call a “surrogate” of A.
So predicting a ‘yes’ decision whenever (**) is greater than 50% gives a model that performs better on the dataset while corresponding to worse generalizability. So the features (**) are contaminated by the decisions that we’re trying to predict.
This is not an abstract hypothetical concern: what led me to recognize the issue is that a random forest model has more predictive power when we use (**) rather than using the average A.
If one were to apply the model in a real world setting, one would collect data that allowed one to quantify the expected regression to the mean, and also to improve the model.