For non-strictly convex quadratic programming, there are usually more than one extreme points, and some of these extreme points are far apart. Why can neural networks accurately approach one of these extreme points? The labels obtained by solvers usually do not necessarily conform to a specific distribution and do not have strong feature information. However, neural networks can almost perfectly approximate these solutions with little feature information—does this mean overfitting of neural network?
Or to rephrase the question: since the labels are randomly obtained by solvers during training without obvious feature information, which extreme point should the solutions provided by neural networks approach during testing, and why these extreme points rather than others?