Optimal Algorithms and the BFGS Updating Techniques for Solving Unconstrained Nonlinear Minimization Problems

To solve an unconstrained nonlinear minimization problem, we propose an optimal algorithm (OA) as well as a globally optimal algorithm (GOA), by deflecting the gradient direction to the best descent direction at each iteration step, and with an optimal parameter being derived explicitly. An invariant manifold defined for the model problem in terms of a locally quadratic function is used to derive a purely iterative algorithm and the convergence is proven. Then, the rank-two updating techniques of BFGS are employed, which result in several novel algorithms as being faster than the steepest descent method (SDM) and the variable metric method (DFP). Six numerical examples are examined and compared with exact solutions, revealing that the new algorithms of OA, GOA, and the updated ones have superior computational efficiency and accuracy.


Introduction
The steepest descent method (SDM), which can be traced back to Cauchy (1847), is the simplest gradient method for solving unconstrained minimization problems.However, the SDM performs well during earlier stages and as approaching to a stationary point it converges very slowly.In this paper, we consider the following nonlinear minimization problem without considering constraint: where  : R   → R is a C 2 differentiable function.
In the iterative solution of (1), if x  is the current iterative point, then we denote (x  ) by   , ∇(x  ) by g  , and ∇ 2 (x  ) by A  , which is known to be a symmetric Hessian matrix.The second-order Taylor expansion of function () at the point x  is where Δx = x − x  .The superscript T signifies the transpose and meanwhile g T  Δx signifies the inner product of g  and Δx.
Let x = x  −  0 g  , and inserting it into (2) we can obtain By requiring the minimization with respect to  0 , we can derive Then, we have a famous steepest descent method (SDM) for solving (1).
For the minimization problem (1) we need to solve ∇ = 0, and hence the residual means the value of ‖∇‖ = ‖g‖.The above convergence criterion ‖g +1 ‖ <  means that when 2 Journal of Applied Mathematics the residual norm is smaller than a given error tolerance , the iterations are terminated.
In the derivation of SDM for solving (1) it is easy to see that we have transformed the global minimization problem into a model problem in terms of a quadratic minimization problem of and determined the coefficient  0 by (4), where  0 =   − g T  x  + x T  A  x  /2 +  0 with  0 being a constant to raise the level value of , and b = A  x  − g  is a constant vector within each iterative step.Here for the purpose of simple notation we omit the subscript  in (6), and we are going to modify the SDM by starting from the above locally quadratic function.
Several modifications of the SDM have been addressed.These modifications have led to a new interest in the SDM that the gradient vector itself is not a bad choice but rather that the original steplength  0 leads to a slow convergence behavior.Barzilai and Borwein [1] have presented a new choice of steplength through a two-point stepsize.Although their method did not guarantee the descent of the minimum function values, Barzilai and Borwein [1] were able to produce a substantial improvement of the convergence speed for a certain test of a quadratic function.The results of Barzilai and Borwein [1] have spurred many researches on the SDM, for example, Raydan [2,3], Friedlander et al. [4], Raydan and Svaiter [5], Dai et al. [6], Dai and Liao [7], Dai and Yuan [8], Fletcher [9], and Yuan [10].In this paper, we will approach this problem from a quite different viewpoint of invariant manifold and propose a new strategy to modify the steplength and the descent direction.Besides the SDM, there were many modifications of the conjugate gradient method for the unconstrained minimization problems, like Birgin and Martinez [11], Andrei [12][13][14], Zhang [15], Babaie-Kafaki et al. [16], and Shi and Guo [17].
Also, there is another class method with the descent direction d in (x  − d) being taken to be D∇(x  ), where D is a positive definite matrix that approximates the inverse of the Hessian matrix A, which is usually named the quasi-Newton method.The earlier method of minimization of a nonlinear function by using this type approach is performed by Davidon [18], which was then simplified and reformulated by Fletcher and Powell [19] and was referred to as the variable metric method (DFP).
The remaining portions of this paper are arranged as follows.In Section 2 we describe an invariant manifold to derive the governing ordinary differential equations (ODEs).The main results are derived in Section 3, which includes the proof of convergence theorem, optimal parameter, optimal algorithm, a critical parameter, and a globally optimal algorithm.Then, in Section 4 we employ the rank-two Broyden-Fletcher-Goldfarb-Shanno (BFGS) updating techniques to update the Hessian matrix or its inversion, resulting in several novel optimal algorithms.The numerical examples are tested in Section 5 to assess the performance of the newly proposed algorithms.Finally, the conclusions are drawn in Section 6.

An Invariant Manifold
From this section on, we focus on the local minimum problem defined in terms of  in (6), rather than that of .When the novel algorithms are developed, we will return to the minimization problem (1).The present approach is different from the conventional line search method by minimizing the steplength   in where d  is a given search direction.
At the first, we consider an iterative scheme of x derived from the ordinary differential equations (ODEs) defined on an invariant manifold which is formed from (x): Here, we let x be a function of a fictitious time variable .We do not need to specify the function () a priori, of which /() is merely a measure of the decreasing of  in time.
Hence, we expect that in our algorithm if () > 0 is an increasing function of , the iterative point x  can tend to the minimal point.We let (0) = 1, and  is determined from the initial condition x(0) = x 0 by We can suitably choose the constant  0 and hence  0 in (6), such that (x) > 0. Indeed, the different level of  does not alter its minimal point.
When  > 0 and  > 0, the manifold defined by ( 8) is continuous, and thus the following differential operation being carried out on the manifold makes sense.For the requirement of consistency condition we have which is obtained by taking the time differential of (8) with respect to , using (6), and considering x = x().We suppose that x is governed by the following ODEs: where  is to be determined.Inserting (11) into (10) we can solve where We further suppose that where  is a parameter to be determined below through the solution of an optimal equation derived, and B is a descent matrix to be specified.Here, we assert that the driving vector u is an optimal linear combination of the gradient vector and a supplemental vector being the gradient vector times B.

Convergence Theorem.
Before the derivation of optimal algorithms we can prove the following convergence result.
Theorem 1.For an iterative scheme to solve (1) by which is generated from the ODEs in (11), the iterative point x on the manifold (8) has the following convergence rate: where and 0 ≤  < 1 is a relaxation parameter.
Proof.The proof of this theorem is quite lengthy and we divide it into three parts.
(A) Inserting ( 12) into (11) we can obtain an evolution equation for x: In the algorithm if () can be guaranteed to be an increasing function of , we might have an absolutely convergent property in finding the minimum of  through the following equation: which is obtained from (8).Here we simplify the notation of (x()) to ().
(B) By applying the Euler method to (20) we can obtain the following algorithm: where In order to keep x on the manifold defined by (21) we can insert the above x( + Δ) into and obtain Thus, by ( 13), (21), and (6) and through some manipulations we can derive the following scalar equation: where of which the inequality can be achieved by taking a suitable value of  0 and hence  0 in (6).

Optimization of 𝛼.
In Algorithm (22) we do not yet specify how to choose the parameter .We can determine a suitable value of  such that  defined in ( 32) is minimized with respect to , because a smaller  will lead to a faster convergence as shown by (17).
Thus by inserting (19) for  0 into (32) we can write  to be where u as defined by ( 15) includes a parameter .Let / = 0, and through some algebraic operations we can solve  and denote it by where the subscript  signifies that   is the optimal value of .
Remark 2. For the usual three-dimensional vectors a, b, c ∈ R 3 , the following formula is famous: Liu [20] has developed a Jordan algebra by extending the above formula to vectors in -dimension: In terms of the Jordan algebra we can write where the symmetry of A was used.It can be seen that the above equation is a more symmetric form than that in (36).

An Optimal
Algorithm.Now we can let x  denote the numerical value of x at the th step and go back g to g  , u to u  , A to A  , and B to B  .Thus, by using ( 16) we can derive an iterative algorithm: where Therefore, we have the following optimal algorithm (OA).
Again we emphasize that we need to solve ∇ = 0 for the minimization problem (1), and hence the residual means the value of ‖∇‖ = ‖g‖.The above convergence criterion ‖g +1 ‖ <  means that when the residual norm is smaller than a given error tolerance , the iterations are terminated.

A Critical
Value for .In Sections 3.2 and 3.3 we have used / = 0 (or equivalently,  0 / = 0) to find the optimal value of  in the descent vector u = g + Bg.Usually, this value of  obtained from / = 0 is not the global minimum of  0 (or ).Here, we try another approach and attempt to derive a better value of  than   , such that the value of  obtained in this manner is the global minimum of  0 (or ).
In practice, we can take When   is near to 0.5, the convergence speed is very fast.Inserting (15) for u into the above equation and through some elementary operations we can derive a quadratic equation to solve : where If the following condition is satisfied: then  in (44) has a real solution: Inserting ( 45)-(47) into the critical equation: we can derive an algebraic equation to determine that   is the lowest bound of (48).In this lowest bound   is a critical value denoted by   , and for all   ≥   it can satisfy (48) automatically.From (50) through some elementary operations, the critical value   can be solved as Then by inserting it for   into (45) and ( 46) we can obtain a critical value   for  from (49): where  = 0 was used in view of (50).By inserting (51) for   into (52) and cancelling out the common term  we can derive a final equation for   , of which we use the same symbols for saving notations: Here we must emphasize that, in the current descent vector u = g + Bg, the above value   is the best one, and the vector is the best descent vector.Due to its criticality, if one attempts to find a better value of the parameter  than   , there would be no real solution of .Furthermore, the best descent vector is also better than the optimal vector u = g +   Bg derived in Section 3.2.

A Globally Optimal
Algorithm.Then, we can derive the following globally optimal algorithm (GOA) to solve the minimization problem in (1).
(i) Select 0 ≤  < 1 and give an initial guess of x 0 .
Remark 3. We have derived a novel globally optimal algorithm for solving the minimization problem in (1).In terms of the descent vector u = g +   Bg, the GOA is the best one, which leads to the global minimum of  0 (or ) and hence the largest convergence rate.While the parameter  is chosen by the user with problem dependence, the parameter   is exactly given by (56).Up to here we have successfully derived a novel best descent vector algorithm, with the help from ( 19), (53), and (54).
Remark 4. At the very beginning we have set an invariant manifold ℎ(x, ) = ()(x) in ( 8) as our starting point to derive the iterative optimal algorithms, which includes a locally objective function  in the governing equations.However, in the final stage the terms which include  can be cancelled out, and thus we have obtained the optimal algorithm in (42) and the globally optimal algorithm in (57), which are both independent of .

The Broyden-Fletcher-Goldfarb-Shanno Updating Techniques
In the above we have derived two optimal algorithms by leaving the descent matrix B to be specified by the user.First we fix B to be the exact Hessian matrix A, and then we can obtain two optimal algorithms OA and GOA.We can also apply the technique of BFGS by updating the Hessian matrix A and its inverse matrix B. To derive this updating technique let us mention the Newton iterative scheme to solve (1): where   is the optimal steplength along the Newton direction ∇ 2 (x  ) −1 g  at the th step.In order to construct a matrix B  to approximate ∇ 2 (x  ) −1 we can analyze the relation between ∇ 2 (x  ) −1 and the first order derivative g  .We take the Taylor expansion of (x) at a point x +1 , obtaining Then we have Let and from (60) we have Assume that the Hessian matrix ∇ 2 (x +1 ) is invertible and denote the inverse matrix by B +1 .Then from the above equation we have the so-called quasi-Newton condition: When we take the descent matrix B to be the inverse of the Hessian matrix A, we might accelerate the convergence speed, which is however a more difficult task when the dimension of the Hessian matrix is quite large.So far, as that done in the quasi-Newton method, we can employ the following updating technique due to BFGS: The advantage by using the above updating technique is that we can obtain an approximation of the inverse of the Hessian matrix, and we do not need to really calculate A −1 .
By the same token we can also apply the technique of BFGS to update the Hessian matrix without needing the computation of the real Hessian matrix: where at the first step we can take A 0 = I  .The advantage of this approach is that we do not need to calculate the Hessian matrix exactly, as developed independently by Broyden [21], Fletcher [22], Goldfarb [23], and Shanno [24].It is easy to check that the updating technique in (66) satisfies the quasi-Newton condition: which is an inversion relationship of (64).The present notations of A and B are the same as those B and H used by Dai [25,26].
The above technique together with the OA by using B  = A  generated from (66) will be named the OABFGS, and two numerical examples will be given to test its performance.
When the matrix A is given exactly by the Hessian matrix in the OA and GOA, we use the above technique to compute B  in the OA and GOA, and the resultant iterative algorithms are named OABFGS1 and GOABFGS1, respectively.Finally, if in the OA and GOA both A and B are updated by (66) and (65), the resultant iterative algorithms will be named OABFGS2 and GOABFGS2, respectively.
In order to show the high performance of the optimal algorithms OABFGS1, GOABFGS1, and OABFGS2, we will also apply the DFP method [18,19] to solve nonlinear minimization problems, of which the updating technique is For each iterative step we search the minimization of (x  − B  ∇(x  )) to find  by solving the optimality equation ∇(x  −B  ∇(x  ))⋅(B  ∇(x  )) = 0 by using the half-interval method.Gill et al. [27], among several others, have shown that the modification in (65) performs more efficiently than that in (68) for most problems.

Numerical Examples
In order to evaluate the performance of the newly developed methods let us investigate the following examples.Some results are compared with those obtained from the steepest descent method (SDM) and the DFP method [18,19].In the above algorithms there exists a relaxation parameter , which is problem dependent.A good parameter value of  can be selected easily by comparing the convergence speeds for different values of .
Example 1.As the first testing example of OA and GOA we use the following function given by Rosenbrock [28]: [29] have used the particle swarm method to solve this problem; however, the numerical procedures are rather complex.Liu and Atluri [30] have applied a fictitious time integration method to solve the above problem under the constraints of  1 ≥ 0 and  2 ≥ 0, whose accuracy can reach to the fifth order.
We apply the OA to this problem by starting from the initial point at (3, 2) and under a convergence criterion  = 10 −10 .The SDM is run 3749 steps as shown in Figures 1(a) and 1(b) by solid lines for showing the residual and the objective function .The SDM can reach a very accurate minimum value of  with 1.22 × 10 −20 .The OA with  = 0 converges very fast with only 6 steps, with the residual and the objective function  being shown in Figures 1(a) and 1(b) by dashed lines.The OA is faster than the SDM with about 625 times, and furthermore the minimum value of  can be reduced to 1.925 × 10 −25 .In Figure 1(c) we compare the solution paths generated by the SDM and OA.When the SDM is moving very slowly along the valley, the OA is moving outside the region of valley and is convergent very fast to the solution.In Figure 1(c) the red dashed line is used to represent the valley of the Rosenbrock function, which is not used to indicate the result of a numerical method.As shown in Figure 1(b), GOA is slightly better than OA, although the paths of OA and GOA seem to be identical in Figure 1(c).
The SDM usually works very well during early stages; however, as a stationary point is approached, it behaves poorly, taking small nearly orthogonal steps.On the other hand, with the help of the optimal direction g +   Ag the OA can fast reach to the final end point with high accuracy.For this example the GOA spends the same number of steps as the OA; however, the GOA gives very accurate minimum value of  = 1.26 × 10 −29 .Example 2. Next, we consider a generalization of the Rosenbrock function [31,32]: This variant has been shown to have exactly one minimum for  = 3 at ( 1 ,  2 ,  3 ) = (1, 1, 1) and exactly two minima for 4 ≤  ≤ 7. The global minimum happens at ( 1 ,  2 , . . .,   ) = (1, 1, . . ., 1) and a local minimum is near ( 1 ,  2 , . . .,   ) = (−1, 1, . . ., 1).This result is obtained by setting the gradient of the function equal to zero, noticing that the resulting equation is a rational function of   .For small  the polynomials can be determined exactly and Sturm's theorem can be used to determine the number of real roots, while the roots can be bounded in the region of |  | < 2.4 [32].For larger  this method breaks down due to the size of the coefficients involved.We apply the OA to this problem with  = 30, starting from   = 0.1, and under a convergence criterion  = 10 −5 .Under the above stopping criterion, the SDM is run over 13830 steps as shown in Figure 2 At the same time we show the values of the optimal  in Figure 3(a) for the OA.The value of  is much smaller than 1, which means that the term g plays a dominant role in the descent direction; however, the influence of Ag, although a little, is a key point to speed up the convergence.
Then we apply the GOA to this problem under the same conditions as that in OA.The GOA with  = 0.1 converges with 9846 steps, with the residual and  being shown in Figure 2 by dashed-dotted lines.The GOA is faster than the SDM, and the minimum value of  can be further reduced to 9.59 × 10 −12 .We show the values of the optimal  for the GOA in Figure 3(b).This problem is quite difficult, and many other algorithms are failure to solve this problem.
Example 3. We consider a problem due to Powell [33]: The minimum of  is zero occurring at ( 1 ,  2 ,  3 ,  4 ) = (0, 0, 0, 0).We apply the OA to this problem, starting from ( 1 ,  2 ,  3 ,  4 ) = (3, −1, 0, 1) and under a convergence criterion of  = 10 −6 .The SDM converges very slowly over Number of steps As shown in Figure 4(c) the residual obtained by the OABFGS is rapidly growing to the order 10 26 and then fast decaying to the order 10 −6 , and through 1043 steps the OABFGS leads to a better solution with  = 1.285 × 10 −9 than that obtained by the SDM.Here the combination of the optimal algorithm with the BFGS updating technique is very useful when the exact Hessian matrix is not available or when the computation of the Hessian matrix is cumbersome.
Then we apply the GOA in Section 3.5 to the Powell problem.Under the same convergence criterion of  = 10 −6 , the GOA with  = 0.001 converges only with 96 steps as shown in Figure 4(c), where the GOA is faster than the SDM with 500 times and than OA with 3.5 times, and furthermore we can get a very accurate minimum value  = 1.55 × 10 −9 .In Figure 5 we compare the values of  obtained by the OA and the GOA, of which we can observe that both  are quite small and may be negative.
Then, we apply the OABFGS1 and the GOABGGS1 mentioned in Section 4 to solve the Powell problem.Under the same convergence criterion of  = 10 −6 , the OABFGS1 with  = 0.1 converges only with 29 steps as shown in Figure 6(a), where we can get very accurate value of  = 7.57 × 10 −12 .Similarly, the GOABFGS1 with  = 0.1 converges with 30 steps as shown in Figure 6(a), where we can get very accurate value of  = 5.28 × 10 −10 .In Figure 6(b    large, which means that the quasi-Newton direction is a dominant factor to accelerate the convergence speed.Then, we apply the OABFGS2 to solve the Powell problem with  = 0, which converges through 116 steps and with the value of  to be  = 2.77 × 10 −11 , which is very accurate.In Figures 6(c) and 6(d) we show the residual and the value of  obtained by the OABFGS2.
In order to reveal the high performance of the optimal algorithms OABFGS1, GOABFGS1, and OABFGS2, we apply the DFP method [18,19] to solve the Powell problem, which converges through 131 steps as shown in Figure 6  the half-interval method, with two end points fixed by  = 0 and  = 1, and the convergence criterion is given by  = 10 −10 .At the end of the iteration we obtain the minimum value of  to be  = 9.79 × 10 −12 , which is very accurate.
Example 4. In this example we design an office block inside a structure with a curved roof given by  = 100 −  2 .Suppose that the number of total cuboids is  and each cuboid can have different size, and we attempt to find the dimensions of all cuboids with maximum volume which would fit inside the given roof structure; that is, where   > 0 is the height of the th cuboid.The maximum of  is tending to 2000/3 when  is increasing.When  = 95, we apply the GOA to this problem by starting from   = 0.05 and under a convergence criterion of  = 10 −5 .The SDM is convergent with 6868 steps as shown in Figure 7 by solid lines for showing  0 , , residual, and .At the same time, the GOA with  = 0.35 converges with 96 steps, with  0 , , residual, and  being shown in Figure 7 by dashed lines.Both  are tending to 661.9945.The GOA is faster than the SDM with 80 times.The heights and widths of the cuboids with respect to the number of floors are plotted in Figure 8.For this problem both OA and GOA lead to the same numerical results.
Moreover, we also apply the OABFGS and the OABFGS1 as well as the DFP to this problem under the above same initial condition and convergence criterion, where for the DFP we use the half-interval method to solve the local minimization to find the best  with two end points fixed by  = 0 and  = 1, and the convergence criterion is given by  = 10 −10 .The values of  used in the OABFGS and OABFGS1 are, respectively, 0.3 and 0.05.The residuals of these three methods are compared in Figure 9(c), from which we can observe that the OABFGS converges with 35 steps, the OABFGS1 converges with 32 steps, and the DFP converges with 46 steps.They are all better than the above OA and GOA algorithms.The value of  as shown in Figure 9(a) for the OABFGS is quite small, which indicates that the descent direction is dominated by the gradient direction.The value of  as shown in Figure 9(b) for the OABFGS1 is quite large, which indicates that the descent direction is dominated by the quasi-Newton direction.where   = 100(  −  2  ) 2 + (  − 1) 2 .The minimum is zero at   = 1,  = 1, . . ., .Here, we fix  = 8.
It is very difficult to minimize the Whitley function.The SDM diverges as shown in Figure 11 by solid lines.We apply the OA to this problem, starting from   = 1.12 and under a convergence criterion of  = 10 −8 .The OA with  = 0.06 converges with 24 steps, with residual and  being shown in Figure 11 by dashed lines and  tending to 1.47 × 10 −13 .

Conclusions
By formulating the minimization problem into a continuous manifold, we can derive a governing system of ODEs for deriving the iterative algorithms which being proven convergence to find the unknown minimum point x of a given nonlinear minimization function.The novel algorithm is named "an optimal algorithm (OA), " because in the local frame we have derived the optimal parameter of  in the descent direction, which is a linear combination of the gradient vector and a supplemental vector.We have demonstrated a critical descent vector to derive a globally optimal algorithm (GOA), which can substantially accelerate the convergence speed in the numerical solution of nonlinear minimization problem.It was proven that the critical value   in the critical descent vector leads to the largest convergence rate among all the descent vectors specified by u = g +Bg.Due to its criticality, if one attempts to find a better descent vector than u = g +   Bg, there would be no real descent vector of u.Through several numerical tests we found that the both the OA and the GOA outperformed very well.Then we have proposed novel algorithms based on OA and GOA by replacing the Hessian matrix A and the descent matrix B with the updating techniques of BFGS for one or for both of these two matrices.Two numerical examples were given to test the performances of these algorithms, which are faster than the original OA and GOA algorithms, due to the enhancement by using the quasi-Newton conditions on the updated matrices.

Figure 1 :
Figure 1: For the Rosenbrock problem, (a) the residuals, (b) the objective function , and (c) the solution paths of SDM, OA, and GOA.
(a) by solid lines for showing the residual and .The SDM can reach a very accurate minimum value of  with 1.357 × 10 −10 .The OA with  = 0.2 converges with 9956 steps, with the residual and  being shown in Figure 2 by dashed lines.The OA is faster than the SDM, and similarly the minimum of  can be reduced to 1.96 × 10 −11 .

Figure 2 :Figure 3 :
Figure 2: For the generalized Rosenbrock problem, (a) the residuals and (b) the objective functions obtained by SDM, OA, and GOA.

Figure 4 :Figure 5 :
Figure 4: For the Powell problem, comparing (a)  0 and (b)  of SDM and OA and (c) the residuals obtained by the SDM, OA, GOA, and OABFGS.

Figure 7 :
Figure 7: For the maximum area under a given curve comparing (a)  0 , (b) , (c) residual error, and (d)  obtained by the SDM and GOA.

Figure 8 :Figure 9 :
Figure 8: Showing the heights and widths of the floors with respect to the number of floors.

Figure 10 :
Figure 10: For the minimization of Schwefel function comparing (a)  0 , (b) ,(c) residual error, and (d)  obtained by the SDM and OA.

Figure 11 :
Figure 11: For the minimization of Whitley function comparing (a) residuals and (b)  obtained by the SDM and OA, where the SDM is failure.
) we compare the values of  obtained by the OABFGS1 and GOABFGS1, of which we can observe that both  are quite