Common section

Appendix A: Optimization

Optimization refers to the process of finding the maximum or minimum of a function in some defined region, which could be either finite or infinite in extent. Here we look only at finding the maximum of a function, since that is the most relevant for pricing and revenue optimization.

A. 1 CONTINUOUS OPTIMIZATION

In continuous optimization, we want to maximize a continuous, differentiable function f(x) over a certain region—say, x ≥ 0. Here, x may be either a single number—a scalar—or a vector x = (x₁, x₂, . . . , x_n). In either case, we seek the value of x, for which f(x) is maximized. This general problem can be complex and subtle. In particular, a function may have many different local maxima within the region. If so, the problem of finding the global maximum among all the local maxima may be very difficult and may require the use of such approaches as genetic algorithms. Fortunately, the problems treated in this book typically have unimodal objective functions; that is, any local maximum will also be a global maximum. The problems we consider also generally have convex feasible regions. The combination of a unimodal objective function with a convex feasible region means that hill-climbing approaches typically work well. Starting at any point, we find an improving direction and move in that direction to a point with a higher objective function value. From that point we find a new improving direction and repeat the process until we get to a point where no further improvement is possible. This point is the global maximizer.

There are three possibilities.

1. The function may have no maximum. For example, Chapter 5 shows that a supplier facing a constant-elasticity price-response function with elasticity greater than 1 can always increase revenue by increasing his price. In this case, there is no finite price that maximizes revenue.

2. The function may attain its maximum value within the interior of the region.

3. The function may attain its maximum value on the boundary of the region

Given the restrictions we have put on the underlying function and the feasible region, an intelligent hill-climbing approach will find the optimal solution in the second and third cases. The first case will never occur in a well-posed problem—if we find that our pricing optimization problem has no finite solution, it generally means that the problem has been incorrectly specified or that one or more important constraints have been inadvertently excluded.

For an interior point x* to be a maximizer of a function f, it must be the case that all of the derivatives are equal to zero at x*—that is, ∂f(x*)/∂x_i* = 0 for all i. (If x is a scalar, this is equivalent to the requirement that f '(x*) = 0.) If ∂f (x)/∂x_i > 0 for some i, then f(x) could be increased by increasing by a small amount. If ∂f(x)/∂x_i < 0 for some i, then f(x) could be increased by decreasing by a small amount. In either case, x* is not a maximizer.

Thus, for all i is a necessary condition for x to be an interior maximizer of f. It is not, however, a sufficient condition. There are two other possibilities: x* could be a minimizer of f(x). Or, more rarely, x* could be an inflection point of f(x). In the case when x is a scalar, the most common test to ensure that x* is a maximizer of f(x) is that the second derivative f″(x) < 0. If this condition is satisfied, then x* will be a local maximizer of f(x). And if f(x) is unimodal, x* must be a global maximizer as well.

Calculating the second derivative can be difficult. However, note that an x* that satisfies the first-order conditions will also be a maximizer of f(x) if perturbing any of its elements either up or down leads to a smaller value of f(x).

Example A.1

Assume that we want to maximize the function f(x) = –x² + 2x + 6. The first derivative of f(x) is f'(x) = 2 – 2x, which is equal to 0 only for x* = 1 with f(x*) = 7. The question is whether x* = 1 is a maximizer or a minimizer of f(x) (or possibly an inflection point). We note that f"(x) = –2 < 0 for all x, which indicates that x*is a maximizer. Alternatively, we note that f(.9) = 6.99 < f(x*) and f(1.1) = 6.99 < f(x*), which is another way of showing that x* = 1 is a maximizer.

This is a cursory treatment of a deep topic. It does not deal with the many exceptions, caveats, and subtleties to continuous optimization. However, for the purpose of basic pricing and revenue optimization, we usually do not need to deal with these subtleties; a simple approach will usually suffice. Most pricing and revenue optimization problems have a simple structure: Increasing a decision variable (price or capacity allotment) initially improves profit, but, at some point, further increases lead to lower profit. The maximum occurs at the point just before profit begins to degrade. With this structure, we can use simple approaches and not worry about some of the more subtle issues. This does not mean that these subtleties can always be ignored in practice. For example, in optimizing list prices, some products can be complements (like hot dogs and hot-dog buns), while others are substitutes (like Coke and Pepsi). In this situation, the problem of finding the price that maximizes total expected contribution can have many local maxima, and a point that satisfies all of the first- and second-order conditions may not be the global maximum. This means that genetic algorithms or related approaches may be needed to find the global maximum.

A.2 LINEAR PROGRAMMING

Linear programming is a well-established approach to optimization in the case where a linear objective function is being maximized subject to linear constraints. The simplex algorithm for solving such problems was initially developed by George Dantzig in the 1960s and has been extended and refined many times since. The simplex algorithm, like all other commercially used algorithms for solving linear programs, is a hill-climbing approach, in the sense that it moves from point to point within the feasible region, increasing the objective function at every move, until no further improving direction within the feasible region can be found.

The standard form of a linear programming problem is

subject to

The objective function coefficients c_i and the constraint coefficients a_ij can each be either positive, negative, or zero. The constraints have been specified as “less than or equal to,” but “greater than or equal to” constraints can also be incorporated by simply multiplying both sides by − 1. Equality constraints can be incorporated by including simultaneous “less than or equal to” and “greater than or equal to” constraints. The standard form has specified that each x_i is zero or greater, but it is also possible to constrain any or all of them to be zero or less or to take either positive or negative values. Thus, the standard form is quite flexible in representing any linear objective function subject to any set of linear constraints.

The standard form can also be written as

subject to

where c = (c₁, c₂, . . . , c_n), x = (x₁, x₂, . . . , x_n), b = (b₁, b₂, . . . , b_m), and the matrix A consists of the elements a_ij.

An optimization problem that can be specified in this form can be solved efficiently by a number of commercial software packages such as CPLEX. Microsoft Excel includes a linear programming capability within its SOLVER function.

A.3 DUALITY AND COMPLEMENTARY SLACKNESS

Every linear program of the form of Equation A.1 has an associated dual linear programming problem of the form

subject to

where λ = (λ₁, λ₂, . . . , λ_m) is a vector of dual variables, one corresponding to each of the constraints in the original formulation in Equation A.1. The importance of the dual arises partly from the complementary slackness principle:

The incremental change in the objective function from an incremental change in constraint j in Formula A.1 will be (approximately) λ_j. If constraint j in Equation A.1 is not binding, then = 0 and relaxing the constraint will not change the value of the objective function. If λ_j > 0, then constraint j in Equation A.1 is binding.

In particular, if a constraint in a linear program is associated with a capacity or inventory limitation, the complementary slackness principle means that the corresponding dual variable will be the opportunity cost associated with that constraint.

A.4 DISCRETE OPTIMIZATION

Discrete optimization is in theory more applicable to pricing and revenue optimization than is continuous optimization, since most of the entities of interest (demand, bookings, no-shows, prices, etc.) are discrete rather than continuous. Unfortunately, discrete optimization is more computationally difficult, in general, than continuous optimization. In continuous optimization we are navigating a smooth surface, looking for the highest point. If the continuous function is differentiable (as we have assumed), either we are at a maximum or the local derivatives give us a clear direction to move. In contrast, discrete optimization is like hopping from island to island in an archipelago, looking for the island with the highest peak. When we are on a particular island, there may be no indicator to tell us which island to jump to next.

As a result, integer linear programming—finding the vector of integers x that solves Equation A.1—is much more difficult than standard linear programming, in which the answers do not need to be integers. In principle, the optimal integer solution to a linear program could be far from the optimal continuous solution found by standard linear programming software. Since we are usually interested in integer solutions, this would seem to present a problem. Fortunately for most pricing and revenue optimization problems addressed in this book, rounding the continuous solution to a linear program to the nearest feasible integer solution usually turns out to give a good answer.

A particular case of interest in revenue management is the situation in which we need to find the integer i ≥ 0 that maximizes R(i), where R(i) is known to be unimodal. This is the problem faced, for example, in setting a total booking limit for a flight or setting capacity limits with two booking classes. In this case, we can use the principle of marginality, which states that the value i* that maximizes R(i) is the smallest value of i such that R(i + 1) ≤ R(i). This means that we can start at i = 0, calculate R(i) and R(i + 1), and stop whenever R(i + 1) ≤ R(i). As long as we know in advance that R(i) is unimodal, this approach will always terminate at the optimal value of i.

A.5 REINFORCEMENT LEARNING AND BANDIT APPROACHES

The methods that we have considered so far are based on maximizing (or minimizing) some function subject to a set of constraints. Note that optimization requires a stylized model that is always an imperfect representation of the real world. Optimization is a mathematical operation on this model that produces values of decision variables—in our case, primarily prices or booking limits—that can then be applied in the real world. For applications such as pricing and revenue optimization, this is often called a forecast then optimize approach—we build a mathematical model that forecasts what we believe would happen for different decisions we could make (often with a probability distribution) and then use an optimization model to calculate the values of the decision variables that maximize our objective function. This approach has two drawbacks: (1) it requires computational effort to estimate the parameters of the model and the forecast, such as the parameters of the price-response functions estimated in Chapter 4, and (2) the quality of the decisions is dependent on the quality of the assumptions underlying our model. If, for example, we assume demands by booking class are independent when, in fact, they are highly dependent, then our booking limits may be systematically biased.

The idea behind reinforcement learning is “learning by doing”—to systematically try different alternatives and do more of what works well and less of what does not work well. Reinforcement learning (RL) approaches were developed in part in emulation of the way that animals learn to negotiate their environment and meet such goals as obtaining food or shelter. RL approaches have posted some highly publicized successes—for example, the triumph of Google’s DeepMind AlphaGo over the world champion human Go player in 2018. Since about 2010, RL has become commonly applied to a wide variety of business decisions including pricing.

The basic idea behind reinforcement learning is shown in Figure A.1. At any time, an agent has a set of actions available to it and is in the current state. The agent chooses an action. On the basis of the current state and the action chosen, the environment will respond with a reward and a new state. Both the reward and the new state are likely to be probabilistic, which is why the environment is represented as a cloud. The agent seeks to maximize the (possibly discounted) sum of the expected rewards that it will receive over time.

Figure A.1 Elements of reinforcement learning.

The structure in Figure A.1 is quite general. It could apply to such varied situations as these:

• A driver is trying to get from an origin to a destination. His state is his location; his available actions are the choices he could make at any time, such as turn right, turn left, or go straight; and his reward is the inverse of the time it takes him to get from the origin to the destination.

• A poker player is playing a hand of poker. The player’s state is the cards she holds and the past actions of other players—how much they have bet in what order and which ones have folded. Her possible actions are all her possible bets, as well as “fold” or “stay,” and her reward is the payoff at the end of the hand.

• A chess player contemplates her next move. Her state is the position of the pieces on the board, her potential actions are her available moves, and the reward is either a win, a loss, or a draw at the end of the game.

As the examples might suggest, the problems attacked by reinforcement learning can be quite complex. In the examples above, the rewards do not occur primarily in response to each action but at the end when the driver reaches his destination or the chess player either wins, loses, or draws. Such applications can require deep and sophisticated approaches such as neural nets as well as lots of training to become effective.

Fortunately, the problems associated with pricing are much simpler—at least from a reinforcement learning point of view. The action available at each time is the price, and the reward is the immediate contribution—profit—received. There is no long-term goal such as winning a chess match or driving to a destination that requires thinking many steps ahead. And, at least as we consider it in Chapter 5, the pricing problem is stateless. This would not be the case if we were applying reinforcement learning to capacity control (Chapter 9), overbooking (Chapter 11), or markdown management (Chapter 12). In the case of these issues, the decision maker needs to keep track of the remaining unsold inventory, which would be a state variable.

Because of its relative simplicity, many pricing problems can be formulated as a k-armed bandit problem, which is one of the simplest forms of reinforcement learning. The name k-armed bandit arises from the metaphor of a gambler in a casino who has the option of playing one of k different slot machines (or one machine with k arms). The slot machines have different payoff probabilities. For simplicity, assume that they are $1 slot machines and that, if you lose, you lose your dollar, but if you win, you get your original dollar back plus an additional $1. The problem is that you do not know the payoff probabilities for the different machines but you want to maximize your expected winnings over some large number of plays. What should you do?

Let us start with the case of two machines—Machine 1 and Machine 2—and the following assumptions:

1. The probability of winning on either machine is independent of the probability of winning on the other.

2. The probabilities of winning on either machine do not change over time.

3. We have no prior information on the payoff probability of either machine.

For concreteness, let us assume that we bet $1 each time and the payoff is $2 if we win and $0 if we lose. Let us consider the case where Machine 1 has a payoff probability of .7 and Machine 2 has a payoff probability of .3.

The most naïve strategy that we could play is to choose one of the machines at random and play it until we are tired of playing or run out of money—whichever comes first. Alternatively, we could alternate between the machines or flip a coin to determine which of the two machines to play on each turn. Under either strategy, the expected payoff per play is the same: On each play, there is a 50% chance that we are playing the .7 payoff machine with expected payoff .7 × $1.00 – .3 × $1.00 = $0.40, and there is a 50% chance that we are playing the .3 payoff machine with expected pay off .3 × $1.00 – .7 × $1.00 = –$0.40. Our expected payoff per turn for any of these naïve strategies is thus .5 × $0.40 – .5 × $0.40 = 0.

A more sophisticated strategy is to pick one of the two machines at random and play it. If we win, we play it again; if we lose, we move to the other machine. If we use this strategy, we will end up playing Machine 1 about 70% of the time and Machine 2 about 30% of the time on the average. Thus, using this strategy, our expected return is about .7 × $0.40 – .3 × $0.40 = $0.16, which is not bad and far better than anything you will ever encounter in Las Vegas.

However, the “play again if we win, otherwise switch” strategy has an obvious drawback—it has no memory, so it cannot learn. Therefore, it can never realize that one machine is paying out more often on average than the other. The idea behind a bandit algorithm is to learn from experience. One particular approach—called an epsilon-greedy algorithm—is to do the following:

1. Set a small value of epsilon ?. For this example, we will choose ? = .1.

2. We have just completed our Nth play. For each machine, we record the number of times that we have played that machine, with n_1N being the number of times that we have played Machine 1 and n_2N the number of times that we have played Machine 2. Note that n_1N + n_2N = N. Also, we have recorded the number of wins that we have enjoyed so far on each machine, w_1N, w_1N.

3. Before play N + 1, calculate the win rates r_1N = w_1N/n_1N and r_2n = w_2N/n_2N.

4. On play N + 1,

a. If r_1n > r_2n, play Machine 1 with probability 1 – ? and Machine 2 with probability ?.

b. If r_2n > r_1n, play Machine 2 with probability 1 – ? and Machine 1 with probability ?.

c. If r_1n = r_2n, play either machine with 50% probability.

5. For the machine that you played, i, update n_iN _{+ 1}= n_iN + 1. If you won, set w_iN _{+ 1}= w_iN + 1; if you lost, then w_iN _{+ 1}= w_iN . For the machine not played—call it machine j—set n_jN _{+ 1} = n_jN and w_jN _{+ 1} = w_jN . Increment N ← N + 1 and go to step 2.

We continue until we run out of money or get tired of playing.

How well does the ?-greedy multi-armed bandit algorithm perform? Over time, it will converge to the case in which Machine 1 is chosen 90% of the time. In this case, the average payoff will be .9 × $0.40 – .1 × $0.40 = $0.32, double the average payoff from the “play again if we win, otherwise switch” approach. In simulating 100 plays, it picks Machine 1 about 85% of the time, so the average payoff is .85 × $0.40 – .15 × $0.40 = $0.28. The average fraction of the time that Machine 1 is chosen in 100 plays is shown for 50 replications in the left-hand graph in Figure A.2. The fraction is .5 on play 2 because each machine is chosen once in the first two plays—from there the algorithm learns fairly rapidly to favor Machine 1.

The rapid learning displayed in the left-hand graph of Figure A.2 is partly a reflection of the fact that the payoffs from the two machines are so different—70% versus 30%. The algorithm learns much more slowly when the difference in payoff is smaller. The average learning trajectory in the case when the payoff from Machine 1 is 52% and that from Machine 2 is 48% is shown on the right-hand graph in Figure A.2. The learning is clearly slower and only an average of about 70% of the plays are on Machine 1 after 100 plays. On the other hand, when the difference between machines is smaller, learning is cheaper in the sense that there is less relative cost to spending time on Machine 2 rather than Machine 1.

Figure A.2 Fraction of times that Machine 1 is played as a function of the number of plays averaged over 50 replications when the probabilities of winning are 70% and 30% (left) and 52% and 48% (right).

While there are many refinements and extensions, the algorithm described above incorporates the fundamental features of k-armed bandit algorithms. The probability of picking a losing arm is critically important because, without occasional testing, there is a chance that the algorithm could pick the wrong arm and stick with it. In the case when the winning probabilities are 70% and 30%, there is a 9% chance that the first pull of Machine 1 will lose and the first pull of Machine 2 will win. In this case, a purely greedy algorithm (i.e., one with ? = 0) would continue playing Machine 2 forever. By occasionally testing a seemingly inferior alternative, we eliminate this possibility.

The two-armed bandit in the example can be extended to multiple arms in a straightforward fashion. After trying each of the arms, we choose the arm with the best average performance a fraction of 1 – ? of the time. We can choose among the other arms based on their relative performance—better-performing arms should have a higher probability of being chosen, but every arm should have some chance of being chosen.

The extension of the multi-armed bandit problem to the pricing optimization problem should be clear. For pricing, we have a discrete set of prices p₁ <p₂< . . . < p_n that we want to test. In period t, we decide on one of the n prices to charge. Assume that we have decided on price p_it. We observe the realized demand d̂(p_it) and the corresponding contribution d̂(p_it) (p_it – c). Let I_i(t) be an indicator variable such that I_i(t) = 1 means that price i was the price chosen at time t and I_i(t) = 0 means that price i was not the price chosen, so that one and only one value of i has I_i(t) = 1 for any t. Then, at time T, the average contribution from price i is

A multi-armed bandit approach to pricing uses the values of to determine which price to charge in period T + 1.

While the multi-armed bandit approach is appealing in its simplicity and its lack of underlying assumptions, it has a number of potential drawbacks. First, it requires the ability to test frequently throughout the potential space of decisions. In some circumstances, testing may not be feasible or desirable. For example, an airline may not want to do extensive testing of high overbooking levels, which may lead to high levels of denied boardings. Second, the two-armed bandit example assumed that the payoff probabilities of each machine were constant over time and not influenced by the external environment. However, in pricing and revenue optimization, there are multiple external variables that might change demand—for example, day of the week, whether it is the Christmas holiday season, and known competitive prices. In this case, we are searching not just for a price but a policy that maps possible values of external variables to a best price. There are many different approaches to generating a policy using reinforcement learning—however, they all require significantly more testing than the simple examples we have seen.

A.6 FURTHER READING

There is a vast literature on classical optimization, both continuous and discrete. I am partial to Linear and Nonlinear Programming (Luenberger and Ye 2008) as a great single source for both continuous and discrete optimization. For those who want to go even deeper, Convex Optimization (Boyd and Vandenberghe 2004) is recommended for continuous optimization, and Understanding and Using Linear Programming (Matousek and Gärtner 2007) is recommended for discrete optimization.

For reinforcement learning in general and the bandit problem in particular, the best starting point is the excellent Reinforcement Learning: An Introduction (Sutton and Barto 2018); chapter 2 discusses the multi-armed bandit problem at length.

Page

Contents

If you find an error or have any questions, please email us at admin@erenow.org. Thank you!