Probability-boosting technique for combinatorial optimization
- Published
- Accepted
- Received
- Academic Editor
- Mehmet Cunkas
- Subject Areas
- Algorithms and Analysis of Algorithms, Optimization Theory and Computation, Theory and Formal Methods
- Keywords
- Randomized algorithms, Algorithm design and analysis
- Copyright
- © 2024 Kantabutra
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
- Cite this article
- 2024. Probability-boosting technique for combinatorial optimization. PeerJ Computer Science 10:e2499 https://doi.org/10.7717/peerj-cs.2499
Abstract
In many combinatorial optimization problems we want a particular set of k out of n items with some certain properties (or constraints). These properties may involve the k items. In the worst case a deterministic algorithm must scan n−k items in the set to verify the k items. If we pick a set of k items randomly and verify the properties, it will take about (n/k)k verifications, which can be a really large number for some values of k and n. In this article we introduce a significantly faster randomized strategy with very high probability to pick the set of such k items by amplifying the probability of obtaining a target set of k items and show how this probability boosting technique can be applied to solve three different combinatorial optimization problems efficiently. In all three applications algorithms that use the probability boosting technique show superiority over their deterministic counterparts.
Introduction
Combinatorial optimization is the study of finding an optimal item or an optimal set of items from a finite set of items, where the set of feasible solutions is discrete. Some of these combinatorial optimization problems are known to be hard while others are not. Often-mentioned problems in combinatorial optimization are the travelling salesperson problem, the minimum spanning tree problem, and the knapsack problem (Cormen et al., 2001). Solutions to combinatorial optimization problems vary from deterministic polynomial-time algorithms, to approximation algorithms, to linear programming, to randomized algorithms. We are interested in a type of combinatorial optimization that chooses out of items such that these objects meet some certain criteria. The eight queens puzzle, for example, can be viewed as choosing eight out of 64 positions on the chessboard so that no two queens threaten each other (Ball, 1960). In social media networks some users are influential and have a greater power to relay information (Qiang, Pasiliao & Zheng, 2023). The problem of locating these out of users who can influence the most users within a limited time frame can also be viewed as the type of choosing out of items. Additionally, this particular type of problems appears in many NP-hard decision problems such as vertex cover problem, clique problem, dominating set problem, independent set problem, and set cover problem (Garey & Johnson, 1990) and these NP-hard problems have a wide range of applications.
In the worst case a deterministic algorithm must scan items in the set to verify the items. If we choose at random such items out of a total of items, it takes about verifications, which can be a really large number for some values of and . More importantly, in some applications, the generation of all possible such combinations could take excessive amount of memory when the values of and are large. There exists an algorithm to generate each of the combinations uniformly at random (Fan, Muller & Rezucha, 1962; Jones, 1962; Knuth, 1998). In addition, there is also a family of randomized algorithms for choosing a simple random sample, without replacement, of items from a population of unknown size or of a known size that is too large to fit into main memory in a single pass over the items (Vitter, 1985). However, in either case, the number of verifications in the worst case is still large.
In the last two decades or so, computer scientists have witnessed a tremendous growth in the use of probability to solve difficult problems. The prevalent use of randomized algorithms is one of such examples. Randomized algorithms are algorithms that make random choices during their execution and then execute according to these random outcomes (Mitzenmacher & Upfal, 2005). For example, the protocol implemented in an Ethernet card uses random numbers to decide the next attempts to access the shared Ethernet communication medium when there is a collision of requests (Hussain, Kapoor & Heidemann, 2004). In this case the randomness is useful for breaking symmetry, preventing different cards from repeatedly accessing the medium at the same time. Other applications of randomized algorithms include Monte Carlo simulations (Raychaudhuri, 2008) and primality testing in cryptography (Goldwasser & Kilian, 1999). In these and many other important applications, randomized algorithms are significantly more efficient than the best known deterministic solutions. Moreover, in most cases the randomized algorithms are also simpler and easier to program.
In our case, we would also like to take advantage of the randomized approach to find the out of items. If we apply a straight forward uniformly randomized method to find such out of items, we will have a very small chance of success of only one in . In this article we, therefore, consider a method to increase the probability of choosing out of items and illustrate the usefulness of our method in three different real-world applications. Note that if values of and are close, to find such items becomes easy. It is difficult to find such items when the values of and are far apart and we will only consider this latter case in this article.
Increasing a probability of some particular event of interest is not new. Obviously, appropriately reducing the size of a sample space can have such an effect. Given a set of numbers, the classic Insertion Sort algorithm (Cormen et al., 2001) reduces incrementally the space of all possible permutations until it arrives at the sorted sequence of numbers. In machine learning the gradient descent algorithm (Sra, Nowozin & Wright, 2011) also reduces the sample space of all possible parameters at each iteration until it arrives at a local minima. There are also some other methods to increase the probability of the event of interest in literature. Consider the Solovay-Strassen primality test which always answers true for prime numbers and false with probability at least and true with probability less than if the numbers are composite (Solovay & Strassen, 1977). To amplify the success probability, we can run this algorithm several times. If it returns true, then the number is prime. If it consecutively returns false responses times, the number is composite with probability at least . In general, any Monte Carlo algorithms with one-sided errors can use this probability amplifying technique. For two-sided error Monte Carlo algorithms, the probability can be boosted by running the algorithms times and using the majority of responses as an answer. In quantum computation probability or amplitude amplification is used in a family of quantum algorithms inspired by the Grover’s search algorithm (Brassard & Høyer, 1997; Brassard, HØyer & Tapp, 1998; Grover, 1998). In addition, probability amplification is also used in two-way quantum finite automata (Yakaryılmaz & Say, 2009). To the best of our knowledge, no probability-boosting technique in combinatorial optimization similar to ours exists in literature.
Our contributions are as follows. We discuss the probability-boosting technique that can be used to increase the probability of choosing out of objects in the first section and then illustrate different uses of the technique in three applications. In the first application we use the probability boosting technique to help choose the set of online advertisements with a more expensive rate and a required frequency of display. In the second application we consider a heterogeneous environment of computer servers, in which there is a small set of powerful supercomputers. Our probability boosting technique is used to select this small set of supercomputers to effectively increase the overall processing speedup. We consider genetic comparison, in which two genetic strings are non-related, in the last application. We show that the probability boosting technique can successfully help to identify the differences in the genes. The superiority of our technique can be summed up in Table 1.
Applications | With probability boosting | Deterministic methods |
---|---|---|
Online content optimization | Meet the frequency requirements for each rate while having a random display sequence of advertisements. | None exists. |
Server selection in a heterogeneous environment | Obtain supercomputers with a very high probability and, therefore, a high speedup for processing as well as save a significant number of verification tests. | Need verification tests in the worst case to obtain supercomputers. |
String comparison | Identify the differences with an average of positional checks with a good deviation from the average. | Require positional checks in the worst case to identify the differences. |
The probability-boosting technique
In this section we discuss a randomized technique that can be potentially used to solve many combinatorial optimization problems. Suppose we have different items and of these items are the items that we would like to obtain. If we simply choose items out of items uniformly at random, the probability is one in that we will obtain the desired items. This probability is small when is large and is small. In the rest of this article, unless stated otherwise, we assume that . Our technique is based on an observation that if we choose items uniformly at random instead of just , the probability that we will obtain a ( )-combination that includes the desired items will significantly increase. The following theorem describes this relationship. The proof of the theorem begins with an observation that when a ( )-combination of the set S is picked uniformly at random, the probability of obtaining a targeted -combination of the set S equals . This is because is fixed and the probability only depends on and for all such -combinations of the set S. The proof then uses a lower bound to bound the probability of obtaining , given . Observe that in each step of the derivations of the inequalities the numerators get smaller while the denominator is fixed, making the inequalities hold.
Theorem 1 (Probability Boosting). Given constants and , and any targeted -combination of a set S with size , a ( )-combination of the set S contains the targeted -combination of the same set S with probability at least if .
Proof. Observe that when a ( )-combination of the set S is picked uniformly at random, the probability of obtaining a targeted -combination of the set S equals .
Theorem 1 states that, given a desired probability and any targeted -combination of a set S, we can always find to obtain a ( )-combination of the set S that contains the targeted -combination of the same set S with probability at least . Hence, among all such combinations, at least of them contain the targeted -combination. Because this is true for any targeted -combination of a set S, each element in S appears in at least of all such combinations. We therefore have the following proposition.
Proposition 1. Let be the probability in Theorem 1 that a ( )-combination of the set S contains the targeted -combination of the same set S. Each element in S is in at least combinations of a total of combinations.
1: procedure S(L,n,k) ▹ a list L of n items and a number k |
2: Set . ▹ t the total number of of items dealt with so far |
3: Set . ▹ m the number of items selected so far |
4: Generate a random number U uniformly at random between 0 and 1. |
5: if then goto step 8. |
6: Select the next item in L for the sample and increase m and t by 1. |
7: If , go to step 4, else the procedure terminates. |
8: Skip the next item in L, increase t by 1, and go back to step 4. |
Moreover, the value is tight in a sense that it gives exactly when substituted in the lower bound of . In other words, if we want to improve this value , we need to improve the lower bound.
Given an appropriate value of from Theorem 1, we will compute a random set of out of items. A randomized algorithm that is used to generate a -combination of a set S uniformly at random is due to Fan, Muller & Rezucha (1962) and Jones (1962) and was described in greater detail in Knuth (1998). For a historical reason, we use the original name called procedure S. We describe it as follows. Assume and .
By Theorem 1, if we set and execute procedure one time, the chance of obtaining a ( )-combination of the set S containing the targeted -combination of the same set S is at least . By executing procedure times independently, the chance of not obtaining a ( )-combination of the set S that contains the targeted -combination of the same set S becomes at most .
We remark here that the time complexity of the procedure is and the algorithm will terminate before considering the last record exactly of the time since the last item is selected with probability (Knuth, 1998). In other words, most of the time, this algorithm does not consider every item in L, making it computationally fast. We use when is not an integer. We note that, in some cases, may be greater than . For instance, and . In such cases, we will handle it differently. Later we discuss limitations and mitigation for the cases when is too large.
Content optimization in online advertisement
In this section we discuss how the probability boosting technique can be applied in the online advertising business. In online social media platforms such as Facebook customers can pay to have their advertisements displayed on the platforms. The rates of services may vary according to how often an advertisement must appear on the targeted social media: the higher the frequency, the more expensive the rate. At the same time, these advertisements cannot stay on display all the time in the same sequence of advertisements because people may get bored. For the purpose of illustration, suppose an online social media company is paid to display 5,000 advertisements on its platform. Among these 5,000 advertisements, two of them are very expensive and must be displayed at least 50% of the time according to the contracts the company has made with customers. Algorithm 2 that meets this particular requirement can be defined as above. In this specific example, L is a list of 5,000 advertisements with , , and . By the virtue of Theorem 1, it is with probability at least that the two very expensive advertisements are displayed on the company’s social media platform with the other advertisements on step 6. We assume that this algorithm does not terminate because the advertisements run 24 hours a day everyday.
1: Set . |
2: Set . |
3: Set . |
4: Invoke the procedure to obtain a random ( )-set. |
5: If all k targeted advertisements are in the resulting ( )-set then |
6: Add the ( )-set to B. |
7: Display all advertisements in B. |
8: Set . |
9: Go to step 4. |
In another scenario, if there is a single rate for all advertisements and each of the advertisements is to be displayed each time simultaneously with probability at least at different social media outlets, the following algorithm can be used.
Again we assume that the advertisements are displayed without stopping. The correctness of this algorithm follows Proposition 1.
Server selection in a heterogeneous environment
In this section we discuss how to apply the probability boosting technique in Theorem 1 to manage a network of heterogeneous computer servers. Among these servers, of them are very expensive supercomputers with the power of doing calculations per second (Oak Ridge National Laboratory, 2022). Because they are very expensive supercomputers, we can only afford to have a small number of them. The other servers are regular servers with the power of doing calculations per second (phoenixNAP—Global IT Services, 2022). In a real computer network, in which each computer may be in a different country, we do not see computers physically. All we know are their IP addresses or server identification numbers, which change dynamically. In order to know whether a computer server is regular or not, a test must be performed and this test is costly in terms of communication time. We try to avoid testing all computer servers at all costs due to the potential to cause overwhelming network traffic.
1: Set . |
2: Set . |
3: Set . |
4: Invoke the procedure to obtain a random ( )-set. |
5: Add the resulting ( )-set to B. |
6: Display each advertisement in B at each social media outlet simultaneously. |
7: Set . |
8: Go to step 4. |
Our objective is to leverage the very expensive supercomputers to reduce the overall computational time and at the same time reduce the number of tests to be performed. Let be server identification numbers and , where is a constant. To choose a set of servers for processing, we use Algorithm 4 that is defined as follows:
1: Set . |
2: Set . |
3: Set . |
4: Invoke the procedure to obtain a random ( )-set. |
5: Add the resulting ( )-set to B. |
6: Repeat steps 4 and 5 more times. |
7: Return B. |
The time complexity of Algorithm 4 is and the space complexity for B is . The time to verify the correct ( )-set is because the verification must be done for every ( )-set to find the supercomputers, assuming the time taken to verify the types of servers is a very large constant . Additionally, in step 4, Algorithm 1 terminates before considering the last record exactly of the time since the last item is selected with probability (Knuth, 1998). Therefore, most of the time, Algorithm 4 does not consider every server in L in step 4 so that it is relatively fast. Lemma 1 gives the probability that Algorithm 4 obtains the supercomputers. The main idea of the proof is similar to tossing a fair coin that has the probability of of obtaining a head and the other of obtaining a tail. To increase the probability of obtaining a head, we can repeat the tosses many times to make sure that at least one of the outcomes is a head.
Lemma 1. Algorithm 4 gives a ( )-set of servers that contains the supercomputers with a probability of .
Proof. By Theorem 1, the procedure on line 4 returns a ( )-set that contains the supercomputers with probability at least . Therefore, the probability of not getting such a ( )-set is less than for one execution of the procedure . Because the algorithm repeats the call to the procedure times and each call is made independently, the probability that at least one of the -sets of servers contains the supercomputers is at least . Hence, Algorithm 4 gives a ( )-set of servers that contains the supercomputers with probability .
Observe that if , Algorithm 4 gives a ( )-set of servers that contains the supercomputers with a probability of , which is with high probability. However, the space complexity becomes and this fact inevitably requires the number of tests to be the same number as the space complexity. If is proportional to , this number of tests would be worse than tests in the straightforward deterministic algorithm. Hence, we would like to be as small as possible to gain a real advantage. The following proof shows that if and in the execution of Algorithm 4, the number of required tests to verify the supercomputers in the worst case is strictly less than . The idea of the proof is simply noticing that there is a total of servers to be tested and then using the two given conditions to show that the theorem holds.
Theorem 2. If and in the execution of Algorithm 4, the number of required tests to verify the supercomputers in the worst case is strictly less than of the straightforward deterministic algorithm.
Proof. Suppose and . The number of required tests to verify the supercomputers in the worst case depends on the ( )-sets of servers from the sampling in Algorithm 4. There is a total of servers to be tested. If and , then . The theorem holds.
To verify that there is a set of numbers for , , , and such that Theorem 2 holds. We give the following example. Let , , and . By the virtue of Theorem 1, it is with probability at least that a ( )-set contains the supercomputers with probability at least . In this case, , , and . Therefore, the number of required tests is strictly less than , saving tests and this saving also comes with the probability of of obtaining the supercomputers by Lemma 1. Hence, this method saves 374 tests of required tests with the probability of that the servers are supercomputers.
In a real situation we could apply this probability boosting technique to select a set of supercomputers to process a number of jobs in parallel. Suppose there are jobs to complete using the network of heterogeneous computer servers. These jobs come in one at a time. Among these computer servers, of them are very expensive supercomputers with the power of doing calculations per second. The other servers are regular servers with the power of doing calculations per second. Each of these jobs requires an identical number of calculations and that each job is divided into pieces with calculations each. If regular servers process one of these jobs, it takes seconds to complete a job. On the other hand, if supercomputers process one of these jobs, it takes seconds to complete the same job. In a parallel processing environment, a slowest server determines the total processing time to complete a job. Hence, if we randomly pick a -set from computer servers, the slowest of these servers determines the total processing time for the job.
Observe that if Algorithm 4 with an appropriate value of does not return the set of supercomputers, the jobs are still executed by a possibly mixed set of regular and supercomputer servers. No harm is done except that we do not obtain the speedup that we would like to have from the set of supercomputers. Theorem 3 shows the time expected to complete all jobs using Algorithm 5. The proof simply uses the fact that we know the probabilities of the case where all supercomputers are obtained and the case where all supercomputers are not obtained by the random process in the algorithm and the numbers of calculations in each case.
1: Obtain a set B of computers by executing Algorithm 4. |
2: Obtain a set F of k fastest computers among the computers in B. |
3: For each incoming job, execute it in parallel on the k computers in F. |
Theorem 3. If we use Algorithm 5, the expected time to complete all jobs is , where is the number of loops in Algorithm 4.
Proof. Let be a random variable for the number of calculations job takes to complete itself and be a random variable for the number of calculations to complete all jobs.
The expected number of calculations for each job
and the expected number of calculations for all jobs is as follows.
Hence, the theorem holds.
A few remarks are in order. First, the time complexity of Algorithm 5 is the time complexity of Algorithm 4 in Step 1 and the time complexity for verifying the servers in Step 2 and the average time to process all jobs in Step 3. Steps 1 and 2 are executed only one time and the time is . The average time in Step 3 is the time in Theorem 3. Second, this result depends on the value . As grows, the expected number of calculations in Step 3 approaches that of supercomputers multiplied by . Third, if we just pick a set of computers at random without the probability boosting technique for each job, the expected number of calculations would be as follows.
The expected number of calculations for each job because according to Cormen et al. (2001) and the expected number of calculations for all jobs is therefore as follows.
In this latter case, the expected number of calculations for all jobs is much greater than that with the probability boosting technique because is a lot smaller than . We illustrate this winning advantage with a concrete example. Let n = 10,000, , , , and we consider only one job. With the probability boosting technique, the expected time to complete the job is . On the other hand, without the probability boosting technique, the expected time to complete the same job is . Clearly, with the probability boosting technique, the processing speed is much closer to that of supercomputers while, without the probability boosting technique, the processing speed is much closer to that of regular computers.
String comparison
In biology a protein is a sequence of certain English characters and each position in the sequence has a certain representation. Significant similarity in proteins such as DNA and RNA is a strong evidence that two protein sequences are related by evolutionary changes from a common ancestral sequence. Identifying differences in these protein sequences gives us useful information about a common ancestral relation of proteins and therefore animals. In this section we discuss how the probability boosting technique can be applied to detect differences in protein sequences or character strings in general.
For the sole purpose of illustration, two strings of even length are exactly the same if and only if a character in each position of the two strings is the same. If there are at least differences in each half of the strings, the two strings are considered non-related. Given two non-related strings of even length , we would like to identify different positions between the two strings.
A straight forward method is to compare each character of the two strings from position 1 to position . This would take time in the worst case and is not the best way when is small or . If we, however, pick positions at random and compare them, the chance that we have correct different positions is only 1 in . Instead, we use Algorithm 6 with the probability boosting technique.
1: Set . |
2: Set |
3: Set . |
4: Set . |
5: Repeat |
6: Invoke the procedure to obtain a random ( )-set. |
7: Invoke the procedure to obtain a random ( )-set. |
8: Check positions in the ( )-set in S1 and for differences. |
9: Check positions in the ( )-set in S2 and for differences. |
10: Until there are at least k' differences in both pairs. |
11: Report the k' positions in each half. |
Let S and be two given non-related strings of even length and , be the first half and second half of S and , be the first half and second half of , respectively. Let and .
Observe that Algorithm 6 is Las Vegas and is always correct when it reports the positions in each half. This algorithm always terminates because the two given input strings are non-related. However, the time taken might vary between runs, even with the same input. Hence, in this example, we will bound the running time. Lemma 2 shows the number of loops that can be expected in Algorithm 6 and its variance. The idea of the proof is to observe that the number of loops has a geometric distribution.
Lemma 2. The expected number of loops in Algorithm 6 is at most 2 and the variance is at most 2.
Proof. We need to bound the probability that both halves of the strings have at least differences because this is when the algorithm terminates. This probability equals the probability that the first half has at least differences and the second half has at least differences. By Theorem 1, the probability that each half has at least differences is at least . Because positions in each half are independently chosen, the probability that the first half has at least differences and the second half has at least differences is therefore at least . Because this experiment has a geometric distribution, the expected number of loops until there are at least differences in both halves is at most 2 and the variance is at most 2.
To see how running time of Algorithm 6 deviates from the expected number of loops, let X be a random variable representing the number of loops in Algorithm 6 and and by Lemma 2. By Chebychev’s inequality, we have that , which is quite small. We further remark that a clear advantage of this algorithm is the number of times that positions are checked to find differences between the two strings. We give a concrete example when , , , . In this case the number of positional checks that can be saved on average equals the length of the string-the average number of iterations . In the deterministic case all 19,995 positions must be checked in the worst case.
Limitations
In Theorem 1, we compute according to the following inequality.
In a rare case, could result in a value greater than or equal to , this method simply picks the complete set. This could happen when or are large. In most cases, this probability boosting method produces a good value for . We mention here that does not have to be large in order to use this method because the chance of picking the targeted -set out of is still very small. As the problem of content optimization in online advertisement in the previous section illustrated, one cannot simply do it manually when is as small as 5,000. Nonetheless, we suggest the following remedies in the case where whenever they are applicable.
-
1.
Increase the value of in the computation of . This makes smaller but may have to trade it with more running time of the algorithm to obtain the desired result.
-
2.
Reduce the value of and using the divide and conquer technique illustrated in the example of string comparison. In this case, the more subproblems we have, the smaller the values of and become. The divide and conquer technique can also accommodate parallel computation to speed up the overall process.
-
3.
Because = , some applications may allow us to pick or , whichever is smaller.
Conclusion
In this article we considered a type of combinatorial optimization that involves choosing a targeted -subset of objects. We discussed the motivation and the probability-boosting technique that can be used to increase the probability of choosing out of objects. Three different applications of the probability-boosting technique were then illustrated. We summarize the advantages of each example in Table 1.
In the end potential limitations of the probability-boosting technique were discussed and mitigation was suggested. We are convinced that the three examples in this article are among many applications of this probability-boosting technique to be discovered. In terms of future research, we may investigate the application of the probability-boosting technique in a combinatorial optimization problem whose items are not obvious and/or come with some constraints such as the hybrid flow-shop scheduling problem (Deng et al., 2024).