Received August 8, 2019, accepted August 25, 2019, date of publication September 6, 2019, date of current version September 20, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2939805 Prospective: A Data-Driven Technique to Predict Web Service Response Time Percentiles YASAMAN AMANNEJAD 1 , DIWAKAR KRISHNAMURTHY2 , AND BEHROUZ FAR2 1 Mathematics and Computing Department, Mount Royal University, Calgary, AB T3E 6K6, Canada 2 Department of Electrical and Computer Engineering, University of Calgary, AB T2N 1N4, Canada Corresponding author: Yasaman Amannejad (yamannejad@mtroyal.ca) ABSTRACT Delivering fast response times for user transactions is a critical requirement for Web services. Often, a Web service has Service Level Agreements (SLA) with its users that quantify how quickly the service has to respond to a user transaction. Typically, SLAs stipulate requirements for Web service response time percentiles, e.g., a specified target for the 95th percentile of response time. Violating SLAs can have adverse consequences for a Web service operator. Consequently, operators require systematic techniques to predict Web service response time percentiles. Existing prediction techniques are very time consuming since they often involve manual construction of queuing or machine learning models. To address this problem, we propose Prospective, a data-driven approach for predicting Web service response time percentiles. Given a specification for workload expected at the Web service over a planning horizon, Prospective uses historical data to offer predictions for response time percentiles of interest. At the core of Prospective is a lightweight simulator that uses collaborative filtering to estimate response time behaviour of the service based on behaviour observed historically. Results show that Prospective significantly outperforms other baseline techniques for a wide variety of workloads. In particular, the technique provides accurate estimates even for workload scenarios not directly observed in the historical data. We also show that Prospective can provide a Web service operator with accurate estimates of the types and numbers of Web service instances needed to avoid SLA violations. INDEX TERMS Performance engineering, prediction, response time percentiles, system sizing. I. INTRODUCTION Web services need to respond quickly to transactions issued by their users. Long response times can frustrate a user and can cause them to discontinue using the service. Often, Web service operators have Service Level Agreements (SLA) with end users that among other things specify requirements, e.g., acceptable thresholds, for service response times. Typically, an SLA will stipulate targets for service response time percentiles. Since the objective is to avoid long response times for users, percentiles that capture the tail of the response time distribution, e.g., the 95th percentile, are often used while defining SLAs. Operators need systematic performance engineering techniques to ensure that their service will meet response time percentile requirements when deployed in production mode. Specifically, given a planning horizon, i.e., a future time period, and an estimate of user workload over this planning The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Messina. 127904 horizon, an operator needs predictions of the various response time percentiles of interest. Such predictions can help the operator take remedial actions if percentile thresholds are likely to be exceeded. An example of a remedial action could be scaling out the resources and using more server instances to host the service. Operators need a performance engineering approach to help them select the right types and numbers of server instances to avoid exceeding response time percentile thresholds. Performance engineering exercises for Web services typically combine load testing and performance modeling [1], [2]. Since load tests are very time consuming to setup and conduct, typically only a very small number of test workloads and system configuration settings are considered. Consequently, a queuing model or a machine learning model, e.g., regression model, may be used to predict the response time behaviour of the service under workloads and system settings that were not explored during the testing phase. Measurements from the load tests are used to parameterize as well as validate the predictive model. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles While modeling techniques can be effective, operators can face many challenges in leveraging them. Specifically, queuing analysis techniques can be time consuming since they require an expert to author, parameterize, and validate a model of the Web service. Similarly, applying machine learning models requires an expert to identify the right type of learning technique to employ. Furthermore, expertise is needed to map workload and system characteristics to numerical independent variables that are good at explaining the variation in the dependent variable, i.e., a response time percentile of interest. This process is called featurization. Additionally, some Machine learning and statistical techniques, e.g., least squares regression [3] and quantile regression [4], require the specification of a suitable performance function, e.g., linear or polynomial, that defines how the identified independent and dependent variables are related. Such tasks typically require an expert and can be time consuming. In this paper, we explore a data-driven and automated approach called Prospective to address these limitations. Prospective does not require an operator to author a queuing model. Furthermore, neither does it require the operator to explicitly define a featurization nor does it require the specification of a performance function. Prospective takes as input historical performance data collected either from load testing campaigns or from a live system deployment and a specification for the workload expected at the Web service over a planning horizon. The specification describes the arrival pattern of transactions, i.e., the time series of arrivals, and the transaction mix, i.e., the probabilities of observing each transaction type in the workload. Each transaction type refers to a specific function invoked by the end user, e.g. browse and buy in an e-commerce system. Given such a specification, Prospective uses the historical data to offer predictions for response time percentiles of interest for the specified workload. The key challenge that Prospective needs to address is that the specification of the workload for which a prediction is desired could be quite different from those of the workloads used to generate the historical data. Prospective employs a novel approach to address this challenge. The central idea behind Prospective is the combined use of load and transaction type for predicting a transaction’s response time. We define the load at any given time instant as a vector that records the numbers of concurrent transactions of each type at a service at that time instant. Prospective consists of a lightweight simulator that interacts with a collaborative filtering (CF) [5] based response time prediction module to estimate the service’s response time behaviour. The prediction process begins by first generating a synthetic trace of transactions that conforms to the input workload specification. Prospective’s simulator traverses this trace to produce initial estimates of the loads experienced by these transactions. Given the initial load estimated for a synthetic transaction and that transaction’s type, response time prediction is carried out by analyzing the historical data to obtain past response times detected for the desired transaction type and load combination. For scenarios where the historical VOLUME 7, 2019 data has not encountered this combination, the CF method infers the response time based on response times recorded in the repository for similar combinations. The simulator iteratively refines the load and response time estimates for the synthetic transactions and then finally calculates the desired percentiles. Results from realistic Web service testbed based on the RUBiS [6] benchmark application show that Prospective achieves high accuracy for a wide variety of workloads. In all of the experimented scenarios, we show that Prospective can offer 95th and 99th percentile predictions with errors less than 14.1%. Furthermore, Prospective outperforms baseline techniques. In particular, it is 16 times more accurate than a similar technique from literature that is agnostic to transaction types. Prospective is also significantly more accurate than existing approaches in literature that are based on building quantile regression models [7], [8]. Finally, we show how Prospective can be used to simulate the impact of horizontal scaling. Specifically, results show that the technique can simulate common load balancing policies used in practice and provide an operator with accurate estimates of the types and numbers of server instances needed to avoid response time percentile violations. This paper represents a significant extension of the preliminary work described in our previous paper [9]. The extensions are along the following directions: 1) We have further improved the accuracy of Prospective by incorporating two new features. We now explore an additional step where transaction types having similar mean response times are automatically clustered into a small number of transaction groups. We also study a refinement to the earlier approach we used to automatically select Prospective’s simulation parameters. On an average, as discussed in Sec. V-C, the two new features combined improve the mean prediction error for the 95th and 99th percentiles from 11.0% and 14.2% to 6.1% and 7.0%, respectively. 2) We have expanded the evaluation by experimentally comparing Prospective with two state-of-the-art quantile regression techniques [7], [8]. 3) We present new experiments to characterize the sensitivity of Prospective to the historical data. The results are discussed in Sec. V-E. 4) To support system sizing questions, we have added a new module to Prospective that can simulate common load balancing policies used in practice. We present new experiments in Sec. VI to show how this feature can be used to simulate horizontal scaling scenarios and answer system sizing questions. 5) The related work section is significantly expanded. The remainder of this paper is organized as follows. Our methodology is discussed in Sec. II. Sec. IV describes the experiments used to validate Prospective. Sec. V provides results that characterize Prospective’s prediction accuracy. Sec. VI show how Prospective can be used to support system 127905 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles sizing exercises. Sec. VII discusses related work. Finally, Sec. VIII provides conclusions and future work. II. PROSPECTIVE We propose ‘‘Prospective’’, a system for Predicting RespOnSe Time PErcentiles using CollaboraTIVE Filtering. The high level architecture of Prospective is described in Sec.II-A. Sections II-B to II-D discuss in detail the implementation of Prospective. A. OVERVIEW Fig. 1 shows the architecture of Prospective. The target application is a Web service with n types of transactions with each transaction type assigned a unique id in the range 1 to n. Similar to other performance prediction technique, Prospective requires that the Web service operator specifies as input a workload specification W over a planning horizon. W represents the workload for which a prediction is desired. It consists of the tuple of transaction mix M, transaction arrival model A, and the total number of transactions N . M defines the probabilities of observing each transaction type in the workload. A provides a statistical model that governs the time instants at which transactions in the workload arrive at the Web service, i.e., the arrival pattern. W can be obtained by exploiting workload prediction techniques [10], [11] or using expert knowledge. It can also be obtained by perturbing the current workload of the service for the purpose of a sensitivity analysis. As shown in Fig. 1, a workload generator is used to transform W to a trace T of N workload records. Each record pertains to a transaction and contains an id indicating the transaction’s type and also the time at which the transaction arrives at the Web service. executing concurrently on the service during the execution of transaction i. The response time of the transaction is indiL cated by resi i . D can be constructed by processing trace data Trep from a load testing campaign or the actual Web service deployment. Prospective only requires that each line of Trep contain the arrival instant of a transaction, its transaction type, and its response time. As shown in Fig. 1, the final inputs to Prospective are the overlap parameter O and the number of iterations parameter I . O defines the minimum overlap between the execution of two transactions for them to be considered as concurrent transactions during load calculations. I is the number of times Prospective attempts to refine the response time estimates for a transaction. O and I together provide a mechanism to simultaneously improve prediction accuracy and limit simulation time. The parameters are automatically calculated in a pre-processing phase described in Sec. II-C. Prospective uses a lightweight simulator in combination with a CF based response time prediction module to estimate response times of transactions in the input trace T. These response times are then used to calculate the Pth percentile of response times. Multiple independent simulation runs can be carried out to calculate a confidence interval (CI) for the mean. The Web service operator can use the predictions to take system sizing and application deployment decisions. Fig. 1 also shows the high-level interaction between the simulator and the response time prediction modules. The simulator module takes as input the workload trace T. It then calculates the load Li experienced by a transaction i in the trace and queries the response time prediction module to get a L response time estimate presi i at this load for the transaction. This process is repeated for all N transactions to obtain the response time percentile of interest. We next discuss in detail the implementation of these two modules. B. SIMULATOR FIGURE 1. High level architecture of prospective. From Fig. 1, Prospective predicts the response time percentile for an operator-defined percentile value P. Prospective also takes as input a historical data repository D. Each record in the repository pertains to a transaction i. It includes the transaction’s type k, load Li , and its load-dependant response L time resi i . Li is an n-dimensional vector where the jth element in Li , Li [j], shows the number of transactions of type j 127906 Algorithm 1 describes Prospective’s simulator which takes as input the trace T, the desired percentile P, and the input parameters derived from the pre-processing phase namely, the overlap parameter O and the number of iterations parameter I . We defer the description of the pre-processing phase to Sec. II-C. The output of Prospective is resPpredicted , a prediction for the Pth percentile of response time. As shown in Fig. 2 (a), the simulator first generates a time line consisting of the arrival instants of all transactions in T arranged in chronological order (lines 2-7, Algorithm 1). It then starts processing these arrival events as described in the second loop of Algorithm 1. Specifically, processing each arrival instant involves calculating the response time of that transaction and hence the time instant at which the transaction completes, i.e., the departure time. The simulator uses the response time prediction module to compute the response time of that transaction using the historical repository D. As mentioned previously, to calculate the response time of any given transaction i the simulator requires the VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles Algorithm 1 Main Steps of Prospective P 1 Input: T, P, O, I , Output: respredicted foreach (transaction in T) do arrival ← transaction.arrivalTime 4 k ← transaction.type 5 event ← createEvent(k, arrival, null) 6 AddToTimeline(event) 7 end 8 foreach event in Timeline do 9 i←i+1 10 k ← event.transactionType 11 arrivali ← event.arrivalTime 12 Li ← calcStartLoad(i, arrivali ) L 13 presi i ← PredictResponseTime(i, k, Li ) L 14 departurei ← arrivali + presi i 15 foreach iteration in (1, I ) do 16 Li ← calcLoad(arrivali , departurei , O) L 17 presi i ← PredictResponseTime(i, k, Li ) L 18 departurei ← arrivali + presi i 19 end L 20 RESPs.add(presi i ) 21 end P 22 respredicted ← calculatePercentile(RESPs, P) 2 3 iterative process where the initial estimate of Li is refined taking into account the departures and arrivals that happen during the newly estimated execution time line of i (lines 16-20, Algorithm 1). We first refine Li using O. The parameter O allows the simulator to only include those transactions that have a significant overlap with the execution of transaction i. For any given transaction i and any other concurrent transaction j, the overlap Oi,j between them is described as the difference between the departure time of the predecessor transaction and the arrival time of the successor transaction expressed as a percentage of the response time of transaction i. The simulator only considers those transaction pairs whose Oi,j values are greater than or equal to O for computing the load Li . Fig. 2 (c) illustrates the use of O. In this example, transactions p1 and p2 arrive prior to transaction i. The simulator has already computed their departure times (solid black circles) and they are both predicted to end after the arrival of i. Transactions s1 and s2 arrive after i but their departure times have not been estimated yet. For this example, only p2 and s1 satisfy the acceptable overlap criterion specified by O. Consequently, the load vector Li is updated such that the number of transactions of the type to which p1 belongs is decremented by 1. This ensures that p1 does not influence the load. Similarly, the number of transactions of the type to which s1 belongs is incremented by 1. Transaction s2 is not involved in the load calculation since it did not meet the overlap criterion. The updated Li is used to calculate a revised departure time for i, as shown in Fig. 2 (d). This refinement process is repeated I times. A pre-processing step, outlined next, is used to select O and I values that can yield accurate predictions. C. PRE-PROCESSING FOR AUTOMATED PARAMETER SELECTION FIGURE 2. Example of load and response time estimation. load Li over the execution of that transaction. However, Li is unknown when the simulator processes the arrival instant. As a result, the simulator has to guess Li and iteratively refine that estimate. The simulator uses the known instantaneous load at the arrival instant of i as an initial estimate of Li . The instantaneous load is a known quantity since the simulator knows the arrival instants of all previous transactions and has also calculated the departure instants of those transactions. The simulator provides this Li estimate as well as the transaction’s type k as inputs to the response time prediction module and obtains presLi i as an initial response time estimate. This response time is then used to compute an initial departure time of transaction i. As shown in Fig. 2 (b), calculation of the departure time (denoted by a solid grey circle) allows the simulator to obtain a different estimate of Li that captures other concurrently executing transactions. This triggers an VOLUME 7, 2019 The basic idea in this step is to use the simulator to predict the measured response times of transactions in the historical trace data Trep used to construct the repository D. The simulator employs various O and I values and selects the values that most accurately predict the historically measured response time percentiles embodied by Trep . We now describe two parameter selection approaches namely, arrival independent and arrival dependent. The basic operation mode of Prospective, as depicted in Fig. 1, uses arrival independent parameter selection. With this approach, the simulator takes as input a workload trace Ttune . Each line of Ttune corresponds to a transaction recorded in Trep . It contains the arrival instant of the transaction and the transaction’s type. For the sake of simplicity, Prospective independently searches for the best O and I values. During the process for selecting the value of O, the value of I is set to 1. The simulator first conducts experiments using progressively increasing O values. For each of these experiments, we calculate over all percentiles of interest the mean of the prediction errors (defined in Sec. IV-D). We then 127907 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles select the O value that results in the least error. The simulator then uses this value of O and performs experiments with progressively increasing values of I . For any given percentile of interest, we select the I value that yields the least prediction error. We also study Prospective with arrival dependent parameter selection where the choice of O and I depends on the mean transaction arrival rate λ. During parameter selection, we first split Trep into a set of sub-traces < T1rep , T2rep , . . . , Turep >, each containing the same number of transactions. For each sub-trace Tirep , we calculate the mean transaction arrival rate λi and select O and I using a full factorial parameter sweep over a pre-defined range of values for these parameters. The final result after considering all the sub-traces is a look up table τ that records the O and I values discovered for λi ∀i ∈ {1, . . . u}. During the simulation, when arrival dependent parameter selection is enabled, the simulator considers equal duration transaction windows in the input trace T. For each window w, it calculates the arrival rate λw . For any transaction i in this window, it provides λw to the response time prediction module in addition to the usual inputs, i.e., the load Li and the transaction’s type k. The response time prediction module uses the look up table τ to locate the entry whose arrival rate is closest to λw . It then uses the O and I values pertaining to that entry while predicting response times in that window w. Due to its arrival dependency and the use of full factorial parameter search, parameter selection can take a longer time using this approach. However, it could potentially improve prediction accuracy for workloads with highly variable transaction arrival patterns. We evaluate this in detail in Sec. V. transaction types k and j is defined by Eq. 1. P e∈Ekj S[k, j] = r P e∈Ekj (resek − res∗k ) × (resej − res∗j ) (resek − res∗k )2 × P e∈Ekj (resej − res∗j )2 (1) In Eq. 1, the similarity S[k, j] between transaction types k and j is calculated based on the mean response times of these two transaction types for all common loads, Ekj , they have experienced before. Common loads are loads at the Web service that both k and j have experienced in the past. resek shows the measured response time of transaction type k under one such load e ∈ Ekj . In Equation 1, res∗k and res∗j are the mean response times of transaction types k and j under all system loads recorded in D, respectively. As described shortly, the similarity value between each pair of transaction types is calculated and used within Eq. 2 to estimate the response time. We note that the similarity measures calculation need to be computed only once by processing the repository D. To predict the transaction’s response time using CF the module first generates a random number r using a uniform L distribution between 0 and 1. It then identifies Nk i as the set of other transaction types similar to k that have encountered L the load Li in D. For each similar transaction type j in Nk i , Li the module then constructs a list Hj that contains response times recorded for j under Li . Next, the module selects the r th L quantile of response time from this list as presj i . This process is depicted in lines 6-12 of Algorithm 2. The final step of the CF method is to aggregate all the L presj i values pertaining to the similar transaction types L D. RESPONSE TIME PREDICTION MODULE As shown in Algorithm 2, the response time prediction module takes as input the transaction type k of transaction i and the load Li . The module first looks up the repository D to check if any transactions of type k experienced the load Li when the historical data was collected. If there is a hit for the k and Li combination, then the module obtains a list L Hk i containing past transaction response times observed for this combination. Finally, a response time value is selected L at random from this list and returned as presi i , the response time prediction for transaction i (lines 2-4, Algorithm 2). If there is no hit for the desired transaction type and load combination, then a CF based method is used to predict response time. This involves finding similar transaction types that have experienced the load Li in D. The CF method relies on the computation of similarity measures between pairs of transaction types in the system. Two transaction types j and k are considered to be similar if their mean response times are similar to each other for all observed loads in the repository. Their similarity measure S[j, k] characterizes the extent of their similarity. We use the commonly used Pearson correlation coefficient (PCC) as our similarity metric. Using PCC, the similarity between two 127908 into a single response time prediction presi i for the transaction i. This aggregation uses the similarity measures L between k and the similar transaction types recorded in Nk i . Specifically, we use the aggregation approach proposed by Breese et al. [12] implemented as Eq. 2 in our context. P L (resj i − res∗j ) × S[k, j] L L resk i = res∗k + j∈Nk i P |S[k, j]| (2) L j∈Nk i When no transactions in D have experienced the load Li L the module simply returns the predicted response time presk i as the mean response time of all transactions of type k in the repository D (Lines 14-16, Algorithm 2). We refer to the percentage likelihood of this scenario as the miss rate which provides an estimate of the error in Prospective’s prediction. While the basic operation mode of Prospective distinguishes transactions based on transaction types, we also explore another approach where similar transaction types are further clustered into a smaller number of transaction groups. We refer to this as the clustered approach. With this approach, we use the well-known K-means clustering algorithm [13] during the pre-processing step. We use a simple method where the centroid of a cluster is the mean response time VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles Algorithm 2 Response Time Prediction L Input: D, i, k, Li , Output: presi i Li 2 H k ← query repository (k, Li ) Li 3 if (H k is not empty) then L L 4 presi i ← randomValue(Hk i ) 5 else L 6 Nk i ← query similar transactions(k, Li ) L 7 if (Nk i is not empty) then 8 r ← randomNumber() L 9 foreach j ∈ Nk i do L 10 Hj i ← query repository (j, Li ) 1 L L presj i ← getValue(r, Hj i ) 11 end L L presi i ← adjusted value of all presj i (Eq. 2) 12 13 else 14 L 15 end 16 17 TABLE 1. Experiment settings. presi i ← mean response time of k end of transaction types assigned to that cluster. In this manner, transaction types with similar mean response times get clustered to the same transaction group. With the clustered approach, the response time prediction module first maps the input transaction type k to its transaction group g. It also translates the input load vector Li into a vector that records the numbers of concurrent transactions belonging to each transaction group. The rest of the prediction process is similar to what is outlined in Algorithm 2 except that the module operates at the granularity of transaction groups as opposed to transaction types. The clustered approach could benefit scenarios where it is possible to come up with a more compact grouping of transactions than that suggested by the service’s URLs. We evaluate this feature in detail in Sec. V. III. USING PROSPECTIVE FOR HORIZONTAL SCALING In the previous sections, we describe how Prospective can be used for predicting the response times of transactions submitted to a single instance of a Web service. If a predicted response time percentile exceeds its corresponding threshold, the service operator would typically be interested in exploring horizontal scaling strategies that employ multiple server instances. With horizontal scaling, incoming transactions are distributed among a set of server instances based on a load balancing policy. To support evaluation of horizontal scaling strategies, we implement a load balancing module in Prospective that simulates the common load balancing policies, such as Weighted Round Robin (WRR) and the Least Connection (LC). The number of instances of each instance type is specified as an input to the simulation. With the WRR policy, each server instance m has associated with it a weight wm that determines the fraction of incoming transactions assigned to that instance. Before starting the VOLUME 7, 2019 simulation process described in Sec. II-B, Prospective uses the weights to assign each transaction in the input trace T to a specific server instance. As a result, each instance m gets assigned an input trace of transactions Tm . Next, Prospective simulates the behaviour of each instance using the traces generated in this manner. Specifically, for each instance m Prospective uses the historical repository pertaining to that instance to estimate the response times in Tm . Finally, response times estimated in this manner are collected from all instances to calculate the response time percentile of interest. The LC policy assigns an incoming transaction to the instance that is currently handling the least number of active transactions ν. Active transactions at an instance are those transactions that are currently being executed at the instance. Since the ν values depend on the response time behaviour of individual instances, it is more challenging to simulate LC than WRR. To realize LC, Prospective maintains separate transaction arrival time lines for the instances. Consider transaction i in T with an arrival instant of t. Prospective assigns this transaction to the arrival time line of the instance with the least ν at time t. We note that in contrast to the example outlined in Fig. 2, at this point Prospective has no knowledge about the transactions that would follow i. This is because future arrivals at the instance depend on instantaneous values of ν at the instances. Consequently, response time and completion time of transaction i is estimated based only on the transactions that arrived before i at this instance. This estimate as well as the response time estimates of the other active transactions at this instance are continually adjusted based on the transactions that arrive subsequently to the instance. Response time percentiles of interest are calculated after all transactions in T have been assigned to the instances in this manner. IV. EXPERIMENT SETUP A. EXPERIMENT TESTBED Table 1 provides details of our test environment. We use virtual machines (VM) running on a multicore server to host our Web servers. We use the httperf tool [14] for generating workloads. httperf is installed on a different multicore server. Both servers are connected by an Ethernet switch that provides dedicated 1 Gbps connectivity. Consequently, response times measured by httperf are a reliable indicator of the server’s performance. Finally, we implement Prospective using a set of Python scripts. 127909 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles B. EXPERIMENT FACTORS Table 2 lists the experiment factors and their levels. Default levels are shown in bold. TABLE 2. Experiment factors. We run the RUBiS benchmark version 1.0 [6] as our Web service. RUBiS is a popular benchmark for multi-tiered applications and has been used extensively by others [7], [15]–[19]. It emulates the behavior of an auction server that allows users to browse and bid for items. The service supports 14 distinct transaction types. Examples of the supported transactions are browsing products, searching, and viewing bid histories. We use the Large VM instance by default. The target response time percentiles that we are interested in predicting are 25th , 50th , 75th , 95th , and 99th percentiles. The 95th and 99th percentiles are commonly used in SLAs. They capture the tail behaviour of the response time distribution and are challenging to predict. We study Prospective’s ability to predict the behaviour of different types of transaction arrival patterns. Fig. 3 shows the inter arrival time between successive transactions of a subset of these patterns. Patterns 1 to 3 use the exponential distribution to generate inter arrival times. Their mean inter arrival times are selected to cause low (30%), medium (50%), and high (70%) mean per-core utilization on a large VM instance of the RUBiS server. As shown in Fig. 3, patterns 4 to 7 emulate time varying arrivals, which are typical of many real Web services [20], [21]. Patterns 4 and 5 use exponential distributions with different means to achieve the time varying behaviour. In pattern 4, the mean inter arrival times are chosen such that the RUBiS server’s mean per-core utilization increases from 10% to 80%, and then decreases from 80%, to 10%. In pattern 5, the mean per-core utilization decreases from 80% to 10% and then increases from 10% to 80%. By alternating between very low and very high utilizations, these patterns mimic bursty behaviour. Inter arrival times of pattern 6 and pattern 7 do not follow the exponential distribution. Pattern 6 causes mean per-core utilization to gradually vary between 10% to 65%. Pattern 7 utilizations vary similarly but the variations are more rapid. Patterns 8 and 9 are used to study the impact of distributions with lower and higher variability than the exponential distribution. While pattern 8 has half the coefficient of 127910 variation (CV) of exponential, pattern 9 has two times the CV of exponential. For these patterns we use the same mean inter arrival time as in the default Medium − Exp pattern, i.e., Pattern 2, to facilitate comparisons. To demonstrate the ability of Prospective to support system sizing exercises, we consider two additional patterns. Patterns 10 and 11, VHigh − Exp and XHigh − Exp, are exponential patterns (not shown in Fig. 3) that cause very high and extreme utilizations, respectively. VHigh − Exp is the maximum load that can be sustained on the large instance. This in turn leads to extremely long response times on the service. The mean arrival rate of XHigh − Exp is almost twice that of VHigh − Exp and hence this pattern cannot be handled using one VM instance alone. Results involving these patterns are discussed in Sec. VI. We also study the sensitivity of Prospective to transaction mix. We use three different mixes that differ in the percentages of static transactions (S), i.e., transactions that do not require dynamically generated content, and dynamic transactions (D). Mix40S − 60D is the default browsing mix of RUBiS. It contains 40% static transactions and 60% dynamic transactions. This mix is used in creating the repository. Mix60S − 40D contains 60% static and 40% dynamic transactions. Mix100S − 0D contains only static transactions. We evaluate Prospective under two different modes. Under the basic mode, Prospective offers predictions at the granularity of transaction types and uses arrival independent selection of O an I . With the enhanced mode, it clusters transaction types into transaction groups and also uses arrival dependent parameter selection. In this mode, we consider 5 second transaction windows. As described in Sec. II, we compute λw for each such window and use that in combination with the look up table τ to obtain the O and I values to use for that window. Several different baseline response time estimation strategies are compared. The Avg-Tr approach ignores the dependency of transaction response times on the load vector. It estimates the response time of a transaction by computing the mean of the response times recorded for that transaction’s type in the historical repository. The Avg-Load approach considers load but ignores transaction type. For each transaction, it estimates the load, i.e., total number of concurrent transactions. It then obtains the prediction as the mean response time of all transactions, regardless of transaction type, that have experienced the estimated load. We note that this approach is similar to the approach used by Spicuglia et al. [22]. The Avg-Tr-Load approach considers both load and transaction type similar to Prospective. However, unlike Prospective it merely estimates the response time of a transaction as the mean response time of all records with the desired load and transaction type. We note that these baseline methods all use arrival independent parameter selection. We also compare Prospective with two different quantile regression approaches. The Reg-CPU approach suggested by Watson et al. [7] relates the response time percentile to the mean per-core utilization U of the server, as shown by VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles FIGURE 3. Arrival patterns used in experiments. the performance function in Eq. 3. The Reg-Rate approach used by Bodik et al. [8] is similar but uses mean transaction arrival rate λ as the independent variable, as shown in Eq. 4. To facilitate a fair comparison, we use the same historical data contained in Prospective’s repository to train and obtain the β coefficients in the models. The historical data is divided into 1-second windows. Each window provides training observations that relate the dependent variable, i.e., response time of a transaction in that window, to the independent variable, i.e., either U or λ of the window. Quantile regression uses the training data to obtain different sets of β coefficients for the different percentiles of interest. resPpredicted (U ) = β0 + β1 × U + exp(β2 + β3 × U ) (3) resPpredicted (λ) = β0 + β1 × λ + exp(β2 + β3 × λ) (4) C. EXPERIMENT PROCESS Before starting the evaluation, we need to create a historical repository. Our intent is to create a repository that covers a diversity of transaction types and per-core utilizations. To create such a repository, we use the default mix, Mix40S − 60D, that contains all the RUBiS transaction types. We then submit to the service traces of transactions conforming to this mix and having exponential inter arrival time distributions. The mean inter arrival times are selected to generate various per-core utilization levels (10% < per-core utilization < 90% VOLUME 7, 2019 in steps of 5%). For each utilization level, we repeat the experiment using 5 different samples drawn from the relevant exponential distribution. Each run contains 9, 000 transactions. As mentioned previously, a transaction trace Trep is collected from system logs under these experiments and is used for creating the repository. After creating the repository, the pre-processing steps discussed previously are carried out. For the arrival independent selection of O and I , we vary O in steps of 10% from 10% to 90% and I in steps of 1 from 1 to 3. Since we search for O and I independently, this translates to 11 simulation experiments to identify these parameters. For the arrival dependent selection, we split the trace used in the arrival independent approach into 9 smaller sub-traces having equal number of transactions. For each sub-trace, we use the same ranges for O and I as in the arrival independent selection. However, since we employ a full factorial parameter sweep, parameter selection involves 27 experiments for each sub-trace. Accounting for the smaller traces, the arrival dependent approach needs approximately 2.5 times more experiments than the arrival independent approach for deducing the values of O and I . For each measurement and simulation experiment, we use multiple statistically similar samples from the relevant arrival model and transaction mix. Each of them involves a trace of 9, 000 transactions. We calculate the mean and the 95% CI for both the measured and predicted response time percentiles by replicating the experiments. 127911 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles D. EVALUATION METRIC To evaluate the accuracy of Prospective, we use the absolute relative error of the predicted response time percentile values as shown in Eq. 5. errp = | resPpredicted − resPmeasured resPmeasured | × 100 (5) In Eq. 5, P represents the desired percentile level and errP is the error for the Pth percentile response time prediction. resPpredicted and resPmeasured are the mean of the predicted and measured Pth percentile values. V. RESULTS In this section, we report the result of experiments designed to validate Prospective. A. PRE-PROCESSING We first focus on selecting the O and I values. Fig. 4 shows the selection of the overlap parameter O for three different sub-traces with progressively increasing arrival rates namely, Low, Medium, and High. We only show the 95th percentile with I set to 1. Similar trends are observed for other cases. FIGURE 4. Selecting the overlap value. From Fig. 4, O values around 50%, 40%, and 30% yield the lowest errors for Low, Medium, and High, respectively. For any given arrival rate, very low values of O cause transactions that have very low overlap to be counted towards the load leading to overly pessimistic response time estimates. Very high values ignore the effects of some transactions that have quite a bit of overlap leading to optimistic estimates. Furthermore, Fig. 4 shows that higher arrival rates benefit from lower values of O. This suggests that a transaction’s response time at higher loads is sensitive to even transactions that only have a small overlap. These results motivate the arrival dependent selection of O for scenarios experiencing high arrival rates. Results suggest that the tuning of I is specific for each percentile of interest. In general the value of I needs to be greater than 1 for the extreme tail percentiles, e.g., the 99th percentile, for better accuracy. We omit presenting detailed results for the sake of brevity. Finally, we focus on the clustered approach where Prospective assigns transaction types to transaction groups using K-means clustering. As discussed previously, the selection 127912 of the number of clusters, i.e., transaction groups, and the clustering is done automatically in the pre-processing step. In our case, when increasing the number of clusters the mean distance of a point from its cluster centroid decreases and stabilizes with 5 clusters. As a result, we group RUBiS transaction types into 5 transaction groups. B. COMPARISON WITH BASELINE AND REGRESSION APPROACHES In this section, we compare the baseline estimation approaches and the state-of-the-art quantile regression techniques introduced in Sec. IV-B with Prospective. We use the basic operation mode of Prospective, which uses arrival independent O and I and operates at the granularity of transaction types. We consider the default levels for the experiment factors as shown in Table 2 and report mean prediction errors of all percentile values. Avg-Tr, which ignores load, produces a large mean error of 39.8%. Avg-Load, which considers load but ignores transaction type, also does poorly with a mean error of 72.9%. In contrast, Prospective yields a mean error of 4.6%. These results establish the importance of incorporating both load and transaction type into the prediction process. Similar to Prospective, Avg-Tr-Load considers both transaction type and load. Recalling from Sec. IV-B, Avg-Tr-Load uses mean of the response times recorded for a transaction type and load combination. In contrast, Prospective randomly draws from a list of historical response times recorded for that combination. Fig. 5 compares their predicted response time distributions. From the figure, the averaging done by AvgTr-Load causes probability masses to bunch together over a few values causing the distribution to deviate significantly from the actual measured distribution. Prospective avoids this problem and follows the measured distribution more closely. FIGURE 5. Comparison of response time distributions. Next, we compare the performance of Prospective with the two quantile regression approaches, Reg − CPU and Reg − Rate. Table 3 shows the prediction errors of the three approaches when we vary the mix while keeping other factors at their default levels. From the table, all three techniques perform comparably when predicting for the default mix, i.e., Mix − 40S − 60D. Recalling from Sec. IV, this is the mix used to build the historical repository and train the regression VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles TABLE 3. Mean prediction errors of regression models. models. From Table 3, as expected, the accuracies of the regression techniques suffer significantly while predicting for the other two mixes, i.e., Mix − 60S − 40D and Mix − 100S − 0D. This is because these regression techniques do not consider the mix in their models. Reg − CPU approach performs better than Reg − Rate, as CPU utilization can capture the change of mix better than arrival rate. However, Reg − CPU requires a method to predict the CPU utilization of the workload for which prediction is desired. In contrast to the regression techniques, Prospective provides accurate estimates for these non-default mixes. These results indicate Prospective’s ability to generalize well for workloads not directly captured by the historical repository. We explore the impact of mix in more detail in Sec. V-D. We also compare the sensitivity of the three approaches to arrival patterns different than the one used to build the repository. Recalling from Sec. IV, the repository is built by using a set of time varying exponential transaction inter-arrival time distributions. We use the three techniques to offer predictions for the four non-exponential workloads, i.e., Patterns 6-9, and default levels for other factors. For these four patterns, Reg − CPU and Reg − Rate perform comparably and result in mean errors of 23% and 25%, respectively. This is about 3 times higher than the mean prediction error of 7.2% that we obtain with Prospective for these four patterns. This result further corroborates the ability of Prospective to generalize better than the regression approaches. The accuracy of the regression approaches could be improved by exploring other choices for performance functions, regression techniques, and featurization, e.g., independent variables that explicitly capture the mix and arrival pattern. Prospective offers an alternative that does not require an operator to spend time investigating such choices. C. EFFECT OF ARRIVAL PATTERNS Table 4 shows Prospective’s percentile prediction errors and their mean for the different arrival patterns in the basic mode. Patterns 1 to 5 use either exponential or time varying exponential distributions. While the repository is constructed using exponential distributions, the exact sequence and inter-arrival times of transactions in the repository are different from those of patterns 1 to 5. For example, unlike Patterns 4 and 5, the repository does not encounter situations where the system switches between very high and very low utilizations. Results show that Prospective yields very good accuracy for these patterns with the maximum mean error being 13.2%. Among patterns 1 to 5, the worst errors occur for cases where the system encounters very high per-core utilizations. VOLUME 7, 2019 For example, the High−Exp pattern imposes a mean per-core utilization of 70% with peak utilizations reaching up to 90%. The prediction errors for the 95th and 99th percentiles for this case are 18.5% and 30.1%, respectively. Similarly, the Low − High and High − Low represent bursty workload patterns where the mean per-core utilization can reach an extremely high value of 80%. The mean prediction errors for the 95th and 99th percentiles for these two case are 15.0% and 23.5%, respectively. We notice that at very high utilizations there is more diversity in the number of unique load vectors that can be encountered by the system. As a result, there is a higher likelihood that Prospective will have a miss, i.e., not find a match for a transaction type and load combination in the repository either by using historical response times directly or by applying the CF method. As discussed previously, Prospective merely uses the mean historical response time for the relevant transaction type as its prediction when there is a miss. This leads to optimistic predictions for the 95th and 99th percentiles. TABLE 4. Prediction errors - basic prospective. Table 4 also shows that Prospective is very accurate for the Periodic − Gradual and Periodic − Rapid patterns, which have non-exponential, time varying arrivals. These results indicate that Prospective can offer predictions for inter arrival time distributions different than that governing the repository. Specifically, it can provide good predictions as long as there is a good match in the repository for the transaction type and load combinations. Next, we focus on the impact of the variability of the inter-arrival time distribution. From Table 4, Prospective is very accurate for the HypoExp pattern, which imposes the same per-core utilization as Medium − Exp but has half the CV of exponential. The HyperExp pattern has twice the CV as exponential and hence triggers phases with intense transaction arrivals where the per-core utilization of the system is very high. As a result, Prospective is less accurate for this case than the Medium − Exp and HypoExp cases. However, the mean error is still only 10.2% for this case. Summarizing, the basic operation mode of Prospective is very accurate for cases that correspond to the utilization ranges of real life servers [23]. Errors increase for scenarios with very high utilizations. We next show how such high utilization scenarios can benefit from the enhanced operation mode of Prospective, i.e., using arrival dependent O and I and operating at the granularity of transaction groups. 127913 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles TABLE 5. Prediction errors - enhanced prospective. Table 5 compares Prospective’s accuracy for the different arrival patterns while operating in the enhanced mode. In general, Prospective is even more accurate under the enhanced mode. The mean error over all patterns and all percentiles decreases from 8.4% to 5.5%. The baseline operation of Prospective has mean errors of 11.0% and 14.2% for the 95th and 99th percentiles, respectively. In contrast, the enhanced mode reduces the mean errors for the 95th and 99th percentiles to 6.1% and 7.2%, respectively. From Table 5, the enhanced mode is particularly effective for scenarios that cause high per-core utilizations. For the High−Exp pattern, the errors for the 95th and 99th percentiles decrease from 18.5% and 30.1% to 8.1% and 5.5%, respectively. Similar improvements are also seen for the extremely bursty Low − High and High − Low patterns as well as the highly variable HyperExp pattern. The reason for the effectiveness of the enhanced mode at high utilizations is two fold. First, as noted previously, at high utilizations there is a greater diversity in the number of possible load vectors seen by the system. Clustering reduces this diversity by grouping similar transaction types into the same transaction group. This decreases the number of misses thereby improving accuracy. Second, as shown in Sec. V-A, high load scenarios benefit from smaller values of O when compared to low load scenarios. By adapting the choice of O to load, i.e., transaction arrival rate, the enhanced mode is able to further improve accuracy. For the remainder of the paper we use the enhanced mode of Prospective. D. EFFECT OF TRANSACTION MIX, CF, AND REPOSITORY EVOLUTION To study Prospective’s sensitivity to mix, we submit each of the mixes shown in Table 2 while using the default Medium − Exp arrival pattern. The mixes have very different response time behaviours. For example the 99th percentile of the response time with Mix40S − 60D is 10 times larger than the response time of the Mix100S − 0D. Table 6 shows the prediction errors. As expected, Prospective offers the most accurate predictions for the default Mix40S − 60D, which also corresponds to the mix of the historical repository. Predictions degrade slightly for the non-default mixes Mix60S −40D and Mix100S −0D but are still quite accurate. Errors correlate well with the miss rates seen by the simulator. The miss rates for Mix60S − 40D and Mix100S − 0D are 127914 TABLE 6. Prediction errors(%) for different mixes. 12% and 4%, respectively. This suggests that the miss rate reported by Prospective can be used by the operator to assess predictive accuracy. We explore Mix60S − 40D further to show the value of incorporating the CF method into Prospective. Disabling CF doubles the miss rate. As a result, the prediction error for the 95th percentile and the mean error increase from 8.3% and 7.5% to 16.0% and 14.1%, respectively. We now study the impact of evolving the repository to include a new mix. Prospective can continually expand its repository to include new measurements, e.g., data from the live system. To verify this, we expose the system to a workload that uses Mix60S − 40D. We use a time varying exponential inter arrival time distribution similar to the process described in Sec. IV-C. Response time information from this workload is then integrated into the repository. We then conduct a simulation to predict for Mix60S − 40D with the default Medium − Exp arrival pattern. Due to the expanded repository, the error for the 95th percentile and the mean error decrease from 8.3% and 7.5% to 5.7% and 6.0%, respectively. These prediction errors are close to the error of the default mix, Mix40S −60D, and shows that Prospective is robust and can continually refine its predictions based on new data. The expansion of the repository happens in the same way when new arrival patterns are observed. E. SENSITIVITY TO REPOSITORY CONTENT In this experiment, we study the sensitivity of Prospective to the contents of the repository in more detail. For this purpose, we compare the prediction accuracy for the Low − Exp, Medium − Exp, and High − Exp patterns when the repository contains different sets of historical data. Recalling from Sec. IV, these workloads cause per-core utilizations of 30%, 50%, and 70%, respectively. All other factors are set at their default levels. For this experiment, as shown in Table 7, we use 7 different controlled sets of traces to populate the repository. All traces are generated using exponential inter-arrival times and the default mix. The sets differ in the range of per-core utilizations they cause on the system. For example, C3 causes two distinct periods of activity with mean per-core utilizations of 10% and 90%. C7 is the default trace used to generate the repository for the experiments presented in the previous sections. For the various repositories we consider, Fig. 6a and Fig. 6b show the miss rate and mean prediction error, respectively. From the figures, the miss rate and prediction error are high with C1 when the repository only contains data for low per-core utilization. Interestingly, miss rate decreases and accuracy improves for all three workloads for C2 even though VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles TABLE 7. Repository content. causes Prospective to generate a highly variable progression of load vectors thereby helping to capture the impact of the variability. Similarly, a mix that favours one particular transaction type will cause Prospective to generate a progression of load vectors where that transaction type dominates. Prospective is a data-driven technique and therefore, similar to any other data-driven technique depends on the quality of the historical repository. Data-driven techniques are usually sensitive to the quality of their data. The superiority of Prospective is in mitigating this sensitivity through the CF technique. Where possible, Prospective exploits existing historical response time measurements for the transaction type and load vector combinations generated during simulation. For example, while predicting for Mix60S − 40D in Sec. V-D, 76% of the combinations generated by Prospective during the simulation are already present in the repository even though the repository is generated using a different mix. For combinations not observed previously, it exploits a novel CF based method to estimate response times. Considering the same Mix60S − 40D example from Sec. V-D, our CF method allowed predictions to be generated for approximately half of the 24% of combinations not present in the repository. The main reasons for the superiority of the technique can be summarized as follows: Prospective’s response time prediction process considers the transaction type. This improves accuracy since different transaction types can stress system resources differently. • The load vector used during prediction records the numbers and types of concurrent transactions observed in the system. The mix of the incoming workload can affect the mix of the transactions observed in the population vector. Therefore, it can capture the impact of the workload mix. • The population vectors can also capture the arrival rate of transactions. The higher the transaction arrival rate is, there is more chance to observe vectors with a higher number of transactions in them. • Moreover, the population vectors can capture the effect of arrival patterns. For example, a high variability in a bursty workload is reflected as higher variability in the vectors experienced by the system. • The use of the population vectors in conjunction with CF, helps our system to offer predictions for a given workload even if that workload has not been explicitly observed in the past. • FIGURE 6. Results with different repository content. the repository only contains data pertaining to the extreme 90% per-core utilization. As stated previously, the number of unique load vectors increases with utilization. As a result, C2 yields more load vectors than C1 thereby helping Prospective. From Fig. 6a and Fig. 6b, miss rate decreases and accuracy improves significantly with C3, which captures behaviour under both low, i.e., 10%, and extreme, i.e., 90%, per-core utilizations. Covering a wide range of utilizations yields a diverse set of load vectors to the repository which in turn helps improve accuracy. F. DISCUSSION From the results, Prospective significantly outperforms other baseline techniques. In particular, unlike the existing quantile regression-based approaches, Prospective is able to offer accurate predictions even for arrival patterns and mixes not directly captured in the repository. The use of load vectors in combination with CF helps offer accurate predictions for a given arrival pattern and mix even if that pattern and mix have not been explicitly observed in the past. Specifically, Prospective modulates the progression of load vectors during a simulation to implicitly consider the impact of both the arrival pattern and the mix. For example, a synthetic input trace with a highly variable arrival pattern VOLUME 7, 2019 VI. USING PROSPECTIVE FOR SYSTEM SIZING We now demonstrate how Prospective can be used for horizontal scaling to mitigate violation of a response time percentile target. Specifically, we show how one can estimate the number and type of server instances with minimum cost to achieve a target response time percentile. This feature can enable self-organization in could services. 127915 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles We consider two different types of VM instances for the RUBiS service namely, Large (L) and Small (S). We assume that the cost of acquiring one Large instance is greater than the cost of acquiring two Small instances and lower than the cost of three Small instances. Given this assumption, Prospective searches for the system configuration with a minimum cost that meets the SLA requirements. Accordingly, we create a repository for Large and another repository for Small using the methodology described in Sec. IV-C. We use an iterative sizing approach to determine a horizontal scaling strategy. We start by simulating the service deployed on a single Large instance. If there is a violation, we explore the addition of another instance. We define this as scaling step. At each scaling step, we explore first the addition of a Small instance. If simulation of this configuration suggests that the violation persists, we remove the Small instance from the configuration and explore instead the addition of a Large instance. If the violation still persists even with the addition of the Large instance, we initiate one more scaling step to provision an additional instance. The process is repeated until Prospective indicates no violation. We study both the WRR and LC load balancing policies. From Table 1, the Large instance has twice the number of cores and memory as the Small instance. Consequently, a Large instance in a configuration will handle twice the number of transactions as the Small instance in the configuration. Our goal is to realize a system that meets a 95th percentile response time target of 10 ms. We consider three workloads namely, High − Exp, VHigh − Exp, and XHigh − Exp. Both High − Exp and VHigh − Exp violate the target while using a single Large instance. As mentioned in Sec. IV, XHigh − Exp cannot be sustained by a single Large instance alone. Fig. 7 shows the configurations suggested by Prospective while using WRR. Prospective indicates that the 95th percentile for the High − Exp workload is just under the target of 10.0 ms when using a configuration with one instance each of Large and Small, i.e., L-S in Fig. 7. The VHigh − Exp and XHigh − Exp workloads require 2 and 3 Large instances, respectively. Note that in Fig. 7, the measured and predicted values for XHigh − Exp are shown only for configurations that can sustain this workload. To validate Prospective’s predictions, we measure the performance of all configurations evaluated by Prospective for FIGURE 7. 95th percentiles of WRR for different configurations. 127916 FIGURE 8. Response times under WRR and LC. all three workloads. We use the HAProxy [24] load balancer configured with WRR. From Fig. 7, the 95th percentiles obtained from Prospective are close to the measured percentiles for every configuration explored. Measurements show that the configurations suggested by Prospective are the configurations with the minimum cost that will not violate the target 95th percentile. We repeated the same experiments with the LC policy. As with WRR, predictions closely match measurements. Detailed analysis also reveals that LC yields lower measured response times than WRR and that Prospective is able to capture this behaviour. As an example, we refer to Fig. 8, which focuses on the predicted and measured 95th percentile response times of both policies for the High − Exp workload under the L − S configuration. From the figure, both the measured and predicted response times with LC are lower than their corresponding values with WRR. This shows that Prospective is able to successfully predict the measured behaviour of improved response times with LC. Overall, Prospective has comparable response time prediction accuracies for WRR and LC with the accuracy of LC being slightly lower due to its higher complexity. Over the scaling steps shown in Fig. 7, predictions under LC have only 1.9% higher error on average than those under WRR. VII. RELATED WORK A large body of work has focused on the problem of selecting services that meet end user response time requirements [25]–[34]. Such selection approaches can be used to recommend Web services [25]–[31], automate the service composition [32] and runtime management [33], [35] of complex service chains to meet cost and response time requirements. In contrast to our work, these studies do not focus on predicting the impact of workload characteristics and system configurations on Web service performance. Several studies have proposed novel instrumentation techniques to characterize the performance of Web services [36], [37]. Li et al. propose WebProphet to predict the page load time percentiles of popular services such as Google Maps [36]. Jiang et al. develop WebPerf to estimate a cloud-based Web service’s response time distribution under various cloud configurations and runtime conditions [37]. Prospective differs from WebProphet and WebPerf in that it VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles does not require specialized binary instrumentation to offer performance predictions. Analytic queuing modeling techniques have been frequently used in literature for performance prediction [1], [38], [38]. With this approach, key workload characteristics and system configuration parameters are reflected into a Queuing Network Model (QNM) [39]. The QNM is often solved using Mean Value Analysis (MVA) [39]. While MVA has been used extensively, it is not intended for predicting response time percentiles. Modeling techniques that are considerably more complex than MVA have been proposed for predicting response time percentiles [40]–[46]. The models involved in these techniques require considerable effort and expertise to construct. Furthermore, in general, techniques based on analytic queuing models often make simplifying assumptions. These assumptions impact accuracy and limit applicability [47]. Queuing modeling techniques require an operator to manually construct and validate a model. Typically, the process of model building and validation is iterative. This process is time consuming and requires experts in performance modeling. To address this problem, several researchers have proposed techniques to automatically build queuing models based on historical system response time measurements [48], [49]. However, the models resulting from these techniques are capable of only predicting mean response times. Queuing simulations are used in situations where it is difficult to abstract a system using an analytic model. Spicuglia et al. propose an automated simulation approach that does not require explicit model construction to predict response time percentiles of cloud applications [22]. Similar to our work, the authors uses results from controlled load tests to drive a simulation that estimates transaction response times. The technique presented in their paper can only handle a single transaction type. As we show in Sec. V, not distinguishing between transaction types can cause significant prediction errors. Machine learning provides an alternative to analytic and simulation based queuing modeling techniques. Techniques such as regression analysis [7], [8], [50], [51], Support Vector Machines (SVM) [15], [52], Artificial Neural Networks (ANN) [15], and Bayesian networks [53] have been used in literature for performance prediction. Watson et al. use quantile regression to relate percentiles of response time metrics as a function of system resource utilizations [7]. Bodik et al. also use quantile regression to predict the 99th percentile of response times of a Web application as a function of the mean transaction arrival rate [8]. Uttamchandani et al. propose a black box performance modeling technique for storage systems that exploits SVM regression [52]. These regression approaches require an expert to identify appropriate workload parameters, e.g., resource utilization and mean transaction arrival rate, as part of a featurization process. Furthermore, one has to specify an appropriate performance function, e.g., a kernel in the case VOLUME 7, 2019 of SVM, that captures how the response time percentile of interest relates to the workload parameters. Several researchers have used regression approaches that do not require the specification of a performance function [50], [51]. For example, Wang et al. use the Classification And Regression Trees (CART) technique to predict the mean and 90th percentile response times of a storage system [50]. The authors show that the accuracy of CART models is acceptable when the workload for which prediction is offered is similar to the workload used to train the model. Kundu et al. explore the use of an ANN to predict the response time percentiles of a virtualized RUBiS Web service as a function of the hardware resources assigned to that service [15]. The authors show that splitting the feature space into different regions and building a sub-model for each region yields better accuracy than using a single global model. However, needs excessive training data. Furthermore, they state that careful tuning of the ANN, e.g., selecting an appropriate activation function, is required for obtaining good prediction accuracy. In summary, applying machine learning requires operator intervention for tasks that can have a big impact on accuracy such as selecting an appropriate learning technique, featurization, performance function selection, and model structure tuning. This motivates the need for approaches such as Prospective that do not burden the operator with such tasks. VIII. CONCLUSION AND FUTURE WORK In this paper we address the key requirement of building a prediction tool that does not require a Web service operator to perform time consuming tasks such as model building, featurization, performance function selection, and model structure tuning. Prospective applies a novel technique based on CF to predict workload scenarios not directly observed in the historical data. The use of load vectors in combination with CF helps offer accurate predictions for a given arrival pattern and mix even if that pattern and mix have not been explicitly observed in the past. Specifically, Prospective modulates the progression of load vectors during a simulation to implicitly consider the impact of both the arrival pattern and the mix. Results show that Prospective outperforms other baseline techniques. It is effective for a wide variety of workloads including those that exhibit highly variable transaction arrival patterns. The technique is robust and is able to improve its predictions continually based on new data. Prospective also reports the miss rate metric, which helps an operator assess the accuracy of a prediction. We show how the load balancing module of Prospective can be leveraged in system sizing exercises to determine the numbers and types of server instances needed for meeting a response time percentile requirement. In future, our research will be complemented along several dimensions. Future research will validate Prospective for other types of applications. We will also explore different ways of featurizing arrival pattern and mix and compare the accuracy of these alternatives. 127917 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles REFERENCES [1] D. Krishnamurthy, J. Rolia, and M. Xu, ‘‘WAM—The weighted average method for predicting the performance of systems with bursts of customer sessions,’’ IEEE Trans. Softw. Eng., vol. 37, no. 5, pp. 718–735, Sep./Oct. 2011. [2] G. Casale, N. Mi, L. Cherkasova, and E. Smirni, ‘‘Dealing with burstiness in multi-tier applications: Models and their parameterization,’’ IEEE Trans. Softw. Eng., vol. 38, no. 5, pp. 1040–1053, Sep. 2012. [3] N. R. Draper, H. Smith, and E. Pownell, Applied Regression Analysis, vol. 3. New York, NY, USA: Wiley, 1966. [4] R. Koenker, Quantile Regression, no. 38. Cambridge, U.K.: Cambridge Univ. Press, 2005. [5] F. Ricci, L. Rokach, and B. Shapira, Introduction to Recommender Systems Handbook. Boston, MA, USA: Springer, 2011. [6] RUBiS: Rice University Bidding System. Accessed: Aug. 1, 2019. [Online]. Available: http://rubis.ow2.org/ [7] B. J. Watson, M. Marwah, D. Gmach, Y. Chen, M. Arlitt, and Z. Wang, ‘‘Probabilistic performance modeling of virtualized resource allocation,’’ in Proc. 7th Int. Conf. Auton. Comput., 2010, pp. 99–108. [8] P. Bodık, C. Sutton, A. Fox, D. Patterson, and M. Jordan, ‘‘Response-time modeling for resource allocation and energy-informed SLAs,’’ in Proc. Workshop Stat. Learn. Techn. Solving Syst. Problems (MLSys), Whistler, BC, Canada, 2007, pp. 1–4. [9] Y. Amannejad, D. Krishnamurthy, and B. Far, ‘‘Predicting Web service response time percentiles,’’ in Proc. Int. Conf. Netw. Service Manage. (CNSM), Oct./Nov. 2016, pp. 73–81. [10] R. N. Calheiros, E. Masoumi, R. Ranjan, and R. Buyya, ‘‘Workload prediction using ARIMA model and its impact on cloud applications’ QoS,’’ IEEE Trans. Cloud Comput., vol. 3, no. 4, pp. 449–458, Aug. 2015. [11] V. Debusschere and S. Bacha, ‘‘Hourly server workload forecasting up to 168 hours ahead using seasonal ARIMA model,’’ in Proc. IEEE Int. Conf. Ind. Technol., Mar. 2012, pp. 1127–1131. [12] J. S. Breese, D. Heckerman, and C. Kadie, ‘‘Empirical analysis of predictive algorithms for collaborative filtering,’’ in Proc. 14th Conf. Uncertainty Artif. Intell., 1998, pp. 43–52. [13] J. MacQueen, ‘‘Some methods for classification and analysis of multivariate observations,’’ in Proc. 5th Berkeley Symp. Math. Statist. Probab., Oakland, CA, USA, vol. 1, 1967, pp. 281–297. [14] D. Mosberger and T. Jin, ‘‘Httperf—A tool for measuring Web server performance,’’ ACM SIGMETRICS Perform. Eval. Rev., vol. 26, no. 3, pp. 31–37, 1998. [15] S. Kundu, R. Rangaswami, A. Gulati, M. Zhao, and K. Dutta, ‘‘Modeling virtualized applications using machine learning techniques,’’ ACM SIGPLAN Notices, vol. 47, no. 7, pp. 3–14, Mar. 2012. [16] A. Bahga and V. K. Madisetti, ‘‘Synthetic workload generation for cloud computing applications,’’ J. Softw. Eng. Appl., vol. 4, no. 7, p. 396, 2011. [17] Z. Gong, X. Gu, and J. Wilkes, ‘‘PRESS: Predictive elastic resource scaling for cloud systems,’’ in Proc. Int. Conf. Netw. Service Manage. (CNSM), Oct. 2010, pp. 9–16. [18] P. Padala, X. Zhu, Z. Wang, S. Singhal, and K. G. Shin, ‘‘Performance evaluation of virtualization technologies for server consolidation,’’ HP Labs, Palo Alto, CA, USA, Tech. Rep. HPL-2007-59R1, 2007, p. 137. [19] W. Zhang, Y. Shi, Y. Zheng, L. Liu, and L. Cui, ‘‘Resource and performance prediction at high utilization for N-tier cloud-based service systems,’’ in Proc. Australas. Comput. Sci. Week Multiconf., 2017, p. 43. [20] V. Almeida, M. F. Arlitt, and J. Rolia, ‘‘Analyzing a Web-based system’s performance measures at multiple time scales,’’ ACM SIGMETRICS Perform. Eval. Rev., vol. 30, pp. 3–9, Sep. 2002. [21] N. D. Mickulicz, P. Narasimhan, and R. Gandhi, ‘‘To auto scale or not to auto scale,’’ in Proc. 10th Int. Conf. Auton. Comput. (ICAC), 2013, pp. 145–151. [Online]. Available: https://www.usenix.org/conference/ icac13/technical-sessions/presentation/mickulicz [22] S. Spicuglia, M. Björkqvist, L. Y. Chen, and W. Binder, ‘‘Catching the response time tail in the cloud,’’ in Proc. Int. Symp. Integr. Netw. Manage. (IM), May 2015, pp. 572–577. [23] C. Delimitrou and C. Kozyrakis, ‘‘Quasar: Resource-efficient and QoSaware cluster management,’’ ACM SIGPLAN Notices, vol. 49, no. 4, pp. 127–144, 2014. [24] HAProxy: The Reliable, High Performance TCP/HTTP Load Balancer. [Online]. Available: http://www.haproxy.org/ [25] Z. Zheng, H. Ma, M. R. Lyu, and I. King, ‘‘QoS-aware Web service recommendation by collaborative filtering,’’ IEEE Trans. Services Comput., vol. 4, no. 2, pp. 140–152, Apr./Jun. 2011. 127918 [26] Z. Zheng, H. Ma, M. R. Lyu, and I. King, ‘‘Collaborative Web service QoS prediction via neighborhood integrated matrix factorization,’’ IEEE Trans. Services Comput., vol. 6, no. 3, pp. 289–299, Jan. 2013. [27] Y. Hu, Q. Peng, X. Hu, and R. Yang, ‘‘Time aware and data sparsity tolerant Web service recommendation based on improved collaborative filtering,’’ IEEE Trans. Services Comput., vol. 8, no. 5, pp. 782–794, Sep. 2015. [28] Y. Ma, S. Wang, P. C. K. Hung, C.-H. Hsu, Q. Sun, and F. Yang, ‘‘A highly accurate prediction algorithm for unknown Web service QoS values,’’ IEEE Trans. Services Comput., vol. 9, no. 4, pp. 511–523, Jul./Aug. 2016. [29] X. Wu, B. Cheng, and J. Chen, ‘‘Collaborative filtering service recommendation based on a novel similarity computation method,’’ IEEE Trans. Services Comput., vol. 10, no. 3, pp. 352–365, May 2017. [30] D. Yu, M. Wu, and Y. Yin, ‘‘A combination approach to QoS prediction of Web services,’’ in Proc. Int. Conf. Service-Oriented Comput. Berlin, Germany: Springer, 2012, pp. 99–106. [31] R. Karim, C. Ding, and A. Miri, ‘‘End-to-end performance prediction for selecting cloud services solutions,’’ in Proc. IEEE Symp. Service-Oriented Syst. Eng. (SOSE), Mar. 2015, pp. 69–77. [32] K. Xiong and H. Perros, ‘‘SLA-based service composition in enterprise computing,’’ in Proc. 16th Int. Workshop Qual. Service (IWQoS), Jun. 2008, pp. 30–39. [33] V. Cardellini, E. Casalicchio, V. Grassi, and F. L. Presti, ‘‘Adaptive management of composite services under percentile-based service level agreements,’’ in Proc. Int. Conf. Service-Oriented Comput. Berlin, Germany: Springer, 2010, pp. 381–395. [34] S. Ding, Y. Li, D. Wu, Y. Zhang, and S. Yang, ‘‘Time-aware cloud service recommendation using similarity-enhanced collaborative filtering and ARIMA model,’’ Decis. Support Syst., vol. 107, pp. 103–115, Mar. 2018. [35] G. H. Alférez and V. Pelechano, ‘‘Achieving autonomic Web service compositions with models at runtime,’’ Comput. Elect. Eng., vol. 63, pp. 332–352, Oct. 2017. [36] Z. Li, M. Zhang, Z. Zhu, Y. Chen, A. G. Greenberg, and Y.-M. Wang, ‘‘WebProphet: Automating performance prediction for Web services,’’ in Proc. NSDI, vol. 10, 2010, pp. 143–158. [37] Y. Jiang, L. R. Sivalingam, S. Nath, and R. Govindan, ‘‘WebPerf: Evaluating what-if scenarios for cloud-hosted Web applications,’’ in Proc. ACM SIGCOMM, 2016, pp. 258–271. [38] J. Rolia, G. Casale, D. Krishnamurthy, S. Dawson, and S. Kraft, ‘‘Predictive modelling of SAP ERP applications: Challenges and solutions,’’ in Proc. 4th Int. ICST Conf. Perform. Eval. Methodol. Tools, 2009, pp. 1–9. [39] E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik, Quantitative System Performance, Computer System Analysis Using Queuing Network Models. Upper Saddle River, NJ, USA: Prentice-Hall, 1984. [40] G. Casale, ‘‘Approximating passage time distributions in queueing models by Bayesian expansion,’’ Perform. Eval., vol. 67, no. 11, pp. 1076–1091, 2010. [41] R. A. Hayden, A. Stefanek, and J. T. Bradley, ‘‘Fluid computation of passage-time distributions in large Markov models,’’ Theor. Comput. Sci., vol. 413, no. 1, pp. 106–141, 2012. [42] J. F. Pérez and G. Casale, ‘‘Line: Evaluating software applications in unreliable environments,’’ IEEE Trans. Rel., vol. 66, no. 3, pp. 837–853, Feb. 2017. [43] M. Grottke, V. Apte, K. S. Trivedi, and S. Woolet, ‘‘Response time distributions in networks of queues,’’ in Queueing Networks. Boston, MA, USA: Springer, 2011, pp. 587–641. doi: 10.1007/978-1-4419-6472-4_14. [44] M. Nguyen, Z. Li, F. Duan, H. Che, and H. Jiang, ‘‘The tail at scale: How to predict it?’’ in Proc. 8th USENIX Workshop Hot Topics Cloud Comput. (HotCloud), 2016, pp. 1–6. [45] U. Sharma, P. Shenoy, and D. F. Towsley, ‘‘Provisioning multi-tier cloud applications using statistical bounds on sojourn time,’’ in Proc. 9th Int. Conf. Auton. Comput., 2012, pp. 43–52. [46] M. Björkqvist, N. Gautam, R. Birke, L. Y. Chen, and W. Binder, ‘‘Optimizing for tail sojourn times of cloud clusters,’’ IEEE Trans. Cloud Comput., vol. 6, no. 1, pp. 156–167, Aug. 2018. [47] U. N. Bhat, ‘‘Sixty years of queueing theory,’’ Manage. Sci., vol. 15, no. 6, p. B-280, 1969. [48] S. Kraft, S. Pacheco-Sanchez, G. Casale, and S. Dawson, ‘‘Estimating service resource consumption from response time measurements,’’ in Proc. 4th Int. ICST Conf. Perform. Eval. Methodol. Tools (VALUETOOLS), 2009, pp. 48:1–48:10. doi: 10.4108/ICST.VALUETOOLS2009.7526. [49] T. Zheng, C. M. Woodside, and M. Litoiu, ‘‘Performance model estimation and tracking using optimal filters,’’ IEEE Trans. Softw. Eng., vol. 34, no. 3, pp. 391–406, May 2008. VOLUME 7, 2019 Y. Amannejad et al.: Prospective: Data-Driven Technique to Predict Web Service Response Time Percentiles [50] M. Wang, K. Au, A. Ailamaki, A. Brockwell, C. Faloutsos, and G. R. Ganger, ‘‘Storage device performance prediction with CART models,’’ in Proc. Int. Symp. Modeling, Anal., Simulation Comput. Telecommun. Syst. (MASCOTS), Oct. 2004, pp. 588–595. [51] M. Courtois and M. Woodside, ‘‘Using regression splines for software performance analysis,’’ in Proc. 2nd Int. Workshop Softw. Perform. (WOSP), 2000, pp. 105–114. [Online]. Available: http://doi.acm.org/10.1145/ 350391.350416 [52] S. Uttamchandani, L. Yin, G. A. Alvarez, J. Palmer, and G. Agha, ‘‘CHAMELEON: A self-evolving, fully-adaptive resource arbitrator for storage systems,’’ in Proc. USENIX Annu. Tech. Conf. (ATEC), 2005, pp. 1–6. [Online]. Available: http://dl.acm.org/citation.cfm?id=1247360. 1247366 [53] I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, and J. Symons, ‘‘Correlating instrumentation data to system states: A building block for automated diagnosis and control,’’ in Proc. OSDI, vol. 4, 2004, p. 16. DIWAKAR KRISHNAMURTHY is currently a Professor with the University of Calgary. His research interest includes the performance evaluation of software systems. He is also involved in research projects related to cloud computing, virtualization technologies, big data analytics, and healthcare simulation. YASAMAN AMANNEJAD received the Ph.D. degree in software engineering from the University of Calgary, Canada. She is currently an Assistant Professor with Mount Royal University. Her research interests include software engineering, data analytics, and machine learning. BEHROUZ FAR is currently a Professor with the University of Calgary. His research interests include engineering of intelligent, distributed and heterogeneous networked software systems, specifically design, and implementation and testing agent-based software systems. He is also involved with research projects related to distributed systems, intelligent transportation, cloud computing, and bio-informatics. VOLUME 7, 2019 127919