Large-scale P2P and Overlay Systems

As the demand on content delivery systems grew, p2p-based architectures emerged as viable and cost-effective approaches to satisfying such systems' requirements. Given the significant interest from researchers and Internet users alike, improvements in p2p systems' performance is likely to have an impact on a large community. Since such architectures need to be highly efficient, highly scalable, and highly distributed, appropriate design and evaluation of p2p-type systems is fundamental to the future of distributed systems in general.

This part of my research has focused on the design and evaluation of highly distributed p2p systems as well as overlay systems. To focus this research further, I have concentrated my efforts on the following content delivery applications: (a) file downloads, i.e., applications where the entire content is delivered before it is consumed (e.g., software upgrade downloads); (b) live streaming, i.e., applications where continuous media (CM) is being viewed as it is being generated (e.g., live sporting events); and (c) video-on-demand, i.e., applications where CM was recorded at an earlier time, and it can either be viewed as it is being delivered, or large portions of it can be downloaded before viewing begins. Many such applications require high QoS in order to be commercially viable - inability to control the resulting QoS is an important barrier for their wide-spread development and use (and is the reason commercial p2p systems include dedicated servers).

Many challenges in such systems can be viewed in light of two fundamental issues: (1) incentives:, i.e., provision of appropriate incentives for peers to contribute their resources such that the entire system performs well according to metrics appropriate for each application; and (2) quality-of-service, i.e., designing protocols and architectures that are amenable to provision of QoS and performance prediction. My work focuses on gaining insight into these fundamental issues, which I believe will lead to the development and proliferation of applications that are able to provide high QoS that is currently lacking. Challenges to being able to address these issues include: (1) dynamic and heterogeneous nature of p2p and overlay environments; (2) ability to use resources efficiently with little or no centralized control; (3) design and analysis of models that are able to capture sufficient aspects of such systems to be able to understand their ``interesting'' behavior and to facilitate better understanding of how such systems work and where improvements are needed; and (4) design and analysis of protocols and architectures that are robust to the dynamic and heterogeneous nature of such environments and that are subsequently able to provide QoS.

In the context of large-scale overlays, I have also continued my efforts on scalable upload applications over wide-area networks, where we have focused on design of an application layer infrastructure that improves scalability, security, fault tolerance, and performance of current and future Internet-based services. To the best of my knowledge, our work was the first to focus on scalability and efficiency of upload applications.

P2P Systems

Incentives. We explore incentive provision in the context of BitTorrent (BT) [D12]. Although, BT ``forces'' peers to share their resources while attempting to complete their downloads (through the tit-for-tat (TFT) mechanism) there are still opportunities for free-riders (i.e., peers who do not contribute resources) to download files through: (1) capacity provided by leechers (nodes who have not completed their download yet) through the optimistic unchokingmechanism (used by leechers to probe for new downloading opportunities), and (2) capacity provided by seeds (nodes that have completed their download but continue to contribute their uploading capacity to the system). Sufficient evidence exists that: (1) opportunities for free-riding hurt the system's performance, and (2) a nonnegligible number of nodes in BT participate as seeds, which significantly contributes to providing reasonable performance for the free-riders. Moreover, a leecher's progress through the downloading process is slower at the beginning (when having few chunks prevents a node from participating in TFT) and at the end (when it is more difficult to find peers with the few remaining pieces of needed data). In both cases, seeds can contribute significantly, as they have complete data and are no longer playing the TFT game. Since, the seeds' behavior is an important factor in facilitating free-riding and in aiding contributing leechers, our main idea was to explore alternative approaches for seeds to contribute their upload capacity to the system, with the goal of degrading free-riders' performance while (if possible) improving other leechers' performance. To the best of our knowledge, no other study before ours explored this.

Although empirical evidence suggests that many nodes participate in multiple torrents, little research literature exists on this topic. This can be exploited to increase the number of seeds in a given torrent, which in turn contributes to (a) helping newly joined nodes ramp up so that they can become contributing peers faster, (b) helping nodes nearing the end of their downloads find the last few file chunks faster, and (c) keeping a torrent ``alive'' longer (i.e., making sure that all file chunks are available in the system). Thus, our work [D8,A4] focused on (i) what incentives could be provided for nodes to contribute resources as seeds in a multi-torrent environment, and (ii) what are the resulting performance consequences, both, for the nodes willing to be seeds and for the overall system. We found that the current system lacks incentives for nodes to stay around as seeds in a multi-torrent environment. Motivated by that, we proposed a cross torrent based method and illustrated, through an extensive performance study, the benefits of such an approach.

QoS/performance prediction. We have explored the different aspects of QoS and performance prediction through the use of analytical models and protocol design. For instance, we have developed a simple, yet extensive analytical model of BT [D7]. To the best of our knowledge, it is the first (steady state) analytical performance model of a heterogeneous BT system that includes the behavior of seeds and free-riders (as noted earlier, both are important). The accuracy of our model can be attributed to including important BT characteristics, namely (i) imperfect clustering in regular (TFTbased) unchokes and (ii) bias in optimistic unchokes. This accuracy is validated through simulations and PlanetLab experiments, which also demonstrate the importance of including imperfect clustering and biased optimistic unchoking in accurately predicting nodes' download times. Our model is quite extensible and can be used to evaluate future protocol modifications - e.g., we demonstrated our model's extensibility by explicitly modeling two variations of BT: (a) the well known large view exploit, and (b) our proposed fix for the large view exploit (the novel seeding approach described above [D12]).

In a p2p-based streaming system, packets can arrive at a node in any order, but they must be played back in a specific order (and at a specific rate), corresponding to the original recording rate of the stream. We can view the subset of the edges (between nodes in a p2p overlay) used for packet delivery as forming a mesh. The system's performance is largely a function of the algorithms used to construct and maintain this mesh as well as the algorithms used for scheduling packet delivery over the mesh. Our work [D6] focused on approaches to constructing such meshes, where the primary goal was to develop an understanding of two related quantities - playback delay and buffering requirements. To this end, we explored two approaches to mesh construction and subsequent streaming, one based on multi-trees and the other based on hypercubes. We used these approaches to explore the resulting playback delay, buffer space, and communication requirements. In our work we focused on a ``structured'' approach to mesh construction (i.e., the set of edges used for delivery of packets was fixed by our algorithms) as compared to an ``unstructured'' approach (i.e., the edges used for delivery are determined on a per packet basis, essentially on the fly when the data is needed). Although unstructured techniques allow the system to more easily adapt to node churn, existing unstructured approaches to streaming (rather than file downloading) are mostly ``best effort'', and little exists in the way of formal analysis of resulting QoS prediction. In contrast, an advantage of our approach is that it has provable quality-of-service (QoS), where we provide analysis of corresponding performance characteristics.

While many efforts focused on the common challenges faced by live and pre-recorded (Videoon- Demand) streaming applications, our efforts focused on some of the open questions in designing p2p-based Video-on-Demand (VoD) systems specifically [D4]. We considered a BT-like VoD system and focused on: (i) how the lack of load balance, which typically exists in a p2p-based VoD system, affects performance and what steps can be taken to remedy that, and (ii) is a FCFS approach to serving requests sufficient or whether a Deadline-Aware Scheduling (DAS) approach can lead to performance improvements. Motivated by the lack of high QoS in streaming systems, we proposed several practical schemes aimed at addressing the above stated questions and illustrated the benefits of our techniques through an extensive performance study.

Multi-path Streaming

We have also been carrying over our efforts on QoS-oriented design from storage systems [A5,A8,D23,D17] to delivery of CM data over the Internet. The poor QoS in streaming over the Internet is partly due to variations in delays, bandwidth limitations, and packet losses. Although streaming applications can tolerate some missing data, non-recoverable information loss degrades these applications' QoS. Consequently, a number of application areas have backed away from streaming of their content over the Internet. Inability to control the resulting visual and auditory quality of the streamed presentation is an important reason for such a trend.

We believe that this trend can be reversed. To this end, our work focused on exploring high quality streaming through the exploitation of multiple paths existing in the network (between a set of senders and a receiver). By high quality, we mean with significant bandwidth requirements, of relatively long duration, and without information loss or hiccups in data delivery.

Our goal was to design an application-level approach, i.e., one that can be easily deployed over the Internet, without requirements for support from/modifications to the lower layers of the network. Our focus was largely on providing a fundamental understanding of the benefits of using multiple paths to deliver CM data destined for a particular receiver. (This data is fragmented into packets and the different packets take alternate routes to the receiver, e.g., by streaming the data through an overlay of relays distributed over a wide area network. As each packet is sent on only one of the paths, our approach does not increase the overall workload on the network.)

Existence of multiple paths (with disjoint bottlenecks) includes the following potential benefits: (a) reduction in correlations between consecutive packet losses; (b) increased throughput; (c) improved robustness and ability to adjust to variations in congestion patterns and failures. Our work focused on the loss characteristics aspects. Using a Gilbert model (GM), we gave [A14] an analytical characterization of when a multi-path (MP) approach is beneficial, as compared to a single path (SP) approach. Our results indicated that: (1) in general, MP streaming exhibits better loss characteristics than SP streaming, (2) use of an erasure code may not necessarily improve data loss characteristics in the case of SP streaming, while MP streaming (with or without use of an erasure code) can improve data loss characteristics, and (3) lag1-autocorrelation of MP streaming is usually closer to zero than that of SP streaming, which should also result in a higher viewing quality of the received data. It was also indicated that the average error burst length ofMP streaming is statistically shorter than that of SP streaming.

Given these benefits, another important question is how to distribute the load among multiple paths so as to optimize viewing quality. (The setting and possible adaptation of what fraction of data is sent on each path can be done by the receiver, based on its perceived QoS.) We studied [D20] the problem of assigning CM data to the multiple paths, according to the path characteristics, such that a certain specified performance metric is optimized. This was done under the GM, while considering appropriate optimization objectives, with the goal of improving the perceptual quality of the streamed media. Determining and gaining insight into suitable optimization objectives is also non-trivial and important (e.g., optimizing the lag-1 autocorrelation may result in higher loss rates observed at the receiver, when the paths are not homogeneous). Since there is a fundamental tradeoff between the frequency of losses and the corresponding loss correlations, it is natural to consider optimization objectives that encompass both metrics.

A great deal of our work was done under a ``conventional Gilbert model''. A limitation of using such a model is that the loss process of a path is independent of the bandwidth requirements of the streaming application. Our later effort [A9] proposed the use of a functional Gilbert model (FGM) as a more general approach to characterizing the bursty loss nature of a path as well as its dependency on an application's bandwidth requirements.

We verified our analytical and simulation results under real world conditions, by building a p2pbased multi-path streaming prototype [A11] and using it for streaming video experiments between UMD, CUHK, and USC. The choice of a p2p-based architecture was motivated by the parallelism between simultaneous downloads in p2p systems and multi-path streaming.

Scalable Infrastructure forWide-area Uploads

Hotspots are a major obstacle to achieving scalability in the Internet; they are usually caused by either high demand for some data or high demand for a certain service. At the application layer, hotspot problems have traditionally been dealt with using some combination of increasing capacity, spreading the load over time and/or space, and changing the workload. These classes of solutions have been studied in the context of one-to-many, many-to-many, and one-to-one type communication. However, to the best of our knowledge ours was the first effort [D37,D33,B1,A13], on making applications using many-to-one communication scalable and efficient. This corresponds to an important class of upload applications (e.g., submission of income tax forms to IRS, submission of papers to conferences, Internet-based storage, certain aspects of cloud computing, and many more).

The main focus of our early work was on scalable infrastructure design for wide-area upload applications, where we proposed Bistro [D37], an architecture which employs an overlay of (untrusted) hosts for improving the efficiency and scalability of uploads. Our basic idea was based on the observations that (a) existence of hotspots in many upload applications is due to approaching deadlines and long transfer times and (b) what is actually required by many upload applications is an assurance that specific data was submitted before a specific time, and that the transfer of the data needs to be done in a timely fashion, but does not have to occur by that deadline. Thus, our approach was to break the original deadline-driven upload problem into: (1) a real-time timestamp subproblem, where we ensure that the data is timestamped and that the data cannot be subsequently tampered with; (2) a low latency commit subproblem, where the data goes ``somewhere'' (to an intermediary) and the user is assured that the data is safely and securely ``on its way'' to its destination; and (3) a timely data transfer subproblem, which can be carefully planned (and coordinated with other uploads) and results in data delivery to the original destination.

Bistro's ability to share an infrastructure, such as an infrastructure of proxies, between a variety of wide-area applications has clear advantages over the more traditional solutions. In [B1] we conducted a performance study which demonstrated the potential performance gains of the Bistro framework as well as provided insight into the general upload problem. Since confidentiality of data as well as other security issues are especially important in upload applications and in our solution where we introduced untrusted (public) intermediaries (i.e., bistros), we also developed [D33] a secure data transfer protocol within the Bistro framework, which not only ensures the privacy and integrity of the data but also takes scalability considerations into account.

In later efforts we focused on performance and reliability improvements to the basic Bistro design along the subproblems outlined above [D19,D21,A10,D27,A1,D29,D30]. For instance, we obtained [A10,D27] interesting results in the context of the timely data transfer subproblem (described above), which can be viewed more generally as the problem of collecting a large amount of data from a set of hosts to a single destination in a wide-area network. Often, due to congestion conditions, the paths chosen by the network may have poor throughput. By choosing an alternate route at the application level (e.g., by using relays in an overlay), we may be able to obtain substantially faster completion time. This data collection problem is a non-trivial one because the issue is not only to avoid congested link(s), but to devise a coordinated transfer schedule which would afford maximum possible utilization of available network resources. In our work, we made no assumptions about knowledge of the topology of the network or the capacity available on individual links of the network, i.e., we only used end-to-end information. We used a network flow based solution, utilizing time-expanded graphs, which gave us an optimal data transfer schedule with respect to the makespan metric, given the constraints on knowledge of the topology and capacity of the network. Experimentally, we established that the lack of knowledge of the paths, provided by the network to send data, are not a significant barrier. Our approach to computing coordinated data collection schedules resulted in significant performance improvements. In general, our approach can be used for solving arbitrary data movement problems over the Internet, and we used the Bistro platform to illustrate one such application.

Another direction where we considered a general problem, motivated by the Bistro architecture, is that of digital signature schemes' performance [A1,D29]. Digital signature schemes are widely used in many applications; however, the performance of an Internet server computing digital signatures online is limited by the high cost of modular arithmetic. One way to improve the server's performance is to reduce the number of computed digital signatures by combining a set of documents into a batch (in a smart way) and signing each batch only once. This approach could reduce the demand on the CPU but requires additional network bandwidth (for sending extra information to clients). Our work provided a framework (using semi-Markov models) for studying as well as analyzing performance of a variety of online digital signature batching schemes. We extensively validated our framework's accuracy and effectiveness. Our results showed that substantial computational benefits can be obtained from batching without significant increases in the amount of additional information that needs to be sent to the clients.

[Last updated Mon Mar 29 2010]