• Building and Managing Business Resiliency on the Cloud
  • Improving overall performance & energy consumption of your cluster with remote GPU virtualization
  • Call for Tutorials (closed)

Building and Managing Business Resiliency on the Cloud (Tuesday Dec. 8th starts at 1:30pm, Room: Point Grey 3rd Floor) 

Abstract: Business resiliency is the ability to rapidly adapt and respond to business disruptions and to maintain continuous business operations. Enterprises are rapidly migrating their workloads to cloud environments to reap the many economic advantages of cloud computing. These enterprises have business resiliency requirements that can be very stringent. Some of the very attributes of cloud architectures that create economic advantages (e.g., logically or physically centralized management and control, and common data management mechanisms) are potential single points of failure and may actually make such outages and failures more likely. The sheer scale of a cloud environment and the concomitant aggregation of a very large amount of workloads into a single site tend to completely overwhelm traditional resiliency technologies.

We will present approaches for providing business resiliency in cloud-based IT infrastructures against unplanned failures ranging from localized hardware and software failures (such as failures of processes, containers, servers, and storage) to large-scale failures such as disasters, as well as planned outages such as those due to maintenance and upgrades.

We will describe the steps involved in formulating, implementing, and operating a business resiliency strategy for cloud-based IT infrastructures, including understanding exactly what the business needs in order to survive unexpected events, assessing risk versus cost, and planning ahead for challenges that could come at any time.

We will present a reference architecture and patterns for business resiliency covering cloud infrastructures and both cloud-enabled and cloud-native applications. The reference architecture provides a formalized approach (including blueprints and patterns/building blocks) for business resiliency, taking into account the capabilities and constraints of different cloud environments and the workloads running on those environments. The reference architecture includes design elements (such as orchestration, replication, monitoring, management) and design principles at each layer of the application stack (e.g., presentation, middleware, data layers) and infrastructure (network, storage, load balancing, monitoring) required to meet the resiliency requirements.

The tutorial will include a survey of commercial cloud resiliency solutions and academic results, and will also cover specific examples on how certain key enabling technologies (including Software Defined Environments, replication technologies, and the use of deep analytics for resiliency planning and assessment) are applied in the framework of the reference architecture to achieve business resiliency. We will also present our own experiences in building and managing cloud systems for business resiliency.

Author Bios


Rick Harper is a Research Staff Member at IBM TJ. Watson Research center. Rick’s main assignment since joining IBM Research in 1998 has been to conceive and lead the transfer of research projects to product development. This has resulted in numerous products, such as the Summit Server product line, the Software Rejuvenation product, the Dynamic System Analysis product, virtualization-based Availability Management products, and the High Availability and Disaster Recovery functions for the IBM Cloud Managed Services offering. Rick participated in the National Academy of Sciences Panel on Reengineering the Space Shuttle in 1998 and was elected to the IBM Academy of Technology in 2007. He has authored approximately 30 papers, supervised over 20 graduate student theses, and has numerous international patents. Prior to joining IBM, Rick was a Senior Technical Advisor at Stratus Computer in Marlboro, Massachusetts, where he was responsible for technical strategy and development for the company’s line of fault tolerant computers. Prior to Stratus, Rick was a Principle Member of the Technical Staff at the Charles Stark Draper Laboratory, where his responsibilities were to create, design, and implement massively parallel fault tolerant computers for mission critical applications. At Oak Ridge National Laboratory, his responsibilities were designing and implementing instrumentation and control systems for nuclear research projects. He received his PhD in Computer Systems Technology/Aerospace Engineering in 1987 from the Massachusetts Institute of Technology, his MS in Physics and Aerospace Engineering in 1976 from Mississippi State University, and his BS in Physics in 1976 from Mississippi State University.


Hari Ramasamy is a Research Scientist and Manager at the IBM T.J. Watson Research Center. Hari’s research interests are in the areas of cloud resiliency and analytics for IT services transformation. Hari was inducted into the IBM Academy of Technology, the global technical leadership body for IBM in 2014. At IBM, Hari has received numerous recognitions such as the IBM Research Client Award (2014), IBM Outstanding Innovation Award (2012) and IBM Research Division Awards (2012, 2015). His research work has been recognized with Best Paper Awards from the IEEE SCC (co-author, 2013) and IEEE PRDC (co-Author, 2002) conferences. He is an IEEE Senior Member, and has served as the Program Co-Chair of the SAFECONFIG 2011 conference. Hari serves on the Editorial Advisory Board of the Disaster Recovery Journal, the premier publication of the business continuity industry. Hari is an Adjunct Associate Professor at New York University, and has previously served as an Adjunct Faculty at Columbia University and at NYU-Poly. He obtained his Ph.D. degree in Computer Science from the University of Illinois, Urbana-Champaign (UIUC) in 2006.


Long Wang is a Research Staff Member at the IBM T.J. Watson Research Center, Yorktown Heights, NY, where he leads the architecture of Disaster Recovery of IBM Cloud Managed Services to IBM Resiliency Services. His research interests include Fault-Tolerance and Reliability of Systems and Applications, Dependable and Secure Systems, Distributed Systems, Cloud Computing, Operating Systems, System Modeling, as well as Measurement and Assessment. He has published more than 20 papers in top conferences and journals and has served as the Program Committee of IEEE SELSE 2015, GlobalIT 2015, and FCC 2014. Dr. Wang is a member of the IEEE. He obtained his Ph.D. degree from Department of Electrical & Computer Engineering in University of Illinois at Urbana-Champaign (UIUC) in 2010. Before that, he got an MS degree from Department of Computer Science at UIUC in 2002 and a BS degree from Department of Computer Science at Beijing University in 2000.

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Abstract: Nowadays, GPUs are widely used to accelerate scientific applications, but their adoption in HPC clusters presents several drawbacks. First, in addition to increasing acquisition costs, the use of accelerators also increments maintenance and space costs. Second, energy consumption is also increased. Third, GPUs in a cluster may present a relatively low utilization rate (it is quite unlikely that all the accelerators in the cluster are used all the time). In consequence, virtualizing the GPUs of the cluster is an appealing strategy to simultaneously dealing with all these drawbacks. Additionally, cluster throughput is increased whereas costs and energy consumption are reduced. In this tutorial we present the benefits of remote GPU virtualization, comparatively introducing several of these frameworks: gVirtuS, DS-CUDA, and rCUDA. We present the latest developments within these frameworks: low-power processors, job schedulers, virtual machines, etc. In a hands-on part of the tutorial we expose how to install and use the freely available rCUDA solution. We demonstrate how by using rCUDA over a high-performance interconnect the overhead of remote GPU virtualization is reduced to negligible values. Finally, attendees will be able to exercise with this framework by connecting to a real cluster that includes several nodes with GPUs. Presenters:

Federico_SillaFederico Silla (https://sites.google.com/site/federicosillaupv/) received the MS and PhD degrees from Technical University of Valencia (UPV), Spain. He is currently an associate professor at the Department of Computer Engineering (DISCA) at that university. His research is mainly performed within the Parallel Architectures Group of Technical University of Valencia, although he is also an external contributor of the Advanced Computer Architecture research group at the Department of Computer Engineering at University of Heidelberg, Germany. Furthermore, he worked for two years at Intel Corporation, developing on-chip networks. His research addresses high performance on-chip and off-chip interconnection networks as well as distributed memory systems and remote GPU virtualization mechanisms. He has published numerous papers in peer-reviewed conferences and journals, as well as several book chapters. He has been member of the Program Committee in several of the most prestigious conferences in his area, including PACT, ICS, SC, HiPC, ICPP, etc. The different papers he has published so far provide an H-index impact factor equal to 23 according to Google Scholar. Currently, he is coordinating the rCUDA remote GPU virtualization project since it began in 2008. Additionally, he is also leading the development of other virtualization technologies.


Carlos Reaño (http://www.gap.upv.es/carregon) received a BS degree in Computer Engineering from the University of Valencia, Spain, in 2008. He also holds a MS degree in Software Engineering, Formal Methods and Information Systems from the Technical University of Valencia, Spain, since 2012. He is currently doing his PhD in virtualization of remote GPUs at the Department of Computer Engineering (DISCA) of that university, where he is working in the rCUDA project. He has published several papers in peer reviewed conferences and journals, and has also participated as a reviewer in some conferences and journals.

Call for Tutorials

The Middleware conference traditionally includes tutorials on selected topics given by renowned scientists and practitioners in their fields. Tutorials on both mature and emerging topics are welcomed. Tutorials may be lectures, interactive workshops, hands-on training, or any combination of the above. Exploring diverse ways of interacting with the audience is welcome as are cross-disciplinary topics.

Proposers of accepted tutorials have to prepare a webpage containing detailed information about the tutorial. Tutorial proposals should be submitted in PDF format, not exceeding three (3) pages in total, and be sent to:

  • Davide Frey (davide.frey@inria.fr)
  • Xiaohui (Helen) Gu (gu@csc.ncsu.edu)

in an email with subject line “[Middleware 2015 – Tutorial Submission]”.

Important Dates:

Tutorial proposals due July 17, 2015
Notification of acceptance/rejection August 3, 2015

Proposals must include

  • Title and short outline of the tutorial content (max. 200 words)
  • Motivation on why the topic is of particular interest at this time.
  • Information about the presenters (name, affiliation, email address, homepage) and a short description of their expertise, experiences in teaching and in tutorial presentation.
  • The type of tutorial (e.g., lecture vs. hands-on)
  • References to previous iterations of the tutorial (if applicable) including their date, venue, topics and number of participants and the motivation for the new proposal
  • Requirements for the tutorial room (please note that our capabilities in fulfilling unusual requests are limited)
  • Requirements for the attendants (e.g., must bring own laptop or other hardware, familiarity with certain technologies or topics, etc.)
  • Expected number of participants