University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design

Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Eiko Yoneki.

Main memory is one of the leading hardware causes for machine crashes in today’s datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors.

In this project, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, we use the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluate the potential of different protection mechanisms in light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system.

Bio: Ioan Stefanovici is a PhD student in the Computer Systems and Networks Group at the University of Toronto, under the supervision of Prof. Bianca Schroeder. His research has dealt primarily with improving the reliability and performance of large-scale computer systems, studying the reliability of DRAM , and the impact of temperature on data centers. More recently, he has been working at Microsoft Research on sotware-defined storage. http://www.cs.utoronto.ca/~ioan/

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity