COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design
Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System DesignAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Eiko Yoneki. Main memory is one of the leading hardware causes for machine crashes in today’s datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of the underlying characteristics of errors in DRAM in the field. While there have recently been a few first studies on DRAM errors in production systems, these have been too limited in either the size of the data set or the granularity of the data to conclusively answer many of the open questions on DRAM errors. Such questions include, for example, the prevalence of soft errors compared to hard errors, or the analysis of typical patterns of hard errors. In this project, we study data on DRAM errors collected on a diverse range of production systems in total covering nearly 300 terabyte-years of main memory. As a first contribution, we provide a detailed analytical study of DRAM error characteristics, including both hard and soft errors. We find that a large fraction of DRAM errors in the field can be attributed to hard errors and we provide a detailed analytical study of their characteristics. As a second contribution, we use the results from the measurement study to identify a number of promising directions for designing more resilient systems and evaluate the potential of different protection mechanisms in light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems, while sacrificing only a negligible fraction of the total DRAM in the system. Bio: Ioan Stefanovici is a PhD student in the Computer Systems and Networks Group at the University of Toronto, under the supervision of Prof. Bianca Schroeder. His research has dealt primarily with improving the reliability and performance of large-scale computer systems, studying the reliability of DRAM , and the impact of temperature on data centers. More recently, he has been working at Microsoft Research on sotware-defined storage. http://www.cs.utoronto.ca/~ioan/ This talk is part of the Computer Laboratory Systems Research Group Seminar series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsCambridge Centre for Analysis talks Perspectives on Inclusive and Special Education Environment on the Edge Technical Talks - Department of Computer Science and Technology Education in Conflict-Affected and Fragile Environments - Roundtable and Networking Event EPRG Energy and Environment (E&E) Series Easter 2012Other talksComputing High Resolution Health(care) Adaptation in log-concave density estimation The Deciding Factor - An afternoon talk Changing understandings of the human fetus over five decades of legal abortion How to lead a happy life in the midst of uncertainty Refugees and Migration Investigating the Functional Anatomy of Motion Processing Pathways in the Human Brain Cambridge - Corporate Finance Theory Symposium September 2017 - Day 2 Towards a whole brain model of perceptual learning Networks, resilience and complexity |