容錯(cuò)計(jì)算英文版課件:lec01-intro-motiv_第1頁(yè)
容錯(cuò)計(jì)算英文版課件:lec01-intro-motiv_第2頁(yè)
容錯(cuò)計(jì)算英文版課件:lec01-intro-motiv_第3頁(yè)
容錯(cuò)計(jì)算英文版課件:lec01-intro-motiv_第4頁(yè)
容錯(cuò)計(jì)算英文版課件:lec01-intro-motiv_第5頁(yè)
已閱讀5頁(yè),還剩14頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Fault-Tolerant ComputingMotivation, Background, and ToolsSep. 20061Introduction and MotivationAbout This PresentationEditionReleasedRevisedRevisedFirstSep. 2006This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical

2、and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. Behrooz ParhamiSep. 20062Introduction and MotivationDesign, Implementation, Operation, and H

3、uman MishapsSep. 20063Introduction and MotivationSep. 20064Introduction and MotivationThe Curse of ComplexityComputer engineering is the art and science of translating user requirements we do not fully understand; into hardware and software we cannot precisely analyze; to operate in environments we

4、cannot accurately predict; all in such a way that the society at large is given no reason to suspect the extent of our ignorance.11Adapted from definition of structural engineering: Ralph Kaplan, By Design: Why There Are No Locks on the Bathroom Doors in the Hotel Louis XIV and Other Object Lessons,

5、 Fairchild Books, 2004, p. 229Microsoft Windows NT (1992): 4M lines of codeMicrosoft Windows XP (2002): 40M lines of codeIntel Pentium processor (1993): 4M transistorsIntel Pentium 4 processor (2001): 40M transistorsIntel Itanium 2 processor (2002): 500M transistorsSep. 20065Introduction and Motivat

6、ionDefining FailureFailure is an unacceptable difference between expected and observed performance.11 Definition used by the Tech. Council on Forensic Engineering of the Amer. Society of Civil EngineersA structure (building or bridge) need not collapse catastrophically to be deemed a failureReasons

7、of typical Web site failures Hardware problems:15%Software problems: 34%Operator error:51%ImplementationSpecification?Sep. 20066Introduction and MotivationDesign Flaws: “To Engineer is Human”1Complex systems almost certainly contain multiple design flawsRedundancy in the form of safety factor is rou

8、tinely used in buildings and bridges1 Title of book by Henry PetroskiOne catastrophic bridge collapse every 30 years or soSee the following amazing video clip (Tacoma Narrows Bridge):http:/www.enm.bris.ac.uk/research/nonlinear/tacoma/tacnarr.mpg Example of a more subtle flaw: Disney Concert Hall in

9、Los Angeles reflected light into nearby building, causing discomfort for tenants due to blinding light and high temperatureSep. 20067Introduction and MotivationDesign Flaws in Computer SystemsHardware example: Intel Pentium processor, 1994For certain operands, the FDIV instruction yielded a wrong qu

10、otientAmply documented and reasons well-known (overzealous optimization)Software example: Patriot missile guidance, 1991Missed intercepting a scud missile in 1st Gulf War, causing 28 deathsClock reading multiplied by 24-bit representation of 1/10 s (unit of time)caused an error of about 0.0001%; nor

11、mally, this would cancel out in relative time calculations, but owing to ad hoc updates to some (not all) calls to a routine, calculated time was off by 0.34 s (over 100 hours), during which time a scud missile travels more than kmUser interface example: Therac 25 machine, mid 1980s1Serious burns an

12、d some deaths due to overdose in radiation therapyOperator entered “x” (for x-ray), realized error, corrected by entering “e” (for low-power electron beam) before activating the machine; activation was so quick that software had not yet processed the override1 Accounts of the reasons varySep. 20068I

13、ntroduction and MotivationLearning Curve: “Normal Accidents”1Example: Risk of piloting a plane1903First powered flight1908First fatal accident1910Fatalities = 32 (2000 pilots worldwide)TodayCommercial airline pilots pay normal life insurance rates1 Title of book by Charles Perrow (Ex. p. 125)1918US

14、Air Mail Service foundedPilot life expectancy = 4 years31 of the first 40 pilots died in service1922One forced landing for every 20 hours of flightSep. 20069Introduction and MotivationMishaps, Accidents, and CatastrophesMishap: misfortune; unfortunate accidentForum on Risks to the Public in Computer

15、s and Related Systemshttp:/catless.ncl.ac.uk/risks (Peter G. Neumann, moderator)At one time (following the initial years of highly unreliable hardware), computer mishaps were predominantly the results of human error Accident: unexpected (no-fault) happening causing loss or injuryNow, most mishaps ar

16、e due to complexity (unanticipated interactions)Catastrophe: final, momentous event of drastic action; utter failureSep. 200610Introduction and MotivationExample fromOn August 17, 2006, a class-two incident occurred at the Swedish atomic reactor Forsmark. A short-circuit in the electricity network c

17、aused a problem inside the reactor and it needed to be shut down immediately, using emergency backup electricity. However, in two of the four generators, which run on AC, the AC/DC converters died. The generators disconnected, leaving the reactor in an unsafe state and the operators unaware of the c

18、urrent state of the system for approximately 20 minutes. A meltdown, such as the one in Tschernobyl, could have occurred.Coincidence of problems in multiple protection levels seems to be a recurring theme in many modern-day mishaps - emergency systems had not been tested with the grid electricity be

19、ing offSep. 200611Introduction and MotivationLayers of SafeguardsWith multiple layers of safeguards, a system failure occurs only if warning symptoms and compensating actions are missed at each layer, which is quite unlikelyIs it really?The computer engineering literature is full of examples of mish

20、aps when two or more layers of protection failed at the same timeMultiple layers increase the reliability significantly only if the “holes” in the representation above are fairly randomly distributed, so that the probability of their being aligned is negligibleDec. 1986: ARPANET had 7 dedicated line

21、s between NY and Boston;A backhoe accidentally cut all 7 (they went through the same conduit)Sep. 200612Introduction and MotivationA Problem to Think AboutIn a passenger plane, the failure rate of the cabin pressurizing system is 105/ hr (loss of cabin pressure occurs once per 105 hours of flight)As

22、suming failure independence, both systems fail at a rate of 1010/ hr Alternate reasoningProbability of cabin pressure system failure in 10-hour flight is 104 Probability of oxygen masks failing to deploy in 10-hour flight is 104 Probability of both systems failing in 10-hour flight is 108 Why is thi

23、s result different from that of our earlier analysis (109)?Which one is correct?Failure rate of the oxygen-mask deployment system is also 105/ hrFatality probability for a 10-hour flight is about 1010 10 = 109 (109 or less is generally deemed acceptable) Probability of death in a car accident is 1/6

24、000 per year (107/ hr)Sep. 200613Introduction and MotivationCabin Pressure and Oxygen MasksWhen we multiply the two per-hour failure rates and then take the flight duration into account, we are assuming that only the failure of the two systems within the same hour is catastrophicThis produces an opt

25、imistic reliability estimate (1 109)012345678910MasksfailPressure is lost012345678910MasksfailPressure is lostWhen we multiply the two flight-long failure rates, we are assuming that the failure of these systems would be catastrophic at any timeThis produces a pessimistic reliability estimate (1 108

26、)Sep. 200614Introduction and MotivationCauses of Human Errors in Computer Systems1. Personal factors (35%): Lack of skill, lack of interest or motivation, fatigue, poor memory, age or disability2. System design (20%): Insufficient time for reaction, tedium, lack of incentive for accuracy, inconsiste

27、nt requirements or formats3. Written instructions (10%): Hard to understand, incomplete or inaccurate, not up to date, poorly organized4. Training (10%): Insufficient, not customized to needs, not up to date5. Human-computer interface (10%): Poor display quality, fonts used, need to remember long co

28、des, ergonomic factors6. Accuracy requirements (10%): Too much expected of operator7. Environment (5%): Lighting, temperature, humidity, noiseBecause “the interface is the system” (according to a popular saying), items 2, 5, and 6 (40%) could be categorized under user interfaceSep. 200615Introductio

29、n and MotivationProperties of a Good User Interface1. Simplicity: Easy to use, clean and unencumbered look2. Design for error: Makes errors easy to prevent, detect, and reverse; asks for confirmation of critical actions3. Visibility of system state: Lets user know what is happening inside the system

30、 from looking at the interface4. Use of familiar language: Uses terms that are known to the user (there may be different classes of users, each with its own vocabulary)5. Minimal reliance on human memory: Shows critical info on screen; uses selection from a set of options whenever possible6. Frequen

31、t feedback: Messages indicate consequences of actions7. Good error messages: Descriptive, rather than cryptic8. Consistency: Similar/different actions produce similar/different results and are encoded with similar/different colors and shapesSep. 200616Introduction and MotivationOperational Errors in

32、 Computer SystemsHardware examplesPermanent incapacitation due to shock, overheating, voltage spikeIntermittent failure due to overload, timing irregularities, crosstalkTransient signal deviation due to alpha particles, external interferenceSoftware examplesCounter or buffer overflowOut-of-range, un

33、reasonable, or unanticipated inputUnsatisfied loop termination conditionDec. 2004: “Comair runs a 15-year old scheduling software package from SBS International (). The software has a hard limit of 32,000 schedule changes per month. With all of the bad weather last week, Comair apparently hit this l

34、imit and then was unable to assign pilots to planes.” It appears that they were using a 16-bit integer format to hold the count.June 1996: Explosion of the Ariane 5 rocket 37 s into its maiden flight was due to a silly software error. For an excellent exposition of the cause, see:p.lancs.ac.uk/compu

35、ting/users/dixa/teaching/CSC221/ariane.pdf) These can also be classified as design errorsSep. 200617Introduction and MotivationAbout the Name of This CourseFault-tolerant computing: a discipline that began in the late 1960s 1st Fault-Tolerant Computing Symposium (FTCS) was held in 1971In the early 1980s, the name “dependable computing” was proposed for the field, to account for the fact that tolerating faults is but one appr

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論