1、Systems Seminar Schedule,1 October - Douglas Thain “Error Management in Virtual Operating System” 15 October - Andrea Arpaci-Dusseau “Information and Control in Gray-Box Systems” 29 October - John Bent “Creating Communities for Grid I/O” 12 November - Open 26 November - Open 10 December - Open,Error
2、 Management in a Virtual Operating System,Douglas Thain Condor Project University of Wisconsin,What is a Virtual OS?,Hardware,Operating System,Device Drivers,Virtual OS 2,App 2,Device Drivers,App 3,App 4,Why Use a Virtual OS?,To test and deploy software that would otherwise require destructive chang
3、es. (Wine, User Mode Linux) To improve indirection or fault-tolerance. (Rocks, Socks, Grid Console) To transparently harness exterior resources. (UFO, Condor, PFS),Harness the Grid,Virtual OS 2,Virtual OS 1,App 1,App 2,App 3,App 4,In a Standard OS, Errors are not Difficult,Layers are members of a un
4、ified engineering effort. A standard namespace and scheme are used end-to-end. Most interfaces closely resemble the underlying implementation. Most catastrophic failures are coordinated.,Device Driver,errno,File System,OS Kernel,errno,Standard Library,errno,errno,App,Handling Errors is a Serious Pro
5、blem On the Grid,It is an important problem to solve: As systems grow more complex, MTBF-0. Failures are generally uncoordinated. Propagating knowledge of failure is more important than increasing likelihood of success. It is a difficult problem to solve: Theoretical: Matching different abstractions
6、. Technical: Mating different langauges and conventions. Social: Coordinating distinct engineering efforts.,Error Management: A Problem of Depth,Virtual OS,App,FTP Driver,Globus FTP Library,Globus FTP Library,Unitree OS,FTP Server,POSIX,Unitree,Globus,FTP,Globus,DDI,Disk Cache,Tape Archive,DDI,DDI,A
7、 Problem of Width,Virtual Operating System,errno,App,UNIX Driver,SRB Driver,FTP Driver,NeST Driver,Kangaroo Driver,Globus GASS Driver,An Alphabet Soup of Protocols, APIs, Systems, Authorities, and Authors,A Problem of Design Direction,Bottom Up Design,App,Application Library,Standard Library,OS Kern
8、el,errno,errno,?,App,Virtual OS,FTP Driver,FTP Library,Globus,DDI,errno,Outside In Design,How do we correctly represent errors in a virtual operating system?,Spirit of this Talk,Software design involves striking balances - there is no trivial answer. Concentrate on presenting several concrete proble
9、ms and working solutions. Given these “data points,” I will present some reasonable generalizations. Languages and conventions are ancillary issues. e.g. Exceptions vs. signals vs. errnos Discussion and disagreement are welcome!,The Pluggable File System,Local Driver,SRB Driver,Kangaroo Driver,Kanga
10、roo Library,SRB Library,GridFTP Driver,GridFTP Library,NeST Driver,NeST Library,HTTP Driver,HTTP Library,App,Bypass,Grid Services,Host Operating System,Examples of PFS,% vi /gsiftp/vulture.cs.wisc.edu/etc/hosts% grep phone /http/www.cs.wisc.edu/% gcc /nest/turkey.cs.wisc.edu/input.c-o /kangaroo/khak
11、i.ncsa.uiuc.edu/output,The Pluggable File System,A Kernel on Top of a Kernel,Local Driver,SRB Driver,Kangaroo Driver,GridFTP Driver,NeST Driver,HTTP Driver,Host Operating System,0,1,2,3,4,5,6,7,8,9,10,11,12,65,1001,0,150,126,/tmp/input,/gsiftp /host/ out.10,/srb /host /tmp/data,/kangaroo /host /etc/
12、hosts,File Descriptors,File Pointers,File Objects,Current Working Directory,Mount Table,namei,Not a Complete Virtual OS,Does not address process management, synchronization, etc. Complete enough to be put to good use with real, non-trivial applications. Gaussian - atomic model simulation CMSIM - sim
13、ulation of CERN LHC POVray - ray tracing software Structure and concept are developed enough to explore other OS issues others welcome!,Top-Level Error Space,A single namespace of integer errors that apply to all levels of the system. Any call is free to return any possible error. (124) General vs s
14、pecific: ENOENT vs ECHILD Some artifacts: EACCESS vs EPERM EADV and EDOTDOT,EPERM 1 /* Operation not permitted */ ENOENT 2 /* No such file or directory */ ESRCH 3 /* No such process */ EINTR 4 /* Interrupted system call */ EIO 5 /* I/O error */ ENXIO 6 /* No such device or address */ E2BIG 7 /* Arg
15、list too long */ ENOEXEC 8 /* Exec format error */ EBADF 9 /* Bad file number */ ECHILD 10 /* No child processes */ EAGAIN 11 /* Try again */ ENOMEM 12 /* Out of memory */ EACCES 13 /* Permission denied */ ,Concrete Problems and Solutions,Too little information - file transfer replies (FTP) Stick yo
16、ur head in the sand. Grope in the dark. Never forget a face. Too much information - infinite namespace (SRB) Divide and conquer. Appeal to a higher power. New failure modes - login errors (Globus) Take it easy. Split hairs.,The Problem of Too Little Information,Too Little Information: FTP Replies,In
17、teger codes indicate the severity of a response to an action. Many transfer problems are identified, but few file system problems are. Third digit specified infrequently, and for wide classes of errors.,100 - Positive Preliminary 200 - Positive Completion 300 - Positive Intermediate 400 - Transient
18、Negative 500 - Permanent negative000 - Syntax 010 - Information 020 - Connections 030 - Authentication 040 - Unspecified 050 - File System550: “e.g. File not found, no access”,Virtual OS,FTP Driver,App,FTP Server,550: Pas de tellement lime ou repertoire.,GET datafile,open datafile,open datafile,?,EN
19、OENT, EACCES, EISDIR.?,Too Little Information: FTP Replies,Too little Information: “Stick your head in the sand”,If you dont understand the failure, keep trying until the result is acceptable. Might work for transient errors. Might even work for the savvy user that can identify and fix problems.,Too
20、 little Information: “Grope in the Dark”,if GET succeedsreturn success else if CHDIR succeeds return EISDIR else if LIST succeeds return EACCESS else return ENOENT end end end,GET,CHDIR,LIST,EACCESS,Too little Information: “Never Forget a Face”,Each error condition has a signature: Server identifier
21、: “wuftpd 4.1 ftp.cs” Operation attempted: “GET” Message in reply: “550: Pas de tallenmand.” First “Grope” and then cache the determined error along with the signature. Problems: Server must be consistent Groping is not atomic,The Problem of Too Much Information,Multiplexes many server backends into
22、 one client interface. Error space is an amalgam of all back end error spaces. Any call may return any error. 1026 and growing!,UNIX_EPERM -1301 UNIX_ENOENT -1302 . . . UNIX_EDEADLOCK -1356,HPSS_EPERM -1401 HPSS_ENOENT -1402 . . . HPSS_NOCOS -1499,MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 . . .
23、 MCAT_USER_NOT_IN_DOMN -3032,SQL_RSLT_TOO_LONG -1600,HTTP_ERR_BAD_PATH -1700,Too Much Info: SRB Replies,UNIX_EPERM -1301 UNIX_ENOENT -1302 . . . UNIX_EDEADLOCK -1356,HPSS_EPERM -1401 HPSS_ENOENT -1402 . . . HPSS_NOCOS -1499,MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 . . . MCAT_USER_NOT_IN_DOMN -
24、3032,SQL_RSLT_TOO_LONG -1600,HTTP_ERR_BAD_PATH -1700,Too Much Information: “Divide and Conquer”,EPERM,ENOENT,ESRCH,EINTR,EIO,EACCESS,EISDIR,OTHER,“Appeal to a Higher Power”,Virtual OS,SRB Driver,App,SRB Server,HPSS_NOCOS,open datafile,open datafile,open datafile,Throw an exception. Kill the process.
25、 Dump core.,“Cannot assign a COS.” A)bort R)etry F)ail? EACCESS, ENOENT, or EISDIR?,Human,OTHER,The Problem of New Failure Modes,Identify Certificate,Virtual OS,GSI Driver,App,GSI Resource,Protocol Negotiation,open datafile,open datafile,?,EPERM, EACCES, EPROTO.?,Find Identity,Authentication,Authori
26、zation,New Failure Modes: Login Errors,GET datafile,Hierarchy of error objects, much like Java. Errors may be identified by individual type or their membership in a more general type.,class Error Error trigger;Module place_in_code;Object thing_in_question;String message; ;,Error,Authen- tication,Aut
27、hor- ization,Commun- ication,No Creds,Expired Creds,No Trust,New Failure Modes: Login Errors,New Failure Modes: “Take it Easy”,Easy for program to interpret and react. Difficult for a human to debug.,EACCES,No identity,Couldnt Authenticate,Not Authorized,Protocol Not Supp.,New Failure Modes: “Split
28、Hairs”,Preserves unique error types for the savvy user. Program may not be prepared to react to arbitrary error values.,EPERM,No identity,Couldnt Authenticate,Not Authorized,Protocol Not Supp.,EACCES,EPROTO,ESRCH,New Failure Modes: Rocks Solution,“Reliable Sockets” by Vic Zandy Give a general error
29、code along the standard channel. Give a detailed message along a back channel.,Reliable Sockets,Standard Sockets,App,Connection Refused,Connection Lost,rserrno,Reconnection Timeout Expired,A Toolbox for Error Conversions,Simple Conversions: “Take it Easy” “Split Hairs” “Divide and Conquer” “Grope in
30、 the Dark” “Never Forget a Face” “Appeal to a Higher Power” “Stick your Head in the Sand”,Increasing Cost,Error Accuracy can be A Performance Concern,We can always find some way to produce a correct - even if undesired - execution. But - An “Appeal to a Higher Power” causes badput. “Groping in the D
31、ark” yields high latencies. “Head in the Sand” may keep trying when no automatic recovery is possible. .or, a failure to retry results in unnecessary user interaction.,1 - Express errors in terms of the interface. 2 - Assume the audience is a program. 3 - Leave room to expand, but avoid using it. 4
32、- Give the essence, not the details.,Hints for Error Design,1 - Express Errors in Terms of the Interface,Essence of separation of interface and implementation. The user of an interface should not see a “moving target” as the implementation changes.,Application,File Interface,Disk Impl,Network Impl,M
33、emory Impl,?,2 - Assume the Audience is a Program,A computer-readable error can be used as the basis for a decision at any level. A human-readable error can only result in a blind retry or an Appeal. Computer-readable errors are easily made human-readable.,Layer2,Layer 0,Layer 1,Human,Error Text,?,?
34、,Error Code,Decision,Decision,Decision,Decision,3 - Leave Room to Expand .but Avoid Using It,Any significantly different implementation of an interface will introduce new failure modes. Possibilities for a new failure: Best case: fit it into an existing error. Medium case: return “unknown error.” Wo
35、rst case: “Appeal to a Higher Power.”,4 - Give the Essence, not the Details,The details distract the caller from the nature of the problem and result in cascading “Appeals.” Example in file systems: “Fell off the end of the directory linked list.” or “No file by that name.” Example in networking: “T
36、imer went off, but no network interrupt received. or “Connection lost.” Example in security: “Failure in PEM_do_header while reading password.” or “You have no credentials.” A restatement of hint #1.,All authors remain anonymous. “Error in return value.” “A system call failed!” “Could not execute jo
37、b. Reason: Success”,Hall of Fame,In Summary.,Error management is part of the “art” of software engineering. The importance and the difficulty of error management are magnified in a virtual operating system. All errors have some value, but low-signal errors result in performance problems. Hints for error interface design.,Contact Info,Douglas Thain thaincs.wisc.edu Software and other info: http:/www.cs.wisc.edu/condor/pfsQuestions and discussion?,
copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
备案/许可证编号:苏ICP备17064731号-1