火星上到底发生了什么
来自: www.yeeyan.org - FeedzShare
发布时间:2011年07月07日, 已有 2 人推荐
原作者:
来源What really happened on Mars?
译者kneep
The Mars Pathfinder mission was widely proclaimed as "flawless" in the early days after its July 4th, 1997 landing on the Martian surface. Successes included its unconventional "landing" -- bouncing onto the Martian surface surrounded by airbags, deploying the Sojourner rover, and gathering and transmitting voluminous data back to Earth, including the panoramic pictures that were such a hit on the Web. But a few days into the mission, not long after Pathfinder started gathering meteorological data, the spacecraft began experiencing total system resets, each resulting in losses of data. The press reported these failures in terms such as "software glitches" and "the computer was trying to do too many things at once".
在刚刚登陆火星表面的那几天,火星探路者号的表现可谓无懈可击。在气囊保护下着陆、部署Sojourner Rover(译者注:一辆小机械车)、发回大量数据,包括后来在网上点击率很高的那些全景图。但是几天后,也就是探路者号开始收集气象数据不久,整个系统发生了重启,每次重启都导致数据丢失。关于原因,媒体的说法是“软件小故障”或者“系统想一次性做太多的工作”。
This week at the IEEE Real-Time Systems Symposium I heard a fascinating keynote address by David Wilner, Chief Technical Officer of Wind River Systems. Wind River makes VxWorks, the real-time embedded systems kernel that was used in the Mars Pathfinder mission. In his talk, he explained in detail the actual software problems that caused the total system resets of the Pathfinder spacecraft, how they were diagnosed, and how they were solved. I wanted to share his story with each of you.
本周在 IEEE 实时系统研讨会上,我聆听了一次精彩的主题演讲,主讲人是David Wilner,Wind River Systems 的首席技术官。Wind River 创造了 VxWorks,这正是火星探路者号上运行的实时操作系统内核。在演讲中,他详细解释了导致探路者号重启的软件原因,故障如何定位,以及如何解决的。我想和你们中的每一位分享这个故事。
VxWorks provides preemptive priority scheduling of threads. Tasks on the Pathfinder spacecraft were executed as threads with priorities that were assigned in the usual manner reflecting the relative urgency of these tasks.
VxWorks 提供了抢占式进程调度。探路者号上的进程按照优先级来执行,优先级反映了这些进程的轻重缓急。
Pathfinder contained an "information bus", which you can think of as a shared memory area used for passing information between different components of the spacecraft. A bus management task ran frequently with high priority to move certain kinds of data in and out of the information bus. Access to the bus was synchronized with mutual exclusion locks (mutexes).
探路者号上有一个“数据总线”,可以理解为一块共享内存,用于不同组件之间传递信息。有一个数据总线进程,经常以高优先级运行,负责把数据总线中的数据取出来。访问数据总线需要获得互斥锁(mutex)。
The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus to publish its data. When publishing its data, it would acquire a mutex, do writes to the bus, and release the mutex. If an interrupt caused the information bus thread to be scheduled while this mutex was held, and if the information bus thread then attempted to acquire this same mutex in order to retrieve published data, this would cause it to block on the mutex, waiting until the meteorological thread released the mutex before it could continue. The spacecraft also contained a communications task that ran with medium priority.
气象数据进程负责把收集到的气象数据放到数据总线上,运行不频繁,是低优先级进程。存放数据的时候,他先要获得互斥锁,再往总线上写数据,最后再释放互斥锁。如果中断引起数据总线进程被执行,那么它会试图获得互斥锁来读取数据,这会导致它阻塞在互斥锁上,直到气象数据进程释放锁为止。另外探路者号上还有一个中优先级的通信进程。
Most of the time this combination worked fine. However, very infrequently it was possible for an interrupt to occur that caused the (medium priority) communications task to be scheduled during the short interval while the (high priority) information bus thread was blocked waiting for the (low priority) meteorological data thread. In this case, the long-running communications task, having higher priority than the meteorological task, would prevent it from running, consequently preventing the blocked information bus task from running. After some time had passed, a watchdog timer would go off, notice that the data bus task had not been executed for some time, conclude that something had gone drastically wrong, and initiate a total system reset.
在大多数情况下,这种组合工作得很好。但是,在数据总线进程(高优先级)阻塞并等待气象数据进程(低优先级)的间隙,中断可能会导致通信进程(中优先级)被执行。在这种情况下,通信进程如果长时间运行,就阻塞了比它优先级低的气象数据进程,最终的后果是信息总线进程得不到运行。过了一段时间,看门狗意识到数据总线进程很久得不到执行,认为系统发生了严重故障,于是重启了整个系统。
This scenario is a classic case of priority inversion.
这个情景是一个经典的优先级反转案例。
HOW WAS THIS DEBUGGED?
问题是如何定位的?
VxWorks can be run in a mode where it records a total trace of all interesting system events, including context switches, uses of synchronization objects, and interrupts. After the failure, JPL engineers spent hours and hours running the system on the exact spacecraft replica in their lab with tracing turned on, attempting to replicate the precise conditions under which they believed that the reset occurred. Early in the morning, after all but one engineer had gone home, the engineer finally reproduced a system reset on the replica. Analysis of the trace revealed the priority inversion.
VxWorks 可以跟踪所有你感兴趣的系统事件,包括上下文切换、同步对象的使用和中断。喷气推进实验室的工程师们日以继夜在实验室的飞船模型上运行同样的系统, 他们把 VxWorks 的 Trace 打开,并试图完全模拟发生重启时的各种条件。某天早上,其他工程师都回家了,只剩下一个人在工作,他终于在模型上重现了重启的故障。对 Trace 记录的分析表明,优先级反转是发生重启的原因。
HOW WAS THE PROBLEM CORRECTED?
问题是如何解决的?
When created, a VxWorks mutex object accepts a boolean parameter that indicates whether priority inheritance should be performed by the mutex. The mutex in question had been initialized with the parameter off; had it been on, the low-priority meteorological thread would have inherited the priority of the high-priority data bus thread blocked on it while it held the mutex, causing it be scheduled with higher priority than the medium-priority communications task, thus preventing the priority inversion. Once diagnosed, it was clear to the JPL engineers that using priority inheritance would prevent the resets they were seeing.
VxWorks 的互斥锁在创建的时候,可以用一个布尔参数表示是否要优先级继承。故障中的互斥锁没有使用这个功能,如果使用的话,当高优先级的数据总线进程阻塞在这个互斥锁上时,低优先级的气象数据进程会继承数据总线进程的优先级,这样它的优先级就比通信进程高,从而防止了优先级反转。一旦定位后,喷气推进实验室的工程师们就明白了用优先级继承可以防止重启。
VxWorks contains a C language interpreter intended to allow developers to type in C expressions and functions to be executed on the fly during system debugging. The JPL engineers fortuitously decided to launch the spacecraft with this feature still enabled. By coding convention, the initialization parameter for the mutex in question (and those for two others which could have caused the same problem) were stored in global variables, whose addresses were in symbol tables also included in the launch software, and available to the C interpreter. A short C program was uploaded to the spacecraft, which when interpreted, changed the values of these variables from FALSE to TRUE. No more system resets occurred.
VxWorks 上有一个 C 语言的解释器,调试的时候,开发人员可以键入 C 表达式和函数来实时执行。碰巧,飞船上天的时候,喷气推进实验室的工程师们决定把这个功能留在上面。根据编码规范,这个互斥锁的初始化参数(包括其他两个可能导致同样问题的互斥锁)保存在全局变量中。保存全局变量地址的符号表就在发射软件中,可以通过 C 解释器访问。这样,一个小程序被上传到飞船上,经过解释器解释后,把这些全局变量的值从 FALSE 改为 TRUE,重启再也没发生过。
ANALYSIS AND LESSONS
分析和教训
First and foremost, diagnosing this problem as a black box would have been impossible. Only detailed traces of actual system behavior enabled the faulty execution sequence to be captured and identified.
首先及最主要的,黑盒诊断这样一个问题是很难成功的,只有靠详细的 Trace 才能把这个错误的执行序列抓出来。
Secondly, leaving the "debugging" facilities in the system saved the day. Without the ability to modify the system in the field, the problem could not have been corrected.
其次,在系统上留一些 Debug 的手段会大大节省时间。如果没有这个 C 语言解释器的话,这个问题可能就解决不了了。
Finally, the engineer's initial analysis that "the data bus task executes very frequently and is time-critical -- we shouldn't spend the extra time in it to perform priority inheritance" was exactly wrong. It is precisely in such time critical and important situations where correctness is essential, even at some additional performance cost.
最后,工程师最初的分析:“数据总线进程执行非常频繁,且时间要求严格——我们不应花额外的时间在优先级继承上”是完全错误的。在这种非常苛刻的条件下,正确性是最重要的,甚至可以牺牲性能。
HUMAN NATURE, DEADLINE PRESSURES
人之天性、项目期限的压力
David told us that the JPL engineers later confessed that one or two system resets had occurred in their months of pre-flight testing. They had never been reproducible or explainable, and so the engineers, in a very human-nature response of denial, decided that they probably weren't important, using the rationale "it was probably caused by a hardware glitch".
David 告诉我们,喷气推进实验室的工程师们后来承认,在他们做预飞行测试的时候,曾经碰到过一两次重启。但他们无法解释也无法重现,出于人的天性,他们认为这并不是什么大问题,照例推说“这可能是硬件的小毛病”。
Part of it too was the engineers' focus. They were extremely focused on ensuring the quality and flawless operation of the landing software. Should it have failed, the mission would have been lost. It is entirely understandable for the engineers to discount occasional glitches in the less-critical land-mission software, particularly given that a spacecraft reset was a viable recovery strategy at that phase of the mission.
还有一点是工程师们的工作重心。他们全力关注飞船登陆过程中使用的软件,保证其质量以期完美着陆,如果这都失败,那整个项目就结束了。对于飞船登陆后使用的软件,由于重要性相对低一点,所以,他们对一些偶然出现的故障有所松懈也是可以理解的,更何况重启本身就是解决登陆后故障的一种有效手段。
THE IMPORTANCE OF GOOD THEORY/ALGORITHMS
优秀理论、算法的重要性
David also said that some of the real heroes of the situation were some people from CMU who had published a paper he'd heard presented many years ago who first identified the priority inversion problem and proposed the solution. He apologized for not remembering the precise details of the paper or who wrote it. Bringing things full circle, it turns out that the three authors of this result were all in the room, and at the end of the talk were encouraged by the program chair to stand and be acknowledged. They were Lui Sha, John Lehoczky, and Raj Rajkumar. When was the last time you saw a room of people cheer a group of computer science theorists for their significant practical contribution to advancing human knowledge? :-) It was quite a moment.
David 也提到了这个案子真正的幕后英雄是卡耐基梅隆大学的几个人,他们在多年前发表了一篇论文,首次发现了优先级反转问题,并提出了解决办法。他道歉说,他不记得论文的细节和作者的名字了。圆满的是,那篇论文的三位作者当时就坐在大厅里,演讲结束后,主席提议他们起立,接受所有人的致谢。他们是Lui Sha、John Lehoczky、Raj Rajkumar。整个大厅的人为这几位计算机理论科学家欢呼,感谢他们为推进人类知识进步所做的重大贡献,你何时见过这样的场面?真是伟大的时刻。
POSTLUDE
后记
For the record, the paper was:
为完整起见,这篇论文是:
L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990.
L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990.
相关文章: