火星上到底发生了什么

标签: 火星 | 发表时间:2011-07-08 01:31 | 作者:(author unknown) smile
出处:http://www.feedzshare.com

来自: www.yeeyan.org - FeedzShare  
发布时间:2011年07月07日,  已有 2 人推荐


原作者:
来源What really happened on Mars?
译者kneep

The Mars Pathfinder mission was widely proclaimed as "flawless" in the early days after its July 4th, 1997 landing on the Martian surface. Successes included its unconventional "landing" -- bouncing onto the Martian surface surrounded by airbags, deploying the Sojourner rover, and gathering and transmitting voluminous data back to Earth, including the panoramic pictures that were such a hit on the Web. But a few days into the mission, not long after Pathfinder started gathering meteorological data, the spacecraft began experiencing total system resets, each resulting in losses of data. The press reported these failures in terms such as "software glitches" and "the computer was trying to do too many things at once".

在刚刚登陆火星表面的那几天,火星探路者号的表现可谓无懈可击。在气囊保护下着陆、部署Sojourner Rover(译者注:一辆小机械车)、发回大量数据,包括后来在网上点击率很高的那些全景图。但是几天后,也就是探路者号开始收集气象数据不久,整个系统发生了重启,每次重启都导致数据丢失。关于原因,媒体的说法是“软件小故障”或者“系统想一次性做太多的工作”。

This week at the IEEE Real-Time Systems Symposium I heard a fascinating keynote address by David Wilner, Chief Technical Officer of Wind River Systems. Wind River makes VxWorks, the real-time embedded systems kernel that was used in the Mars Pathfinder mission. In his talk, he explained in detail the actual software problems that caused the total system resets of the Pathfinder spacecraft, how they were diagnosed, and how they were solved. I wanted to share his story with each of you.

本周在 IEEE 实时系统研讨会上,我聆听了一次精彩的主题演讲,主讲人是David Wilner,Wind River Systems 的首席技术官。Wind River 创造了 VxWorks,这正是火星探路者号上运行的实时操作系统内核。在演讲中,他详细解释了导致探路者号重启的软件原因,故障如何定位,以及如何解决的。我想和你们中的每一位分享这个故事。

VxWorks provides preemptive priority scheduling of threads. Tasks on the Pathfinder spacecraft were executed as threads with priorities that were assigned in the usual manner reflecting the relative urgency of these tasks.

VxWorks 提供了抢占式进程调度。探路者号上的进程按照优先级来执行,优先级反映了这些进程的轻重缓急。

Pathfinder contained an "information bus", which you can think of as a shared memory area used for passing information between different components of the spacecraft. A bus management task ran frequently with high priority to move certain kinds of data in and out of the information bus. Access to the bus was synchronized with mutual exclusion locks (mutexes).

探路者号上有一个“数据总线”,可以理解为一块共享内存,用于不同组件之间传递信息。有一个数据总线进程,经常以高优先级运行,负责把数据总线中的数据取出来。访问数据总线需要获得互斥锁(mutex)。

The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus to publish its data. When publishing its data, it would acquire a mutex, do writes to the bus, and release the mutex. If an interrupt caused the information bus thread to be scheduled while this mutex was held, and if the information bus thread then attempted to acquire this same mutex in order to retrieve published data, this would cause it to block on the mutex, waiting until the meteorological thread released the mutex before it could continue. The spacecraft also contained a communications task that ran with medium priority.

气象数据进程负责把收集到的气象数据放到数据总线上,运行不频繁,是低优先级进程。存放数据的时候,他先要获得互斥锁,再往总线上写数据,最后再释放互斥锁。如果中断引起数据总线进程被执行,那么它会试图获得互斥锁来读取数据,这会导致它阻塞在互斥锁上,直到气象数据进程释放锁为止。另外探路者号上还有一个中优先级的通信进程。

Most of the time this combination worked fine. However, very infrequently it was possible for an interrupt to occur that caused the (medium priority) communications task to be scheduled during the short interval while the (high priority) information bus thread was blocked waiting for the (low priority) meteorological data thread. In this case, the long-running communications task, having higher priority than the meteorological task, would prevent it from running, consequently preventing the blocked information bus task from running. After some time had passed, a watchdog timer would go off, notice that the data bus task had not been executed for some time, conclude that something had gone drastically wrong, and initiate a total system reset.

在大多数情况下,这种组合工作得很好。但是,在数据总线进程(高优先级)阻塞并等待气象数据进程(低优先级)的间隙,中断可能会导致通信进程(中优先级)被执行。在这种情况下,通信进程如果长时间运行,就阻塞了比它优先级低的气象数据进程,最终的后果是信息总线进程得不到运行。过了一段时间,看门狗意识到数据总线进程很久得不到执行,认为系统发生了严重故障,于是重启了整个系统。

This scenario is a classic case of priority inversion.

这个情景是一个经典的优先级反转案例。

HOW WAS THIS DEBUGGED?

问题是如何定位的?

VxWorks can be run in a mode where it records a total trace of all interesting system events, including context switches, uses of synchronization objects, and interrupts. After the failure, JPL engineers spent hours and hours running the system on the exact spacecraft replica in their lab with tracing turned on, attempting to replicate the precise conditions under which they believed that the reset occurred. Early in the morning, after all but one engineer had gone home, the engineer finally reproduced a system reset on the replica. Analysis of the trace revealed the priority inversion.

VxWorks 可以跟踪所有你感兴趣的系统事件,包括上下文切换、同步对象的使用和中断。喷气推进实验室的工程师们日以继夜在实验室的飞船模型上运行同样的系统, 他们把 VxWorks 的 Trace 打开,并试图完全模拟发生重启时的各种条件。某天早上,其他工程师都回家了,只剩下一个人在工作,他终于在模型上重现了重启的故障。对 Trace 记录的分析表明,优先级反转是发生重启的原因。

HOW WAS THE PROBLEM CORRECTED?

问题是如何解决的?

When created, a VxWorks mutex object accepts a boolean parameter that indicates whether priority inheritance should be performed by the mutex. The mutex in question had been initialized with the parameter off; had it been on, the low-priority meteorological thread would have inherited the priority of the high-priority data bus thread blocked on it while it held the mutex, causing it be scheduled with higher priority than the medium-priority communications task, thus preventing the priority inversion. Once diagnosed, it was clear to the JPL engineers that using priority inheritance would prevent the resets they were seeing.

VxWorks 的互斥锁在创建的时候,可以用一个布尔参数表示是否要优先级继承。故障中的互斥锁没有使用这个功能,如果使用的话,当高优先级的数据总线进程阻塞在这个互斥锁上时,低优先级的气象数据进程会继承数据总线进程的优先级,这样它的优先级就比通信进程高,从而防止了优先级反转。一旦定位后,喷气推进实验室的工程师们就明白了用优先级继承可以防止重启。

VxWorks contains a C language interpreter intended to allow developers to type in C expressions and functions to be executed on the fly during system debugging. The JPL engineers fortuitously decided to launch the spacecraft with this feature still enabled. By coding convention, the initialization parameter for the mutex in question (and those for two others which could have caused the same problem) were stored in global variables, whose addresses were in symbol tables also included in the launch software, and available to the C interpreter. A short C program was uploaded to the spacecraft, which when interpreted, changed the values of these variables from FALSE to TRUE. No more system resets occurred.

VxWorks 上有一个 C 语言的解释器,调试的时候,开发人员可以键入 C 表达式和函数来实时执行。碰巧,飞船上天的时候,喷气推进实验室的工程师们决定把这个功能留在上面。根据编码规范,这个互斥锁的初始化参数(包括其他两个可能导致同样问题的互斥锁)保存在全局变量中。保存全局变量地址的符号表就在发射软件中,可以通过 C 解释器访问。这样,一个小程序被上传到飞船上,经过解释器解释后,把这些全局变量的值从 FALSE 改为 TRUE,重启再也没发生过。

ANALYSIS AND LESSONS

分析和教训

First and foremost, diagnosing this problem as a black box would have been impossible. Only detailed traces of actual system behavior enabled the faulty execution sequence to be captured and identified.

首先及最主要的,黑盒诊断这样一个问题是很难成功的,只有靠详细的 Trace 才能把这个错误的执行序列抓出来。

Secondly, leaving the "debugging" facilities in the system saved the day. Without the ability to modify the system in the field, the problem could not have been corrected.

其次,在系统上留一些 Debug 的手段会大大节省时间。如果没有这个 C 语言解释器的话,这个问题可能就解决不了了。

Finally, the engineer's initial analysis that "the data bus task executes very frequently and is time-critical -- we shouldn't spend the extra time in it to perform priority inheritance" was exactly wrong. It is precisely in such time critical and important situations where correctness is essential, even at some additional performance cost.

最后,工程师最初的分析:“数据总线进程执行非常频繁,且时间要求严格——我们不应花额外的时间在优先级继承上”是完全错误的。在这种非常苛刻的条件下,正确性是最重要的,甚至可以牺牲性能。

HUMAN NATURE, DEADLINE PRESSURES

人之天性、项目期限的压力

David told us that the JPL engineers later confessed that one or two system resets had occurred in their months of pre-flight testing. They had never been reproducible or explainable, and so the engineers, in a very human-nature response of denial, decided that they probably weren't important, using the rationale "it was probably caused by a hardware glitch".

David 告诉我们,喷气推进实验室的工程师们后来承认,在他们做预飞行测试的时候,曾经碰到过一两次重启。但他们无法解释也无法重现,出于人的天性,他们认为这并不是什么大问题,照例推说“这可能是硬件的小毛病”。

Part of it too was the engineers' focus. They were extremely focused on ensuring the quality and flawless operation of the landing software. Should it have failed, the mission would have been lost. It is entirely understandable for the engineers to discount occasional glitches in the less-critical land-mission software, particularly given that a spacecraft reset was a viable recovery strategy at that phase of the mission.

还有一点是工程师们的工作重心。他们全力关注飞船登陆过程中使用的软件,保证其质量以期完美着陆,如果这都失败,那整个项目就结束了。对于飞船登陆后使用的软件,由于重要性相对低一点,所以,他们对一些偶然出现的故障有所松懈也是可以理解的,更何况重启本身就是解决登陆后故障的一种有效手段。

THE IMPORTANCE OF GOOD THEORY/ALGORITHMS

优秀理论、算法的重要性

David also said that some of the real heroes of the situation were some people from CMU who had published a paper he'd heard presented many years ago who first identified the priority inversion problem and proposed the solution. He apologized for not remembering the precise details of the paper or who wrote it. Bringing things full circle, it turns out that the three authors of this result were all in the room, and at the end of the talk were encouraged by the program chair to stand and be acknowledged. They were Lui Sha, John Lehoczky, and Raj Rajkumar. When was the last time you saw a room of people cheer a group of computer science theorists for their significant practical contribution to advancing human knowledge? :-) It was quite a moment.

David 也提到了这个案子真正的幕后英雄是卡耐基梅隆大学的几个人,他们在多年前发表了一篇论文,首次发现了优先级反转问题,并提出了解决办法。他道歉说,他不记得论文的细节和作者的名字了。圆满的是,那篇论文的三位作者当时就坐在大厅里,演讲结束后,主席提议他们起立,接受所有人的致谢。他们是Lui Sha、John Lehoczky、Raj Rajkumar。整个大厅的人为这几位计算机理论科学家欢呼,感谢他们为推进人类知识进步所做的重大贡献,你何时见过这样的场面?真是伟大的时刻。

POSTLUDE

后记

For the record, the paper was:

为完整起见,这篇论文是:

L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990.

L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990.

添加新评论

相关文章:

  Linux之父最新访谈录

  隐形的电脑

  象空气一样视而不见的电脑

  详解CSS特定值

  看门狗介绍

相关 [火星] 推荐:

火星上到底发生了什么

- smile - FeedzShare
来自: www.yeeyan.org - FeedzShare  . 发布时间:2011年07月07日,  已有 2 人推荐. 来源What really happened on Mars?. The Mars Pathfinder mission was widely proclaimed as "flawless" in the early days after its July 4th, 1997 landing on the Martian surface.

新读图时代:解剖火星

- holic536 - 东西

你好,火星人 [枣读:199期]

- Charlie - 爱枣报
看完《自闭历程》的时候,我问自己最多一句话是:”如果拿你的人生去交换坦普尔·葛兰汀(Temple Grandin)的人生,你愿意么. ”身为阿斯伯格综合症(高性能自闭症)患者,坦普尔·葛兰汀拥有亚利桑那州立大学畜牧科学硕士,并于一九八八年获得伊利诺大学的畜牧科学博士学位,是当今少数的牲畜处理设备设计、建造专家之一.

新火星漫游车轮子可能会将地球细菌带到火星

- 微笑!?~ - Solidot
地球细菌感染红色火星不是这么新观点,但一位微生物学家认为,即将发射的新火星探测器有较高的可能性会将地球微生物带到火星. 佛罗里达大学和NASA肯尼迪航天中心空间生命科学实验室的微生物学家Andrew C.Schuerger相信,问题与火星科学实验室或好奇号的轮子有关,好奇号将直接通过轮子与火星地面接触,而以前的漫游车至少要在登陆器平台上等待至少两个火星日才会冒险踏上地面,因此轮子上任何辛存的细菌都会被紫外线杀死.

NASA 放棄精神號火星探測車

- georgexsh - Engadget 中文版
自從去年一月精神號進入火星冬天以來,又過了一年半的時間. 這當中火星的嚴冬來了又去,但 NASA 本來期待在火星春天來臨後精神號可以充飽電力,但顯然地在寒冬中有某些重要的器材受不住寒冷而故障,精神號因此一直沒有再和地球聯絡過. NASA 表示已在五月二十五日結速束所有與精神號之間的聯絡,在實際上意味著精神號七年任務的結束.

火星使命志愿者称“我们是开拓者”

- 墨狸 - Solidot
六名志愿者将在封闭空间内住上超过500天,模拟火星载人太空飞行试验,他们称自己为“开拓者”. “火星-500”将于6月3日在俄罗斯航天医学问题研究所内正式实施密闭,共持续520天. 来自世界各地的数千名志愿者候选人从去年11月开始,经过基本条件选拔、医学选拔、心理选拔等多轮筛选. 最后挑选出了7位志愿者(其中一位为预备队员,不参加封闭试验),包括了4名俄罗斯人、1名法国人、1名意大利人,以及来自中国的王跃.

科学家发现火星上有水流动的迹象

- China Moon - cnBeta.COM
 一个结合了3D轨道成像技术的图像展示出了科学家们推测的在春季和夏季火星牛顿陨石坑的陡坡上有水的出现.  液态水的可能存在确实值得再深思,火星存在丰富的微生物有机体. 生命的秘诀,至少像我们知道的,需要有液态水,含碳分子和能量来源.

火星漫游车“好奇号”整装待发

- Oscar - Solidot
火星科学实验室“好奇号”已经安装上了防热罩,随时准备飞往火星的漫长旅途. 防热罩被用于保护漫游车,抵御进入火星大气层时所产生的高温(视频). “好奇号”计划于11月底和12初在卡纳维拉尔角空军基地择期发射,它将飞行八个多月时间于2012年8月抵达火星,预计服役1个火星年(相当于23个地球月),寻找火星生命的证据.

【每日酷图】火星四射的人体摄影

- neil - 爱…稀奇~{新鲜:科技:创意:有趣}
来自美国加利福尼亚州的摄影师Adam Chilson的作品,关于灵与肉的挣扎:“将具有诱惑力的人体艺术,放到迸射的火光中,创造出另类的视觉效果,带着少许的暴力与情色,少许的迷恋和幻想,表现出富有魅力的意境……而且,很难想象这样夸张的效果竟然没有使用任何数字处理技术,实在让人惊讶. [ via LED灯光之家 ].

「机遇号」火星车的三年红色旅程

- xiao - YesKafei Daily
美国宇航局的「机遇号」火星车仍然在红色的土地上漫步,NASA制作的这个视频,让我们可以了解机遇号从维多利亚陨石坑到奋进陨石坑三年的旅程. 20公里的行程,309张图片:. 火星上首次发现存在流水的证据. 另一个星球眼中的地球:它们在那. 内部视频:美国明尼苏达体育场积雪压塌屋顶事件.