标签: 0和1 科技评论 Code Analyst DTrace linpack | 发表时间:2011-09-18 14:38 | 作者:wangleheng 张沈鹏


  LinkedIn的多核与并行计算讨论组(Multicore & Parallel Computing Group)里,刚刚发生了一次讨论,内容很有价值,尤其是对刚踏入这个领域的新人而言。所以转载到BLOG上。

  一开始,是印度小姑娘Ratika Agarwal问了个基本问题:


  How to measure the performance of any working algorithm? do any body know?


  Darryl Gove提到了各种经典工具,例如Linux下的Oprofile、Solaris下的DTrace和MDB、AMD的Code Analyst,还有Intel的VTune:


  If, by performance, you mean “How long does it take”, then you could use the time command, print out the number of units of work completed and divide one by the other. However, that doesn’t guaranty that your implementation of the algorithm is efficient.

  So I’d recommend profiling. On Linux look at Oprofile, on Solaris (or Linux) you can use the performance analyzer from Solaris Studio (just announced the beta programme for 12.3), on Windows with AMD chips they have a Code Analyst, Intel you probably want to look at VTune.

  Most of these tools will give you time based profiles and hardware performance counter data. So you can look into things like the instruction count, cache misses etc.


  John Lilley又提到了Visual Studio Team edition里面的profiler工具。他总结了影响性能的主要因素,优化的时候往往是在各个因素之间寻求平衡。


  Also on Windows, Visual Studio Team edition 2008 and 2010 has a built-in code profiler for native code. I think you might get managed code profiling in the Professional versions; however I am unsure about that. However, there are many dimensions to consider when talking about “performance” in addition to CPU. For example, you may want any of the following to optimize:

  1) Memory usage

  2) Disk usage

  3) Wall-clock time (as opposed to CPU loading, which is what a profiler will give you)

  There are many examples of such tradeoffs. For example consider the problem of sorting a file of records. The fastest sort algorithm will probably load the entire file into memory, perform a parallel sort that utilizes all CPU cores as much as possible, and will probably make a complete copy of the data (or perhaps just a copy of pointers to the data), and then write the output file. However, there are many tradeoffs to consider. What if the file is 100GB and you only have 8GB of memory? Then you must store some intermediate results on disk. You might then consider the CPU/disk tradeoff of compressing the temporary data. These are just some suggestions.


  Jim Dempsey进一步提醒,优化更多是一项工程活儿,程序性能往往受到具体环境影响:如果数据集中在内存里,那么提高缓存的本地性往往比减少需执行的指令数更重要;而如果面临I/O密集型的应用,选择适当的负载均衡技术是最重要的。


  The performance of a given algorithm is not necessarily an absolute property. In reading the prior posts you will find that there are various factors that can affect performance and therefore one algorithm is not always superior to another without considering the runtime circumstances. When your working dataset fits within memory then paying attention to improving cache locality may be more important than reducing instruction count. When I/O is involved then choosing a technique that permits computation overlapped with I/O usually is superior. Overlapping I/O or sequencing the data that is fed into an algorithm is generally not considered part of “the algorithm” but may have more of an impact on application performance than the algorithm itself.


  John Lilley顶贴,对此表示同意。


  Jim is absolutely right. The “performance” of an algorithm must be judged within the context of the program’s requirements. The kind of performance you want out of a program on a multi-CPU server with virtual memory and a fast SAN might be completely different than what you want from a similar algorithm running on a smart phone. On the server, you might optimize for performance and scalability, On the smartphone, you might optimize for power conservation, memory, and storage use.


  Ratika Agarwal表示压力很大,自己是个新手,不知道从哪里开始。


  But I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.actually I need to do a project for my M.Tech thesis.

  Thus I am thinking about, to construct some parallel algorithms of the existing sequential ones. Is that the right path what i am following, for my M.Tech project and thesis?

  Also I am a very new user for parallel programming. Don’t know from where to start.


  于是John Lilley给她推荐了书和其他学习资料。


  This book: http://books.google.com/books/about/The_Art_of_Concurrency.html?id=rU68SYVS7S8C is a good primer to parallel programming, if you have access. Also read Dmitry’s blog: http://www.1024cores.net. A goodle search on “parallel programming blog” yields good forums in the top 10. Depending on your programming language of choice, I would use “Thread Building Blocks” from Intel for C++, or Cilk++, or if you are C#/.Net, use PLINQ and TPL. On the Mac, there is Grand Central Dispatch. I’m not sure what Java people use. The maintream programming wisdom today is to think in terms of sub-tasks to execute in parallel, not in terms of threads. The aforementioned tools are all parallel task models. This is a concise illustration of parallelizing quicksort using cilk++: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/cref_cls/common/cilk_convertcpp.htm. There are many interesting problems out there that that benefit from efficient parallel implementations, especially in the realm of graph analysis, sorting, searching, indexing, and so on.


  Jeff Kenton又给了些建议。


  Your original question was about ” how to measure the performance”. So, what you need to know is how much faster your parallel algorithm is than the sequential one? And how well does it scale with the addition of more cores / threads? The bottlenecks will likely turn out to be communication between the pieces, and sequential parts that you can’t eliminate.

  You will need to choose a suitable problem to solve. Implement it sequentially first so that you understand the simple version. Then decide what can be done in parallel and how to combine and communicate the results at the end. And also figure out how you will measure your results and how you will instrument the various parts so that you understand what the issues are.

  Good luck.


  Christopher Aziz提到了LINPACK指标和openMP。


  ”performance measurement” is an important, frequently misrepresented and often misunderstood topic usually associated with “timing”. Notionally, “timing” a race is simple. You click the stop watch when the starter’s gun fires and then stop it when the first runner or horse crosses the finish line. This is the lay person’s view: ” Where is the hard part?”. Certainly not in this simplistic notion. Now a 6 dimensional race with variable non-continuous environmental conditions and a difficult to calculate “finish line” boundary condition might be closer to the practical reality but is far beyond the common conception of “timing”.

  The hard part is knowing what to measure and developing a process model that let’s you confirm the sensitivity, variability, repeatability and ability to extrapolate or predict results in other cases from your measurements.

  Shifting from abstract to practical, I’d suggest taking a look at the netlib.org/benchmark suite. My favorites are the LINPACK variants. Leave the HPL for last. To get a handle on factors affecting results start scalar (1 core) and try different compiler options, machines, size of problem effects and look at run to run variation. As you posted to this multicore group, I’d suggest next converting the codes to openMP.

  For timing consistency, I like the openMP elapsed timer, omp_get_wtime, regardless of how you may effect parallelization. It is portable and generally reliable.

  You are stepping into a huge space. It is a personal favorite. To get a “feel” for the subject area, I think your best first step is to pick a small corner and get deep with it rather than survey the broader domain.

  Good Hunting.


  Eugenio Villar进一步补充


  In practice, the performance of an algorithm strongly depends on the run-time conditions on a certain processor. The problem becomes harder in multi-processing platforms. And even harder in heterogeneous multi-processing platforms. When your target is the same as the host on which you are developing the code, some of the techniques commented already can be used. If your target is different, then an ISS or a virtualization-based simulator (www.ovpworld.org) can be used. A more abstract technology not requiring the complete compilable code is native simulation (www.teisa.unican.es/scope). We are now working in keeping the accuracy even when the OS calls are removed…


  Rick Peralta耐心地逐句解答了Ratika Agarwal的困惑


  Performance is relative some metric. There is no absolute performance. You can measure performance relative computations per second. You can measure the relative computations per second based on single or multi-threaded implementations. Of you could measure the ratio of computations to space. Or the quality of the computations (e.g. bits of resolution).

> I’m a beginner.I’m just going to start implementing the comparative analysis of algorithms.

  The first step is to determine what you want to measure. Then look for ways to measure and represent it. Often performance is multi-dimensional and multi-modal, so there is not necessarily a single optimal solution.

  Assuming you are looking for simple computations per second, for various solutions, the wall clock method is a good first step. How long does it take to complete some fixed bit of work. Compile various solutions and see what the run time is. Run the same test for different implementations. On a finer resolution, a profiler can give comparable information for various parts of the code.

> Don’t know from where to start.

  For your research, I’d suggest that you build a benchmark framework that monitors whatever you are looking at (wall clock time, CPU time, memory usage, external I/O, et cetera) and drop in the various solutions. The standard framework will help keep the comparisons comparable.

  Consider the traveling salesman problem. Is it better to do the whole job single threaded, perhaps using the stack to handle branches or to drop each branch to a thread thread or queue to a pool of threads? Code up each and see what the results are. Try to keep the test environment the same, as largely invisible characteristics (such as pipeline behavior or other activity on the system) can significantly impact the results.

  As a general rule, if things are not clear, simplify. Look at a simpler problem, use simpler tools. Too often people get lost in their own sophistication.

  - Rick Keep It Simple Silly ;^)


  Eduardo Grosclaude介绍了几本参考书


  You may find Grama/Gupta/Karypis/Kumar’s book very enlightening.http://www-users.cs.umn.edu/~karypis/parbook/


  Jonathan Beard又给了不少参考资料网址,他同意Rick的说法:“Keep It Simple”


  Here’s a few books I’ve found useful that I used for class that are relatively cheap (although I find it best to check out library editions until i really find I can’t live without a book):

  The Art of Multiprocessor Programming – http://bit.ly/pd3fz9

  Computer Architecture: A quantitative approach – http://bit.ly/o7PGoD , although if you don’t have much hardware experience I’d recommend the other book as well Computer Organization and Design.

  You mention creating parallel algorithms from sequential ones, are you planning on an automated process? If so I’d recommend a compiler book, Advanced Compiler Design (http://bit.ly/qK7wm2).

  There are tons of languages to implement parallel algorithms, the choice of which I think should be left to the algorithm and how the author can best express the idea…in other words, don’t get hung up on one language or the next.

  I agree with Rick on the keep it simple. I’d also go one step further and say turn off hyperthreading (not to knock it, but it’s easier to reason about performance without it), speed stepping, synchronize clocks across cores, and when possible tie threads/processes to a specific processor so you know exactly what data you are collecting. Nothing is more frustrating than performing multiple tests only to find out that part of your data must be thrown out because a context swap or frequency increase gave inconsistent results.

  Good luck.


  Houmad Tighezza引用了1978年的一本老书里的一段话


  The question is simple but the answer good be very complex, the answer depend witch performance you want to know: numerical performance, fault tolerance, CPU, memory, vectorization/ parallelization, speed up, MPIS, MACS, etc.?

  In my opinion, the first book all about the performance evaluation techniques and they application was Wright by DOMENICO FERRARI in 1978 (PRENTICE-HALL).

  Computer Systems Performance Evaluation by Domenico Ferrari 1978 : “The author argues that systems performance evaluation, in the first twenty years of its existence, has developed in substantial isolation with respect to such disciplines as computer architecture, system organization, operating systems, and software engineering. The possible causes for this phenomenon, which seems to be unique in the history of engineering, are explored. Its positive and negative effects on computer science and technology, as well as on performance evaluation itself, are discussed. In the author’s opinion, the drawbacks of isolated development outweigh its advantages. Thus, the author proposes instructional and research initiatives to foster the rapid integration of the performance evaluation viewpoint into the main stream of computer science and engineering.”


  Antonio M Flores建议多查查Wiki,上面全是宝


  Strictly speaking, algorithms are measured by an analisys of complexity, which is a computer science mathematical concept, so it is irrelevant to the hardware platform. I suggest that you take a deep reading to the article in Wikipedia for a first insight: http://en.wikipedia.org/wiki/Analysis_of_algorithms and http://en.wikipedia.org/wiki/Time_complexity. Of course, for the same complexity algorithms, you can later do profiling, which is a programming concept and for which there is plenty of software tools available: http://en.wikipedia.org/wiki/Profiling_(computer_programming), as well to do a measure of efficiency: http://en.wikipedia.org/wiki/Algorithmic_efficiency

  For parallel versus sequential algorithms, you should be using the concept of speed-up and related: http://en.wikipedia.org/wiki/Speed_up

  Good Luck and Success!!


  最后Ratika Agarwal表示感谢。又有一个叫Agnaldo Pereira的新手感谢Ratika Agarwal问了这个好问题,看来对他也帮助很大。

joyfire 王乐珩, 中国科学院计算技术研究所, 蛋白鉴定搜索引擎pFind的架构师

技能1 计算蛋白质组学:
   基于质谱数据的蛋白质鉴定,尤其是搜索引擎方面。对生物学粗通皮毛。翻译过一本名为《The Stuff of Life》的科普漫画小册子(《漫画生命史话》)。

技能2 软件开发:
   C++、Java和Python编程;并行计算、MPI和MapReduce编程;大型软件架构;很早以前整理过《joyfire Linux笔记》。

技能3 技术管理

技能4 单板滑雪




相关 [测量 算法 性能] 推荐:


- 张沈鹏 - joyfire 王乐珩
  先写一句题外话,我是玩社交网络的,例如豆瓣、LinkedIn、42区,俺只是不玩新浪微博而已.   LinkedIn的多核与并行计算讨论组(Multicore & Parallel Computing Group)里,刚刚发生了一次讨论,内容很有价值,尤其是对刚踏入这个领域的新人而言.   一开始,是印度小姑娘Ratika Agarwal问了个基本问题:.


- - ITeye博客
数据集是movielens-1M( 下载)版本. 使用SlopeOne算法,每次随机选取6%的用户预测其喜好,进行5次实验,取MAE的均值,得到下表:. 绘制成折线图,如下图所示:.  由此可知,训练集越大,则推荐的准确率越高. 使用ItemCF算法,训练集大小为数据集的90%,每次随机选取30%的用户预测其喜好,进行5次实验,取MAE的均值,得到下表:.


- - Java译站
本文将会对常用的几个压缩算法的性能作一下比较. 结果表明,某些算法在极端苛刻的CPU限制下仍能正常工作. JDK GZIP ——这是一个压缩比高的慢速算法,压缩后的数据适合长期使用. JDK中的java.util.zip.GZIPInputStream / GZIPOutputStream便是这个算法的实现.


- - InfoQ - 促进软件开发领域知识与创新的传播
在 上一篇文章中,我们回顾了人脸识别算法的发展历程,介绍了人脸识别算法从传统机器学习算法到现在的深度学习算法的演进历程. 接下来,我们将详细介绍一下人脸识别常见的应用方式,以及现在主流的人脸识别算法. 1.人脸识别的主要应用方式. 为了讲清楚人脸识别算法的设计思路,有必要首先介绍人脸识别在实际场景中的主要的三种不同的应用方式.

忘掉 Snowflake,感受一下性能高出 587 倍的全局唯一 ID 生成算法

- - SegmentFault 最新的文章
今天我们来拆解 Snowflake 算法,同时领略百度、美团、腾讯等大厂在全局唯一 ID 服务方面做的设计,接着根据具体需求设计一款全新的全局唯一 ID 生成算法. 这还不够,我们会讨论到全局唯一 ID 服务的分布式 CAP 选择与性能瓶颈. 已经熟悉 Snowflake 的朋友可以先去看大厂的设计和权衡.


- lostsnow - 小彰
没有人能说清哪种缓存算法由于其他的缓存算法. (以下的几种缓存算法,有的我也理解不好,如果感兴趣,你可以Google一下  ). 大家好,我是 LFU,我会计算为每个缓存对象计算他们被使用的频率. 我是LRU缓存算法,我把最近最少使用的缓存对象给踢走. 我总是需要去了解在什么时候,用了哪个缓存对象.


- zii - 小彰
BFPRT算法的作者是5位真正的大牛(Blum 、 Floyd 、 Pratt 、 Rivest 、 Tarjan),该算法入选了在StackExchange上进行的当今世界十大经典算法,而算法的简单和巧妙颇有我们需要借鉴学习之处. BFPRT解决的问题十分经典,即从某n个元素的序列中选出第k大(第k小)的元素,通过巧妙的分析,BFPRT可以保证在最坏情况下仍为线性时间复杂度.


- Shan - 博客园-首页原创精华区
顾名思义,贪心算法总是作出在当前看来最好的选择. 也就是说贪心算法并不从整体最优考虑,它所作出的选择只是在某种意义上的局部最优选择. 当然,希望贪心算法得到的最终结果也是整体最优的. 虽然贪心算法不能对所有问题都得到整体最优解,但对许多问题它能产生整体最优解. 如单源最短路经问题,最小生成树问题等.


- 成 - FeedzShare
来自: 小彰 - FeedzShare  . 发布时间:2011年09月25日,  已有 2 人推荐. 没有人能说清哪种缓存算法由于其他的缓存算法. (以下的几种缓存算法,有的我也理解不好,如果感兴趣,你可以Google一下  ). 大家好,我是 LFU,我会计算为每个缓存对象计算他们被使用的频率.

K-Means 算法

- - 酷壳 - CoolShell.cn
最近在学习一些数据挖掘的算法,看到了这个算法,也许这个算法对你来说很简单,但对我来说,我是一个初学者,我在网上翻看了很多资料,发现中文社区没有把这个问题讲得很全面很清楚的文章,所以,把我的学习笔记记录下来,分享给大家. k-Means 算法是一种  cluster analysis 的算法,其主要是来计算数据聚集的算法,主要通过不断地取离种子点最近均值的算法.