latencytop深度了解你的Linux系统的延迟

标签: Linux 工具介绍 调优 dstat latencytop | 发表时间:2011-03-29 21:00 | 作者:Yu Feng kissingwolf
出处:http://blog.yufeng.info

原创文章,转载请注明: 转载自Erlang非业余研究

本文链接地址: latencytop深度了解你的Linux系统的延迟

我们在系统调优或者定位问题的时候,经常会发现多线程程序的效率很低,但是又不知道问题出在哪里,就知道上下文切换很多,但是为什么上下文切换,是谁导致切换,我们就不知道了。上下文切换可以用dstat这样的工具查看,比如:

$dstat
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  9   2  87   2   0   1|7398k   31M|   0     0 | 9.8k   11k|  16k   64k
 20   4  69   3   0   4|  26M   56M|  34M  172M|   0     0 |  61k  200k
 21   5  64   6   0   3|  26M  225M|  35M  175M|   0     0 |  75k  216k
 21   5  66   4   0   4|  25M  119M|  34M  173M|   0     0 |  66k  207k
 19   4  68   5   0   3|  23M   56M|  33M  166M|   0     0 |  60k  197k

#或者用systemtap脚本来看
$sudo stap -e 'global cnt; probe scheduler.cpu_on {cnt<<<1;} probe timer.s(1){printf("%d\n", @count(cnt)); delete cnt;}'
217779
234141
234759

每秒高达200k左右的的上下文切换, 谁能告诉我发生了什么? 好吧,latencytop来救助了!

它的官网:http://www.latencytop.org/

Skipping audio, slower servers, everyone knows the symptoms of latency. But to know what’s going on in the system, what’s causing the latency, how to fix it… that’s a hard question without good answers right now.

LatencyTOP is a Linux* tool for software developers (both kernel and userspace), aimed at identifying where in the system latency is happening, and what kind of operation/action is causing the latency to happen so that the code can be changed to avoid the worst latency hiccups.

它是Intel贡献的另外一个性能查看器,还有一个是powertop,都是很不错的工具.

Latencytop通过在内核上下文切换的时候,记录被切换的进程的内核栈,然后通过匹配内核栈的函数来判断是什么原因导致上下文切换,同时他把几十种容易引起切换的场景的函数都记录起来,这样在判断系统问题的时候能容易定位到问题。

latencytop分成2个部分,内核部分和应用部分。内核部分负责调用栈的收集并且通过/proc来暴露, 应用部分负责显示.

工作界面截图如下:

latencytop在2.6.256后被内核吸收成为其中一部分,只要编译的时候打开该选项就好,如何确认呢?

$ cat /proc/latency_stats
Latency Top version : v0.1

看到这个就好了, 遗憾的是RHEL6竟然带了latencytop应用部分,而没有打开编译选项,让我们情何以堪呢?
在ubuntu下可以这么安装:

$ uname -r
2.6.38-yufeng
$ apt-get install latencytop
$ sudo latencytop #就可以使用了

但是latencytop比较傻的是默认是开图像界面的,我们很不习惯,我们要文本界面, 自己动手把!

$ apt-get source latencytop
$ diff -up Makefile.orig Makefile
--- Makefile.orig	2011-03-29 20:10:29.025845447 +0800
+++ Makefile	2011-03-28 14:48:11.232318002 +0800
@@ -1,5 +1,5 @@
 # FIXME: Use autoconf ?
-HAS_GTK_GUI = 0
+#HAS_GTK_GUI = 0

 DESTDIR =
 SBINDIR = /usr/sbin

重新make下就好了, 文本界面出现了. 具体使用参看 man latencytop。

fcicq同学说:

加个 –nogui 参数就好了. 不需要重新编译.

谢谢!

好了,那么latencytop支持多少种的延迟原因呢?让latencytop.trans告诉你,我们也可以自己修改这个文件,把新的延迟原因加上去。

$ cat /usr/share/latencytop/latencytop.trans
#
1	vfs_read		Reading from file
1	vfs_write		Writing to file
1	__mark_inode_dirty	Marking inode dirty
1	vfs_readdir		Reading directory content
1	vfs_unlink		Unlinking file
1	blocking_notifier_call_chain	Blocking notifier
1	lock_super		Superblock lock contention
1	vfs_create		Creating a file
1	KAS_ScheduleTimeout	Binary AMD driver delay
1	firegl_lock_device	Binary AMD driver delay
#
2	__bread			Synchronous buffer read
2	do_generic_mapping_read	Reading file data
2	sock_sendmsg		Sending data over socket
2	do_sys_open		Opening file
2	do_sys_poll		Waiting for event (poll)
2	core_sys_select		Waiting for event (select)
2	proc_reg_read		Reading from /proc file
2	__pollwait		Waiting for event (poll)
2	sys_fcntl		FCNTL system call
2	scsi_error_handler	SCSI error handler
2	proc_root_readdir	Reading /proc directory
2	ksoftirqd		Waking ksoftirqd
2	worker_thread		.
2	do_unlinkat		Unlinking file
2	__wait_on_buffer	Waiting for buffer IO to complete
2	pdflush			pdflush() kernel thread
2	kjournald		kjournald() kernel thread
2	blkdev_ioctl		block device IOCTL
2	kauditd_thread		kernel audit daemon
2	tty_ioctl		TTY IOCTL
2	__filemap_fdatawrite_range fdatasync system call
2	do_sync_write		synchronous write
2	kthreadd		kthreadd kernel thread
2	usb_port_resume		Waking up USB device
2	usb_autoresume_device	Waking up USB device
2	kswapd			kswapd() kernel thread
2	md_thread		Raid resync kernel thread
2	i915_wait_request	Waiting for GPU command to complete
2	request_module		Loading a kernel module

#
3	tty_wait_until_sent	Waiting for TTY to finish sending
3	pipe_read		Reading from a pipe
3	pipe_write		Writing to a pipe
3	pipe_wait		Waiting for pipe data
3	read_block_bitmap	Reading EXT3 block bitmaps
3	scsi_execute_req	Executing raw SCSI command
3	sys_wait4		Waiting for a process to die
3	sr_media_change		Checking for media change
3	sr_do_ioctl		SCSI cdrom ioctl
3	sd_ioctl		SCSI disk ioctl
3	sr_cd_check		Checking CDROM media present
3	ext3_read_inode		Reading EXT3 inode
3	htree_dirblock_to_tree	Reading EXT3 directory htree
3	ext3_readdir		Reading EXT3 directory
3	ext3_bread		Synchronous EXT3 read
3	ext3_free_branches	Unlinking file on EXT3
3	ext3_get_branch		Reading EXT3 indirect blocks
3	ext3_find_entry		EXT3: Looking for file
3	__ext3_get_inode_loc	Reading EXT3 inode
3	ext3_delete_inode	EXT3 deleting inode
3	sync_page		Writing a page to disk
3	tty_poll		Waiting for TTY data
3	tty_read		Waiting for TTY input
3	tty_write		Writing data to TTY
3	update_atime		Updating inode atime
3	page_cache_sync_readahead	Pagecache sync readahead
3	do_fork			Fork() system call
3	sys_mkdirat		Creating directory
3	lookup_create		Creating file
3	inet_sendmsg		Sending TCP/IP data
3	tcp_recvmsg		Receiving TCP/IP data
3	link_path_walk		Following symlink
3	path_walk		Walking directory tree
3	sys_getdents		Reading directory content
3	unix_stream_recvmsg	Waiting for data on unix socket
3	ext3_mkdir		EXT3: Creating a directory
3	journal_get_write_access	EXT3: Waiting for journal access
3	synchronize_rcu		Waiting for RCU
3	input_close_device	Closing input device
3	mousedev_close_device	Closing mouse device
3	mousedev_release	Closing mouse device
3	mousedev_open		Opening mouse device
3	kmsg_read		Reading from dmesg
3	sys_futex		Userspace lock contention
3	do_futex		Userspace lock contention
3	vt_waitactive		vt_waitactive IOCTL
3	acquire_console_sem	Waiting for console access
3	filp_close		Closing a file
3	sync_inode		(f)syncing an inode to disk
3	ata_exec_internal_sg	Executing internal ATA command
3	writeback_inodes	Writing back inodes
3	ext3_orphan_add 	EXT3 adding orphan
3	ext3_mark_inode_dirty 	EXT3 marking inode dirty
3	ext3_unlink 		EXT3 unlinking file
3	ext3_create		EXT3 Creating a file
3	log_do_checkpoint	EXT3 journal checkpoint
3	generic_delete_inode	Deleting an inode
3	proc_delete_inode	Removing /proc file
3	do_truncate		Truncating file
3	sys_execve		Executing a program
3	journal_commit_transaction	EXT3: committing transaction
3	__stop_machine_run	Freezing the kernel (for module load)
3	sys_munmap		unmapping memory
3	sys_mmap		mmaping memory
3	sync_buffer		Writing buffer to disk (synchronous)
3	inotify_inode_queue_event	Inotify event
3	proc_lookup		Looking up /proc file
3	generic_make_request	Creating block layer request
3	get_request_wait	Creating block layer request
3	alloc_page_vma		Allocating a VMA
#3	__d_lookup		Looking up a dentry
3	blkdev_direct_IO	Direct block device IO
3	sys_mprotect		mprotect() system call
3	shrink_icache_memory	reducing inode cache memory footprint
3	vfs_stat_fd		stat() operation
3	cdrom_open		opening cdrom device
3	sys_epoll_wait		Waiting for event (epoll)
3	sync_sb_inodes		Syncing inodes
3	tcp_connect		TCP/IP connect
3	ata_scsi_ioctl		ATA/SCSI disk ioctl
3	do_rmdir		Removing directory
3	vfs_rmdir		Removing directory
3	sys_flock		flock() on a file
3	usbdev_open		opening USB device
3	lock_kernel		Big Kernel Lock contention
3	blk_execute_rq		Submitting block IO
3	scsi_cmd_ioctl		SCSI ioctl command
3	acpi_ec_transaction	ACPI hardware access
3	journal_get_undo_access	Waiting for EXT3 journal undo operation
3	i915_irq_wait		Waiting for GPU interrupt
3	i915_gem_throttle_ioctl	Throttling GPU while waiting for commands

#
#
5	do_page_fault		Page fault
5	handle_mm_fault		Page fault
5	filemap_fault		Page fault
5	sync_filesystems	Syncing filesystem
5	sys_nanosleep		Application requested delay
5	sys_pause		Application requested delay
5	evdev_read		Reading keyboard/mouse input
5	do_fsync		fsync() on a file (type 'F' for details)
5	__log_wait_for_space	Waiting for EXT3 journal space

延迟原因非常的详细.

本来到这里,我要介绍的要介绍了,但是且慢,由于这个东西要在2.6.26后的系统上使用,我们的线上系统大部分是RHEL 5U4, 2.6.18的, 我们如何使用呢?

这时候 systemtap 一如既往的前来救助了!

systemtap 1.4版本以后带了个latencytop.stp, 也是intel的贡献. 那我们试验下穷人家的latencytop.
它在那里呢?

$ uname -r
2.6.18-164.el5

$ stap -V
Systemtap translator/driver (version 1.5 /0.137 non-git sources)
Copyright (C) 2005-2011 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
enabled features: AVAHI LIBRPM LIBSQLITE3 NSS BOOST_SHARED_PTR TR1_UNORDERED_MAP NLS

$ ls -al /usr/share/doc/systemtap/examples/profiling/latencytap.stp
-rwxr-xr-x 1 chuba users 16240 Feb 17 22:02/usr/share/doc/systemtap/examples/profiling/latencytap.stp 

$ sudo stap -t --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
ERROR: Skipped too many probes, check MAXSKIPPED or try again with stap -t for more details.
WARNING: Number of errors: 0, skipped probes: 101
WARNING: Skipped due to global 'dequeue' lock timeout: 2
WARNING: Skipped due to global 'this_sleep' lock timeout: 99
----- probe hit report:
kernel.trace("deactivate_task")!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:47:1), hits: 254, cycles: 680min/43327avg/2248467max, from: kernel.trace("deactivate_task")
kernel.trace("activate_task")!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:58:1), hits: 255, cycles: 890min/502549avg/2271568max, from: kernel.trace("activate_task")
kernel.function("finish_task_switch@kernel/sched.c:1969")?, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:78:7), hits: 509, cycles: 213min/1002207avg/5382852max, from: kernel.function("finish_task_switch") from: scheduler.cpu_on
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:123:1), hits: 1, cycles: 1802min/1802avg/1802max, from: begin
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:131:1), hits: 1, cycles: 227979min/227979avg/227979max, from: begin
Pass 5: run failed.  Try again with another '--vp 00001' option.

出错了! 原因是lock timeout, 原来stap的全局变量是用锁保护的,现在超时了!知道原因好办,打个patch吧!

$ diff -up translate.cxx.orig  translate.cxx
--- translate.cxx.orig     2011-03-22 21:26:52.000000000 +0800
+++ /translate.cxx     2011-03-29 20:31:28.000000000 +0800
@@ -5802,10 +5802,10 @@ translate_pass (systemtap_session& s)
       s.op->newline() << "#define MAXACTION_INTERRUPTIBLE (MAXACTION * 10)";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef TRYLOCKDELAY";
-      s.op->newline() << "#define TRYLOCKDELAY 10 /* microseconds */";
+      s.op->newline() << "#define TRYLOCKDELAY 50 /* microseconds */";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef MAXTRYLOCK";
-      s.op->newline() << "#define MAXTRYLOCK 100 /* 1 millisecond total */";
+      s.op->newline() << "#define MAXTRYLOCK 500 /* 1 millisecond total */";
       s.op->newline() << "#endif";
       s.op->newline() << "#ifndef MAXMAPENTRIES";
       s.op->newline() << "#define MAXMAPENTRIES 2048";

#编译安装后再来一次
$ sudo stap  --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
Pass 5: run failed.  Try again with another '--vp 00001' option.

#又错了,这次原因是probe overhead exceeded threshold, 看下代码我们知道,脚本的开销太大了,超过正常的负载,通过查看代码可以用STP_NO_OVERLOAD来解除这个限制

#再来一次
$ sudo  stap -DSTP_NO_OVERLOAD --all-modules -DMAXSKIPPED=1024 /usr/share/doc/systemtap/examples/profiling/latencytap.stp 

Reason                                  Count  Average(us)  Maximum(us) Percent%
Userspace lock contention                 345     16409195     83258717      45%
                                         1453       867513     60231852      10%
                                           95      7391754     33821926       5%
migration() kernel thread                1733       402701      3571412       5%
                                         7239        87993       401552       5%
Reading from a pipe                       212      2922207     52151180       4%
                                          142      2267850     17990214       2%
                                          108      2457247      7494331       2%
Waking ksoftirqd                           16     16082822     59266312       2%
Waiting for event (select)                 99      2113310     28510974       1%
kjournald() kernel thread                 148      1313447     13983084       1%
Application requested delay                94      1059898     10011409       0%
                                           41      2391993      7618788       0%
Waiting for event (select)                 38      2259444     29057362       0%
                                          719        92947       584944       0%
Waiting for event (poll)                    1     57582711     57582711       0%
Application requested delay                 3     19030709     36000553       0%
Waiting for event (select)                 39      1341880      5847683       0%
                                           34       936628      6649350       0%
                                            5      6163603     10008484       0%
...

这次看到结果了,哈哈,小高兴一把。但是在繁忙的系统上这个脚本的资源占用特别多,也是不爽的。 幸运的是这个脚本支持查看某个进程的延迟情况, 就是在 latencytap.stp 后面加个-x 参数。

这个脚本设计应该是支持进程ID, 但是结果写成了线程ID,属于bug!!!

动手改下吧:

$ diff -u latencytap.stp.orig  latencytap.stp
---  latencytap.stp.orig    2011-02-17 22:02:40.000000000 +0800
+++ latencytap.stp     2011-03-29 20:43:51.000000000 +0800
@@ -15,7 +15,7 @@
 global this_sleep;
 global debug = 0;

-function log_event:long (p:long) { return (!traced_pid || traced_pid == p) }
+function log_event:long (p:long) { return (!traced_pid || traced_pid == task_pid(p)) }

 #func names from hex addresses
 function func_backtrace:string (ips:string)
@@ -50,14 +50,14 @@
   # check to see if task is in appropriate state:
   # TASK_INTERRUPTIBLE      1
   # TASK_UNINTERRUPTIBLE    2
-  if (log_event($p->pid) && (s & 3)) {
+  if (log_event($p) && (s & 3)) {
     dequeue[$p] = gettimeofday_us();
   }
 }

 probe kernel.trace("activate_task") !,
       kernel.function("activate_task") {
-  if (!log_event($p->pid)) next
+  if (!log_event($p)) next

   a = gettimeofday_us()
   d = dequeue[$p]

#再来一次
$ sudo stap  --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp -x $$
...

#如果发现出来的Reason是空行, 就把latencytap.stp里面的debug=0, 改成debug=1

这下终于爽了,旧内核用systemtap版本的,新内核用内核版本的,世界和谐!

通过对线上MySQL的诊断发现大部分时间花在mutex锁的竞争上来,我说过了,我会收拾你的,等着瞧!

玩得开心!

Post Footer automatically generated by wp-posturl plugin for wordpress.

相关 [latencytop 深度 linux] 推荐:

latencytop深度了解你的Linux系统的延迟

- kissingwolf - Erlang非业余研究
原创文章,转载请注明: 转载自Erlang非业余研究. 本文链接地址: latencytop深度了解你的Linux系统的延迟. 我们在系统调优或者定位问题的时候,经常会发现多线程程序的效率很低,但是又不知道问题出在哪里,就知道上下文切换很多,但是为什么上下文切换,是谁导致切换,我们就不知道了. 上下文切换可以用dstat这样的工具查看,比如:.

Javascript 里跑Linux

- rockmaple - Shellex&#39;s Blog
牛逼到暴的大拿 Fabrice Bellard,用Javascript实现了一个x86 PC 模拟器,然后成功在这个模拟器里面跑Linux(请用Firefox 4 / Google Chrome 11打开,Chome 12有BUG). 关于这个东西… 伊说 “I did it for fun“,大大啊大大啊….

Linux Ksplice,MySQL and Oracle

- Syn - DBA Notes
Oracle 在 7 月份收购了 Ksplice. 使用了 Ksplice 的 Linux 系统,为 Kernel 打补丁无需重启动,做系统维护的朋友应该明白这是一个杀手级特性. 现在该产品已经合并到 Oracle Linux 中. 目前已经有超过 700 家客户,超过 10 万套系统使用了 Ksplice (不知道国内是否已经有用户了.

linux makefile编写

- hl - C++博客-首页原创精华区
在讲述这个Makefile之前,还是让我们先来粗略地看一看Makefile的规则. target也就是一个目标文件,可以是Object File,也可以是执行文件. prerequisites就是,要生成那个target所需要的文件或是目标. command也就是make需要执行的命令. 这是一个文件的依赖关系,也就是说,target这一个或多个的目标文件依赖于prerequisites中的文件,其生成规则定义在 command中.

Linux下的VDSO

- 圣斌 - Adam&#39;s
VDSO(Virtual Dynamically-linked Shared Object)是个很有意思的东西, 它将内核态的调用映射到用户态的地址空间中, 使得调用开销更小, 路径更好.. 开销更小比较容易理解, 那么路径更好指的是什么呢. 拿x86下的系统调用举例, 传统的int 0×80有点慢, Intel和AMD分别实现了sysenter, sysexit和syscall, sysret, 即所谓的快速系统调用指令, 使用它们更快, 但是也带来了兼容性的问题.

Linux wget命令

- - CSDN博客推荐文章
wget是linux最常用的下载命令, 一般的使用方法是: wget + 空格 + 要下载文件的url路径. 例如: # wget  http://www.linuxsense.org/xxxx/xxx.tar.gz. 简单说一下-c参数, 这个也非常常见, 可以断点续传, 如果不小心终止了, 可以继续使用命令接着下载.

linux 小技巧

- - DBA Blog
2:如何限制用户的最小密码长度. 修改/etc/login.defs里面的PASS_MIN_LEN的值. 比如限制用户最小密码长度是8:. 3:如何使新用户首次登陆后强制修改密码. 4:更改Linux启动时用图形界面还是字符界面. 将id:5:initdefault: 其中5表示默认图形界面. 改id:3: initdefault: 3表示字符界面.

Linux iostat命令

- - CSDN博客系统运维推荐文章
iostat用于输出CPU和磁盘I/O相关的统计信息. . iostat [ -c | -d ] [ -k | -m ] [ -t ] [ -V ] [ -x ] [ device [. iostat各个参数说明:. -c 仅显示CPU统计信息.与-d选项互斥. -d 仅显示磁盘统计信息.与-c选项互斥.

Linux的架构

- - 博客园_首页
作者:Vamei 出处:http://www.cnblogs.com/vamei 欢迎转载,也请保留这段声明. 我们以下图为基础,说明Linux的架构(architecture). (该图参考《 Advanced Programming in Unix Environment》). 最内层是我们的硬件,最外层是我们常用的各种应用,比如说使用firefox浏览器,打开evolution查看邮件,运行一个计算流体模型等等.

linux命令locate

- - 操作系统 - ITeye博客
    locate命令其实是"find -name"的另一种写法,但是要比后者快得多,原因在于它不搜索具体目录,而是搜索一个数据库(/var/lib/locatedb),这个数据库中含有本地所有文件信息. Linux系统自动创建这个数据库,并且每天自动更新一次,所以使用locate命令查不到最新变动过的文件.