web-scale OLAP系统应用解决方案

标签： 个性化推荐与搜索 | 发表时间：2012-09-17 15:15 | 作者：bicloud

出处：http://blog.sina.com.cn/bicloud

Avatara: OLAP for Webscale Analytics Products

OLAP：在线分析处理

OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value.

为了支持linkedin在线应用“Who’s Viewed My Profile?” 和 “Who’s Viewed This Job?”等等。构建OLAP 一个可伸缩和快速的serving system called Avatara to solve this many, small cubes problem。

目前已有的解决方案存在一些问题，这些企业级解决方案支持离线分析和传统的数据仓库，不能有效的支持高吞吐，低延时和高可用性需求的访问量大的网站。在这些用例中，查询是预先设计好的，因为产品接口对用户进行限制，选择已经知道的查询的子集。这样，一些复杂的操作，譬如join，可以预先计算好。Additionally, queries span relatively few—usually tens to at most a hundred—dimensions. These queries, and consequently, the cubes, can be sharded across a primary dimension

OLAP系统模式会影响系统的可用性。假设，一个拥有100个结点的集群，每30天每台机器可能会失败10分钟，那么系统的可用性是：(1 − MTTR/MTTF) ⁿ ，其中MTTR表示系统恢复的平均时间，MTTF表示系统失败的平均时间，n表示满足查询所需要的结点数。在我们的例子中，一个查询的可用性为：(1 − 10/(30 · 24 · 60)) ¹⁰⁰ = 0.977115，这个看起来是个可用性足够的系统；但是，假设一个用户每天访问页面某个产品3次查看状态信息，那么一年大约会看到25次“not available”错误，每个月可能会发现两个错误，这不是一个好的用户体验。如果查询可以在一个shard中进行cube操作，这样查询就可以在从一个单独的磁盘获取。这样，系统可用性为：1−10/(30·24·60) = 0.999768 ，那么一年大约会看到0.25次“not available”错误,这个可以满足实际需求。There are several assumptions and simplifications here, but this need forms the basis for highly-available key-value systems, such as Amazon’s Dynamo，采用KV存储。

Avatara, a fast, scalable OLAP system to handle the many, small cubes scenario we encounter with our various analytics features. 离线计算通过hadoop实现，kv存储使用voldemort。

系统架构：

Avatara主要有两个组件构成：离线批处理引擎和在线查询引擎。离线批处理引擎执行用户根据行为流数据，定义的join运算和聚合运算，生成cube。离线计算主要通过hadoop处理，结果存储在kv数据中。客户端用过类似sql语句执行在线查询计算，从kv存储中获取数据。

The system uses a hybrid offline/online strategy coupled with sharding into a keyvalue store by an application-specified primary dimension to support OLAP queries at web scale.

Query latency CDF for a high-traffic day. 95% of queries can be served within 25 ms

总结：一些访问量大的网站，某些数据产品，可以借鉴这个模式实现。类似于linkedin的Who’s Viewed My Profile?， Who’s Viewed This Job? ，“Jobs You May Be Interested In”WVTJ shows the number of job views broken down by time, title, and company. OLAP is an ideal solution to handle these tasks. 这些任务通过cube配置文件和简单的sql就可以实现。

参考文献：Avatara: OLAP for Webscale Analytics Products