玩转大数据必备技

标签： 数据挖掘 | 发表时间：2013-06-05 12:53 | 作者：bicloud

出处：http://blog.sina.com.cn/bicloud

最近把数据，算法，产品相关的东西，做了个精简缩影版（总结缩减版），和大家分享下

内外兼修，内功是基础，招式是提升，反反复复，小步慢跑，快速迭代

书籍推荐：

数据挖掘概念与技术

数据挖掘原理 http://book.douban.com/subject/1103515/ 神书

数据挖掘导论

机器学习 CMU

数据挖掘：实用机器学习工具与技术（英文版·第3版）

数据挖掘：概念、模型、方法和算法

Pattern Recognition And Machine Learning

集体智慧编程

模式分类

数据挖掘算法 paper：

[1]Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB conference, pp 487–499.

[2]Breiman L, Friedman JH, Olshen RA, Stone CJ.Classiﬁcation and regression trees. Wadsworth,Belmont

[3]Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J Roy Stat Soc B 39:1–38

[4]Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton

[5]Pei, Jian; Han, Jiawei; and Lakshmanan, Laks V. S.; Mining frequent itemsets with convertible constraints, in Proceedings of the 17th International Conference on Data Engineering, April 2–6, 2001, Heidelberg, Germany, 2001, pages 433-442.

[6]MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. pp. 281–297.

[7]Quinlan JR (1979) Discovering rules by induction from large collections of examples. In: Michie D (ed),Expert systems in the micro electronic age. Edinburgh University Press,Edinburgh

[8]Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo

[9]Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic Regression (2nd ed.)

[10]Rish, Irina. An empirical study of the naive Bayes classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence.

[11] Ho, Tin Kam . Random Decision Forest. Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282.

[12]Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine.

[13]Friedman, J. H. Stochastic Gradient Boosting.

[14]Jerry Ye, Jyh-Herng Chow, Jiang Chen, Zhaohui Zheng. Stochastic Gradient Boosted Distributed Decision Trees.2009.

[15]Fix E, Hodges JL, Jr (1951) Discriminatory analysis, nonparametric discrimination. USAF School of Aviation Medicine, Randolph Field, Tex., Project 21-49-004, Rept. 4, Contract AF41(128)-31, February 1951

机器学习/数据挖掘 Tools：

统计学习工具/模型原型 R SAS Clementine

支持向量机学习包 libsvm

大数据量线性机器学习分类预测包 liblinear linear-svm&LR

hadoop机器学习包 mahout

数据仓库/ETL hive/hql sql

云计算 hadoop

数据爬取/转换/模型原型开发 python php

-------------------------------------------------------------

-------数据挖掘应用之一--------------------------------------

推荐算法 paper：

[1]Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-Based Collaborative Filtering Recommendation Algorithms. Proceedings of the 10th International Conference on World Wide Web (pp. 285-295). Hong Kong: ACM.

[2]Jiahui Liu and Elin Pedersen and Peter Dolan.Personalized News Recommendation Based on Click Behavior.ace2010 International Conference on Intelligent User Interfs.

[3]Y. Koren. Collaborative Filtering with Temporal Dynamics. In KDD, 2009.

[4] Yi Ding , Xue Li, Time weight collaborative filtering, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany.

[5] Shumeet Baluja , Rohan Seth , D. Sivakumar , Yushi Jing , Jay Yagnik , Shankar Kumar , Deepak Ravichandran , Mohamed Aly, Video suggestion and discovery for youtube: taking random walks through the view graph, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China.

[6]Azarias Reda, Yubin Park, Mitul Tiwari, Christian Posse, and Sam Shah. Metaphor: A System for Related Search Recommendations.In the 21st International Conference on Information and Knowledge Management (CIKM 2012).

[7]James Davidson , Benjamin Liebald , Junning Liu , Palash Nandy , Taylor Van Vleet , Ullas Gargi , Sujoy Gupta , Yu He , Mike Lambert , Blake Livingston , Dasarathi Sampath, The YouTube video recommendation system, Proceedings of the fourth ACM conference on Recommender systems, September 26-30, 2010, Barcelona, Spain.

[8]Abhinandan S. Das , Mayur Datar , Ashutosh Garg , Shyam Rajaram, Google news personalization: scalable online collaborative filtering, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada.

[9]Robert M. Bell , Yehuda Koren, Lessons from the Netflix prize challenge, ACM SIGKDD Explorations Newsletter, v.9 n.2, December 2007.

[10]Noam Koenigstein , Nir Nice , Ulrich Paquet , Nir Schleyen, The Xbox recommender system, Proceedings of the sixth ACM conference on Recommender systems, September 09-13, 2012, Dublin, Ireland.

[11]G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Transactions on Knowledge and Data Engineering 17 (2005) 734-749.

[12]T. Zhou, J. Ren, M. Medo, Y.-C. Zhang, Bipartite network projection and personal recommendation, Physical Review E 76 (2007) 046115.

[13] T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J.R. Wakeling, Y.-C. Zhang, Solving the apparent diversity–accuracy dilemma of recommender systems, Proceedings of the National Academy of Sciences of the United States of America 107 (2010) 4511-4515.

[14]Steffen Rendle (2012): Factorization Machines with libFM, to appear in ACM Trans. Intell. Syst. Technol., 3(3), May.

[15]Daniel Lemire, Anna Maclachlan. Slope One Predictors for Online Rating-Based Collaborative Filtering.

数据存储 paper：

[1]Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels. Dynamo: amazon's highly available key-value store.Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles.

[2]Fay Chang, Jeffrey Dean, Sanjay Ghemawat,Wilson Hsieh, Deborah Wallach, Mike Burrows,

Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for

Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06), Berkeley, CA, USA, 2006.

[3]Roshan Sumbaly, Jay Kreps, Alex Feinberg, Lei Gao, and Sam Shah. Serving Large-Scale Batch Computed Data with Project Voldemort.10th USENIX conference on File and Storage Technologies (FAST 2012).

[4]Brian F. Cooper , Raghu Ramakrishnan , Utkarsh Srivastava , Adam Silberstein , Philip Bohannon , Hans-Arno Jacobsen , Nick Puz , Daniel Weaver , Ramana Yerneni, PNUTS: Yahoo!'s hosted data serving platform, Proceedings of the VLDB Endowment, v.1 n.2, August 2008.

[5]Adam Silberstein, Jianjun Chen, David Lomax, Brad McMillan, Masood Mortazavi, P. P. S. Narayan, Raghu Ramakrishnan, Rusty Sears.PNUTS in Flight: Web-Scale Data Serving at Yahoo.

IEEE Internet Computing , Volume 16 Issue 1.

[6]Lamport, Leslie. Time, clocks, and the ordering of events in a distributed system.

[7]Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. Consistency, Availability, and Convergence. Technical Report (UTCS TR-11-22)

数据存储 Tools：

Voldemort linkedin key-value存储 http://www.project-voldemort.com/voldemort/ i love it

Redis http://redis.io/

hbase http://hbase.apache.org/

cassandra http://cassandra.apache.org/

hypertable http://hypertable.org/

Data product 数据产品

搜索优化

推荐引擎

广告投放

CRM（会员营销）

导购

数据统计宏观分析工具

社会化中小金融贷款（欢迎交流）

-----wish u good luck-------------

青春就应该这样绽放游戏测试：三国时期谁是你最好的兄弟！！你不得不信的星座秘密

玩转大数据必备技

相关 [大数据] 推荐：

谈大数据(2)

大数据之惑

白话大数据

交通大数据

全球10大数据库

谈大数据分析

大数据的一致性

大数据Lambda架构

大数据公司Amazon

大数据架构hadoop

相关文章

订阅