玩转大数据必备技

标签: 数据挖掘 | 发表时间:2013-06-05 12:53 | 作者:bicloud
出处:http://blog.sina.com.cn/bicloud
最近把数据,算法,产品相关的东西,做了个精简缩影版(总结缩减版),和大家分享下

内外兼修,内功是基础,招式是提升,反反复复,小步慢跑,快速迭代

书籍推荐:
数据挖掘概念与技术
数据挖掘原理 http://book.douban.com/subject/1103515/  神书
数据挖掘导论
机器学习 CMU
数据挖掘:实用机器学习工具与技术(英文版·第3版)
数据挖掘:概念、模型、方法和算法
Pattern Recognition And Machine Learning
集体智慧编程
模式分类

数据挖掘算法 paper:
[1]Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB conference, pp 487–499.
[2]Breiman L, Friedman JH, Olshen RA, Stone CJ.Classification and regression trees. Wadsworth,Belmont
[3]Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J Roy Stat Soc B 39:1–38
[4]Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton
[5]Pei, Jian; Han, Jiawei; and Lakshmanan, Laks V. S.; Mining frequent itemsets with convertible constraints, in Proceedings of the 17th International Conference on Data Engineering, April 2–6, 2001, Heidelberg, Germany, 2001, pages 433-442.
[6]MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. pp. 281–297.
[7]Quinlan JR (1979) Discovering rules by induction from large collections of examples. In: Michie D (ed),Expert systems in the micro electronic age. Edinburgh University Press,Edinburgh        
[8]Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo
[9]Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic Regression (2nd ed.)
[10]Rish, Irina. An empirical study of the naive Bayes classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence.
[11] Ho, Tin Kam . Random Decision Forest. Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282.
[12]Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine.
[13]Friedman, J. H. Stochastic Gradient Boosting.
[14]Jerry Ye, Jyh-Herng Chow, Jiang Chen, Zhaohui Zheng. Stochastic Gradient Boosted Distributed Decision Trees.2009.
[15]Fix E, Hodges JL, Jr (1951) Discriminatory analysis, nonparametric discrimination. USAF School of Aviation Medicine, Randolph Field, Tex., Project 21-49-004, Rept. 4, Contract AF41(128)-31, February 1951

机器学习/数据挖掘 Tools:

统计学习工具/模型原型 R SAS Clementine
支持向量机学习包 libsvm
大数据量线性机器学习分类预测包 liblinear linear-svm&LR
hadoop机器学习包 mahout
数据仓库/ETL hive/hql sql
云计算       hadoop
数据爬取/转换/模型原型开发  python php
-------------------------------------------------------------
-------数据挖掘应用之一--------------------------------------
推荐算法 paper:
[1]Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-Based Collaborative Filtering Recommendation Algorithms. Proceedings of the 10th International Conference on World Wide Web (pp. 285-295). Hong Kong: ACM.
[2]Jiahui Liu and Elin Pedersen and Peter Dolan.Personalized News Recommendation Based on Click Behavior.ace2010 International Conference on Intelligent User Interfs.
[3]Y. Koren. Collaborative Filtering with Temporal Dynamics. In KDD, 2009.
[4] Yi Ding , Xue Li, Time weight collaborative filtering, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany.
[5] Shumeet Baluja , Rohan Seth , D. Sivakumar , Yushi Jing , Jay Yagnik , Shankar Kumar , Deepak Ravichandran , Mohamed Aly, Video suggestion and discovery for youtube: taking random walks through the view graph, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China.
[6]Azarias Reda, Yubin Park, Mitul Tiwari, Christian Posse, and Sam Shah. Metaphor: A System for Related Search Recommendations.In the 21st International Conference on Information and Knowledge Management (CIKM 2012).
[7]James Davidson , Benjamin Liebald , Junning Liu , Palash Nandy , Taylor Van Vleet , Ullas Gargi , Sujoy Gupta , Yu He , Mike Lambert , Blake Livingston , Dasarathi Sampath, The YouTube video recommendation system, Proceedings of the fourth ACM conference on Recommender systems, September 26-30, 2010, Barcelona, Spain.
[8]Abhinandan S. Das , Mayur Datar , Ashutosh Garg , Shyam Rajaram, Google news personalization: scalable online collaborative filtering, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada.
[9]Robert M. Bell , Yehuda Koren, Lessons from the Netflix prize challenge, ACM SIGKDD Explorations Newsletter, v.9 n.2, December 2007.
[10]Noam Koenigstein , Nir Nice , Ulrich Paquet , Nir Schleyen, The Xbox recommender system, Proceedings of the sixth ACM conference on Recommender systems, September 09-13, 2012, Dublin, Ireland.
[11]G. Adomavicius, A. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Transactions on Knowledge and Data Engineering 17 (2005) 734-749.
[12]T. Zhou, J. Ren, M. Medo, Y.-C. Zhang, Bipartite network projection and personal recommendation, Physical Review E 76 (2007) 046115.
[13] T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J.R. Wakeling, Y.-C. Zhang, Solving the apparent diversity–accuracy dilemma of recommender systems, Proceedings of the National Academy of Sciences of the United States of America 107 (2010) 4511-4515.
[14]Steffen Rendle (2012): Factorization Machines with libFM, to appear in ACM Trans. Intell. Syst. Technol., 3(3), May.
[15]Daniel Lemire, Anna Maclachlan. Slope One Predictors for Online Rating-Based Collaborative Filtering.


数据存储 paper:
[1]Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels. Dynamo: amazon's highly available key-value store.Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles.
[2]Fay Chang, Jeffrey Dean, Sanjay Ghemawat,Wilson Hsieh, Deborah Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for
Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06), Berkeley, CA, USA, 2006.
[3]Roshan Sumbaly, Jay Kreps, Alex Feinberg, Lei Gao, and Sam Shah. Serving Large-Scale Batch Computed Data with Project Voldemort.10th USENIX conference on File and Storage Technologies (FAST 2012).
[4]Brian F. Cooper , Raghu Ramakrishnan , Utkarsh Srivastava , Adam Silberstein , Philip Bohannon , Hans-Arno Jacobsen , Nick Puz , Daniel Weaver , Ramana Yerneni, PNUTS: Yahoo!'s hosted data serving platform, Proceedings of the VLDB Endowment, v.1 n.2, August 2008.
[5]Adam Silberstein, Jianjun Chen, David Lomax, Brad McMillan, Masood Mortazavi, P. P. S. Narayan, Raghu Ramakrishnan, Rusty Sears.PNUTS in Flight: Web-Scale Data Serving at Yahoo.
IEEE Internet Computing , Volume 16 Issue 1.
[6]Lamport, Leslie. Time, clocks, and the ordering of events in a distributed system.
[7]Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. Consistency, Availability, and Convergence. Technical Report (UTCS TR-11-22)

数据存储 Tools:
Voldemort linkedin key-value存储 http://www.project-voldemort.com/voldemort/  i love it
Redis http://redis.io/
hbase  http://hbase.apache.org/
cassandra http://cassandra.apache.org/
hypertable http://hypertable.org/

Data product 数据产品
搜索优化
推荐引擎
广告投放
CRM(会员营销)
导购
数据统计宏观分析工具
社会化中小金融贷款(欢迎交流)

-----wish u good luck-------------


  青春就应该这样绽放   游戏测试:三国时期谁是你最好的兄弟!!   你不得不信的星座秘密

相关 [大数据] 推荐:

谈大数据(2)

- - 人月神话的BLOG
对于大数据,后面会作为一个系列来谈,大数据涉及的方面特别多,包括主数据,数据中心和ODS,SOA,云计算,业务BI等很多方面的内容. 前面看到一个提法,即大数据会让我们更加关注业务方面的内容,而云平台则更多是技术层面的内容. 对于大数据会先把各个理解的关键点谈完了,再系统来看大数据的完整解决方案和体系化.

大数据之惑

- - 互联网分析
算起来,接触大数据、和互联网之外的客户谈大数据也有快2年了. 也该是时候整理下一些感受,和大家分享下我看到的国内大数据应用的一些困惑了. 云和大数据,应该是近几年IT炒的最热的两个话题了. 在我看来,这两者之间的不同就是: 云是做新的瓶,装旧的酒; 大数据是找合适的瓶,酿新的酒. 云说到底是一种基础架构的革命.

白话大数据

- - 互联网分析
这个时代,你在外面混,无论是技术还是产品还是运营还是商务,如果嘴里说不出“大数据”“云存储”“云计算”,真不好意思在同行面前抬头. 是千万级别的用户信息还是动辄XXXTB的数据量. 其实,大数据在我的眼里,不是一门技术,而是一种技能,从数据中去发现价值挖掘价值的技能. ”当我掷地有声用这句话开场时,正好一个妹子推门而入,听到这句话,微微一怔,低头坐下.

交通大数据

- - 人月神话的BLOG
本文简单谈下智慧交通场景下可能出现的大数据需求和具体应用价值. 对于公交线路规划和设计是一个大数据潜在的应用场景,传统的公交线路规划往往需要在前期投入大量的人力进行OD调查和数据收集. 特别是在公交卡普及后可以看到,对于OD流量数据完全可以从公交一卡通中采集到相关的交通流量和流向数据,包括同一张卡每天的行走路线和换乘次数等详细信息.

全球10大数据库

- - 译言-电脑/网络/数码科技
原文: Fiorenttini   译者: julie20098. [非商业性转载必须注明译者julie20098和相关链接. ,否则视为侵权,追究转载责任. 世界气候数据中心:气候全球数据中心, 220TB 的网络数据, 6PB 的其它数据. 国家能源研究科学计算中心,有 2.8PB 容量.

谈大数据分析

- - 人月神话的BLOG
对于数据分析层,我们可以看到,其核心重点是针对海量数据形成一个分布式可弹性伸缩的,高查询性能的,支持标准sql语法的一个ODS库. 我们看到对于Hive,impala,InfoBright更多的都是解决这个层面的问题,即解决数据采集问题,解决采集后数据行列混合存储和压缩的问题,然后形成一个支撑标准sql预防的数据分析库.

大数据的一致性

- - 阳振坤的博客
看到了一篇关于数据一致性的文章:下一代NoSQL:最终一致性的末日. (  http://www.csdn.net/article/2013-11-07/2817420 ),其中说到: 相比关系型数据库,NoSQL解决方案提供了shared-nothing、容错和可扩展的分布式架构等特性,同时也放弃了关系型数据库的强数据一致性和隔离性,美其名曰:“最终一致性”.

大数据Lambda架构

- - CSDN博客云计算推荐文章
1 Lambda架构介绍.          Lambda架构划分为三层,分别是批处理层,服务层,和加速层. 最终实现的效果,可以使用下面的表达式来说明. 1.1 批处理层(Batch Layer, Apache Hadoop).          批处理层主用由Hadoop来实现,负责数据的存储和产生任意的视图数据.

大数据公司Amazon

- - 36氪 | 关注互联网创业
说到 Amazon,它通常给人的印象是一家典型的电商公司——创办于1995年,靠在线书籍销售业务起家,发展至今也已颇具规模. 近日,TechCrunch作者Alex Williams撰文称,Amazon其实并非一家贸易公司,而是一家大数据公司. 联想到Amazon CEO Jeff Bezos曾说过的一句话:“企业家应该愿意在很长一段时间内承受误解的目光.

大数据架构hadoop

- - CSDN博客云计算推荐文章
摘要:Admaster数据挖掘总监 随着互联网、移动互联网和物联网的发展,谁也无法否认,我们已经切实地迎来了一个海量数据的时代,数据调查公司IDC预计2011年的数据总量将达到1.8万亿GB,对这些海量数据的分析已经成为一个非常重要且紧迫的需求. 随着互联网、移动互联网和物联网的发展,谁也无法否认,我们已经切实地迎来了一个海量数据的时代,数据调查公司IDC预计2011年的数据总量将达到1.8万亿GB,对这些海量数据的分析已经成为一个非常重要且紧迫的需求.