Programming Collective Intelligence 读书总结
标签:
程序园
| 发表时间:2011-10-19 12:31 | 作者:崔添翼 透明
出处:http://cuitianyi.com
- Making Recommendations (Collaborative Filtering)
- User-based
- Finding similar users
- User as vector based on item score
- Euclidean distance
- Pearson correlation
- Reverse users and items, we can find similar items to a given item
- Sort and recommend items based on
- sum(user similarity * user’s item score) for each other user
- Item-based
- Find item similarities
- These results can be cached and periodically updated
- Sort and recommend items based on
- sum((item similarity * user’s item score) / sum(item similarity)) for each user’s item
- Significantly faster and better for sparse dataset
- Discovering Groups (Clustering)
- Supervised Learning
- use example inputs and outputs
- neural networks, decision trees, support-vector machines, and Bayesian filtering
- Word Vectors of texts
- Hierarchical Clustering
- choose two nearest vectors to combine
- results in binary tree
- Can cluster articles or words
- Dendrogram drawing
- K-Means clustering
- randomly place k centroids
- assign every item to the nearest centroid, and move the centroid to the average location of all items assigned to them
- Searching and Ranking
- word index stored in relational database
- ranking
- content-based
- various metrics: word frequency, document location, word distance
- use inbound links
- simple count
- PageRank algorithm
- random walk
- sparse matrix multiplication iterations
- use link text
- learning from clicks
- click-tracking neuro-network (multilayer perception network, i.e. MLP network)
- Optimization
- stochastic optimization
- numerical solution
- cost function
- random searching
- hill climbing
- increase the most promising dimension of a vector
- simulated annealing
- variable: temperature, starts very high and gradually gets lower
- worse solution being accepted depending on temperature
- generic algorithms
- Document Filtering (to be expanded…)
- use words as features
- naive Bayesian classifier
- the Fisher method
- Modeling with Decision Trees
- Algorithm: CART (Classification and Regression Trees)
- choose the best split from all possible splits
- Gini impurity
- information entropy
- recursively build the whole tree
- then can be used to classify new observations
- pruning the tree
- when it becomes overfitted
- checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold
- Dealing with
- missing data
- numerical outcomes
- use variance instead of entropy
- Building Price Models
- k-nearest neighbors (kNN)
- weighted
- may need scaling or normalizing
- to estimate the probability density
- cross-validation
- divide data into training sets and test sets
- Advanced Classification: Kernel Methods and SVMs
- basic linear classification
- using dot-products to determine distance
- kernel methods
- define another dot-product == move the points into different space
- support-vector machines
- find the line that is as far away as possible from classes
- Finding Independent Features
- non-negative matrix factorization
- factor the article-word matrix into two matrix
- the features matrix: row for features, column for words
- the weight matrix: row for articles, column for features
- Evolving Intelligence
- creating an algorithm that creating algorithms
- mutation, crossover/breeding
- use trees to represent algorithm to enable evolving
- use to guess numerical functions or, game AI
- Algorithm Summary
- Supervised Learning
- Bayesian Classifier
- Decision Tree Classifier
- Neural Networks
- Support-Vector Machines
- Unsupervised Learning
- k-Nearest Neighbors
- Clustering
- Multidimensional Scaling
- Non-Negative Matrix Factorization
- Optimization
相关 [programming collective intelligence] 推荐:
- 透明 - 崔添翼 § 翼若垂天之云
assign every item to the nearest centroid, and move the centroid to the average location of all items assigned to them. checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold.
- Ian - IT·行·思·录
2011年8月这一期的CACM上有一篇“An Overview of Business Intelligence Technology”,总结了商业智能(Business Intelligence, BI)的运行组成部分和相关关键技术,对于理解整个商业智能的架构很有帮助. 这篇文章特别说明了一些BI领域在“大数据(big data)”时代面临的挑战和需要关注的技术,并对在内存处理、分布式、统计等比较流行和实用的技术的应用进行了介绍.
- - ihower { blogging }
這是我之前念 Functional Programming for Java Developers 一書的摘要記錄. 這本書很薄只有90頁,是一本蠻不錯的 Functional Programming 概念入門勸敗書. 近來 Functional Programming (函數式編程,以下簡稱FP) 的重要性提昇就是為了因應 Concurrency 的需求.