Programming Collective Intelligence 读书总结

标签: 程序园 | 发表时间:2011-10-19 12:31 | 作者:崔添翼 透明
出处:http://cuitianyi.com
  • Making Recommendations (Collaborative Filtering)
    • User-based
      • Finding similar users
        • User as vector based on item score
          • Euclidean distance
          • Pearson correlation
        • Reverse users and items, we can find similar items to a given item
      • Sort and recommend items based on
        • sum(user similarity * user’s item score) for each other user
    • Item-based
      • Find item similarities
        • These results can be cached and periodically updated
      • Sort and recommend items based on
        • sum((item similarity * user’s item score) / sum(item similarity)) for each user’s item
      • Significantly faster and better for sparse dataset
  • Discovering Groups (Clustering)
    • Supervised Learning
      • use example inputs and outputs
      • neural networks, decision trees, support-vector machines, and Bayesian filtering
    • Word Vectors of texts
    • Hierarchical Clustering
      • choose two nearest vectors to combine
      • results in binary tree
    • Can cluster articles or words
      • transpose the matrix
    • Dendrogram drawing
    • K-Means clustering
      • randomly place k centroids
      • assign every item to the nearest centroid, and move the centroid to the average location of all items assigned to them
  • Searching and Ranking
    • word index stored in relational database
    • ranking
      • content-based
        • various metrics: word frequency, document location, word distance
      • use inbound links
        • simple count
        • PageRank algorithm
          • random walk
          • sparse matrix multiplication iterations
        • use link text
      • learning from clicks
        • click-tracking neuro-network (multilayer perception network, i.e. MLP network)
          • one hidden layer
  • Optimization
    • stochastic optimization
      • numerical solution
      • cost function
    • random searching
    • hill climbing
      • increase the most promising dimension of a vector
    • simulated annealing
      • variable: temperature, starts very high and gradually gets lower
      • worse solution being accepted depending on temperature
    • generic algorithms
      • mutate, crossover, …
  • Document Filtering (to be expanded…)
    • use words as features
    • naive Bayesian classifier
    • the Fisher method
  • Modeling with Decision Trees
    • Algorithm: CART (Classification and Regression Trees)
      • choose the best split from all possible splits
        • Gini impurity
        • information entropy
          • sum of p(x)log(p(x))
      • recursively build the whole tree
      • then can be used to classify new observations
      • pruning the tree
        • when it becomes overfitted
        • checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold
    • Dealing with
      • missing data
        • use both branches
      • numerical outcomes
        • use variance instead of entropy
  • Building Price Models
    • k-nearest neighbors (kNN)
      • weighted
      • may need scaling or normalizing
      • to estimate the probability density
    • cross-validation
      • divide data into training sets and test sets
  • Advanced Classification: Kernel Methods and SVMs
    • basic linear classification
      • using dot-products to determine distance
    • kernel methods
      • define another dot-product == move the points into different space
    • support-vector machines
      • find the line that is as far away as possible from classes
  • Finding Independent Features
    • non-negative matrix factorization
      • factor the article-word matrix into two matrix
        • the features matrix: row for features, column for words
        • the weight matrix: row for articles, column for features
  • Evolving Intelligence
    • creating an algorithm that creating algorithms
    • mutation, crossover/breeding
    • use trees to represent algorithm to enable evolving
      • use to guess numerical functions or, game AI
  • Algorithm Summary
    • Supervised Learning
      • Bayesian Classifier
      • Decision Tree Classifier
      • Neural Networks
      • Support-Vector Machines
    • Unsupervised Learning
      • k-Nearest Neighbors
      • Clustering
      • Multidimensional Scaling
      • Non-Negative Matrix Factorization
    • Optimization

相关 [programming collective intelligence] 推荐:

Programming Collective Intelligence 读书总结

- 透明 - 崔添翼 § 翼若垂天之云
assign every item to the nearest centroid, and move the centroid to the average location of all items assigned to them. checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold.

[论文阅读笔记]An Overview of Business Intelligence Technology

- Ian - IT·行·思·录
2011年8月这一期的CACM上有一篇“An Overview of Business Intelligence Technology”,总结了商业智能(Business Intelligence, BI)的运行组成部分和相关关键技术,对于理解整个商业智能的架构很有帮助. 这篇文章特别说明了一些BI领域在“大数据(big data)”时代面临的挑战和需要关注的技术,并对在内存处理、分布式、统计等比较流行和实用的技术的应用进行了介绍.

Functional Programming for Java Developers 讀書摘要

- - ihower { blogging }
這是我之前念 Functional Programming for Java Developers 一書的摘要記錄. 這本書很薄只有90頁,是一本蠻不錯的 Functional Programming 概念入門勸敗書. 近來 Functional Programming (函數式編程,以下簡稱FP) 的重要性提昇就是為了因應 Concurrency 的需求.