zillow数据科学应用探索
zillow http://www.zillow.com/
data science at zillow
zillow 美国一家房地产租赁和销售企业。
Zillow serves the full lifecycle of owning and living in a home: buying, selling, renting, financing, remodeling and more.
Zillow described their 20TB dataset and the technology they use to estimate house values for more than 110 million homes in the US.
数据科学技术
基于R,python语言构建原型和生产环境,还会用到graphlab create构建模型
大数据
homes on zillow 110 million
home attributes 103
double precision 8byte
time series 220months
total 20T工具
R
python
R is used for prototyping work, such as proof-of-concept experiments on subsets of the dataset and also in production, mainly as an interface programming layer. The production computations avail of C++ technology. Zillow referenced proprietary R packages which they have developed in-house. One such package is ZPL(实现R并行计算), which provides a function similar to MapReduce. Both SQLserver and SQLite are used in Zillow.
应用
rent zestimate
计算租赁指数 zillow rent index
calculate raw median rent zestimates
应用平滑过滤
考虑季节性因素
质量控制
计算 房屋价值指数 zillow home value index
zillow地理信息技术
大多数开发在windows上完成
sql server 数据库
75%python,15% R,5% sql server,5% bash 和shell
linux-only database used for blazingly fast in memory and http look up
crawl->walk->run
数据挖掘模型
数据建模,寻找异常点,寻找脏数据,数据库清洗,缺失值插入
python在数据科学中的角色,科学计算中应用不断增长,机器学习算法实现更加容易
zillow常用python包,numpy,pandas,scikit-learn,textmining,
pymssql、pyodbc,sqlite3, graphlab create
使用sklearn构建欺诈检测模型,gbrt算法
总结:
基于大数据,数据科学技术,实现房产业务数据化,房产数据业务化,开发数据产品,进行精准营销。国内的安居客,搜房网等,需要接轨。
from:http://workinganalytics.com/zillow-opens-the-kimono-reveals-r-python-and-graphlab-create-underneath/