[原]Mahout 对推荐数据的抽象表示(上部分)

标签: | 发表时间:2015-10-09 05:26 | 作者:huruzun
出处:http://blog.csdn.net/huruzun

学习Mahout推荐相关算法前,我们必须先要理解Mahout如何对推荐数据进行抽象表示。首先来看下Preference,该抽象是最基本的抽象,这个抽象对象一般代表一个单独的 userID、itemID、Preference 分数,在具体实现层面首先是Preference接口:

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.mahout.cf.taste.model;

/**
 * <p>
 * A {@link Preference} encapsulates an item and a preference value, which indicates the strength of the
 * preference for it. {@link Preference}s are associated to users.
 * </p>
 */
public interface Preference {
  
  /** @return ID of user who prefers the item */
  long getUserID();
  
  /** @return item ID that is preferred */
  long getItemID();
  
  /**
   * @return strength of the preference for that item. Zero should indicate "no preference either way";
   *         positive values indicate preference and negative values indicate dislike
   */
  float getValue();
  
  /**
   * Sets the strength of the preference for this item
   * 
   * @param value
   *          new preference
   */
  void setValue(float value);
  
}

Preference接口主要定义了获取Preference三个属性方法和set Preference value值函数。

Mahout中一个Preference 对象表示 一个user 对一个 item的 score(喜爱程度),通常中我们直接用到的实现Preference接口的GenericPreference类,如下代码:

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.mahout.cf.taste.impl.model;

import java.io.Serializable;

import org.apache.mahout.cf.taste.model.Preference;

import com.google.common.base.Preconditions;

/**
 * <p>
 * A simple {@link Preference} encapsulating an item and preference value.
 * </p>
 */
public class GenericPreference implements Preference, Serializable {
  
  private final long userID;
  private final long itemID;
  private float value;
  
  public GenericPreference(long userID, long itemID, float value) {
    Preconditions.checkArgument(!Float.isNaN(value), "NaN value");
    this.userID = userID;
    this.itemID = itemID;
    this.value = value;
  }
  
  @Override
  public long getUserID() {
    return userID;
  }
  
  @Override
  public long getItemID() {
    return itemID;
  }
  
  @Override
  public float getValue() {
    return value;
  }
  
  @Override
  public void setValue(float value) {
    Preconditions.checkArgument(!Float.isNaN(value), "NaN value");
    this.value = value;
  }
  
  @Override
  public String toString() {
    return "GenericPreference[userID: " + userID + ", itemID:" + itemID + ", value:" + value + ']';
  }
  
}
通常看到这里我们会猜想Mahout会用一个容器类来存下很多Preference 对象,但是考虑到存储和效率Mahout选择了更好的方式表示一个Preference 集合。

如下图所示:



这里中文版翻译出错了,图3.2 显示是GenericUserPreferenceArray结构,该结构只需要一个userID,但是GenericItemPreferenceArray 则是依据item 为维度,其中只有一个为ItemID,所以这就看你需要做基于user 的CF还是基于Item 的CF。

从上图可以明显发现Mahout定义的PreferenceArray 更有效率的表示了Preference集合,代码如下:

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.mahout.cf.taste.model;

import java.io.Serializable;

/**
 * An alternate representation of an array of {@link Preference}. Implementations, in theory, can produce a
 * more memory-efficient representation.
 */
public interface PreferenceArray extends Cloneable, Serializable, Iterable<Preference> {
  
  /**
   * @return size of length of the "array"
   */
  int length();
  
  /**
   * @param i
   *          index
   * @return a materialized {@link Preference} representation of the preference at i
   */
  Preference get(int i);
  
  /**
   * Sets preference at i from information in the given {@link Preference}
   * 
   * @param i
   * @param pref
   */
  void set(int i, Preference pref);
  
  /**
   * @param i
   *          index
   * @return user ID from preference at i
   */
  long getUserID(int i);
  
  /**
   * Sets user ID for preference at i.
   * 
   * @param i
   *          index
   * @param userID
   *          new user ID
   */
  void setUserID(int i, long userID);
  
  /**
   * @param i
   *          index
   * @return item ID from preference at i
   */
  long getItemID(int i);
  
  /**
   * Sets item ID for preference at i.
   * 
   * @param i
   *          index
   * @param itemID
   *          new item ID
   */
  void setItemID(int i, long itemID);

  /**
   * @return all user or item IDs
   */
  long[] getIDs();
  
  /**
   * @param i
   *          index
   * @return preference value from preference at i
   */
  float getValue(int i);
  
  /**
   * Sets preference value for preference at i.
   * 
   * @param i
   *          index
   * @param value
   *          new preference value
   */
  void setValue(int i, float value);
  
  /**
   * @return independent copy of this object
   */
  PreferenceArray clone();
  
  /**
   * Sorts underlying array by user ID, ascending.
   */
  void sortByUser();
  
  /**
   * Sorts underlying array by item ID, ascending.
   */
  void sortByItem();
  
  /**
   * Sorts underlying array by preference value, ascending.
   */
  void sortByValue();
  
  /**
   * Sorts underlying array by preference value, descending.
   */
  void sortByValueReversed();
  
  /**
   * @param userID
   *          user ID
   * @return true if array contains a preference with given user ID
   */
  boolean hasPrefWithUserID(long userID);
  
  /**
   * @param itemID
   *          item ID
   * @return true if array contains a preference with given item ID
   */
  boolean hasPrefWithItemID(long itemID);
  
}

类似Preference先定义接口,这里也是先定义了PreferenceArray 接口,GenericUserPreferenceArray是实现了PreferenceArray 接口来表示上面图片所表示的数据结构。

代码:

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.mahout.cf.taste.impl.model;

import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import com.google.common.base.Function;
import com.google.common.collect.Iterators;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.cf.taste.model.PreferenceArray;
import org.apache.mahout.common.iterator.CountingIterator;

/**
 * <p>
 * Like {@link GenericItemPreferenceArray} but stores preferences for one user (all user IDs the same) rather
 * than one item.
 * </p>
 *
 * <p>
 * This implementation maintains two parallel arrays, of item IDs and values. The idea is to save allocating
 * {@link Preference} objects themselves. This saves the overhead of {@link Preference} objects but also
 * duplicating the user ID value.
 * </p>
 * 
 * @see BooleanUserPreferenceArray
 * @see GenericItemPreferenceArray
 * @see GenericPreference
 */
public final class GenericUserPreferenceArray implements PreferenceArray {

  private static final int ITEM = 1;
  private static final int VALUE = 2;
  private static final int VALUE_REVERSED = 3;

  private final long[] ids;
  private long id;
  private final float[] values;

  public GenericUserPreferenceArray(int size) {
    this.ids = new long[size];
    values = new float[size];
    this.id = Long.MIN_VALUE; // as a sort of 'unspecified' value
  }

  public GenericUserPreferenceArray(List<? extends Preference> prefs) {
    this(prefs.size());
    int size = prefs.size();
    long userID = Long.MIN_VALUE;
    for (int i = 0; i < size; i++) {
      Preference pref = prefs.get(i);
      if (i == 0) {
        userID = pref.getUserID();
      } else {
        if (userID != pref.getUserID()) {
          throw new IllegalArgumentException("Not all user IDs are the same");
        }
      }
      ids[i] = pref.getItemID();
      values[i] = pref.getValue();
    }
    id = userID;
  }

  /**
   * This is a private copy constructor for clone().
   */
  private GenericUserPreferenceArray(long[] ids, long id, float[] values) {
    this.ids = ids;
    this.id = id;
    this.values = values;
  }

  @Override
  public int length() {
    return ids.length;
  }

  @Override
  public Preference get(int i) {
    return new PreferenceView(i);
  }

  @Override
  public void set(int i, Preference pref) {
    id = pref.getUserID();
    ids[i] = pref.getItemID();
    values[i] = pref.getValue();
  }

  @Override
  public long getUserID(int i) {
    return id;
  }

  /**
   * {@inheritDoc}
   * 
   * Note that this method will actually set the user ID for <em>all</em> preferences.
   */
  @Override
  public void setUserID(int i, long userID) {
    id = userID;
  }

  @Override
  public long getItemID(int i) {
    return ids[i];
  }

  @Override
  public void setItemID(int i, long itemID) {
    ids[i] = itemID;
  }

  /**
   * @return all item IDs
   */
  @Override
  public long[] getIDs() {
    return ids;
  }

  @Override
  public float getValue(int i) {
    return values[i];
  }

  @Override
  public void setValue(int i, float value) {
    values[i] = value;
  }

  @Override
  public void sortByUser() { }

  @Override
  public void sortByItem() {
    lateralSort(ITEM);
  }

  @Override
  public void sortByValue() {
    lateralSort(VALUE);
  }

  @Override
  public void sortByValueReversed() {
    lateralSort(VALUE_REVERSED);
  }

  @Override
  public boolean hasPrefWithUserID(long userID) {
    return id == userID;
  }

  @Override
  public boolean hasPrefWithItemID(long itemID) {
    for (long id : ids) {
      if (itemID == id) {
        return true;
      }
    }
    return false;
  }

  private void lateralSort(int type) {
    //Comb sort: http://en.wikipedia.org/wiki/Comb_sort
    int length = length();
    int gap = length;
    boolean swapped = false;
    while (gap > 1 || swapped) {
      if (gap > 1) {
        gap /= 1.247330950103979; // = 1 / (1 - 1/e^phi)
      }
      swapped = false;
      int max = length - gap;
      for (int i = 0; i < max; i++) {
        int other = i + gap;
        if (isLess(other, i, type)) {
          swap(i, other);
          swapped = true;
        }
      }
    }
  }

  private boolean isLess(int i, int j, int type) {
    switch (type) {
      case ITEM:
        return ids[i] < ids[j];
      case VALUE:
        return values[i] < values[j];
      case VALUE_REVERSED:
        return values[i] > values[j];
      default:
        throw new IllegalStateException();
    }
  }

  private void swap(int i, int j) {
    long temp1 = ids[i];
    float temp2 = values[i];
    ids[i] = ids[j];
    values[i] = values[j];
    ids[j] = temp1;
    values[j] = temp2;
  }

  @Override
  public GenericUserPreferenceArray clone() {
    return new GenericUserPreferenceArray(ids.clone(), id, values.clone());
  }

  @Override
  public int hashCode() {
    return (int) (id >> 32) ^ (int) id ^ Arrays.hashCode(ids) ^ Arrays.hashCode(values);
  }

  @Override
  public boolean equals(Object other) {
    if (!(other instanceof GenericUserPreferenceArray)) {
      return false;
    }
    GenericUserPreferenceArray otherArray = (GenericUserPreferenceArray) other;
    return id == otherArray.id && Arrays.equals(ids, otherArray.ids) && Arrays.equals(values, otherArray.values);
  }

  @Override
  public Iterator<Preference> iterator() {
    return Iterators.transform(new CountingIterator(length()),
      new Function<Integer, Preference>() {
        @Override
        public Preference apply(Integer from) {
          return new PreferenceView(from);
        }
      });
  }

  @Override
  public String toString() {
    if (ids == null || ids.length == 0) {
      return "GenericUserPreferenceArray[{}]";
    }
    StringBuilder result = new StringBuilder(20 * ids.length);
    result.append("GenericUserPreferenceArray[userID:");
    result.append(id);
    result.append(",{");
    for (int i = 0; i < ids.length; i++) {
      if (i > 0) {
        result.append(',');
      }
      result.append(ids[i]);
      result.append('=');
      result.append(values[i]);
    }
    result.append("}]");
    return result.toString();
  }

  private final class PreferenceView implements Preference {

    private final int i;

    private PreferenceView(int i) {
      this.i = i;
    }

    @Override
    public long getUserID() {
      return GenericUserPreferenceArray.this.getUserID(i);
    }

    @Override
    public long getItemID() {
      return GenericUserPreferenceArray.this.getItemID(i);
    }

    @Override
    public float getValue() {
      return values[i];
    }

    @Override
    public void setValue(float value) {
      values[i] = value;
    }

  }

}

有个很有意思的排序lateralSort算法实现:Comb sort ,形式化描述可能不太明白直接来例子。

梳排序还是基于冒泡排序,与冒泡不同的是,梳排序比较的是固定距离处的数的比较和交换,类似希尔那样

这个固定距离是待排数组长度除以1.3得到近似值,下次则以上次得到的近似值再除以1.3,直到距离小至3时,以1递减

假设待数组[8 4 3 7 6 5 2 1]
待排数组长度为8,而8÷1.3=6,则比较8和2,4和1,并做交换

[8 4 3 7 6 5 2 1]
[8 4 3 7 6 5 2 1]
交换后的结果为
[2 1 3 7 6 5 8 4]
第二次循环,更新间距为6÷1.3=4,比较2和6,1和5,3和8,7和4
[2 1 3 7 6 5 8 4]
[2 1 3 7 6 5 8 4]
[2 1 3 7 6 5 8 4]
[2 1 3 7 6 5 8 4]
只有7和4需要交换,交换后的结果为
[2 1 3 4 6 5 8 7]
第三次循环,更新距离为3,没有交换
第四次循环,更新距离为2,没有交换
第五次循环,更新距离为1,三处交换
[2 1 3 4 6 5 8 7]
[2 1 3 4 6 5 8 7]
[2 1 3 4 6 5 8 7]
三处交换后的结果为[1 2 3 4 5 6 7 8]
交换后排序结束,顺序输出即可得到[1 2 3 4 5 6 7 8]

参考wiki有个生动的过程图: 点击打开链接

这里源码中排序很多代码写的很棒值得参考借鉴。

Mahout设计者认为实现PreferenceArray所带来的设计复杂度是值得的,因为减少了大约75%的内存消耗。看了PreferenceArray设计后,我们将会发现

Mahout中不会使用java 原生的map、set 容器而是从存储效率角度考虑自己重新构建的数据结构 FastByIDMap 、FastByIDSet,很明显大数据下的操作每条数据节省一些存储将是非常大的存储节省。但是这里并不是说java 中容器效率

不好或者设计很糟糕,因为Mahout根据自己特定业务场景需要重新设计这样数据结构,而java 类似HashSet、HashMap之类是设计成的通用场景容器。

FastByIDMap 也是hash-based,但是它使用linear probing (线性探测再散列)而不是separate chaining(可能我们翻译成中文叫拉链法)。

这里稍微展开讲下散列的溢出处理:

1. 线性探测法

计算要插入元素散列地址,如果散列地址槽为空,就直接把该新元素插入该槽中。但是如果新元素被散列到一个已经满了的散列桶,就必须寻找其他散列桶,最简单办法就是把这个新元素插到最近的未满的散列桶中。

2. 拉链法

允许元素散列地址相同,把散列地址相同元素通过链表结构串联起来。

HashMap处理冲突是采用拉链法,我们查看实现源码:

   /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

Hashmap里面的bucket出现了单链表的形式,散列表要解决的一个问题就是散列值的冲突问题,通常是两种方法:链表法和开放地址法。链表法就是将相同hash值的对象组织成一个链表放在hash值对应的槽位;开放地址法是通过一个探测算法,当某个槽位已经被占据的情况下继续查找下一个可以使用的槽位。java.util.HashMap采用的链表法的方式,链表是单向链表。形成单链表的核心代码如下:

其中很关键的是addEntry方法:

    /**
     * Adds a new entry with the specified key, value and hash code to
     * the specified bucket.  It is the responsibility of this
     * method to resize the table if appropriate.
     *
     * Subclass overrides this to alter the behavior of put method.
     */
    void addEntry(int hash, K key, V value, int bucketIndex) {
        Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<>(hash, key, value, e);
        if (size++ >= threshold)
            resize(2 * table.length);
    }
上面方法的代码很简单,但其中包含了一个设计:系统总是将新添加的 Entry 对象放入 table 数组的 bucketIndex 索引处——如果 bucketIndex 索引处已经有了一个 Entry 对象,那新添加的 Entry 对象指向原有的 Entry 对象(产生一个 Entry 链),如果 bucketIndex 索引处没有 Entry 对象,也就是上面程序代码的 e 变量是 null,也就是新放入的 Entry 对象指向 null,也就是没有产生 Entry 链。HashMap里面没有出现hash冲突时,没有形成单链表时,hashmap查找元素很快,get()方法能够直接定位到元素,但是出现单链表后,单个bucket 里存储的不是一个 Entry,而是一个 Entry 链,系统只能必须按顺序遍历每个 Entry,直到找到想搜索的 Entry 为止——如果恰好要搜索的 Entry 位于该 Entry 链的最末端(该 Entry 是最早放入该 bucket 中),那系统必须循环到最后才能找到该元素。

       当创建 HashMap 时,有一个默认的负载因子(load factor),其默认值为 0.75,这是时间和空间成本上一种折衷:增大负载因子可以减少 Hash 表(就是那个 Entry 数组)所占用的内存空间,但会增加查询数据的时间开销,而查询是最频繁的的操作(HashMap 的 get() 与 put() 方法都要用到查询);减小负载因子会提高数据查询的性能,但会增加 Hash 表所占用的内存空间。


通过HashMap底层实现可以发现,采用拉链法每个条目entry 都额外增加一个Map.Entry,这样的Object 内存开销是非常惊人的。所以mahout中FastByIDMap采用是线性探测法,宁愿适当花费时间也要节省空间。 

Mahout中Keys 总是long 型数据在表示而不是使用object,因为使用long 类型节约了内存并且提升 了性能。

Mahout中Set底层实现不是使用的Map,FastByIDMap 表现像一个cache,因为它有一个最大的size,超过这个size不经常使用的entries将会移除。


在Mahout中对recommend input data 的最上层抽象封装就是DataModel,DataModel是一个抽象接口如下:

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.mahout.cf.taste.model;

import java.io.Serializable;

import org.apache.mahout.cf.taste.common.Refreshable;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.common.FastIDSet;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;

/**
 * <p>
 * Implementations represent a repository of information about users and their associated {@link Preference}s
 * for items.
 * </p>
 */
public interface DataModel extends Refreshable, Serializable {
  
  /**
   * @return all user IDs in the model, in order
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  LongPrimitiveIterator getUserIDs() throws TasteException;
  
  /**
   * @param userID
   *          ID of user to get prefs for
   * @return user's preferences, ordered by item ID
   * @throws org.apache.mahout.cf.taste.common.NoSuchUserException
   *           if the user does not exist
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  PreferenceArray getPreferencesFromUser(long userID) throws TasteException;
  
  /**
   * @param userID
   *          ID of user to get prefs for
   * @return IDs of items user expresses a preference for
   * @throws org.apache.mahout.cf.taste.common.NoSuchUserException
   *           if the user does not exist
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  FastIDSet getItemIDsFromUser(long userID) throws TasteException;
  
  /**
   * @return a {@link LongPrimitiveIterator} of all item IDs in the model, in order
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  LongPrimitiveIterator getItemIDs() throws TasteException;
  
  /**
   * @param itemID
   *          item ID
   * @return all existing {@link Preference}s expressed for that item, ordered by user ID, as an array
   * @throws org.apache.mahout.cf.taste.common.NoSuchItemException
   *           if the item does not exist
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  PreferenceArray getPreferencesForItem(long itemID) throws TasteException;
  
  /**
   * Retrieves the preference value for a single user and item.
   * 
   * @param userID
   *          user ID to get pref value from
   * @param itemID
   *          item ID to get pref value for
   * @return preference value from the given user for the given item or null if none exists
   * @throws org.apache.mahout.cf.taste.common.NoSuchUserException
   *           if the user does not exist
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  Float getPreferenceValue(long userID, long itemID) throws TasteException;

  /**
   * Retrieves the time at which a preference value from a user and item was set, if known.
   * Time is expressed in the usual way, as a number of milliseconds since the epoch.
   *
   * @param userID user ID for preference in question
   * @param itemID item ID for preference in question
   * @return time at which preference was set or null if no preference exists or its time is not known
   * @throws org.apache.mahout.cf.taste.common.NoSuchUserException if the user does not exist
   * @throws TasteException if an error occurs while accessing the data
   */
  Long getPreferenceTime(long userID, long itemID) throws TasteException;
  
  /**
   * @return total number of items known to the model. This is generally the union of all items preferred by
   *         at least one user but could include more.
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  int getNumItems() throws TasteException;
  
  /**
   * @return total number of users known to the model.
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  int getNumUsers() throws TasteException;
  
  /**
   * @param itemID item ID to check for
   * @return the number of users who have expressed a preference for the item
   * @throws TasteException if an error occurs while accessing the data
   */
  int getNumUsersWithPreferenceFor(long itemID) throws TasteException;

  /**
   * @param itemID1 first item ID to check for
   * @param itemID2 second item ID to check for
   * @return the number of users who have expressed a preference for the items
   * @throws TasteException if an error occurs while accessing the data
   */
  int getNumUsersWithPreferenceFor(long itemID1, long itemID2) throws TasteException;
  
  /**
   * <p>
   * Sets a particular preference (item plus rating) for a user.
   * </p>
   * 
   * @param userID
   *          user to set preference for
   * @param itemID
   *          item to set preference for
   * @param value
   *          preference value
   * @throws org.apache.mahout.cf.taste.common.NoSuchItemException
   *           if the item does not exist
   * @throws org.apache.mahout.cf.taste.common.NoSuchUserException
   *           if the user does not exist
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  void setPreference(long userID, long itemID, float value) throws TasteException;
  
  /**
   * <p>
   * Removes a particular preference for a user.
   * </p>
   * 
   * @param userID
   *          user from which to remove preference
   * @param itemID
   *          item to remove preference for
   * @throws org.apache.mahout.cf.taste.common.NoSuchItemException
   *           if the item does not exist
   * @throws org.apache.mahout.cf.taste.common.NoSuchUserException
   *           if the user does not exist
   * @throws TasteException
   *           if an error occurs while accessing the data
   */
  void removePreference(long userID, long itemID) throws TasteException;

  /**
   * @return true if this implementation actually stores and returns distinct preference values;
   *  that is, if it is not a 'boolean' DataModel
   */
  boolean hasPreferenceValues();

  /**
   * @return the maximum preference value that is possible in the current problem domain being evaluated. For
   * example, if the domain is movie ratings on a scale of 1 to 5, this should be 5. While a
   * {@link org.apache.mahout.cf.taste.recommender.Recommender} may estimate a preference value above 5.0, it
   * isn't "fair" to consider that the system is actually suggesting an impossible rating of, say, 5.4 stars.
   * In practice the application would cap this estimate to 5.0. Since evaluators evaluate
   * the difference between estimated and actual value, this at least prevents this effect from unfairly
   * penalizing a {@link org.apache.mahout.cf.taste.recommender.Recommender}
   */
  float getMaxPreference();

  /**
   * @see #getMaxPreference()
   */
  float getMinPreference();
  
}

简单描述这个接口提供最基本几个功能如:所有user IDs、关于某个item的所有偏好得分、所有对某个item进行评价user 集合。

最简单的对DataModel实现的是GenericDataModel类,简单看下使用示例程序如下:

FastByIDMap<PreferenceArray> preferences = new FastByIDMap<PreferenceArray>();
		PreferenceArray prefsForUser1 = new GenericUserPreferenceArray(10);
		prefsForUser1.setUserID(0, 1L);
		prefsForUser1.setItemID(0, 101L);
		prefsForUser1.setValue(0, 3.0f);
		prefsForUser1.setItemID(1, 102L);
		prefsForUser1.setValue(1, 4.5f);
		//...(8 more)
		preferences.put(1L, prefsForUser1);
		DataModel model = new GenericDataModel(preferences);
		System.out.println(model);
当然这里只是最简单的使用,平时我们读数据更多是从文件模型或者数据库读入,Mahout也有针对这些场景的设计。

作者:huruzun 发表于2015/10/8 21:26:21 原文链接
阅读:507 评论:1 查看评论

相关 [mahout 数据 抽象] 推荐:

[原]Mahout 对推荐数据的抽象表示(下部分)

- - huruzun的专栏
这篇博客是延续上部分的补充: Mahout 对推荐数据的抽象表示(上部分). 处理无Preference values 数据. 下面都是围绕Mahout对没有Preference values的数据的推荐. 有时进入推荐引擎的数据没有Preference values,而是只有相关联的一个userID、itemID,它们之间有多强的联系我们没有一个Preference values来量化衡量.

[原]Mahout 对推荐数据的抽象表示(上部分)

- - huruzun的专栏
学习Mahout推荐相关算法前,我们必须先要理解Mahout如何对推荐数据进行抽象表示. 首先来看下Preference,该抽象是最基本的抽象,这个抽象对象一般代表一个单独的 userID、itemID、Preference 分数,在具体实现层面首先是Preference接口:. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.

Mahout介绍

- - 互联网 - ITeye博客
Mahout 是机器学习和数据挖掘的一个分布式框架,区别于其他的开源数据挖掘软件,它是基于hadoop之上的; 所以hadoop的优势就是Mahout的优势. http://mahout.apache.org/ 上说的Scalable就是指hadoop的可扩展性. Mahout用map-reduce实现了部分数据挖掘算法,解决了并行挖掘的问题.

C++ 工程实践(9):数据抽象

- roc - 博客园-陈硕的 Blog
陈硕 (giantchen_AT_gmail). 陈硕关于 C++ 工程实践的系列文章: http://blog.csdn.net/Solstice/category/802325.aspx. 排版正常的版本: http://www.cnblogs.com/Solstice/category/287661.html.

使用Mahout搭建推荐系统之入门篇2-玩转你的数据1

- - 互联网 - ITeye博客
用意: 搞推荐系统或者数据挖掘的, 对数据要绝对的敏感和熟悉, 并且热爱你的数据. 分析数据既要用统计分析那一套,又要熟悉业务发掘有趣的特征(feature). 后者有意思的多,但是因为我业务做的不多,还不太熟悉, 跪求大牛们分析业务经历. 听豆瓣上的大神"懒惰啊我"说过,有一个Nokia的比赛,有一个团队直接用陀螺仪参数就发现了性别分布,因为男生手机都放在 口袋里, 而女生往往放在包里面.

mahout部署实践

- - CSDN博客云计算推荐文章
一 下载mahout并解压. JAVA_HOME mahout运行需指定jdk的目录. MAHOUT_JAVA_HOME指定此变量可覆盖JAVA_HOME值. HADOOP_HOME  如果配置,则在hadoop分布式平台上运行,否则单机运行. HADOOP_CONF_DIR指定hadoop的配置文件目录.

mahout 实用教程之一

- - CSDN博客云计算推荐文章
mahout 实用教程 (一). 本文力求把mahout从使用的角度为读者建立一个框架,为后续的使用打下基础. 本文为原创文章转载请注明原网址 http://blog.csdn.net/comaple,谢谢. 下面首先给出源代码svn地址以及用于测试的公共数据集,大家可以下载并测试. mahout svn仓库地址: http://svn.apache.org/repos/asf/mahout/trunk.

Mahout: SVDRecommender SVD推荐算法

- -

Mahout实现的机器学习算法

- - ITeye博客
使用命令:mahout -h.   在Mahout实现的机器学习算法见下表:. EM聚类(期望最大化聚类). 并行FP Growth算法. 并行化了Watchmaker框架. 非Map-Reduce算法. 扩展了java的Collections类. Mahout最大的优点就是基于hadoop实现,把很多以前运行于单机上的算法,转化为了MapReduce模式,这样大大提升了算法可处理的数据量和处理性能.