情感分析的现代方法(修复代码问题)
最近在研究 情感分析的内容,翻到了《 Modern Methods for Sentiment Analysis》这篇文章,这篇文章本身讲的方法并没有什么“现代”,采用的是一些传统的方法。这里摘录的部分内容,做一些学习。由于原文代码可能由于版本问题都无法运行,这里重新进行了整理。
Word2Vec的情感分析的作用
Word2Vec 可以识别单词之间重要的关系。这使得它在许多 NLP 项目和我们的情感分析案例中非常有用。将它运用到情感分析案例之前,让我们先来测试下 Word2Vec 对单词的分类能力。我们将利用三个分类的样本集:食物、运动和天气单词集合,我们可以从 Enchanted Learning下载得到这三个数据集。这里Word2Vec使用训练好的词向量文件: GoogleNews-vectors-negative300.bin
# 获取分类词 import requests import re from django.utils.html import strip_tags pattern = re.compile("<div class=wordlist-item>(.*?)</div>") def get_dict(category): url = "https://www.enchantedlearning.com/wordlist/{0}.shtml".format(category) r = requests.get(url) item_list = re.findall(pattern, r.text) for item in item_list: with open("./data/" + category + "_words.txt", 'a', encoding='utf-8') as f: f.write(strip_tags(item) + '\n') if __name__ == "__main__": get_dict("sports") get_dict("food") get_dict("weather")
由于这是一个 300 维的向量,为了在 2D 视图中对其进行可视化,我们需要利用 Scikit-Learn 中的降维算法 t-SNE 处理源数据。
import numpy as np import gensim from sklearn.manifold import TSNE import matplotlib.pyplot as plt model = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin', binary=True) with open('./data/food_words.txt', 'r') as infile: food_words = infile.readlines() with open('./data/sports_words.txt', 'r') as infile: sports_words = infile.readlines() with open('./data/weather_words.txt', 'r') as infile: weather_words = infile.readlines() def get_word_vecs(words): vecs = [] for word in words: word = word.replace('\n', '') try: vecs.append(model[word].reshape((1, 300))) except KeyError: continue vecs = np.concatenate(vecs) return np.array(vecs, dtype='float') # TSNE expects float type values food_vecs = get_word_vecs(food_words) sports_vecs = get_word_vecs(sports_words) weather_vecs = get_word_vecs(weather_words) ts = TSNE(2) reduced_vecs = ts.fit_transform(np.concatenate((food_vecs, sports_vecs, weather_vecs))) # color points by word group to see if Word2Vec can separate them for i in range(len(reduced_vecs)): if i < len(food_vecs): color = 'b' # food words colored blue elif len(food_vecs) <= i < (len(food_vecs) + len(sports_vecs)): color = 'r' # sports words colored red else: color = 'g' # weather words colored green plt.plot(reduced_vecs[i, 0], reduced_vecs[i, 1], marker='o', color=color, markersize=8) plt.show()
从上图可以看出,Word2Vec 很好地分离了不相关的单词,并对它们进行聚类处理。
Emoji 推文的情感分析
利用 emoji 表情对我们的数据添加模糊的标签。笑脸表情:-)表示乐观情绪,皱眉标签:-(表示悲观情绪。总的 400000 条推文被分为乐观和悲观两组数据。我们随机从这两组数据中抽取样本,构建比例为 8:2 的训练集和测试集。随后,我们对训练集数据构建 Word2Vec 模型,其中分类器的输入值为推文中所有词向量的加权平均值。
训练数据: https://github.com/udaykeith/Meetup-4thNov
# 导入数据并构建 Word2Vec 模型: from sklearn.model_selection import train_test_split from gensim.models.word2vec import Word2Vec import numpy as np from sklearn.preprocessing import scale from sklearn.linear_model import SGDClassifier from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt with open('twitter_data/pos_tweets.txt', 'r', encoding="utf-8") as infile: pos_tweets = infile.readlines() with open('twitter_data/neg_tweets.txt', 'r', encoding="utf-8") as infile: neg_tweets = infile.readlines() # use 1 for positive sentiment, 0 for negative y = np.concatenate((np.ones(len(pos_tweets)), np.zeros(len(neg_tweets)))) x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_tweets, neg_tweets)), y, test_size=0.2) def clean_text(corpus): corpus = [z.lower().replace('\n', '').split() for z in corpus] return corpus x_train = clean_text(x_train) x_test = clean_text(x_test) # Initialize model and build vocab n_dim = 300 twitter_w2v = Word2Vec(size=n_dim, min_count=10) twitter_w2v.build_vocab(x_train) # Train the model over train_reviews (this may take several minutes) twitter_w2v.train(x_train, epochs=twitter_w2v.epochs, total_examples=twitter_w2v.corpus_count) # 获得推文中所有词向量的平均值 # Build word vector for training set by using the average value of all word vectors in the tweet, then scale def build_word_vector(text, size): vec = np.zeros(size).reshape((1, size)) count = 0. for word in text: try: vec += twitter_w2v.wv[word].reshape((1, size)) count += 1. except KeyError: continue if count != 0: vec /= count return vec # 调整数据集的量纲是数据标准化处理的一部分,我们通常将数据集转化成服从均值为零的高斯分布,这说明数值大于均值表示乐观,反之则表示悲观。 train_vecs = np.concatenate([build_word_vector(z, n_dim) for z in x_train]) train_vecs = scale(train_vecs) # Train word2vec on test tweets twitter_w2v.train(x_test, epochs=twitter_w2v.epochs, total_examples=twitter_w2v.corpus_count) # 建立测试集向量并对其标准化处理: # Build test tweet vectors then scale test_vecs = np.concatenate([build_word_vector(z, n_dim) for z in x_test]) test_vecs = scale(test_vecs) lr = SGDClassifier(loss='log', penalty='l1') lr.fit(train_vecs, y_train) print('Test Accuracy: %.2f' % lr.score(test_vecs, y_test)) # Create ROC curve pred_probas = lr.predict_proba(test_vecs)[:, 1] fpr, tpr, _ = roc_curve(y_test, pred_probas) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, label='area = %.2f' % roc_auc) plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.legend(loc='lower right') plt.show()
返回结果:Test Accuracy: 0.73
在没有创建任何类型的特性和最小文本预处理的情况下,利用 Scikit-Learn 构建的简单线性模型的预测精度为 73%。有趣的是,删除标点符号会影响预测精度,这说明 Word2Vec 模型可以提取出文档中符号所包含的信息。处理单独的单词,训练更长时间,做更多的数据预处理工作,和调整模型的参数都可以提高预测精度。使用人工神 经网络(ANNs)模型可以提高 5% 的预测精度。需要注意的是,Scikit-Learn 没有提供 ANN 分类器的实现工具,所以利用了创建的自定义库: https://github.com/mczerny/NNet
from NNet import NeuralNet nnet = NeuralNet(100, learn_rate=1e-1, penalty=1e-8) maxiter = 1000 batch = 150 _ = nnet.fit(train_vecs, y_train, fine_tune=False, maxiter=maxiter, SGD=True, batch=batch, rho=0.9) print('Test Accuracy: %.2f' % nnet.score(test_vecs, y_test))
执行后的结果为 Test Accuracy: 0.67,反而更差了。另外NNet在执行时会报错,需要对报错的地方进行修复,代码混乱,不建议使用。
利用 Doc2Vec 分析电影评论数据
利用词向量均值对推文进行分析效果不错,这是因为推文通常只有十几个单词,所以即使经过平均化处理仍能保持相关的特性。一旦我们开始分析段落数据时,如果忽略上下文和单词顺序的信息,那么我们将会丢掉许多重要的信息。在这种情况下,最好是使用 Doc2Vec 来创建输入信息。作为一个示例,使用 IMDB 电影评论数据及来测试 Doc2Vec 在情感分析中的有效性(相关代码与上篇中的 使用Word2Vec/Doc2Vec对IMDB情感分析基本一致)。该数据集包含 25000 条乐观的电影评论,25000 条悲观评论和 50000 条尚未添加标签的评论。首先对未添加标签的评论数据构建 Doc2Vec 模型:
from gensim.models.doc2vec import Doc2Vec from gensim.models.doc2vec import TaggedDocument from sklearn.model_selection import train_test_split from sklearn.linear_model import SGDClassifier import numpy as np import random from sklearn.metrics import roc_curve, auc import matplotlib.pyplot as plt with open('imdb_data/pos.txt', 'r', encoding='utf-8') as infile: pos_reviews = infile.readlines() with open('imdb_data/neg.txt', 'r', encoding='utf-8') as infile: neg_reviews = infile.readlines() with open('imdb_data/unsup.txt', 'r', encoding='utf-8') as infile: unsup_reviews = infile.readlines() # use 1 for positive sentiment, 0 for negative y = np.concatenate((np.ones(len(pos_reviews)), np.zeros(len(neg_reviews)))) x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos_reviews, neg_reviews)), y, test_size=0.2) # Do some very minor text preprocessing def cleanText(corpus): punctuation = """.,?!:;(){}[]""" corpus = [z.lower().replace('\n', '') for z in corpus] corpus = [z.replace('<br />', ' ') for z in corpus] # treat punctuation as individual words for c in punctuation: corpus = [z.replace(c, ' %s ' % c) for z in corpus] corpus = [z.split() for z in corpus] return corpus x_train = cleanText(x_train) x_test = cleanText(x_test) unsup_reviews = cleanText(unsup_reviews) # Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it. # We do this by using the LabeledSentence method. The format will be "TRAIN_i" or "TEST_i" where "i" is # a dummy index of the review. def labelizeReviews(reviews, label_type): labelized = [] for i, v in enumerate(reviews): label = '%s_%s' % (label_type, i) labelized.append(TaggedDocument(v, [label])) return labelized x_train = labelizeReviews(x_train, 'TRAIN') x_test = labelizeReviews(x_test, 'TEST') unsup_reviews = labelizeReviews(unsup_reviews, 'UNSUP') # instantiate our DM and DBOW models size = 400 model_dm = Doc2Vec(min_count=1, window=10, vector_size=size, sample=1e-3, negative=5, workers=6) model_dbow = Doc2Vec(min_count=1, window=10, vector_size=size, sample=1e-3, negative=5, dm=0, workers=6) # build vocab over all reviews model_dm.build_vocab(x_train + x_test + unsup_reviews) model_dbow.build_vocab(x_train + x_test + unsup_reviews) # We pass through the data set multiple times, shuffling the training reviews each time to improve accuracy. all_train_reviews = x_train + unsup_reviews for epoch in range(10): random.shuffle(all_train_reviews) model_dm.train(all_train_reviews, total_examples=model_dm.corpus_count, epochs=1) model_dbow.train(all_train_reviews, total_examples=model_dbow.corpus_count, epochs=1) # Get training set vectors from our models def get_vecs(model, corpus, size): vecs = [np.array(model.docvecs[z.tags[0]]).reshape((1, size)) for z in corpus] return np.concatenate(vecs) # 数组 train_vecs_dm = get_vecs(model_dm, x_train, size) train_vecs_dbow = get_vecs(model_dbow, x_train, size) train_vecs = np.hstack((train_vecs_dm, train_vecs_dbow)) # train over test set all_test = x_test + [] for epoch in range(10): random.shuffle(all_test) model_dm.train(all_test, total_examples=model_dm.corpus_count, epochs=1) model_dbow.train(all_test, total_examples=model_dbow.corpus_count, epochs=1) # Construct vectors for test reviews test_vecs_dm = get_vecs(model_dm, x_test, size) test_vecs_dbow = get_vecs(model_dbow, x_test, size) test_vecs = np.hstack((test_vecs_dm, test_vecs_dbow)) lr = SGDClassifier(loss='log', penalty='l1') lr.fit(train_vecs, y_train) print('Test Accuracy: %.2f' % lr.score(test_vecs, y_test)) # Create ROC curve pred_probas = lr.predict_proba(test_vecs)[:, 1] fpr, tpr, _ = roc_curve(y_test, pred_probas) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, label='area = %.2f' % roc_auc) plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.legend(loc='lower right') plt.show()
执行后准确率为:Test Accuracy: 0.83
Related posts: