自然语言处理词向量化总结
自然语言处理
1. 词向量表示
distributional representation vs. distributed representation 分布式表达(一类表示方法,基于统计含义),分散式表达(从一个高维空间X映射到一个低维空间Y) 分布假说(distributional hypothesis)为这一设想提供了 理论基础:上下文相似的词,其语义也相似.
自然语言处理的基础是词向量化,即文本数值化,后面进行数据挖掘工作就和常见的任务类似,即分类,聚类等等。
1.1 one-hot encoding
In vector space terms, this is a vector with one 1 and a lot of zeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
1.2 count-based
tf*idf
svd
Glove
1.3 word embedding
基于神经网络的词向量表示 word2vec(2*2=4)四种训练方法
网络结构 CBOW,skip-gram
训练方法 Hierarchical Softmax,negative sampling
2. 词向量实现工具
word2vec
https://code.google.com/archive/p/word2vec/
gensim
https://github.com/RaRe-Technologies/gensim
fasttext
https://github.com/facebookresearch/fastText
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format("wiki.en.vec")
words = []
for word in model.vocab:
words.append(word)
print("word count: {}".format(len(words)))
print("Dimensions of word: {}".format(len(model[words[0]])))
demo_word = "car"
for similar_word in model.similar_by_word(demo_word):
print("Word: {0}, Similarity: {1:.2f}".format(
similar_word[0], similar_word[1]
))
word count: 2519370
Dimensions of word: 300
Word: cars, Similarity: 0.83
Word: automobile, Similarity: 0.72
Word: truck, Similarity: 0.71
Word: motorcar, Similarity: 0.70
Word: vehicle, Similarity: 0.70
Word: driver, Similarity: 0.69
Word: drivecar, Similarity: 0.69
Word: minivan, Similarity: 0.67
Word: roadster, Similarity: 0.67
Word: racecars, Similarity: 0.67
Glove
https://github.com/stanfordnlp/GloVe
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tsne import bh_sne
def read_glove(glove_file):
embeddings_index = {}
embeddings_vector = []
f = open("glove.6B.100d.txt", "rb")
word_idx = 0
for line in f:
values = line.decode("utf-8").split()
word = values[0]
vector = np.asarray(values[1:], dtype="float64")
embeddings_index[word] = word_idx
embeddings_vector.append(vector)
word_idx += 1
f.close()
inv_index = {v : k for k, v in embeddings_index.items()}
glove_embeddings = np.vstack(embeddings_vector)
glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
glove_embeddings_normed = glove_embeddings / glove_norms
# glove_embeddings_normed.fill(0)
return embeddings_index, glove_embeddings, glove_embeddings_normed, inv_index
def get_emb(word, embeddings_index, glove_embeddings):
idx = embeddings_index.get(word)
if idx is None:
return None
else:
return glove_embeddings[idx]
def get_normed_emb(word, embeddings_index, glove_embeddings_normed):
idx = embeddings_index.get(word)
if idx is None:
return None
else:
return glove_embeddings_normed[idx]
def most_similar(words, inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed, topn=10):
query_emb = 0
if type(words) == list:
for word in words:
query_emb += get_emb(word, embeddings_index, glove_embeddings)
else:
query_emb = get_emb(words, embeddings_index, glove_embeddings)
query_emb = query_emb / np.linalg.norm(query_emb)
cosin = np.dot(glove_embeddings_normed, query_emb)
idxs = np.argsort(cosin)[::-1][:topn]
return [(inv_index[idx], cosin[idx]) for idx in idxs]
def plot_tsne(glove_embeddings_normed, inv_index, perplexity, img_file_name, word_cnt=100):
#word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:word_cnt])
word_emb_tsne = bh_sne(glove_embeddings_normed[:word_cnt], perplexity=perplexity)
plt.figure(figsize=(40, 40))
axis = plt.gca()
np.set_printoptions(suppress=True)
plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)
for idx in range(word_cnt):
plt.annotate(inv_index[idx],
xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
xytext=(0, 0), textcoords='offset points')
plt.savefig(img_file_name)
plt.show()
def main():
glove_input_file = "glove.6B.100d.txt"
embeddings_index, glove_embeddings, glove_embeddings_normed, inv_index = read_glove(glove_input_file)
print(np.isfinite(glove_embeddings_normed).all())
print(glove_embeddings.shape)
print(get_emb("computer", embeddings_index, glove_embeddings))
print(most_similar("cpu", inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed))
print(most_similar(["river", "chinese"], inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed))
# plot tsne viz
plot_tsne(glove_embeddings_normed, inv_index, 30.0, "tsne.png", word_cnt=1000)
if __name__ == "__main__":
main()
3. papers
- Neural Word Embeddings as Implicit Matrix Factorization
- Linguistic Regularities in Sparse and Explicit Word Representation
- Random Walks on Context Spaces Towards an Explanation of the Mysteries of Semantic Word Embeddings
- word2vec Explained Deriving Mikolov et al.’s Negative Sampling Word Embedding Method
- Linking GloVe with word2vec
- Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective
- Hierarchical Probabilistic Neural Network Language Model
- Notes on Noise Contrastive Estimation and Negative Sampling
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
- Distributed Representations of Words and Phrases and their Compositionality
- Efficient Estimation of Word Representations in Vector Space
- GloVe Global Vectors forWord Representation
- Neural probabilistic language models
- Natural language processing (almost) from scratch
- Learning word embeddings efficiently with noise contrastive estimation
- A scalable hierarchical distributed language model
- Three new graphical models for statistical language modelling
- Improving word representations via global context and multiple word prototypes
- A Primer on Neural Network Models for Natural Language Processing
- Joulin, Armand, et al. "Bag of tricks for efficient text classification." FAIR 2016
- P. Bojanowski , E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
https://github.com/oxford-cs-deepnlp-2017/lectures
wget http://nlp.stanford.edu/data/glove.6B.zip
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zh.zip
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip
https://blog.manash.me/how-to-use-pre-trained-word-vectors-from-facebooks-fasttext-a71e6d55f27
4. 自然语言处理应用
自然语言处理的基础是文本词向量化,之后可以进行分类,聚类,情感分析等等。
Natural Language Processing
Topic Classification
Topic modeling
Sentiment Analysis
Google Translate
Chatbots / dialogue systems
Natural language query understanding (Google Now, Apple Siri, Amazon Alexa)
Summarization