引言

在使用机器学习算法对语言文字进行处理的时候，我们需要对文字做一些预处理，将其变为算法可以识别的数据格式（即向量）。在下文中，我们假设已经从某些渠道获取到了一些文本（如网页爬虫、日志记录、文本语料库、等等），接下来对这些文本进行预处理的过程主要为如下几个步骤：

数据清洗：去掉非文本部分、停用词，分词，进行大小写转换，拼写纠错，等等
标准化：词干提取、词型还原等
特征提取：tf-idf，word2vec等

对于中文与英文来说，它们的处理过程大致都符合上述流程，但是各自存在一些特殊性，在下文将进行详细说明。

数据清洗

去掉无用部分

去除非文本

在原始数据中，通常包含一些HTML标签、标点符号或者特殊符号等非文本内容，这些符号并未包含任何信息，故可以直接删除。少量的非文本内容可以直接使用Python的正则表达式删除，复杂的可以使用beautifulsoup库来帮助去除。

例如我们使用正则表达式去除汉语中的非文本内容：

import re

def find_chinese(raw_data):
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    chinese = re.sub(pattern, '', raw_data)
    return chinese

raw_sentence_chinese=r"高德纳（音译：唐纳德·尔文·克努斯，1938年1月10日—），出生于美国密尔沃基，著名计算机科学家，斯坦福大学计算机系荣誉退休教授。高德纳教授为现代计算机科学的先驱人物，创造了算法分析的领域，在数个理论计算机科学的分支做出基石一般的贡献。在计算机科学及数学领域发表了多部具广泛影响的论文和著作。"

1	sentence_chinese=find_chinese(raw_sentence_chinese)

1	sentence_chinese

'高德纳音译唐纳德尔文克努斯年月日出生于美国密尔沃基著名计算机科学家斯坦福大学计算机系荣誉退休教授高德纳教授为现代计算机科学的先驱人物创造了算法分析的领域在数个理论计算机科学的分支做出基石一般的贡献在计算机科学及数学领域发表了多部具广泛影响的论文和著作'

在英文文本中，通常我们要去除所有除了a-z和A-Z之外的文字。下面是除去英文中非文本内容的示例：

import re

def find_english(raw_data):
    pattern = re.compile(r'[^a-zA-Z]')
    english = re.sub(pattern, ' ', raw_data)
    return english

raw_sentence_english='''THE recent success of neural networks has boosted research
on pattern recognition and data mining. Many
machine learning tasks such as object detection [1], [2],
machine translation [3], [4], and speech reconition [5], which
once heavily relied on handcrafted feature engineering to
extract informative feature sets, has recently been revolutionized
by various end-to-end deep learning paradigms, e.g.,
convolutional neural networks (CNNs) [6], recurrent neural
networks (RNNs) [7], and autoencoders [8].'''

1	sentence_english=find_english(raw_sentence_english)

1	sentence_english

'THE recent success of neural networks has boosted research on pattern recognition and data mining  Many machine learning tasks such as object detection           machine translation           and speech reconition      which once heavily relied on handcrafted feature engineering to extract informative feature sets  has recently been revolutionized by various end to end deep learning paradigms  e g   convolutional neural networks  CNNs       recurrent neural networks  RNNs       and autoencoders     '

分词

由于英文单词中间以空格分隔，因此英文文本的分词很容易。我们以上文中已经去掉非文本的英文句子为例进行分词：

1	splited_sentence_english = sentence_english.split()

1	print(splited_sentence_english)

['THE', 'recent', 'success', 'of', 'neural', 'networks', 'has', 'boosted', 'research', 'on', 'pattern', 'recognition', 'and', 'data', 'mining', 'Many', 'machine', 'learning', 'tasks', 'such', 'as', 'object', 'detection', 'machine', 'translation', 'and', 'speech', 'reconition', 'which', 'once', 'heavily', 'relied', 'on', 'handcrafted', 'feature', 'engineering', 'to', 'extract', 'informative', 'feature', 'sets', 'has', 'recently', 'been', 'revolutionized', 'by', 'various', 'end', 'to', 'end', 'deep', 'learning', 'paradigms', 'e', 'g', 'convolutional', 'neural', 'networks', 'CNNs', 'recurrent', 'neural', 'networks', 'RNNs', 'and', 'autoencoders']

而对于汉语来说，汉语的每个词语之间并没有任何的分隔符，因此需要借助一些特定的函数库来帮助我们完成这件事。例如可以使用jieba函数库来完成：

1	import jieba

def chinese_splitwords(sentence_chinese, suggestions_list=None):
    if suggestions_list is not None:
        for item in suggestions_list:
            jieba.add_word(item)
    splited_chinese=[item for item in jieba.cut(sentence_chinese)]
    return splited_chinese

1	splited_sentence_chinese=chinese_splitwords(sentence_chinese)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Yufei Luo\AppData\Local\Temp\jieba.cache
Loading model cost 0.562 seconds.
Prefix dict has been built successfully.

1	print(splited_sentence_chinese)

['高', '德纳', '音译', '唐纳德', '尔文克', '努斯', '年月日', '出', '生于', '美国', '密尔沃基', '著名', '计算机', '科学家', '斯坦福大学', '计算机系', '荣誉', '退休', '教授', '高', '德纳', '教授', '为', '现代', '计算机科学', '的', '先驱', '人物', '创造', '了', '算法', '分析', '的', '领域', '在', '数个', '理论', '计算机科学', '的', '分支', '做出', '基石', '一般', '的', '贡献', '在', '计算机科学', '及', '数学', '领域', '发表', '了', '多部', '具', '广泛', '影响', '的', '论文', '和', '著作']

在上述的分词结果中，有一些分词不对的地方，比如人名、专属名词等。此时，我们可以将它们手动加入：

1	splited_sentence_chinese=chinese_splitwords(sentence_chinese, ['高德纳', '尔文', '克努斯', '出生于', '算法分析', '理论计算机科学'])

1	print(splited_sentence_chinese)

['高德纳', '音译', '唐纳德', '尔文', '克努斯', '年月日', '出生于', '美国', '密尔沃基', '著名', '计算机', '科学家', '斯坦福大学', '计算机系', '荣誉', '退休', '教授', '高德纳', '教授', '为', '现代', '计算机科学', '的', '先驱', '人物', '创造', '了', '算法分析', '的', '领域', '在', '数个', '理论计算机科学', '的', '分支', '做出', '基石', '一般', '的', '贡献', '在', '计算机科学', '及', '数学', '领域', '发表', '了', '多部', '具', '广泛', '影响', '的', '论文', '和', '著作']

大小写转换与拼写纠错

在大部分情况下，英文单词的大写与小写表示的是同一个意思。因此，为了方便后续的处理，通常需要将所有的大写单词都变为小写。同时，英文单词中也可能存在一些拼写错误的情况，需要对其进行纠错。

大小写转换可以直接使用Python自带的lower()函数：

1	splited_sentence_english=[item.lower() for item in splited_sentence_english]

1	print(splited_sentence_english)

['the', 'recent', 'success', 'of', 'neural', 'networks', 'has', 'boosted', 'research', 'on', 'pattern', 'recognition', 'and', 'data', 'mining', 'many', 'machine', 'learning', 'tasks', 'such', 'as', 'object', 'detection', 'machine', 'translation', 'and', 'speech', 'reconition', 'which', 'once', 'heavily', 'relied', 'on', 'handcrafted', 'feature', 'engineering', 'to', 'extract', 'informative', 'feature', 'sets', 'has', 'recently', 'been', 'revolutionized', 'by', 'various', 'end', 'to', 'end', 'deep', 'learning', 'paradigms', 'e', 'g', 'convolutional', 'neural', 'networks', 'cnns', 'recurrent', 'neural', 'networks', 'rnns', 'and', 'autoencoders']

而对于拼写检查与纠错，则可以使用pyenchant函数库来完成。但是它仅仅会给出一些修改建议，仍需要自己去手动去决定是否修改。基本用法如下：

1 2	import enchant enchant_dict=enchant.Dict('en_US')

1
2
3

for item in splited_sentence_english:
    if enchant_dict.check(item) is False:
        print('original word: ',item, ' suggested words: ', enchant_dict.suggest(item))

original word:  reconition  suggested words:  ['recondition', 'recognition', 'reconnection', 'reconception', 'reconsecration', 'premonition']
original word:  convolutional  suggested words:  ['convolution al', 'convolution-al', 'convolution', 'involutional', 'convocational', 'coevolutionary']
original word:  cnns  suggested words:  ['conns', 'inns', 'cans', 'cons', 'CNS']
original word:  rnns  suggested words:  ['inns', 'runs', 'scorns']
original word:  autoencoders  suggested words:  ['auto encoders', 'auto-encoders', 'encoders']

去掉停用词

在文本中，通常包含有一些无效的词语，例如虚词、代词、没有特定含义的动词等，这些单词被称为停用词。去掉这些单词之后，对于理解整个句子的语义没有任何影响。

对于英文文本来说，我们可以直接使用NLTK提供的英文停用词表：

1
2
3

import nltk
from nltk.corpus import stopwords
stop=stopwords.words('english')

1	print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

1	filtered_splited_sentence_english=[item for item in splited_sentence_english if item not in stop]

1	print(filtered_splited_sentence_english)

['recent', 'success', 'neural', 'networks', 'boosted', 'research', 'pattern', 'recognition', 'data', 'mining', 'many', 'machine', 'learning', 'tasks', 'object', 'detection', 'machine', 'translation', 'speech', 'reconition', 'heavily', 'relied', 'handcrafted', 'feature', 'engineering', 'extract', 'informative', 'feature', 'sets', 'recently', 'revolutionized', 'various', 'end', 'end', 'deep', 'learning', 'paradigms', 'e', 'g', 'convolutional', 'neural', 'networks', 'cnns', 'recurrent', 'neural', 'networks', 'rnns', 'autoencoders']

而中文去停用词则需要先下载停用词表，可以从https://github.com/yinzm/ChineseStopWords 中下载。之后的处理方法与英文类似，此处不再演示。

标准化

标准化这一步骤特指的是对英文单词进行处理。由于一个英语单词可能会通过增加前后缀等方式变为不同的形式，但是它们的含义相同。通过标准化处理可以将它们进行简化，从而方便做后续的文本向量化。常用的办法为词干提取和词形还原。

词干提取

词干提取（Stemming）指的是去除单词的前后缀得到词干的过程。常见的前后词缀包括名词复数、进行时、过去分词等。这一方法通常依赖于规则变化，粒度较粗，通常被用于信息检索。

下面为使用nltk做词干提取的例子。从结果也可以看出，词干提取得到的结果比较粗糙：

1
2
3

from nltk.stem import SnowballStemmer
stemmer=SnowballStemmer('english')
stemmed_sentence_english=[stemmer.stem(item) for item in filtered_splited_sentence_english]

1	print(stemmed_sentence_english)

['recent', 'success', 'neural', 'network', 'boost', 'research', 'pattern', 'recognit', 'data', 'mine', 'mani', 'machin', 'learn', 'task', 'object', 'detect', 'machin', 'translat', 'speech', 'reconit', 'heavili', 'reli', 'handcraft', 'featur', 'engin', 'extract', 'inform', 'featur', 'set', 'recent', 'revolution', 'various', 'end', 'end', 'deep', 'learn', 'paradigm', 'e', 'g', 'convolut', 'neural', 'network', 'cnns', 'recurr', 'neural', 'network', 'rnns', 'autoencod']

词形还原

词形还原（Lemmatization）则相比于词干提取更复杂一些，它要求还原得到词的原型。这不仅要做词缀转化，还要做词性的识别。因此，这更加依赖于词典，需要在词典中进行搜索并寻找到有效词。因此，这一方法通常被用于更细粒度的文本分析和表达，如文本挖掘、自然语言处理等。

下面为使用nltk做词形还原的例子：

1
2
3

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
lemmatized_sentence_english=[lemmatizer.lemmatize(item) for item in filtered_splited_sentence_english]

1	print(lemmatized_sentence_english)

['recent', 'success', 'neural', 'network', 'boosted', 'research', 'pattern', 'recognition', 'data', 'mining', 'many', 'machine', 'learning', 'task', 'object', 'detection', 'machine', 'translation', 'speech', 'reconition', 'heavily', 'relied', 'handcrafted', 'feature', 'engineering', 'extract', 'informative', 'feature', 'set', 'recently', 'revolutionized', 'various', 'end', 'end', 'deep', 'learning', 'paradigm', 'e', 'g', 'convolutional', 'neural', 'network', 'cnns', 'recurrent', 'neural', 'network', 'rnns', 'autoencoders']

为了方便起见，我们可以将上述的这些步骤写成一个统一的函数：

def preprocess_english_sentence(raw_english_sentence, 
                    stop_words=nltk.corpus.stopwords.words('english'), 
                    lemmatizer=nltk.stem.WordNetLemmatizer()):
    pattern = re.compile(r'[^a-zA-Z]')
    processed_sentence = re.sub(pattern, ' ', raw_english_sentence)
    processed_sentence = processed_sentence.lower()
    processed_sentence = processed_sentence.split()
    processed_sentence = [item for item in processed_sentence if item not in stop]
    processed_sentence = [lemmatizer.lemmatize(item) for item in processed_sentence]
    return processed_sentence

特征提取

简介

在上述步骤进行完成之后，得到的基本就是干净的文本，后面就可以从文本中提取特征向量。下面介绍几种常用的方法。为了方便演示，我们使用如下的英文句子，并对文本进行初步的处理。

raw_sentences=[r'Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal.', r'The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.', r'In the most interesting and challenging cases, actions may a↵ect not only the immediate reward but also the next situation and, through that, all subsequent rewards.', r'These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning.']

1	processed_sentences=[preprocess_english_sentence(item) for item in raw_sentences]

1	print(processed_sentences)

[['reinforcement', 'learning', 'learning', 'map', 'situation', 'action', 'maximize', 'numerical', 'reward', 'signal'], ['learner', 'told', 'action', 'take', 'instead', 'must', 'discover', 'action', 'yield', 'reward', 'trying'], ['interesting', 'challenging', 'case', 'action', 'may', 'ect', 'immediate', 'reward', 'also', 'next', 'situation', 'subsequent', 'reward'], ['two', 'characteristic', 'trial', 'error', 'search', 'delayed', 'reward', 'two', 'important', 'distinguishing', 'feature', 'reinforcement', 'learning']]

离散表示

一种基于规则和统计的向量化方式，常用的方法包括词集模型和词袋模型，都是基于词之间保持独立性、没有关联为前提，将所有文本中单词形成一个字典，然后根据字典来统计单词出现频数。不同的是，在词集模型中，只要单个文本中单词出现在字典中，就将其置为1，不管出现多少次；而对于词袋模型，只要单个文本中单词出现在字典中，就将其向量值加1，出现多少次就加多少次。

其基本的特点是忽略了文本信息中的语序信息和语境信息，仅将其反映为若干维度的独立概念，这种情况有着因为模型本身原因而无法解决的问题，比如主语和宾语的顺序问题。

1	import numpy as np

下面的示例是对句子做One-hot编码，这种编码方式对于重复出现的词只计算一次：

def generate_one_hot_encoding(sentences):
    token_index={}
    for sentence in sentences:
        for word in sentence:
            if word not in token_index:
                token_index[word]=len(token_index)
    results=np.zeros((len(sentences),len(token_index)))
    for i in range(len(sentences)):
        for word in sentences[i]:
            results[i,token_index[word]]=1
    return results

print(generate_one_hot_encoding(processed_sentences))

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

下面的示例是对句子做Bag of Words编码，相比于One-hot编码，它加入了对词频的统计信息：

def generate_bag_of_words_encoding(sentences):
    token_index={}
    for sentence in sentences:
        for word in sentence:
            if word not in token_index:
                token_index[word]=len(token_index)
    results=np.zeros((len(sentences),len(token_index)))
    for i in range(len(sentences)):
        for word in sentences[i]:
            results[i,token_index[word]]+=1
    return results

print(generate_bag_of_words_encoding(processed_sentences))

[[1. 2. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 2. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 1. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 2. 1. 1. 1. 1. 1. 1. 1. 1.]]

另一种指标是TF-IDF。假设我们有N个句子，对其中的某一个句子D，它的TF-IDF定义如下：TF（Term Frequency）指的是单词\(x\)在D中出现的次数；而IDF（Inverse Document Frequency）使用如下公式计算： \[ \text{IDF}(x)=\log \frac{N+1}{N(x)+1}+1 \] 其中\(N\)代表语料库中文本的总数，\(N(x)\)代表语料库中包含单词\(x\)的文本总数；而\(\text{ TF-IDF}(x)=\text{TF}(x)\cdot \text{IDF}(x)\)。

Sklearn提供了直接生成TF-IDF的类，可以直接使用：

1 2	from sklearn.feature_extraction.text import TfidfVectorizer tf_idf_vec=TfidfVectorizer(stop_words=nltk.corpus.stopwords.words('english'),).fit_transform(raw_sentences)

1	print(tf_idf_vec.toarray())

[[0.21531714 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.53191902 0.33733591
  0.33733591 0.         0.         0.         0.33733591 0.26595951
  0.17603587 0.         0.         0.33733591 0.         0.33733591
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.40568172 0.         0.         0.         0.         0.
  0.31778941 0.         0.         0.         0.         0.
  0.         0.31778941 0.         0.31778941 0.         0.
  0.         0.         0.31778941 0.         0.         0.
  0.1658357  0.         0.         0.         0.         0.
  0.         0.31778941 0.31778941 0.         0.31778941 0.
  0.31778941]
 [0.18676679 0.29260626 0.29260626 0.29260626 0.         0.
  0.         0.         0.29260626 0.         0.         0.29260626
  0.         0.         0.29260626 0.         0.         0.
  0.         0.29260626 0.         0.29260626 0.         0.
  0.15269409 0.29260626 0.         0.         0.29260626 0.
  0.29260626 0.         0.         0.         0.         0.
  0.        ]
 [0.         0.         0.         0.         0.27200938 0.27200938
  0.         0.27200938 0.         0.27200938 0.27200938 0.
  0.27200938 0.         0.         0.         0.21445532 0.
  0.         0.         0.         0.         0.         0.21445532
  0.14194578 0.         0.27200938 0.         0.         0.
  0.         0.         0.         0.27200938 0.         0.54401876
  0.        ]]

Word2Vec

简介

Word2Vec属于词嵌入（Word Embedding）方式的一种，它指的是将自然语言中的词语嵌入到一个数学空间中去，从而可以表达为词向量的形式。

Word Embedding的原理是用一个低维的稠密向量表示一个词语，这个词语可以表示一个书名、一个商品、或是一部电影等等。这个向量的性质是能够使得距离相近的向量所对应的词语具有相近的含义。例如，“复仇者联盟”和“钢铁侠”之间的距离比较接近，而“复仇者联盟”和“乱世佳人”之间的距离就会相应的远一些。此外，embedding甚至还可以具有一些属性运算的关系。例如\(\text{Vec(Woman) }- \text{Vec(Man)}\approx \text{Vec(Queen)}-\text{Vec(King)}\)。

因此，这种可以用低维稠密向量来对词语编码并且还保留其含义的特点非常适合深度学习（深度学习不适合高维稀疏特征向量的处理，这样会导致模型过大且不利于收敛）。目前这一思想已经不止被运用在词嵌入中，也被用于推荐系统中商品的Embedding、图结构中结点的Embedding等。

由于Word2Vec模型得到的是词向量，如果要用它来表示句子的话，可以将每个句子中的词向量相加再取平均，即使用每个句子的平均词向量来作为整个句子的向量。

Word2Vec的实现有两种不同的模型：Skip-Gram和CBOW。下面对其分别进行介绍。

CBOW模型

CBOW模型指的是使用一个词语的上下文作为输入，来预测这个词语本身。例如我们要使用“The quick brown fox jumps over the lazy dog”这一句话来训练得到单词"fox"的词嵌入，以“fox”前后的各3个单词作为输入，则模型的输入则为“The, quick, brown, jumps, over, the”这六个词。模型结构如下图，其中相邻的两层之间都为全连接层：

在下面的推导过程中，我们使用\(V\)来表示词汇表的大小，\(N\)表示隐藏层的大小，网络的输入共有\(C\)个；网络的输入记作向量\(\boldsymbol{x}_1,\boldsymbol{x}_2 \dots,\boldsymbol{x}_C\)，它们都为One-hot编码的词向量，也就是说\(\boldsymbol{x}_i\)的V个元素\(\{x_{i1},x_{i2},\dots,x_{iV}\}\)中只有一个为1，其余全部为0；输入层与隐藏层之间的权重矩阵记为\(W_{V\times N}\)（不同输入使用同一个矩阵进行运算），由于输入为One-hot形式的编码，因此矩阵的每个行向量\(\boldsymbol{v}_i\)也就对应于第\(i\)个词语的向量表示；隐藏层与输出层之间的权重矩阵记为\(W'_{N\times V}\)，它第\(i\)列的向量可以记作\(\boldsymbol{v}'_i\)；网络的输出记作向量\(\boldsymbol{y}\)。

在网络的计算过程中，隐藏层输出可以表示为： \[ \boldsymbol{h}=\frac{1}{C}W(\boldsymbol{x}_1+\boldsymbol{x}_2+\dots+\boldsymbol{x}_C)=\frac{1}{C}\sum_{i=1}^{C}\boldsymbol{v}_i \] 而输出层则需要在矩阵运算之后再经过Softmax层的处理，因此输出\(\boldsymbol{y}\)中的元素\(y_i\)可以表示为： \[ y_i=\frac{\exp(\boldsymbol{h}\cdot \boldsymbol{v}_i')}{\sum_{j=1}^{N} \exp(\boldsymbol{h}\cdot \boldsymbol{v}_i')} \] 之后，则需要将输出结果与实际要预测的词语对比并计算损失函数，来进行网络参数的优化。在模型训练完成之后，我们即可直接取\(W_{V\times N}\)作为词向量的查询表。

Skip-Gram模型

Skip-Gram模型指的是用一个词语作为输入，来预测它周围的上下文。例如我们要使用“The quick brown fox jumps over the lazy dog”这一句话来训练得到单词"fox"的词嵌入。模型的输入为单词“fox”，而模型的输出则为“fox”前后的各3个单词，即“The, quick, brown, jumps, over, the”这六个词。模型结构如下图所示：

其中，\(V\)来表示词汇表的大小，\(N\)表示隐藏层的大小，网络的输出共有\(C\)个；网络的输入记作向量\(\boldsymbol{x}\)，它为One-hot编码的词向量；输入层与隐藏层之间的权重矩阵记为\(W_{V\times N}\)，由于输入为One-hot形式的编码，因此矩阵的每个行向量\(\boldsymbol{v}_i\)也就对应于输入词语的向量表示；隐藏层与输出层之间的权重矩阵记为\(W'_{N\times V}\)（不同的输出使用同一个矩阵），它第\(i\)列的向量可以记作\(\boldsymbol{v}'_i\)；网络的输出记作向量\(\boldsymbol{y}_1,\boldsymbol{y}_2 \dots,\boldsymbol{y}_C\)，同样需要经过Softmax函数的操作之后得到最后的输出。

示例

下面的程序是使用gensim来做Word2Vec的简单示例。使用这一工具，只需要提前准备好已经做完分词的语料库，便可以方便地得到词向量：

1 2	from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence

1	sentences=LineSentence('lyrics.txt') #传入一个文件名，文件中为一系列句子组成的语料库，一行代表一句。要求这些句子已经使用空格做好了分词

1	model=Word2Vec(sentences, vector_size=32, window=5, min_count=1, workers=4)

1	model.wv['Earth'] #训练完成之后，便可以查看一个词的向量表示

array([-0.00666012, -0.01481392,  0.0269256 ,  0.01354416,  0.01357177,
        0.02895457, -0.02626695,  0.0164796 ,  0.00615305,  0.01287159,
        0.00542354,  0.01383927,  0.01400239,  0.01892557, -0.00990492,
       -0.01417832, -0.0012459 ,  0.00806731, -0.01026401,  0.01915038,
        0.01341161,  0.02472899,  0.00842807,  0.02517772, -0.00420819,
        0.02525247,  0.01139648, -0.02486438, -0.01205918, -0.00779674,
        0.01509896, -0.00250713], dtype=float32)

参考

NLP入门-- 文本预处理Pre-processing - 知乎 (zhihu.com)
中文文本挖掘预处理流程总结 - 刘建平Pinard - 博客园 (cnblogs.com)
英文文本挖掘预处理流程总结 - 刘建平Pinard - 博客园 (cnblogs.com)
中文文本预处理流程(带你分析每一步) - 炼己者 - 博客园 (cnblogs.com)
PyEnchant — PyEnchant 3.2.1 documentation
自然语言处理 | 文本向量化 - 简书 (jianshu.com)
word2vec 从原理到实现 - 知乎 (zhihu.com)
万物皆Embedding，从经典的word2vec到深度学习基本操作item2vec - 知乎 (zhihu.com)
1411.2738v3.pdf (arxiv.org)
Gensim: Topic modelling for humans (radimrehurek.com)

Yufei Luo's Blog

机器学习-文本预处理

引言