您现在的位置是:网站首页> 内容页

NLP入门(三)词形还原(Lemmatization)

  • bao1618.com
  • 2019-03-25
  • 303人已阅读
简介  词形还原(Lemmatization)是文本预处理中的重要部分,与词干提取(stemming)很相似。  简单说来,词形还原就是去掉单词的词缀,提取单词的主干部分,通常提取后的单词

  词形还原(Lemmatization)是文本预处理中的重要部分,与词干提取(stemming)很相似。  简单说来,词形还原就是去掉单词的词缀,提取单词的主干部分,通常提取后的单词会是字典中的单词,不同于词干提取(stemming),提取后的单词不一定会出现在单词中。比如,单词“cars”词形还原后的单词为“car”,单词“ate”词形还原后的单词为“eat”。  在Python的nltk模块中,使用WordNet为我们提供了稳健的词形还原的函数。如以下示例Python代码:

from nltk.stem import WordNetLemmatizerwnl = WordNetLemmatizer()# lemmatize nounsprint(wnl.lemmatize("cars", "n"))print(wnl.lemmatize("men", "n"))# lemmatize verbsprint(wnl.lemmatize("running", "v"))print(wnl.lemmatize("ate", "v"))# lemmatize adjectivesprint(wnl.lemmatize("saddest", "a"))print(wnl.lemmatize("fancier", "a"))

输出结果如下:

carmenruneatsadfancy

在以上代码中,wnl.lemmatize()函数可以进行词形还原,第一个参数为单词,第二个参数为该单词的词性,如名词,动词,形容词等,返回的结果为输入单词的词形还原后的结果。  词形还原一般是简单的,但具体我们在使用时,指定单词的词性很重要,不然词形还原可能效果不好,如以下代码:

from nltk.stem import WordNetLemmatizerwnl = WordNetLemmatizer()print(wnl.lemmatize("ate", "n"))print(wnl.lemmatize("fancier", "v"))

输出结果如下:

atefancier

  那么,如何获取单词的词性呢?在NLP中,使用Parts of speech(POS)技术实现。在nltk中,可以使用nltk.pos_tag()获取单词在句子中的词性,如以下Python代码:

sentence = "The brown fox is quick and he is jumping over the lazy dog"import nltktokens = nltk.word_tokenize(sentence)tagged_sent = nltk.pos_tag(tokens)print(tagged_sent)

输出结果如下:

[("The", "DT"), ("brown", "JJ"), ("fox", "NN"), ("is", "VBZ"), ("quick", "JJ"), ("and", "CC"), ("he", "PRP"), ("is", "VBZ"), ("jumping", "VBG"), ("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN")]

  关于上述词性的说明,可以参考下表:

  OK,知道了获取单词在句子中的词性,再结合词形还原,就能很好地完成词形还原功能。示例的Python代码如下:

from nltk import word_tokenize, pos_tagfrom nltk.corpus import wordnetfrom nltk.stem import WordNetLemmatizer# 获取单词的词性def get_wordnet_pos(tag): if tag.startswith("J"): return wordnet.ADJ elif tag.startswith("V"): return wordnet.VERB elif tag.startswith("N"): return wordnet.NOUN elif tag.startswith("R"): return wordnet.ADV else: return Nonesentence = "football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal."tokens = word_tokenize(sentence) # 分词tagged_sent = pos_tag(tokens) # 获取单词词性wnl = WordNetLemmatizer()lemmas_sent = []for tag in tagged_sent: wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词形还原print(lemmas_sent)

输出结果如下:

["football", "be", "a", "family", "of", "team", "sport", "that", "involve", ",", "to", "vary", "degree", ",", "kick", "a", "ball", "to", "score", "a", "goal", "."]

输出的结果就是对句子中的单词进行词形还原后的结果。  本次分享到此结束,欢迎大家交流~

注意:本人现已开通微信公众号: Python爬虫与算法(微信号为:easy_web_scrape), 欢迎大家关注哦~~

文章评论

Top