在Python中从文本中创建N-Grams

在计算语言学中,n-grams对语言处理和语境及语义分析非常重要。它们是来自一串标记的相邻的连续和连续的词的序列。

流行的有单字格、大字格和三字格,它们是有效的,在n>3 ,可以有数据稀疏性。

本文将讨论如何在Python中使用特征和库来创建n-grams。

使用for 循环,在Python中从文本中创建N-grams

我们可以有效地创建一个ngrams 函数,该函数接收文本和n 值,返回一个包含n-grams的列表。

为了创建这个函数,我们可以分割文本并创建一个空的列表 (output),用来存储n-grams。我们使用for 循环,循环浏览splitInput 列表,以浏览所有的元素。

然后将这些词(tokens)追加到output 列表中。

def ngrams(input, num):
    splitInput = input.split(' ')
    output = []
    for i in range(len(splitInput) - num + 1):
        output.append(splitInput[i:i + num])
    return output
text = "Welcome to the abode, and more importantly, our in-house exceptional cooking service which is close to the Burj Khalifa"
print(ngrams(text, 3))

该代码的输出

[['Welcome', 'to', 'the'], ['to', 'the', 'abode,'], ['the', 'abode,', 'and'], ['abode,', 'and', 'more'], ['and', 'more', 'importantly,'], ['more', 'importantly,', 'our'], ['importantly,', 'our', 'in-house'], ['our', 'in-house', 'exceptional'], ['in-house', 'exceptional', 'cooking'], ['exceptional', 'cooking', 'service'], ['cooking', 'service', 'which'], ['service', 'which', 'is'], ['which', 'is', 'close'], ['is', 'close', 'to'], ['close', 'to', 'the'], ['to', 'the', 'Burj'], ['the', 'Burj', 'Khalifa']]

使用nltk ,在Python中从文本中创建N-Grams

NLTK库是一个自然语言工具包,为文本处理和标记化等方面的重要资源提供了一个易于使用的接口。要安装nltk ,我们可以使用下面的pip 命令。

pip install nltk

为了向我们展示一个潜在的问题,让我们使用word_tokenize() 方法,在我们继续编写更详细的代码之前,它可以帮助我们使用NLTK推荐的单词标记器对传递给它的文本创建一个标记化的副本。

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

代码的输出:

Traceback (most recent call last):
  File "c:UsersakinlDocumentsPythonSFTPn-gram-two.py", line 4, in <module>
    tokens = nltk.word_tokenize(text)
  File "C:Python310libsite-packagesnltktokenize__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:Python310libsite-packagesnltktokenize__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
  File "C:Python310libsite-packagesnltkdata.py", line 750, in load
    opened_resource = _open(resource_url)
  File "C:Python310libsite-packagesnltkdata.py", line 876, in _open
    return find(path_, path + [""]).open()
  File "C:Python310libsite-packagesnltkdata.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:
  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html
  Attempted to load [93mtokenizers/punkt/english.pickle[0m
  Searched in:
    - 'C:\Users\akinl/nltk_data'
    - 'C:\Python310\nltk_data'
    - 'C:\Python310\share\nltk_data'
    - 'C:\Python310\lib\nltk_data'
    - 'C:\Users\akinl\AppData\Roaming\nltk_data'
    - 'C:\nltk_data'
    - 'D:\nltk_data'
    - 'E:\nltk_data'
    - ''
**********************************************************************

出现上述错误信息和问题的原因是NLTK库的某些方法需要某些数据,而我们没有下载这些数据,特别是如果这是你第一次使用。因此,我们需要NLTK下载器下载两个数据模块:punktaveraged_perceptron_tagger

这些数据是可以使用的,例如,在使用words() 等方法时。为了下载数据,如果我们需要通过我们的Python脚本来运行它,我们需要download() 方法。

你可以创建一个Python文件并运行下面的代码来解决这个问题。

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

或者通过你的命令行界面运行以下命令:

python -m nltk.downloader punkt
python -m nltk.downloader averaged_perceptron_tagger

示例代码:

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print(list(textBigGrams), list(textTriGrams))

代码的输出:

[('well', 'the'), ('the', 'money'), ('money', 'has'), ('has', 'finally'), ('finally', 'come')] [('well', 'the', 'money'), ('the', 'money', 'has'), ('money', 'has', 'finally'), ('has', 'finally', 'come')]

示例代码:

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print("The Bigrams of the Text are")
print(*map(' '.join, textBigGrams), sep=', ')
print("The Trigrams of the Text are")
print(*map(' '.join, textTriGrams), sep=', ')

代码的输出:

The Bigrams of the Text are
well the, the money, money has, has finally, finally come
The Trigrams of the Text are
well the money, the money has, money has finally, has finally come