如何使用nltk或python删除停用词

How to remove stop words using nltk or python

所以我有一个数据集，我想删除停止字

1	stopwords.words('english')

我在努力如何在代码中使用它来简单地删除这些单词。我已经有了这个数据集中的单词列表，我正在努力的部分是将它与这个列表进行比较并删除停止单词。感谢您的帮助。

相关讨论

1
2
3

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

相关讨论

您还可以执行设置diff，例如：

1	list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

相关讨论

我想您有一个单词列表(单词表)，您想从中删除非索引字。你可以这样做：

1
2
3
4

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

相关讨论

要排除所有类型的停止词，包括nltk停止词，可以执行以下操作：

1
2
3
4
5
6
7
8

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

使用TextCleaner库从数据中删除停止字。

请访问以下链接：https://yugantm.github.io/textcleaner/documentation.html删除u stpwrds

按照以下步骤操作此库。

1	pip install textcleaner

安装后：

1
2
3
4

import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

使用以上代码删除停止字。

相关讨论

使用过滤器：

1
2
3

from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

你可以使用这个函数，你应该注意你需要降低所有的单词

1
2
3
4
5
6
7
8
9

from nltk.corpus import stopwords

def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split("")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x not in list: # comparing from the list and removing it
another_list.append(x) # it is also possible to use .remove
for x in another_list:
print(x,end=' ')

# 2) if you want to use .remove more preferred code
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split("")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x in list:
userstring.remove(x)
for x in userstring:
print(x,end = ' ')
#the code will be like this