How to remove stop words using nltk or python
所以我有一个数据集,我想删除停止字
1 | stopwords.words('english') |
我在努力如何在代码中使用它来简单地删除这些单词。我已经有了这个数据集中的单词列表,我正在努力的部分是将它与这个列表进行比较并删除停止单词。感谢您的帮助。
1 2 3 | from nltk.corpus import stopwords # ... filtered_words = [word for word in word_list if word not in stopwords.words('english')] |
您还可以执行设置diff,例如:
1 | list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english'))) |
我想您有一个单词列表(单词表),您想从中删除非索引字。你可以这样做:
1 2 3 4 | filtered_word_list = word_list[:] #make a copy of the word_list for word in word_list: # iterate over word_list if word in stopwords.words('english'): filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword |
要排除所有类型的停止词,包括nltk停止词,可以执行以下操作:
1 2 3 4 5 6 7 8 | from stop_words import get_stop_words from nltk.corpus import stopwords stop_words = list(get_stop_words('en')) #About 900 stopwords nltk_words = list(stopwords.words('english')) #About 150 stopwords stop_words.extend(nltk_words) output = [w for w in word_list if not w in stop_words] |
使用TextCleaner库从数据中删除停止字。
请访问以下链接:https://yugantm.github.io/textcleaner/documentation.html删除u stpwrds
按照以下步骤操作此库。
1 | pip install textcleaner |
安装后:
1 2 3 4 | import textcleaner as tc data = tc.document(<file_name>) #you can also pass list of sentences to the document class constructor. data.remove_stpwrds() #inplace is set to False by default |
使用以上代码删除停止字。
使用过滤器:
1 2 3 | from nltk.corpus import stopwords # ... filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list)) |
你可以使用这个函数,你应该注意你需要降低所有的单词
1 2 3 4 5 6 7 8 9 | from nltk.corpus import stopwords def remove_stopwords(word_list): processed_word_list = [] for word in word_list: word = word.lower() # in case they arenet all lower cased if word not in stopwords.words("english"): processed_word_list.append(word) return processed_word_list |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import sys print ("enter the string from which you want to remove list of stop words") userstring = input().split("") list =["a","an","the","in"] another_list = [] for x in userstring: if x not in list: # comparing from the list and removing it another_list.append(x) # it is also possible to use .remove for x in another_list: print(x,end=' ') # 2) if you want to use .remove more preferred code import sys print ("enter the string from which you want to remove list of stop words") userstring = input().split("") list =["a","an","the","in"] another_list = [] for x in userstring: if x in list: userstring.remove(x) for x in userstring: print(x,end = ' ') #the code will be like this |