Best way to strip punctuation from a string in Python
似乎有一个比以下更简单的方法:
1 2 3 | import string s ="string. With. Punctuation?" # Sample string out = s.translate(string.maketrans("",""), string.punctuation) |
有?
从效率的角度看,你不会打败
1 | s.translate(None, string.punctuation) |
对于较高版本的python,请使用以下代码:
1 | s.translate(str.maketrans('', '', string.punctuation)) |
它使用一个查找表在C中执行原始字符串操作——除了编写自己的C代码,没有什么比这更好的了。
如果速度不是问题,另一个选择是:
1 2 | exclude = set(string.punctuation) s = ''.join(ch for ch in s if ch not in exclude) |
这比s.replace替换为每个字符要快,但不能像下面计时中看到的那样执行非纯Python方法,如regexes或string.translate。对于这种类型的问题,在尽可能低的水平上做它会有回报。
定时代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import re, string, timeit s ="string. With. Punctuation" exclude = set(string.punctuation) table = string.maketrans("","") regex = re.compile('[%s]' % re.escape(string.punctuation)) def test_set(s): return ''.join(ch for ch in s if ch not in exclude) def test_re(s): # From Vinko's solution, with fix. return regex.sub('', s) def test_trans(s): return s.translate(table, string.punctuation) def test_repl(s): # From S.Lott's solution for c in string.punctuation: s=s.replace(c,"") return s print"sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000) print"regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000) print"translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000) print"replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000) |
结果如下:
1 2 3 4 | sets : 19.8566138744 regex : 6.86155414581 translate : 2.12455511093 replace : 28.4436721802 |
正则表达式足够简单,如果你知道的话。
1 2 3 | import re s ="string. With. Punctuation?" s = re.sub(r'[^\w\s]','',s) |
在上面的代码中,我们用空字符串替换(re.sub)所有非[字母数字字符(w)和空格(s)]。因此。还有?通过regex运行s变量后,变量"s"中将不存在标点符号。
为了方便使用,我总结了在python 2和python 3中从字符串中去掉标点符号的注意事项。详细描述请参考其他答案。
Python 2
1 2 3 4 5 | import string s ="string. With. Punctuation?" table = string.maketrans("","") new_s = s.translate(table, string.punctuation) # Output: string without punctuation |
Python 3
1 2 3 4 5 | import string s ="string. With. Punctuation?" table = str.maketrans({key: None for key in string.punctuation}) new_s = s.translate(table) # Output: string without punctuation |
1 | myString.translate(None, string.punctuation) |
我通常用这样的东西:
1 2 3 4 5 6 7 | >>> s ="string. With. Punctuation?" # Sample string >>> import string >>> for c in string.punctuation: ... s= s.replace(c,"") ... >>> s 'string With Punctuation' |
1 2 3 4 5 | # -*- coding: utf-8 -*- from unicodedata import category s = u'String — with - ?punctation ?...' s = ''.join(ch for ch in s if category(ch)[0] != 'P') print 'stripped', s |
不一定简单,但如果你更熟悉这个家庭的话,那就另当别论了。
1 2 3 | import re, string s ="string. With. Punctuation?" # Sample string out = re.sub('[%s]' % re.escape(string.punctuation), '', s) |
对于python 3
删除(一些?)标点符号,使用:
1 2 3 4 | import string remove_punct_map = dict.fromkeys(map(ord, string.punctuation)) s.translate(remove_punct_map) |
要删除所有标点符号,而不仅仅是ASCII标点符号,您的表需要大一点;请参见J.F.Sebastian的答案(python 3版本):
1 2 3 4 5 | import unicodedata import sys remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')) |
1 2 3 4 | import regex s = u"string. With. Some?Really Weird、Non?ASCII。 「(Punctuation)」?" remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE) remove.sub(u"", s).strip() |
就我个人而言,我认为这是从python字符串中删除标点符号的最佳方法,因为:
- 它删除所有Unicode标点符号
- 它很容易修改,例如,如果要删除标点符号,可以删除
\{S} ,但保留类似$ 的符号。 - 您可以对要保留的内容和要删除的内容进行具体说明,例如,
\{Pd} 只删除破折号。 - 此regex还规范化空白。它将标签、回车和其他奇怪的东西映射到漂亮的单个空间。
这使用了Unicode字符属性,您可以在维基百科上了解更多信息。
下面是一个针对python 3.5的一行程序:
1 2 | import string "l*ots! o(f. p@u)n[c}t]u[a'ti"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation})) |
这可能不是最好的解决方案,但我就是这样做的。
1 2 | import string f = lambda x: ''.join([i for i in x if i not in string.punctuation]) |
这是我写的一个函数。它不是很有效,但很简单,您可以添加或删除任何您想要的标点:
1 2 3 4 5 6 7 | def stripPunc(wordList): """Strips punctuation from list of words""" puncList = [".",";",":","!","?","/","\",",","#","@","$","&",")","(","""] for punc in puncList: for word in wordList: wordList=[word.replace(punc,'') for word in wordList] return wordList |
我还没看到这个答案。只需使用regex;它会删除除单词字符(
1 2 3 | import re s ="string. With. Punctuation?" # Sample string out = re.sub(ur'[^\w\d\s]+', '', s) |
正如更新一样,我在python3中重写了@brian示例,并对其进行了更改,以将regex编译步骤移到函数内部。我在这里想的是时间的每一个步骤需要使功能工作。也许您使用的是分布式计算,并且不能在您的工作人员之间共享regex对象,并且需要在每个工作人员处执行
1 | table = str.maketrans({key: None for key in string.punctuation}) |
VS
1 | table = str.maketrans('', '', string.punctuation) |
另外,我还添加了另一个方法来使用set,在这里我利用交集函数来减少迭代次数。
这是完整的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | import re, string, timeit s ="string. With. Punctuation" def test_set(s): exclude = set(string.punctuation) return ''.join(ch for ch in s if ch not in exclude) def test_set2(s): _punctuation = set(string.punctuation) for punct in set(s).intersection(_punctuation): s = s.replace(punct, ' ') return ' '.join(s.split()) def test_re(s): # From Vinko's solution, with fix. regex = re.compile('[%s]' % re.escape(string.punctuation)) return regex.sub('', s) def test_trans(s): table = str.maketrans({key: None for key in string.punctuation}) return s.translate(table) def test_trans2(s): table = str.maketrans('', '', string.punctuation) return(s.translate(table)) def test_repl(s): # From S.Lott's solution for c in string.punctuation: s=s.replace(c,"") return s print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)) print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000)) print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)) print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)) print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000)) print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)) |
这是我的结果:
1 2 3 4 5 6 | sets : 3.1830138750374317 sets2 : 2.189873124472797 regex : 7.142953420989215 translate : 4.243278483860195 translate2 : 2.427158243022859 replace : 4.579746678471565 |
这里有一个没有regex的解决方案。
1 2 3 4 5 6 7 | import string input_text ="!where??and!!or$$then:)" punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation)) print ' '.join(input_text.translate(punctuation_replacer).split()).strip() Output>> where and or then |
- 用空格替换标点符号
- 将单词之间的多个空格替换为单个空格
- 删除尾随空格(如果有)条()
1 2 3 4 5 6 | >>> s ="string. With. Punctuation?" >>> s = re.sub(r'[^\w\s]','',s) >>> re.split(r'\s*', s) ['string', 'With', 'Punctuation'] |
1 2 3 | import re s ="string. With. Punctuation?" # Sample string out = re.sub(r'[^a-zA-Z0-9\s]', '', s) |
在不太严格的情况下,使用一行程序可能会有所帮助:
1 | ''.join([c for c in s if c.isalnum() or c.isspace()]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #FIRST METHOD #Storing all punctuations in a variable punctuation='!?,.:;"\')(_-' newstring='' #Creating empty string word=raw_input("Enter string:") for i in word: if(i not in punctuation): newstring+=i print"The string without punctuation is",newstring #SECOND METHOD word=raw_input("Enter string:") punctuation='!?,.:;"\')(_-' newstring=word.translate(None,punctuation) print"The string without punctuation is",newstring #Output for both methods Enter string: hello! welcome -to_python(programming.language)??, The string without punctuation is: hello welcome topythonprogramminglanguage |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | with open('one.txt','r')as myFile: str1=myFile.read() print(str1) punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"',"'"] for i in punctuation: str1 = str1.replace(i,"") myList=[] myList.extend(str1.split("")) print (str1) for i in myList: print(i,end=' ') print ("____________") |
使用regex函数进行搜索和替换,如下所示。如果必须重复执行该操作,则可以保留一个已编译的regex模式(标点符号)副本,这将加快速度。
使用python从文本文件中删除停止字
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | print('====THIS IS HOW TO REMOVE STOP WORS====') with open('one.txt','r')as myFile: str1=myFile.read() stop_words ="not","is","it","By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these" myList=[] myList.extend(str1.split("")) for i in myList: if i not in stop_words: print ("____________") print(i,end=' ') |
这是如何把我们的文件改成大写的或小写字母。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | print('@@@@This is lower case@@@@') with open('students.txt','r')as myFile: str1=myFile.read() str1.lower() print(str1.lower()) print('*****This is upper case****') with open('students.txt','r')as myFile: str1=myFile.read() str1.upper() print(str1.upper()) |
我喜欢使用这样的函数:
1 2 3 4 5 6 | def scrub(abc): while abc[-1] is in list(string.punctuation): abc=abc[:-1] while abc[0] is in list(string.punctuation): abc=abc[1:] return abc |