read a very big single line txt file and split it
我有以下问题:我有一个大约500MB大的文件。它的文本,全部在一行中。文本以虚拟行结尾分隔,称为行分隔,文本如下:
现在,我需要进行以下操作,我想将此文件拆分为若干行,以便得到这样的文件:
问题是,即使我用Windows文本编辑器打开它,它也会因为文件太大而中断。
有可能像我提到的C,Java,Python那样分割这个文件吗?什么是最好的解决方法,不要过度消耗我的CPU。
这是我的解决方案。原则上容易(?如果你想处理一些特殊的情况(哪一种情况),那么就不容易编写代码。Ukaszw.pl没有)。
特殊情况是,分隔行del被拆分为两个读取块(如i4v指出的那样),如果有两个相邻的行del,第二个行del被拆分为两个读取块,则更加微妙。
由于行del比任何可能的新行(
'
'
')都长,因此可以在文件中用操作系统使用的新行替换它。这就是为什么我选择重写文件本身的原因。为此,我使用模式
其原理是读取一个块(例如,在现实生活中,它的大小为262144)和x个附加字符,其中x是分隔符-1的长度。然后检查分隔符是否出现在块的末尾+x字符中。根据是否存在,块在执行行del的转换并在适当位置重写之前被缩短或不缩短。
裸码是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | text = ('The hospital roommate of a man infected ROW_DEL' 'with novel coronavirus (NCoV)ROW_DEL' '—a SARS-related virus first identified ROW_DELROW_DEL' 'last year and already linked to 18 deaths—ROW_DEL' 'has contracted the illness himself, ROW_DEL' 'intensifying concerns about the ROW_DEL' "virus's ability to spread ROW_DEL" 'from person to person.') with open('eessaa.txt','w') as f: f.write(text) with open('eessaa.txt','rb') as f: ch = f.read() print ch.replace('ROW_DEL','ROW_DEL ') print ' length of the text : %d chars ' % len(text) #========================================== from os.path import getsize from os import fsync,linesep def rewrite(whichfile,sep,chunk_length,OSeol=linesep): if chunk_length<len(sep): print 'Length of second argument, %d , is '\ 'the minimum value for the third argument'\ % len(sep) return x = len(sep)-1 x2 = 2*x file_length = getsize(whichfile) with open(whichfile,'rb+') as fR,\ open(whichfile,'rb+') as fW: while True: chunk = fR.read(chunk_length) pch = fR.tell() twelve = chunk[-x:] + fR.read(x) ptw = fR.tell() if sep in twelve: pt = twelve.find(sep) m = (" !! %r is" "at position %d in twelve !!" % (sep,pt)) y = chunk[0:-x+pt].replace(sep,OSeol) else: pt = x m = '' y = chunk.replace(sep,OSeol) pos = fW.tell() fW.write(y) fW.flush() fsync(fW.fileno()) if fR.tell()<file_length: fR.seek(-x2+pt,1) else: fW.truncate() break rewrite('eessaa.txt','ROW_DEL',14) with open('eessaa.txt','rb') as f: ch = f.read() print ' '.join(repr(line)[1:-1] for line in ch.splitlines(1)) print ' length of the text : %d chars ' % len(ch) |
为了跟踪执行过程,这里还有另一个代码可以一直打印消息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | text = ('The hospital roommate of a man infected ROW_DEL' 'with novel coronavirus (NCoV)ROW_DEL' '—a SARS-related virus first identified ROW_DELROW_DEL' 'last year and already linked to 18 deaths—ROW_DEL' 'has contracted the illness himself, ROW_DEL' 'intensifying concerns about the ROW_DEL' "virus's ability to spread ROW_DEL" 'from person to person.') with open('eessaa.txt','w') as f: f.write(text) with open('eessaa.txt','rb') as f: ch = f.read() print ch.replace('ROW_DEL','ROW_DEL ') print ' length of the text : %d chars ' % len(text) #========================================== from os.path import getsize from os import fsync,linesep def rewrite(whichfile,sep,chunk_length,OSeol=linesep): if chunk_length<len(sep): print 'Length of second argument, %d , is '\ 'the minimum value for the third argument'\ % len(sep) return x = len(sep)-1 x2 = 2*x file_length = getsize(whichfile) with open(whichfile,'rb+') as fR,\ open(whichfile,'rb+') as fW: while True: chunk = fR.read(chunk_length) pch = fR.tell() twelve = chunk[-x:] + fR.read(x) ptw = fR.tell() if sep in twelve: pt = twelve.find(sep) m = (" !! %r is" "at position %d in twelve !!" % (sep,pt)) y = chunk[0:-x+pt].replace(sep,OSeol) else: pt = x m = '' y = chunk.replace(sep,OSeol) print ('chunk == %r %d chars ' ' -> fR now at position %d ' 'twelve == %r %d chars %s ' ' -> fR now at position %d' % (chunk ,len(chunk), pch, twelve,len(twelve),m, ptw) ) pos = fW.tell() fW.write(y) fW.flush() fsync(fW.fileno()) print (' %r %d long ' ' has been written from position %d ' ' => fW now at position %d' % (y,len(y),pos,fW.tell())) if fR.tell()<file_length: fR.seek(-x2+pt,1) print ' -> fR moved %d characters back to position %d'\ % (x2-pt,fR.tell()) else: print (" => fR is at position %d == file's size " ' File has thoroughly been read' % fR.tell()) fW.truncate() break raw_input(' press any key to continue') rewrite('eessaa.txt','ROW_DEL',14) with open('eessaa.txt','rb') as f: ch = f.read() print ' '.join(repr(line)[1:-1] for line in ch.splitlines(1)) print ' length of the text : %d chars ' % len(ch) |
在处理块的末尾时有一些微妙之处,以检测行del是否跨在两个块上,以及是否有两个行del相邻。这就是为什么我花了很长时间来发表我的解决方案:我最终不得不写
实际上,500MB的文本并没有那么大,只是记事本太糟糕了。在Windows上,您可能没有SED可用,但至少在Python中尝试简单的解决方案,我认为它会很好地工作:
1 2 3 | import os with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out: f_out.write(f_in.read().replace('ROW_DEL ', os.linesep)) |
以块形式读取此文件,例如使用c_中的
对于每个已读块,您可以将
并将其附加到新文件中。
只需记住增加当前索引的字符数你刚刚读。