Good way to pull new lines from (non indexed/non sequential) huge file
本问题已经有最佳答案,请猛点这里访问。
我有一个csv大文件(> 1GB)坐在网络文件存储器中,每周更新一次新记录。 该文件具有与以下类似的列:
1 | Customer ID | Product | Online? (Bool) | Amount | DATE |
我需要使用此文件来更新客户ID的postgresql数据库,其中每个月的产品和商店总金额。 像这样的东西:
1 | Customer ID | MONTH | (several unrelated FIELDS) | Product 1 (Online) | Product 1 (Offline) | Product 2 (Online) | ect... |
因为文件太大(并且随着每次更新而变得越来越大),我需要一种有效的方法来获取更新的记录并更新数据库。 不幸的是,我们的服务器按客户ID而不是日期更新文件,所以我不能拖尾它。
是否有一种聪明的方法来分散文件的方式不会随着文件的不断增长而中断?
将文件复制到临时表。当然,假设您有一个PK,即每个不变异的行的唯一标识符。我校验和剩余的列和已加载到目标表中的行相同,并将源与目标进行比较,这将查找更新,删除和新行。
如您所见,我没有添加任何索引或以任何其他方式调整此。我的目标是让它正常运作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | CREATE schema SOURCE; CREATE schema destination; --DROP TABLE source.employee; --DROP TABLE destination.employee; SELECT x employee_id, CAST('Bob' AS text) first_name,CAST('H'AS text) last_name, CAST(21 AS INTEGER) age INTO SOURCE.employee FROM generate_series(1,10000000) x; SELECT x employee_id, CAST('Bob' AS text) first_name,CAST('H'AS text) last_name, CAST(21 AS INTEGER) age INTO destination.employee FROM generate_series(1,10000000) x; SELECT destination.employee.*, SOURCE.employee.*, CASE WHEN (md5(SOURCE.employee.first_name || SOURCE.employee.last_name || SOURCE.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM' WHEN (destination.employee.employee_id IS NULL) THEN 'Missing' WHEN (SOURCE.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType FROM destination.employee FULL OUTER JOIN SOURCE.employee ON destination.employee.employee_id = SOURCE.employee.employee_id WHERE (destination.employee.employee_id IS NULL OR SOURCE.employee.employee_id IS NULL) OR (md5(SOURCE.employee.first_name || SOURCE.employee.last_name || SOURCE.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)); --Mimic source data getting an update. UPDATE SOURCE.employee SET age = 99 WHERE employee_id = 45000; SELECT destination.employee.*, SOURCE.employee.*, CASE WHEN (md5(SOURCE.employee.first_name || SOURCE.employee.last_name || SOURCE.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)) THEN 'CHECKSUM' WHEN (destination.employee.employee_id IS NULL) THEN 'Missing' WHEN (SOURCE.employee.employee_id IS NULL) THEN 'Orphan' END AS AuditFailureType FROM destination.employee FULL OUTER JOIN SOURCE.employee ON destination.employee.employee_id = SOURCE.employee.employee_id WHERE (destination.employee.employee_id IS NULL OR SOURCE.employee.employee_id IS NULL) OR (md5(SOURCE.employee.first_name || SOURCE.employee.last_name || SOURCE.employee.age)) != md5((destination.employee.first_name || destination.employee.last_name || destination.employee.age)); |
不要以CSV> 1千兆字节存储数据。将其存储在名为
唯一真正有效的解决方案是控制正在创建该文件的程序,并使其做一些更明智的事情。
如果你不能这样做,> 1 GB只是不大,除非>> 1GB。只是重新计算整个事情。如果这很慢,那就让它更快。没有理由计算1GB的一些摘要应该很慢。