How to process CSV with 100k+ lines in PHP?
我有一个超过100.000行的CSV文件,每行有3个以分号分隔的值。总文件大小约为。 5MB。
CSV文件采用以下格式:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | stock_id;product_id;amount ========================== 1;1234;0 1;1235;1 1;1236;0 ... 2;1234;3 2;1235;2 2;1236;13 ... 3;1234;0 3;1235;2 3;1236;0 ... |
我们有10只股票,以CSV格式编入1-10。在数据库中,我们将它们保存为22-31。
CSV按stock_id,product_id排序,但我认为无关紧要。
是)我有的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | <?php session_start(); require_once ('db.php'); echo '<meta charset="iso-8859-2">'; // convert table: `CSV stock id => DB stock id` $stocks = array( 1 => 22, 2 => 23, 3 => 24, 4 => 25, 5 => 26, 6 => 27, 7 => 28, 8 => 29, 9 => 30, 10 => 31 ); $sql = $mysqli->query("SELECT product_id FROM table WHERE fielddef_id = 1"); while ($row = $sql->fetch_assoc()) { $products[$row['product_id']] = 1; } $csv = file('export.csv'); // go thru CSV file and prepare SQL UPDATE query foreach ($csv as $row) { $data = explode(';', $row); // $data[0] - stock_id // $data[1] - product_id // $data[2] - amount if (isset($products[$data[1]])) { // in CSV are products which aren't in database // there is echo which should show me queries echo" UPDATE t SET value =" . (int)$data[2] ." WHERE fielddef_id =" . (int)$stocks[$data[0]] ." AND product_id = '" . $data[1] ."' -- product_id isn't just numeric LIMIT 1"; } } |
问题是用
我的想法是将CSV文件加载到更多变量(更好的数组),如下所示,但我不知道为什么。
1 2 3 4 5 | $csv[0] = lines 0 - 10.000; $csv[1] = lines 10.001 - 20.000; $csv[2] = lines 20.001 - 30.000; $csv[3] = lines 30.001 - 40.000; etc. |
我找到了例如。有效地计算文本文件的行数。 (200mb +),但我不确定它是如何帮助我的。
当我将
任何想法如何更新数据库中的这么多记录?
谢谢。
由于问题的答案和评论,我有解决方案。它的基础来自@Dave,我只更新它以更好地回答问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | <?php require_once 'include.php'; // stock convert table (key is ID in CSV, value ID in database) $stocks = array( 1 => 22, 2 => 23, 3 => 24, 4 => 25, 5 => 26, 6 => 27, 7 => 28, 8 => 29, 9 => 30, 10 => 31 ); // product IDs in CSV (value) and Database (product_id) are different. We need to take both IDs from database and create an array of e-shop products $sql = mysql_query("SELECT product_id, value FROM cms_module_products_fieldvals WHERE fielddef_id = 1") or die(mysql_error()); while ($row = mysql_fetch_assoc($sql)) { $products[$row['value']] = $row['product_id']; } $handle = fopen('import.csv', 'r'); $i = 1; while (($data = fgetcsv($handle, 1000, ';')) !== FALSE) { $p_id = (int)$products[$data[1]]; if ($p_id > 0) { // if product exists in database, continue. Without this condition it works but we do many invalid queries to database (... WHERE product_id = 0 updates nothing, but take a time) if ($i % 300 === 0) { // optional, we'll see what it do with the real traffic sleep(1); } $updatesql ="UPDATE table SET value =" . (int)$data[2] ." WHERE fielddef_id =" . $stocks[$data[0]] ." AND product_id =" . (int)$p_id ." LIMIT 1"; echo"$updatesql";//for debug only comment out on live $i++; } } // cca 1.5sec to import 100.000k+ records fclose($handle); |
像这样的东西(请注意这是100%未经测试,我的头顶可能需要一些调整实际工作:))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | //define array may (probably better ways of doing this $stocks = array( 1 => 22, 2 => 23, 3 => 24, 4 => 25, 5 => 26, 6 => 27, 7 => 28, 8 => 29, 9 => 30, 10 => 31 ); $handle = fopen("file.csv","r")); //open file while (($data = fgetcsv($handle, 1000,";")) !== FALSE) { //loop through csv $updatesql ="UPDATE t SET `value` =".$data[2]." WHERE fielddef_id =".$stocks[$data[0]]." AND product_id =".$data[1]; echo"$updatesql";//for debug only comment out on live } |
您无需进行初始选择,因为您只需在代码中将产品数据设置为1,并且从您的说明中看出您的产品ID始终是正确的,只有您的fielddef列具有地图。
也只是为了生活不要忘记把你的实际mysqli执行命令放在你的$ updatesql上;
为了与实际使用代码进行比较(我可以进行基准测试!)
这是我用于上传文件的导入器的一些代码(它不完美,但它完成了它的工作)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | if (isset($_POST['action']) && $_POST['action']=="beginimport") { echo"<h4>Starting Import</h4><br />"; // Ignore user abort and expand time limit //ignore_user_abort(true); set_time_limit(60); if (($handle = fopen($_FILES['clientimport']['tmp_name'],"r")) !== FALSE) { $row = 0; //defaults $sitetype = 3; $sitestatus = 1; $startdate ="2013-01-01 00:00:00"; $enddate ="2013-12-31 23:59:59"; $createdby = 1; //loop and insert while (($data = fgetcsv($handle, 10000,",")) !== FALSE) { // loop through each line of CSV. Returns array of that line each time so we can hard reference it if we want. if ($row>0) { if (strlen($data[1])>0) { $clientshortcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0]))); $sitename = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0]))."".trim(stripslashes($data[1]))); $address = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[1])).",".trim(stripslashes($data[2])).",".trim(stripslashes($data[3]))); $postcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[4]))); //look up client ID $client = $db->queryUniqueObject("SELECT ID FROM tblclients WHERE ShortCode='$clientshortcode'",ENABLE_DEBUG); if ($client->ID>0 && is_numeric($client->ID)) { //got client ID so now check if site already exists we can trust the site name here since we only care about double matching against already imported sites. $sitecount = $db->countOf("tblsites","SiteName='$sitename'"); if ($sitecount>0) { //site exists echo"SITE $sitename ALREADY EXISTS SKIPPING<br />"; } else { //site doesn't exist so do import $db->execute("INSERT INTO tblsites (SiteName,SiteAddress,SitePostcode,SiteType,SiteStatus,CreatedBy,StartDate,EndDate,CompanyID) VALUES ('$sitename','$address','$postcode',$sitetype,$sitestatus,$createdby,'$startdate','$enddate',".$client->ID.")",ENABLE_DEBUG); echo"IMPORTED -".$data[0]." -".$data[1]."<br />"; } } else { echo"CLIENT $clientshortcode NOT FOUND PLEASE ENTER AND RE-IMPORT<br />"; } fcflush(); set_time_limit(60); // reset timer on loop } } else { $row++; } } echo"<br />COMPLETED<br />"; } fclose($handle); unlink($_FILES['clientimport']['tmp_name']); echo"All Imports finished do not reload this page"; } |
在大约10秒钟内输入了150k行
就像我在评论中所说,使用SPLFileObject迭代CSV文件。使用Prepared语句可以降低在每个循环中调用UPDATE的性能开销。此外,将两个查询合并在一起,没有任何理由首先提取所有产品行并根据CSV检查它们。您可以使用JOIN来确保只有第二个表中与第一个产品相关且当前CSV行中的那些股票才会更新:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | /* First the CSV is pulled in */ $export_csv = new SplFileObject('export.csv'); $export_csv->setFlags(SplFileObject::READ_CSV | SplFileObject::DROP_NEW_LINE | SplFileObject::READ_AHEAD); $export_csv->setCsvControl(';'); /* Next you prepare your statement object */ $stmt = $mysqli->prepare(" UPDATE stocks, products SET value = ? WHERE stocks.fielddef_id = ? AND product_id = ? AND products.fielddef_id = 1 LIMIT 1 "); $stmt->bind_param('iis', $amount, $fielddef_id, $product_id); /* Now you can loop through the CSV and set the fields to match the integers bound to the prepared statement and execute the update on each loop. */ foreach ($export_csv as $csv_row) { list($stock_id, $product_id, $amount) = $csv_row; $fielddef_id = $stock_id + 21; if(!empty($stock_id)) { $stmt->execute(); } } $stmt->close(); |
每次更新每条记录都会过于昂贵(主要是由于寻求,但也来自写作)。
您应首先
为了使速度更快,您应该在插入之前锁定表格,然后将其解锁。这将防止在每个插入时发生索引。
使查询更大,即使用循环编译更大的查询。您可能需要将其拆分为块(例如,一次处理100个),但一定不要一次执行一个查询(适用于任何类型,插入,更新,如果可能,甚至选择)。这应该会大大提高性能。
通常建议您不要在循环中查询。