关于mysql：如何用PHP中的100k +行处理CSV？

How to process CSV with 100k+ lines in PHP?

我有一个超过100.000行的CSV文件，每行有3个以分号分隔的值。总文件大小约为。 5MB。

CSV文件采用以下格式：

1
2
3
4
5
6
7
8
9
10
11
12
13
14

stock_id;product_id;amount
==========================
1;1234;0
1;1235;1
1;1236;0
...
2;1234;3
2;1235;2
2;1236;13
...
3;1234;0
3;1235;2
3;1236;0
...

我们有10只股票，以CSV格式编入1-10。在数据库中，我们将它们保存为22-31。

CSV按stock_id，product_id排序，但我认为无关紧要。

是)我有的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

<?php

session_start();

require_once ('db.php');

echo '<meta charset="iso-8859-2">';

// convert table: `CSV stock id => DB stock id`
$stocks = array(
1 => 22,
2 => 23,
3 => 24,
4 => 25,
5 => 26,
6 => 27,
7 => 28,
8 => 29,
9 => 30,
10 => 31
);

$sql = $mysqli->query("SELECT product_id FROM table WHERE fielddef_id = 1");

while ($row = $sql->fetch_assoc()) {
$products[$row['product_id']] = 1;
}

$csv = file('export.csv');

// go thru CSV file and prepare SQL UPDATE query
foreach ($csv as $row) {
$data = explode(';', $row);
// $data[0] - stock_id
// $data[1] - product_id
// $data[2] - amount

if (isset($products[$data[1]])) {
// in CSV are products which aren't in database
// there is echo which should show me queries
echo" UPDATE t
SET value =" . (int)$data[2] ."
WHERE fielddef_id =" . (int)$stocks[$data[0]] ." AND
product_id = '" . $data[1] ."' -- product_id isn't just numeric
LIMIT 1";
}
}

问题是用echo写下100k行是太慢了，需要很长时间。我不确定MySQL会做什么，如果它会更快，或者需要相同的时间。我这里没有测试机，所以我担心在prod服务器上测试。

我的想法是将CSV文件加载到更多变量(更好的数组)，如下所示，但我不知道为什么。

1
2
3
4
5

$csv[0] = lines 0 - 10.000;
$csv[1] = lines 10.001 - 20.000;
$csv[2] = lines 20.001 - 30.000;
$csv[3] = lines 30.001 - 40.000;
etc.

我找到了例如。有效地计算文本文件的行数。 (200mb +)，但我不确定它是如何帮助我的。

当我将foreach替换为print_r时，我会在<1秒内获得转储。任务是使foreach循环更快地进行数据库更新。

任何想法如何更新数据库中的这么多记录？
谢谢。

相关讨论

由于问题的答案和评论，我有解决方案。它的基础来自@Dave，我只更新它以更好地回答问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

<?php

require_once 'include.php';

// stock convert table (key is ID in CSV, value ID in database)
$stocks = array(
1 => 22,
2 => 23,
3 => 24,
4 => 25,
5 => 26,
6 => 27,
7 => 28,
8 => 29,
9 => 30,
10 => 31
);

// product IDs in CSV (value) and Database (product_id) are different. We need to take both IDs from database and create an array of e-shop products
$sql = mysql_query("SELECT product_id, value FROM cms_module_products_fieldvals WHERE fielddef_id = 1") or die(mysql_error());

while ($row = mysql_fetch_assoc($sql)) {
$products[$row['value']] = $row['product_id'];
}

$handle = fopen('import.csv', 'r');
$i = 1;

while (($data = fgetcsv($handle, 1000, ';')) !== FALSE) {
$p_id = (int)$products[$data[1]];

if ($p_id > 0) {
// if product exists in database, continue. Without this condition it works but we do many invalid queries to database (... WHERE product_id = 0 updates nothing, but take a time)
if ($i % 300 === 0) {
// optional, we'll see what it do with the real traffic
sleep(1);
}

$updatesql ="UPDATE table SET value =" . (int)$data[2] ." WHERE fielddef_id =" . $stocks[$data[0]] ." AND product_id =" . (int)$p_id ." LIMIT 1";
echo"$updatesql";//for debug only comment out on live
$i++;
}
}

// cca 1.5sec to import 100.000k+ records
fclose($handle);

像这样的东西(请注意这是100％未经测试，我的头顶可能需要一些调整实际工作:))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

//define array may (probably better ways of doing this
$stocks = array(
1 => 22,
2 => 23,
3 => 24,
4 => 25,
5 => 26,
6 => 27,
7 => 28,
8 => 29,
9 => 30,
10 => 31
);

$handle = fopen("file.csv","r")); //open file
while (($data = fgetcsv($handle, 1000,";")) !== FALSE) {
//loop through csv

$updatesql ="UPDATE t SET `value` =".$data[2]." WHERE fielddef_id =".$stocks[$data[0]]." AND product_id =".$data[1];
echo"$updatesql";//for debug only comment out on live
}

您无需进行初始选择，因为您只需在代码中将产品数据设置为1，并且从您的说明中看出您的产品ID始终是正确的，只有您的fielddef列具有地图。

也只是为了生活不要忘记把你的实际mysqli执行命令放在你的$ updatesql上;

为了与实际使用代码进行比较(我可以进行基准测试！)
这是我用于上传文件的导入器的一些代码(它不完美，但它完成了它的工作)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

if (isset($_POST['action']) && $_POST['action']=="beginimport") {
echo"<h4>Starting Import</h4><br />";
// Ignore user abort and expand time limit
//ignore_user_abort(true);
set_time_limit(60);
if (($handle = fopen($_FILES['clientimport']['tmp_name'],"r")) !== FALSE) {
$row = 0;
//defaults
$sitetype = 3;
$sitestatus = 1;
$startdate ="2013-01-01 00:00:00";
$enddate ="2013-12-31 23:59:59";
$createdby = 1;
//loop and insert
while (($data = fgetcsv($handle, 10000,",")) !== FALSE) { // loop through each line of CSV. Returns array of that line each time so we can hard reference it if we want.
if ($row>0) {
if (strlen($data[1])>0) {
$clientshortcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0])));
$sitename = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0]))."".trim(stripslashes($data[1])));
$address = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[1])).",".trim(stripslashes($data[2])).",".trim(stripslashes($data[3])));
$postcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[4])));
//look up client ID
$client = $db->queryUniqueObject("SELECT ID FROM tblclients WHERE ShortCode='$clientshortcode'",ENABLE_DEBUG);

if ($client->ID>0 && is_numeric($client->ID)) {
//got client ID so now check if site already exists we can trust the site name here since we only care about double matching against already imported sites.
$sitecount = $db->countOf("tblsites","SiteName='$sitename'");
if ($sitecount>0) {
//site exists
echo"SITE $sitename ALREADY EXISTS SKIPPING<br />";
} else {
//site doesn't exist so do import
$db->execute("INSERT INTO tblsites (SiteName,SiteAddress,SitePostcode,SiteType,SiteStatus,CreatedBy,StartDate,EndDate,CompanyID) VALUES
('$sitename','$address','$postcode',$sitetype,$sitestatus,$createdby,'$startdate','$enddate',".$client->ID.")",ENABLE_DEBUG);
echo"IMPORTED -".$data[0]." -".$data[1]."<br />";
}
} else {
echo"CLIENT $clientshortcode NOT FOUND PLEASE ENTER AND RE-IMPORT<br />";
}
fcflush();
set_time_limit(60); // reset timer on loop
}
} else {
$row++;
}
}
echo"<br />COMPLETED<br />";
}
fclose($handle);
unlink($_FILES['clientimport']['tmp_name']);
echo"All Imports finished do not reload this page";
}

在大约10秒钟内输入了150k行

相关讨论

就像我在评论中所说，使用SPLFileObject迭代CSV文件。使用Prepared语句可以降低在每个循环中调用UPDATE的性能开销。此外，将两个查询合并在一起，没有任何理由首先提取所有产品行并根据CSV检查它们。您可以使用JOIN来确保只有第二个表中与第一个产品相关且当前CSV行中的那些股票才会更新：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

/* First the CSV is pulled in */
$export_csv = new SplFileObject('export.csv');
$export_csv->setFlags(SplFileObject::READ_CSV | SplFileObject::DROP_NEW_LINE | SplFileObject::READ_AHEAD);
$export_csv->setCsvControl(';');

/* Next you prepare your statement object */
$stmt = $mysqli->prepare("
UPDATE stocks, products
SET value = ?
WHERE
stocks.fielddef_id = ? AND
product_id = ? AND
products.fielddef_id = 1
LIMIT 1
");

$stmt->bind_param('iis', $amount, $fielddef_id, $product_id);

/* Now you can loop through the CSV and set the fields to match the integers bound to the prepared statement and execute the update on each loop. */

foreach ($export_csv as $csv_row) {
list($stock_id, $product_id, $amount) = $csv_row;
$fielddef_id = $stock_id + 21;

if(!empty($stock_id)) {
$stmt->execute();
}
}

$stmt->close();

每次更新每条记录都会过于昂贵(主要是由于寻求，但也来自写作)。

您应首先TRUNCATE该表，然后再次插入所有记录(假设您没有外部外键链接到此表)。

为了使速度更快，您应该在插入之前锁定表格，然后将其解锁。这将防止在每个插入时发生索引。

使查询更大，即使用循环编译更大的查询。您可能需要将其拆分为块(例如，一次处理100个)，但一定不要一次执行一个查询(适用于任何类型，插入，更新，如果可能，甚至选择)。这应该会大大提高性能。

通常建议您不要在循环中查询。