关于php：如何使用多线程在mysql数据库中上传csv文件？

How can i upload a csv file in mysql database using multithreads?

我有一个csv文件，包含数百万个电子邮件地址，我想用PHP快速上传到mysql数据库。

现在我正在使用单线程程序，上传时间太长。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

//get the csv file
$file = $_FILES['csv']['tmp_name'];
$handle = fopen($file,"r");

//loop through the csv file and insert into database
do {
if ($data[0]) {
$expression ="/^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";
if (preg_match($expression, $data[0])) {
$query=mysql_query("SELECT * FROM `postfix`.`recipient_access` where recipient='".$data[0]."'");
mysql_query("SET NAMES utf8");
$fetch=mysql_fetch_array($query);
if($fetch['recipient']!=$data[0]){
$query=mysql_query("INSERT INTO `postfix`.`recipient_access`(`recipient`, `note`) VALUES('".addslashes($data[0])."','".$_POST['note']."')");
}
}
}
} while ($data = fgetcsv($handle,1000,",","'"));

相关讨论

首先，我不能给人足够的压力;修复你的缩进 - 它将使每个人的生活更轻松。

其次，答案很大程度上取决于您遇到的实际瓶颈：

正则表达式非常慢，特别是当它们处于循环中时。
数据库往往适用于WRITES或READS，但不适用于：尝试预先减少查询量。
按理说，循环中的PHP代码越少，它的工作速度就越快。考虑减少条件(例如)。
为了记录，您的代码对于mysql注入是不安全的：在手[*]之前过滤$ _POST
[*]说到这一点，访问一个变量比一个数组的索引更快，比如$ _POST。
您可以通过让主程序将巨大的CSV文件拆分为较小的CSV文件并将每个文件运行到不同的进程来模拟多线程。

的common.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

class FileLineFinder {
protected $handle, $length, $curpos;
public function __construct($file){
$handle = fopen($file, 'r');
$length = strlen(PHP_EOL);
}
public function next_line(){
while(!feof($this->handle)){
$b = fread($this->handle, $this->length);
$this->curpos += $this->length;
if ($b == PHP_EOL) return $this->curpos;
}
return false;
}
public function skip_lines($count){
for($i = 0; $i < $count; $i++)
$this->next_line();
}
public function __destruct(){
fclose($this->handle);
}
}

function exec_async($cmd, $outfile, $pidfile){
exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outfile, $pidfile));
}

main.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

require('common.php');

$maxlines = 200; // maximum lines subtask will be processing at a time
$note = $_POST['note'];
$file = $_FILES['csv']['tmp_name'];
$outdir = dirname(__FILE__) . DIRECTORY_SEPARATOR . 'out' . DIRECTORY_SEPARATOR;

//make sure our output directory exists
if(!is_dir($outdir))
if(!mkdir($outdir, 0755, true))
die('Cannot create output directory: '.$outdir);

// run a task for each chunk of lines in the csv file
$i = 0; $pos = 0;
$l = new FileLineFinder($file);
do {
$i++;
exec_async(
'php -f sub.php -- '.$pos.' '.$maxlines.' '.escapeshellarg($file).' '.escapeshellarg($note),
$outdir.'proc'.$i.'.log',
$outdir.'proc'.$i.'.pid'
);
$l->skip_lines($maxlines);
} while($pos = $l->next_line());

// wait for each task to finish
do {
$tasks = count(glob($outdir.'proc*.pid'));
echo 'Remaining Tasks: '.$tasks.PHP_EOL;
} while ($tasks > 0);
echo 'Finished!'.PHP_EOL;

sub.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

require('common.php');

$start = (int)$argv[1];
$count = (int)$argv[2];
$file = $argv[3];
$note = mysql_real_escape_string($argv[4]);
$lines = 0;

$handle = fopen($file, 'r');
fseek($handle, $start, SEEK_SET);

$expression ="/^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";

mysql_query('SET NAMES utf8');

//loop through the csv file and insert into database
do {
$lines++;
if ($data[0]) {
if (preg_match($expression, $data[0])) {
$query = mysql_query('SELECT * FROM `postfix`.`recipient_access` where recipient="'.$data[0].'"');
$fetch = mysql_fetch_array($query);
if($fetch['recipient'] != $data[0]){
$query = mysql_query('INSERT INTO `postfix`.`recipient_access`(`recipient`, `note`) VALUES("'.$data[0].'","'.$note.'")');
}
}
}
} while (($data = fgetcsv($handle, 1000, ',', '\'')) && ($lines < $count));

积分

https://stackoverflow.com/a/2162528/314056
https://stackoverflow.com/a/45966/314056

将整个循环放在SQL事务中。这将使事情加快一个数量级。

一般建议：加速任何程序的关键是知道哪个部分占用大部分时间。

然后弄清楚如何减少它。有时您会对实际结果感到非常惊讶。

顺便说一句，我不认为多线程会解决你的问题。

最紧迫的事情是确保您的数据库已正确编入索引，以便您为每一行执行的查询查询尽可能快。

除此之外，你可以做的事情并不多。对于多线程解决方案，您必须在PHP之外。

您也可以在mySQL中导入CSV文件，然后使用PHP脚本清除多余的数据 - 这可能是最快的方法。