关于java：Java8 – 以并行方式处理Stream 以非线程安全的消费者的惯用方法？

Java8 - idiomatic way to process a Stream> in parallel delivering to a non-thread-safe consumer?

假设我有Stream> stream;。该流正在访问超过一百万个不适合内存的对象。

将此转换为Stream的方式是什么，以确保Callable::call在传递给非线程安全的消费者之前并行执行(可能通过调用.sequential().forEach()或其他一些瓶颈) 机制)？

即并行处理流但按顺序传送输出(随机顺序ok，只要它是单线程的)。

我知道我可以通过在原始流和消费者之间设置ExecutionService和Queue来做我想做的事。但这似乎是很多代码，是否有一个神奇的单行程？

您仍然可以使用ExecutorService进行并行化。像这样：

1
2
3
4
5
6
7
8
9

ExecutorService service = Executors.newFixedThreadPool(4);
stream.map(c -> service.submit(c)).map(future -> {
try {
return future.get(); //retrieve callable result
} catch (InterruptedException | ExecutionException ex) {
//Exception handling
throw new RuntimeException(ex);
}
});

您可以按顺序进一步处理生成的Stream。

如果直接在Stream>上使用forEach / forEachOrdered，则可以在当前未来完成后直接处理生成的SomeClass对象(与使用invokeAll()阻塞直到每个任务完成时不同)。

如果您想按照它们可用的确切顺序处理可调用的结果，则必须使用CompletionService，由于在提交后必须调用Future f = completionService.take()，因此不能与单个流操作链一起使用可卡住的。

编辑：

在流中使用ExecutorService不能像上面显示的那样工作，因为每个Callable都是通过future.get()一个接一个地提交和请求的。

我找到了一种可能的副作用更重的解决方案，将Callables分成固定的平行块。

我使用类TaskMapper作为映射函数来提交Callables并将它们映射到块：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

class TaskMapper implements Function<Callable<Integer>, List<Future<Integer>>>{
private final ExecutorService service;
private final int chunkSize;
private List<Future<Integer>> chunk = new ArrayList<>();

TaskMapper(ExecutorService service, int chunkSize){
this.service = service;
this.chunkSize = chunkSize;
}

@Override
public List<Future<Integer>> apply(Callable<Integer> c) {
chunk.add(service.submit(c));
if(chunk.size() == chunkSize){
List<Future<Integer>> fList = chunk;
chunk = new ArrayList<>();
return fList;
}else{
return null;
}
}

List<Future<Integer>> getChunk(){
return chunk;
}
}

这就是流操作链的样子：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

ExecutorService service = Executors.newFixedThreadPool(4);
TaskMapper taskMapper = new TaskMapper(service, 4);
stream.map(taskMapper)
.filter(fl -> fl != null) //filter for the chunks
.flatMap(fl -> fl.stream()) //flat-map the chunks to futures
.map(future -> {
try {
return future.get();
} catch (InterruptedException | ExecutionException ex) {
throw new RuntimeException(ex);
}
});
//process the remaining futures
for(Future<Integer> f : taskMapper.getChunk()){
try {
Integer i = f.get();
//process i
} catch (InterruptedException | ExecutionException ex) {
//exception handling
}
}

其工作原理如下：TaskMapper每次将4个callats提交给服务并将它们映射到一块未来(不含Spliterator)。这是通过每次映射到第x，第2和第3可调用的null来解决的。例如，null可以由虚拟对象替换。将期货映射到结果的映射函数等待块的每个未来的结果。我在我的示例中使用Integer而不是SomeClass。当映射当前块中的所有期货结果时，将创建并并行化新块。最后，如果流中的元素数量不能被chunkSize(在我的示例中为4)分割，则必须从TaskMapper中检索剩余的期货并在流外部处理。

这个构造适用于我执行的测试，但我知道由于流的副作用，状态和未定义的评估行为，它可能是脆弱的。

EDIT2：

我使用自定义Spliterator从上一个编辑中创建了一个版本的构造：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

public class ExecutorServiceSpliterator< T > extends AbstractSpliterator<Future< T >>{
private final Spliterator<? extends Callable< T >> srcSpliterator;
private final ExecutorService service;
private final int chunkSize;
private final Queue<Future< T >> futures = new LinkedList<>();

private ExecutorServiceSpliterator(Spliterator<? extends Callable< T >> srcSpliterator) {
this(srcSpliterator, Executors.newFixedThreadPool(8), 30); //default
}

private ExecutorServiceSpliterator(Spliterator<? extends Callable< T >> srcSpliterator, ExecutorService service, int chunkSize) {
super(Long.MAX_VALUE, srcSpliterator.characteristics() & ~SIZED & ~CONCURRENT);
this.srcSpliterator = srcSpliterator;
this.service = service;
this.chunkSize = chunkSize;
}

public static < T > Stream< T > pipeParallelized(Stream<? extends Callable< T >> srcStream){
return getStream(new ExecutorServiceSpliterator<>(srcStream.spliterator()));
}

public static < T > Stream< T > pipeParallelized(Stream<? extends Callable< T >> srcStream, ExecutorService service, int chunkSize){
return getStream(new ExecutorServiceSpliterator<>(srcStream.spliterator(), service, chunkSize));
}

private static < T > Stream< T > getStream(ExecutorServiceSpliterator< T > serviceSpliterator){
return StreamSupport.stream(serviceSpliterator, false)
.map(future -> {
try {
return future.get();
} catch (InterruptedException | ExecutionException ex) {
throw new RuntimeException(ex);
}
}
);
}

@Override
public boolean tryAdvance(Consumer<? super Future< T >> action) {
boolean didAdvance = true;
while((didAdvance = srcSpliterator.tryAdvance(c -> futures.add(service.submit(c))))
&& futures.size() < chunkSize);
if(!didAdvance){
service.shutdown();
}

if(!futures.isEmpty()){
Future< T > future = futures.remove();
action.accept(future);
return true;
}
return false;
}

}

这个类提供了函数(pipeParallelized())，它接受Callable元素的流并行地执行它们，然后输出包含结果的顺序流。 Spliterators被允许是有状态的。因此，希望此版本不会违反任何流操作约束。这就是Splitterator的使用方法(靠近"神奇的oneliner")：

1	ExecutorServiceSpliterator.pipeParallelized(stream);

这一行采用Callables stream的流并行化它们的执行并返回一个包含结果的顺序流(管道发生延迟 - >应该使用数百万个可调用的)，这可以通过常规流操作进一步处理。

ExecutorServiceSpliterator的实现非常基础。它应该主要说明如何在原则上完成。可以优化服务的重新供应和结果的检索。例如，如果允许得到的流是无序的，则可以使用CompletionService。

相关讨论

没有其他答案对我有用。

我终于找到了这样的东西(伪代码)：

1
2
3
4
5
6
7

ExecutorService executor = Executors.newWorkStealingPool();
CompletionService completor = new CompletionService(executor);
int count = stream.map(completor::submit).count();
while(count-- > 0) {
SomeClass obj = completor.take();
consume(obj);
}

consume(obj)循环在单个线程中顺序执行，而各个可调用任务以异步方式通过CompletionService的多个线程。内存消耗受到限制，因为CompletionService一次只有正在进行的项目，因为有可用的线程。等待执行的Callables急切地从流中实现，但是与每次开始执行时消耗的内存相比，其影响可以忽略不计(您的用例可能会有所不同)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

public static void main(String[] args) {
testInfititeCallableStream();
}
private static void testInfititeCallableStream() {
ExecutorService service = Executors.newFixedThreadPool(100);
Consumer<Future<String>> consumeResult = (Future<String> future)->{
try {
System.out.println(future.get());
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}
};
getCallableStream().parallel().map(callable -> service.submit(callable)).forEach(consumeResult);

}
private static Stream<Callable<String>> getCallableStream() {
Random randomWait = new Random();
return Stream.<Callable<String>>generate(() ->
new Callable<String>() {
public String call() throws Exception {
//wait for testing
long time = System.currentTimeMillis();
TimeUnit.MILLISECONDS.sleep(randomWait.nextInt(5000));
return time +":" +UUID.randomUUID().toString();
};
}).limit(Integer.MAX_VALUE);
}

第一个例子：

1
2
3
4
5
6
7
8
9
10

ExecutorService executor = Executors.newWorkStealingPool();

List<Callable<String>> callables = Arrays.asList(
() ->"job1",
() ->"job2",
() ->"job3");

executor.invokeAll(callables).stream().map(future -> {
return future.get();
}).forEach(System.out::println);

第二个例子：

1
2
3
4

Stream.of("1","2","3","4","","5")
.filter(s->s.length() > 0)
.parallel()
.forEachOrdered(System.out::println);

相关讨论

你在寻求一种惯用的解决方案。不鼓励在其行为参数中具有副作用的流(在Stream的javadoc中明确说明)。

因此惯用解决方案基本上是ExecutorService + Futures和一些循环/ forEach()。如果您有一个Stream作为参数，只需将其转换为具有标准收集器的List。

像这样的东西：

1
2
3
4

ExecutorService service = Executors.newFixedThreadPool(5);
service.invokeAll(callables).forEach( doSomething );
// or just
return service.invokeAll(callables);

相关讨论

码农家园

关于java：Java8 – 以并行方式处理Stream >以非线程安全的消费者的惯用方法？

Java8 - idiomatic way to process a Stream> in parallel delivering to a non-thread-safe consumer?