0% found this document useful (0 votes)
58 views53 pages

T PL Dataflow

The document discusses TPL Dataflow, a framework designed for high-performance stream processing and workflows, emphasizing its ability to handle large amounts of data with low latency through asynchronous and parallel programming. It highlights the architecture, key components like BufferBlock and ActionBlock, and provides examples of how to implement various data processing patterns. Additionally, it touches on the integration of TPL Dataflow with Reactive Extensions for enhanced asynchronous programming capabilities.

Uploaded by

Tapan Pratap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views53 pages

T PL Dataflow

The document discusses TPL Dataflow, a framework designed for high-performance stream processing and workflows, emphasizing its ability to handle large amounts of data with low latency through asynchronous and parallel programming. It highlights the architecture, key components like BufferBlock and ActionBlock, and provides examples of how to implement various data processing patterns. Additionally, it touches on the integration of TPL Dataflow with Reactive Extensions for enhanced asynchronous programming capabilities.

Uploaded by

Tapan Pratap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

High Performance

Stream Processing
and Workflows
with TPL Dataflow

RICCARDO TERRELL
TPL Dataflow … design to compose
TPL Dataflow workflow for High throughput, low-latency scenarios
Process
Task

Buffer Process Output


Transform Aggregate Transform
Input Data Task Data

Process
Task

implementing tailored asynchronous parallel workflow and batch queuing


Its about maximizing resource use
Divide and Conquer
To get the best performance, your application has to
partition and divide processing to take full advantage of
multicore processors – enabling it to do multiple things at
the same time, i.e. concurrently.
When to use the TPL Dataflow
Large amount of data to process and/or generate (Stream Data)

To process any workflow that can take advantage of Parallel Computation

When taking advantage of batch processing by breaking up each section of an application


Why TPL Dataflow?
Asynchronous programming evolution (for both I/O bound and CPU bound operations)

Parallel and concurrent programming without Synchronization (no share of state)

Reduce complexity of code structure (declarative)

Process seamlessly large data in parallel with low latency


How to start?
TPL Dataflow

[Link]
TPL Dataflow promotes Message Passing
programming model
TPL Dataflow Architecture Building Blocks
TPL Dataflow

Building Blocks
TPL Dataflow Architecture Building Blocks
public interface IDataflowBlock
{
void Complete();
void Fault(Exception error);
Task Completion { get; }
}

public interface ISourceBlock<out TOutput> : IDataflowBlock


{
bool TryReceive(out TOutput item, Predicate<TOutput> filter);
bool TryReceiveAll(out IList<TOutput> items);
}

public interface ITargetBlock<in TInput> : IDataflowBlock


{
DataflowMessageStatus OfferMessage(
DataflowMessageHeader messageHeader, TInput messageValue,
ISourceBlock<TInput> source, bool consumeToAccept);
}
BufferBlock<T>
Producer
N producer Consumer

N consumers
Producer Consumer

Producer Consumer
Produce Consumer
BlockingCollection<int> bCollection = new BlockingCollection<int>(10);

Task producerThread = [Link](() =>


{
for (int i = 0; i < 10; ++i)
[Link](i);

[Link]();
});

Task consumerThread = [Link](() =>


{
while (![Link])
{
int item = [Link]();
[Link](item);
}
});

[Link](producerThread, consumerThread);
BufferBlock<T> Producer-Consumer
static BufferBlock<int> buffer = new BufferBlock<int>();

async Task Producer(IEnumerable<int> values) {


foreach (var value in values)
[Link](value);
Consumer }
Producer A
Data BufferBlock<T> async Task Consumer(Action<int> process) {
Input Internal Buffer Data while (await [Link]())
Producer B Output process(await [Link]());
Data
}

Producer C async Task Run() {


await [Link](Producer([Link](0,100),
Consumer(n => [Link]($"value {n}")));
}
BufferBlock<T> Producer-Consumer
static BufferBlock<int> buffer = new BufferBlock<int>(
new ExecutionDataflowBlockOptions { BoundedCapacity = 10 } );

async Task Producer(IEnumerable<int> values) {


foreach (var value in values)
Consumer
await [Link](value);
Producer A }
Data BufferBlock<T>
Input Internal Buffer Data
async Task Consumer(Action<int> process) {
Producer B Output
Data while (await [Link]())
process(await [Link]());
Producer C }

async Task Run() {


await [Link](Producer([Link](0,100),
Consumer(n => [Link]($"value {n}")));
}
ActionBlock<T>
Action<TInput> delegate
ActionBlock<T>
var actionBlock = new ActionBlock<int>(n =>{
[Link](1000);
[Link](n);

});

for (int i = 0; i < 10; i++)


{
[Link](i);
}

[Link](”Finished!");
ActionBlock<T> Sync and Async
// Downloading Images Sequentially and Synchronously
var downloader = new ActionBlock<string>(url =>
{
// Download returns byte[]
byte [] imageData = Download(url);
Process(imageData);
}); // Downloading Images Sequentially and Asynchronously
var downloader = new ActionBlock<string>(async url =>
[Link]("[Link] {
[Link]("[Link] byte [] imageData = await DownloadAsync(url);
Process(imageData);
});

[Link]("[Link] ");
[Link]("[Link]
MaxDegreeOfParallelism

Action<TInput>

MaxDegreeOfParallelism = 2
TPL Dataflow Parallelism
var actionBlock = new ActionBlock<int>((i) => {
[Link]($”{[Link]}\t{i}");
});

for (var i = 0; i < 10; i++) [Link](i);

[Link] = new ExecutionDataflowBlockOptions() {


MaxDegreeOfParallelism = 4 }
);

for (var i = 0; i < 10; i++) [Link](i);


TransformBlock<T, R>
Func<TInput,TOutput>
TransformBlock<T, R>
var tfBlock = new TransformBlock<int, string>( n =>
{
[Link](500);
return (n * n).ToString();
}, new ExecutionDataflowBlockOptions() {MaxDegreeOfParallelism = 1}); // Change DOP

for (int i = 0; i < 10; i++)


{
[Link](i);
[Link]($"Message {i} processed - queue count {[Link]}");
}
TransformBlock + ActionBlock
Func<,> Action<>

Action<>
TransformBlock<T, R>
var actionBlock = new ActionBlock<int>(n =>
{
[Link]($"Message : {n} - Thread Id#{[Link]}");
});

var tfBlock = new TransformBlock<int, int>( n =>


{
return n * n;
}, new ExecutionDataflowBlockOptions() {MaxDegreeOfParallelism = 1}); // Change DOP

[Link](actionBlock);

for (int i = 0; i < 10; i++)


{
[Link](i);
[Link]($"Message {i} processed - queue count {[Link]}");
}
Link Filtering
var actionBlock1 = new ActionBlock<int>(n =>
{
[Link]($"Message : {n} - Thread Id#{[Link]}");
});

var actionBlock2 = new ActionBlock<int>(n =>


{
[Link]($"Message : {n} - Thread Id#{[Link]}");
});

var bcBlock = new TransformBlock<int, int>(n => n);

[Link](actionBlock1);
[Link](actionBlock2, n => n % 2 == 0);
TransformManyBlock<T, R>
Func<TInput,TOutput>
TransformManyBlock<T, R>
// Asynchronous Web Crawler
var downloader = new TransformManyBlock<string,string>(async url =>
{
[Link](“Downloading “ + url);
return ParseLinks(await DownloadContents(url));
});

var actionBlock = new ActionBlock<string>(n =>


{
[Link]($"Message : {n} - Thread Id#{[Link]}");
});

[Link](actionBlock);
BatchBlock<T>
BatchBlock<T>
// Batching Requests into groups of 100 to Submit to a Database

var batchRequests = new BatchBlock<Request>(batchSize:100);

var sendToDb = new ActionBlock<Request[]>(reqs => SubmitToDatabase(reqs));

[Link](sendToDb);

for (int i = 0; i < 100; i++)


[Link](new Request (i));
BatchBlock<T>
request

request
request
request Bottleneck
request
request

request

Multiple Contention due to a limited The number


application number of available of concurrent
requests for connections, causing waits requests is
Database and queuing of the requests reduced
BroadcastBlock<T>

“overwrite buffer”
BroadcastBlock<T>
var bcBlock = new BroadcastBlock<int>(n => n);
var actionBlock1 = new ActionBlock<int>(n => [Link]($"Message {n} processed - ActionBlock 1"));
var actionBlock2 = new ActionBlock<int>(n => [Link]($"Message {n} processed - ActionBlock 3"));

[Link](actionBlock1);
[Link](actionBlock2);

for (int i = 0; i < 10; i++)


[Link](i);
Completion and Error propagation
var source = new BufferBlock<string>();
var actionBlock = new ActionBlock<string>(n =>
{
[Link]($"Message : {n} - Thread Id#{[Link]}");
});

// Completation is not propagated by default


[Link](actionBlock, new DataflowLinkOptions() { PropagateCompletion = true });

for (int i = 0; i < 10; i++)


{
[Link]($"Item #{i}");
}
[Link](a => [Link]("actionBlock completed"));
[Link]();
[Link]();
Building a Dataflow Network
Internal Buffer Task
[Link](join);
[Link](join);

Input Buffer Task Output Buffer Input Buffer


Internal Buffer Task Task Output Buffer

Input Buffer Task Output Buffer

[Link](action);
[Link](transf1);
[Link](transf2);
Parallel Web-Crawler
A TPL Dataflow based WebCrawler
TPL Dataflow
Block Download
Web Page
TPL Dataflow
TPL Dataflow TPL Dataflow Block
Block Block
Image Broadcast
Broadcast
Parser
Link
• Download web pages asynchronously. Parser
• Download max 4 web pages in parallel. TPL Dataflow
• Traverse the web pages links tree. TPL Dataflow Block

• Parse for links to images. Block Image


• Download jpg images to disk. Processor
• Download the images using async (Write to disk) Local
• (parallel async stream copy) File
Reactive Extensions
What are Reactive Extensions
Rx is a library for composing asynchronous and event-based programs
using observable sequences and LINQ-style query operators
(2) The
IEnumerable\IEnumerator Interactive Reactive (2) The IObservable\IObserver
pattern pulls data from source, pattern receives a notification
which blocks the execution if from the source when new data
there is no data available Data Data is available, which is pushed to

Source Source the consumer


Pushing
MoveNext()
Pulling
IEnumerable<T> IObservable<T>
IEnumerator<T> IObserver<T>
(1) The consumer asks (1) The source notifies
for new data OnNext() the consumer that
new data is available

Consumer Consumer
TPL Dataflow & Reactive Extension
IPropagatorBlock<int, string> source =
new TransformBlock<int, string>(i => (i + i).ToString());

IObservable<int> observable = [Link]().Select([Link]);

IDisposable subscription =
[Link](i => $"Value {i} - Time
{[Link]("hh:mm:[Link]")}".Dump());

for (int i = 0; i < 100; i++)


[Link](i);
TPL Dataflow & Reactive Extension
IPropagatorBlock<string, int> target =
new TransformBlock<string, int>(s => [Link](s));

IDisposable link = [Link](new ActionBlock<int>(i => $"Value {i} - Time


{[Link]("hh:mm:[Link]")}".Dump())

IObserver<string> observer = [Link]();


IObservable<string> observable = [Link](1,20)
.Select(i => (i *i).ToString());
[Link](observer);

for (int i = 0; i < 100; i++)


[Link]([Link]());
TPL DataFlow and Rx
var encryptor = new TransformBlock<CompressDetails, EncryptDetails>();

[Link](compressor, linkOptions);
[Link](encryptor, linkOptions);

[Link]()

.Scan((new Dictionary<int, EncryptDetails>(), 0),


(state, msg) => [Link](async() => {
(Dictionary<int,EncryptDetails> details, int lastIndexProc) = state;
[Link]([Link], msg);

return (details, lastIndexProc);

}) .SingleAsync())
.SubscribeOn([Link]).Subscribe();
TPL Dataflow

Process 1 Process 2 Process 3 Process 4 Action


Message Passing based concurrency
◦ Processing
◦ Storage – State
◦ Communication only by messages
Mailbox 1 2 3
◦ Share Nothing
◦ Message are passed by value
◦ Lightweight object
BEHAVIOR
◦ Running on it’s own thread
◦ No shared state
◦ Messages are kept in mailbox and
STATE processed in order
◦ Massively scalable and lightening fast
because of the small call stack
Agent anatomy

A message is sent to the Immutable state


mailbox to communicate Agent
with the Agent

State

Incoming Mailbox Behavior Output


message (Message processor) message
queue

Messages are dequeued Single threaded


and processed Encapsulate state
sequentially
TPL DataFlow a statefull Agent in C#
class StatefulDataflowAgent<TState, TMessage> : IAgent<TMessage>
{
private TState state;
private readonly ActionBlock<TMessage> actionBlock;

public StatefulDataflowAgent(
TState initialState,
Func<TState, TMessage, Task<TState>> action,
CancellationTokenSource cts = null)
{
state = initialState;
var options = new ExecutionDataflowBlockOptions
{
CancellationToken = cts != null ?
[Link] : [Link]
};
actionBlock = new ActionBlock<TMessage>(
async msg => state = await action(state, msg), options);
}

public Task Send(TMessage message) => [Link](message);


public void Post(TMessage message) => [Link](message);
}
TPL DataFlow as Agent

StatelessAgent StatefulAgent

ck
Messages

ck
lo
Messages

lo
nB

nB
t io

t io
Ac
State

Ac
Buffer Task
Buffer Task
Source [Link]

Twitter @trikace
Blog [Link]
Email tericcardo@[Link]
Github [Link]/rikace/
Q &A ?
The tools we use have a profound (and devious!) influence on our thinking habits,
and, therefore, on our thinking abilities.
-- Edsger Dijkstra
How to reach me

[Link]/DCFsharp

[Link]/DC-fsharp/

@DCFsharp

rterrell@[Link]

You might also like