Timely Dataflow in Fifteen Minutes [video]

andyferris · on April 9, 2020

This video is really cool. I’ve been following dataflow approaches for a while, including some of Frank McSherry’s (usually enjoyable) articles. None of the comments mention https://materialize.io so I may as well (an open source commercial offering based off these concepts).

Watching this explanation I’m slightly curious whether things like materialize and noria are a bit limited in that this could be a paradigm for an actual functional reactive programming language rather than specifically a “data” thing. It appears to have the structure of nested contexts (loops, scopes, etc) advocated by structured programming (ie “goto considered harmful”). It can reliably calculate an answer at each point in time for each state of input, concurrently and with parallelism. Even if there are multiple inputs with their own notion of time (not covered in the video). That’s, like, the holy grail of PLT these days, isn’t it? Or am I missing something?

pritambaral · on April 9, 2020

> ... an open source commercial offering ...

I just looked at their license[1] and it doesn't appear to be open source at the moment.

https://github.com/MaterializeInc/materialize/blob/master/LI...

benesch · on April 9, 2020

Yes, we consider ourselves an “open core” company. Timely and differential, the core compute engine, are fully open source projects, but the Materialize layer atop is licensed under the “Business Source License” (BSL).

We think the BSL strikes a good balance between giving back to the community—four years after every release, the code is automatically relicensed under Apache 2—and ensuring we can build a viable business. And you’re free to use Materialize for any purpose in a non-distributed (i.e., single node) deployment without paying for an enterprise license.

pritambaral · on April 19, 2020

Wasn't complaining against your model; just correcting the parent's usage of the terms.

dmos62 · on April 9, 2020

There's a few relevant repositories here https://github.com/TimelyDataflow including two rust implementations.

FridgeSeal · on April 9, 2020

The readme for the Abomonation repo makes me laugh every time.

rustybolt · on April 9, 2020

In my opinion dataflow is the only true representation of computations. Unlike normal code, it represents dependencies and parallel computation perfectly. Because of this, it is also a great basis for a hardware implementation.

BubRoss · on April 9, 2020

Graphs are good for seeing dependencies and ordering, but not great for branching and looping.

DonaldFisk · on April 9, 2020

Although in conventional languages with control flow, we're more used to how they're done, both branching and looping can be done straightforwardly in dataflow without introducing any non-dataflow constructs - just graphs with vertices connected by edges.

Branching: http://www.fmjlang.co.uk/fmj/tutorials/Conditional.html

Looping: http://www.fmjlang.co.uk/fmj/tutorials/Iteration.html

Emit and collect: http://www.fmjlang.co.uk/fmj/tutorials/Macros.html

These show the basic idea. There are many other examples throughout the tutorials (http://www.fmjlang.co.uk/fmj/tutorials/TOC.html).

BTW, in case anyone's wondering why there have been no recent updates to the pages on my visual dataflow language, it's because the many improvements I've been making, particularly big changes to the type system, have required a lot more work than I expected. I haven't abandoned work on it, but it will still be some time before it's ready for release.

BubRoss · on April 10, 2020

Those show that it's possible, not necessarily that it's a good interface to make those parts of programs. Houdini's shader language has had branching done in a data flow graph for a long time. Touch designer has a kind of looping construct too. You might want to take a look at these domain specific interfaces if you are doing your own graph, they are well done.

Fundamentally though, the density of text expressions exceeds a data flow graph by a huge margin. If what is being done isn't fundamentally a directed acyclic graph, visualizing it with a graph becomes more difficult to absorb than the expressions as text.

throwaway8291 · on April 9, 2020

I looked at the data flow paradigm a couple of years ago. Back then I thought that the difference to just a "ordinary" functions is not that big, and for performance (which is important for my data work), you do not want to deviate from the traditional way too much.

Anyone felt the same or can provide a real-world problem, where data flow is actually working better that other solutions?

FridgeSeal · on April 9, 2020

There’s this: https://github.com/mit-pdos/noria

It’s like a cache, except it keeps itself in sync with the database automatically and generates “materialised views” using data-flows based on the queries that get asked of it and will automatically generate new ones if someone makes a query it doesn’t already have a data flow for. Parts of data flows can also be shared across views.

The paper linked in the github goes into detail about the performance gains, but it easily outperforms straight database calls and caching setups.

j-pb · on April 9, 2020

You should watch the video. It's not really about the dataflow programming paradigm itself.

This is about timely dataflow, the foundation of differential dataflow. It allows for the efficient incremental computation of results.

It basically solves the entire view maintenance problem from databases in a very elegant and efficient way.

BubRoss · on April 9, 2020

If your functions are transforming chunks of data into other formats/types, you are already doing what data flow graphs are doing. Generalizing can give much more structured concurrency.

thereyougo · on April 9, 2020

So many talented teachers out there. I'm glad you shared this video. This guy deserves more views to his videos

FridgeSeal · on April 10, 2020

His papers are fascinating as well. The COST paper especially changed how I thought about a lot of problems.

arendtio · on April 9, 2020

Somehow the 'Hello World' example reminds me of

  $ printf 'Hello World' | awk '{print $2}' | tr '[A-Z]' '[a-z]' | wc -w | cat

Just kidding ;-)