That statement is overly vague and sounds like marketing BS.
I worked on a project where we scored streaming data in R. The biggest bottleneck was getting the data into and out of the R session. We started out using disk based I/O and ended up using using rJava so our streaming system could communicate with R. In that case we did get a 100X speed up between our first iteration and the final version which used rJava to serialize the data.
So basically, the major bottleneck was not R. It was the communication with R. In the article R is installed on the same hardware as SQL Server, which should automatically give it a speedup with streaming data.
If Microsoft also has an optimized way to get data from SQL Server to R I can see how they got a 100X speedup. In certain cases using the MKL libraries can give you that as well from faster scoring, but I suspect the speedup just comes from improving the data transfer method.
> If Microsoft also has an optimized way to get data from SQL Server to R I can see how they got a 100X speedup. In certain cases using the MKL libraries can give you that as well, but I suspect the speedup just comes from improving the data transfer method.
The optimized method is that you can run R inside the database in the latest version of SQL Server.
I've actually installed Windows again, just to play with this feature (though I cannot make claims to actually putting it to good use yet).
No need to install Windows to get R in a database, you can run with PL/R on Postgres (unless you have a particular desire to run it within SQL Server, of course, but doesn't it run on Linux now?).
Ah, okay. It wasn't clear to me that Microsoft had written its own algorithms. That does seem very useful, then (though presumably those algorithms and data structures could be used outside of the SQL Server environment, I presume that Microsoft is using them to encourage people to use SQL Server rather than another solution).
I believe the second link is referring to being able to initialize and share data between functions within the R runtime (rather than having to transfer back and forth between Postgres and the runtime). Is that not what you were referring to?
>being able to initialize and share data between functions within the R runtime
That's right.
>rather than having to transfer back and forth between Postgres and the runtime
But that's not. There's a difference between being able to use data outside of the database (from the R runtime) in my UDFs (executed in Postgres) on one hand and being able to attach 2TBs of data straight from an SQL table in the R runtime on the other. I don't even care that much about the algorithms. Moving the data is the bottleneck most of the time. And Microsoft is actually late to the party (but better than never). Oracle, Netezza, Vertica and Hana have been able to do it for quite a while now.
You are spot on about being able to use the algorithms outside of SQL Server. You can use them on Teradata or Hadoop or rent your own VMs on Azure to use them or you can buy standalone licenses too.
So by "attach 2 TB of data straight from an SQL table into the R runtime" you mean that Microsoft taught R to interact directly with SQL Server's storage engine? If so, I agree, data movement is almost always the bottleneck for large data sets, and I don't think PL/R can do that (though I am not sure if that's a necessity due to the way Postgres's language plugins work, or something that could be done with enough effort).
However, if all you mean is that SQL Server can transfer the data a tuple at a time to R on the same server (in memory), I believe that PL/R and Postgres interact like that already (again, maybe I'm wrong). And I don't know how much extra overhead that provides over talking directly to the storage engine, anyway.
>Microsoft taught R to interact directly with SQL Server's storage engine
They have created 2 new services for SQL Server 2016 - BxlServer and SQL Satellite which facilitate the communication and data exchange. They obviously have additional speedups for the proprietary runtime (that was one of the main selling points of the company they acquired - fast data access to several RDBMS), but it's plenty fast for regular R too.
"When you select this feature, extensions are installed in the database engine to support execution of R scripts, and a new service is created, the SQL Server Trusted Launchpad, to manage communications between the R runtime and the SQL Server instance."
So basically SQL Server is talking to the R session. The speedup is coming from R being installed locally and the "communication" which I've yet to figure out.
I worked on a project where we scored streaming data in R. The biggest bottleneck was getting the data into and out of the R session. We started out using disk based I/O and ended up using using rJava so our streaming system could communicate with R. In that case we did get a 100X speed up between our first iteration and the final version which used rJava to serialize the data.
So basically, the major bottleneck was not R. It was the communication with R. In the article R is installed on the same hardware as SQL Server, which should automatically give it a speedup with streaming data.
If Microsoft also has an optimized way to get data from SQL Server to R I can see how they got a 100X speedup. In certain cases using the MKL libraries can give you that as well from faster scoring, but I suspect the speedup just comes from improving the data transfer method.