-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan for updating FsLab #137
Comments
Here are some stream-of-consciousness thoughts on steps to update FsLab:
Also:
Links to various parts of this as they get done:
|
@dsyme I only use Deedle and XPlot. Data is consumed via Dapper. I prototype all my research via FSI. I do hope dotnet core support on FSI will come soon as I refresh the issue everyday :) I used to heavily use SqlClient type provider until it's not supported on dotnet core. I do use the latest SignalR so that Dapper can mitigate the problem. I feel we shall discuss about a bigger picture of F# in data science, not only restrained by updating the current libraries to netstandard. @tpetricek Among FsLab packages, I found FSharp.Formatting as a weak link. I experimented it before but I found it's not too productive to use as it requires some boilerplate and template in html/razor. Documentation and samples were not in good shape either. To data scientist or quant in finance, I generalize workflow as data retrieval -> computation -> visualization. Jupyter notebook is the de-facto place to start as almost zero boilerplate is required. The notebook can be easily rerun to replicate research step by step. On top of it, dynamic-typed language is very easy to get started for people with stats/math background. That's partially why Python build up an amazing ecosystem. Though I strongly prefer statically-typed language, I cannot deny the productivity of Python ecosystem. But the productivity comes at a cost such as speed, scalability, type robustness and integration issues. But in practice, most research teams are willing to sacrifice these long-term cost with the boost of short-term productivity at idea generation stage. In order to grow a community using F# in data science, avoiding boiler plate is a first prerequisite to boost productivity. The productivity from FSharp.Data typeprovider cannot make up the cost of preparing boilerplate to plot results. I found IfSharp is quite useful as I can run all my F# scripts and visualize results using XPlot easily. The scripting and debugging experience is not as great as Visual Studio and FSI. But I just need to copy paste my script to IfSharp and use it as a notebook to record and visualize result. Most of time it works right away. We shall improve the experience of FsLab on IfSharp as a priority by providing more documentations. It took me a while to dig through various issues to get a Deedle frame printed. Digging these issues will discourage newcomers to F#. More sample Azure notebooks on various topics to educate community will be very useful. @cgravill Another promising path is via a polyglot Jupyter notebook such as Beakerx by Two Sigma. It has just released version 1.0 with built-in two-way autotranslation. There was an old Beaker F# kernel. Maybe it could be ported so that F# can coexist with all other language/ecosystem on the same notebook. Then more libraries and visualizations can be handy. @aolney Including an ML library such as Accord.Net/TensorFlowSharp is a good idea. But I am not an expert on it. Maybe @mathias-brandewinder has some good suggestions. I would also include another optimization library Google.OrTools. They plan to release its FSharp library targeting netstandard in its next version soon. It can solve another branch of users on linear optimization cases. I've compiled its F# library to use in production and found its F# examples very elegant. |
I don't know if I missed something, but I feel that F# need to support complex XSD -> XML generation. It is being used heavily here in Brazil by Governament for Sales, Medical, and it would be good to have a easy way to have type safety when generating XML from our data using the XSD Schemas provided by Governament. |
It's great discussion, please continue, all the comments are enlightening Note the list of work items above is not meant to be comprehensive and is a bi stream-of-consciousness. My take is that FsLab should be a collection of packages which "work together and you don't regret". That is, the packages should
Basically you want to to be able to "add an FsLab reference" and do some data-science workbook programming, whether that be in Visual Studio, VS Code, Jupyter notebooks or whatever. Equally you should be able to back out of using FsLab and just use individual packages with the same effect. Machine learning packages for .NET are a little tricky for FsLab. The more complete ones like Accord.NET (which is great) tend to be a complete set of packages in their own right (which is also great). Other packages like ML.NET are a little too early to include. So in general I agree Interop packages like RProvider, python , MATLAB provider, Excel provider etc. are tricky too. On the one hand these are incredibly useful when they are needed and work, and can benefit from regular integration and use with other components. They are also sometimes painful to get working first time and people sometimes shy away from them. On the other hand they are a source of considerable complexity and documenting them can be tricky. Note that one approach would be to abandon FsLab as an "integrated" package and simply document the choices and how to get started with them Finally FsLab today takes a very strong approach to literate programming - and I agree with @zyzhu that FSharp.Formatting is a bit of weak link. I need to understand better where we should end up here. |
I think I'm close to having a XPlot netstandard2.0 PR |
@jackfoxy That's great. I did a couple of updates to XPlot to fix the paket bootstrapper and documentation generation, you'll want to integrate those |
If it's already in master, I'll merge. I also implemented paket magic mode, which is possibly what you did @dsyme |
Thanks to @zyzhu and @dsyme for starting this discussion. I'm really happy to hear about these developments! I agree with most of zyzhu's points; still, as one of your goals seems to be to attract new users to FsLab, I feel that providing my 2¢ of opinion could be helpful. From my experience starting out with using F# for data processing can be quite though for a newcomer (I'm coming from a science background) for several (some non-technical) reasons. Some thoughts:
|
@sebhofer I agree with all those points, thanks. My first aim here is to get FsLab "clean" and spark a round of work on fundamentals like .NET Standard support. But we can also reassess its whole construction - I'm still not sure it should be anything but a template of the kind you propose (does it even need to be a combined nuget package?) FsLab is, at the moment:
plus a template. These seem reasonable (though FSharp.Charting should I think be dropped now). I think each is quite well documented (once links are all fixed). But the centrality of the literate programming support is questionable in the world of notebooks. There are also transient dependencies on
the first two of which are questionable, and also optional dependencies on:
|
@jackfoxy Could you send a PR for your xplot .NET Standard 2.0 work, even if not yet quite complete? Then we can discuss and others can help get it over the line? thanks |
@sebhofer I agree with this and I'm concerned by aspects of the Deedle design. It's possible there are also just better data frame libraries emerging for .NET as well, especially with regard to simplicity and discoverability. We need to reassess this. |
@dsyme fslaborg/XPlot#75 not merged with latest master |
I've merged a change to IfSharp to target .NET 4.7.1 to ease interaction with .NET Standard 2.0 fsprojects/IfSharp#181 There is some odd behaviour but with that I'm able to use ML.NET 0.3 in the context of a Jupyter Notebook. It'd be great to have improved support for FsLab. There was some initial work on this in fsprojects/IfSharp#156 but more would be great. The helper script approach does have discovery issues but it's meant we can keep the core cleaner. |
@dsyme I'm certainly in no position to judge the Deedle design, but I experienced that it's quite easy (for a beginner) to get bad performance if one is not careful. In my case I had to hack my own merge (or join?) function, because the built-in one would just not finish in reasonable time. (The reason was that the built-in version was too general for my problem and could be simplified considerably.) This is in principle not bad, but certainly slows you down in your day-to-day work. So there is certainly some room for improvement. To finish, I also have to say that it's just great that @tpetricek is so responsive on stackoverflow with respect to any Deedle issues that crop up (or any F# related problem for that matter :)! |
DeedleThe work on Deedle is tremendous, but (coming from R, SQL) I unfortunately found it complicated to understand the programming model and gave up on it. I found much more success using base f# data structures. It was far simpler. That said, saving frames to files and using frames to pass to/from Rprovider are fantastic. The time series join stuff is also great, but I just end up using Array.find for inequality searches or maps for equality searches. FSharp.DataNo record collection -> Csv file function is a weak point for saving intermediate results. Hand mapping 30 column records to a CSV row type is not practical or type safe (easy to accidentally transpose two neighboring columns of the same type). So I resort to a version of this: formattingMy current workflow is do calculations in F#, save to CSV, then do literate programming in Rmarkdown documents for tables, figures. The blocking issue for using F# formatting is automatic latex table formatting of fancy regressions. I guess integration with R latex formatting via Rprovider is possible, but I haven't tried it. I think it will be hard to make a lot of progress here, because the first step is to have the statistical models, then second formatting for it. The holdup is the statistical models. packages used most often
Overall
|
At the risk of piling on, since I was mentioned in an earlier post, I thought I'd give an update on Beaker for polyglot notebook programming. If this is an unfamiliar concept, basically it means you have a computational notebook that is simultaneously connected to multiple language kernels, and accordingly you can program in any of the corresponding languages across cells. So you can munge some data in F# and then in the next cell do some statistical modeling in R. I've been using Beaker for several years with real workloads, and it works very well. The project has recently pivoted towards supporting Jupyter, with a fairly huge loss of functionality during the pivot. The best current polyglot alternative within Jupyter seems to be the SoS kernel, which can also be used in JupyterLab. So far I've only used SoS for small workloads, but it seems very solid. In Beaker I've had notebooks that use F#, R, Scala, Groovy, Java, and Javascript, using each where it works best (and potentially has a library dependency that I need). From my perspective, this is far better than trying to bring libraries developed in other languages into F# because:
Polyglot notebooks can have some issues, but these seem to mostly be self-inflicted. For example, Beaker had specific kernel connection code for each supported language, making it difficult to maintain dozens of kernels. Also autotranslation (passing data structures between kernels through the notebook) is a cool and often touted feature that can be difficult to implement well with many edge cases. If autotranslation had a very basic implementation, then many of the associated problems would disappear. In practice I've found it's not really that useful except for passing configuration information between cells (e.g. file paths) because data of non-trivial size needs to hit the disk anyways, where it can be read by other kernels. Anyways, it seems that polyglot notebooks are here to stay. I habitually use F# within this context (favorite language naturally) but use other languages in the notebook when their native support is a more natural fit. As far as F# kernels are concerned, since Jupyter has replaced Beaker, the ifSharp kernel is the best F# kernel to keep moving forward. I've used ifSharp with SoS/Jupyter and it works great. |
@aolney Thanks for sharing your experience. Your points clarify my confusion about Beaker and BeakerX. I took a quick look at SoS. It seems that it requires setting up a language module similar to https://github.com/vatlab/sos-r/tree/21883327750a1089066e8933843131d6271bfd74 so that SoS can interop F# with other languages. I found the documentation here https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html Is that how you get started on using IfSharp on SoS? Any possibility to create a pull request to share your language module to SoS so that it can support IfSharp kernel out of box? That will help F# community get started on SoS notebook. |
I've been using a Jupyterlab installation but I think the process is similar for Jupyter notebook. I believe this is all it takes:
In other words there's no need for a new language module b/c ifSharp works with Jupyter. My understanding (could be wrong) is that the links you provided are only needed for certain functionalities like autotranslation and syntax coloring. They may also be needed for future capabilities like intellisense and linting. |
@aolney Thanks for clarifying. Yes. I was interested in autotranslation so that variables between python and R can be used in F# and vice versa. I already got F# kernel working on SoS. One step further is to autotranslate between pandas dataframe with Deedle. But it requires Feather format support as that's how it's done between R and python right now. https://github.com/vatlab/sos-r/blob/21883327750a1089066e8933843131d6271bfd74/src/sos_r/kernel.py#L83 You mentioned data needs to be dumped to file before consumed by other language. I see that's how it's done between pandas and matlab now https://github.com/vatlab/sos-matlab/blob/d818cb93b8988bb8ecf9e4910c12fe7ab9538e73/src/sos_matlab/kernel.py#L102 We can do that instead to support Deedle and pandas dataframe interaction. Guess this would be another long-term project. At least the path looks clear. |
So, say everything existing is updated and working nicely. Well, a proper RProvider would give you access to whatever is missing in F# and just about anything in python and then some. RProvider though has not worked for quite some time since R 3.5. Missing, Missing, Missing:
model <- lm(data = scores, score ~ age * sex)
let model =
let data = scores
let response = [ "score" ]
let predictors = [ "age"; "sex" ]
(data, response, predictors)
|> linearModel ModelType.OLS CrossEffects.Multiplicative
|
For whatever product, you would require a few killer features that would make it indispensable, and for F#, it could easily be the entire data analytics and data science workloads. The same thing that greatly helped propel python to the front. The user base, especially, being more mathematically inclined and comfortable with the syntax (I just love/adore it but dunno why makes lots of people uncomfortable), ideas of immutability and the core of language being input -> function -> output, would be much better adopters than say, developers active in GUI or web. There are other areas I am sure, for example, business applications that fit nicely with Domain-Driven Design. But data science workloads - incidentally, a perfect match for DDD - are certainly worth the investment, especially as they seem to be exponentially growing both in volume and utilisation. If you think about it, one of the most active open source big data projects, Spark, is only 7 years old - with many users adopting a difficult language like Scala just to use full Spark capabilities and performance. The community as a whole seems to be more-so accepting of learning and new tech that makes their life easier. |
Corporate support could be subtle, could be a lot of things from adoption and critique to code contribution to money, marketing, etc. For example,
|
This discussion is continues at fslaborg/FsLab#3 Please join us there! |
Now that @zyzhu has started updating Deedle to netstandard 2.0 we should look at updating the whole FsLab collection
@zyzhu - do you use FsLab as an entire collection or just the individual pieces?
The text was updated successfully, but these errors were encountered: