Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for updating FsLab #137

Closed
dsyme opened this issue Jul 18, 2018 · 73 comments
Closed

Plan for updating FsLab #137

dsyme opened this issue Jul 18, 2018 · 73 comments

Comments

@dsyme
Copy link
Member

dsyme commented Jul 18, 2018

Now that @zyzhu has started updating Deedle to netstandard 2.0 we should look at updating the whole FsLab collection

@zyzhu - do you use FsLab as an entire collection or just the individual pieces?

@dsyme
Copy link
Member Author

dsyme commented Jul 18, 2018

Here are some stream-of-consciousness thoughts on steps to update FsLab:

  • Update FsLab packages to latest
  • Make FsLab support for netstandard 2.0 (so you can reference FsLab in a netstandard 2.0 project)
  • Make sure the package is usable with .NET SDK projects
  • Revamp load scripts to account for package locations
  • Update template and documentation to use dotnet new

Also:

  • Check experience and tutorials on Mac, VSCode for Mac, VS for Mac
  • Look again at FSharp.Charting - do we really want to use Gtk etc.
  • Revive the twitter account http://twitter.com/fslaborg
  • add Python interop to the mix (no type provider, just basic interop and samples)
  • consider whether machine learning should be added to the mix (obviously it's needed, just does FsLab want to take a preferred view on the different options)

Links to various parts of this as they get done:

@dsyme dsyme changed the title Update packages and support to netstandard 2.0 Plan for updating FsLab Jul 18, 2018
@zyzhu
Copy link

zyzhu commented Jul 18, 2018

@dsyme I only use Deedle and XPlot. Data is consumed via Dapper. I prototype all my research via FSI. I do hope dotnet core support on FSI will come soon as I refresh the issue everyday :)

I used to heavily use SqlClient type provider until it's not supported on dotnet core. I do use the latest SignalR so that Dapper can mitigate the problem.

I feel we shall discuss about a bigger picture of F# in data science, not only restrained by updating the current libraries to netstandard. @tpetricek

Among FsLab packages, I found FSharp.Formatting as a weak link. I experimented it before but I found it's not too productive to use as it requires some boilerplate and template in html/razor. Documentation and samples were not in good shape either.

To data scientist or quant in finance, I generalize workflow as data retrieval -> computation -> visualization. Jupyter notebook is the de-facto place to start as almost zero boilerplate is required. The notebook can be easily rerun to replicate research step by step. On top of it, dynamic-typed language is very easy to get started for people with stats/math background. That's partially why Python build up an amazing ecosystem.

Though I strongly prefer statically-typed language, I cannot deny the productivity of Python ecosystem. But the productivity comes at a cost such as speed, scalability, type robustness and integration issues. But in practice, most research teams are willing to sacrifice these long-term cost with the boost of short-term productivity at idea generation stage.

In order to grow a community using F# in data science, avoiding boiler plate is a first prerequisite to boost productivity. The productivity from FSharp.Data typeprovider cannot make up the cost of preparing boilerplate to plot results. I found IfSharp is quite useful as I can run all my F# scripts and visualize results using XPlot easily. The scripting and debugging experience is not as great as Visual Studio and FSI. But I just need to copy paste my script to IfSharp and use it as a notebook to record and visualize result. Most of time it works right away.

We shall improve the experience of FsLab on IfSharp as a priority by providing more documentations. It took me a while to dig through various issues to get a Deedle frame printed. Digging these issues will discourage newcomers to F#. More sample Azure notebooks on various topics to educate community will be very useful. @cgravill

Another promising path is via a polyglot Jupyter notebook such as Beakerx by Two Sigma. It has just released version 1.0 with built-in two-way autotranslation. There was an old Beaker F# kernel. Maybe it could be ported so that F# can coexist with all other language/ecosystem on the same notebook. Then more libraries and visualizations can be handy. @aolney
twosigma/beakerx#5039

Including an ML library such as Accord.Net/TensorFlowSharp is a good idea. But I am not an expert on it. Maybe @mathias-brandewinder has some good suggestions.

I would also include another optimization library Google.OrTools. They plan to release its FSharp library targeting netstandard in its next version soon. It can solve another branch of users on linear optimization cases. I've compiled its F# library to use in production and found its F# examples very elegant.
google/or-tools#722

@TonyHenrique
Copy link

I don't know if I missed something, but I feel that F# need to support complex XSD -> XML generation. It is being used heavily here in Brazil by Governament for Sales, Medical, and it would be good to have a easy way to have type safety when generating XML from our data using the XSD Schemas provided by Governament.

See fsprojects/FSharp.Data.Xsd#26 (comment)

@dsyme
Copy link
Member Author

dsyme commented Jul 19, 2018

It's great discussion, please continue, all the comments are enlightening

Note the list of work items above is not meant to be comprehensive and is a bi stream-of-consciousness.

My take is that FsLab should be a collection of packages which "work together and you don't regret". That is, the packages should

  • be useful for data science (but not necessarily a complete set of packages for every eventuality - you might need to add more)
  • work cross-platform (including .NET Core)
  • work with F# Interactive (including on .NET Core when it is done)
  • work in iFSharp Jupyter notebooks
  • be well-scoped, i.e. do what they say on the tin, and not more or less
  • have relatively few bugs
  • have an active maintainer
  • be well-documented
  • be accepting contributions
  • not interfere with the use of alternative packages
  • together they should not be "too large"
  • be usable as independent components if necessary ("not a big fur-ball that is all or nothing")
  • be usable in both data scripting and compiled code

Basically you want to to be able to "add an FsLab reference" and do some data-science workbook programming, whether that be in Visual Studio, VS Code, Jupyter notebooks or whatever.

Equally you should be able to back out of using FsLab and just use individual packages with the same effect.

Machine learning packages for .NET are a little tricky for FsLab. The more complete ones like Accord.NET (which is great) tend to be a complete set of packages in their own right (which is also great). Other packages like ML.NET are a little too early to include. So in general I agree

Interop packages like RProvider, python , MATLAB provider, Excel provider etc. are tricky too. On the one hand these are incredibly useful when they are needed and work, and can benefit from regular integration and use with other components. They are also sometimes painful to get working first time and people sometimes shy away from them. On the other hand they are a source of considerable complexity and documenting them can be tricky.

Note that one approach would be to abandon FsLab as an "integrated" package and simply document the choices and how to get started with them

Finally FsLab today takes a very strong approach to literate programming - and I agree with @zyzhu that FSharp.Formatting is a bit of weak link. I need to understand better where we should end up here.

@jackfoxy
Copy link

jackfoxy commented Jul 19, 2018

I think I'm close to having a XPlot netstandard2.0 PR
https://github.com/jackfoxy/XPlot/tree/magicmode
It builds in VS, but getting strange error with Newtonsoft.Json not recognized in the build target of the build script.

@dsyme
Copy link
Member Author

dsyme commented Jul 19, 2018

@jackfoxy That's great. I did a couple of updates to XPlot to fix the paket bootstrapper and documentation generation, you'll want to integrate those

@jackfoxy
Copy link

If it's already in master, I'll merge. I also implemented paket magic mode, which is possibly what you did @dsyme

@sebhofer
Copy link

sebhofer commented Jul 19, 2018

Thanks to @zyzhu and @dsyme for starting this discussion. I'm really happy to hear about these developments! I agree with most of zyzhu's points; still, as one of your goals seems to be to attract new users to FsLab, I feel that providing my 2¢ of opinion could be helpful. From my experience starting out with using F# for data processing can be quite though for a newcomer (I'm coming from a science background) for several (some non-technical) reasons. Some thoughts:

  • First and foremost: getting to know what's available in FsLab is really though. Try clicking through the FsLab website and the project sites. Frankly, it's quite messy. Just looking at the projects listed on the top right is confusing; this top bar lists anything from 2 to 5 different packages, and hardly ever the same. To this day I don't know what exactly is "part" of FsLab. Also, some (but not all) pages link to http://fsharp.github.io/FSharp.Charting/, which is dead.
  • Starting with Deedle was surprisingly hard for me coming from a dynamic world. Although the documentation is quite extensive, I still needed a lot of time to figure out seemingly simple tasks. I'm not quite sure how one could alleviate this. Maybe a list of common patterns in pandas and their translation to Deedle would help. I also thought about doing a Deedle cheat sheet along the lines of the pandas one, but I never got around to it.
  • Notebook interface: I completely agree with @zyzhu that these are really useful, and I think it's crucial to have a notebook interface which nicely integrates display of dataframes and plotting without too much fiddling around.
  • Interop with R and python would certainly be nice, and I think would attract many people who just can't afford to give up using a certain package for some reason or another.
  • What I would also enjoy to have is a data science template similar to this. I'm not sure if FsLab is the place for it, but on the other hand, there are already 2 templates...

@dsyme
Copy link
Member Author

dsyme commented Jul 20, 2018

@sebhofer I agree with all those points, thanks. My first aim here is to get FsLab "clean" and spark a round of work on fundamentals like .NET Standard support. But we can also reassess its whole construction - I'm still not sure it should be anything but a template of the kind you propose (does it even need to be a combined nuget package?)

FsLab is, at the moment:

  • Deedle
  • XPlot
  • Math.NET Numerics
  • RProvider
  • FSharp.Charting
  • Some literate programming support

plus a template. These seem reasonable (though FSharp.Charting should I think be dropped now). I think each is quite well documented (once links are all fixed). But the centrality of the literate programming support is questionable in the world of notebooks.

There are also transient dependencies on

  • Suave
  • Newtonsoft.Json
  • Google.DataTable.Net.Wrapper

the first two of which are questionable, and also optional dependencies on:

  • Google charts
  • Plotly
  • R

@dsyme
Copy link
Member Author

dsyme commented Jul 20, 2018

@jackfoxy Could you send a PR for your xplot .NET Standard 2.0 work, even if not yet quite complete? Then we can discuss and others can help get it over the line? thanks

@dsyme
Copy link
Member Author

dsyme commented Jul 20, 2018

Starting with Deedle was surprisingly hard for me coming from a dynamic world. Although the documentation is quite extensive, I still needed a lot of time to figure out seemingly simple tasks.

@sebhofer I agree with this and I'm concerned by aspects of the Deedle design. It's possible there are also just better data frame libraries emerging for .NET as well, especially with regard to simplicity and discoverability. We need to reassess this.

@jackfoxy
Copy link

jackfoxy commented Jul 20, 2018

@dsyme fslaborg/XPlot#75 not merged with latest master

@cgravill
Copy link

I've merged a change to IfSharp to target .NET 4.7.1 to ease interaction with .NET Standard 2.0 fsprojects/IfSharp#181 There is some odd behaviour but with that I'm able to use ML.NET 0.3 in the context of a Jupyter Notebook.

It'd be great to have improved support for FsLab. There was some initial work on this in fsprojects/IfSharp#156 but more would be great. The helper script approach does have discovery issues but it's meant we can keep the core cleaner.

@sebhofer
Copy link

sebhofer commented Jul 20, 2018

@dsyme I'm certainly in no position to judge the Deedle design, but I experienced that it's quite easy (for a beginner) to get bad performance if one is not careful. In my case I had to hack my own merge (or join?) function, because the built-in one would just not finish in reasonable time. (The reason was that the built-in version was too general for my problem and could be simplified considerably.) This is in principle not bad, but certainly slows you down in your day-to-day work. So there is certainly some room for improvement.

To finish, I also have to say that it's just great that @tpetricek is so responsive on stackoverflow with respect to any Deedle issues that crop up (or any F# related problem for that matter :)!

@nhirschey
Copy link

nhirschey commented Jul 24, 2018

Deedle

The work on Deedle is tremendous, but (coming from R, SQL) I unfortunately found it complicated to understand the programming model and gave up on it. I found much more success using base f# data structures. It was far simpler.

That said, saving frames to files and using frames to pass to/from Rprovider are fantastic.

The time series join stuff is also great, but I just end up using Array.find for inequality searches or maps for equality searches.

FSharp.Data

No record collection -> Csv file function is a weak point for saving intermediate results. Hand mapping 30 column records to a CSV row type is not practical or type safe (easy to accidentally transpose two neighboring columns of the same type). So I resort to a version of this:
https://stackoverflow.com/questions/25086198/list-of-string-in-a-record-to-csv

formatting

My current workflow is do calculations in F#, save to CSV, then do literate programming in Rmarkdown documents for tables, figures. The blocking issue for using F# formatting is automatic latex table formatting of fancy regressions. I guess integration with R latex formatting via Rprovider is possible, but I haven't tried it.

I think it will be hard to make a lot of progress here, because the first step is to have the statistical models, then second formatting for it. The holdup is the statistical models.

packages used most often

  • FSharp.Data
  • MathNet.Numerics
  • PSeq

Overall

  1. The real limiting factor is easy integration with statistical models. The .NET way is weird and lacks a lot of stuff in R or Stata or SAS; the DSL work by Matthias would have the most impact, coupled with modern standard error functions. But I know the only way for this to happen is contributors. RProvider would be fine for models, except that I want literate formatting too so I might as well just use R.

  2. The proposed (I think) "#r paket FSharp.Data " syntax would make it far easier for beginners in scripts.

  3. Figuring out project/solution files is still the thing that took me the longest to get. My only purpose is to put common code used across multiple .fsx files in xxx.fs files. Leaving in .fsx is problematic if A.fsx has common code used in B.fsx and C.fsx, but C.fsx also needs to load B.fsx.
    There probably needs to be documentation showing how to go from a simple script file to a larger project. Simple, but important.

@aolney
Copy link

aolney commented Jul 24, 2018

At the risk of piling on, since I was mentioned in an earlier post, I thought I'd give an update on Beaker for polyglot notebook programming. If this is an unfamiliar concept, basically it means you have a computational notebook that is simultaneously connected to multiple language kernels, and accordingly you can program in any of the corresponding languages across cells. So you can munge some data in F# and then in the next cell do some statistical modeling in R.

I've been using Beaker for several years with real workloads, and it works very well. The project has recently pivoted towards supporting Jupyter, with a fairly huge loss of functionality during the pivot. The best current polyglot alternative within Jupyter seems to be the SoS kernel, which can also be used in JupyterLab. So far I've only used SoS for small workloads, but it seems very solid.

In Beaker I've had notebooks that use F#, R, Scala, Groovy, Java, and Javascript, using each where it works best (and potentially has a library dependency that I need). From my perspective, this is far better than trying to bring libraries developed in other languages into F# because:

  • Native libraries are always current
  • Native libraries have the best documentation/support
  • Converted libraries can be more difficult/less fluent to use than native (sorry RProvider)

Polyglot notebooks can have some issues, but these seem to mostly be self-inflicted. For example, Beaker had specific kernel connection code for each supported language, making it difficult to maintain dozens of kernels. Also autotranslation (passing data structures between kernels through the notebook) is a cool and often touted feature that can be difficult to implement well with many edge cases. If autotranslation had a very basic implementation, then many of the associated problems would disappear. In practice I've found it's not really that useful except for passing configuration information between cells (e.g. file paths) because data of non-trivial size needs to hit the disk anyways, where it can be read by other kernels.

Anyways, it seems that polyglot notebooks are here to stay. I habitually use F# within this context (favorite language naturally) but use other languages in the notebook when their native support is a more natural fit. As far as F# kernels are concerned, since Jupyter has replaced Beaker, the ifSharp kernel is the best F# kernel to keep moving forward. I've used ifSharp with SoS/Jupyter and it works great.

@zyzhu
Copy link

zyzhu commented Aug 20, 2018

@aolney Thanks for sharing your experience. Your points clarify my confusion about Beaker and BeakerX. I took a quick look at SoS. It seems that it requires setting up a language module similar to https://github.com/vatlab/sos-r/tree/21883327750a1089066e8933843131d6271bfd74 so that SoS can interop F# with other languages. I found the documentation here https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html

Is that how you get started on using IfSharp on SoS? Any possibility to create a pull request to share your language module to SoS so that it can support IfSharp kernel out of box? That will help F# community get started on SoS notebook.

@aolney
Copy link

aolney commented Aug 20, 2018

I've been using a Jupyterlab installation but I think the process is similar for Jupyter notebook.

I believe this is all it takes:

pip install jupyterlab
jupyter labextension install jupyterlab-sos
jupyter kernelspec install ifsharp 

In other words there's no need for a new language module b/c ifSharp works with Jupyter.

My understanding (could be wrong) is that the links you provided are only needed for certain functionalities like autotranslation and syntax coloring. They may also be needed for future capabilities like intellisense and linting.

@zyzhu
Copy link

zyzhu commented Aug 20, 2018

@aolney Thanks for clarifying. Yes. I was interested in autotranslation so that variables between python and R can be used in F# and vice versa. I already got F# kernel working on SoS.

One step further is to autotranslate between pandas dataframe with Deedle. But it requires Feather format support as that's how it's done between R and python right now. https://github.com/vatlab/sos-r/blob/21883327750a1089066e8933843131d6271bfd74/src/sos_r/kernel.py#L83

You mentioned data needs to be dumped to file before consumed by other language. I see that's how it's done between pandas and matlab now https://github.com/vatlab/sos-matlab/blob/d818cb93b8988bb8ecf9e4910c12fe7ab9538e73/src/sos_matlab/kernel.py#L102 We can do that instead to support Deedle and pandas dataframe interaction.

Guess this would be another long-term project. At least the path looks clear.

@siavash-babaei
Copy link

siavash-babaei commented Nov 21, 2020

So, say everything existing is updated and working nicely. Well, a proper RProvider would give you access to whatever is missing in F# and just about anything in python and then some. RProvider though has not worked for quite some time since R 3.5.

Missing, Missing, Missing:

  • Data Frames: For R and python (through pandas), data frames are kind of the primary core data structure that one would work with, especially in the case of in-memory data, and almost every library, package, and algorithm is aware of them and utilizes them in one way or the other. Going so far as even many big data tools simply chunk large data into manageable data frames and take it from there. On the other hand, no matter how good and effective Deedle is at handling data frames, Accord.NET, Math.NET Numerics, ML.NET, etc. are not aware of Deedle data frames and cannot directly consume them. (Please correct me if I am wrong ...) Hence severely limiting their usage.
  • Visualization: In terms of visualizations, both R and python have superb capabilities in ggplot and Matplotlib. FSharp.Charting is not nearly as good. In addition, R has Shiny and python has Bokeh to handle interactive visualization and dashboard, etc similar to what Microsoft Power BI and Tableau offer in terms of preparing interactive reports and dashboards - although obviously less commercially polished ... So far as I know, F# has nothing similar.
  • Data Retrieval: FSharp.Data is great but it could be expanded upon perhaps incorporating some other TypeProviders as well maybe even supporting data file formats from R, Python, MATLAB, HDFS, SAS, SPSS. The ability to communicate with databases - Whether SQL or NoSQL - and various data sources, is of course of paramount importance. I have not seen TypeProviders for say MongoDB, Cosmos DB, HBase ... As an example, the last update on MongoDB.FSharp which is one of few such projects is 6+ years old and the link on MongoDB website is 8 years old: https://www.mongodb.com/blog/post/enhancing-the-f-developer-experience-with-mongodb.
  • Big Data: I believe a significant proportion of data analytics workflows are still in-memory, using a standard panel of models for regression/classification/clustering tasks, and do not yet involve stuff like big data, Spark, Deep Learning, and alike. However, not having those capabilities, is a deal-breaker. Standard tools of the trade for that are Spark, Keras, Tensorflow, probably a couple of others would complete the list of necessary tooling. We got Spark for .NET and ML.NET is supposed to offer Tensorflow support, so I suppose the best hope lies with Microsoft and then maybe F# API/Wrappers in time. It's a pitty though ML.NET is written in C#, had it been F#, maybe it would have helped propel F# similar to what Spark did for Scala, not to mention that F# would have been a lot more suitable at core rather than OO C#. Nonetheless, ML.NET is an obvious candidate for inclusion in FsLab and bigger involvement from F# lads. It offers a very welcome more unified approach to doing data science plus support for Tensorflow and ONNX
  • Light Intuitive Syntax: Since most of the time, you are doing exploratory analysis and prototyping, quick turn-around is a very important feature. What we have in the likes of ML.NET and Accord.NET is very C#, too awkward and verbose for quick and dirty hacking. In R, you would do:
          model <- lm(data = scores, score ~ age * sex)
  and then, from this `model` object, you can extract whatever you need, including statistics, 
  coefficients and confidence intervals, error estimates, etc, even diagnostic plots, with some 
  pretty intuitive names. To me, doing the same thing as above and almost perfect in F# would 
  go like:
          let model = 
              let data = scores
              let response = [ "score" ]
              let predictors = [ "age"; "sex" ]
              (data, response, predictors)
              |> linearModel ModelType.OLS CrossEffects.Multiplicative
with `model` object perhaps being a record type with fields corresponding to coefficients 
table, error estimates, basic statistics, etc. 

@siavash-babaei
Copy link

siavash-babaei commented Nov 21, 2020

For whatever product, you would require a few killer features that would make it indispensable, and for F#, it could easily be the entire data analytics and data science workloads. The same thing that greatly helped propel python to the front. The user base, especially, being more mathematically inclined and comfortable with the syntax (I just love/adore it but dunno why makes lots of people uncomfortable), ideas of immutability and the core of language being input -> function -> output, would be much better adopters than say, developers active in GUI or web. There are other areas I am sure, for example, business applications that fit nicely with Domain-Driven Design. But data science workloads - incidentally, a perfect match for DDD - are certainly worth the investment, especially as they seem to be exponentially growing both in volume and utilisation. If you think about it, one of the most active open source big data projects, Spark, is only 7 years old - with many users adopting a difficult language like Scala just to use full Spark capabilities and performance. The community as a whole seems to be more-so accepting of learning and new tech that makes their life easier.
FsLab could be that unified environment for data analytics pipelines with a comprehensive suite of up-to-date tools accessible from whatever OS, with pieces that have the necessary awareness of each other. Kind of pointless to have a data frame that cannot be readily consumed within the tools that you use to analyse your data: ML.NET has its own data frame and btw, it seems very inferior to that of Deedle; and, Accord.NET has its own extremely horrible way of consuming data in the form of arrays. It is going to be an involved process though starting from selecting a set of standard features the community and more importantly, the language requires in this regard - a lot of input needed from developers and more importantly, users. Further steps could even involve attracting corporate support and money. Ideally, you would end up in an environment like MATLAB, R, or Julia, where you can readily hack quick-and-dirty, just as well as develop polished applications (very clumsy and difficult to do in R/MATLAB and unsound/non-performant in python).

@siavash-babaei
Copy link

siavash-babaei commented Nov 21, 2020

Corporate support could be subtle, could be a lot of things from adoption and critique to code contribution to money, marketing, etc. For example,

  • make F# code usable directly from within SQL Server and Power BI (remembering that a big component there, M Language, is inspired by F#) the same as is currently the case with R and Python; and/or,
  • a backend (or some form of help with one) for ML.NET and SPARK.NET in carefully designed stable idiomatic F# 5.0 with superb documentation, tutorials, etc.
  • encouragement of more visibility and publicised utilisation: if you are doing something in F#, make a note of it somewhere on your website ... publish a link to that in some forums/blogs, yadi-yada-yada ...

@dsyme
Copy link
Member Author

dsyme commented Dec 10, 2020

This discussion is continues at fslaborg/FsLab#3

Please join us there!

@dsyme dsyme closed this as completed Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests