-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
direct lake - columns into memory #115
Comments
You can already disable all the queries to DirectLake by disabling the "Read statistics from data" setting. |
If I turn off the 'Read statistics from data' I don't get back the column cardinality (which is quite important) because that was being calculated by doing a DISTINCTCOUNT against every column in the model. Why isn't this using the Statistics_DistinctStates from TMSCHEMA_COLUMN_STORAGES? Then you don't need to run all that DAX and put the columns into memory. That seems to potentially solve the problem in an easier fashion. Also, I'm not sure if Vertipaq Analyzer considers only the columns where IsResident=True for the memory usage (for Direct Lake models). That would be useful (if it's not done already). |
Good point - I have to investigate. Originally, we were using the dictionary size as column cardinality (before we had TMSCHEMA DMVs) and the DISTINCTCOUNT calculation was the only way to get the right number as the dictionary can be larger than the actual unique values in the data (it happens with incremental refresh or when you refresh a partition). |
What should we do in the meantime? I've had customers complain about this and I've told them not to use Vertipaq Analyzer for now. There are many others who simply don't know this is happening behind the scenes and I feel they should know. Large models are more likely to be checked by Vertipaq Analyzer and those are exactly the ones for which this is problematic. |
Well, the code is open source, if you have time for a pull request... :) |
I don't think any of the DMVs will have cardinality information until the column is paged into memory. I think a better option might be to treat DL models like DQ models and assume that reading the cardinality statistics is potentially very expensive. We could change the "Read statistics from Direct Query Tables" option to be "Read statistics from Direct Query / Direct Lake Tables". This is already disabled by default, but if people want to take the performance hit to get the cardinality they can choose to do so. A better option for DL models is probably for people to run a notebook against the source delta tables to view the cardinality |
@dgosbell I would use two separate settings because, depending on the size of DirectLake databases, the approach could be very different, and I might have scenarios where I use different defaults for the two cases. I am not sure about the default, though. There are reasons why it should be enabled also for DirectLake, but if we remove the statistics collection for all the columns, then it would be clear that something is missing and users should act accordingly. |
Yes, a separate setting would also work. And You are right that it should be very rare that it's needed for DQ models, but there may be situations where someone wants to force a full scan of their DL model. I'm thinking that it might make sense for client tools to have some visual indicator like an icon or a different background color when a column has isResident=false as that will indicate that the cardinality and size columns are probably not accurate |
Yep, I want to spend time on it to improve VertiPaq Analyzer for these cases. I don't have time soon, unfortunately, but it's on the list. |
I'll have a look at doing a pull request for this. |
This is also relevant for Tabular Editor 3, so I'm just posting here to be notified of any updates. Thanks @dgosbell! |
I'm not sure this is the right way to go long-term. |
I don't think people would need "real-time" analysis. It would be more of a specific point in time snapshot. I don't really like the idea of doing an automated refresh.
The problem here is that users can currently turn off "Read statistics from data" and generate an "incomplete" file like this today. The problem I see with doing a "complete" analysis with Direct Lake models is the definition of "complete". Is it:
The problem with querying every column is you could have a hidden column like a fact ID which normally would not cause issues in a DL model if it was not queried, but VPA could cause issues either evicting other columns from memory or forcing a fallback to DQ mode. And obviously with reports in other workspaces and external tools and Excel reports it's tricky to accurately cover columns used in reports. We'd have to ask the user to try and do this manually which is not ideal. One possible compromise might be to use DISCOVER_CALC_DEPENDENCY to get columns referenced in measures and then collect stats for those plus any columns involved in relationships. Would that collect enough information for something like DAX Optimizer? Then we could have an enum rather than a boolean for the Direct Lake options. And we could do something like the following:
It would probably also make sense to store this in the VPAX somewhere so that we can easily tell what option was selected when generating the file and tools like DAX Optimizer can then treat them differently if it needs to. And a client tool like DAX Studio could display a warning with a button to trigger a higher level of scan if the full scan was not performed. What do you think of this as an approach? |
Good points. We are already working on an extension of VPAX called TCDX - Tabular Consumer Dependencies eXchange libraries. It's not public yet. It keeps track of dependencies in different ways, from static analysis to log processing. The Tcdx could be the "minimal" tracking. From my point of view, limiting the analysis to measure dependencies within the model is not very useful unless you assume that any visible column is used - and it's still incorrect in several cases (users can use hidden columns anyway). Moreover, DirectLake has the same behavior of Large data models for VertiPaq, so it's not really a separate use case - I guess that it's more likely you want a Minimal or ResidentOnly scan with DirectLake rather than with VertiPaq, but there are benefits in doing that also for regular imported models. I think that we should:
Thoughts? |
My thinking behind the minimal option was providing just enough information for someone or some tool (ie. DAX Optimizer) to suggest options for explicit measure optimization. If we could add regularly used columns using information from the TCDX I think that might also meet this goal of finding a happy middle ground between ResidentOnly and Complete What do you think here - is this an achievable goal? Or am I oversimplifying here, and the minimal option is never going to be of much value?
I did consider whether having something like ResidentOnly would make sense for an import model in the Large storage format. It would not be hard to do, but the large format does not have as much overhead for paging in a non-resident column since there is no transcoding. And an import model has to fit within the memory limits for the premium SKU it is running on. So, I was thinking the little bit of extra time to bring in non-resident columns for a large format import model was probably worth it. The challenge with Direct Lake is that the total data size can be larger than the premium SKU memory limits. I can see a situation where it might be impossible to do a "complete" scan, since by the time we've finished scanning the last batch of columns for accurate cardinality, the first batch of columns has been paged out.
Agreed. I'm very aware that DAX Studio is only one client of these libraries. I've started experimenting in my own fork to see what's possible, but I want to make sure we reach a consensus the best strategic implementation before doing a pull request that will change the behaviour. I was not aware of the Tcdx work, but if it make sense for the broader use cases for vpax I don't have objections to including this. But I'm also aware that more and more people are trying out Direct Lake, so I'd like to try and get some smarter defaults in place sooner rather than later.
Agreed. Currently I'm turning off the "Read statistics from data" option globally if I only want to look at resident columns and it's a pain. There are a number of different options. I could add a dialog where the default settings could be overriden for the current run. Or if something lower than a complete scan was done I could display a warning bar with buttons to run one of the more intensive scans. It probably makes sense for @otykier and I to use a similar approach across both TE and DS if we can. |
Ok, I'm traveling this week, but I want to make sure we make Tcdx visible in 1-2 weeks so we can start working on a long-term plan for VPAX according to the considerations we made in this thread. I'll keep you posted! |
The new |
Yes, we should use the same default in DAX Studio once we implement that - @dgosbell do you agree? |
When vertipaq analyzer is run (inside DAX Studio) against a Direct Lake model, it fires off a bunch of DAX queries which essentially put all the model's columns into memory. This defeats the purpose of a Direct Lake model having only the columns which a user queries put into memory. There should be some sort of configuration made to Veritpaq Analyzer so that it does not do this by default for Direct Lake models. And, that it obtains as many properties as possible from DMVs and as few as possible from DAX statements. IMO, the default setting for Direct Lake should be to not run any DAX queries against the model so as not to put any unnecessary columns into memory. This is highly problematic for Direct Lake.
The text was updated successfully, but these errors were encountered: