Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate column names coming back as array, instead of using the last column #1384

Open
blitzmann opened this issue Apr 17, 2022 · 6 comments

Comments

@blitzmann
Copy link
Contributor

blitzmann commented Apr 17, 2022

It seems that either the documentation on how duplicate column names are handled is incorrect, or the implementation is incorrect. Documentation notes:

If your queries contain output columns with identical names, the default behaviour of mssql will only return column metadata for the last column with that name.

Expected behaviour:

I would expect that, by default, any query with duplicate column names would take the last column and return the keyed object using that data.

{
  recordsets: [ [ [Object] ] ],
  recordset: [ { id: 2 } ],
  output: {},
  rowsAffected: [ 1 ]
}

Actual behaviour:

The actual behavior seems to group the two data sets together into an array.

{
  recordsets: [ [ [Object] ] ],
  recordset: [ { id: [1, 2] } ],
  output: {},
  rowsAffected: [ 1 ]
}

Script:

const sql = require("mssql")
;(async () => {
    await sql.connect({
        user: "sa",
        password: "Admin12345",
        database: "tempdb",
        server: "localhost",
        options: {
            trustServerCertificate: true, // change to true for local dev / self-signed certs
        },
    })
    const result = await sql.query`SELECT 1 as id, 2 as id`
    console.dir(result)
})()

Software versions

  • NodeJS: 14.17.3
  • node-mssql: tested with 8.1 and 7.3.1
  • SQL Server: 2017 (docker image mcr.microsoft.com/mssql/server:2017-latest-ubuntu)
@dhensby
Copy link
Collaborator

dhensby commented Apr 19, 2022

So the behaviour for duplicate columns being presented as arrays was introduced in #1240 (although this is only in v8 and no v7). This only changed the values for the values presented and didn't affect the metadata (which is what your quote regarding the documentation).

At the moment, the behaviour is expected, so from that point of view this is not a valid bug. However, if the mis-match in returned metadata and the actual returned data structures, then that seems valid.

I'll look into it a bit to see what is going on there.

@blitzmann
Copy link
Contributor Author

Oh, I may have mis read it then. If it's only dealing with the metadata instead of the actual values coming back, that would make sense.

Would it be possible to introduce an option that returns just the last value for multiple columns names? Basically, using the last columns meta data as well as the last columns value, instead of just using the last columns metadata while grouping all the values into a collection? This would bring the mssql package more in-line with how other SQL packages handle duplicate column names. The following who how different drivers in TypeORM return the query SELECT 1 as id, 2 as id, and as you can see mssql is the lone wolf in that it returns both values (for better or for worse)

========
better-sqlite3
        [{"id":2}]
sqlite
        [{"id":2}]
sqlite-2
        [{"id":2}]
postgres
        [{"id":2}]
cockroachdb
        [{"id":"2"}]
mysql
        [{"id":"2"}]
mariadb
        [{"id":2}]
mssql
        [{"id":[1,2]}]

If we could get an option (not even defaulted to avoid breaking changes), then we could implement that option within TypeORM and bring the returned resultset in line with what we expect from other drivers.

Thoughts?

@dhensby
Copy link
Collaborator

dhensby commented Apr 20, 2022

The change to return an array of values was brought in deliberately in v8 of the library. I'm not sure that going back would be a good idea, nor adding a feature flag for this.

Whilst I appreciate other libraries appear to be handling this in a defacto-standard way, I'm not sure standardised data loss is a good idea and, instead, users should be writing better queries where these duplicates aren't there OR it's dealt with on the application layer.

It should be fairly trivial to process the returned rows to extract the last value of the array if the value is an array. I suppose the complexity is determining the difference between intended arrays (parsed JSON data, though I'm not sure we parse it, that might be something the drivers do) vs constructed arrays from multiple values, which is why the metadata structure should better represent the recordset data-structure.

@blitzmann
Copy link
Contributor Author

The change to return an array of values was brought in deliberately in v8 of the library.

Then there must be some confusion here. I've tested on both v7 and v8, and both return an array when multiple columns of the same name return. Thoughts? Please see below version as reported by npm list

ryan@RYANDESKTOP MINGW64 /h/git/typeorm ((0.3.6))
$ npm list | grep mssql
Debugger listening on ws://127.0.0.1:56474/438f166b-c964-42fa-8310-49d19798edc1
For help, see: https://nodejs.org/en/docs/inspector
Debugger attached.
+-- [email protected]
Waiting for the debugger to disconnect...

I disagree with the notion that we shouldn't do this because "data loss is bad" but also lose metadata associated with all but the last duplicate column anyway, according to the documentation...?

It seems to me that it is intuitive to think that, as a js driver for a RDMS, the result would be an array of records, with objects only a single level deep (as this is how RDMS return data).

I agree it is trivial to check for an array in some sort of post-processing step, and take the last value. I also understand that we could even use arrayRowMode and build out the raw object we would expect. We could do any of these outside of the mssql package and be fine.

However, I still believe that the current default is unexpected behavior, even if it's technically accurate, and requires you to dig into why it's doing it that way when the vast majority of other drivers default to something more intuitive. I can appreciate why it is done this way, and even why it's defaulted (the whole idea of giving the developer exactly what they asked for, don't just assume they want the last one), but I still think it would be a good idea to allow this to be a general option that can be passed into the config. Something along the lines of

const sql = require("mssql")
;(async () => {
    await sql.connect({
        user: "sa",
        password: "Admin12345",
        database: "tempdb",
        server: "localhost",
        options: {
            trustServerCertificate: true, // change to true for local dev / self-signed certs
        },
        duplicateColumnsToArray: false, //can default to true
    })
    const result = await sql.query`SELECT 1 as id, 2 as id`
    console.dir(result)
})()

This way, it's documented when first becoming acquainted with the library and the user will then be able to determine form the get go what they would prefer.

I hope you reconsider, but at the same time understand that you guys are the experts and know best. If the above proposal is out of scope, feel free to close the issue 🙂 Thanks!

@dhensby
Copy link
Collaborator

dhensby commented Apr 21, 2022

Firstly, it definitely looks like there's something not quite right because the returned metadata and the returned data don't match up quite right. At the moment you'll get the metadata from both columns but they aren't shown as arrays, so it's difficult to actually associate the metadata with the row data. That's something to look into.

Secondly, regarding v7.3 + v8, maybe this was always acting this way, or it was brought in earlier. But that's a bit of a distraction - the point is it is currently intentional behaviour.

To address the main crux of this issue: Should duplicate columns return as an array of values or just replace the last seen value?

In my day-to-day use of databases, I don't encounter duplicate column names like this, so it's entirely an academic exercise for me. My unsympathetic response is "write better SQL", but I appreciate with automated query building this can be harder - however I am involved in automated DB building and SQL builders that are capable of not having duplicate column names.

If duplicate names are unavoidable, (rhetorically) why is the last column the one you want? I can see from a naive implementation you'd do:

const record = {};
dbquery.on('column', (name, value) => {
  record[name] = value;
});

and you'd end up with the last column there... but if you were slightly more thorough and did:

const record = {};
dbquery.on('column', (name, value) => {
  if (!(name in record)) {
    record[name] = value;
  }
});

then you have different behaviour where the first value is retained and I would say that's sensible too.

As a developer who knows they will have duplicate names in their queries, do I automatically want the first or the last value? I'm not sure it's obvious, that feels like a toss-up. What if I want both/all? What if I want the first not the last? What if I want the second (if there are more than two)? It just all feels like we have to make an opinionated choice where there's no obvious correct answer.

We could add an option to always return the last item, but then what if someone wants an option to always return the first one, or another configuration? This just feels like a spiral into endless options for every edge-case. With the current implementation (an array of values) the consumer of the library can make this choice freely because they have all the values at their disposal and their DB interface can just pluck out the values they want/need fairly trivially.

It seems to me that it is intuitive to think that, as a js driver for a RDMS, the result would be an array of records, with objects only a single level deep (as this is how RDMS return data).

It's not right to say the RDMS returns data this way, the RDMS returns all the columns, including the duplicated ones, which is how we are able to present them all... It just appears other libraries have made the decision to override the data on duplicate cols.

I'll definitely think on this, as I appreciate if lots of other DB libs do it this way, it's not fun to be the odd one out. For my understanding, can you justify to me why it makes sense to return the last duplicate named value over the first? What circumstances can you suggest where this is just so more obvious than any other value (eg: typical queries, etc where this is the obvious choice)?

@blitzmann
Copy link
Contributor Author

For my understanding, can you justify to me why it makes sense to return the last duplicate named value over the first?

and

We could add an option to always return the last item, but then what if someone wants an option to always return the first one, or another configuration? This just feels like a spiral into endless options for every edge-case.

I definitely see where you're coming from, it absolutely is an opinionated decision regardless of which direction is taken, and I appreciate the concern about getting into the potential madness of "option for everything. The only thing that I can offer as justification of taking the last column vs taking any other column is simply for uniformity with other packages and protocols, where this seems to be the default and expected behavior. And again, I'm not advocating for it to be defaulted, but rather an option that the user can opt into.

If I was issuing direct commands, I can't think of a situation where I would want to return two of the same column other than it being a mistake. But I confess that my background in this issue is driven in part due to my use of a query builder that is built on top of TypeORM, where something like this is more likely to happen.

Specifically, I'm referencing a querybuilder that takes an OData string and builds a query representing it using TypeORM. The current functionality is problematic because a) the user is the one determining the query, not the developer, and b) some generalizations have to be made with how the query is actually built, in which it's very difficult to account for all edge cases (for example, selecting the joined column twice in certain circumstances). Of course, these could definitely be seen as bugs on the part of the third-party library and how they generate the query, but could also be considered just a byproduct of unexpected behavior by mssql and outside the 'norm', however good-intentioned returning [1, 2] instead of just 2 is.

It's not right to say the RDMS returns data this way, the RDMS returns all the columns, including the duplicated ones, which is how we are able to present them all... It just appears other libraries have made the decision to override the data on duplicate cols.

Sorry, I was more talking about how they conceptually return results, not technically. RDMS's return their data in a 2-dimentional result which lends itself well to an array (resultset) of objects (row) with properties (column) set to the value. It seems to me to be counter-intuitive to thrown in additional dimensions by default 😉

As you said, and I agree, it is trivial to iterate through and just take the last column in order to emulate the other packages, or use arrayRowMode and do it manually. I just think it would be nice to have the option to have mssql do that instead while it's already in the business of getting the data together, rather than having to loop through the resultset and clean it up in the higher-level application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants