-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced std::string with std::string_view and removed excessive copies in cudf::io #17734
base: branch-25.02
Are you sure you want to change the base?
Conversation
Nice optimization. General question, many of the strings in the PR (except for paths etc) are very tiny (< 10 chars) which may not impact performance or space otherwise. Should we just leave them as is? |
@@ -106,13 +107,13 @@ class selected_rows_offsets { | |||
/** | |||
* @brief Removes the first and Last quote in the string | |||
*/ | |||
string removeQuotes(string str, char quotechar) | |||
std::string_view remove_quotes(std::string_view str, char quotechar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This utility function is inside of detail
namespace, so perhaps it doesn't matter that much. But from an API standpoint, having std::string_view
as return type should be avoided, since it's very easy to cause UB and program crash:
std::string_view remove_quotes(std::string_view str, char quotechar);
auto sv = remove_quotes(foo_return_a_string(), '\"');
std::cout << sv; // Dangling view. UB.
Also, do we know if the quote chars are always the first and last chars? For a string hello world
, if the quote char is l
, then the initial implementation results in helo word
while the new implementation gives lo wor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I thought this was a very good use of std::string_view
because the output is solely dependent on the scope of the input std::string_view str
-- it's fate is the same. There is no reason to create a new std::string
if the output is only a sub-view of the input. That said, Tianyu's example does indeed show UB. But this is no different than managing any other view and is a calling pattern to be avoided.
Example in libcudf:
auto sv = cudf::slice( *cudf::sequence(100,0), {20,50} );
// the sv is pointing to a column that has been destroyed
The cudf::slice and cudf::split
APIs are meant to work with cudf::column_view
's and provide a quick way to partition columns without copying data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as @davidwendt said, we have view types already in cudf, i.e. column_view
and table_view
, they'd have same UB behaviour. I don't think it's a problem.
For the implementation behaviour, My understanding of escaping quotes was to drop until the first and last quote character, the input you used as an example looks like a malformed one.
i.e. "xyz"
with "
as quotechar will give xyz
I don't think the transformation xy"z"
=> xyz
is either a valid input or a desired transformation as far as CSVs are concerned.
I can rename the function to be trim_quotes
as that's what it's now doing. I'm also open to reverting the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if the comment mentions that str
should not be a temporary and that the quote char is the first and last character of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternate implementation which may be a bit cleaner/faster
std::string_view trim_quotes( std::string_view str, char quotechar) {
int first = str.front() == quotechar;
int last = str.size() - (str.back()==quotechar);
return str.substr(first,last-first);
}
Here is a mini-test: https://godbolt.org/z/463nbhqqf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidwendt's proposal is good, it does less work here. It short-circuits if the find
(rfind
) misses on the first (last) character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but we could have spaces/tabs between the start and end of them column name or other tokens that need to be discarded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information should be added to the comment description.
only a few, most of them aren't tiny (<16 characters) in the general case, regardless, incurring extra allocations even from small strings (>16 && < 1'000) when they aren't needed isn't ideal either as they lead to memory fragmentation for long-running programs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. But I still have a different opinion that std::string_view
should not be used in associative containers.
Looks like we have build errors from replacing strings with string_views at certain places! |
Looks like we are getting some runtime errors in libcudf with this PR as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some python tests are now failing due to extra "
chars in some of the output cols
Co-authored-by: David Wendt <[email protected]>
Description
As part of the improvement effort discussed in #15907, this merge request removes some of the excessive
std::string
copies and usesstd::string_view
in place ofstd::string
when the lifetime semantics are clear.std::string
is only replaced in this MR in linear functions and constructors, but not in structs as there's no established ownership or lifetime semantics to guarantee thestring_view
s will not outlive their source.There were also some cases of excessive copies, i.e. consider:
In the above example, the string is likely to be allocated twice if a temporary/string-literal is used to construct "s": one for the temporary and one for the copy constructor for
str
The string is only allocated once in all scenarios.
This also applies to
std::vector
and is arguably worse as there's no small-vector-optimization (i.e.std::string
's small-string-optimization/SSO).Checklist