Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced std::string with std::string_view and removed excessive copies in cudf::io #17734

Open
wants to merge 9 commits into
base: branch-25.02
Choose a base branch
from

Conversation

lamarrr
Copy link
Contributor

@lamarrr lamarrr commented Jan 14, 2025

Description

As part of the improvement effort discussed in #15907, this merge request removes some of the excessive std::string copies and uses std::string_view in place of std::string when the lifetime semantics are clear.

std::string is only replaced in this MR in linear functions and constructors, but not in structs as there's no established ownership or lifetime semantics to guarantee the string_views will not outlive their source.
There were also some cases of excessive copies, i.e. consider:

struct source_info{
source_info(std::string const& s) : str{s}{}

private:
std::string str;
};

In the above example, the string is likely to be allocated twice if a temporary/string-literal is used to construct "s": one for the temporary and one for the copy constructor for str

struct source_info{
source_info(std::string  s) : str{std::move(s)}{}

private:
std::string str;
};

The string is only allocated once in all scenarios.
This also applies to std::vector and is arguably worse as there's no small-vector-optimization (i.e. std::string's small-string-optimization/SSO).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@lamarrr lamarrr requested a review from a team as a code owner January 14, 2025 16:23
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 14, 2025
@lamarrr lamarrr changed the title replaced std::string with std::string_view and removed excessive copies in cudf::io Replaced std::string with std::string_view and removed excessive copies in cudf::io Jan 14, 2025
@lamarrr lamarrr added feature request New feature or request improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed feature request New feature or request labels Jan 14, 2025
@mhaseeb123
Copy link
Member

Nice optimization. General question, many of the strings in the PR (except for paths etc) are very tiny (< 10 chars) which may not impact performance or space otherwise. Should we just leave them as is?

@@ -106,13 +107,13 @@ class selected_rows_offsets {
/**
* @brief Removes the first and Last quote in the string
*/
string removeQuotes(string str, char quotechar)
std::string_view remove_quotes(std::string_view str, char quotechar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utility function is inside of detail namespace, so perhaps it doesn't matter that much. But from an API standpoint, having std::string_view as return type should be avoided, since it's very easy to cause UB and program crash:

std::string_view remove_quotes(std::string_view str, char quotechar);
auto sv = remove_quotes(foo_return_a_string(), '\"');
std::cout << sv; // Dangling view. UB.

Also, do we know if the quote chars are always the first and last chars? For a string hello world, if the quote char is l, then the initial implementation results in helo word while the new implementation gives lo wor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I thought this was a very good use of std::string_view because the output is solely dependent on the scope of the input std::string_view str -- it's fate is the same. There is no reason to create a new std::string if the output is only a sub-view of the input. That said, Tianyu's example does indeed show UB. But this is no different than managing any other view and is a calling pattern to be avoided.

Example in libcudf:
auto sv = cudf::slice( *cudf::sequence(100,0), {20,50} );
// the sv is pointing to a column that has been destroyed

The cudf::slice and cudf::split APIs are meant to work with cudf::column_view's and provide a quick way to partition columns without copying data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as @davidwendt said, we have view types already in cudf, i.e. column_view and table_view, they'd have same UB behaviour. I don't think it's a problem.
For the implementation behaviour, My understanding of escaping quotes was to drop until the first and last quote character, the input you used as an example looks like a malformed one.

i.e. "xyz" with " as quotechar will give xyz
I don't think the transformation xy"z" => xyz is either a valid input or a desired transformation as far as CSVs are concerned.

I can rename the function to be trim_quotes as that's what it's now doing. I'm also open to reverting the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if the comment mentions that str should not be a temporary and that the quote char is the first and last character of the string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternate implementation which may be a bit cleaner/faster

std::string_view trim_quotes( std::string_view str, char quotechar) {
    int first = str.front() == quotechar;
    int last = str.size() - (str.back()==quotechar);
    return str.substr(first,last-first);
}

Here is a mini-test: https://godbolt.org/z/463nbhqqf

Copy link
Contributor

@bdice bdice Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwendt's proposal is good, it does less work here. It short-circuits if the find (rfind) misses on the first (last) character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but we could have spaces/tabs between the start and end of them column name or other tokens that need to be discarded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information should be added to the comment description.

@lamarrr
Copy link
Contributor Author

lamarrr commented Jan 15, 2025

Nice optimization. General question, many of the strings in the PR (except for paths etc) are very tiny (< 10 chars) which may not impact performance or space otherwise. Should we just leave them as is?

only a few, most of them aren't tiny (<16 characters) in the general case, regardless, incurring extra allocations even from small strings (>16 && < 1'000) when they aren't needed isn't ideal either as they lead to memory fragmentation for long-running programs.

@mhaseeb123 mhaseeb123 removed their assignment Jan 15, 2025
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. But I still have a different opinion that std::string_view should not be used in associative containers.

@mhaseeb123
Copy link
Member

Looks like we have build errors from replacing strings with string_views at certain places!

@mhaseeb123
Copy link
Member

Looks like we have build errors from replacing strings with string_views at certain places!

Looks like we are getting some runtime errors in libcudf with this PR as well

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some python tests are now failing due to extra " chars in some of the output cols

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

6 participants