[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed and optimized split methods #11

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from

Conversation

Guiorgy
Copy link
Contributor
@Guiorgy Guiorgy commented Jan 28, 2024

I also replaced the custom InvalidCountExceedingBehaviourException with the builtin ArgumentException since that is what string.Split throws when StringSplitOptions is invalid. If that is unacceptable, the commit can be dropped.

I also renamed the CountExceedingBehaviour options to, what I think, better explains them. Also reworded the summary comments. Again, if you disagree with that, you may drop that commit.

There's 1 more gatcha in string.Split: If separators (delimiters in SplitAny in our case) is an empty collection, the function automatically assumes that we wish to split the string with white spaces:

If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters

Sigh... So there are several ways to handle this:

  • Ignore it, and throw an exception if delimiters is empty.
  • Do what .NET does, or in other words, go through the source span first, get all whitespace characters present in a remporary array (which will allocate -_-), and then go through the source span once more to split it.
  • Write an alternative implementation where instead of checking if a character from source span is part of delimiters, we check if the character is a whitespace character using char.IsWhiteSpace(). Though I am assuming this is slower than the option above, otherwise why would .NET do it differently.
  • Any other thoughts?

@dragon7307
Copy link
Member

Looks good at first glance. I'll review, slightly adapt the style to better match the project, and merge it.
I originally created, that custom exception to avoid having to throw two different exceptions for before and after .Net 7, but if string.Split uses ArgumentException, then this sounds good to me. I'll look into the behaviour for null or empty delimiters as well, though I would prefer not to throw an exception. We'll see...

@dragon7307
Copy link
Member

The methods look good to me, but I would still prefer not to merge the yet, since I'd like to conduct some unit tests to compare any any potential differences. Maybe waiting until the test suite is finished??

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Jan 29, 2024

Sure. The only failing tests is with an empty delimiters span, since we haven't implemented that yet. Though, as you said, the tests are a bit lacking still.

Also, fyi SpanExtensions still has StringSplitOptions and count reversed : Split(this Span<char> source, char delimiter, StringSplitOptions options, int count.

@dragon7307
Copy link
Member
dragon7307 commented Jan 30, 2024

Also, I have looked a bit more to the tests you've written and I believe, we should probably merge them into a feature branch
of dev with the src files from dev and the tests files from your pr. Just so we have them somewhat up to date.

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Jan 31, 2024

I have a local branch with uptodate commits from the dev and tests branch for testing. Do you want me to merge the dev branch into tests, or make a separate branch for that?

@dragon7307
Copy link
Member

I have a branch in this repo, called "feature-unit-testing", in which I have adapted your tests branch as of yesterday to better suit the project's style and have brought it up to date with dev. Just take a look at the changes i made. if you agree with them, we have two options: we either merge any eventual changes you have locally into this branch and continue working on this branch, which is based of off dev, into which it will eventually be merged, or you push any changes you have made to tests to your fork and we then merge my updated branch with yours into your fork.
If you don't agree with the changes I made, then yes, do as you propose, and merge dev into a branch created from tests.

@dragon7307
Copy link
Member

Also, unless I have gotten something dramatically wrong, we could refactor the monster that is GetAllStringSplitOptions to the following:

public static StringSplitOptions[] GetAllStringSplitOptions()
{
    return (StringSplitOptions[])Enum.GetValues(typeof(StringSplitOptions)); 
}

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Jan 31, 2024

I'll look into them, and probably just merge them into the tests branch. It was going to be merged with dev eventually anyways.

As for the GetAllStringSplitOptions method, the problem here is that StringSplitOptions is an enum marked with the Falgs attribute, which means you can combine options with the or (|) operator.

Try running this:

var options = (StringSplitOptions[])Enum.GetValues(typeof(StringSplitOptions));
Console.WriteLine(string.Join(", ", options));

var options2 = GetAllStringSplitOptions().Select(o => o.ToString().Replace(",", " |"));
Console.WriteLine(string.Join(", ", options2));

The output should look like this:

None, RemoveEmptyEntries, TrimEntries
None, RemoveEmptyEntries, TrimEntries, RemoveEmptyEntries | TrimEntries

While Enum.GetValues does get all defined enum values, it does not return their combinations when the enum is marked as flags.

The one I wrote was the simplest I could find, since the alternative would be trying to generate all permutations of the enums returned by Enum.GetValues, and remove duplicates (since Enum.GetValues also returns None, and None | TrimEntries is the same as TrimEntries).

If you have an alternative, don't hesitate to suggest.

@dragon7307
Copy link
Member

How about just hard-coding the values into an array?? They are unlikely to soon be changed and the current method seems overkill for such a simple purpose for a test suite.

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Jan 31, 2024

An alternative would be to manually hardcode all possible values, but you'd have to remember to update them if a new one was ever added, and if a lot of values were added, the number of all possible combinations would explode in size. I wanted to make something that (hopefully) doesn't need to ever be modified again.

The only reason why that code wouldn't work anymore (as stated in the comment next to the Debug.Assert), is if a new option was added that for some reason skipped a bit (for example, TrimEntries = 1 /* binary 001 */, SomeNewOption = 4 /* binary 100 */), which shouldn't happen (fingers crossed).

If you prefer hardcoding those values, I'm fine with changing that.

@dragon7307
Copy link
Member
dragon7307 commented Jan 31, 2024

Then let's leave it that way

@Guiorgy Guiorgy mentioned this pull request Feb 4, 2024
53 tasks
@Guiorgy Guiorgy force-pushed the fixed-and-optimized-split-methods branch from 25689c3 to 68d321b Compare February 5, 2024 09:18
@Guiorgy Guiorgy force-pushed the fixed-and-optimized-split-methods branch from e3ad377 to 8bb8994 Compare February 5, 2024 11:48
@Guiorgy Guiorgy force-pushed the fixed-and-optimized-split-methods branch from 4a5dacf to 3993f24 Compare February 7, 2024 16:37
@dragon7307 dragon7307 modified the milestones: Version 1.3.0 , Backlog Feb 7, 2024
@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 19, 2024

Run into another edge case when expanding the fuzzing with parameterized ([Theory]) testing.

If the separator span is empty in Split(source, delimiter) the enumeration never stops and keeps returning empty spans forever. Yikes.

Looking at string.Split(string, string), if the separator string is empty, it assumes count = 1 and goes from there. I'll do the same, unless you have other thoughts?

if (separator!.Length == 0)
{
    count = 1;
    goto ShortCircuit;
}
else
{
    return SplitInternal(separator, count, options);
}

Edit: Wait, I tested that against the old implementation (before this PR), give me a sec xd
Edit2: It happens in this PR too 😬
Edit3: Fixed

@dragon7307
Copy link
Member
dragon7307 commented Feb 19, 2024

The current implementation tests for the delimiterLength in every iteration of Move, This is unnecessary, as delimiter.Length doesn't change. The check should probably be in the constructor instead.

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 19, 2024

I know that, and that's what I do in the ones that have count a a parameter. But I don't really see a way to do it from the constructor in the ones that don't.

When delimiter is empty, we need to shorcircuit, return the whole span and exit. The if statement that does that, checks for delimiterIndex which changes on every iteration, so there's no way to influence that from the constuctor. The only way I see is to either return an alternative IEnumerator, or add an additional check in the MoveNext() method.

Edit: If IndexOf returned -1 when delimiter was empty, that would also work.

The ones that have count already do check if CurrentCount <= 1, which can be iffluenced from the constructor, and is exactly what I did.

If you can come up with something, I am all ears.

@dragon7307
Copy link
Member
dragon7307 commented Feb 19, 2024

One could at least do the comparision once in the constructor and the have a field like DelimiterIsEmpty, which is set in the constructor to delimiter.IsEmpty and then checked...

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 19, 2024

That is already basically what I do though?

DelimiterLength = Delimiter.Length; // cached in constructor
...
if(delimiterIndex == -1 || DelimiterLength == 0)

The only difference is that instead of a boolead, I am comparing an int to 0, but considering booleans are basically an int that is compared to 0 to check state, there's no real difference.

@dragon7307
Copy link
Member

Yes, but in theory your version should trigger an invocation of the comparision operator and thereby the Equals method.
And sure, this is something that the compiler will optimize away, but since we cannot predict which compiler the consumer of this library will use, it is better to stay stafe, because maybe the official compiler optimizes it, but maybe someone uses mono, and maybe mono doesn't optimize this, or IL2CPP doesn't. Also, a boolean is in C#, unlike in say C, the size of a byte not of an int.

As you can see here, in jit asm a dword (4 bytes) is used for the comparison for the int . On the other hand, here with the boolean approach, a byte is used. That is a difference in memory allocation. Now, whether this will actually result in a measurable performance difference is debattable, but as a matter of fact, they produce different jit asm and hence work differently. The asm is in favour of using the boolean.

Here are the screenshots top to bottom, on the first is the approach with (DelimiterLength) and the second DelimiterIsEmpty:

image
image

And yes that is micro optimization, but so is the mere concept of Spans in C#.
Also, the example uses an int instead of generics, as generics can't be jit compiled by sharplab in c# and dotnet.

@Guiorgy Guiorgy force-pushed the fixed-and-optimized-split-methods branch from 8486faf to 2d377c0 Compare February 19, 2024 19:25
@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 20, 2024

Looking at the some of the changes you made recently on dev, you renamed CountExceedingBehavior enum entries.

I had also done that in this PR, as I had mentioned previously, and was hoping for you input on that.

Personally, instead of AppendLastElements or AppendRemainingElements I prefer IncludeRemainingElements, so include instead of append.

The way string.Split words it is:

[...] the last string in the returned array will contain this instance's remaining trailing substring, [...]

So, since contain and include are synonyms, I felt like it was appropriate.

As for CutLastElements and CutRemainingElements, I went with DropRemainingElements, since it's an antonym to include.

However, this isn't really that important, so if you say so, I'll update to the names you came up with.

@dragon7307
Copy link
Member
dragon7307 commented Feb 20, 2024

I see; I agree it isn't that important now. I get the idea of the antonyms; however I felt and do still feel that 'append' is more descriptive as to where the remaining elements are actually included (they're appended to the last element). The downside of this approach is that the antonym is not necessarilly a good fit for the contrary behaviour. There is no need to Update the pull request right now as the differences are growing so complex that I will need to merge it into a helper branch and then resolve many conflicts.

The final decision on the name is going to be made before the release of version 1.3.0, because I do not want to introduce a breaking change later on.

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 20, 2024

There's 1 more thing missing in the current implementation. All the Span Split Enumerators work and return ReadOnlySpan, even if the source is a Span<T>. This means that the following is impossible:

char[] source = "Lorem ipsum dolor sit amet...";

// Split source by space and capitalize the first letter of every word
foreach (var word in source.AsSpan().Split(' '))
{
    if (word.Length == 0) continue;

    word[0] = word[0].ToUpper(); // impossible, since word is a ReadOnlySpan<char>
}

To fix this, we'd need to duplicate every split enumerable and make it work with normal mutable spans, and update all extension methods and tests. We could do that now, or maybe try to make a source generator first to not have to do all that manually?

@dragon7307
Copy link
Member

That seems like a job for the to be source generator

@Guiorgy Guiorgy force-pushed the fixed-and-optimized-split-methods branch from 64e12ac to bb825aa Compare February 24, 2024 16:23
@Guiorgy Guiorgy force-pushed the fixed-and-optimized-split-methods branch from bb825aa to 591df77 Compare February 24, 2024 16:52
@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 24, 2024

I also added a check for validity of StringSplitOptions, which throws an exception when invalid. This is to copy what string.Split does.

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 25, 2024

We should decide what are the final names for CountExceedingBehaviour, so I can fix the 2 PRs.

@dragon7307
Copy link
Member

I am pretty set on AppendRemainingElements and CutRemainingElements.

@Guiorgy
Copy link
Contributor Author
Guiorgy commented Feb 25, 2024

I am pretty set on AppendRemainingElements and CutRemainingElements.

Ok, on it
Edit: Done

@dragon7307 dragon7307 modified the milestones: Backlog, Version 1.4 Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants