Reduce allocations in Get-Content cmdlet#8103
Conversation
| /// <param name="isRawStream"> | ||
| /// Indicates raw stream. | ||
| /// </param> | ||
| public FileSystemContentReaderWriter( |
There was a problem hiding this comment.
It is unneeded and used internal constructor and we can safely remove it. (Otherwise we have to fix it to follow our changes in another constructor with delimiter parameter.)
PaulHigin
left a comment
There was a problem hiding this comment.
Overall this looks good to me. I agree that the offset Dictionary could be a perf issue for simple delimiter characters. Have you performed any tests to verify perf gains? I imagine they would be significant, in some case, by switching to use Span.
| blocks.Add(contentRead); | ||
| } | ||
|
|
||
| return _reader.Peek() != -1; |
There was a problem hiding this comment.
Doesn't this always have to be -1, when reading a raw stream to the end?
| // we can read to generate another possible match. | ||
| // If we read more characters than this, we risk consuming | ||
| // more of the stream than we need. | ||
| _offsetDictionary = new Dictionary<char, int>(_delimiter.Length); |
There was a problem hiding this comment.
This looks a little bit dangerous. If _usingDelimiter is set to true anywhere else, _currentLineContent, _offsetDictionary will be null during later reads. I feel we should at least have an assert where these are used to ensure they are not null and as a warning if the code somehow gets changed.
There was a problem hiding this comment.
If somebody make the wrong change our tests will be crashed with null reference exception and we'll never merge the code. I don't know where we could set the assert and how it can help more then the null reference exception.
I don't found how quickly fix this. Will investigate.
No. I accidentally discovered this by reading code. Once I realized that we allocate in heap for every char in a worst case and for every line I did not see the point to run PerfView. |
There was a problem hiding this comment.
I think this can be
_offsetDictionary[lowByte] = _delimiter.Length - i - 1;
For duplicate chars the last index will end up in _offsetDictionary[lowByte]
Also, do we need to be worried about char collisions using the low byte?
There was a problem hiding this comment.
Also, do we need to be worried about char collisions using the low byte?
There is a comment in the code. If we have a file with chars from many languages we'll have many collisions and lost performance. In practice, we can expect that files contains chars from one-two languages. In the common case we'll have no collisions and best performance.
There was a problem hiding this comment.
I think this can be
Thanks! Good catch!
Fixed.
There was a problem hiding this comment.
currentChar can be defined below where it is used.
4a7488a to
07af4ca
Compare
|
@anmenaga @SteveL-MSFT Have you any comments? |
|
|
PR Summary
Current implementation do huge extra allocations for every line of a file and at worst for every char (!). The fix exclude the extra allocations by moving initializations in a constructor and using Span for temporary buffer.
The Boyer-Moore string search algorithm was optimized. Use simple mapping Unicode chars to byte. It works well for common text files with chars from one-two languages.
You could review commit by commit.
PR Checklist
.h,.cpp,.cs,.ps1and.psm1files have the correct copyright headerWIP:to the beginning of the title and remove the prefix when the PR is ready.[feature]if the change is significant or affects feature tests