Question

Create a string synopsis

Given a unknown string with an unknown size, e.g. a ScriptBlock expression or something like:

$Text = @'
LOREM IPSUM

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
'@

I would like to summarize the string to a single line (replace all the consecutive white spaces to a single white space) and truncate it to a specific $Length:

$Length = 32
$Text = $Text -Replace '\s+', ' '
if ($Text.Length -gt $Length) { $Text = $Text.SubString(0, $Length) }
$Text
LOREM IPSUM Lorem Ipsum is simpl

The issue is that if it concerns a large string, it isn't very effective towards replacing the white spaces: it replaces all white spaces in the whole $Text string where only need to replace the first few white spaces till I have a string of the required size ($Length = 32).
Swapping the -replace and SubString operations isn't desired as well as that would return a lesser length than required or even a single space for any $Text string that starts with something like 32 white spaces.

Question:
How can I effectively merge the two (-replace and SubString) operations so that I am not replacing more white spaces than necessarily and get a string of the required length (in case the $Text string is larger than the required length)?


Update

I think I am close by using a MatchEvaluator Delegate:

$Length = 8
$TotalSpaces = 0
$Delegate = {
    if ($Args[0].Index - $TotalSpaces -gt $Length) {
        '{break}'
        ([Ref]$TotalSpaces).Value = [int]::MaxValue
    }
    else { ([Ref]$TotalSpaces).Value += $Args[0].Value.Length }
}
[regex]::Replace('test 0 1 2 3 4 5 6 7 8 9', '\s+', $Delegate)
test01234{break}56789

Now the question is how can I break the regex processing at the {break}?
Note that for performance reasons I really want to break out and not substitute the <regular-expression> with the found match (which makes it look like it stopped).

 4  95  4
1 Jan 1970

Solution

 4

Perhaps a more manual approach is faster than trying to do it with regex, of course it's a lot more code.

$Text = @'
LOREM IPSUM
Lorem   Ipsum is
   simply dummy    text
'@

$Length = 32
$sb = [System.Text.StringBuilder]::new($Length)

foreach ($char in $Text.GetEnumerator()) {
    if ($sb.Length -eq $Length) {
        break
    }

    if ([char]::IsWhiteSpace($char)) {
        if (-not $prevSpace) {
            $sb = $sb.Append(' ')
        }

        $prevSpace = $true
        continue
    }

    $sb = $sb.Append($char)
    $prevSpace = $false
}

$sb.ToString()

Very similar approach using String.Create might probably be even faster but will need pre-compile or Add-Type it. You can find an example here.

2024-07-24
Santiago Squarzon

Solution

 1

Benchmark:

$Length = 32

$Sizes = 50, 100, 200, 400, 800, 1600 # words
$Strings = @(
    foreach ($Size in $Sizes) {
        -Join @('Word ') * $Size
    }
)

$Iterations = 1000

@(
    $Results = [Ordered]@{ Name = 'Question' }
    for ($i = 1; $i -le $Iterations; $i++) {
        foreach ($String in $Strings) {
            $Results["$($String.Length)"] += (Measure-Command {
                $Text = $String -Replace '\s+', ' '
                if ($Text.Length -gt $Length) { $Text = $Text.SubString(0, $Length) }
                $Void = $Text
            }).TotalMilliseconds
        }
    }
    $Void.Length | Should -be $Length
    [PSCustomObject]$Results

    $Results = [Ordered]@{ Name = 'Santiago' }
    for ($i = 1; $i -le $Iterations; $i++) {
        foreach ($String in $Strings) {
            $Results["$($String.Length)"] += (Measure-Command {
                $sb = [System.Text.StringBuilder]::new($Length)

                foreach ($char in $Text.GetEnumerator()) {
                    if ($sb.Length -eq $Length) {
                        break
                    }

                    if ([char]::IsWhiteSpace($char)) {
                        if (-not $prevSpace) {
                            $sb = $sb.Append(' ')
                        }

                        $prevSpace = $true
                        continue
                    }

                    $sb = $sb.Append($char)
                    $prevSpace = $false
                }

                $Void = $sb.ToString()
            }).TotalMilliseconds
        }
    }
    $Void.Length | Should -be $Length
    [PSCustomObject]$Results
) | Format-Table
Name       250   500  1000  2000  4000   8000
----       ---   ---  ----  ----  ----   ----
Question 28.15 12.18 20.19 35.81 68.50 129.03
Santiago 79.05 57.73 54.25 55.33 54.73  54.12
2024-07-24
iRon

Solution

 0

You cannot.
Your best bet to increase efficiency(and I'm not sure how much) is to first cut down the original string into a substring because you already know you are going to reduce its size anyway so no reason to elaborate a 10MB file if you only end up needing the first 100kB.

Something like ($Text.Substring(0, $Length * 2) -replace '\ +', ' ').Substring(0, $Lenght)
I've used $Length * 2 but you can use any dimension you want, depending on how many multiple spaces you realistically expect in the original(sub) string.
I'm guessing anything from $Length * 1.25 up should be enough

2024-07-24
sirtao