Thursday 10 July 2014

Regex help please

G'day:
I'm hoping Peter Boughton or Ben Nadel might see this. Or someone else who is good @ regular expression patterns that I'm unaware of.

Here's the challenge...



Given this string:

Lorem ipsum dolor sit

I want to extract the leading sub-string which is:
  • no more than n characters long;
  • breaks at the previous whole word, rather than in the middle of a word;
  • if no complete single word matches, them matches at least the first word, even if the length of the sub-string is greater than n.

I've come up with this:

// trimToWord.cfm
string function trimToWord(required string string, required numeric index){
    return reReplace(string, "^((?:.{1,#index#}(?=\s|$)\b)|(?:.+?\b)).*", "\1", "ONE");
}

It works, but that regex is a bit hoary.

Here's a visual representation of it (courtesy of regexper.com), by way of explanation:



Anyone fancy improving it for me?

Here's some unit tests to run your suggestions through:



// TestCase.cfc
component extends="testbox.system.BaseSpec" {

    function beforeAll(){
        include "trimToWord.cfm";
        variables.sample = "Lorem ipsum dolor sit";
    }

    function run(){
        describe("Tests for trimToWord()", function(){
            it("works when the trim point is smaller than the first word 'Lorem'", function(index){
                for (var i=1; i <= 5; i++){
                    expect(
                        trimToWord(sample, i)
                    ).toBe(left(sample, 5), "trimming @ #i#");
                }
            });
            it("works when the trim point is between the first and second words 'Lorem ipsum'", function(index){
                for (var i=6; i <= 10; i++){
                    expect(
                        trimToWord(sample, i)
                    ).toBe(left(sample, 5), "trimming @ #i#");
                }
            });
            it("works when the trim point is between the second and third words 'Lorem ipsum dolor'", function(index){
                for (var i=11; i <= 16; i++){
                    expect(
                        trimToWord(sample, i)
                    ).toBe(left(sample, 11), "trimming @ #i#");
                }
            });
            it("works when the trim point is between the third and fourth words 'Lorem ipsum dolor sit'", function(index){
                for (var i=17; i <= 20; i++){
                    expect(
                        trimToWord(sample, i)
                    ).toBe(left(sample, 17), "trimming @ #i#");
                }
            });
            it("works when the trim point is at the end of the string 'Lorem ipsum dolor sit'", function(index){
                expect(
                    trimToWord(sample, 21)
                ).toBe(sample);
            });
        });
    }
}

Cheers.

--
Adam