Visual Basic .NET Search for Repeated Words

From Regex Regular Expression Encyclopedia

Jump to: navigation, search

You can use this recipe to find words that appear more than once on a line, such as the the.

[edit] code

Imports System
Imports System.IO
Imports System.Text.RegularExpressions
Public Class Recipe
    Private Shared _Regex As Regex = New Regex("\b(\w+)\s\1\b")
    Public Sub Run(ByVal fileName As String)
        Dim line As String
        Dim lineNbr As Integer = 0
        Dim sr As StreamReader = File.OpenText(fileName)
        line = sr.ReadLine
        While Not line Is Nothing
            lineNbr = lineNbr + 1
            If _Regex.IsMatch(line) Then
                Console.WriteLine("Found match '{0}' at line {1}", _
                    line, _
                    lineNbr)
            End If
            line = sr.ReadLine
        End While
        sr.Close()
    End Sub
    Public Shared Sub Main(ByVal args As String())
        Dim r As Recipe = New Recipe
        r.Run(args(0))
    End Sub
End Class

[edit] How It Works

The most important aspect of this regular expression is the back reference, which is \1 in all the previous recipes. The back reference is just a way of saying “whatever you found in the first group.” The parentheses in the expression define the group. Here’s a breakdown of the expression:

Regular Expression Description
\b is a word boundary, followed by . . .
(...) a group (explained next), then . . .
\s a space . . .
+ one or more times, then . . .
\1 whatever was found in the group, and lastly . . .
\b a word boundary.
The group is simply (\w+), which is as follows:
\w a word character . . .
+ found one or more times.

This will match a word. The expression begins and ends with a word boundary anchor. This is to prevent the expression from matching a string such as quarterback backrub. If the word boundary anchors are removed, the expression will start matching subsections of words.

Personal tools