Visual Basic .NET Search for Repeated Words Across Multiple Lines

From Regex Regular Expression Encyclopedia

Jump to: navigation, search

This recipe allows you to search for repeated words that occur on more than one line. For example:

   word
   word

[edit] code

Imports System
Imports System.IO
Imports System.Text.RegularExpressions
 
Public Class Recipe
    Private Shared _Regex As Regex = New Regex("\b(\w+)(\s*$\s*|\s+)\1\b", _
        RegexOptions.IgnoreCase Or RegexOptions.Multiline)
    Public Sub Run(ByVal fileName As String)
        Dim line As String
        Dim sr As StreamReader = File.OpenText(fileName)
        line = sr.ReadToEnd()
        If Not line Is Nothing
            If _Regex.IsMatch(line) Then
                For Each myMatch As Match In _Regex.Matches(line)
                     Console.WriteLine("Found match '{0}'", myMatch.ToString())
                Next
            End If
        End If
        sr.Close()
    End Sub
    Public Shared Sub Main(ByVal args As String())
        Dim r As Recipe = New Recipe
        r.Run(args(0))
    End Sub
End Class

[edit] How It Works

The “magic” part of this expression is the option given to the constructor of the Regex class, RegexOptions.Multiline, which allows the $ anchor to match the end of a line as well as the end of a string. The difference between the two is that when using the ReadToEnd() method of the StreamReader, the entire contents of the file will be loaded into one string, even though the contents span multiple lines in the file. Each word can have some space between it and the end of the line or between the beginning of the line and the word. The part of the expression that matches this is as follows:

Regular Expression Description
\s whitespace . . .
* that’s optional . . .
\s some more whitespace . . .
* that's optional.

Since I wanted to match two repeated words on one line as well as two lines, the expres- sion must also look for a space between the words. This is the same as in recipe 1-10, which is \s+.

                                                                                                                                                                               Another option is passed in on the constructor to the Regex class: RegexOptions. IgnoreCase. When more than one option is specified, the | operator is used in C# and the Or keyword is used in Visual Basic .NET between the options. The option to ignore case in this expression is used so it will find matches such as This this or The the.
Personal tools