Visual Basic .NET Search for Repeated Words
From Regex Regular Expression Encyclopedia
You can use this recipe to find words that appear more than once on a line, such as the the.
[edit] code
Imports System Imports System.IO Imports System.Text.RegularExpressions Public Class Recipe Private Shared _Regex As Regex = New Regex("\b(\w+)\s\1\b") Public Sub Run(ByVal fileName As String) Dim line As String Dim lineNbr As Integer = 0 Dim sr As StreamReader = File.OpenText(fileName) line = sr.ReadLine While Not line Is Nothing lineNbr = lineNbr + 1 If _Regex.IsMatch(line) Then Console.WriteLine("Found match '{0}' at line {1}", _ line, _ lineNbr) End If line = sr.ReadLine End While sr.Close() End Sub Public Shared Sub Main(ByVal args As String()) Dim r As Recipe = New Recipe r.Run(args(0)) End Sub End Class
[edit] How It Works
The most important aspect of this regular expression is the back reference, which is \1 in all the previous recipes. The back reference is just a way of saying “whatever you found in the first group.” The parentheses in the expression define the group. Here’s a breakdown of the expression:
| Regular Expression | Description |
|---|---|
| \b | is a word boundary, followed by . . . |
| (...) | a group (explained next), then . . . |
| \s | a space . . . |
| + | one or more times, then . . . |
| \1 | whatever was found in the group, and lastly . . . |
| \b | a word boundary. |
| The group is simply (\w+), which is as follows: | |
| \w | a word character . . . |
| + | found one or more times. |
This will match a word. The expression begins and ends with a word boundary anchor. This is to prevent the expression from matching a string such as quarterback backrub. If the word boundary anchors are removed, the expression will start matching subsections of words.
