Visual Basic .NET Find Variations on Words

From Regex Regular Expression Encyclopedia

Jump to: navigation, search

You can use this recipe for finding variations on a word with one search. This particular recipe searches for the strings Jon Doe, John Doe, or Jonathan Doe.

[edit] code

Imports System
Imports System.IO
Imports System.Text.RegularExpressions
Public Class Recipe
    Private Shared _Regex As Regex = New Regex("\b[bcm]at\b")
    Public Sub Run(ByVal fileName As String)
        Dim line As String
        Dim lineNbr As Integer = 0
        Dim sr As StreamReader = File.OpenText(fileName)
        line = sr.ReadLine
        While Not line Is Nothing
            lineNbr = lineNbr + 1
            If _Regex.IsMatch(line) Then
                Console.WriteLine("Found match '{0}' at line {1}", _
                    line, _
                    lineNbr)
            End If
            line = sr.ReadLine
        End While
        sr.Close()
    End Sub
    Public Shared Sub Main(ByVal args As String())
        Dim r As Recipe = New Recipe
        r.Run(args(0))
    End Sub
End Class

[edit] How It Works

This expression works by finding the common and optional parts of a word and searching based on them. John, Jon, and Jonathan are all similar. They start with Jo and have an n in them. The rest is the h in John or the athan ending in Jonathan. For example:

Regular Expression Description
\b a word boundary . . .
J followed by . . .
o then . . .
h that is . . .
 ? optional, followed by . . .
n followed by . . .
(...) a group of characters . . .
 ? that may appear once but isn’t required, followed by . . .
<space> a space, followed by . . .
D then . . .
o and finally . . .
e an e, then . . .
\b another word boundary.

This group of characters is athan, which will let the expression match Jonathan. It may or may not appear as a whole part, so that’s why it’s grouped with parentheses and followed by ?.

[edit] Variations

One variation on this recipe is using instead an expression such as that found in recipe 1-3, like ((Jon)|(John)|(Jonathan)) Doe. Depending on the skills of your peers, this may be easy to use because it may be easier to read by someone else. Another variation on this is ((Jon(athan)?)|(John)) Doe. Writing an elegant and fast regular expression is nice, but these days processor cycles are often cheaper than labor. Make sure whatever path you choose will be the easiest to maintain by the people in your organization.

Personal tools