"I think the American public wants a solemn ass as a President, and I think I'll go along with them." - (John) Calvin Coolidge


Edit

Overview

A Binary Search is a method for searching a list that is extremely fast and consistent. To use this technique, the data you are searching must mean 2 criteria:

  1. The data must be in the same order as the value you are searching for. For example, if you are searching a personnel database for a given last name and the data is sorted by employee ID, you can't use this technique.
  2. The data must be accessible randomly by a sequential index (i.e. a numeric index that has no gaps).

In .NET, one of the most obvious structures for this kind of search is the Array, ArrayList or List(of T). This is such a common bit of functionality on those structures, that the .NET framework included the BinarySearch method in those classes.

Essentially, the algorithm involves cutting the list of data in half repeatedly until you find the element you are looking for. Since the data is sorted, each time you check the element at the dead-center of the array, assuming it wasn't the one you wanted, you'll eliminate half of the data from consideration. Either the top or bottom half of the data is no longer relevant since your search element is either less than or greater than the one in the middle.

Edit

Basic Algorithm

The routine built into the .Net framework is written in c#; a VB.NET version using the IList interface would look something like this:

Public Function BinarySearch(Of T As IComparable(Of T)) _
    (ByVal lst As IList(Of T), ByVal SrchKey As T) As Integer

Dim hiNdx As Integer = lst.Count - 1 Dim loNdx As Integer = 0 Do While loNdx <= hiNdx Dim midNdx As Integer = loNdx + ((hiNdx - loNdx) >> 1) Dim Order As Integer = SrchKey.CompareTo(lst(midNdx)) Select Case Order Case Is = 0 Return midNdx Case Is < 0 hiNdx = midNdx - 1 Case Is > 0 loNdx = midNdx + 1 End Select Loop Return (Not loNdx) End Function

Edit

An example using List(Of T)

The simplest way to do this is to create data class that implements IComparable(Of T). In my work, I use a lot of phone-based data. For this example, let's create a data class that will store the city and state for area code/exchange combinations in the United States:

Public Class LocationDataClass
    Implements IComparable(Of LocationDataClass)

Public AreaCode As String Public Exchange As String

Public CityName As String Public StateCode As String

Public Function CompareTo(ByVal other As LocationDataClass) As Integer _ Implements System.IComparable(Of LocationDataClass).CompareTo

Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode) If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange) Return n End Function End Class

The CompareTo function is used by the List class's Sort and BinarySearch mechanism. To sort the list after loading it:

Dim MyData As New List(Of LocationDataClass)
' load some data into MyData
MyData.Sort

To search for a specific area code and exchange:

Dim srch As New LocationDataClass
srch.AreaCode = "312"
srch.Exchange = "555"
Dim i As Integer = MyData.BinarySearch(srch)
If i >= 0 Then
    Console.WriteLine("Found at record #{0} - {1}", i, MyData(i).CityName)
Else
    Console.WriteLine("Not found; should be at record #{0}", Not i)
End If

If the index returned by the BinarySearch method is a negative number, then the record you were searching for isn't in the list. By itself, that's good to know. But that number also tells you where in the list that record would have been if it existed. Simply use the Not function to reverse it and you'll know where that record should have been. This can be very useful in caching functions where the failure to find the record means you need to lookup the record in a database. Then, once you've looked it up, you know where to insert it in this list.

Edit

An example using an ArrayList

Using an ArrayList is almost exactly same as using a Generic.List except that you must implement IComparable instead of IComparable(Of T) and you must cast values from objects to your data type in the CompareTo method.

Edit

An example using an Array

Using an Array is almost exactly same as using a Generic.List except that you call Array.BinarySearch as a shared function instead of calling a method on your list object.

Edit

An example using a data file

Recently, I was given a data file that I needed to work with that was much too large to load into memory. But, other than that, it still met the qualifications of a Binary Search as listed above:

  1. Each data record was in order by it's primary key
  2. Each data record was a fixed length, so accessing them randomly could be done using a Seek function

Let's assume we have a file that contains the same data as the above example (area code, exchange, city and state code). The layout of this file is an ASCII text file that has a fixed length record with the following layout:

Pos.LengthField name
03Area Code
33Exchange
620City name (space padded)
262State code
282Carriage return, Line feed (makes it easier for a human to read the file, but does nothing for our code)

To model the data, we'll create a data class similar to the one above:

Public Class LocationDataClass
    Implements IComparable(Of LocationDataClass)

Public Shared RecordLength As Integer = 30

Private _rawData() As Byte

Public Sub SetData(ByVal rawData() As Byte) _rawData = rawData End Sub

Public Function GetData() As Byte() Return _rawData End Function

Public Property AreaCode() As String Get Return Text.Encoding.ASCII.GetString(_rawData, 0, 3) End Get Set(ByVal value As String) If value Is Nothing Then value = " " If Len(value) < 3 Then value &= StrDup(3 - Len(value), " ") Dim ar() As Byte = Text.Encoding.ASCII.GetBytes(value) Array.Copy(ar, 0, _rawData, 0, 3) End Set End Property

Public Property Exchange() As String Get Return Text.Encoding.ASCII.GetString(_rawData, 3, 3) End Get Set(ByVal value As String) If value Is Nothing Then value = " " If Len(value) < 3 Then value &= StrDup(3 - Len(value), " ") Dim ar() As Byte = Text.Encoding.ASCII.GetBytes(value) Array.Copy(ar, 0, _rawData, 3, 3) End Set End Property

Public ReadOnly Property CityName() As String Get Return Text.Encoding.ASCII.GetString(_rawData, 6, 20).TrimEnd End Get End Property Public ReadOnly Property StateCode() As String Get Return Text.Encoding.ASCII.GetString(_rawData, 26, 2) End Get End Property

Public Function CompareTo(ByVal other As LocationDataClass) As Integer Implements System.IComparable(Of LocationDataClass).CompareTo Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode) If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange) Return n End Function End Class

This class contains the same information as the original one, but the data store is a byte array. I'm storing the data this way because it's the format you read the data in when you access a stream randomly. We'll be reading a lot of records to find the one we want and we want to avoid parsing them any more than necessary. Since we want to read records randomly, let's write a function that will read one random record:

Public Function GetRawRecord(ByVal fs As IO.Stream, ByVal recordNumber As Integer) As Byte()
    Dim ar(LocationDataClass.RecordLength - 1) As Byte
    fs.Seek(recordNumber * LocationDataClass.RecordLength, IO.SeekOrigin.Begin)
    fs.Read(ar, 0, LocationDataClass.RecordLength)
    Return ar
End Function

To open a stream to this file:

Dim fi As New IO.FileInfo(Filename)
Dim fs As IO.FileStream = fi.Open(IO.FileMode.Open, IO.FileAccess.Read, IO.FileShare.Read)
Dim RecordCount As Integer = CInt(fi.Length \ LocationDataClass.RecordLength)

And finally, the search function for this file looks like this:

Public Sub SearchFileOnly _
    (ByVal fs As IO.Stream, ByVal RecordCount As Integer, _
    ByVal SrchKey As LocationDataClass, _
    ByRef ResultRecord As LocationDataClass, ByRef ResultIndex As Integer)

Dim hiNdx As Integer = RecordCount - 1 Dim loNdx As Integer = 0

Dim srchRaw() As Byte = SrchKey.GetData Dim tmpRecord As New LocationDataClass

' if the lo and hi end up out of order, it means we didn't find it Do While loNdx <= hiNdx ' find the center point Dim midNdx As Integer = loNdx + ((hiNdx - loNdx) >> 1)

' test that record against your search key tmpRecord.SetData(GetRawRecord(fs, midNdx)) Dim Order As Integer = SrchKey.CompareTo(tmpRecord) Select Case Order Case Is = 0 ' if we found it, we're done ResultRecord = tmpRecord ResultIndex = midNdx Exit Sub Case Is < 0 'everything above the mid point is not relevent hiNdx = midNdx - 1 Case Is > 0 'everything below the mid point is not relevent loNdx = midNdx + 1 End Select Loop

ResultRecord = Nothing ResultIndex = (Not loNdx) End Sub

Notice that this version returns both the result record and the result index instead of just the index. Since getting the data required a read to the file, performance is enhanced by passing that data back instead of requiring the calling program to read it again. Waste not, want not.

Finally, we have the code to invoke this routine:

Dim fi As New IO.FileInfo(Filename)
Using fs As IO.FileStream = fi.Open(IO.FileMode.Open, IO.FileAccess.Read, IO.FileShare.Read)
    Dim RecordCount As Integer = CInt(fi.Length \ LocationDataClass.RecordLength)

Dim srch As New LocationDataClass srch.AreaCode = "312" srch.Exchange = "555"

Dim iRes As Integer = -1 Dim res As LocationDataClass = Nothing SearchFileOnly(fs, RecordCount, srch, res, iRes)

If iRes >= 0 Then Console.WriteLine("Found at record #{0} - {1}", i, res.CityName) Else Console.WriteLine("Not found; should be at record #{0}", Not iRes) End If End Using

As written, this technique doesn't store anything more than a single result in memory at any given time. Therefore, it scales well. Regardless of the size of the file, you memory usage remains constant. But, reading from a disk file is almost always slower than reading from memory.

Since this technique uses the same comparison method (IComparable) as the first method, it doesn't take much to combine the techniques. Create an instance of List(Of LocationDataClass). Search it first. If that succeeds, use that record. If it fails, use the SearchFileOnly method. Store the result of SearchFileOnly in the List object for next time. You can even add code that purges the List object if it ends up holding too many records. That's beyond the scope of this article, but I'll cover it in another article on caching.

Edit

Another data file technique

Another way to handle a large data file is to pre-read all of the key values into a List object along with a pointer you can use in the Seek function. Then, use the List's built-in BinarySearch method and follow that up with a single seek/read to the file. This works well if the data file is especially large, but the key values alone will fit in memory. It also works well if you expect to get a lot of requests for data that isn't in the file; those requests will never results in a file read other than the initial key load. It's also possible to use this technique on files that aren't fixed length by noting the seek positions as you read in the non-fixed records.

For this example, assume we have a variable length text file with the same data we've been using. This time, fields are tab-separated and each record has a variable length:

Public Class LocationDataClass
    Implements IComparable(Of LocationDataClass)

Public AreaCode As String Public Exchange As String

Public SeekPointer As Long

Public Function CompareTo(ByVal other As LocationDataClass) As Integer _ Implements System.IComparable(Of LocationDataClass).CompareTo

Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode) If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange) Return n End Function End Class

Dim MyData As New List(Of LocationDataClass)

Public Sub LoadData(ByVal filename As String) Using fs As New IO.StreamReader(filename) Dim seekPos As Long = 0 Dim s As String = fs.ReadLine Do Until s Is Nothing Dim ar() As String = Split(s, vbTab, 3) Dim tmp As New LocationDataClass tmp.AreaCode = ar(0) tmp.Exchange = ar(1) tmp.SeekPointer = seekPos MyData.Add(tmp)

seekPos += (s.Length + 2)

s = fs.ReadLine Loop End Using MyData.Sort() End Sub

The LoadData routine reads each data line and parses out the key values. I limited the Split function to 3 items so I could accurately get the 2 key values and ignore the data values. The SeekPointer is a Long in case the overall size of the file is greater than Int32.MaxValue (approx. 2gb). The SeekPos variable is constantly incremented based on the actual size of the data plus 2 (carriage return + line feed).

This technique will still eat up a significant amount of memory, so use caution. Also, if it's possible that the source file can change while you are using it, your seek pointers will get out of sync and cause a major problem. If you know that shouldn't happen, you can safely prevent it by re-opening the file for random access and leaving it open until the application is done with it. Either that or you can use a FileSystemWatcher object to reload your pointers if the file is changed during the lifetime of your application.

Edit

Using Binary Search when you don't know the full key value

Let's go back to our original example: A List of phone locations.

Public Class LocationDataClass
    Implements IComparable(Of LocationDataClass)

Public AreaCode As String Public Exchange As String

Public CityName As String Public StateCode As String

Public Function CompareTo(ByVal other As LocationDataClass) As Integer _ Implements System.IComparable(Of LocationDataClass).CompareTo

Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode) If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange) Return n End Function End Class

Dim MyData As New List(Of LocationDataClass) ' load some data into MyData MyData.Sort

The primary CompareTo method requires that we know both the area code and the exchange. But, what if we only know the area code? Can we still use a binary search on this list? Yes and No. A binary search will work on non-unique data that follows our original rules, but the algorithm doesn't guarantee which of the records it will find. It will find the area code if it's in the list, but which exchange within that area code is not predictable.

First, we need to create a comparer that doesn't care about exchange:

Public Class CompareAreaCodeOnly
    Implements IComparer(Of LocationDataClass)

Private Sub New() End Sub

Public Shared Instance As New CompareAreaCodeOnly

Public Function Compare _ (ByVal x As LocationDataClass, _ ByVal y As LocationDataClass) As Integer _ Implements System.Collections.Generic.IComparer(Of LocationDataClass).Compare

Return String.Compare(x.AreaCode, y.AreaCode) End Function End Class

To search for a specific area code only:

Dim srch As New LocationDataClass
srch.AreaCode = "312"
Dim i As Integer = MyData.BinarySearch(srch, CompareAreaCodeOnly.Instance)
If i >= 0 Then
    Console.WriteLine("Area code {0} exists", srch.AreaCode)
Else
    Console.WriteLine("Area code {0} does not exist", srch.AreaCode)
End If

In this case, when it finds the record, it finds one of the records that has that area code, but it could be the first one, the last one or any one in between. It's not random, but it's effectively non-predictable. All we can really tell is that the area code exists in our list or not. If it does exist, we could do a linear search backwards to find the first exchange or forward to find the last one.

This works as long as the partial key we are searching for is the first part of the sort key. If it's in the middle or end of the sort key, it won't work.

Edit

Using Binary Search with databases

Most modern databases have sophisticated indexing techniques that make doing a binary search against the data unnecessary. While you probably can NOT binary search a database file faster than the database server can do a unique index lookup, you can use binary searches against a cache of results so that you can lookup previously found data in-memory before going back to the database for a redundant query. More on that in a caching article.

ScrewTurn Wiki version 2.0.33. Current Page Count: 23. Some of the icons created by FamFamFam.