EditOverview
A Binary Search is a method for searching a list that is extremely fast and consistent. To use this technique, the data you are searching must mean 2 criteria:
- The data must be in the same order as the value you are searching for. For example, if you are searching a personnel database for a given last name and the data is sorted by employee ID, you can't use this technique.
- The data must be accessible randomly by a sequential index (i.e. a numeric index that has no gaps).
In .NET, one of the most obvious structures for this kind of search is the Array, ArrayList or List(of T). This is such a common bit of functionality on those structures, that the .NET framework included the BinarySearch method in those classes.
Essentially, the algorithm involves cutting the list of data in half repeatedly until you find the element you are looking for. Since the data is sorted, each time you check the element at the dead-center of the array, assuming it wasn't the one you wanted, you'll eliminate half of the data from consideration. Either the top or bottom half of the data is no longer relevant since your search element is either less than or greater than the one in the middle.
EditBasic Algorithm
The routine built into the .Net framework is written in c#; a VB.NET version using the IList interface would look something like this:
Public Function BinarySearch(Of T As IComparable(Of T)) _
(ByVal lst As IList(Of T), ByVal SrchKey As T) As Integer
Dim hiNdx As Integer = lst.Count - 1
Dim loNdx As Integer = 0
Do While loNdx <= hiNdx
Dim midNdx As Integer = loNdx + ((hiNdx - loNdx) >> 1)
Dim Order As Integer = SrchKey.CompareTo(lst(midNdx))
Select Case Order
Case Is = 0
Return midNdx
Case Is < 0
hiNdx = midNdx - 1
Case Is > 0
loNdx = midNdx + 1
End Select
Loop
Return (Not loNdx)
End Function
EditAn example using List(Of T)
The simplest way to do this is to create data class that implements IComparable(Of T). In my work, I use a lot of phone-based data. For this example, let's create a data class that will store the city and state for area code/exchange combinations in the United States:
Public Class LocationDataClass
Implements IComparable(Of LocationDataClass)
Public AreaCode As String
Public Exchange As String
Public CityName As String
Public StateCode As String
Public Function CompareTo(ByVal other As LocationDataClass) As Integer _
Implements System.IComparable(Of LocationDataClass).CompareTo
Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode)
If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange)
Return n
End Function
End Class
The CompareTo function is used by the List class's Sort and BinarySearch mechanism. To sort the list after loading it:
Dim MyData As New List(Of LocationDataClass)
' load some data into MyData
MyData.Sort
To search for a specific area code and exchange:
Dim srch As New LocationDataClass
srch.AreaCode = "312"
srch.Exchange = "555"
Dim i As Integer = MyData.BinarySearch(srch)
If i >= 0 Then
Console.WriteLine("Found at record #{0} - {1}", i, MyData(i).CityName)
Else
Console.WriteLine("Not found; should be at record #{0}", Not i)
End If
If the index returned by the BinarySearch method is a negative number, then the record you were searching for isn't in the list. By itself, that's good to know. But that number also tells you where in the list that record would have been if it existed. Simply use the
Not function to reverse it and you'll know where that record should have been. This can be very useful in caching functions where the failure to find the record means you need to lookup the record in a database. Then, once you've looked it up, you know where to insert it in this list.
EditAn example using an ArrayList
Using an ArrayList is almost exactly same as using a Generic.List except that you must implement IComparable instead of IComparable(Of T) and you must cast values from objects to your data type in the CompareTo method.
EditAn example using an Array
Using an Array is almost exactly same as using a Generic.List except that you call Array.BinarySearch as a shared function instead of calling a method on your list object.
EditAn example using a data file
Recently, I was given a data file that I needed to work with that was much too large to load into memory. But, other than that, it still met the qualifications of a Binary Search as listed above:
- Each data record was in order by it's primary key
- Each data record was a fixed length, so accessing them randomly could be done using a Seek function
Let's assume we have a file that contains the same data as the above example (area code, exchange, city and state code). The layout of this file is an ASCII text file that has a fixed length record with the following layout:
| Pos. | Length | Field name |
|---|
| 0 | 3 | Area Code |
| 3 | 3 | Exchange |
| 6 | 20 | City name (space padded) |
| 26 | 2 | State code |
| 28 | 2 | Carriage return, Line feed (makes it easier for a human to read the file, but does nothing for our code) |
To model the data, we'll create a data class similar to the one above:
Public Class LocationDataClass
Implements IComparable(Of LocationDataClass)
Public Shared RecordLength As Integer = 30
Private _rawData() As Byte
Public Sub SetData(ByVal rawData() As Byte)
_rawData = rawData
End Sub
Public Function GetData() As Byte()
Return _rawData
End Function
Public Property AreaCode() As String
Get
Return Text.Encoding.ASCII.GetString(_rawData, 0, 3)
End Get
Set(ByVal value As String)
If value Is Nothing Then value = " "
If Len(value) < 3 Then value &= StrDup(3 - Len(value), " ")
Dim ar() As Byte = Text.Encoding.ASCII.GetBytes(value)
Array.Copy(ar, 0, _rawData, 0, 3)
End Set
End Property
Public Property Exchange() As String
Get
Return Text.Encoding.ASCII.GetString(_rawData, 3, 3)
End Get
Set(ByVal value As String)
If value Is Nothing Then value = " "
If Len(value) < 3 Then value &= StrDup(3 - Len(value), " ")
Dim ar() As Byte = Text.Encoding.ASCII.GetBytes(value)
Array.Copy(ar, 0, _rawData, 3, 3)
End Set
End Property
Public ReadOnly Property CityName() As String
Get
Return Text.Encoding.ASCII.GetString(_rawData, 6, 20).TrimEnd
End Get
End Property
Public ReadOnly Property StateCode() As String
Get
Return Text.Encoding.ASCII.GetString(_rawData, 26, 2)
End Get
End Property
Public Function CompareTo(ByVal other As LocationDataClass) As Integer Implements System.IComparable(Of LocationDataClass).CompareTo
Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode)
If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange)
Return n
End Function
End Class
This class contains the same information as the original one, but the data store is a byte array. I'm storing the data this way because it's the format you read the data in when you access a stream randomly. We'll be reading a lot of records to find the one we want and we want to avoid parsing them any more than necessary. Since we want to read records randomly, let's write a function that will read one random record:
Public Function GetRawRecord(ByVal fs As IO.Stream, ByVal recordNumber As Integer) As Byte()
Dim ar(LocationDataClass.RecordLength - 1) As Byte
fs.Seek(recordNumber * LocationDataClass.RecordLength, IO.SeekOrigin.Begin)
fs.Read(ar, 0, LocationDataClass.RecordLength)
Return ar
End Function
To open a stream to this file:
Dim fi As New IO.FileInfo(Filename)
Dim fs As IO.FileStream = fi.Open(IO.FileMode.Open, IO.FileAccess.Read, IO.FileShare.Read)
Dim RecordCount As Integer = CInt(fi.Length \ LocationDataClass.RecordLength)
And finally, the search function for this file looks like this:
Public Sub SearchFileOnly _
(ByVal fs As IO.Stream, ByVal RecordCount As Integer, _
ByVal SrchKey As LocationDataClass, _
ByRef ResultRecord As LocationDataClass, ByRef ResultIndex As Integer)
Dim hiNdx As Integer = RecordCount - 1
Dim loNdx As Integer = 0
Dim srchRaw() As Byte = SrchKey.GetData
Dim tmpRecord As New LocationDataClass
' if the lo and hi end up out of order, it means we didn't find it
Do While loNdx <= hiNdx
' find the center point
Dim midNdx As Integer = loNdx + ((hiNdx - loNdx) >> 1)
' test that record against your search key
tmpRecord.SetData(GetRawRecord(fs, midNdx))
Dim Order As Integer = SrchKey.CompareTo(tmpRecord)
Select Case Order
Case Is = 0
' if we found it, we're done
ResultRecord = tmpRecord
ResultIndex = midNdx
Exit Sub
Case Is < 0
'everything above the mid point is not relevent
hiNdx = midNdx - 1
Case Is > 0
'everything below the mid point is not relevent
loNdx = midNdx + 1
End Select
Loop
ResultRecord = Nothing
ResultIndex = (Not loNdx)
End Sub
Notice that this version returns both the result record and the result index instead of just the index. Since getting the data required a read to the file, performance is enhanced by passing that data back instead of requiring the calling program to read it again. Waste not, want not.
Finally, we have the code to invoke this routine:
Dim fi As New IO.FileInfo(Filename)
Using fs As IO.FileStream = fi.Open(IO.FileMode.Open, IO.FileAccess.Read, IO.FileShare.Read)
Dim RecordCount As Integer = CInt(fi.Length \ LocationDataClass.RecordLength)
Dim srch As New LocationDataClass
srch.AreaCode = "312"
srch.Exchange = "555"
Dim iRes As Integer = -1
Dim res As LocationDataClass = Nothing
SearchFileOnly(fs, RecordCount, srch, res, iRes)
If iRes >= 0 Then
Console.WriteLine("Found at record #{0} - {1}", i, res.CityName)
Else
Console.WriteLine("Not found; should be at record #{0}", Not iRes)
End If
End Using
As written, this technique doesn't store anything more than a single result in memory at any given time. Therefore, it scales well. Regardless of the size of the file, you memory usage remains constant. But, reading from a disk file is almost always slower than reading from memory.
Since this technique uses the same comparison method (IComparable) as the first method, it doesn't take much to combine the techniques. Create an instance of
List(Of LocationDataClass). Search it first. If that succeeds, use that record. If it fails, use the SearchFileOnly method. Store the result of SearchFileOnly in the List object for next time. You can even add code that purges the List object if it ends up holding too many records. That's beyond the scope of this article, but I'll cover it in another article on caching.
EditAnother data file technique
Another way to handle a large data file is to pre-read all of the key values into a List object along with a pointer you can use in the Seek function. Then, use the List's built-in BinarySearch method and follow that up with a single seek/read to the file. This works well if the data file is especially large, but the key values alone will fit in memory. It also works well if you expect to get a lot of requests for data that isn't in the file; those requests will never results in a file read other than the initial key load. It's also possible to use this technique on files that aren't fixed length by noting the seek positions as you read in the non-fixed records.
For this example, assume we have a variable length text file with the same data we've been using. This time, fields are tab-separated and each record has a variable length:
Public Class LocationDataClass
Implements IComparable(Of LocationDataClass)
Public AreaCode As String
Public Exchange As String
Public SeekPointer As Long
Public Function CompareTo(ByVal other As LocationDataClass) As Integer _
Implements System.IComparable(Of LocationDataClass).CompareTo
Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode)
If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange)
Return n
End Function
End Class
Dim MyData As New List(Of LocationDataClass)
Public Sub LoadData(ByVal filename As String)
Using fs As New IO.StreamReader(filename)
Dim seekPos As Long = 0
Dim s As String = fs.ReadLine
Do Until s Is Nothing
Dim ar() As String = Split(s, vbTab, 3)
Dim tmp As New LocationDataClass
tmp.AreaCode = ar(0)
tmp.Exchange = ar(1)
tmp.SeekPointer = seekPos
MyData.Add(tmp)
seekPos += (s.Length + 2)
s = fs.ReadLine
Loop
End Using
MyData.Sort()
End Sub
The LoadData routine reads each data line and parses out the key values. I limited the Split function to 3 items so I could accurately get the 2 key values and ignore the data values. The SeekPointer is a Long in case the overall size of the file is greater than
Int32.MaxValue (approx. 2gb). The SeekPos variable is constantly incremented based on the actual size of the data plus 2 (carriage return + line feed).
This technique will still eat up a significant amount of memory, so use caution. Also, if it's possible that the source file can change while you are using it, your seek pointers will get out of sync and cause a major problem. If you know that shouldn't happen, you can safely prevent it by re-opening the file for random access and leaving it open until the application is done with it. Either that or you can use a FileSystemWatcher object to reload your pointers if the file is changed during the lifetime of your application.
EditUsing Binary Search when you don't know the full key value
Let's go back to our original example: A List of phone locations.
Public Class LocationDataClass
Implements IComparable(Of LocationDataClass)
Public AreaCode As String
Public Exchange As String
Public CityName As String
Public StateCode As String
Public Function CompareTo(ByVal other As LocationDataClass) As Integer _
Implements System.IComparable(Of LocationDataClass).CompareTo
Dim n As Integer = String.Compare(Me.AreaCode, other.AreaCode)
If n = 0 Then n = String.Compare(Me.Exchange, other.Exchange)
Return n
End Function
End Class
Dim MyData As New List(Of LocationDataClass)
' load some data into MyData
MyData.Sort
The primary CompareTo method requires that we know both the area code and the exchange. But, what if we only know the area code? Can we still use a binary search on this list? Yes and No. A binary search will work on non-unique data that follows our original rules, but the algorithm doesn't guarantee which of the records it will find. It will find the area code if it's in the list, but which exchange within that area code is not predictable.
First, we need to create a comparer that doesn't care about exchange:
Public Class CompareAreaCodeOnly
Implements IComparer(Of LocationDataClass)
Private Sub New()
End Sub
Public Shared Instance As New CompareAreaCodeOnly
Public Function Compare _
(ByVal x As LocationDataClass, _
ByVal y As LocationDataClass) As Integer _
Implements System.Collections.Generic.IComparer(Of LocationDataClass).Compare
Return String.Compare(x.AreaCode, y.AreaCode)
End Function
End Class
To search for a specific area code only:
Dim srch As New LocationDataClass
srch.AreaCode = "312"
Dim i As Integer = MyData.BinarySearch(srch, CompareAreaCodeOnly.Instance)
If i >= 0 Then
Console.WriteLine("Area code {0} exists", srch.AreaCode)
Else
Console.WriteLine("Area code {0} does not exist", srch.AreaCode)
End If
In this case, when it finds the record, it finds one of the records that has that area code, but it could be the first one, the last one or any one in between. It's not random, but it's
effectively non-predictable. All we can really tell is that the area code exists in our list or not. If it does exist, we could do a linear search backwards to find the first exchange or forward to find the last one.
This works as long as the partial key we are searching for is the
first part of the sort key. If it's in the middle or end of the sort key, it won't work.
EditUsing Binary Search with databases
Most modern databases have sophisticated indexing techniques that make doing a binary search against the data unnecessary. While you probably can NOT binary search a database file faster than the database server can do a unique index lookup, you can use binary searches against a cache of results so that you can lookup previously found data in-memory before going back to the database for a redundant query. More on that in a
caching article.