Creating Random Access Text in C#

by David Bolton Aug 7, 2014 7 min read

Back in April I looked at disk folders as a possible alternative to NoSQL or using a relational DB. My conclusion wasn't encouraging—I was concerned about poor performance, especially on Linux. The use case I examined was for a server that had from 100,000 to 1 million users. I wanted to store and retrieve text files for any user. Those files could vary in length from a few bytes to a few KB. Back in the dark ages—before the Web and before databases were in common use—I did a fair bit of this type of code. It was written in Turbo Pascal using binary files. Life was simpler then. There was no Unicode to contend with, just ASCII 8 bit chars. Click here to find C# jobs. But the world has moved on. In this article, I'll be creating a class that stores proper C# strings in a data file using an index file for fast string access and retrieval. The principle is straightforward: There's an index file and a data file. The index file is made up of n index records where n is the number of strings to be stored. Each index record consists of two fields and is held in an instance of type IndexRecord. I've used UInt32 (0-4GB).

public UInt32 offset { get; set; } public UInt32 length { get; set; }

Now you might wonder why length is included, since that's stored with the string when you write it with a BinaryWriter. The reason is that it's needed for comparison when changing a string. If the length of the new string is the same or a lesser size, then the old offset is used. If it's longer, the offset is changed to the end of the data file and the text is written there.

Files and Streams

To make this work requires a FileStream which connects a file to an in-memory stream and uses a BinaryWriter and BinaryReader, created on the fly to do the reading or writing per string access. This code, for instance, reads in the entire indexfile into a List<IndexRecord>. The statement Position != Length is an end of file check.

using (var fs=File.OpenRead(indexFilename)) }
using (var br = new BinaryReader(fs))

{

while (br.BaseStream.Position != br.BaseStream.Length)

{

var indexRec = new IndexRecord

{

offset = br.ReadUInt32(),

length = br.ReadUInt32(),

};

indexes.Add(indexRec);

}

}
}

The Indices class manages a List of IndexRecords and has the Boolean property IsDirty(). This is set to true whenever a new index is added or changed. The Save method only rewrites the complete index to the index file if IsDirty() is true.

The TextRecords Class

This is the class that you instantiate. It encapsulates the index list, providing methods to Add, Change or GetString. In this case, the management of the FileStream instance differs from how it was used in the example above. It's a class-level variable. Once opened, it’s kept open by using the optional LeaveOpen() parameter in the BinaryWriter and BinaryReader constructors. The default Encoding in the example above is UTF8. To set the LeaveOpen() parameter requires the Encoding to be explicitly stated.

using (var bw = new BinaryWriter(fs,Encoding.UTF8,true)) {
bw.Write(text);
}

Although it would seem that creating an instance of a BinaryWriter would slow things down, it runs very quickly, writing 100,000 short strings (about 500Kb total) in just over a second on an i7 950 PC that's getting on four years old. The downside of keeping a FileStream open through the life of the class is that it requires that class to implement the IDisposable interface and ensure that the FileStream.Dispose() method is called. This is also a good place to call indices.Save() to save out the index file if it has been changed. It uses the recommended way of implementing Dispose() which prevents the Finalizer from calling Dispose() as well. The Finalizer is a rough equivalent to a C++ destructor and is called from the garbage collector. It’s non-deterministic as to when it's called, so explicitly disposing is better.

public class TextRecords : IDisposable {
...
}
public void Dispose()

{

Dispose(true);

GC.SuppressFinalize(this); // Prevent Dispose() from being called Twice

}

protected virtual void Dispose(bool disposing)

{

if (disposing)

{

indices.Save();

try

{

fs.Dispose();

}

catch

{

}

}
}

The interplay of FileStream and BinaryReader/BinaryWriter can be a little confusing. The FileStream has methods Seek() and Position(). Seek() moves the file cursor to a specified point and Position() returns where the file cursor is currently located. With a new file, the cursor starts at 0 and is moved by using the BinaryWriter to write something. For reading a string back, the offset position is read from the specified index record. Then the BinaryReader object reads the string from the disk. This is GetString(), below, doing that.

public string GetString(int index) {
var record = indices.GetIndex(index);

if (record == null)

{

return @"";

}

using (var br = new BinaryReader(fs, Encoding.UTF8,true))

{

fs.Seek(record.offset, SeekOrigin.Begin);

return br.ReadString();

}
}

The FileStream Seek() method can seek relative to the start position, end or relative to the current file cursor. So moving the file cursor to the end of the data file is done with this:

fs.Seek(0, SeekOrigin.End);

Whereas the strings are located using the index record's offset, relative to the start of the data file like this:

fs.Seek(record.offset, SeekOrigin.Begin);

It's Getting Larger!

Every time a string is changed, and the new string is larger than the old, it gets added onto the end of the data file and the index is updated to point to it. The space previously occupied by the string remains, so over time the data file will grow larger and larger. It doesn't need new strings to be added either. If you swapped two strings of different length, then the longer one would be written at the end of the data file. To recover this lost space, the TextRecords.Compress() method has been added. This copies the data file string by string into a temporary file. It saves out the indices (now pointing into the temporary data file), disposes of the FileStream object, renames the old data file to a backup and reloads the indices and reopens the FileStream object. The C# 5.0 source code from this has been posted on SourceForge. It consists of project files including two source files.

Upload Your ResumeEmployers want candidates like you. Upload your resume. Show them you're awesome.

Main image of article Creating Random Access Text in C#

public UInt32 offset { get; set; } public UInt32 length { get; set; }

Files and Streams

using (var fs=File.OpenRead(indexFilename)) }
using (var br = new BinaryReader(fs))

{

while (br.BaseStream.Position != br.BaseStream.Length)

{

var indexRec = new IndexRecord

{

offset = br.ReadUInt32(),

length = br.ReadUInt32(),

};

indexes.Add(indexRec);

}

}
}

The TextRecords Class

using (var bw = new BinaryWriter(fs,Encoding.UTF8,true)) {
bw.Write(text);
}

public class TextRecords : IDisposable {
...
}
public void Dispose()

{

Dispose(true);

GC.SuppressFinalize(this); // Prevent Dispose() from being called Twice

}

protected virtual void Dispose(bool disposing)

{

if (disposing)

{

indices.Save();

try

{

fs.Dispose();

}

catch

{

}

}
}

public string GetString(int index) {
var record = indices.GetIndex(index);

if (record == null)

{

return @"";

}

using (var br = new BinaryReader(fs, Encoding.UTF8,true))

{

fs.Seek(record.offset, SeekOrigin.Begin);

return br.ReadString();

}
}

The FileStream Seek() method can seek relative to the start position, end or relative to the current file cursor. So moving the file cursor to the end of the data file is done with this:

fs.Seek(0, SeekOrigin.End);

Whereas the strings are located using the index record's offset, relative to the start of the data file like this:

fs.Seek(record.offset, SeekOrigin.Begin);

It's Getting Larger!

Upload Your ResumeEmployers want candidates like you. Upload your resume. Show them you're awesome.

Creating Random Access Text in C#

Files and Streams

The TextRecords Class

It's Getting Larger!

Related Articles

Creating Random Access Text in C#

Files and Streams

The TextRecords Class

It's Getting Larger!

Related Articles

David Bolton

Creating Random Access Text in C#

Files and Streams

The TextRecords Class

It's Getting Larger!

Related Articles

Creating Random Access Text in C#

Files and Streams

The TextRecords Class

It's Getting Larger!

Related Articles

David Bolton

Related Articles

Apple's Flashy New File System Coming Soon

Cloud-Based File Services Raise Concerns for SMBs, Despite Use

SAS Updates Text Analysis Software