11.2. Binary Encoding of Data

  • File: BinaryEncodings.ml

As discussed in the previous section, there is no big difference between text and binary files, as all of those are represented similarly by sequences of bits, with the former being given a special treatment in the case if an operational system identifies them following a certain encoding pattern.

Let us now learn how to work with binary data (i.e., reading/writing the corresponding files) in OCaml. We will largely rely on the library Extlib.IO that comes as a part of the batteries package:

open Extlib.IO

The standard terminology for writing/reading data to/from its binary representation is to serialize/deserialize it.

11.2.1. Writing and Reading Binary Files

Standard OCaml library does not provide means to work with binary data in the most fine-grained way: with standard functions one can read/write sequences of bits that are multipliers of 8 (i.e., bytes etc), but not individual bits. The functions output_bits and input_bits from Extlib.IO provide this possibility by giving “wrappers” around standard input/output channels for manipulating with files.

The following function, implemented by us, uses input_bits to read bits from a file filename and process them via the client-provided function deserialize:

let read_from_binary deserialize filename =
  Core.In_channel.with_file ~binary:true filename
    ~f:(fun file_input ->
        let bits_input = input_bits @@ input_channel file_input in
        deserialize bits_input)

Writing bits to a file is almost as straightforward and is done with the help of the following function that makes use of the output_bits wrapper:

let write_to_binary serialize filename data =
  Core.Out_channel.with_file filename ~append:false ~binary:true ~f:(fun file ->
      let bits_output = output_bits @@ output_channel file ~cleanup:true in
      serialize bits_output data;
      (* Padding from the end -- important! *)
      flush_bits bits_output)

Notice the last statement flush_bits bits_output. What it does is to add “missing” bits (as zeroes) to the binary file so its length (in bits) would be divisible by 8. If this is not done, then reading such a file might result in an error. The procedure write_to_binary takes as arguments, the function serialize that handles the data to be written to an output file , the filename of the file and the data itself.

11.2.2. Writing and Reading OCaml Strings

Let us now use the binary-manipulating machinery to read/write OCaml strings as if they were just sequences of bits.

Writing is done via the following function:

let write_string_to_binary filename text =
  let serialize out text =
    let size = String.length text in
    for i = 0 to size - 1 do
      let ch = int_of_char text.[i] in
      write_bits out ~nbits:8 ch;
    done
  in
  write_to_binary serialize filename text

The implementation above has a couple of interesting aspects. First, it treats a string as an array of characters that it converts to integers (int_of_char text.[i]). Second, it writes those integers as bits (i.e., 8-bit sequence) into the output file out (write_bits out ~nbits:8 ch). Since OCaml uses 32 bits to represent integers, such a truncation to 8 bits could be unsafe, but we know that our integers are converted from char and hence range at 0-255.

The resulting file thus contains a sequence of bytes precisely encoding the string.

Reading is done similarly:

let read_string_from_binary filename =
  let deserialize input =
    let buffer = Buffer.create 1 in
    (try
       while true do
         let bits = read_bits input 8 in
         let ch = char_of_int bits in
         Buffer.add_char buffer ch
       done;
     with BatInnerIO.No_more_input -> ());
    Buffer.contents buffer
  in
  read_from_binary deserialize filename

For an arbitrary file, we don’t know what is the length of the string it has. Therefore, we just keep adding byte-encoded characters to a buffer in a while true loop, until we hit the end of the file (each invocation of read_bits advances our reading “position” in the file, ultimately reaching the end). Once it happens an exception BatInnerIO.No_more_input is raised, which we can catch and return the result accumulated in the buffer.

We can also test that our serialization is implemented correctly:

let string_serialization_test s =
  let filename = "text.tmp" in
  write_string_to_binary filename s;
  let s' = read_string_from_binary filename in
  Sys.remove filename;
  s = s'

let abracadabra = "ABRACADABRA!"

let%test _ = string_serialization_test abracadabra

For more impressive testing, let us read a large text file (Leo Tolstoy’s “War and Peace”) and make a copy of it, testing the validity of our copying mechanism:

let string_file_serialization_test source_file =
  let s = read_file_to_single_string source_file in
  string_serialization_test s

(* Get the file path *)
let find_file fname =
  Printf.sprintf "%s/%s" (Sys.getcwd ()) fname

let%test _ =
  let f = (find_file "../../../resources/war-and-peace.txt") in
  string_file_serialization_test f

Notice that the function find_file returns the absolute path of a file located by starting at the running directory of the executable (which is different in the cases when we run utop and when we run tests - feel frree to check it). Here, we have tailored the path so it would work correctly with inline tests.

11.2.3. Compressing DNA Sequences

There is no gain in reading strings in binary, as we use the same format for representing them as plain OCaml.

Some domains, however, have data, for which it would be too wasteful to represent it as a string. Realising this gives an initial idea of implementing data compression — exploiting properties of data to find more compact representation of it as a bit-string.

A good example of data that can be efficiently represented are DNA sequences. The sequences are very long strings of only four characters:

  • A (Adenosine)
  • G (Guanine)
  • C (Cytosine)
  • T (Thymidine)

Therefore, a typical sequences look as follows:

let dna_string1 = "CGT"
let dna_string2 = "ATAGATGCATAGCGCATAGCTAGATAGTGCTAG"
let dna_string3 = "ATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGG"
let dna_string4 = "ATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGG"

Since there are only 4 characters in DNA strings, we don’t need 8 bits to encode them — just two bits would do:

let dna_encoding_size = 2

We can the implement the encoding from DNA characters to 2-bit integers and vice versa:

let dna_encoder = function
  | 'A' -> 0
  | 'C' -> 1
  | 'G' -> 2
  | 'T' -> 3
  | _ -> raise (Failure "DNA encoding error")

let dna_decoder = function
  | 0 -> 'A'
  | 1 -> 'C'
  | 2 -> 'G'
  | 3 -> 'T'
  | _ -> raise (Failure "DNA decoding error")

Let us now implement the binary serializers/deserializers for DNA data using this format. This can be accomplished using the general binary-manipulating primitives defined above.

The writing procedure starts by putting a header to the bit file of size 30 (the largest size of a bit-sequence supported by Extlib.IO), which is a serialised integer indicating the length of the following sequence of 2-bit encoded DNA characters. We did not need to put this information for 8-bit strings, but need it here because of the file padding via flush_bits:

let write_dna_to_binary filename text =
  let serialize out text =
    let size = String.length text in
    write_bits out ~nbits:30 size;
    for i = 0 to size - 1 do
      let ch = dna_encoder text.[i] in
      write_bits out ~nbits:dna_encoding_size ch;
    done
  in
  write_to_binary serialize filename text

The deserializer proceeds by first retrieving the header and learning the length of the stream of 2-bit characters, and then using this information to read the DNA string into a buffer and return it as an OCaml string:

let read_dna_from_binary filename =
  let deserialize input =
    let buffer = Buffer.create 1 in
    let input_length = read_bits input 30 in
    for _ = 0 to input_length - 1 do
      let bits = read_bits input dna_encoding_size in
      let ch = dna_decoder bits in
      Buffer.add_char buffer ch
    done;
    Buffer.contents buffer
  in
  read_from_binary deserialize filename

We can now test our compression/decompression procedure for DNAs:

let dna_compression_test d =
  let filename = "dna.tmp" in
  write_dna_to_binary filename d;
  let d' = read_dna_from_binary filename in
  Sys.remove filename;
  d = d'

Question: How can we see if the compression is beneficial?