11.2. Binary Encoding of Data¶
As discussed in the previous section, there is no big difference between text and binary files, as all of those are represented similarly by sequences of bits, with the former being given a special treatment in the case if an operational system identifies them following a certain encoding pattern.
Let us now learn how to work with binary data (i.e., reading/writing the corresponding files) in OCaml. We will largely rely on the library
Extlib.IO that comes as a part of the
The standard terminology for writing/reading data to/from its binary representation is to serialize/deserialize it.
11.2.1. Writing and Reading Binary Files¶
Standard OCaml library does not provide means to work with binary data
in the most fine-grained way: with standard functions one can
read/write sequences of bits that are multipliers of 8 (i.e., bytes
etc), but not individual bits. The functions
Extlib.IO provide this possibility by giving
“wrappers” around standard input/output channels for manipulating with
The following function, implemented by us, uses
input_bits to read bits from a file
filename and process them via the client-provided function
let read_from_binary deserialize filename = Core.In_channel.with_file ~binary:true filename ~f:(fun file_input -> let bits_input = input_bits @@ input_channel file_input in deserialize bits_input)
Writing bits to a file is almost as straightforward and is done with the help of the following function that makes use of the
let write_to_binary serialize filename data = Core.Out_channel.with_file filename ~append:false ~binary:true ~f:(fun file -> let bits_output = output_bits @@ output_channel file ~cleanup:true in serialize bits_output data; (* Padding from the end -- important! *) flush_bits bits_output)
Notice the last statement
flush_bits bits_output. What it does is to add “missing” bits (as zeroes) to the binary file so its length (in bits) would be divisible by 8. If this is not done, then reading such a file might result in an error. The procedure
write_to_binary takes as arguments, the function
serialize that handles the data to be written to an output file , the
filename of the file and the
11.2.2. Writing and Reading OCaml Strings¶
Let us now use the binary-manipulating machinery to read/write OCaml strings as if they were just sequences of bits.
Writing is done via the following function:
let write_string_to_binary filename text = let serialize out text = let size = String.length text in for i = 0 to size - 1 do let ch = int_of_char text.[i] in write_bits out ~nbits:8 ch; done in write_to_binary serialize filename text
The implementation above has a couple of interesting aspects. First, it treats a string as an array of characters that it converts to integers (
int_of_char text.[i]). Second, it writes those integers as bits (i.e., 8-bit sequence) into the output file
write_bits out ~nbits:8 ch). Since OCaml uses 32 bits to represent integers, such a truncation to 8 bits could be unsafe, but we know that our integers are converted from
char and hence range at
The resulting file thus contains a sequence of bytes precisely encoding the string.
Reading is done similarly:
let read_string_from_binary filename = let deserialize input = let buffer = Buffer.create 1 in (try while true do let bits = read_bits input 8 in let ch = char_of_int bits in Buffer.add_char buffer ch done; with BatInnerIO.No_more_input -> ()); Buffer.contents buffer in read_from_binary deserialize filename
For an arbitrary file, we don’t know what is the length of the string
it has. Therefore, we just keep adding byte-encoded characters to a
buffer in a
while true loop, until we hit the end of the file
(each invocation of
read_bits advances our reading “position” in
the file, ultimately reaching the end). Once it happens an exception
BatInnerIO.No_more_input is raised, which we can catch and return
the result accumulated in the buffer.
We can also test that our serialization is implemented correctly:
let string_serialization_test s = let filename = "text.tmp" in write_string_to_binary filename s; let s' = read_string_from_binary filename in Sys.remove filename; s = s' let abracadabra = "ABRACADABRA!" let%test _ = string_serialization_test abracadabra
For more impressive testing, let us read a large text file (Leo Tolstoy’s “War and Peace”) and make a copy of it, testing the validity of our copying mechanism:
let string_file_serialization_test source_file = let s = read_file_to_single_string source_file in string_serialization_test s (* Get the file path *) let find_file fname = Printf.sprintf "%s/%s" (Sys.getcwd ()) fname let%test _ = let f = (find_file "../../../resources/war-and-peace.txt") in string_file_serialization_test f
Notice that the function
find_file returns the absolute path of a
file located by starting at the running directory of the executable
(which is different in the cases when we run
utop and when we run
tests - feel frree to check it). Here, we have tailored the path so it
would work correctly with inline tests.
11.2.3. Compressing DNA Sequences¶
There is no gain in reading strings in binary, as we use the same format for representing them as plain OCaml.
Some domains, however, have data, for which it would be too wasteful to represent it as a string. Realising this gives an initial idea of implementing data compression — exploiting properties of data to find more compact representation of it as a bit-string.
A good example of data that can be efficiently represented are DNA sequences. The sequences are very long strings of only four characters:
- A (Adenosine)
- G (Guanine)
- C (Cytosine)
- T (Thymidine)
Therefore, a typical sequences look as follows:
let dna_string1 = "CGT" let dna_string2 = "ATAGATGCATAGCGCATAGCTAGATAGTGCTAG" let dna_string3 = "ATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGG" let dna_string4 = "ATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGGATAGATGCATAGCGCATAGCTAGATAGTGCTAGCGATGCATAGCGCAGATGCATAGCGCAGGGGG"
Since there are only 4 characters in DNA strings, we don’t need 8 bits to encode them — just two bits would do:
let dna_encoding_size = 2
We can the implement the encoding from DNA characters to 2-bit integers and vice versa:
let dna_encoder = function | 'A' -> 0 | 'C' -> 1 | 'G' -> 2 | 'T' -> 3 | _ -> raise (Failure "DNA encoding error") let dna_decoder = function | 0 -> 'A' | 1 -> 'C' | 2 -> 'G' | 3 -> 'T' | _ -> raise (Failure "DNA decoding error")
Let us now implement the binary serializers/deserializers for DNA data using this format. This can be accomplished using the general binary-manipulating primitives defined above.
The writing procedure starts by putting a header to the bit file of
size 30 (the largest size of a bit-sequence supported by
Extlib.IO), which is a serialised integer indicating the length of
the following sequence of 2-bit encoded DNA characters. We did not
need to put this information for 8-bit strings, but need it here
because of the file padding via
let write_dna_to_binary filename text = let serialize out text = let size = String.length text in write_bits out ~nbits:30 size; for i = 0 to size - 1 do let ch = dna_encoder text.[i] in write_bits out ~nbits:dna_encoding_size ch; done in write_to_binary serialize filename text
The deserializer proceeds by first retrieving the header and learning the length of the stream of 2-bit characters, and then using this information to read the DNA string into a buffer and return it as an OCaml string:
let read_dna_from_binary filename = let deserialize input = let buffer = Buffer.create 1 in let input_length = read_bits input 30 in for _ = 0 to input_length - 1 do let bits = read_bits input dna_encoding_size in let ch = dna_decoder bits in Buffer.add_char buffer ch done; Buffer.contents buffer in read_from_binary deserialize filename
We can now test our compression/decompression procedure for DNAs:
let dna_compression_test d = let filename = "dna.tmp" in write_dna_to_binary filename d; let d' = read_dna_from_binary filename in Sys.remove filename; d = d'
Question: How can we see if the compression is beneficial?