YSC2229: Introductory Data Structures and Algorithms
  • 1. Course Syllabus
  • 2. Software Prerequisites
  • 3. OCaml Style Guide
  • 1. Week 01: Introduction
  • 2. Week 02: Working with Arrays
  • 3. Week 03: Complexity of Algorithms and Order Notation
  • 4. Week 04: Divide-and-Conquer Algorithms
  • 5. Week 05: Binary Heaps and Priority Queues
  • 6. Week 06: Abstract Data Types
  • 7. Midterm Project: Memory Allocation and Reclamation
  • 8. Week 07: Hashing-Based Data Structures
  • 9. Week 08: Searching in Strings
    • 9.1. Substring Search
      • 9.1.1. Testing a search procedure
      • 9.1.2. A very naive search
      • 9.1.3. A slightly better naive search
      • 9.1.4. A recursive version of the naive search
      • 9.1.5. Testing naive search
    • 9.2. Rabin-Karp Search
    • 9.3. Knuth–Morris–Pratt Algorithm
    • 9.4. Exercises
  • 10. Week 09: Backtracking and Dynamic Programming
  • 11. Week 10: Data Encoding and Compression
  • 12. Week 11: Binary Search Trees
  • 13. Week 12: Graph Algorithms
  • 14. Week 13: Elements of Computational Geometry
  • 15. Final Project: Vroomba Programming
YSC2229: Introductory Data Structures and Algorithms
  • Docs »
  • 9. Week 08: Searching in Strings »
  • 9.1. Substring Search

9.1. Substring Search¶

  • File: StringSearch.ml

Imagine that you are searching for a word on a web page or a Word document.

This problem is known and pattern search in a string. Despite being seemingly a simple challenge to solve, in order to do efficiently, it requires a lot of ingenuity.

In this lecture we will see several solutions for it, of the increased implementation complexity, while reduced time demand.

9.1.1. Testing a search procedure¶

The procedure search takes a string text and a patern pattern and returns a result of type int option, where Some i denotes the first index in the string text, such that pattern starts from it and is fully contained within text. If no such index exist (i.e., pattern is not in text), search returns None.

Even before we implement the search procedure itself, we develop a test for it. The first test function checkes if a pattern pattern is indeed in the string text, as reported by the function search:

let test_pattern_in search text pattern =
  let index = get_exn @@ search text pattern in
  let p' = String.sub text index (String.length pattern) in
  assert (pattern = p')

let test_pattern_not_in search text pattern =
  assert (search text pattern = None)

9.1.2. A very naive search¶

Our first attempt to implement the search is as follows:

let very_naive_search text pattern =
  let n = String.length text in
  let m = String.length pattern in
  if n < m then None
  else
    let k = ref 0 in
    let res = ref None in
    while !k <= n - m && !res = None do
      let candidate = String.sub text !k m in
      if candidate = pattern
      then res := Some !k
      else k := !k + 1
    done;
    !res

Question: what is the worst-case complexity of this search in terms of sizes n and m of text and pattern correspondingly?

If we try to evaluate this search on reasonably large strings, we will notice that it behaves quite poorly already for strings of size 5000 and pattern size 20. This is because it exercises its worst-case theoretical complexity. How can we make it better?

9.1.3. A slightly better naive search¶

We can implement a slightly better naive search as follows. Notice that OCaml syntax s.[i] allows one to refer to a character of a string s at a position i:

let naive_search text pattern =
  let n = String.length text in
  let m = String.length pattern in
  if n < m then None
  else
    let k = ref 0 in
    let res = ref None in
    while !k <= n - m && !res = None do
      let j = ref 0 in
      while !j <= m - 1 &&
            text.[!k + !j] = pattern.[!j]
      do  j := !j + 1  done;
      if !j = m
      then res := Some !k
      else
        k := !k + 1
    done;
    !res

It tries to identify the pattern starting from each position i, checking the characters in the substring one by one. If it fails in the inner search, it simply tries the next position by incrementing i.

Question: what is the worst-case complexity of this search in terms of sizes n and m of text and pattern correspondingly?

9.1.4. A recursive version of the naive search¶

The same implementation can be, obviously rewritten so it would be tail-recursive:

let naive_search_rec text pattern =
  let n = String.length text in
  let m = String.length pattern in
  if n < m then None
  else
    let rec walk k =
      if k > n - m then None
      else (
      let j = ref 0 in
      while !j <= m - 1 &&
            text.[k + !j] = pattern.[!j]
      do  j := !j + 1  done;

      if !j = m
      then Some k
      else walk @@ k + 1)

    in walk 0

9.1.5. Testing naive search¶

Let us construct a number of tests, starting from a simple one:

let big = "abcdefghijklmnopeqrstuvabcsrtdsdqewgdcvaegbdweffwdajbjrag"

let patterns = ["dsd"; "jrag"; "abc"]

let%test "Naive Search Works" =
  List.iter (fun p -> test_pattern_in naive_search big p) patterns;
  true

We can also check, on a random string, that our search returns no false positives and no false negatives.

How do we generate random strings for testing search?

This can be done using the function generate_words, which we generated before. We simply create a list of words and concatenate it to produce the string text. We can also create the list of words that are (with a very high probability) not in the obtained string text:

let generate_string_and_patterns n m =
  let ps_in = generate_words n m in
  let ps_not_in =
    List.filter (fun p -> not (List.mem p ps_in)) @@
    generate_words n m in
  let s = String.concat "" (List.rev ps_in) in
  (s, ps_in, ps_not_in)

let%test "Naive Search True Positives" =
  let (s, ps, _) = generate_string_and_patterns 500 5 in
  List.iter (fun p -> test_pattern_in naive_search s p) ps;
  true

let%test "Naive Search True Negatives" =
  let (s, _, pn) = generate_string_and_patterns 500 5 in
  List.iter (fun p -> test_pattern_not_in naive_search s p) pn;
  true

Finally, we can provide a higher-order testing procedure for strings, so it would test on a specific string, and on randomly-generated strings (for both positive and negative results), as follows:

let search_tester search =
  let (s, ps, pn) = generate_string_and_patterns 500 5 in
  List.iter (fun p -> test_pattern_in search big p) patterns;
  List.iter (fun p -> test_pattern_in search s p) ps;
  List.iter (fun p -> test_pattern_not_in search s p) pn;
  true

let%test _ = search_tester naive_search
Next Previous

© Copyright 2021, Ilya Sergey

Built with Sphinx using a theme provided by Read the Docs.