Collection Filtering Algorithm with Bit Operation - Part.1

Background

Collection must be the most commonly used data model in programming. And there are a number of data structure that represents collection in memory, in which Array and Liked List must be the most well known ones.

When dealing with collection, filtering must be one of the most widely used operation being used. Such as find item(s) satisfies specific criteria from the collection. Although filtering is used a lot, but it isn’t actually a simple or easy operation to apply, especially dealing with tremendous data volume.

In fact, there are a number of derived topics about collection filtering, which is impossible to cover them all in the article. This article will focus on bit operation, probably a not widely known but really interesting topic to introduce.

Problem: Find the missing integer

Problem
Suppose there is a collection of integers L.
Name the collection that contains integers from 1 to N as G
Then L contains all the elements of G except one number x.
Items in L are in random order.
Find x.

TIP: This problem be too easy to experienced developers and algorithm hackers. But it is a good opening words to the following 2 problems in this series. Also it reveals the core concept of some other advanced technologies, such as CRC or bloom filter.

When got this problem, the very intuitive(means do it manually by human being) idea is to compare the elements in L and G, find each element in L and remove it from G, then the only left item in G is x.

Here is the code:

1
2
3
4
5
6
7
8
9
10
# @param Integer[] L
# @param Integer N
# @return Integer x
def find_missing_x(L, N)
G = [*(1..N)] # Create G with all integers from 1 to N
L.each { |i| G.delete(i) } # Traverse L, and remove each item in L from G
return G.last # Return the only rest item from G
end

Well this is a work but brute algorithm, which is not efficient in both time and space:

  • Instantiate G means it has space complexity N
  • Find and remove i from G means it has time complexity at O(N*Log(N))

Yes, the algorithm can be optimized, with bit operation. In fact the problem can be resolved with time complexity O(N) and space complexity O(1).

Exclusive-Or Operation

Exclusive-Or(Or XOR for short, written as ^ in this article) is a basic bit operation, here is an Venn Diagram explains it:

XOR

XOR is an important bit operation because it has following interesting features:

  1. A ^ B = B ^ A (aka Commutativity)
  2. (A ^ B) ^ C = A ^ (B ^ C) (aka Associativity)
  3. A ^ A = 0
  4. A ^ 0 = A

Commutativity and Associativity ensure that XOR can be used despite of order. The 3rd and 4th feature establishes a critical position for XOR in cryptology and encoding.

Solution: with XOR

By making use of XOR, previous problem can ben resolved in a much more graceful way:

  • Given y is the xor result of all items in G
  • And z is the xor result of all items in L
  • Then x = y ^ z
1
2
3
4
5
6
7
8
9
10
11
12
# @param Integer[] L
# @param Integer N
# @return Integer x
def find_missing_x(L, N)
x = 0 # Set x to 0, since A ^ 0 = A
(1..N).each { |i| x ^= i } # Calculate the xor result for G
L.each { |i| x ^= i } # Keep xor all elements in L
return x # Return the only rest item from G
end

Let’s prove it mathematically:

1
2
3
4
5
6
7
Given x'= y ^ z
Then x' = (n1 ^ n2 ^ n3 ^ ... ^ x ^ ... ^ nN) ^ (n1 ^ n2 ^ n3 ^ ... ^ nN)
Since ^ obeys commutativity and associativity
Then = (n1 ^ n1) ^ (n2 ^ n2) ^ ... ^ (nN ^ nN) ^ x
As n ^ n = 0 and n ^ 0 = n
Then x'= 0 ^ 0 ^ ... ^ 0 ^ x
= x

Actually the code can be more concise by using reduce, with 2-lines of code

1
2
3
4
5
6
7
8
# @param Integer[] L
# @param Integer N
# @return Integer x
def find_missing_x(L, N)
x = (1..N).reduce(&:^) # Calculate the xor result for G
L.reduce(x, &:^) # Keep xor all elements in L and return
end

Or C version:

1
2
3
4
5
6
7
8
9
int find_missing_x(long N, long* L) {
long x = N;
for(long i = 0; i < N - 1; i++) { // L has size of N-1
x ^= i ^ L[i];
}
return x;
}

Extending problems

So far the problem has been gracefully resolved. But it is yet the end, here is a couple of extending questions:

  1. If G is collection with random but unique numbers (Not continues integer from 1 to N). How to write the code?
  2. Based on 1, if there are 2 numbers are missing? How to find them both?
  3. Based on 1, but L contains the members from G except xfor exactly 2 times. That means L might contain 0 or 1 x, but exactly 2 for any other members from G. How to find x?
  4. Based on 1, but G isn’t a collection of integer, but a collection of web urls. How to find the one missing url?
  5. Based on 4, G is a collection of url, L is subset of G (might contains fewer members other than exactly N - 1). How can we know a randomly picked member from G is absolutely not contained by L. Given the tolerance to mistake on saying a member from G is in L but it doesn’t.

These problems will be explained in the following posts.

Implement sqrt with basic algebra operators

It is a very basic question, but to solve it in a time limited environment, require solid knowledge about algorithm, and could use these knowledges flexibly.
I found the problem of myself is that I know it, but I cannot use it as flexible as my hand.

The problem description:

Calculate the square root of a given number N
The N could be a decimal, such as 6.25 or 0.01
The implementation only allow to use basic algebra operators like +, -, *, /, <, >, etc.
Advanced functions like Math.sqrt is not allowed

As a TDDer, I’m used to write a simple test structure that is gonna used to express the test cases:

Code Skeleton
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def assert(expected, actual)
if expected == actual
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def sqrt(n)
end

After this, we can begin to write our 1st test case, which is the simplest scenario that I can imagine:
We assume:

  1. n must be an integer
  2. n must have a integer square root
1st test case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def assert(expected, actual)
if expected == actual
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def sqrt(n)
end
# Start from the easiest case, the integer square root.
assert 3, sqrt(9)

Run the code, if everything goes right, we will get a failed message as expected. Then we gonna introduce our first implementation to fix the failed test case:

With the 2 additional assumptions in 1st test case, we can easily figure out a simple solution: Simply linear search the integers between the root between 1 and n.

1st implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def assert(expected, actual)
if expected == actual
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def sqrt(n)
(1..n).each do |x|
return x if x * x == n
end
end
# Start from the easiest case, the integer square root.
assert 3, sqrt(9)

So far so good. But there are 2 magic integers that related to the sqrt, one is 1 and another is 0.
And it seems our function cannot handle all of them correctly, so I wanna improve my algorithm to enable it deals with special numbers: 0, 1.
So I added 2 test cases, and improved the implementation:

sqrt of 0 and 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def assert(expected, actual)
if expected == actual
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def sqrt(n)
return 0 if n == 0
(1..n).each do |x|
return x if x * x == n
end
end
# Start from the easiest case, the integer square root.
assert 3, sqrt(9)
# 2 corner cases
assert 1, sqrt(1)
assert 0, sqrt(0)

Now everything looks good, except the performance.
The time complexity of this algorithm is O(n), which is bad. I expected the algorithm complexity could close to O(1). At least it should be O(log n)

How could we improve the performance?
I had ever thought that it is safe to shrink the range to (1..2/n), but in fact it doesn’t really help to improve the performance of this algorithm, it is still O(n) after the update.
And it causes problems when dealing with the number 1, so I prefers to keep it as is.

So what we did in the sqrt function now it kind of a search, we search the number match the condition between 1 and n.
Obviously that 1..n is a ascending series, and mapping x -> x*x has positive differential coefficient.
So it is possible for use to use variant binary search replace the linear search, which reduce the time complexity from O(n) to O(log n)

Binary Search
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def assert(expected, actual)
if expected == actual
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def binary_search(goal, start, stop)
mid = (stop - start) / 2 + start
mid_square = mid * mid
if mid_square == goal
return mid
elsif mid_square > goal
return binary_search(goal, start, mid)
else
return binary_search(goal, mid, stop)
end
end
def sqrt(n)
return 0 if n == 0
binary_search(n, 1, n)
end
# Start from the easiest case, the integer square root.
assert 3, sqrt(9)
# 2 corner cases
assert 1, sqrt(1)
assert 0, sqrt(0)
# 2 normal cases
assert 5, sqrt(25)
assert 9, sqrt(81)

After implemented the binary search algorithm, we found a very interesting phenomenon: We didn’t restrict n to integer, and it seems it get some capability to dealing with float number?!
So I tried to add 2 float number test cases:

Float number test cases
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def assert(expected, actual)
if expected == actual
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def binary_search(goal, start, stop)
mid = (stop - start) / 2 + start
mid_square = mid * mid
if mid_square == goal
return mid
elsif mid_square > goal
return binary_search(goal, start, mid)
else
return binary_search(goal, mid, stop)
end
end
def sqrt(n)
return 0 if n == 0
binary_search(n, 1, n)
end
# Start from the easiest case, the integer square root.
assert 3, sqrt(9)
# 2 corner cases
assert 1, sqrt(1)
assert 0, sqrt(0)
# 2 normal cases
assert 5, sqrt(25)
assert 9, sqrt(81)
# float number
assert 2.5, sqrt(6.25)
assert 1.5, sqrt(2.25)

Amazing, our code works fine!
But I believe it is tricky, since both 2.5 and 1.5 is the number stand on the right center between 2 near-by integers. And it fails dealing with generic float number.
The problem we met is call stack overflow. Binary search algorithm failed to hit the exactly accurate number that we expected.
To solve the problem, we can use a small enough range to replace the accurate equality comparison.
We introduce a const EPSILON to describe the accuracy of the calculation.

Adjust precision
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
EPSILON = 100 * Float.const_get(:EPSILON)
def assert(expected, actual)
if (expected - actual) < EPSILON
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def binary_search(goal, start, stop)
mid = (stop - start) / 2 + start
mid_square = mid * mid
if (mid_square - goal).abs < EPSILON
return mid
elsif mid_square > goal
return binary_search(goal, start, mid)
else
return binary_search(goal, mid, stop)
end
end
def sqrt(n)
return 0 if n == 0
binary_search(n, 1, n)
end
# Start from the easiest case, the integer square root.
assert 3, sqrt(9)
# 2 corner cases
assert 1, sqrt(1)
assert 0, sqrt(0)
# 2 normal cases
assert 5, sqrt(25)
assert 9, sqrt(81)
# float number
assert 2.5, sqrt(6.25)
assert 1.5, sqrt(2.25)
# float numbers not at 2^n
assert 3.3, sqrt(10.89)
assert 7.7, sqrt(59.29)

Now it looks our code can calculate the square root of most of the numbers that larger than 1. But it fails to calculate the square root of number less than 1.
The reason of the failure is because x x < x when x < 1 but x x > 1 when x > 1, which means we should search in different range for the numbers > 1 and numbers < 1.

float numbers <1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
EPSILON = 100 * Float.const_get(:EPSILON)
def assert(expected, actual)
if (expected - actual).abs < (EPSILON * 10)
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def binary_search(goal, start, stop)
mid = (stop - start) / 2 + start
mid_square = mid * mid
if (mid_square - goal).abs < EPSILON
return mid
elsif mid_square > goal
return binary_search(goal, start, mid)
else
return binary_search(goal, mid, stop)
end
end
def sqrt(n)
return 0 if n == 0
if n == 1
return 1
elsif n > 1
return binary_search(n, 1, n)
else
return binary_search(n, n, 1)
end
end
puts "Start from the easiest case, the integer square root."
assert 3, sqrt(9)
puts "2 corner cases"
assert 1, sqrt(1)
assert 0, sqrt(0)
puts "2 normal cases"
assert 5, sqrt(25)
assert 9, sqrt(81)
puts "float number"
assert 2.5, sqrt(6.25)
assert 1.5, sqrt(2.25)
puts "float numbers not at 2^n"
assert 3.3, sqrt(10.89)
assert 7.7, sqrt(59.29)
puts "float number < 1"
assert 0.1, sqrt(0.01)
assert 0.02, sqrt(0.0004)

So now the algorithm is pretty much as what we want, but we still found it raise call stack overflow exception sometimes. And it wastes too much iterations on unnecessary precision.
So I think maybe we can make the algorithm unlimitedly close to O(1) by sacrificing some precision of the result.
So we set up a limit for the maximum iterations that we can take during the calculation. When the limit is reached, we break out the iteration, and return a less accurate number.

It is really useful when calculating the irrational square root, which is has unlimited digits, and there is no accurate solution to it.

irrational square root
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
EPSILON = 10 * Float.const_get(:EPSILON)
DEPTH_LIMIT = 100
def assert(expected, actual)
if (expected - actual).abs < (EPSILON * 10)
puts "Passed"
else
puts "Failed"
p expected
p actual
end
end
def binary_search(goal, start, stop, depth)
mid = (stop - start) / 2 + start
mid_square = mid * mid
if (mid_square - goal).abs < EPSILON
return mid
else
return mid if depth >= DEPTH_LIMIT
if mid_square > goal
return binary_search(goal, start, mid, depth + 1)
else
return binary_search(goal, mid, stop, depth + 1)
end
end
end
def sqrt(n)
return 0 if n == 0
n = n.to_f
if n == 1
return 1
elsif n > 1
return binary_search(n, 1, n, 0)
else
return binary_search(n, n, 1, 0)
end
end
puts "Start from the easiest case, the integer square root."
assert 3, sqrt(9)
puts "2 corner cases"
assert 1, sqrt(1)
assert 0, sqrt(0)
puts "2 normal cases"
assert 5, sqrt(25)
assert 9, sqrt(81)
puts "float number"
assert 2.5, sqrt(6.25)
assert 1.5, sqrt(2.25)
puts "float numbers not at 2^n"
assert 3.3, sqrt(10.89)
assert 7.7, sqrt(59.29)
puts "float number < 1"
assert 0.1, sqrt(0.01)
assert 0.02, sqrt(0.0004)
puts "irrational root"
assert 1.414213562373095, sqrt(2)
assert 1.732050807568877, sqrt(3)

So besides of the binary search approach, we can calculate the square root with Newton’s method, which calculates the result with a iterative equation.
Newton’s method has the limitation in precision, but has a good performance. It is said that one of the ancestors of FPS game Quake uses it in the game engine to get a good performance with limited computing power.
Here is a easy-understood document explain how it works.