# Collection Filtering Algorithm with Bit Operation - Part.1

## Background

Collection must be the most commonly used data model in programming. And there are a number of data structure that represents collection in memory, in which `Array` and `Liked List` must be the most well known ones.

When dealing with collection, `filtering` must be one of the most widely used operation being used. Such as find item(s) satisfies specific criteria from the collection. Although filtering is used a lot, but it isn’t actually a simple or easy operation to apply, especially dealing with tremendous data volume.

In fact, there are a number of derived topics about collection filtering, which is impossible to cover them all in the article. This article will focus on `bit operation`, probably a not widely known but really interesting topic to introduce.

## Problem: Find the missing integer

Problem
Suppose there is a collection of integers `L`.
Name the collection that contains integers from 1 to `N` as `G`
Then `L` contains all the elements of `G` except one number `x`.
Items in `L` are in random order.
Find `x`.

TIP: This problem be too easy to experienced developers and algorithm hackers. But it is a good opening words to the following 2 problems in this series. Also it reveals the core concept of some other advanced technologies, such as `CRC` or `bloom filter`.

When got this problem, the very intuitive(means do it manually by human being) idea is to compare the elements in `L` and `G`, find each element in `L` and remove it from `G`, then the only left item in `G` is `x`.

Here is the code:

Well this is a work but brute algorithm, which is not efficient in both time and space:

• Instantiate `G` means it has space complexity `N`
• Find and remove `i` from `G` means it has time complexity at `O(N*Log(N))`

Yes, the algorithm can be optimized, with bit operation. In fact the problem can be resolved with time complexity `O(N)` and space complexity `O(1)`.

## `Exclusive-Or` Operation

`Exclusive-Or`(Or `XOR` for short, written as `^` in this article) is a basic bit operation, here is an Venn Diagram explains it:

`XOR` is an important bit operation because it has following interesting features:

1. A ^ B = B ^ A (aka Commutativity)
2. (A ^ B) ^ C = A ^ (B ^ C) (aka Associativity)
3. A ^ A = 0
4. A ^ 0 = A

Commutativity and Associativity ensure that `XOR` can be used despite of order. The 3rd and 4th feature establishes a critical position for `XOR` in `cryptology` and `encoding`.

## Solution: with `XOR`

By making use of `XOR`, previous problem can ben resolved in a much more graceful way:

• Given `y` is the `xor` result of all items in `G`
• And `z` is the `xor` result of all items in `L`
• Then `x = y ^ z`

Let’s prove it mathematically:

Actually the code can be more concise by using `reduce`, with 2-lines of code

Or `C` version:

## Extending problems

So far the problem has been gracefully resolved. But it is yet the end, here is a couple of extending questions:

1. If `G` is collection with random but unique numbers (Not continues integer from 1 to `N`). How to write the code?
2. Based on `1`, if there are 2 numbers are missing? How to find them both?
3. Based on `1`, but `L` contains the members from `G` except `x`for exactly 2 times. That means `L` might contain 0 or 1 `x`, but exactly 2 for any other members from `G`. How to find `x`?
4. Based on `1`, but `G` isn’t a collection of integer, but a collection of web urls. How to find the one missing url?
5. Based on `4`, `G` is a collection of url, `L` is subset of `G` (might contains fewer members other than exactly `N` - 1). How can we know a randomly picked member from `G` is absolutely not contained by `L`. Given the tolerance to mistake on saying a member from `G` is in `L` but it doesn’t.

These problems will be explained in the following posts.

# Implement sqrt with basic algebra operators

It is a very basic question, but to solve it in a time limited environment, require solid knowledge about algorithm, and could use these knowledges flexibly.
I found the problem of myself is that I know it, but I cannot use it as flexible as my hand.

### The problem description:

Calculate the square root of a given number N
The N could be a decimal, such as 6.25 or 0.01
The implementation only allow to use basic algebra operators like +, -, *, /, <, >, etc.
Advanced functions like Math.sqrt is not allowed

As a TDDer, I’m used to write a simple test structure that is gonna used to express the test cases:

After this, we can begin to write our 1st test case, which is the simplest scenario that I can imagine:
We assume:

1. n must be an integer
2. n must have a integer square root

Run the code, if everything goes right, we will get a failed message as expected. Then we gonna introduce our first implementation to fix the failed test case:

With the 2 additional assumptions in 1st test case, we can easily figure out a simple solution: Simply linear search the integers between the root between 1 and n.

So far so good. But there are 2 magic integers that related to the sqrt, one is 1 and another is 0.
And it seems our function cannot handle all of them correctly, so I wanna improve my algorithm to enable it deals with special numbers: 0, 1.
So I added 2 test cases, and improved the implementation:

Now everything looks good, except the performance.
The time complexity of this algorithm is O(n), which is bad. I expected the algorithm complexity could close to O(1). At least it should be O(log n)

How could we improve the performance?
I had ever thought that it is safe to shrink the range to (1..2/n), but in fact it doesn’t really help to improve the performance of this algorithm, it is still O(n) after the update.
And it causes problems when dealing with the number 1, so I prefers to keep it as is.

So what we did in the sqrt function now it kind of a search, we search the number match the condition between 1 and n.
Obviously that 1..n is a ascending series, and mapping x -> x*x has positive differential coefficient.
So it is possible for use to use variant binary search replace the linear search, which reduce the time complexity from O(n) to O(log n)

After implemented the binary search algorithm, we found a very interesting phenomenon: We didn’t restrict n to integer, and it seems it get some capability to dealing with float number?!
So I tried to add 2 float number test cases:

Amazing, our code works fine!
But I believe it is tricky, since both 2.5 and 1.5 is the number stand on the right center between 2 near-by integers. And it fails dealing with generic float number.
The problem we met is call stack overflow. Binary search algorithm failed to hit the exactly accurate number that we expected.
To solve the problem, we can use a small enough range to replace the accurate equality comparison.
We introduce a const EPSILON to describe the accuracy of the calculation.

Now it looks our code can calculate the square root of most of the numbers that larger than 1. But it fails to calculate the square root of number less than 1.
The reason of the failure is because x x < x when x < 1 but x x > 1 when x > 1, which means we should search in different range for the numbers > 1 and numbers < 1.

So now the algorithm is pretty much as what we want, but we still found it raise call stack overflow exception sometimes. And it wastes too much iterations on unnecessary precision.
So I think maybe we can make the algorithm unlimitedly close to O(1) by sacrificing some precision of the result.
So we set up a limit for the maximum iterations that we can take during the calculation. When the limit is reached, we break out the iteration, and return a less accurate number.

It is really useful when calculating the irrational square root, which is has unlimited digits, and there is no accurate solution to it.

So besides of the binary search approach, we can calculate the square root with Newton’s method, which calculates the result with a iterative equation.
Newton’s method has the limitation in precision, but has a good performance. It is said that one of the ancestors of FPS game Quake uses it in the game engine to get a good performance with limited computing power.
Here is a easy-understood document explain how it works.