Today, we faced to a technical question when giving solution for a prospect. We need to do a background processing in Java for below reqs:

+ Getting some text data from social networks.

+ Doing some string search for counting the keywords (in a pre-defined list) before giving some final results.

Certainly, the first quick-dirty answer for this is: “using String indexOf function”. We all know that is not good for performance.

We did a quick research on this topic. Below is the list of solution we’re analyzing:

+ Using Full Text search.

+ Change to another string search algorithm.

We still don’t decide the final solution yet. However, below is the list of string search algorithm we find:

- Brute Force algorithm (BF)
- Deterministic Finite Automaton algorithm (DFA)
- Karp-Rabin algorithm (KR)
- Shift Or algorithm (SO)
- Morris-Pratt algorithm (MP)
- Knuth-Morris-Pratt algorithm (KMP)
- Simon algorithm (SMN)
- Colussi algorithm (CLS)
- Galil-Giancarlo algorithm (GG)
- Apostolico-Crochemore algorithm (AC)
- Not So Naive algorithm (NSN)
- Boyer-Moore algorithm (BM)
- Turbo BM algorithm (TBM)
- Apostolico-Giancarlo algorithm (AG)
- Reverse Colussi algorithm (RC)
- Horspool algorithm (HP)
- Quick Search algorithm (QS)
- Tuned Boyer-Moore algorithm (BMT)
- Zhu-Takaoka algorithm (ZT)
- Berry-Ravindran algorithm (BR)
- Smith algorithm (SMT)
- Raita algorithm (RT)
- Reverse Factor algorithm (RF)
- Turbo Reverse Factor algorithm (TRF)
- Forward Dawg Matching algorithm (FDM)
- Backward Nondeterministic Dawg Matching algorithm (BNDM)
- Backward Oracle Matching algorithm (BOM)
- Galil-Seiferas algorithm (GS)
- Two Way algorithm (TW)
- String Matching on Ordered Alphabets algorithm (SMOA)
- Optimal Mismatch algorithm (OM)
- Maximal Shift algorithm (MS)
- Skip Search algorithm (SS)
- KMP Skip Search algorithm (KPMSS)

Below is the benchmark chart of these algorithm:

You can see more detail about the benchmark on the article here.

Boyer More/BNDM are prefer ones. However, the String indexOf implementation in Java is used with Brute Force algorithm.

We also find Boyer More/BNDM implementation on this.

Anyway, we need to do some tests before giving the final answer.