Today, we faced to a technical question when giving solution for a prospect. We need to do a background processing in Java for below reqs:
+ Getting some text data from social networks.
+ Doing some string search for counting the keywords (in a pre-defined list) before giving some final results.
Certainly, the first quick-dirty answer for this is: “using String indexOf function”. We all know that is not good for performance.
We did a quick research on this topic. Below is the list of solution we’re analyzing:
+ Using Full Text search.
+ Change to another string search algorithm.
We still don’t decide the final solution yet. However, below is the list of string search algorithm we find:
- Brute Force algorithm (BF)
- Deterministic Finite Automaton algorithm (DFA)
- Karp-Rabin algorithm (KR)
- Shift Or algorithm (SO)
- Morris-Pratt algorithm (MP)
- Knuth-Morris-Pratt algorithm (KMP)
- Simon algorithm (SMN)
- Colussi algorithm (CLS)
- Galil-Giancarlo algorithm (GG)
- Apostolico-Crochemore algorithm (AC)
- Not So Naive algorithm (NSN)
- Boyer-Moore algorithm (BM)
- Turbo BM algorithm (TBM)
- Apostolico-Giancarlo algorithm (AG)
- Reverse Colussi algorithm (RC)
- Horspool algorithm (HP)
- Quick Search algorithm (QS)
- Tuned Boyer-Moore algorithm (BMT)
- Zhu-Takaoka algorithm (ZT)
- Berry-Ravindran algorithm (BR)
- Smith algorithm (SMT)
- Raita algorithm (RT)
- Reverse Factor algorithm (RF)
- Turbo Reverse Factor algorithm (TRF)
- Forward Dawg Matching algorithm (FDM)
- Backward Nondeterministic Dawg Matching algorithm (BNDM)
- Backward Oracle Matching algorithm (BOM)
- Galil-Seiferas algorithm (GS)
- Two Way algorithm (TW)
- String Matching on Ordered Alphabets algorithm (SMOA)
- Optimal Mismatch algorithm (OM)
- Maximal Shift algorithm (MS)
- Skip Search algorithm (SS)
- KMP Skip Search algorithm (KPMSS)
Below is the benchmark chart of these algorithm:
You can see more detail about the benchmark on the article here.
Boyer More/BNDM are prefer ones. However, the String indexOf implementation in Java is used with Brute Force algorithm.
We also find Boyer More/BNDM implementation on this.
Anyway, we need to do some tests before giving the final answer.