libindic/soundex

Consecutive Codes

Opened this issue · 0 comments

Consecutive codes may not be handled correctly, as can be seen with the test cases Pfister and Tymczak referenced at http://www.archives.gov/research/census/soundex.html.

The original Russell and census versions of the algorithm seem to implement this consecutive code behavior for adjacent letters only (not separated by a vowel or '0' code character).

The archives.gov reference also mentions another special case where a consecutive code is discarded when separated by an 'H' or 'W'.

EDIT: The 'H' or 'W' rule actually is used in the SQL Server implementation. Removed the comment that it's not.

EDIT2: I was right and wrong before my first edit. MSSQL is case sensitive for its handling of 'H' and 'W'. Consecutive codes are discarded for upper case and not for lower case...