sfu-db/dataprep

clean_phone module doesn't recognize e.164 extension format

yukewang1 opened this issue · 1 comments

Describe the bug

The E.164 standards state that phone numbers can be written in a format of +<CountryCode><City/AreaCode><LocalNumber>;ext=<ext>. An example could be +19052223333;ext=555. The current clean_phone() function doesn't recognize such numbers because this rule is not specified in the regex at line 16, clean_phone.py.

To Reproduce
Steps to reproduce the behavior:

from dataprep.clean import clean_phone
import pandas as pd

df = pd.DataFrame({
    "phone": ["+19052223333;ext=555"]
})

clean_phone(df, "phone", output_format="e164")

Expected behavior
The correct output should be +12345678901 ext. 1234 where as it doesn't regonize this format and outputs np.NaN.

Screenshots
Screen Shot 2022-03-13 at 23 31 58

Desktop (please complete the following information):

  • OS: macOS Monterey
  • Browser: Chrome
  • Platform: Jupyter Notebook
  • Platform Version: 6.4.8
  • Python Version: 3.9.9
  • Dataprep Version: 0.4.2

Additional context
Here's a blog explaining e.164 standards, specifically about how to specify an extension. Link

yixuy commented

Good catch! Thanks for your context, we will fix it soon!