floere/phony

Validation possible with `phony`?

Closed this issue · 20 comments

fj commented

Given a phone number and a country, is there a way to tell whether it's valid or not using phony?

Hi John,

Can you please give me some examples what you mean exactly by valid?

From your text, there are several options:

  • You have a full E.164 phone number and a country, and would like to see if that phone number is a valid phone number from that country.
  • You have part of a phone number (E.164 minus the country) and the country and you'd like to see if this partial phone number could exist in that country.

Phony cannot do a mapping from country to number. Not because it's hard, but because there are many specific gems that already implement the country-number prefix mapping.

Cheers,
Florian

fj commented

hi Florian,

Yep, both of those examples cover what I was asking about. Here's some examples:

 # true; could be a US phone number
Phony.conforms?('+1-434-222-3333', Phony.country_for('US'))

 # false; not a valid US phone number
Phony.conforms?('+2-434-222-3333', Phony.country_for('US'))

 # true; could be a US phone number
Phony.conforms?('434-222-3333', Phony.country_for('US'))

It sounds like you're opposed to phony using the hypothetical country_for method because you think that's out of scope for this gem. That's probably true. Do you have some recommendations for gems that would cover this mapping?

Hi John,

Thanks for the great feedback!

I am opposed to phony using the country_for, simply because I don't want to maintain a swiss army knife gem when there are good solutions out there.

I'm not sure which is the best currently, but a search here helps:
http://rubygems.org/search?utf8=%E2%9C%93&query=3166
(A range from minimal to comprehensive is available – I assume they all have copied the ISO3166 mapping correctly)

Now, let's look at your cases. Phony does not currently offer the specific functionality you need, but since you're the third person asking I might need something like it soon. Meanwhile…

Phony.conforms?('+1-434-222-3333', Phony.country_for('US'))
Phony.conforms?('+2-434-222-3333', Phony.country_for('US'))

One idea here would be to write your own method:

require 'phony'

def conforms? number, cc
  Phony.split(Phony.normalize(number)).first == cc
end

The third case is more complicated and error prone. So my question is: If I am not mistaken, this could very well be a non-US number – would it still be ok to return true? We could only more or less guarantee that it is not a US number, but guaranteeing that it is a US number is hard, perhaps impossible.

Cheers and thanks,
Florian

Usually, I recommend handling the third case via well-designed user interface where one has to choose a country code via e.g. select. However, this might not be the case for you. Is it user input or do the numbers come from a database or other input?

fj commented

I should probably be clearer on what behavior :conforms? would have:

  • First, it would parse the normalized version of the supplied number as if it were a number of that country. If the number was successfully parsed in this way, and its segments have the right number of digits according to the country's specification, then return true.
  • If that didn't work, then maybe the number is a domestic version of that phone number and is lacking the international prefix. Prepend the country code to the number. Can you parse it successfully now? If so, then return true.
  • If not, then return false.

In short, that would look something like:

module Phony
  def conforms? n, country
    c = country.country_code
    Phony.normalize! n

    # add country code if it's missing
    n = n + c unless n.starts_with? c

    # Phony method that returns `true` if n looks
    # like a phone number for the country code `c`
    Phony.parse? n, c
  end
end

Does that make sense?

I am opposed to phony using the country_for, simply because I don't want to maintain a swiss army knife gem when there are good solutions out there.

I agree, I don't think Phony should do that either. But what if it simply depended on one of these gems to provide country_for?

The third case is more complicated and error prone. So my question is: If I am not mistaken, this could very well be a non-US number – would it still be ok to return true? We could only more or less guarantee that it is not a US number, but guaranteeing that it is a US number is hard, perhaps impossible.

I agree. The intention here is only to answer the question "is it possible that this is a US phone number?", not "is this definitely a US phone number, to the exclusion of all other possibilities?" If there were a country Foo whose country code was 43 and which had 7-digit phone numbers, then we'd also return true for that. Here's some more example:

 # true because this could be a Foo phone number
Phony.conforms?('43-422-23-333', Phony.country_for 'Foo')

 # true here because this could also be a US phone number
Phony.conforms?('4342223333', Phony.country_for 'US')

 # false because US phone numbers need ten digits
Phony.conforms?('2223333', Phony.country_for 'US')

Is that clearer?

fj commented

Usually, I recommend handling the third case via well-designed user interface where one has to choose a country code via e.g. select. However, this might not be the case for you. Is it user input or do the numbers come from a database or other input?

Yes, that's the case here. We're looking for signs of voter fraud in different countries, and we receive self-reported user data from various databases. Sometimes the databases use the CC prefix; sometimes they don't. We have to verify if they're real numbers by seeing if they match the expected pattern. An obviously fraudulent number indicates that a particular individual is less likely to be a real person.

Hi John,

Again, thanks for the feedback.

I have to note that I am currently in Papua New Guinea and time and bandwidth are a bit scarce, but let me try to respond here :) (I'll be back in Australia this weekend and will have more time etc.)

Your code does make sense, thanks for it. One problem is the n.starts_with? part, as in many countries, it is perfectly ok to have a NDC look like the CC (not in the North American scheme, though).
To keep phony free from language-country code mapping I'd like to keep the line c = country.country_code out of the method. We could certainly test for the existence of a country gem, and if it is there, offer this functionality.
I'd prefer the signature to be def conforms? n, country_code – or maybe something like def plausible? number, constraints = {} which checks for plausibility based on the input. Your case would then be Phony.plausible? "123456789", cc: "1".

Now, for the part with the problems. Phony is currently designed to be very lenient. It wasn't at first, but the relative scarceness of information on phone numbers in certain countries led to problems back when Phony was harder on given numbers. For example, certain phone numbers had their extensions cut off for airbnb who are using phony, which led to people not being able to call their hosts.
So now phony basically "lost" this information, meaning that we have to add this plausibility code. The current "parse" method does not scream havoc anymore on a non-conforming phone number.

So, formatting etc. needs to be lenient, and a plausibility check needs to be relatively strict. I am unsure how to proceed at this stage. Probably it makes sense to add a whole new module to phony, with its own information etc.

Now, to have you get to some progress, can I recommend you first implement a sanity checking code separate from phony, in your app, and give me feedback on how it goes such that I get some idea on how you would do this? (So I can think about how to proceed and how to design it) Would that be a possibility for you on how to continue?
I would understand perfectly too if you chose to use some other gem that already implements this. If you do, please tell me which :)

Thanks again and all the best,
Florian

fj commented

Thanks for taking time out of your busy schedule, Florian. I'll try to keep this brief!

plausible? sounds like a better name, for the reasons you pointed out, since we'd be checking whether this "looks like" a good enough number. And yes, I can put plausible? in my code with the method signature you described.

Based on what you described, it seems like different people need different things from Phony. Some people want strictness; some people want "close enough". One way to make everybody happy is to elevate each country's logic to a full-fledged class (e.g. NetherlandsPhoneNumber).

These could declare their segments as they do now, and the "plausible" matching could still occur. And for people who want strictness, each country could define more specific rules about what is or isn't allowed. For example. in the US, there are no area codes that begin with 0 or 1. Likewise, the fourth through sixth digits of US numbers can never be "555". These are two examples of additional rules that might be tested if you want to opt into the "strict" version of phony.

For now, I'll implement plausible? myself in my own app. I'd also be willing to help with an approach similar to what I described above, if you need a hand and you think it's a good strategy.

My pleasure.

Yes, different people want different things from Phony. However, whether they want strictness or not is related to the task they hope to achieve. At the moment, Phony offers normalizing/splitting/formatting, where a less strict approach is very helpful (whatever number it gets, it does its best – and return something useful).

So the two approaches are: formatting/relaxedness vs. validating/strictness. Phony currently implements the first.

I am happy to continue in the direction of a plausibility check as described. Plausibility would be defined as being strict, but not necessarily correct. If plausible? returns false, we know that the given number is not plausible, given the constraints. If plausible? returns true, we cannot be sure if it is really a plausible number or not. Perhaps implausible? would be a better naming in this light.
As a start, all the countries would check only if the country code is correct, and whether the normalized number contains more than 15 numbers (the basic E.164 check), i.e. more than 15 number causes the plausible? check to return false. And so does a number starting with 7, when the constraint is cc: 1.

I am glad for your help on NA numbers after I have implemented this basic structure above (I also need to decide where to include specific country checks). Thanks for the offer! :)

@fj If you have any code for me, I'm happy to use it :)

fj commented

@floere We're still kind of poking around with plausible?. Everyone decided the name wasn't accurate though, and we went back to conforms?. Mostly that's because "plausible" means "is this possible?", but that's a less specific question than "does this look like it could be a phone number meeting this specification?"

I think we refined our desired behavior a little bit, too:

conforms? "4346667777", country: "US"
# true; this could be a US phone number

# Imagine that Foo is a country whose country code
# is "43" and whose phone numbers have 8 digits.
conforms? "43-46667777", country: "Foo"
# true; this could be a Foo phone number

conforms? "1-4346667777", country: "US"
# true; this could be a US phone number

conforms? "2-4346667777", country: "US"
# false; this can't be a US phone number because the country code is wrong

conforms? "43466677", country: "US"
# false; this can't be a US phone number because the segments aren't full

(Note that it's possible for the same phone number to conform to multiple countries.)

We wound up hardcoding a "definition" by requiring the presence of segments of a particular length or length range, similar to Phony's DSL, so there's a bit of reinventing the wheel. (Essentially, we now have a mini-Phony directly in our app.)

Thanks a lot for the thorough description, again. You make a good point for conforms?.
I hope to have some time soon to devote on this. Glad though that you found something that works for now.

I'm interested – how complex are your rules?

fj commented

@floere The rules are very minimal. We only wanted to exclude obviously fake numbers and service numbers. Two examples:

  • No one in the U.S. has a phone number that starts with "911", because that is the emergency service number in the US and it would be intercepted as such before you could dial the rest of the number. Therefore someone entering a number that starts with 911 is obviously mistaken or providing fake data.
  • Likewise, no one in the U.S. has a phone number whose prefix is "555". (There's even a Wikipedia article about this.)

So basically, we considered such rules part of the specification for US phone numbers, and we enforce those rules accordingly to preclude them from being in our database.

Thanks for the info! I am thinking about how to cleanly implement this into Phony. Probably I will add it to the country specification, as a third parameter that operates very similar to the rest of the DSL. But I am only throwing ideas around in my head at this stage – I hope to have something to show soon.

The 555 wiki page is fantastic. I e.g. didn't know this:
"only 555-0100 through 555-0199 are now specifically reserved for fictional use"

I've just released an experimental version of plausible? in version 1.6.7.

I still went with plausible, as something "seems reasonable or probable" (Apple dictionary), not implying 100% correctness on a true, but implying 100% correctness on a false return value.

Appended are the relevant specs. Test on a country by adding the option cc, e.g. cc: '1' for the North American numbering plan. If no options are used, it will go only with the country specific validations found in countries.rb, currently only the 911 NDC is tested for in North America. If options are given, it will test against these, but also against the country Phony thinks it got. This is the current behavior.

I hope this is a step in the right direction, cheers! :)

P.S: Since this is released experimentally, please don't count on it making it into 1.7.0 the way it is presented here. Most likely yes, but not assuredly so.

describe 'plausible?' do

  it "is correct" do
    Phony.plausible?('0000000').should be_false
  end
  it "is correct" do
    Phony.plausible?('hello').should be_false
  end

  it "is correct" do
    Phony.plausible?('+41 44 111 22 33').should be_true
  end
  it "is correct for explicit checks" do
    Phony.plausible?('+41 44 111 22 33', cc: '41').should be_true
  end
  it "is correct for explicit checks" do
    Phony.plausible?('+41 44 111 22 33', ndc: '44').should be_true
  end
  it "is correct for explicit checks" do
    Phony.plausible?('+41 44 111 22 33', cc: '1').should be_false
  end
  it "is correct for explicit checks" do
    Phony.plausible?('+41 44 111 22 33', ndc: '43').should be_false
  end
  it "is correct for explicit checks" do
    Phony.plausible?('+41 44 111 22 33', cc: '41', ndc: '44').should be_true
  end
  it "works with regexps" do
    Phony.plausible?('+41 44 111 22 33', cc: /4(0|2)/, ndc: /4(4|5)/).should be_false
  end
  it "works with regexps" do
    Phony.plausible?('+41 44 111 22 33', cc: /4(0|1)/, ndc: /4(4|5)/).should be_true
  end

  context 'specific countries' do

    it "is correct for US numbers" do
      # Sorry, still need E164 conform numbers.
      #
      Phony.plausible?('4346667777', cc: '1').should be_false

      # Automatic country checking.
      #
      Phony.plausible?('1-4346667777').should be_true
      Phony.plausible?('1-800-692-7753').should be_true
      Phony.plausible?('1-911').should be_false
      Phony.plausible?('1-911-123-1234').should be_false

      # With string constraints.
      #
      Phony.plausible?('14346667777', cc: '1').should be_true
      Phony.plausible?('14346667777', ndc: '434').should be_true
      Phony.plausible?('14346667777', cc: '1', ndc: '434').should be_true

      # With regexp constraints.
      #
      Phony.plausible?('14346667777', cc: /[123]/).should be_true
      Phony.plausible?('14346667777', ndc: /434|435/).should be_true
      Phony.plausible?('14346667777', cc: /[123]/, ndc: /434|435/).should be_true
    end

  end

end
fj commented

@floere This looks pretty solid, thank you! I'll try swapping it in next week after we do a release.

Does Phony have a way to "normalize to country"? For example, you give this test case:

      # Sorry, still need E164 conform numbers.
      #
      Phony.plausible?('4346667777', cc: '1').should be_false

Could I instead do something like:

      # tries to normalize this phone number to this country:
      #   * returns the number itself if it's already `plausible?`
      #   * returns the number with the prefix if that would make it `plausible?`
      #   * otherwise returns nil
      n = Phony.normalize_to_country('4346667777', cc: '1')
      Phony.plausible?(n, cc: '1').should be_true

Do you think something like should be part of Phony, or part of a client application?

I'm going to be lazy and quote myself ;)

Your code does make sense, thanks for it.
One problem is the n.starts_with? part, as in many countries, it is perfectly ok to
have a NDC look like the CC (not in the North American scheme, though).

What I mean by that is that a number may well be plausible (although not your specific example with North American numbers) even when it is missing the country code. That means that even though it is plausible, it would not be correct, and the cc would not be added.

That is why I think it should be part of a client application, especially also because it is simple code. (In all fairness I have to add though that the cc: 1 could be used to improve the heuristics, but atm I think this should not be part of Phony as the plausibility already adds a lot of new things)

I hope this is understandable.

Update:
I misread. To actually answer your question: No it does not (yet) have a way, I'm afraid.

I'm very interested in how you fare and also wanted to thank you again for all your feedback!

fj commented

Yes, I think I am going to introduce a PhoneNumber object into our application and have it wrap Phony accordingly. I think this'll work great.

What I mean by that is that a number may well be plausible (although not your specific example with North American numbers) even when it is missing the country code.

Right this goes back to our different definitions of "plausible". Phony means "is there any chance that this is a phone number at all?", but I mean, "given a country, could this be a phone number for that country?" So my question isn't as useful as yours, since I require that a country be supplied explicitly before I can answer the question, whereas Phony is capable of making an educated guess.

Thanks for all your help!

This sounds like a great idea!

Thanks for reminding me of our discussion. Your question is more useful I think as Phony could use that supplied hint to – if necessary – add the cc to the ndc+rest part of a number.

With NA numbers it can do this quite easily. Add 1 if the first digit isn't a 1, iirc. Phony's problem is that it can't do this reliably for all countries: In the Swiss scheme, it would be perfectly ok to have a cc+ndc+rest number that looks like this: 41411231212-12. So if just the ndc+rest part were to come along, with the hint cc: 41, Phony would see 411231212-12, and conclude that it sees a correct number, returning it, even if it didn't include the cc part.

However, Phony could be smart about it and choose a clever normalize_to_country strategy, tailored to each country, where if cc 1 was given, it'd choose the strategy noted above.

I actually wrote all of this to tell you that no – Phony could be clever about it and include your supplied country to be clever about it :)
(But atm I'd rather work on getting plausible? right first – perhaps as a next step including a max length for each country)

Note: I've added the necessary documentation to the README https://github.com/floere/phony/blob/master/README.textile#plausibility.

I'm also going to close this. Please reopen or better, open a new issue with a specific validation topic. Cheers!