stevejenkins/postwhite

Error Handling with microsoft.com as example

cbfloor opened this issue · 11 comments

Hi Steve,

i am currently testing your script and having some troubles with email-providers that do not (always) return valid data, as it seems.

For example: Right now, microsoft.com causes the script to stop.
I have no idea, why, and i am sure a few days ago it worked fine, so obviously a postwhite cronjob may stop working over time, even if left untouched.

I suggest an option to skip invalid responses without stopping the whole script.
And in either case it would be great to output the domain-name that caused the script to skip/stop.

Thanks for your work on this otherwise very helpfull project!

Kind regards,
Tobias

Hi, Tobias. I just ran the script manually and got no errors. Can you tell me what version of SPFTools you're using? I'd like to replicate your environment as much as possible to test.

I have no idea where to find any kind of version string in spf-tools.
The virtual machine it runs on was set up two weeks ago, so i guess the spf-tools are the latest version cloned from git.
It is an ubuntu 16.04 installation with postfix as its only purpose, if that helps you any further.

I also rechecked the script again, but still: microsoft.com makes it halt instantly. Pinterest too, btw!

I just installed SPF-Tools and Postwhite on a fresh install of Debian 8 (ScrolloutF1 ISO) and it ran successfully.

In SPF-Tools/README.md, line 25 shows the version number. 2016/11 (as of today), with the files dated Nov 27, 2016.

UPDATE: I do get the same issue as reported, but it is intermittent, so it may just be an issue with Microsoft as it hits a specific Microsoft server and stops...

Getting spf-a.hotmail.com
Getting _spf-ssg-b.microsoft.com
Getting _spf-ssg-c.microsoft.com
root@MailGW2:/opt/postwhite#

The next time it occurs, try running despf.sh directly on the server that it stops on, however, it may be the server following the last listed server. In my case, it stops on or after "Getting _spf-ssg-c.microsoft.com", at which point spf-a.outlook.com is typically the next in line (when it does work).

I also noticed when the script fails, it leaves the temp files in /tmp/, so if those are human readable, we might be able to look and see what the last few entries are and if there are errors.

I would recommend Postwhite have a debug or verbose mode that captures everything into a log for analysis.

To run despf.sh on a specific server...
/opt/spf-tools/despf.sh _spf-ssg-c.microsoft.com
(change dir to where ever you placed spf-tools)

I've also noticed that sometimes providers will choke on despf.sh. The guys who run the SPF-Tools project are very responsive to feedback, so if we can figure out which hosts are choking it, it would be best to push the fix upstream so that despf.sh just keeps on trucking.

Also, I'll look into a "keep the logs" mode, but for now the easiest thing to do would be to comment out the section that deletes the log files starting on line 304. That runs near the end of the script, so it makes sense that the logs survive if the despf.sh kills the script.

I ran into this issue again today, on Getting _spf-ssg-c.microsoft.com, then it drops straight to command line.

I also caught that all Microsoft SPF records are processed 2x (probably due to the multiple microsoft-related domains being checked). Rechecking where the script stops and drops, it is the SECOND time _spf-ssg-c.microsoft.com is processed. The next entry seen (when it runs successfully) is spf2.zoho.com... so it may not be an issue with Microsoft SPFs afterall.

After a bit of tinkering, I was able to get it to run successfully again. The next time it stops and drops, I'll remove zoho.com from the script and see if it works again.

As Steven stated, this is most likely an upstream issue with SPF-Tools, however, until I can narrow down the culprit and successfully duplicate it with despf.sh at the same time, I hold off submitting an issue with them.

As far as the "keep the logs" request, this may not be necessary as when the script stops and drops, the tmp files are kept since it never gets to that part of the code. However, the tmp files are near useless when troubleshooting as they just list IP addresses.

It would be best if there was a debug feature that gave more verbose output to a file, but again, this could be passed off as an upstream request for SPF-Tools.

I've actually added the specific issue as #10, and filed a bug upstream (as I guessed, they were very receptive). We'll either figure out a flag that turns off the exists: results from despf.sh, or I'll tinker with the normalize function in postwhite to strip that output.

Oh, and removing fishbowl.com helped mine stop barfing, too. :)

Check the develop branch, which has the version without fishbowl.com