mayanez/flight_scraper

Matrix requests changed

Opened this issue · 3 comments

It seems like there has been an overhaul of the ITA matrix search request format. The following request taken from an earlier closed issue used to work on www.hurl.it but now returns an error :

Follow redirects: On
POST

Headers---
Host: matrix.itasoftware.com
Content-Type:application/x-www-form-urlencoded
Cache-Control: no-cache
Content-Length: 0

Parameters---
name: specificDatesSlice
summarizers: sliceSelections,carrierStopMatrixSlice,currencyNotice,solutionListSlice,priceSliderSlice,carrierListSlice,departureTimeRangesSlice,arrivalTimeRangesSlice,durationSliderSlice,originsSlice,destinationsSlice,stopCountListSlice,warningsSlice
format: JSON
inputs: {"slices":[{"origins":["SEA"],"originPreferCity":true,"destinations":["NYC"],"destinationPreferCity":true,"date":"2014-09-19","isArrivalDate":false,"dateModifier":{"minus":0,"plus":0},"timeRanges":[{"min":"17:00","max":"21:00"},{"min":"21:00","max":"23:59"}]},{"destinations":["SEA"],"destinationPreferCity":true,"origins":["NYC"],"originPreferCity":true,"date":"2014-09-26","isArrivalDate":false,"dateModifier":{"minus":0,"plus":0},"timeRanges":[{"min":"17:00","max":"21:00"},{"min":"21:00","max":"23:59"}]}],"pax":{"adults":1},"cabin":"COACH","changeOfAirport":true,"checkAvailability":true,"sliceIndex":0,"page":{"size":30},"sorts":"default"}

The error returned is

500 Server Error

The server encountered an error and could not complete your request.

Capturing the search request from the new ITA matrix, the JSON in the request seems to be something like

{"method":"search","params":{"2":["carrierStopMatrix","currencyNotice","solutionList","itineraryPriceSlider","itineraryCarrierList","itineraryDepartureTimeRanges","itineraryArrivalTimeRanges","durationSliderItinerary","itineraryOrigins","itineraryDestinations","itineraryStopCountList","warningsItinerary"],"3":{"4":{"1":1,"2":30},"5":{"1":1},"7":[{"3":["NYC"],"5":["LHR"],"8":"2015-01-10","9":1,"11":0},{"3":["LHR"],"5":["NYC"],"8":"2015-01-17","9":0,"11":1}],"8":"COACH","9":1,"10":1,"15":"SUNDAY","22":"default"},"4":"specificDates"}}

I will have to take a look at it in more detail to try and figure out their new request format.

The new request format is a little vague (probably on purpose) relative to the old format, but can be reverse engineering with some trial and error (if you like, I can dig up some old scripts which mapped 95% of the new request format dictionary)...

The bigger issue is that it looks like Google implemented botguard for this ITA matrix search service as well, which is sad news. The OP left out the last parameter of the POST header (in slot "7" after "4: specificDates") which is a random string that is somehow linked to your user session and request.

You will not get the full list of results without the correct random string (hash?).
When submitting a request manually via Chrome I got ~500 flights and the lowest price was $526. When I submitted the same exact request directly through POST using Python requests, I got back ~30 flights and the lowest price was $650. It's a useless result. Also interesting is that the request takes about 30x as long.

HOWEVER, if I submit the request through POST in Python requests while including the same exact hash for parameter "7" in the header that I logged from when I manually entered the search in Chrome, I am able to get the full set of results. But I don't believe there is a way to reverse engineer Google's botguard system to obtain the correct hash for each submission, and even if there is, I'm sure they have resources to detect and patch any temporary workarounds.

Worse yet, this botguard system is also somehow able to detect selenium (webdriver) use. If I try using a webdriver to submit a search request, I get back the same garbage results with ~30 flights.

What I find most interesting is that it even returns any results (why not just return nothing?), and the fact that it takes so much longer. I read somewhere that the botguard system knowingly allows some bots through, but it logs them, and after the right pattern it adds the user/IP for the next wave of bans. This is why it might (very slowly) return garbage instead of just giving you nothing--By giving you garbage, if you are not paying attention you might think everything is fine and continue using it. The technology is quite something. I definitely have a love-hate relationship with it.

Google/ITA are pushing their QPX API for airfare searches now. Unfortunately, there are some carriers which need to approve you getting fares (Delta, American) which they aren't doing for personal users. So the result set is kind of meaningless. Are we completely out of luck for being able to programmatically get fares?