dstl/baleen

REST API for Pipelines doesn't accept Sample YAML

mattplindsay opened this issue · 14 comments

One a freshly built test implementation, the YAML provided in the sample documentation (included below) fails with a 500 error when submitting with a POST with the two form parameters to http://localhost:6413/api/1/pipelines

mapping values are not allowed here
in 'string', line 1, column 26:
collectionreader: class: FolderReader folders: - ./ ...
^

Sample YAML:

mongo:
db: baleen
host: localhost

elasticsearch:
cluster: elasticsearch
host: localhost

collectionreader:
class: FolderReader
folders:

  • C:\baleen\data

annotators:

  • cleaners.AddGenderToPerson
  • cleaners.AddTitleToPerson
  • cleaners.CleanPunctuation
  • cleaners.CleanTemporal
  • cleaners.CollapseLocations
  • cleaners.CorefBrackets
  • cleaners.CorefCapitalisationAndApostrophe
  • cleaners.CurrencyDetection
  • cleaners.EntityInitials
  • cleaners.ExpandLocationToDescription
  • cleaners.MergeAdjacent
  • cleaners.MergeAdjacentQuantities
  • cleaners.MergeNationalityIntoEntity
  • cleaners.NaiveMergeRelations
  • cleaners.NormalizeOSGB
  • cleaners.NormalizeTemporal
  • cleaners.NormalizeWhitespace
  • cleaners.ReferentToEntity
  • cleaners.RelationTypeFilter
  • cleaners.RemoveLowConfidenceEntities
  • cleaners.RemoveNestedEntities
  • cleaners.RemoveNestedLocations
  • cleaners.RemoveOverlappingEntities
  • cleaners.SplitBrackets
  • cleaners.Surname
  • coreference.SieveCoreference
  • gazetteer.Country
  • gazetteer.File
  • class: gazetteer.Mongo
    type: Buzzword
    collection: buzzwords
  • class: gazetteer.Mongo
    type: Location
    collection: location
  • class: gazetteer.Mongo
    type: Organisation
    collection: organisations
  • class: gazetteer.Mongo
    type: Person
    collection: people
  • grammatical.NPAtCoordinate
  • grammatical.NPElement
  • grammatical.NPLocation
  • grammatical.NPOrganisation
  • grammatical.NPTitleEntity
  • grammatical.QuantityNPEntity
  • grammatical.TOLocationEntity
  • language.OpenNLP
  • class: misc.DocumentTypeByLocation
    baseDirectory: C:\baleen\data
  • misc.GenericMilitaryPlatform
  • misc.GenericVehicle
  • misc.GenericWeapon
  • misc.MentionedAgain
  • misc.NationalityToLocation
  • misc.OrganisationPersonRole
  • misc.People
  • misc.Pronouns
  • regex.Area
  • regex.BritishArmyUnits
  • regex.Callsign
  • regex.CasRegistryNumber
  • regex.Date
  • regex.DateTime
  • regex.Distance
  • regex.DocumentNumber
  • regex.Dtg
  • regex.Email
  • regex.FlightNumber
  • regex.Frequency
  • regex.Hms
  • regex.IpV4
  • regex.LatLon
  • regex.Mgrs
  • regex.Money
  • regex.Nationality
  • regex.Osgb
  • regex.Postcode
  • regex.RelativeDate
  • regex.SocialMediaUsername
  • regex.TaskForce
  • regex.Telephone
  • regex.Time
  • regex.TimeQuantity
  • regex.USTelephone
  • regex.UnqualifiedDate
  • regex.Url
  • regex.Volume
  • regex.Weight
  • class: relations.NPVNP
    onlyExisting: true
  • stats.DocumentLanguage
  • class: stats.OpenNLP
    model: models/en-ner-location.bin
    type: Location
  • class: stats.OpenNLP
    model: models/en-ner-organization.bin
    type: Organisation
  • class: stats.OpenNLP
    model: models/en-ner-person.bin
    type: Person

consumers:

  • Mongo
  • Elasticsearch

(worth mentioning that this was using an API framework, not just the SWAGGER UI as I note that this doesn't work (FAQ))

Hi Matt,

I think that your issue is caused by the yaml text not being formatted correctly. The yaml format is quite particular about the layout, and if you are passing it in as a REST POST request then I think that you need to use URL encoding for spaces and line feeds.

The following command allowed me to pass in the sample pipeline via a REST request using CURL.

curl -X POST "http://localhost:6413/api/1/pipelines" -d"name=sample_test_rest&yaml=mongo:%0A%20%20db:%20baleen%0A%20%20host:%20localhost%0A%0Aelasticsearch:%0A%20%20cluster:%20elasticsearch%0A%20%20host:%20localhost%0A%0Acollectionreader:%0A%20%20class:%20FolderReader%0A%20%20folders:%0A%20%20-%20C:\baleen\data%0A%0Aannotators:%0A-%20cleaners.AddGenderToPerson%0A-%20cleaners.AddTitleToPerson%0A-%20cleaners.CleanPunctuation%0A-%20cleaners.CleanTemporal%0A-%20cleaners.CollapseLocations%0A-%20cleaners.CorefBrackets%0A-%20cleaners.CorefCapitalisationAndApostrophe%0A-%20cleaners.CurrencyDetection%0A-%20cleaners.EntityInitials%0A-%20cleaners.ExpandLocationToDescription%0A-%20cleaners.MergeAdjacent%0A-%20cleaners.MergeAdjacentQuantities%0A-%20cleaners.MergeNationalityIntoEntity%0A-%20cleaners.NaiveMergeRelations%0A-%20cleaners.NormalizeOSGB%0A-%20cleaners.NormalizeTemporal%0A-%20cleaners.NormalizeWhitespace%0A-%20cleaners.ReferentToEntity%0A-%20cleaners.RelationTypeFilter%0A-%20cleaners.RemoveLowConfidenceEntities%0A-%20cleaners.RemoveNestedEntities%0A-%20cleaners.RemoveNestedLocations%0A-%20cleaners.RemoveOverlappingEntities%0A-%20cleaners.SplitBrackets%0A-%20cleaners.Surname%0A-%20coreference.SieveCoreference%0A-%20gazetteer.Country%0A-%20gazetteer.File%0A-%20class:%20gazetteer.Mongo%0A%20%20type:%20Buzzword%0A%20%20collection:%20buzzwords%0A-%20class:%20gazetteer.Mongo%0A%20%20type:%20Location%0A%20%20collection:%20location%0A-%20class:%20gazetteer.Mongo%0A%20%20type:%20Organisation%0A%20%20collection:%20organisations%0A-%20class:%20gazetteer.Mongo%0A%20%20type:%20Person%0A%20%20collection:%20people%0A-%20grammatical.NPAtCoordinate%0A-%20grammatical.NPElement%0A-%20grammatical.NPLocation%0A-%20grammatical.NPOrganisation%0A-%20grammatical.NPTitleEntity%0A-%20grammatical.QuantityNPEntity%0A-%20grammatical.TOLocationEntity%0A-%20language.OpenNLP%0A-%20class:%20misc.DocumentTypeByLocation%0A%20%20baseDirectory:%20C:\baleen\data%0A-%20misc.GenericMilitaryPlatform%0A-%20misc.GenericVehicle%0A-%20misc.GenericWeapon%0A-%20misc.MentionedAgain%0A-%20misc.NationalityToLocation%0A-%20misc.OrganisationPersonRole%0A-%20misc.People%0A-%20misc.Pronouns%0A-%20regex.Area%0A-%20regex.BritishArmyUnits%0A-%20regex.Callsign%0A-%20regex.CasRegistryNumber%0A-%20regex.Date%0A-%20regex.DateTime%0A-%20regex.Distance%0A-%20regex.DocumentNumber%0A-%20regex.Dtg%0A-%20regex.Email%0A-%20regex.FlightNumber%0A-%20regex.Frequency%0A-%20regex.Hms%0A-%20regex.IpV4%0A-%20regex.LatLon%0A-%20regex.Mgrs%0A-%20regex.Money%0A-%20regex.Nationality%0A-%20regex.Osgb%0A-%20regex.Postcode%0A-%20regex.RelativeDate%0A-%20regex.SocialMediaUsername%0A-%20regex.TaskForce%0A-%20regex.Telephone%0A-%20regex.Time%0A-%20regex.TimeQuantity%0A-%20regex.USTelephone%0A-%20regex.UnqualifiedDate%0A-%20regex.Url%0A-%20regex.Volume%0A-%20regex.Weight%0A-%20class:%20relations.NPVNP%0A%20%20onlyExisting:%20true%0A-%20stats.DocumentLanguage%0A-%20class:%20stats.OpenNLP%0A%20%20model:%20models/en-ner-location.bin%0A%20%20type:%20Location%0A-%20class:%20stats.OpenNLP%0A%20%20model:%20models/en-ner-organization.bin%0A%20%20type:%20Organisation%0A-%20class:%20stats.OpenNLP%0A%20%20model:%20models/en-ner-person.bin%0A%20%20type:%20Person%0A%0Aconsumers:%0A-%20Mongo%0A-%20Elasticsearch"

Hope this helps.

John

Thanks John,
I'm getting close, but that didn't work on my install (i'm going the basic quick-start).

I'm getting a 500 now - with the following error:
java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map

This is coming from /PipelineBuilder.java line 236 where it's trying to read the configuration.

Just to check, I'm sending the two fields and setting the POST using the content-type of: application/x-www-form-urlencoded I assume that's right?

Hi Matt,

Yes I think that is right. You should be supplying a name for the pipeline and the pipeline yaml configuration as application/x-www-form-urlencoded REST parameters.

I cut out some of the parameters from the Baleen Swagger generated REST call to give a minimal working answer with Curl in my previous post... I hope that has not confused things. The full call is:
curl -X POST "http://localhost:6413/api/1/pipelines" -H "accept: application/json" -H "content-type: application/x-www-form-urlencoded" -d"name=sample_test_rest&yaml=mongo:...

I'm afraid that it is not clear to me exactly what you are doing in terms of running Baleen and submitting your request, so I cannot suggest what you might try, but if it helps, what I have done that does work, is:

Download baleen-2.4.0.jar or build Baleen from source
mvn install -Dskiptests
Launch Baleen from the command line
java -jar baleen-2.4.0.jar
Supply the rest call from the command line using:
curl -X POST "http://localhost:6413/api/1/pipelines" -H "accept: application/json" -H "content-type: application/x-www-form-urlencoded" -d"name=sample_test_rest&yaml=mongo:...

This returns:
{
"paused" : false,
"name" : "sample_test_rest"
}
and the pipeline can be visualised in the baleen Plankton interface at http://localhost:6413/plankton/

I have seen the error message that you are reporting, but it was caused by a malformed REST request. This error occurs when the yaml parser has not been able to parse the yaml string extracted from the REST parameters.

If this still doesn't help, please could you provide some more details about how you are running Baleen and how you are submitting the REST request.

Thanks very much John.

I am using in a Windows environment and I know that Plankton is working OK (I can build a pipeline using the Plankton GUI) and in most other dimensions it seems to be working fine. Unfortunately I can't curl (as am in Windows) so I use a GUI REST interface called Restlett but I'm 99% sure it's sending exactly the same POST that your curl is sending. I've attached a screenshot in case you can see anything obvious.

I will try and get it onto a linux box and curl to see if I can exactly recreate your results.

Many thanks

issue-01

Hi Matt,
I am working on windows too. There is an executable version of curl that you can run from the windows command line. This seems to work well for me.
I'll see if I can replicate your problem with Restlet when I get a chance.

@mattplindsay - Could you provide a link to Restlett? I was also going to have a look to see if I can recreate the issue, but I can't find the GUI you're using.

Thanks James and John - I'll try and get CURL on this machine. Restlett is https://restlet.com/modules/client/?utm_source=DHC

OK, yes, it's something to do with the way in which the client is sending the POST. Supposedly the same request does work from cURL but not from RESTLet. For example, I tried:

orderer%3A%20DependencyGraph%0A%0Acollectionreader%3A%0A%20%20class%3A%20FolderReader%0A%20%20folders%3A%20C%3A%5CbaleenInput%0A%20%20reprocess%3A%20true%0A%20%20contentExtractor%3A%20TikaContentExtractor%0A%0Aannotators%3A%0A%20%20-%20cleaners.AddGenderToPerson%0A%0Aconsumers%3A%0A%20%20-%20class%3A%20Html5%0A%20%20%20%20outputFolder%3A%20C%3A%5CbaleenOutput

Which worked in cURL but not in RESTLet.

Is there something obvious in the structuring of the HTTP call? If we can identify it, I can update the help documents.

Matt

I was able to submit a pipeline through Restlet (on Ubuntu, but as Restlet is Chrome based I don't think that will make a difference), but to do so I had to uncheck the "Encode before Sending" option on the yaml field (see below).

working_baleen_query

I think Restlet might be trying to encode the already encoded String, so when Baleen decodes it (once) it still has an encoded String that it doesn't understand.

Thanks for looking into this, James.
I have just run the same test on Windows. Unchecking the "Encode before sending" option does the trick.
The servlet getparameter routine returns a string that is still URL encoded if this is left checked and this causes problems further down the line.
This suggests that you may be able to submit your yaml content in plain text through Restlet and let it encode it for you, but I cannot see how you would get a newline in in that case.
Presumably just unchecking that box resolves your problem sufficiently.

Ahhhh, I think I see the misunderstanding.... and it illustrates an odd use of POST.

What I can see from your work is that you're passing the key-value pairs in the Header (as query parameters). This would be usual for a GET request (which can't accept a body) , but I don't think is common practice for POST.

If I take the example from MDN:

POST / HTTP/1.1
Host: foo.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 13

say=Hi&to=Mom

as opposed to:

POST /?say=Hi&to=Mom HTTP/1.1
Host: foo.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 0

(Not 100% sure if my exact syntax is correct above.

Whilst it is possible to accept query parameters in a POST request, I don't think it's usual.

This explains why I was having issues sending it as a body in the message!

I think that clears it up - thanks so much for your help!

I suspect that in baleen/baleencore/src/main/java/uk/gov/dstl/baleen/core/web/servlets/PipelineManagerServlet.java

Form line 97 we'd need to change the serverlet from reading the URI, into a getReader to get the body of the request (and then pass that YAML back) as described in https://stackoverflow.com/questions/14525982/getting-request-payload-from-post-request-in-java-servlet

But having looked through all the API documentation, it would be a bit of an overhaul to change all the POST requests to accept body rather than parameters.

Let me know if this is something anyone wants looking at

At least I know a little more about what I'm doing now!
Matt

Hi Matt,
Thanks for looking into this. I don't think that it is something that we will change in the near future as it would be a significant change to the API.
Hopefully you are now able to work with the API as it stands, so I will close this issue.
Thanks,
John