snowplow-referer-parser/referer-parser

Go implementation: Is bindata.go outdated or is there a bug in the go implementation?

thomas91310 opened this issue · 9 comments

@tsileo

Tested with this main.go file:

me@2018-01-19 11:42:34 ~GOPATH/src/github.com/thomas91310/test_referer_parser $ cat main.go
package main

import (
	"log"

	"github.com/snowplow/referer-parser/go"
)

func main() {
	referer_url := "http://www.google.com/search?q=gateway+oracle+cards+denise+linn&hl=en&client=safari"
	r := refererparser.Parse(referer_url)

	log.Printf("Known:%v", r.Known)
}

This is the loadRefererData() function in my refererparser.go file:

func loadRefererData() refererData {
	dat, err := Asset("data/referers.json")
	if err != nil {
		panic(err)
	}
	res := make(refererData)
	if err := json.Unmarshal(dat, &res); err != nil {
		panic(err)
	}

	social, _ := res["social"]
	fmt.Println("social[LinkedIn]: ", social["LinkedIn"])

	return res
}

This is not what I was expecting:

me@2018-01-19 11:42:34 ~GOPATH/src/github.com/thomas91310/test_referer_parser $ ./test_referer_parser
init refererparser
social[LinkedIn]:  map[domains:[linkedin.com]]
Known:true

I was expecting 2 values (linkedin.com and lnkd.in) from reading the data/referers.json file in the go directory but it outputs only one apparently?

After executing go-bindata:

If I go-bindata like it is mentioned in the repository (this creates a new bindata.go file):

me@2018-01-19 11:42:34 ~GOPATH/src/github.com/snowplow/referer-parser/go (develop●●)$ go-bindata -ignore=\\.yml -pkg refererparser data/...
me@2018-01-19 11:42:34 ~GOPATH/src/github.com/snowplow/referer-parser/go (develop●●)$ ls -lah bindata.go
-rw-r--r--  1 me  staff   101K Jan 19 11:53 bindata.go

Now I rebuild everything:

me@2018-01-19 11:42:34 ~GOPATH/src/github.com/snowplow/referer-parser/go (develop●●)$ cd $GOPATH/src/github.com/thomas91310/test_referer_parser
me@2018-01-19 11:42:34 ~GOPATH/src/github.com/thomas91310/test_referer_parser $ go build -v .
github.com/snowplow/referer-parser/go
github.com/thomas91310/test_referer_parser

Executing again results in 2 entries which is the result I was expecting:

me@2018-01-19 11:42:34 ~GOPATH/src/github.com/thomas91310/test_referer_parser $ ./test_referer_parser
init refererparser
social[LinkedIn]:  map[domains:[linkedin.com lnkd.in]]
Known:true

2 questions:

  • Can you replicate this?
  • I haven't looked at how the parsing was implemented but is it an implementation bug or the bindata.go file inside the repo is just outdated? Or, am I missing something?

Thanks in advance for looking into this!

Hi,

You're right, the embedded data is outdated. I would say make a PR and see with other maintainers.

I don't have the time anymore to take care of this, as I'm not using it anymore personally.

Maybe they would be interested in you taking over the "golang port".

We are looking at extracting the different ports out of this repository and keep this one jvm-only, if you're interested in taking over the golang one.

I think we are going to fork the go implementation in a private repository for now but we might be able to take over the golang impl soon?

Right now, we are just going to fix a production issue related to the issue I initially raised (related to bindata.go being outdated)

Proposal

Do you guys see a lot of advantages to source this bindata.go file? Would a CI job take care of building/creating this file be better? (I am thinking yes)

Or even load referers.json from an os path?
I am thinking for the precedence logic:

  1. Read from SNOWPLOW_FILE environment variable (just made it up) that we ask package users to set. A user would then do: export SNOWPLOW_FILE=/usr/local/data/referers.json if he/she wants the package to load data from /usr/local/data/referers.json
  2. If SNOWPLOW_FILE env var is not set then load it from a default path like $HOME/referers.json or something
  3. Finally, I am also tempted to let the package download the file at https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-latest.yml on disk if both 1) and 2) fail

What do you guys think?
How is the data being loaded in the other SDKs?

Hi @thomas91310 - sounds good. We plan on moving all of the referers (maintainers willing) to using that standalone referers.json database in S3, which we want to update regularly through the year. That will decouple the database from the libraries themselves.

Let us know when you are ready to take over the maintenance of the Go version of the library, and we will get the standalone repo setup...

Hey @donspaulding - I don't think we will hard-wire S3-awareness into any of the libraries; we'll just assume that the file is available on the local-file system. My preference is just letting the developer specify the path to the file in an updated initialization method for the library - env vars and default paths are a bit opaque/magical...

Hey @alexanderdean

I think you can go ahead and create the repository for the golang impl. I can take that on. I am currently updating our application to load the latest referers.json and since the current implementation of the referer-parser/go package crashes when I do so, I am also updating the package. As soon as I'm done, I can submit a PR to whichever repository (this one or the new one?)

Regarding the precedence logic

I think the user path you suggested exposing will end up being an environment variable for most applications because you want to be able to load the file from different paths and not having to rebuild the application just because you hardcoded the referers.json path in your application?
Let's say I want to make a referers.json file update. I can, on my local computer / test server, load a referers.json file from different paths without having to rebuild the application?
That's, in my opinion, really valuable but I will follow whatever you decide on if you want to apply the same exact logic to every sdk.

Side question

Can you let me know where the latest referers.json file is? Right now, to get it, I just convert https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-latest.yml to json.

Hey @thomas91310

think the user path you suggested exposing will end up being an environment variable for most applications because you want to be able to load the file from different paths and not having to rebuild the application just because you hardcoded the referers.json path in your application?

Environment variables are a bit odd because it means there's a "backdoor" for a library embedded in an app to access files, without the app being able to control this access. It would be better if the path to the referers database was a piece of configuration for the app, which then passes it to the library. This is pretty standard (MaxMind does the same).

@donspaulding - you should have access to the new Python repo now:

https://github.com/snowplow-referer-parser/python-referer-parser

Could you copy across the git history of the Python implementation? No need to delete the code in this repo (but you could open a PR to remove it).

Many thanks!

repos have been migrated, closing