/gonetgrep

Grep keyword in multiple web pages. A sample code for golang(Google Go programming language).

Primary LanguageGo

gonetgrep

I found Google GO is a `interesting programming language`_. I guess, it is very useful in network data mining. This project is just for a sample to show Go's power. .. _interesting programming language: http://www.theregister.co.uk/2011/05/05/google_go/

The best method to learn a new language is to try to teach others. So, this project won't show up a finished code at once. Intead, I'll keep the my learning experiences on Dummy Days section of this document and the progress of this program on https://github.com/dlintw/gonetgrep.

BTW, I'm not a native English speaker. I'm a newbie of Go. There are many bugs exist on my code or grammars, please notice me.

This document could be converted to html by docutils, that let the content could be clickable. See ReStructuredText for more information.

I define the usage here:

gonetgrep [options] <keyword> <url> [<url> ...]

Grep keyword in multiple web pages.

options:
  -t <num> 0: without parse table, 1:first table only

This is the section of my experience. I stduy Golang by following methods:

  1. Read document . (note: the Specification is also must read document)
  2. Search question on golang-nuts.
  3. Find similar package on project dashboard and packages dashboard.
  4. Ask questions on golang-nuts.

You could check out the source code in different stage. The method is describe in Version Control System.

In fact, I don't like the name of this language. I would like it named as 'golang' which is more searchable.

I use archlinux. It is easy to install multiple newest packages by pacman or yaourt:

pacman -S go  # Google Go, the binary package still have godoc bug
yaourt -S go-hg  # Google Go, I prefer this, it install on /opt/go

# optional packages
yaourt -S gocode-git # suggest strongly for vim users
# if gocode-git install failed, just skip it, I'll explain how to fix it.
pacman -S git mercurial # source code version control for goinstall
pacman -S vi vim-dirdiff # my favorite editor
pacman -S docutils # convert this document into web page form (HTML).

Also instead of yaourt you might be instrested in https://github.com/str1ngs/gur . Which is a aur helper written in go. Its still work in progress but fun to use.

You can read Golang's document without network by godoc

godoc -http=:6060  # and launch in browser by http://localhost:6060
godoc godoc # read more usage by this builtin document tool

In fact, if you use brand new version of Go. You should reference package manual by this method instead of just read the offical site's manual. Because official site only keep stable version's package document.

By the way, I suggest to open a github account, and learn how to use git on github. And try to write document in rst format.

Suject: compile hello world by Make.inc (cmd:gofmt pkg:fmt,flag,os)

Golang provide a gofmt utility to make same coding style. We could try to copy the Makefile from gofmt:

find /opt/go |grep gofmt

mkdir gonetgrep
cd gonetgrep
cp /opt/go/src/cmd/gofmt/Makefile .

We could reference /opt/go/src/cmd/gofmt/gofmt.go to build our gonetgrep.go. If gocode installed. Here is the most important tip when edit by vim editor:

package main
import "flag"  // after declare the flag
func main() {  // note Go's style, brace { should not in next line
  flag. // press Ctrl-X, Ctrl-O then Ctrl-P or Ctrl-N here
}

Press Ctrl-X then Ctrl-O after type flag., you could see flag's members. If you want to know the usage of member functions, just look godoc. To clear the automatic typing code, you could try Ctrl-P again.

A successful programming language should come with a powerful and useful library.

We use the following package functions.

C Golang
printf() fmt.Println()
getopt() flag.Parse()
argv() flag.Args()
exit() os.Exit()
  1. How to write long line string in [fd2a code]?

Ans. use back single quote or + operator (Thank Arlen and PeterGo), this bug will cause the following warning:

gonetgrep.go:17: syntax error: unexpected semicolon or newline, expecting )

This bug is fixed in [15dd].

Golang suggest we use utf-8 as default. So, if we want to display string, we should code in utf-8. For different terminal codec environment, we require convert from utf-8 to encoding locale. There is no default convert package in go package, so, I searched in http://godashboard.appspot.com/package. I found there is two go-iconv package, choose the max count package and install:

goinstall github.com/sloonz/go-iconv  # this line failed
goinstall github.com/sloonz/go-iconv/src # it works
goinst -clean github.com/sloonz/go-iconv/src # it works when you install by go-hg

The finished code in [f028]. To let its format beautiful with default format:

gofmt -w .

We could use 3rd party charset library which is implemented by Go to solve this problem. Here is the finished code [c567].

  1. Is there good method to detect locale instead of check environment variable?

There are several methods could debug your code.

To debug the code, we could use 'log' package, as [9211]:

$ ./gonetgrep foo
2011/05/22 16:01:30 gonetgrep.go:54: before
This is first code Go support utf-8, 也可以用中文寫
2011/05/22 16:01:30 gonetgrep.go:56: after

You may see hex deciaml numbers like this [fd2a]. That's the snapshot of source code at the moment with git version fd2a.

  • To read the version's source tree in browser, just click the version.
  • To read changes of this version, just click the commit on right side after click the link.
  • TO read commit log, click on github's commit button on upper bar.

To check the source code in your linux box, here are sample commands:

# initial copy
git clone git://github.com/dlintw/gonetgrep.git
cd gonetgrep

# get update source
git pull

# show commit log
git log --all
git log    # show current checkout version's log only.

# update to special version, for example fd2a
git checkout fd2a

# back to newest version.
git checkout HEAD

# compare the differences of version fd2a and previous version(fd2a^)
git diff fd2a fd2a^
  1. Why 'git ci' can not check in but 'git ci -a' can?

Ans. git's process force you separate a large patch into small pieces by manual add any 'add' or 'modify' patch. [1]

[1]http://plasmasturm.org/log/gitidxpraise>

This document is written by ReStructuredText format which is used by python language.

This document could be converted to html by docutils.::
rst2html README.rst README.html
  1. How to hightlight Go's syntax in rst format?

I require help to finish all these jobs. If you can help me. Just fork my source, and notice me to pull your code and document.

  • read file line by line (pkg:io)
  • find keyword and display line number (pkg:bytes,regexp)
  • get web page (pkg:http)
  • store to file (pkg:path)
  • get multiple web page by goroutine (pkg:sync)
  • store history into database (pkg:sqlite)
  • get web pages through multiple agents (pkg:gob)
  • show web robots's status on web
  • build test case (pkg:testing)
  • benchmark the code
  • balance load of bottleneck
  • prevent hardware fail by architecture