vezaynk/Sitemap-Generator-Crawler

Standard output

Opened this issue · 9 comments

If --file is not defined, shouldn't the sitemap.xml be written to stdout instead of a file?

It is usually defined. If it is an empty string, sending to stdout would be logical.

logger() would have to be disabled completely in that case to not mess it up.

I am not really sure about the usefulness of the logger. It's a nice idea especially if the script is run command line, but when it's part of a web service or run from cron it becomes obsolete. Also the configuration with array is somewhat complicated. What kind of use cases you have had in mind for your script?

Logger is mostly there for debugging purposes when just setting up the first time, it becomes useless afterwards.

I'm thinking of 3 things:

  1. a flag to enable stdout output
  2. debug flags should be disabled by the user
  3. no file will be written if $file is not set

This allows maximum options to users without forcing anything on them. I rather not take away any choices even if it would be the "right way" to do things.

These behaviors should be documented and perhaps even put into the README since it's probably a common use-case to pipe or redirect the output.

I would just do it simply with --debug and --verbose options so that --verbose would print what logger is printing now and --debug all possible info. And I would put both of these off by default and maybe just show some kind of animation of the progress, number of pages scanned or something similar to inform the user that the script is running and it has not crashed.

Interesting suggestion.

Debug disabled by default and some animation/counter enabled by default that will be ignored when piped?

I'm not sure how to do that, I don't have much experience writing php cli scripts.

Debug disabled by default and some animation/counter enabled by default that will be ignored when piped?

Yes, exactly, or could even be stats about the crawl.

It could look something like this: https://www.xml-sitemaps.com/

I'm not sure how to do that, I don't have much experience writing php cli scripts.

The solution should be working both with CLI and html output, it should be generic. I have even less CLI experience but I am happy to figure that out if nobody else can help. Today was the first time I touched such code. Maybe someone here knows?

I can do it myself but only in a weeks time because college.

If you would like to try yourself at it, go for it!

I think the image stuff is most critical right now. It's anyway a very small and almost cosmetic change that should not take a lot of time to make especially if someone who already knows how to do that will help.

Most critical stuff are bugs such as #34, everything else is meh. Images will be a massive time sink.

As I mentioned previously, this project is centered around being lightweight and not having dependencies (except from cURL, although there can be an attempt to replace it).

This means that we can't make our lives easy and use things like DOMDocument to parse html and must heed the call of Cthulhu. We have a working regex to extract attribute values, shouldn't be an issue to port it over to work with images.

I'm going on a tangent here but image indexing will definitely be not easy.