samdark/sitemap

Change type of access for methods

WinterSilence opened this issue · 34 comments

Change type from private to protected for extending classes.

Which concrete method do you need and for which case? Blindly changing all methods to protected is a bad idea.

Это не плохая идея т.к. все методы\свойства взаимосвязаны. Я хочу хранить данные в кеше, а не в файлах, следовательно мне нужно переопределить минимум Index: write() + Sitemap: createNewFile(), flush(), но само собой эти методы работают с другими методами и переменными которые что? - правильно! закрыты...

As a variant: you can add a new branch and implement it in version that will return content (not saved in files).

Опишите свой кейс. Для чего именно вам хранить sitemap в кеше? Как вы его отдаёте?

Кеш (https://github.com/kohana/cache).
Контроллер загружает из кеша \ генерирует sitemap xml и отдает клиенту. Обращений к sitemap немного, поэтому такой вариант намного удобнее чем cron задание или пересоздание при каждом изменении в списке ссылок.

У вас XML в кеше?

У меня xml текст в файловом кеше и каких-либо проблем по этому поводу я не испытываю. Основная мысль - дать возможность пользователю самому решать как ему хранить данные, не вижу смысла в подобном ограничении расширяемости, можно провести небольшой рефакторинг и отделить момент генерации от сохранения контента.

У Вас висит пару месяцев PR #16, пользователь не может реализовать сжатие в своем варианте, поэтому вынужден расширять Ваш, а Вы как я понимаю этого не хотите. Как итог: смысл библиотеки теряется.

Генерацию от сохранения отделить можно, да. Но всё делать protected я против.

Ты сам посмотри что в итоге останется private если сделать эти методы protected. Сам разделением заниматься будешь или мне попробовать?

Я уже начал над этим работать, но с удовольствием посмотрел бы на альтернативное решение.

Сомневаюсь что оно будет альтернативным, но ради интереса попробую.

I want to extend the classes and override the write() and flush() method so that I can have it write to Amazon S3 instead of the local filesystem. I can't do that because the methods are marked private instead of protected. I think getCurrentFilePath() and $this->writer would also need to be protected.

@ejunker I think it's better to abstract writing into separate interface / class. If you have ideas about it, let's discuss it.

@samdark @WinterSilence Any news on this? We have a use case where we just want the xml to be rendered using output memory rather in a file. Right now there is no option to get sitemap instance without passing a file to it.

  1. get/set the writer so it can be used with/without file
  2. if file is not provided, it can assume $this->writer->openURI('php://output');
class NullWriter implements WriterInterface
{
    public function __construct()
    {
    }

    public function append($data)
    {
    }

    public function finish()
    {
    }
}

__construct($filePath = null, $useXhtml = false)

if ($filePath !== null) {
    $dir = dirname($filePath);

    if (!is_dir($dir)) {
        throw new \InvalidArgumentException(
            "Please specify valid file path. Directory not exists. You have specified: {$dir}."
        );
    }

    $this->filePath = $filePath;
    $this->useXhtml = $useXhtml;
}

createNewFile()

if ($filePath) {
    if ($this->useGzip) {
        if (function_exists('deflate_init') && function_exists('deflate_add')) {
            $this->writerBackend = new DeflateWriter($filePath);
        } else {
            $this->writerBackend = new TempFileGZIPWriter($filePath);
        }
    } else {
        $this->writerBackend = new PlainFileWriter($filePath);
    }
} else {
    $this->writerBackend = new NullWriter();
}

if ($this->writerBackend instanceof NullWriter) {
    $this->writer->openURI('php://output');
} else {            
    $this->writer->openMemory();
}
  1. setIndentString() implementation to set the spacing

That is the general idea. I have seen other issue where complete new solution is suggested. Any news on that if any implementation is being done or in process.

No implementation was done or is in progress. While it's OK to have an ability to write to any stream, it's not clear what's your particular use case.

@samdark I don't want to use file to store xml. Just want to render it to the user. Are you open to accept PR with suggested changes?

That's a very bad idea:

  1. Sitemap doesn't change that often.
  2. You'll get quite big memory usage considering default buffering in most frameworks.
  3. You won't be able to match standards on number of URLs in sitemap and sitemap size.
  4. You won't be able to use multiple sitemaps.
  1. Considering a dynamic site where 100s of new stories are pushed on daily basis
  2. It will follow the same buffering as right now
  3. It will match the number of urls allowed, will flush and will never be more than 50K
  4. While writing to output uri no multiple sitemaps required. If so will fall back to original file implementation. (Its an extension, nothing to dump the old functionality altogahter)
  5. Don't want to manage file on regular basis (file concurrency issues for crud possibility)
  6. Will cache stream and no server load to generate each time
  1. That's fine. If you regenerate sitemap on publishing new story means only 200 regenerations. If you generate it dynamically each time it means significantly more.
  2. Currently it dumps into file steam releasing memory. In case of output buffering it won't release memory till buffer is flushed that, in case of modern frameworks, it is at the very end of request-response cycle.
  3. That contradicts "100s of new stories are pushed on daily basis". You'll reach the limit in at max 1 year and 3 months with such publishing rate.
  4. Wrong. Search engine spider doesn't care how you serve the content. It has a hard limit of 50000 URLs and 10 megabytes of payload. It you're pass the limit it will error not accepting your sitemap at all.
  1. Writer flush will handle it by itself, as stream is output uri, it will release as soon as writer flush is called
  2. We only get the newly published urls not all, so 50K will still not reach
  3. Same as 3
  1. Yes, the problem is when it's called but yeah, that could be irrelevant and it's fine not to care about it.
  2. Huh? It doesn't make sense. Search engines, as far as I remember, expect a full sitemap, not partial.
  1. It will be a full sitemap with new urls. So whenever the urls are updated new urls will be indexed as old are already being crawled

I think it's a bad idea to submit just new URLs and omit old ones: https://webmasters.stackexchange.com/questions/2459/should-i-include-everything-in-the-sitemap-or-only-new-content. No search engine guarantees that if your page reached index it will stay there forever.

Got it. But I think its a separate issue while thinking of writing xml to uri rather file system or memory.
Still need a way to render content to source uri. right?

Well, it was designed to work with files. For example, it advanced to next file when current one is full: https://github.com/samdark/sitemap/blob/master/Sitemap.php#L227. I'm not sure how to do that if you've specified a stream that's no a file.

Nothing happens as said earlier, writer just flushes the stream and carries on with the processing

  1. $this->writerBackend instanceof NullWriter do nothing
  2. Rather making file or memory objects on fly, one should inject in __construct()
  3. Depending on type of injection respective createNewFile() or createSourceUri() can be called
  4. Injecting object can change the code a bit and will not be backward compatible
  5. Creating object on fly, like it does at the moment and instanceof check will do

Carries on with processing where? For files it opens a new file. Continuing flushing to output in this case would result in incorrect sitemap that won't be parsed by search engines. It doesn't make sense to me.

Same about memory. You'll need a sitemap index and a set of sitemaps but these are read in multiple requests so you can't direct these into streams.

It will output all the stream to source uri there will be no new create file due to check of NullWriter.

But that basically won't be read when you reach 50000 elements or a 10 MB of data and you have to write all URLs into sitemap.

Agreed, but we don't have a use case of 50K urls or 10MB of data. If that is the case fallback is always the original implementation i.e. File System

If you have FS, why do you even need to generate anything runtime at all?

Right now, we don't have any need to use filesystem when we can manage on runtime. Handling File Read/write Concurrency not desired

I see. Well, sorry but I don't think having a way to easily shoot into your own leg is a good idea. Therefore, do your own fork.