Auto detect html-fragment

Question

Auto detect html-fragment

Closed this issue 8 years ago · 5 comments

I would like to suggest auto detecting html-fragment.

By detecting the input's first root node equals to <html>, if it is, it is html, otherwise, it is html-fragment.

https://github.com/wasinger/htmlpagedom/blob/master/src/HtmlPageCrawler.php
has a similar detection.

public function addContent($content, $type = null)
    {
        if (empty($type)) {
            $type = 'text/html;charset=UTF-8';
        }
        if (substr($type, 0, 9) == 'text/html' && !preg_match('/<html\b[^>]*>/i', $content)) {
            // string contains no <html> Tag => no complete document but an HTML fragment!
            $this->addHtmlFragment($content);
        } else {
            parent::addContent($content, $type);
        }
    }

What do you think?

Answer 1 · 2017-03-03T13:59:36.000Z

I am not sure that this is a good idea. If you think about HTML5 they are optional - so it would not work for this loader (consistency). Additionally the default behavior is to add the tags.

Answer 2 · 2017-03-03T15:04:22.000Z

I believe in most case, we want to load a html and then save the html in the same format and just modify some DOMs. We should not rely on the repair behavior of DomDocument. It's better to have an option to save it with complete html, and leave the default option to save it in html-fragment (don't wrap the fragment), but not define in the loader (sometimes, I don't know the input is html or just a fragment because it is UGC).

Of course, it is much better if we can auto detect all content-type ( Json, XML, html, html-fragment) by default, but not only html and html-fragment. And, leave a second parameter for developers to specify a content-type. Sometimes, missing the second parameter ( I understand, for the best practice, we shouldn't miss it. But you know, we are just human. 😢 ) when loading a html will throw an exception, because the default content-type is xml.

PHP Fatal error:  Uncaught FluentDOM\Exceptions\LoadingError\Libxml: Libxml fatal error in line 7 at character 8: Opening and ending tag mismatch: meta line 2 and html.

Answer 3 · 2017-03-03T19:50:32.000Z

I don't think that this would be a good feature for the core library. It would introduce to many dependencies basically you would have a single global loader in the core.

I would be open to extend the core in a way that the loader can provide the content type.

This would allow to implement an 'auto-detect' loader as a plugin in a separate project. This loader could be specific for the project. Basically you could write an loader that does the detection and calls the actual loaders.

I have to think about this a little more.

Answer 4 · 2017-03-04T10:04:36.000Z

I checked and this should work already. Develop your own loader, maybe use "*" as the content type. The loader then could do auto-detect specific for you project and return a FluentDOM\Loader\Result with the detected content type.

Answer 5 · 2017-03-05T19:10:38.000Z

Thanks, I will try that.