getpelican/pelican

provide public access to _content etc

sebbASF opened this issue · 8 comments

  • I have searched the issues (including closed ones) and believe that this is not a duplicate.
  • I have searched the documentation and believe that my question is not covered.
  • I am willing to lend a hand to help implement this feature.

Feature Request

Plugins regularly need to access _content.
However pylint (rightly) complains that this is using protected-access.

AFAICT it is expected that plugins may need to access _content (and _summary, _content) so the protected status is just a nuisance when using pylint.

Renaming would cause lots of issues, but it would be possible to provide a public accessor for use by plugins.

Would any of these Pelican signals help you deal with access to each document's content, specifically content signals?

Something like:

  • signals.content_object_init.connect(my_plugin_content_processor)

About 80% of the plugins use this signal.

A content handler would look like this:

from pelican import signals
from pelican.contents import Content, Article, Page

def my_content_object_init(content_class):
    # Description:
    #   First signal handler to provide the actual content of any article/page/static
    #   file.
    #
    # arg1 : content_class:Content
    #
    # article of Article(Content) class provides the following variable member items:
    #   allowed_statuses:tuple, author:Author, authors:list, category:Category,
    #   content:str, date:SafeDatetime, date_format:str, default_status:str,
    #   default_template:str, filename:str, get_content:partial, get_summary:partial,
    #   in_default_lang:bool, lang:str, locale_date:str, mandatory_properties:tuple,
    #   metadata:dict, private:str, reader:str, relative_dir:str,
    #   relative_source_path:str, save_as:str, settings:dict, slug:str,
    #   source_path:str, status:str, summary:str, tags:list, template:str,
    #   timezone:Zoneinfo, title:str, translations:list, url:str, url_format:dict
    #
    # Callstack
    #     signals.content_object_init.send()
    #     Content.__init__()
    #     Article.__init__()
    #     Readers.read_file()
    #     ArticlesGenerator.generate_context()
    #     Pelican.run()
    #
    # 4th article-related signal
    # 3rd signal in ArticlesGenerator.generate_context()
    # Still inside read_file()
    # First signal appearance having a content provided by Markdown.read_file()
    #
    # Hooked using signals.content_object_init.connect(my_content_object_init)
    #
    print('my_content_object_init called')
    print('my_content_object_init: content: {0!s}'.format(content_class.content))

    if not (isinstance(content_class, Article) or isinstance(content_class, Page)):
        return
    # Do your article/page processing here
    return

you can set above handler up by doing content_object_init signal:

# This is how pelican plugin works.
# register() is a well-established function name used by Pelican plugin
# handler for this plugin to get recognized, inserted, initialized, and
# its processors added into and by the Pelican app.
import logging

def register():
    logger.info(
        'MY plugin registered for Pelican, using new 4.0 plugin variant')
    signals.content_object_init.connect(my_content_object_init)

I don't see how that helps.

This request is about getting public access to the protected field _content which is part of the Content object, not about getting access to the Content object.

As I have had reviewed all the signals (as of v4.9.1), I am not fully convinced ... yet... that content needs to be made available outside of signals.content_object_init signal ... as a 'unprotected' access. Of course, I am not the designer, but this current Pelican design is resonating with me.

While Python (or JetBrain IDE PyCharm) may be able to access this protected ._content element item, ideally the plugin should only be using the Pelican-community-unprotected variety of .content element item and that is alone provided toward your own plugin content processor function as hooked by the signals.content_object_init handler.

Is there a particular signal stage that you need content access within? I have listed all the signals used in Pelican v4.9.1 in chronological order:

    # All signals are listed here as of Pelican v4.9.1
    signals.initialized.connect()
    signals.get_generators.connect()
    signals.readers_init()  # Article class
    signals.generator_init()  #ArticlesGenerator class
    signals.article_generator_init.connect()
    signals.readers_init() 
    signals.readers_init()  # Page class
    signals.generator_init()  # PagesGenerator
    signals.page_generator_init()
    signals.readers_init()
    signals.generator_init()
    signals.readers_init()  # Static class
    signals.generator_init()  # StaticGenerator
    signals.static_generator_init()
    signals.article_generator_preread.connect()
    signals.article_generator_context.connect()
    signals.content_object_init.connect()
    signals.article_generator_pretaxonomy.connect()
    signals.article_generator_finalized.connect()
    signals.page_generator_preread.connect()
    signals.page_generator_context.connect()
    signals.content_object_init.connect()
    signals.page_generator_finalized.connect()
    signals.static_generator_preread.connect()
    signals.static_generator_context.connect()
    signals.content_object_init.connect()
    signals.static_generator_finalized.connect()
    signals.all_generators_finalized.connect()
    signals.get_writers()
    signals.feed_generated()
    signals.feed_written()
    signals.article_generator_write_article.connect()
    signals.content_written()
    signals.article_writer_finalized.connect()
    signals.page_generator_write_page.connect()
    signals.content_written()
    signals.page_writer_finalized()
    signals.content_written()
    signals.pelican_finalized()

Here are some plugins that reference _content:

https://github.com/getpelican/pelican-plugins/blob/c61bd12914fd52af1808c53151a07225e7c3341c/glossary/glossary.py#L36

Got it. I think I may have a fix, but no time to test it.

Right off the bat, I can tell you that this particular plugin should be easily fixable by replacing the article_generator_finalized signal with the signal.content_object_init.connect(parse_content):

def register():
    signals.initialized.connect(get_excludes)
    signals.content_object_init.connect(parse_content)
    signals.page_generator_context.connect(set_definitions)

Upgrading the protected content._content into a normal content.content.

def parse_content(content):
    # vvvvv NEW CODE vvvvv
    # Only process Article or Page subclass contents
    if not (isinstance(content_class, Article) or isinstance(content_class, Page)):
        return
    # ^^^^^ NEW CODE ^^^^^
   # resume normal code
    soup = bs4.BeautifulSoup(content._content, 'html.parser')
   ...

Notice a choice of article or page, modify that as needed.

Oh yea, totally remove the parse_articles function and its articles' looping, as the signal is now operating on a single per-document basis.

Your suggested change does solve the issue. It does not change the line where _content is referenced:
soup = bs4.BeautifulSoup(content._content, 'html.parser')

Oops, my bad. Please, if you haven't, replace ALL instances of ._content with .content, I meant. Did that work as well?

_content is a protected variable, in short, it is a read-only variable that is discourage from making any access to it by a function.

We are going round in circles.

._content and .content don't always return the same value, otherwise plugins would not need to use _content.