Cobertos/md2notion

Math equation is broken

shizidushu opened this issue · 10 comments

### Linear Models and Least Squares

Given a vector of inputs $X^T=(X_1, X_2, \ldots, X_p)$, we predict output $Y$ via the model
$$
\hat{Y} = \hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j
$$
The term $\hat{\beta}_0$ is the intercept, also known as the *bias* in machine learning. Often it is convenient to include the constant variable 1 in $X$, include $\hat{\beta_0}$ in the vector of coefficients $\hat{\beta}$, and then write the linear model in vector form as an inner product
$$
\hat{Y} = X^T \hat{\beta}
$$
where $X^T$ denotes vector or matrix transpose ($X$ being a column vector). Here we are modeling a single output, so $\hat{Y}$ is a scalar; in general $\hat{Y}$ can be a $K$-vector, in which case $\beta$ would be a $p \times K$ matrix of coefficients. In the $(p+1)$-dimensional input-output space, $(X, \hat{Y})$ represents a hyperplane. If the constant is included in $X$, then the hyperplane includes the origin and is a subspace; if not; it is an affine set cutting the $Y$-axis at the point $(0, \hat{\beta}_0)$. From now on we assume that the intercept is included in $\hat{\beta}$.

In typora:
图片

with open('temp.md', "r", encoding="utf-8") as mdFile:
    newPage = page.children.add_new(PageBlock, title=mdFile.name)
    
    txt = mdFile.read()
    txt_list = re.split(pattern, txt)
    for i, string in enumerate(txt_list):
        if string == '':
            txt_list[i] = '\n'
    new_txt = ''.join(txt_list)

    rendered = convert(new_txt,addLatexExtension(NotionPyRenderer))
    for blockDescriptor in rendered:
        uploadBlock(blockDescriptor, newPage, mdFile.name)

The equation is broken
图片

Hmm, it looks like the _0 ... m_ in the equation seems to have been interpreted as Markdown italics by notion-py?

That's the only difference I see between your equation and the below:

\hat{Y} = \hat{\beta}_0 + \sum_{j=1}^p X_j \hat{\beta}_j

This line should be setting title_plaintext like CodeBlock does, instead of title. That should fix it

This line should be setting title_plaintext like CodeBlock does, instead of title. That should fix it

@Cobertos Thanks. I get it works.

from mistletoe.block_token import BlockToken
from mistletoe.html_renderer import HTMLRenderer
from mistletoe import span_token
from mistletoe.block_token import tokenize

from md2notion.NotionPyRenderer import NotionPyRenderer

from notion.block import EquationBlock, field_map



class CustomEquationBlock(EquationBlock):

    latex = field_map(
        ["properties", "title_plaintext"],
        python_to_api=lambda x: [[x]],
        api_to_python=lambda x: x[0][0],
    )

    _type = "equation"


class CustomNotionPyRenderer(NotionPyRenderer):
    
    def render_block_equation(self, token):
        def blockFunc(blockStr):
            return {
                'type': CustomEquationBlock,
                'title_plaintext': blockStr #.replace('\\', '\\\\')
            }
        return self.renderMultipleToStringAndCombine(token.children, blockFunc)


import re
pattern = re.compile(r'( {0,3})((?:\$){2,}) *(\S*)')

class Document(BlockToken):
    def __init__(self, lines):
        if isinstance(lines, str):
            lines = lines.splitlines(keepends=True)
        else:
            txt = lines.read()
            txt_list = re.split(pattern, txt)
            for i, string in enumerate(txt_list):
                if string == '':
                    txt_list[i] = '\n'
            lines = ''.join(txt_list)
            lines = lines.splitlines(keepends=True)
        lines = [line if line.endswith('\n') else '{}\n'.format(line) for line in lines]
        self.footnotes = {}
        global _root_node
        _root_node = self
        span_token._root_node = self
        self.children = tokenize(lines)
        span_token._root_node = None
        _root_node = None

def markdown(iterable, renderer=HTMLRenderer):
    """
    Output HTML with default settings.
    Enables inline and block-level HTML tags.
    """
    with renderer() as renderer:
        return renderer.render(Document(iterable))


def convert(mdFile, notionPyRendererCls=NotionPyRenderer):
    """
    Converts a mdFile into an array of NotionBlock descriptors
    @param {file|string} mdFile The file handle to a markdown file, or a markdown string
    @param {NotionPyRenderer} notionPyRendererCls Class inheritting from the renderer
    incase you want to render the Markdown => Notion.so differently
    """
    return markdown(mdFile, notionPyRendererCls)

The InlineEquation has the same problem. @Cobertos Can you have a look?

I'll leave it open until the fix gets in the library itself. Will need to do that soon.

As for the inline equations, notion-py is the one that actually handles uploading inline equations to Notion, added in this PR. This is because it does some special conversions to convert to Notion's expected format.

Looking at that PR, it looks like notion-py's inline equations are formatted with double '$$'s, not single? Which seems to differ from your example, not sure if that is working for you?

In your case though, in md2notion, emphasis is handled by re-echoing out the specific markdown as notion-py will handle that later. That's going to cause issues in your case, converting _ to '*'. I will look into seeing if there's a way mistletoe will allow the exact emphasis formatting marker to carry over. That should at least preserve your _ to let notion-py handle the rest.

There is no problem related to the single $, it has been handled well somewhere.

There is another problem that worth metioning is that if there is no blank line before the block equation, the block equation will be treated as part of TextBlock.
I add \n before and after the double $$ and then trim the equation block string to avoid.

import itertools
new_lines = []
for (i, line) in enumerate(lines):
    new_line = [None, line, None]
    if i > 0 and i < len(lines) - 2:
        if line == '$$\n' and lines[i-1][0] != '\n':
            new_line[0] = '\n'
        if line == '$$\n' and lines[i+1][0] != '\n':
            new_line[2] = '\n'
    new_lines.append(new_line)
new_lines = list(itertools.chain(*new_lines))
new_lines = list(filter(lambda x: x is not None, new_lines))
new_lines = ''.join(new_lines)
lines = new_lines.splitlines(keepends=True)
lines = [line if line.endswith('\n') else '{}\n'.format(line) for line in lines]

Hope it will be handled well and may be more intelligently by the package too.

title_plaintext is now added to master. I also added two tests. Still need to push a package update

To answer all the fixes/questions related to equation blocks current state

The InlineEquation has the same problem.

I added a test that now tests for this. What gets passed to notion-py should be well-formed. Looks like notion-py is parsing the inline Markdown again, so that is most likely where the issue arises.

I don't see an easy fix for notion-py on this though...

There is no problem related to the single $, it has been handled well somewhere.

Woops, yes, I was mistaken. This works correctly. Single $'s are converted to $$ in output

There is another problem that worth metioning is that if there is no blank line before the block equation, the block equation will be treated as part of TextBlock.

Hmm, I am seeing this issue. Ideally we would support this sort of case because it's similar to how CommonMark's specification describes code fences. "A fenced code block may interrupt a paragraph, and does not require a blank line either before or after."

After some research, the issue lies with how mistletoes Paragraph block read() function works. It will specifically loos for CodeFence.start() to break out of it's read() loop. We would need to edit Paragraphs read() function to add BlockEquation.start() in there too to fix this.

Upstream tag for the inline equation issue. Open to ideas to fix the newline thing,,, can't think of an easy way to integrate that