fhightower/html-to-json

Example produces non valid JSON (single quotes)

mlliarm opened this issue · 1 comments

Hello,

I found today that using your library with python 3.9.15 and 3.10.xx that it produces a non recognizable JSON result.

The code I wrote:

# bad_test.py
import html_to_json

html_string = """
<head>
    <title>Floyd Hightower's Projects</title>
    <meta charset="UTF-8">
    <meta name="description" content="Floyd Hightower&#39;s Projects">
    <meta name="keywords" content="projects,fhightower,Floyd,Hightower">
</head>
"""
output_json = html_to_json.convert(html_string)
print(output_json)

Result:

{'head': [{'title': [{'_value': "Floyd Hightower's Projects"}], 'meta': [{'_attributes': {'charset': 'UTF-8'}}, {'_attributes': {'name': 'description', 'content': "Floyd Hightower's Projects"}}, {'_attributes': {'name': 'keywords', 'content': 'projects,fhightower,Floyd,Hightower'}}]}]}

The fix I found was to use json.dumps on the resulting dict:

# good_test.py
import html_to_json, json

html_string = """
<head>
    <title>Floyd Hightower's Projects</title>
    <meta charset="UTF-8">
    <meta name="description" content="Floyd Hightower&#39;s Projects">
    <meta name="keywords" content="projects,fhightower,Floyd,Hightower">
</head>
"""

output_json = html_to_json.convert(html_string)
print(json.dumps(output_json))

Output:

{"head": [{"title": [{"_value": "Floyd Hightower's Projects"}], "meta": [{"_attributes": {"charset": "UTF-8"}}, {"_attributes": {"name": "description", "content": "Floyd Hightower's Projects"}}, {"_attributes": {"name": "keywords", "content": "projects,fhightower,Floyd,Hightower"}}]}]}

Result of "python good_test.py | prettyjson":

{
    "head": [
        {
            "title": [
                {
                    "_value": "Floyd Hightower's Projects"
                }
            ],
            "meta": [
                {
                    "_attributes": {
                        "charset": "UTF-8"
                    }
                },
                {
                    "_attributes": {
                        "name": "description",
                        "content": "Floyd Hightower's Projects"
                    }
                },
                {
                    "_attributes": {
                        "name": "keywords",
                        "content": "projects,fhightower,Floyd,Hightower"
                    }
                }
            ]
        }
    ]
}

If you agree I can make a fix in the result of the html_to_json.convert func and send a PR.

Thanks for reporting this! Makes sense - I'll happily accept this as a PR 😄 .