marcomontalbano/html-miner

Question: How to get p and h3 without surrounding div

Closed this issue · 9 comments

Hi,

how do i get all the h3 and the following p without a surrounding div as an Object:

<div>
     <div>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
     </div>
     <div>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
     </div>
</div>

I was able to get both individual as arrays. Here is my successles Try:

let json = htmlMiner(html, {
        myObject: {
            _each_: 'div div',
            _eachId_: function(arg) {
                 return arg.$scope.data('h3').replace(/\s/g, '');
            }
            h3: 'h3',
            p: function(arg) {
                return arg.$('h3 + p').text().trim();
            }
        }
    console.log( json );
}

Thank you very much for your help!

Or an other example:

<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>

How do i combine both to one object? I am able to read colLeft and colRight as separate Arrays, but i don't understand how to merge them. I know you are the smart one of us, so i prefer to ask you :)

Hi @christianfroehlichconsulting,
thanks a lot for your questions!

I just notice that there's a bug on $scope so that is very hard to get results as you want.

For the first comment you wrote, I think that a good result given this HTML:

<div>
  <div>
    <h3>Title 1</h3>
    <p>Text 1</p>
    <h3>Title 2</h3>
    <p>Text 2A</p>
    <p>Text 2B</p>
  </div>
  <div>
    <h3>Title 3</h3>
    <p>Text 3</p>
  </div>
</div>

could be a JSON like this:

[
  {
    "title": "Title 1",
    "paragraphs": [ "Text 1" ]
  },
  {
    "title": "Title 2",
    "paragraphs": [ "Text 2A", "Text 2B" ]
  },
  {
    "title": "Title 3",
    "paragraphs": [ "Text 3" ]
  }
]

Just give me some days and I'll back to you with a pre-release version and the selector to use for getting that result.

Hi @christianfroehlichconsulting,
I just released a new beta version v3.0.0-beta.0.
You can try directly in your project with npm install html-miner@beta or you can use the updated online playground available at https://marcomontalbano.github.io/html-miner/.

Going back to the starting question, given this HTML:

<div>
  <div>
    <h3>Title 1</h3>
    <p>Text 1</p>
    <h3>Title 2</h3>
    <p>Text 2A</p>
    <p>Text 2B</p>
  </div>
  <div>
    <h3>Title 3</h3>
    <p>Text 3</p>
  </div>
</div>

You can use a selector like this:

{
    _each_: 'div div h3',
    _eachId_: (arg) => arg.$scope.text().replace(/\s/, '-').toLowerCase(),
    title: (arg) => arg.$scope.text(),
    paragraphs: (arg) => arg.$scope.nextUntil('h3').toArray().map(t => arg.$(t).text())
}

For the second question, given this HTML:

<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>

You can use a selector like this:

{
    _each_: '.colLeft',
    title: (arg) => arg.$scope.text(),
    texts: (arg) => arg.$scope.nextUntil('.colLeft').toArray().map(t => arg.$(t).text())
}

I'll release the v3.0.0 next weekend, so we can test it a bit.

Let me know if you have any other doubt.

Thank you very much! Its working as you promised! You are a hero!

Could you give me a hint, how to modify map() to have a space or something between the paragraphs? Its hard to recognize later, where to separate the words.

EDIT: Its not the map(), its a <br> Tag in the paragraph. For sure its not there anymore, but its glueing the Text together. I guess there is no solution. Thank you very much! Its working perfect!

I have one more Question regarding nextUntil('h3'). Sometimes there are empty Strings in the array. They are caused f.ex. from empty <p></p>-Tags, but also, if there is no <p>-Tag from <span></span>-Tags. Is there an easy way to filter the output to delete those? I tried the arg.$(t).filter(v=>v!=='').text(), but my knowledge is not huge enough to make it work.

You are welcome!

About your first question, there is a solution for that. You can get the .html() instead of .text().

{
    _each_: 'div div h3',
    _eachId_: (arg) => arg.$scope.text().replace(/\s/, '-').toLowerCase(),
    title: (arg) => arg.$scope.html(),
    paragraphs: (arg) => arg.$scope.nextUntil('h3').toArray().map(t => arg.$(t).html())
}

About you second question you apply a .filter after the .map result:

(arg) => {
  const paragraphs = arg.$scope.nextUntil('h3').toArray();
  return paragraphs.map(p => arg.$(p).text().trim()).filter(p => p !== '')
}

Let me know if this is working for you.

Thank you! With your help i understood many things. But i still don't know how to handle the glueing with .html() and replace():

<a><div><div>0938764323<br>0938764322</div></div></a>
<a><div><div>0987654321</div><div></div></div></a>
numbers: {
            _each_: '#numbers a',
            content1: 'div div',
            href: (arg) => arg.$scope.attr('href')
        }

If i use this, it will glue 0938764323
0938764322 together to 0938764323093876432.

I don't understand how to exchange "content1: 'div div'" to the html version. Here is my unsuccessful attempt:

numbers: {
            _each_: '#numbers a',
            content1: 'div div',
            content: (arg) => { return arg.scopeData.content1.html().filter(p => p !== '').map(e => e.replace(/(<([^>]+)>)/ig,' ').replace(/\s{2,}/g, ' ').trim()); },
            href: (arg) => arg.$scope.attr('href')
        }

I am trying to get the same result from "content1: 'div div'" but to get it as html and change all Tags to space and delete double spaces. Please open my eyes.

When you use scopeData, it already contains the final result, so in the above case the .html() is not defined because scopeData.content1 is just a string.

Every time you need to manipulate the result (so you don't want just a .text()) you should use a function selector and probably manipulate the arg.$scope to get your desired result.

Given this HTML

<a><div><div>0938764323     <br><br><br>    0938764322<br> 0938764322</div></div></a>
<a><div><div>0987654321</div><div></div></div></a>

You can use this selector

{
    _each_: 'a',
    content: arg => {
      const html = arg.$scope.find('div div').html();
      return html.replace(/<br>/g, ' ').replace(/\s+/g, ' ');
    }
}

For each a we are getting all div div elements as HTML and then remove all <br> and all multiple spaces.

Yuppy! You are the best! Thank you very much!

Hi, I just released the v3.0.0 🎉
Thanks for your help in improving this library.