Question: How to get p and h3 without surrounding div

Question

Question: How to get p and h3 without surrounding div

Closed this issue 5 years ago · 9 comments

christianfroehlichconsulting commented 5 years ago

Hi,

how do i get all the h3 and the following p without a surrounding div as an Object:

<div>
     <div>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
     </div>
     <div>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
          <h3>Text</h3>
          <p>Text</p>
     </div>
</div>

I was able to get both individual as arrays. Here is my successles Try:

let json = htmlMiner(html, {
        myObject: {
            _each_: 'div div',
            _eachId_: function(arg) {
                 return arg.$scope.data('h3').replace(/\s/g, '');
            }
            h3: 'h3',
            p: function(arg) {
                return arg.$('h3 + p').text().trim();
            }
        }
    console.log( json );
}

Thank you very much for your help!

Answer 1 · 2020-04-02T13:48:01.000Z

Or an other example:

<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>

How do i combine both to one object? I am able to read colLeft and colRight as separate Arrays, but i don't understand how to merge them. I know you are the smart one of us, so i prefer to ask you :)

Answer 2 · 2020-04-02T22:02:06.000Z

Hi @christianfroehlichconsulting,
thanks a lot for your questions!

I just notice that there's a bug on $scope so that is very hard to get results as you want.

For the first comment you wrote, I think that a good result given this HTML:

<div>
  <div>
    <h3>Title 1</h3>
    <p>Text 1</p>
    <h3>Title 2</h3>
    <p>Text 2A</p>
    <p>Text 2B</p>
  </div>
  <div>
    <h3>Title 3</h3>
    <p>Text 3</p>
  </div>
</div>

could be a JSON like this:

[
  {
    "title": "Title 1",
    "paragraphs": [ "Text 1" ]
  },
  {
    "title": "Title 2",
    "paragraphs": [ "Text 2A", "Text 2B" ]
  },
  {
    "title": "Title 3",
    "paragraphs": [ "Text 3" ]
  }
]

Just give me some days and I'll back to you with a pre-release version and the selector to use for getting that result.

Answer 3 · 2020-04-05T09:52:22.000Z

Hi @christianfroehlichconsulting,
I just released a new beta version v3.0.0-beta.0.
You can try directly in your project with npm install html-miner@beta or you can use the updated online playground available at https://marcomontalbano.github.io/html-miner/.

Going back to the starting question, given this HTML:

<div>
  <div>
    <h3>Title 1</h3>
    <p>Text 1</p>
    <h3>Title 2</h3>
    <p>Text 2A</p>
    <p>Text 2B</p>
  </div>
  <div>
    <h3>Title 3</h3>
    <p>Text 3</p>
  </div>
</div>

You can use a selector like this:

{
    _each_: 'div div h3',
    _eachId_: (arg) => arg.$scope.text().replace(/\s/, '-').toLowerCase(),
    title: (arg) => arg.$scope.text(),
    paragraphs: (arg) => arg.$scope.nextUntil('h3').toArray().map(t => arg.$(t).text())
}

For the second question, given this HTML:

<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>

You can use a selector like this:

{
    _each_: '.colLeft',
    title: (arg) => arg.$scope.text(),
    texts: (arg) => arg.$scope.nextUntil('.colLeft').toArray().map(t => arg.$(t).text())
}

I'll release the v3.0.0 next weekend, so we can test it a bit.

Let me know if you have any other doubt.

Answer 4 · 2020-04-06T13:13:28.000Z

Thank you very much! Its working as you promised! You are a hero!

Could you give me a hint, how to modify map() to have a space or something between the paragraphs? Its hard to recognize later, where to separate the words.

EDIT: Its not the map(), its a   Tag in the paragraph. For sure its not there anymore, but its glueing the Text together. I guess there is no solution. Thank you very much! Its working perfect!

I have one more Question regarding nextUntil('h3'). Sometimes there are empty Strings in the array. They are caused f.ex. from empty -Tags, but also, if there is no -Tag from -Tags. Is there an easy way to filter the output to delete those? I tried the arg.$(t).filter(v=>v!=='').text(), but my knowledge is not huge enough to make it work.

Answer 5 · 2020-04-07T06:59:11.000Z

You are welcome!

About your first question, there is a solution for that. You can get the .html() instead of .text().

{
    _each_: 'div div h3',
    _eachId_: (arg) => arg.$scope.text().replace(/\s/, '-').toLowerCase(),
    title: (arg) => arg.$scope.html(),
    paragraphs: (arg) => arg.$scope.nextUntil('h3').toArray().map(t => arg.$(t).html())
}

About you second question you apply a .filter after the .map result:

(arg) => {
  const paragraphs = arg.$scope.nextUntil('h3').toArray();
  return paragraphs.map(p => arg.$(p).text().trim()).filter(p => p !== '')
}

Let me know if this is working for you.

Answer 6 · 2020-04-08T11:49:59.000Z

Thank you! With your help i understood many things. But i still don't know how to handle the glueing with .html() and replace():

<a><div><div>0938764323<br>0938764322</div></div></a>
<a><div><div>0987654321</div><div></div></div></a>

numbers: {
            _each_: '#numbers a',
            content1: 'div div',
            href: (arg) => arg.$scope.attr('href')
        }

If i use this, it will glue 0938764323
0938764322 together to 0938764323093876432.

I don't understand how to exchange "content1: 'div div'" to the html version. Here is my unsuccessful attempt:

numbers: {
            _each_: '#numbers a',
            content1: 'div div',
            content: (arg) => { return arg.scopeData.content1.html().filter(p => p !== '').map(e => e.replace(/(<([^>]+)>)/ig,' ').replace(/\s{2,}/g, ' ').trim()); },
            href: (arg) => arg.$scope.attr('href')
        }

I am trying to get the same result from "content1: 'div div'" but to get it as html and change all Tags to space and delete double spaces. Please open my eyes.

Answer 7 · 2020-04-08T15:30:58.000Z

When you use scopeData, it already contains the final result, so in the above case the .html() is not defined because scopeData.content1 is just a string.

Every time you need to manipulate the result (so you don't want just a .text()) you should use a function selector and probably manipulate the arg.$scope to get your desired result.

Given this HTML

<a><div><div>0938764323     <br><br><br>    0938764322<br> 0938764322</div></div></a>
<a><div><div>0987654321</div><div></div></div></a>

You can use this selector

{
    _each_: 'a',
    content: arg => {
      const html = arg.$scope.find('div div').html();
      return html.replace(/<br>/g, ' ').replace(/\s+/g, ' ');
    }
}

For each a we are getting all div div elements as HTML and then remove all   and all multiple spaces.

Answer 8 · 2020-04-08T16:29:12.000Z

Yuppy! You are the best! Thank you very much!

Answer 9 · 2020-04-14T09:02:26.000Z

Hi, I just released the v3.0.0 🎉
Thanks for your help in improving this library.