Question: How to get p and h3 without surrounding div
Closed this issue · 9 comments
Hi,
how do i get all the h3 and the following p without a surrounding div as an Object:
<div>
<div>
<h3>Text</h3>
<p>Text</p>
<h3>Text</h3>
<p>Text</p>
<h3>Text</h3>
<p>Text</p>
</div>
<div>
<h3>Text</h3>
<p>Text</p>
<h3>Text</h3>
<p>Text</p>
<h3>Text</h3>
<p>Text</p>
</div>
</div>
I was able to get both individual as arrays. Here is my successles Try:
let json = htmlMiner(html, {
myObject: {
_each_: 'div div',
_eachId_: function(arg) {
return arg.$scope.data('h3').replace(/\s/g, '');
}
h3: 'h3',
p: function(arg) {
return arg.$('h3 + p').text().trim();
}
}
console.log( json );
}
Thank you very much for your help!
Or an other example:
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
How do i combine both to one object? I am able to read colLeft and colRight as separate Arrays, but i don't understand how to merge them. I know you are the smart one of us, so i prefer to ask you :)
Hi @christianfroehlichconsulting,
thanks a lot for your questions!
I just notice that there's a bug on $scope
so that is very hard to get results as you want.
For the first comment you wrote, I think that a good result given this HTML:
<div>
<div>
<h3>Title 1</h3>
<p>Text 1</p>
<h3>Title 2</h3>
<p>Text 2A</p>
<p>Text 2B</p>
</div>
<div>
<h3>Title 3</h3>
<p>Text 3</p>
</div>
</div>
could be a JSON like this:
[
{
"title": "Title 1",
"paragraphs": [ "Text 1" ]
},
{
"title": "Title 2",
"paragraphs": [ "Text 2A", "Text 2B" ]
},
{
"title": "Title 3",
"paragraphs": [ "Text 3" ]
}
]
Just give me some days and I'll back to you with a pre-release version and the selector
to use for getting that result.
Hi @christianfroehlichconsulting,
I just released a new beta version v3.0.0-beta.0
.
You can try directly in your project with npm install html-miner@beta
or you can use the updated online playground available at https://marcomontalbano.github.io/html-miner/.
Going back to the starting question, given this HTML:
<div>
<div>
<h3>Title 1</h3>
<p>Text 1</p>
<h3>Title 2</h3>
<p>Text 2A</p>
<p>Text 2B</p>
</div>
<div>
<h3>Title 3</h3>
<p>Text 3</p>
</div>
</div>
You can use a selector like this:
{
_each_: 'div div h3',
_eachId_: (arg) => arg.$scope.text().replace(/\s/, '-').toLowerCase(),
title: (arg) => arg.$scope.text(),
paragraphs: (arg) => arg.$scope.nextUntil('h3').toArray().map(t => arg.$(t).text())
}
For the second question, given this HTML:
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
<div class="colLeft">Title</div><div class="colRight">Text for this Title</div>
You can use a selector like this:
{
_each_: '.colLeft',
title: (arg) => arg.$scope.text(),
texts: (arg) => arg.$scope.nextUntil('.colLeft').toArray().map(t => arg.$(t).text())
}
I'll release the v3.0.0
next weekend, so we can test it a bit.
Let me know if you have any other doubt.
Thank you very much! Its working as you promised! You are a hero!
Could you give me a hint, how to modify map() to have a space or something between the paragraphs? Its hard to recognize later, where to separate the words.
EDIT: Its not the map(), its a <br>
Tag in the paragraph. For sure its not there anymore, but its glueing the Text together. I guess there is no solution. Thank you very much! Its working perfect!
I have one more Question regarding nextUntil('h3'). Sometimes there are empty Strings in the array. They are caused f.ex. from empty <p></p>
-Tags, but also, if there is no <p>
-Tag from <span></span>
-Tags. Is there an easy way to filter the output to delete those? I tried the arg.$(t).filter(v=>v!=='').text()
, but my knowledge is not huge enough to make it work.
You are welcome!
About your first question, there is a solution for that. You can get the .html()
instead of .text()
.
{
_each_: 'div div h3',
_eachId_: (arg) => arg.$scope.text().replace(/\s/, '-').toLowerCase(),
title: (arg) => arg.$scope.html(),
paragraphs: (arg) => arg.$scope.nextUntil('h3').toArray().map(t => arg.$(t).html())
}
About you second question you apply a .filter
after the .map
result:
(arg) => {
const paragraphs = arg.$scope.nextUntil('h3').toArray();
return paragraphs.map(p => arg.$(p).text().trim()).filter(p => p !== '')
}
Let me know if this is working for you.
Thank you! With your help i understood many things. But i still don't know how to handle the glueing with .html() and replace():
<a><div><div>0938764323<br>0938764322</div></div></a>
<a><div><div>0987654321</div><div></div></div></a>
numbers: {
_each_: '#numbers a',
content1: 'div div',
href: (arg) => arg.$scope.attr('href')
}
If i use this, it will glue 0938764323
0938764322 together to 0938764323093876432.
I don't understand how to exchange "content1: 'div div'" to the html version. Here is my unsuccessful attempt:
numbers: {
_each_: '#numbers a',
content1: 'div div',
content: (arg) => { return arg.scopeData.content1.html().filter(p => p !== '').map(e => e.replace(/(<([^>]+)>)/ig,' ').replace(/\s{2,}/g, ' ').trim()); },
href: (arg) => arg.$scope.attr('href')
}
I am trying to get the same result from "content1: 'div div'" but to get it as html and change all Tags to space and delete double spaces. Please open my eyes.
When you use scopeData
, it already contains the final result, so in the above case the .html()
is not defined because scopeData.content1
is just a string.
Every time you need to manipulate the result (so you don't want just a .text()
) you should use a function selector and probably manipulate the arg.$scope
to get your desired result.
Given this HTML
<a><div><div>0938764323 <br><br><br> 0938764322<br> 0938764322</div></div></a>
<a><div><div>0987654321</div><div></div></div></a>
You can use this selector
{
_each_: 'a',
content: arg => {
const html = arg.$scope.find('div div').html();
return html.replace(/<br>/g, ' ').replace(/\s+/g, ' ');
}
}
For each
a
we are getting alldiv div
elements as HTML and then remove all<br>
and all multiple spaces.
Yuppy! You are the best! Thank you very much!
Hi, I just released the v3.0.0 🎉
Thanks for your help in improving this library.