Use JQuery DOM manipulation to glean data from HTML

Recently I had a request to help scrape data from a website. These tasks can be tricky and take quite a bit of time. But if the developer who wrote the code generating the HTML decided to go off script it can get downright ugly. In this article, I'm going to show you one method to clean up the HTML text nodes to make it easier to parse data. I'm going to use JQuery but you can certainly also do this with vanilla javascript. 

Note: This is representative of the HTML I was working with but is not the actual data or fields. 

Problem Statement

I needed to pull out the business name, support phone numbers and contact email addresses. The data was jumbled together and as you can see, there are mulitple text nodes separated by break elements and the overall HTML was sporadic. 

		<div>
			<div>
				<img src="image.png" data-userid="123456789">
			</div>
			<div id="business">
				Business name<br>
				<div>
					Suite 1<br>Anytown USA 77777
					<br>
					Business <a href="tel:+18885551111" class="ui-link">(888) 555-1111</a><br>Fax <a href="tel:+18885552222" class="ui-link">(888) 555-2222</a><br>Support <a href="tel:+18885553333" class="ui-link">(888) 555-3333</a><br>
					<a href="MAILTO:sales@company.com" class="ui-link">sales@company.com</a>
				</div>
			</div>
		</div>

I wanted the business name from the first div element but it was in a text node along with several other elements and there was no easy way to simply select it. Using JQuery to select the business div node I would actually get it along with all the children nodes. 

In the output below, the br elements translate into line feeds and the phone numbers stack up against each other with no separation. I'd be regulated to text string parsing - definitely not a good start or the direction I want to be headed.

            let businessName = $('#business').text()
            console.log(businessName)


				Business name
				
					Suite 1Anytown USA 77777
					
					Busiiness (888) 555-1111Fax (888) 555-2222Support (888) 555-3333
					sales@email.com
				
			
					Suite 1Anytown USA 77777
					
					Busiiness (888) 555-1111Fax (888) 555-2222Support (888) 555-3333
					sales@email.com

DOM Manipulation

Let's use the power of JQuery to help solve this data and HTML problem. Given a jQuery object that represents a set of DOM elements, the .contents() method allows us to search through the immediate children of these elements in the DOM tree and construct a new jQuery object from the matching elements. It includes the text and comment nodes, which is exactly what we need.

            let nameBlock = dataBlock.find('#business')

            nameBlock.contents()  // gets the child elements and text nodes
                .filter(function() {
                    return this.nodeType === 3;   // filter to just the text nodes which are type 3
                })
                .wrap('<p></p>')    // wrap each text code into a paragraph element
                .end()                     // reset back to the top node
                .filter('br')              // filter to just the BR elements and remove them
                .remove()             
                .end()

Let's walk through what just happened with this code.

  1. We found the top level div element with all the business information it.
  2. We used the contents() method to return all the text nodes and child elements. 
  3. We used filter() to filter to just the text nodes and then operated on them by wrapping each with a P element. 
  4. Most of jQuery's DOM traversal methods operate on a jQuery object instance and produce a new one, matching a different set of DOM elements. When this happens, it is as if the new set of elements is pushed onto a stack that is maintained inside the object. Each successive filtering method pushes a new element set onto the stack. If we need an older element set, we can use end() to pop the sets back off of the stack.
  5. We used filter() again to filter to just the br elements and then removed them from the DOM. 

Results

The resulting HTML looks like this and is much easier to look at and parse! 

<div id="business">
	<div>
		<img src="image.png" data-userid="123456789">
	</div>
	<div>
		<p>Business name</p>
		<p>	</p>
	<div>
		<p>Suite 1</p>
		<p>Anytown USA 77777</p>
		<p>Business </p>
		<a href="tel:+18885551111" class="ui-link">(888) 555-1111</a>
		<p>Fax </p>
		<a href="tel:+18885552222" class="ui-link">(888) 555-2222</a>
		<p>Support </p>
		<a href="tel:+18885553333" class="ui-link">(888) 555-3333</a>
		<p>	</p>
		<a href="MAILTO:sales@email.com" class="ui-link">sales@email.com</a>
		<p>	</p>
	</div>
	<p> </p>
</div>

The final javascript looks like this - much cleaner than string parsing! 

            let name = $(nameBlock).find("p").first().text().trim()
            let contactBlock = $(nameBlock).find('div').first()
            let line1 = $(contactBlock).find('p:nth-child(1)').text().trim()
            let city = $(contactBlock).find('p:nth-child(2)').text().trim()

            let homePhone = $(contactBlock).find('p:contains("Business")').next().text().trim()
let workPhone = $(contactBlock).find('p:contains("Fax")').next().text().trim()
let mobilePhone = $(contactBlock).find('p:contains("Support")').next().text().trim()
let email = $(contactBlock).find('a[href^="mailto"], a[href^="MAILTO"]').text().trim()

Conclusion

In this article, I walked you through a real world HTML parsing issue and presented a solution using JQuery with DOM manipulation to get a much easy and cleaner way to glean data from HTML. There are certainly improvements which could be made to the code but in this scenario it worked beautifully for what was required. 


Comments

Popular posts from this blog

Max Upload File Size in Spring Framework

Use Java Enums with JPA

Spring Security part 5 - Freemarker Security Tags