Welcome to Tactical Gamer

View RSS Feed

Zhohar

parsing HTML with perl

Rate this Entry
by , 07-29-2010 at 04:36 PM (931 Views)
Maybe I'll detail my adventure here.

So I started out with regex. Each span has a unique ID, so regex -- why not? Slurp the whole file, match on the IDs, barf out the content.
Not my most brilliant moment, I've got to admit. The HTML is often formatted strangely, and this method is pretty inefficient for large files. I figured someone had already done this (I knew, in fact, I just figured this'd be faster and simpler), so method #2.
The HTML Parser class from CPAN. Write a hook for start, text, and end tags, then match on ID. Didn't seem elegant or efficient.
I bet there's some programmer looking at me now, laughing, because I'm just stumbling around in the dark.

What I would ideally want are CSS-like selectors that operate on the HTML ... in perl. So I could say ("#myUniqueID").content to get the content. JQuery is awesome in this respect: I can just ask for what I want.
I know there's a way to do this, and someone's already done it. I've just got to keep hunting around to find it. Maybe perl isn't the right language to do this.

In other news, I think we should load up the new BC2 maps on Server 2. I'd want to play them now, not the old maps.

Submit "parsing HTML with perl" to Digg Submit "parsing HTML with perl" to del.icio.us Submit "parsing HTML with perl" to StumbleUpon Submit "parsing HTML with perl" to Google

Tags: None Add / Edit Tags
Categories
Uncategorized

Comments

  1. xBadger's Avatar
    THATS AMERICAN
  2. Arithea's Avatar
    Parsing HTML is much like parsing XML. Look for a good XML parser in Perl and you might find what you're looking for. What ultimate ends are you trying to achieve with the selectors? Or is it just an exercise in brain function?
  3. Zhohar's Avatar
    I've already found "the" HTML parser and it seems endorsed by a lot of people. Problem is, it's awkward and ugly -- I'm hoping there's a simpler way to do it. All I really want is to extract a few values from unique-id'd spans. Ideally, I'd like to do something like this: body.table[2].row[5].content or ("#uniqueElementID").content to access what the info I need.
  4. Arithea's Avatar
    Hmm. So you're trying to do what Javascript does with the DOM, only with a Perl parser. In my experience, if something is inelegant but endorsed by quite a few people, there really isn't a simple way to do it, but if nobody attempts to find one, no one ever will. Good luck with that!
  5. DrProctor's Avatar
  6. DrProctor's Avatar
    Also, unless you are absolutely forced to use Perl, I know that the CSS selector (http://svn.symfony-project.com/branc...s/CssSelector/) and DomCrawler (http://svn.symfony-project.com/branc...ts/DomCrawler/) components of Symfony 2 will do precisely what you are asking for. An example of how to use those components can be found at http://docs.symfony-reloaded.org/gui...g/crawler.html


  
 

Back to top