Thursday 15 September 2011

parsing - How do you parse and process HTML/XML in PHP? -



parsing - How do you parse and process HTML/XML in PHP? -

how can 1 parse html/xml , extract info it?

this general reference question php tag

native xml extensions

i prefer using 1 of native xml extensions since come bundled php, faster 3rd party libs , give me command need on markup.

dom

the dom extension allows operate on xml documents through dom api php 5. implementation of w3c's document object model core level 3, platform- , language-neutral interface allows programs , scripts dynamically access , update content, construction , style of documents.

dom capable of parsing , modifying real world (broken) html , can xpath queries. based on libxml.

it takes time productive dom, time worth imo. since dom language-agnostic interface, you'll find implementations in many languages, if need alter programming language, chances know how utilize language's dom api then.

a basic usage illustration can found in grabbing href attribute of element , general conceptual overview can found @ domdocument in php

how utilize dom extension has been covered extensively on stackoverflow, if take utilize it, can sure of issues run can solved searching/browsing stack overflow.

xmlreader

the xmlreader extension xml pull parser. reader acts cursor going forwards on document stream , stopping @ each node on way.

xmlreader, dom, based on libxml. not aware of how trigger html parser module, chances using xmlreader parsing broken html might less robust using dom can explicitly tell utilize libxml's html parser module.

a basic usage illustration can found @ getting values h1 tags using php

xml parser

this extension lets create xml parsers , define handlers different xml events. each xml parser has few parameters can adjust.

the xml parser library based on libxml, , implements sax style xml force parser. may improve selection memory management dom or simplexml, more hard work pull parser implemented xmlreader.

simplexml

the simplexml extension provides simple , usable toolset convert xml object can processed normal property selectors , array iterators.

simplexml alternative when know html valid xhtml. if need parse broken html, don't consider simplexml because choke.

a basic usage illustration can found @ a simple programme crud node , node values of xml file , there lots of additional examples in php manual.

3rd party libraries (libxml based)

if prefer utilize 3rd-party lib, i'd suggest using lib uses dom/libxml underneath instead of string parsing.

phpquery

phpquery server-side, chainable, css3 selector driven document object model (dom) api based on jquery javascript library written in php5 , provides additional command line interface (cli).

zend_dom

zend_dom provides tools working dom documents , structures. currently, offer zend_dom_query, provides unified interface querying dom documents utilizing both xpath , css selectors.

querypath

querypath php library manipulating xml , html. designed work not local files, web services , database resources. implements much of jquery interface (including css-style selectors), heavily tuned server-side use. can installed via composer.

fluentdom

fluentdom provides jquery-like fluent xml interface domdocument in php. selectors written in xpath or css (using css xpath converter). current versions extend dom implementing standard interfaces , add together features dom living standard. fluentdom can load formats json, csv, jsonml, rabbitfish , others. can installed via composer.

fdomdocument

fdomdocument extends standard dom utilize exceptions @ occasions of errors instead of php warnings or notices. add together various custom methods , shortcuts convenience , simplify usage of dom.

sabre/xml

sabre/xml library wraps , extends xmlreader , xmlwriter classes create simple "xml object/array" mapping scheme , design pattern. writing , reading xml single-pass , can hence fast , require low memory on big xml files.

fluidxml

fluidxml php library manipulating xml concise , fluent api. leverages xpath , fluent programming pattern fun , effective.

3rd-party (not libxml-based)

the benefit of building upon dom/libxml performance out of box because based on native extension. however, not 3rd-party libs go downwards route. of them listed below

simplehtmldom an html dom parser written in php5+ lets manipulate html in easy way! require php 5+. supports invalid html. find tags on html page selectors jquery. extract contents html in single line.

i not recommend parser. codebase horrible , parser rather slow , memory hungry. of libxml based libraries should outperform easily.

ganon a universal tokenizer , html/xml/rss dom parser ability manipulate elements , attributes supports invalid html , utf8 can perform advanced css3-like queries on elements (like jquery -- namespaces supported) a html beautifier (like html tidy) minify css , javascript sort attributes, alter character case, right indentation, etc. extensible parsing documents using callbacks based on current character/token operations separated in smaller functions easy overriding fast , easy

never used it. can't tell if it's good.

html 5

you can utilize above parsing html5, there can quirks due markup html5 allows. html5 want consider using dedicated parser, like

html5lib

a python , php implementations of html parser based on whatwg html5 specification maximum compatibility major desktop web browsers.

we might see more dedicated parsers 1 time html5 finalized. there blogpost w3's titled how-to html 5 parsing worth checking out.

webservices

if don't sense programming php, can utilize web services. in general, found little utility these, that's me , utilize cases.

yql

the yql web service enables applications query, filter, , combine info different sources across internet. yql statements have sql-like syntax, familiar developer database experience.

scraperwiki.

scraperwiki's external interface allows extract info in form want utilize on web or in own applications. can extract info state of scraper.

regular expressions

last , least recommended, can extract info html regular expressions. in general using regular expressions on html discouraged.

most of snippets find on web match markup brittle. in cases working particular piece of html. tiny markup changes, adding whitespace somewhere, or adding or changing attributes in tag, can create regex fails when it's not written. should know doing before using regex on html.

html parsers know syntactical rules of html. regular expressions have taught each new regex write. regex fine in cases, depends on use-case.

you can write more reliable parsers, writing complete , reliable custom parser regular expressions waste of time when aforementioned libraries exist , much improve job on this.

also see parsing html cthulhu way

books

if want spend money, have at

php architect's guide webscraping php

i not affiliated php architect or authors.

php parsing xml-parsing html-parsing

No comments:

Post a Comment