Thursday 15 April 2010

html - Matlab Regular expression query -



html - Matlab Regular expression query -

very new regex , haven't found descriptive explaination narrow downwards understanding of regex me solution.

i utilize script scrapes html script yahoo finance financial options table data. yahoo changed html code , old algorithm no longer works. old look following:

main_pattern = '.*?</table><table[^>]*>(.*?)</table'; tables = regexp(urltext, main_pattern, 'tokens');

where tables used homecoming data, no longer does. html inspection of html suggests me info no longer in <table>, rather in <tbody>...

my question "what main_pattern regex mean in layman's terms?" i'm trying figure how modify look such is applicable current html.

while agree @marcin , regular expressions best learned doing , leveraging the reference of chosen tool, i'll seek , break downwards in doing.

.*?</table>: match first </table> literal (this lazy look due ?).

<table: match literal.

[^>]*>: match much possible isn't > after <table literal lastly occurrence of > satisfies rest of look (this greedy look since there no ? after *).

(.*?)</table: match , capture between > previous part </table literal; captured can retrieved using 'tokens' options of regexp (you can entire string matched using 'match' option).

while broke pieces, i'd emphasize entire look works whole, why parts refer previous parts.

refer operators , characters section of matlab documentation more in-depth explanations of above.

for future, more robust alternative might utilize matlab's xmlread , dom object traverse table nodes. understand that another api learn, may more maintainable future.

html regex matlab

No comments:

Post a Comment