QUOTE–>While grep and regular expressions are a powerful way to search raw text, when text files already have structure – such as comma-delimited files, or raw HTML – we want to take advantage of programs specifically designed to exploit that structure. With HTML, especially, finding a pattern regular enough (nevermind simple) that a regex can exploit is madness.
However, you don’t have to know CSS (i.e. how to style webpages) to do HTML parsing. You just have to understand how CSS Selectors are used to target specific HTML elements. Instead of styling these HTML elements, we will be grabbing the text inside them. Different purpose, but same process and syntax of selection. <– END QUOTE —> Read the whole thing HTML parsing for pup - Using the pup tool to more sanely extract data from HTML files (Stanford Journalism Computational Methods in the Civic Sphere) <— fantastic HOWTO