|
YOUR FEEDBACK
SYS-CON.TV |
TOP LINKS YOU MUST CLICK ON Scraping Files with Fancy Scripting Tricks
Parsing and extracting specific data from an information file
By: Dave Taylor
May. 30, 2005 12:15 PM
Last month we finally wrapped up the long journey toward creating a useful shell script with the hi-low game. Alright, "useful" might be a bit of a stretch, but if you've read through all the columns leading up to this point, you should have a good understanding of the basics of creating and debugging a shell script, a skill that will prove invaluable as you travel further down the Linux and Unix path. In this column I'm going to present a short shell script that does something darn useful for those of you who secretly are also running Mac OS X, but even if you're not, it's going to be an interesting script to learn. Parsing XML "plist" Files A typical file, the bookmarks file for Apple's Safari browser, stores an individual bookmark this way: <dict> Don't panic. The only thing you need to notice here is that the URL appears immediately after the line URLString, and that the name of the bookmark entry appears immediately after the title. Extracting Lines from a File with Grep The first step in writing this script is to use grep to extract the lines that match the two fieldnames specified and the two lines immediately following each match. This is done with the -A1 flag: bm="$HOME/Library/Safari/Bookmarks.plist" grep -A1 -E '(>URLString<|>title<)' $bm grep -A1 -E '(>URLString<|>title<)' $bm (I've assigned the variable "bm" to the full pathname for convenience.) Notice that I'm also using a simple regular expression to match lines that have the pattern >URLString< or >title<. Use the -E flag to convince grep that you really want to use a regular expression. We're getting there. The problem now is that we have both the lines that contain the information we want and the lines that match the fieldnames. Another job for grep, this time inverting the test to show only the lines that don't match the specified pattern: grep -A1 -E '(>URLString<|>title<)' $bm | Almost done, actually. Here's an example of how the output looks now: <string>Camera</string> All that's left is to clean up the format a bit. Chopping Lines with the Cut Command Got it? Now, let's look at how we can use cut to strip off anything prior to the first ">" and subsequent to the second "<" in each matching line. cut -d\> -f2 | cut -d\< -f1 Not the most elegant or graceful solution, but definitely quick and dirty, with an emphasis on quick. The first command tosses out anything prior to the first ">" symbol, then the second shows only what's on the line prior to the first occurrence of "<". For the first line above, <string> Camera</string>, the first cut would produce Camera</string> and the second would produce Camera, exactly as we hoped. Are we done? Not quite, because while it's useful to be able to produce an output of bookmark name, URL, bookmark name, URL, it'd be much nicer to produce an HTML format output that can then be viewed in any Web browser. To do this, however, is a bit more tricky and involves learning how you can hook a structured block of scripting code into a pipeline. And that, I'm afraid, will have to wait until next month. See you then! LATEST ECLIPSE STORIES . . .
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||