Another Fucking Productive Day at Work

September 10, 2002

I’m loving it. Grep patterns, trying to get an applescript together to scrub a page of HTML. LOVE ING IT. Just took half the day off, so I puttered around a while, trying to figure out how to "carve out" a certain section of a page. Ben actually figured out the right search pattern, and also added some unique identifiers to a few global templates so that we "flagged" the areas we needed.

Ben gave a call to Justin, and got his password for me (so I could log onto his computer and continue the work I did on that applescript). By the end of the day, I got a series of "scrubbing" runs worked out, which I’m really, really proud of:

tell application “BBEdit 6.5”

activate

–Define search settings
set gt to {search mode:grep, starting at top:true, wrap around:false, reverse:false, case sensitive:false, match words:false, extend selection:false}

–Locate beginning of data (<!– FJBEGIN –>) and delete everything prior to
replace “\\A[\\w\\W]*<!– FJBEGIN –>” using “<!– FJBEGIN –>” searching in text 1 of text window 1 options gt

–Locate end of data (<!– FJEND –>) and delete everything that comes after
replace “<!– FJEND –>[\\w\\W]*</html>” using “<!– FJEND –>” searching in text 1 of text window 1 options gt

–Find any <table> <tr> <td> tags, delete
replace “[ \\t]*(<[/]?table[^>]*>|<[/]?tr[^>]*>|<[/]?td[^>]*>)” using “” searching in text 1 of text window 1 options gt

–Find any <img> tags and delete
replace “]*>” using “” searching in text 1 of text window 1 options gt

–Find any links, insert placeholder for href and keep what follows
replace “<a href=\”[^\”]*\”” using “<a href=\”#\”” searching in text 1 of text window 1 options gt

–Find any comments and remove them (WARNING: REMOVES FJ IDENTIFIER)
replace t]*<!–[^>]*–>” using “” searching in text 1 of text window 1 options gt

–Find and remove <div> tags
replace “]*>” using “” searching in text 1 of text window 1 options gt

–Find and remove <link> tags
replace “]*>” using “” searching in text 1 of text window 1 options gt

–Remove all blank lines (PROBABLY RUN THIS TOWARD THE END, AFTER OTHER PROCESSES)
replace “^\\s*[\\s]*[\\r]+” using “” searching in text 1 of text window 1 options gt

–Remove nonbreaking spaces ( ) and bullets ()
replace “ |” using “” searching in text 1 of text window 1 options gt

–Remove any space at the start of all lines
replace “^\\s*” using “” searching in text 1 of text window 1 options gt

–Find and remove any consecutive breaks
replace “<br>(<Br>|\\r)+” using “” searching in text 1 of text window 1 options gt

end tell

Now, keep in mind that I’ve been working off of a previous applescript that Justin wrote. While I’m using his older code for syntax and structure, the search patterns are mine. I tested them out in BBEdit, and inserted them into ScriptDebugger.

What we’re trying to accomplish is a method of "stripping" out extraneous information from an HTML page, keeping only the raw data. Using a "practice" file to test out my script, I was able to reduce what was originally a page with 929 lines of code into a svelte 54 lines. Jeah, baby!

The script above performs all the actions on the foremost document open in BBEdit. Ideally, we can also throw in some code into the beginning of this applescript that will open a URL, copy the code into a new document, run the scrubbing process, and save it with a unique name. This way – we’ll have automated the entire "scrubbing" process for a site. In a perfect world, we’ll be able to run the script, walk away, and have the entire site is stripped down to the bare essentials.

:D I really had a lot of fun today at work. Ben helped give me more of a "tutorial" on how he and Justin go about structuring sites. Everyday, I learn more and more and more. It’s fucking awesome.

« My Two Cents

For Anyone Who Thought Being Immortalized Through a Statue Was Cool »