I like to use Leo to perform research and it struck me that it would be very useful to grab the text of the web page and store it in the outline. There is a plug-in called URLloader? (which I couldn't get to work), but it downloads the HTML directly -- I wanted only the text stored in the body of @url nodes.

To solve the problem, I wrote the following button code. It's clunky, but it works pretty well and I didn't want to spend any more time refining the regular expressions:

# @button GrabURLtext v.1 by Dan Rahmel

from urllib import urlopen
import re

# Get the URL from the headline of the @url node
urlStr = p.headString()[5:]
g.es("Loading... " + urlStr)
scrape = urlopen(urlStr).read()
g.es("Scraping...")
# Delete newlines or carriage returns
scrape=re.sub('(\n|\r)', '', scrape)
# Get rid of white space that is 2 characters or more
scrape=re.sub('\s{2,}', ' ', scrape)
# Replace paragraph, br, and other marks with newline
scrape=re.sub(r'(?s)\<(p|P|br|BR|H1|h1|H2|h2|H3|h3|LI|li).+?>', '\n', scrape)
# Delete remaining HTML tags
scrape=re.sub(r'(?s)\<.+?\>', '', scrape)
# Get rid of white space that is 2 characters or more
scrape=re.sub('\s{2,}', '\n', scrape)
c.setBodyString(p,scrape)
g.es("Complete.")

... --jkw, Thu, 04 Oct 2007 06:22:07 -0700 reply

As I get problems with the encoding (german pages) I ad
scrape = scrape.decode('Latin-1')

before "g.es("Scraping...")". For me it works great.