Scrape eBay from a bash script

This tutorial will build on the previous post, where you grabbed the IP of a web page using just simple unix programs. Lets say you wanna grab samsung products and their prices of eBay, the first thing you wanna do is to search their site after products. While experimenting with the search box on the their site, i notices that the word you are searching for is always present in the URL on the resulting page.

Go head and paste that into your browser. At first the URL contained a lot of parameters, but i managed to work out that just _nkw and its value are the impotent one. Now lets take a look at the page HTML code.

After analysing the HTML i notices that price is always followed by “<b>EURO</b></span>”, which it turns out to be good term to search for. It also seems to be 8 lines after the product name. Lets try do some filtering using curl and grep.

It seems to works pretty well, got a product name and its price, but a lot of unwanted HTML too. To remove the HTML we will use sed and regex. I found the regex on this site.

The output is almost perfect, just a lot of whitespace. Lets start by removing all empty lines.

Now for the last part, removing the leading white spaces. You have almost created your first web scrapering tool.

If you also you wanna scrape all the search result page, you will have to append two more parameter and adjust them according to this pattern:

How to scrape your IP from a Unix shell

In this simple tutorial, i will explain how to scrape data from a simple webpage. I this case i will use since it contains your external IP. So if you have a router, or access the Internet using a proxy, you will still get the correct address. I will only use common unix utils, that are available on most systems. It will work great on Linux. In this case curl, start by just downloading the html.

As you can see, the page contains a look of HTML-code that most be excluded before we can extract the data. Note that the address is in a h1-tag, we can grab that line using grep.

The result looks a lot better after pipping the result to grep. I used the -silent parameter to suppress curl from printing error messages or statistics. We still need to remove the remaining meta data. In other words, remove everything except the dots and numeric characters. This can be done easily using sed.

We just told sed to replace everything that does not match 0-9 or “.” with “”, aka nothing. To save the IP in variable, name the variable and include the command within accents.

Look how easy it is! You have created your first webscraper in just minutes. Now you can go ahead and log the IP, update your DDNS or whatever you wanna use your scraped data for. In fact, I’m using this little scraper on my router to update my firewall. If you dont have curl or sed, maybe you have lynx.

The command gets a bit shorter since lynx gives you the page as plain text, not HTML as curl. If you want, you can create a shell script that just prints the external IP and invoke from anywhere, like a php-script using the system function.