Web Scraping with Nokogiri::HTML and Ruby - save images


April 2019


I'm working on a script to grab data & images from webshop productpages (with approval from the owner)

I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.

I have this code (thanks to Phrogz in this thread)

    URL = 'http://www.sample.com/page.html'

    require 'rubygems'
    require 'nokogiri'
    require 'open-uri'
    require 'uri'

    def make_absolute( href, root )

    Nokogiri::HTML(open(URL)).xpath('//*[@id="zoom"]/@href').each do |src|
      uri = make_absolute(src,URL)
      File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }

that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:

    # encoding: utf-8
    require 'nokogiri'
    require 'open-uri'
    require 'csv'
    require 'mechanize'

    @prices = Array.new
    @title = Array.new
    @description = Array.new
    @warranty = Array.new
    @leadtime = Array.new
    @urls = Array.new 
    @categories = Array.new
    @subcategories = Array.new
    @subsubcategories = Array.new

    urls = CSV.read("lotofurls.csv")
    (0..urls.length - 1).each do |index|

      puts urls[index][0]
        doc = Nokogiri::HTML(open(urls[index][0]))

Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!

You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)

For RMagick, you could do something like this

require 'rmagick'

images.each do |image|
  url = image.url # should be a string

That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options.

And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/