Web Scraping with Nokogiri::HTML and Ruby - save images

Refresh

April 2019

Views

1.5k time

1

I'm working on a script to grab data & images from webshop productpages (with approval from the owner)

I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.

I have this code (thanks to Phrogz in this thread)

    URL = 'http://www.sample.com/page.html'

    require 'rubygems'
    require 'nokogiri'
    require 'open-uri'
    require 'uri'

    def make_absolute( href, root )
      URI.parse(root).merge(URI.parse(href)).to_s
    end

    Nokogiri::HTML(open(URL)).xpath('//*[@id="zoom"]/@href').each do |src|
      uri = make_absolute(src,URL)
      File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
    end

that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:

    # encoding: utf-8
    require 'nokogiri'
    require 'open-uri'
    require 'csv'
    require 'mechanize'

    @prices = Array.new
    @title = Array.new
    @description = Array.new
    @warranty = Array.new
    @leadtime = Array.new
    @urls = Array.new 
    @categories = Array.new
    @subcategories = Array.new
    @subsubcategories = Array.new

    urls = CSV.read("lotofurls.csv")
    (0..urls.length - 1).each do |index|

      puts urls[index][0]
        doc = Nokogiri::HTML(open(urls[index][0]))

Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!

1 answers

1

You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)

For RMagick, you could do something like this

require 'rmagick'

images.each do |image|
  url = image.url # should be a string
  Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end    

That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation

And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/