Questions tagged [web-scraping]

0

votes
0

answer
11

Views

To scrape website and put in excel segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel...
Pavan SN
0

votes
1

answer
19

Views

How scrape a website in which i post information

I want to scrape announcements information from the https://nseindia.com/corporates/corporateHome.html?id=allAnnouncements. Specifically i want to goto Corporate information tab on left hand side of website and then open the link of corporate announcements under equities. After that i want to post i...
user159944
0

votes
0

answer
13

Views

Using ProcessPoolExecutor for Web Scraping: How to get data back to queue and results?

I have written a program to crawl a single website and scrape certain data. I would like to speed up its execution by using ProcessingPoolExecutor. However, I am having trouble understanding how I can convert from single threaded to concurrent. Specifically, when creating a job (via ProcessPoolExecu...
xibalba1
1

votes
1

answer
50

Views

how to save the urls from for loop into a single variable?

I want to store the multiple urls into a single variable 'URLs'. Those URLs is made up from three parts,'urlp1' ,'n'and 'urlp2', which you can see in the code below. urlp1 = 'https://www.proteinatlas.org/' URLs = [] for cancer in cancer_list: urlp2 = '/pathology/tissue/' + cancer[1] f = cancer[0] t...
Rujun Guan
1

votes
0

answer
45

Views

How to get table data from HTTPSConnection using python

I would like you to help me in getting the data from Httpsconnection since the webpage is ASP.Net and the data can not be retrieved from beautiful soup directory, I implement the code below: import http.client import requests from urllib.request import urlopen from bs4 import BeautifulSoup conn = h...
Hannah
1

votes
0

answer
84

Views

Removing blank lines in csv output on Windows 10 without an error

I'm using Scrapy 1.4.0 and Python 3.6.3. When I run 'scrapy crawl -o items.csv', every other line of the csv file is blank. While I did find a solution here, it generates the following error: 2017-12-31 18:39:48 [scrapy.utils.signal] ERROR: Error caught on signal handler: Traceback (most recent...
Dan
1

votes
2

answer
71

Views

Fully Iterate/Scrape a HTML document in Javascript

I am fairly new to Web-Development, including HTML/CSS and also Javascript. Is there a way to scrape a whole HTML Document, looking for certain patterns in it's inner text, using only vanilla JS? I need to extract/identify different forms of IP Adresses from the Document, even those that are not mar...
DoeJ123
1

votes
1

answer
48

Views

Counting words in html documents

I want to count words in html articles using R. Scraping data like titles works nice and i was able to download the articles (code below). Now i want to count words in all of those articles, for example the word 'Merkel'. It seems to be a bitcomplicated. I was able to make it work with the headlines...
matt
1

votes
1

answer
86

Views

LXML xpath is stripping output of brackets

I'm trying to scrap SEC financial filings for data. Here is a link to an example table: target_page = 'https://www.sec.gov/Archives/edgar/data/1564408/000156459017022434/R4.htm' In the target_page's source code, a table cell with numeric output is tagged with somevalue If the value is negative, it...
Phil Dwan
1

votes
0

answer
41

Views

Python Scraper - Request Post Function Not Returning Correct Page

I am working on my first website scraper and have ran into another issue. Below is my code. The website that is returned is the main page not the specific site for the parcel number I searched. Am I using the wrong html class to identify the search function? Or is there something missing in the Py...
Taylor29
1

votes
1

answer
34

Views

Python Scraper - Find Data in Column

I am working on my first website scraper and am trying to get the number 41,110 that is saved in a column on the webpage https://mcassessor.maricopa.gov/mcs.php?q=14014003N. Below is my code. How can I get to this number and print it? from bs4 import BeautifulSoup import requests web_page = 'https...
Taylor29
1

votes
0

answer
135

Views

Click on HTML elements with Scrapy (WebScraping)

I'm doing a program in c # using scrapySharp or HtmlAgilityPack. But I have the disadvantage of that part of the information that I need, to appear when I click on an HTML element (Button, link ). In some forums it was commented that when using Selenium you could manipulate the html elements, so I...
Xime Zabala
1

votes
0

answer
48

Views

BeautifulSoup4 Not detecting Select tag

I am trying to learn web scrapping, I need to get the all the select tags in the web page. But they are not getting detected by BeautifulSoup4. Here is my code in python 3 from urllib.request import Request, urlopen from bs4 import BeautifulSoup import lxml req = Request('https://www.nseindia.com/'...
Abhishek Gangadhar
1

votes
0

answer
271

Views

Python save Stream File to Local File

I'm working on a Python scraping project for politician videos. I have isolated this link (and others like it): http://vod.europarl.europa.eu/wmv/nas/nasvod02/vod0804/2014/wm/VODUnit_20140414_20110100_20125200_63884a681455e14a75888c.wmv It downloads what looks like a streaming video file, but it's o...
Clive
1

votes
0

answer
281

Views

Trying to read a webpage with urllib and getting meta data

I'm new to scraping sites, but here's what I've been able to put together so far: page = urlopen( Request( https://www.example.com, data = None, headers={ 'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11' } ) ) print(page.read()) After setting the User-Agent I no longe...
user3457834
1

votes
1

answer
187

Views

Scrap a website with hidden parameters using the requests library

So I am trying to scrap the following webpage which protected by a login page. None the less when I run the code it keeps redirecting me to the login page. I think this might be due to the fact that the login page has a hidden parameter, though I don't really know how to integrate it into my payload...
Fozoro
1

votes
1

answer
86

Views

ImportXML for Coinbase Prices

I am trying to import the current value of the four cryptocurrencies that Coinbase offers into Google Sheets. Relevant HTML Line: $14,188.72 Formula: =importxml('https://www.coinbase.com/charts', '//div[starts-with(@class,'PriceChart')]') I expected to get the value of each
Steven
1

votes
0

answer
304

Views

wget download webpage with images for local viewing

Based on the discussion here: download-a-working-local-copy-of-a-webpage, I am using following command to make download a webpage with the image: wget --default-page -q -p -k http://mattvh.github.io/solar-theme-jekyll/index.html -O ./page_source/ex.html For some urls this works, for others (as the...
Sandeep
1

votes
2

answer
130

Views

Python: how to extract data from a text?

I used beautifulsoup library to get data from a webpage http://open.dataforcities.org/details?4[]=2016 import urllib2 from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read()) Now soup looks like the following (I show just a part of...
emax
1

votes
1

answer
65

Views

Completing web forms and ingesting the responses with R?

So, here's the current situation: I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds. I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste...
Richard Becker
1

votes
0

answer
236

Views

R - How to enter value to input field on webpage and get resulting page for scraping?

I'm working with this genealogical website that has a single input field. I would like to programmatically enter a surname and have the resulting page returned as an object that I can then scrape. I have some experience with the scraping part but not working with input fields. Where would I look to...
Conner M.
1

votes
0

answer
156

Views

How to mimic an XHR request in Ahrefs.com with Python?

I am trying to scrape data from a page that comes from an XHR request. The request is made when the user clicks a link. I've been trying to mimic the request with my scraper, by using the hash in the link's 'onClick' attribute. I can get it to work for the first link, but I need to iterate each of t...
Jake 1986
1

votes
0

answer
99

Views

HTML split on a given character

so I am using beautiful soup to read the html of a page. req = urllib.request.Request('https://en.wikipedia.org/wiki/Barack_Obama', headers = headers) html = urllib.request.urlopen(reqx) page = BeautifulSoup(html,'html.parser') I want to split the html code on period on the condition that it does no...
Kiran Baktha
1

votes
0

answer
79

Views

How to get the download link hidden behind a radio button?

I'm trying to download a csv file from this link. I know from this thread that we need to use the requests library to get the link by first submitting the form, in this case, to let the server know we want the csv file. However, since I'm not familiar with html and the previous example has now an up...
George Liu
0

votes
1

answer
146

Views

Scraping Table From Sports Page - AdBlock Interfering

I'm trying to get the 6th (or 'Advanced') table from http://www.sports-reference.com/cbb/schools/duke/2010.html. Using htmltab, or XML I have been able to scrape tables 1 through 3 using the the interger reference (ie 1 for first table, 2 for second etc) or the XPath. I can't scrape tables 4, 5, or...
Cody Becker
1

votes
0

answer
76

Views

Scrape different levels of web pages

I am doing a research on World Bank (WB) projects on developing countries. To do so, I am scraping their website in order to collect the data I am interested in. The structure of the webpage I want to scrape is the following: List of countries the list of all countries in which WB has developed proj...
Helena
1

votes
1

answer
72

Views

Using python to scrape push data?

I'm trying to scrape the left side of this news site (= SENESTE NYT): https://www.dr.dk/nyheder/ But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data? Using Chrome's Network console I've found this api but it doesn't contain the...
bib
1

votes
0

answer
37

Views

Change value at Onchange event without an ID

I have this problem. I'm trying to automate a powershell script that changes an onchange event on a website. I have to do this to scape the table afterwards. The problem is that the onchange does not have an ID to select it. Is it possible to do.? Last 24 hours Last 7 days Last 30 days
Klask
1

votes
0

answer
298

Views

How to crawl ASP.NET web applications

I'm writing an application which needs to communicate with a ASP.NET website that doesn't have an API. I'm using Python 3.5.2 with Requests 2.18.4 to achieve my purpose. The problem is the site is using _dopostback() so I achieved my goal using Selenium but it was slow and it opened a browser for e...
Iman Kermani
1

votes
0

answer
39

Views

What happens after rvest submit?

I'm pretty new to scraping and there's something I don't get with the Rvest stuff. There's no problem with scraping a page but if I want to write some script to submit a form and then scrape the results, I don't understand it. I mean, ok, I found the form to fill, I found the submit button, I ran th...
Maddz
1

votes
0

answer
1.2k

Views

Scrapy and python: DNS lookup failed: no results for hostname lookup - proxy issue?

I am trying to use Scrapy and Python to scrape some pages from within my company's IT and network. I started by using the scrapy tutorial from here https://doc.scrapy.org/en/latest/intro/tutorial.html When I try the code identical to the one on the tutorials page, I get the error: 2018-01-24 11:49:0...
Cactus
1

votes
1

answer
32

Views

How do I Access Specfic Elements of Webpage for Import into Pandas

I have this code that scrapes a website for menu information. I have got it working so that it gets the text from this week menu items: #Weekly Breakfast Menu import requests from bs4 import BeautifulSoup page = requests.get('https://trinity.campusdish.com/Commerce/Catalog/Menus.aspx?LocationId=103...
Christopher Reid
1

votes
0

answer
308

Views

How do I iterate a Zillow API call / output parse function in R over a list of addresses and zip codes?

I've been working on a project where the goal is to take a two-column CSV of street addresses and zip codes, read it into R, then perform a Zillow query for each one (GetSearchResults, specifically), parse the output, and store the parsed output in a dataframe to be written to a CSV (and placed righ...
Sean
1

votes
1

answer
75

Views

Can't get a valid response from a webpage containing json data

I've written a script in python to get response from a webpage. The data in that webpage are in json format. However, when I try like below I get an error. Can somebody give me any workaround as to how I can get a valid response? Here is my failure attempt: import requests import json URL = 'https:/...
SIM
1

votes
1

answer
676

Views

IE Automation Download

My question is about IE automation in VBA. I got my code to work through the pages and click on the download button. After searching for hours, I came across this page on StackOverFlow: Automate saveas dialogue for IE9 (vba) The code they suggested worked perfectly for saving pending files that need...
Armani
1

votes
0

answer
187

Views

Extract HTML Table from Web Page source in Unity

I build a script in C# for Unity to read & download a specific webpage source in a text file in Unity, what I really want to achieve is to extract from this pages only html tables data, for example I want to remove all the lines from DOCTYPE html PUBLIC to table class='formular' to extract table & d...
Dorin Buraca
1

votes
2

answer
32

Views

How to send scraped revenue data to my tool?

Because a client doesn't have a dataLayer I'm trying to send scraped revenue data from a thank you page to a Facebook pixel that's being deployed through GTM. I have Imdocument.querySelectorAll('td')[8].textContent from another useful post but its giving me a string, with spaces and currency symbol....
Rafael Lopez
1

votes
1

answer
207

Views

Web Scraping using python and generating a price on a site

So basically I am doing a school project that involves web scrapping. I understand how to use python and incorporate web scrapping, but how do I put that scrapped data onto a website. If it helps I am making a website that pulls prices from other sites, and displays it on mine. (Like Trivago) I can...
Kaleb Bowen
1

votes
1

answer
28

Views

overcoming a break in a span in python

I use the below code to scrape a website. the 'phone' is a challenge. To make it functional I had to use a find all but would much prefer to use just a find. When I look for it specifically it returns this as a result(I removed the phone number): (###) ###-#### As a result, I can only get and it re...
Justin Guy
1

votes
1

answer
67

Views

Handle Key-Error whilst scraping

I am currently working on a script to scrape data from ClinicalTrials.gov. To do this I have written the following script: def clinicalTrialsGov (id): url = 'https://clinicaltrials.gov/ct2/show/' + id + '?displayxml=true' data = BeautifulSoup(requests.get(url).text, 'lxml') studyType = data.study_ty...
jdeo

View additional questions