Questions tagged [web-scraping]

0

votes
0

answer
4

Views

Scraping web with java and downloading a video

I'm trying to scraping this 9gag link I tried using JSoup to get this HTML tag for taking the source link and download the video directly. I tried with this code public static void main(String[] args) throws IOException { Response response= Jsoup.connect('https://9gag.com/gag/a2ZG6Yd') .ignoreConte...
tassi4224
1

votes
1

answer
252

Views

Scraping a table using the rvest package

I’m totally new to web scraping and I’m exploring the potentialities of the rvest library in R. I’m trying to scrape a table on wellbeing in Italian provinces from the following website, install.packages('rvest') library('rvest') url
Luca De Benedictis
1

votes
1

answer
163

Views

Capturing JSON data from intermediate events using Selenium

Below I have setup a script which simply executes a search on a website. The goal is to capture JSON data utilizing Selenium from an event that is fired from an intermediate script, namely the POST request to 'https://www.botoxcosmetic.com/sc/api/findclinic/FindSpecialists' as seen in the included i...
ikemblem
1

votes
2

answer
266

Views

Rcrawler scrape does not yield pages

I'm using Rcrawler to extract the infobox of Wikipedia pages. I have a list of musicians and I'd like to extract their name, DOB, date of death, instruments, labels, etc. Then I'd like to create a dataframe of all artists in the list as rows and the data stored as columns/vectors. The code below t...
Ben
1

votes
2

answer
768

Views

navigating through pagination with selenium in python

I'm scraping this website using Python and Selenium. I have the code working but it currently only scrapes the first page, I would like to iterate through all the pages and scrape them all but they handle pagination in a weird way how would I go through the pages and scrape them one by one? Paginati...
Abdul Jamac
1

votes
2

answer
340

Views

UPS Tracking On Google Sheet does not work

I used following formula get UPS live tracking feed and it works fine until yesterday. I think UPS has updated their site and this formula does not work anymore. Any idea or suggestions for how to get the tracking update from UPS? =Index(IMPORTXML('https://wwwapps.ups.com/WebTracking/track?track=ye...
PrasadD
1

votes
2

answer
82

Views

Checking if div class exists returns an error

I try to web scrap from a webpage after I login some products with beautifulsoup. There is a case where product is no longer available. The webpage has a div class like the following only on page which has not a product There is an error So I do if soup.find_all('div', {'class': 'alert'}): print('A...
Evridiki
1

votes
2

answer
94

Views

How can i get only name and contact number from div?

I'm trying to get name and contact number from div and div has three span, but the problem is that sometime div has only one span, some time two and sometime three span. First span has name. Second span has other data. Third span has contact number Here is HTML beth budinich See listing website (2...
Zubair Farooq
1

votes
3

answer
46

Views

ValueError when scraping Tripadvisor for reviews with BeautifulSoup

I am trying to scrape some Tripadvisor reviews as a complete newbie to this. I'm using code from Susanli2016. It worked (though, removing the attribute 'language') for one link but it doesn't work for any more link (for example.) I'm receiving the error: Traceback (most recent call last): File '',...
Sennheiser
1

votes
2

answer
36

Views

BeautifulSoup findAll returns empty list when selecting class

findall() returns empty list when specifying class Specifying tags work fine import urllib2 from bs4 import BeautifulSoup url = 'https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week' hdr = { 'User-Agent' : 'tempro' } req = urllib2.Request(url, headers=hdr) htmlpage = urllib2.urlopen(req).rea...
Farhaan Mithagare
1

votes
1

answer
40

Views

BeautifulSoup can't find div with specific class

So for some background I have been trying to learn web scraping to get some images for machine learning projects involving CNNs. I have been trying to scrape some images from a site (HTML code on the left, my code on the right) with no luck; my code ends up printing/returning an empty list. Is there...
Gabriel Bello
0

votes
0

answer
4

Views

How to select all class that are above certain tag using css selector in selenium python?

here i want to get all the class = 'result-row' which are above the 'h4' tag, not the ones which are below 'h4' tag. my current code selects all of them: section = driver.find_element_by_css_selector('[class='rows']') result_rows = section.find_elements_by_css_selector('li.result-row') so how can i...
zero
1

votes
1

answer
390

Views

How to get exact page content in wget if error code is 404

I have two url one is working url another one is page deleted url.working url is fine but for page deleted url instead of getting the exact page content wget receives 404 Working url import os def curl(url): data = os.popen('wget -qO- %s '% url).read() print (url) print (len(data)) #print (data) cur...
Mounarajan
0

votes
0

answer
6

Views

Python crawler finds desired tags in only first few divs

I am trying to scrape some images from a shopping site (https://www.grailed.com/shop/EkpEBRw4rw) but I am having some trouble with it since the listings updates as you scroll. I am trying to get the image source in the HTML tag below: Now the code I have been using is shown below: from bs4 import...
Gabriel Bello
0

votes
0

answer
4

Views

Why's my xpath returning both tables despite specifying the index?

I am using the Fuzi Swift library for parsing this hackernews page. I need to extract only the top description of the post which contains the main post's details (i.e. 'Maybe HN can help solve this little mystery.........low.com/a/55711457/2251982) Attached screenshot: Here's my xpath code: print('D...
Pranoy C
2

votes
1

answer
10

Views

Table data returning empty values after web scraping

I tried to web scrape the table data from a binary signals website. The data updates after some time and I wanted to get the data as it updates. The problem is, when I scrape the code it returns empty values. The table has a table tag. I'm not sure if it uses something else other than html because i...
Mark Gacoka
1

votes
1

answer
3k

Views

getting Invalid byte tag in constant pool : 19

i am creating one webservice and getting error like org.apache.tomcat.util.bcel.classfile.ClassFormatException: Invalid byte tag in constant pool: 19. i am using tomcat 8.0 and java versoin is 1.8.0.152.
Vishal
1

votes
2

answer
98

Views

JSONP parsing in javascript/node.js

If I have an string containing a JSONP response, for example'jsonp([1,2,3])', and I want to retrieve the 3rd parameter 3, how could I write a function that do that for me? I want to avoid using eval. My code (below) works fine on the debug line, but return undefined for some reason. function unwrap(...
ariel
1

votes
1

answer
53

Views

Extract item price from HTML

I'm struggling to get the price from the following HTML 7.99 I'm trying to extract the '7.99' from the above example. I've tried HTML.getElementsByClassName('pricing__now')(0).innertext but drawing blank. Any help, kindly received. Many thanks in advance. Ian
IanG
1

votes
3

answer
40

Views

Unable to parse streetAddresses out of a json response

I've written a script to fetch street-addresses out of a json response but I can't reach that portion. The structure seems a bit complicated to me. Link to the json content A chunk of the response containing street-addresses: [[44189579,25735941,-80305513,'$640K',1,0,0,0,['$640K',4,3.0,1963,false,nu...
robots.txt
1

votes
1

answer
125

Views

Web Scraping the screen image from an interactive web map

I need to extract the map component to an static image from: http://www.bom.gov.au/water/landscape/#/sm/Relative/day/-35.30/145.17/5/Point////2018/12/16/ This page contains a Leaflet-based interactive web map, in which the layer data is updated daily via web mapping services. The extracted image sho...
alextc
1

votes
1

answer
45

Views

beautifulsoup webscraper problem: can't find tables on webpage

I want to get tables from this website with this code: from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = 'https://www.flashscore.pl/pilka-nozna/' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, 'html.parser') conta...
ak27
1

votes
3

answer
32

Views

Scrape a table iterating over pages of a website: how to define the last page?

I have the following code that works OK: import requests from bs4 import BeautifulSoup import pandas as pd df_list = [] for i in range(1, 13): url = 'https://www.uzse.uz/trade_results?date=25.01.2019&mkt_id=ALL&page=%d' %i df_list.append(pd.read_html(url)[0]) df = pd.concat(df_list) df But for this...
AK88
1

votes
2

answer
49

Views

Trouble getting the second link when the first link contains certain keyword right next to it

I've created a script in python in association with selenium to get the first link (populated by duckduckgo.com) of any search item unless the keyword Ad is right next to that link like the image below. If the first link contains the very keyword then the script will get the second link and quits. M...
robots.txt
1

votes
2

answer
34

Views

Cannot read a csv with urls to web scrap them in python

I am fresh new to python so I tried with visual studio and windows 7 the following import csv from bs4 import BeautifulSoup import requests contents = [] with open('websupplies.csv','r') as csvf: # Open file in read mode urls = csv.reader(csvf) for url in urls: contents.append(url) # Add each url...
Nikos
1

votes
1

answer
43

Views

Unable to make my script stop printing wrong result

I've created a script in vba using IE to fill in few inputs in a webpage in order to reach a new page to check for some items availability based on inputting some values in an inputbox. To walk you through: what the script is currently doing: Select Buy Bricks from landing page Enter age 30 and coun...
robots.txt
1

votes
4

answer
87

Views

Pulling current stock price (Yahoo) with beautifulsoup

I'm having issues using beautiful soup (python3) to pull the latest stock price import requests from money import Money from bs4 import BeautifulSoup response = requests.get('https://finance.yahoo.com/quote/VTI?p=VTI') soup = BeautifulSoup(response.content, 'lxml') price = soup.find('span', attrs =...
user3654225
1

votes
4

answer
57

Views

Extracting from script - beautiful soup

How would the value for the 'tier1Category' be extracted from the source of this page? https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product soup.find('script') returns only a subset of the source, and the following returns another source within that co...
dirtyw0lf
1

votes
2

answer
57

Views

How can I create a dataframe from data I scraped from a website?

I'm trying to scrape website from a job postings data, and the output looks like this: [{'job_title': 'Junior Data Scientist','company': '\n\n BBC', summary': '\n We're now seeking a Junior Data Scientist to come and work with our Marketing & Audiences team in London. The D...
Harshmallo
1

votes
2

answer
123

Views

How does recaptcha 3 know I'm using selenium/chromedriver?

I'm curious how Recaptcha v3 works. Specifically the browser fingerprinting. When I launch a instance of chrome through selenium/chromedriver and test against ReCaptcha 3 (https://recaptcha-demo.appspot.com/recaptcha-v3-request-scores.php) I always get a score of 0.1 when using selenium/chromedriver...
jamie
1

votes
2

answer
88

Views

How to deal with the captcha when doing Web Scraping in Puppeteer?

I'm using Puppeteer for Web Scraping and I have just noticed that sometimes, the website I'm trying to scrape asks for a captcha due to the amount of visits I'm doing from my computer. The captcha form looks like this one: So, I would need help about how to handle this. I have been thinking about se...
AdriánT95
1

votes
1

answer
46

Views

Unable to let puppeteer browse newly collected links reusing the same browser

I've created a script in node in combination with puppeteer to scrape the links of different posts from a site's landing page and my script is doing this flawlessly. Although the content of that site are static, I used puppeteer to see how it performs as I'm very new to this. What I wish to do now i...
robots.txt
1

votes
2

answer
30

Views

Extracting text from 'value' attribute using beautifulsoup

HTML Code: i would like to extra the text in 'Value' attribute ('1435.95') i tried doing it by executing the following code, but no luck. driver.get(someURL) page = driver.page_source soup = BeautifulSoup(page, 'lxml') price = soup.find('td', {'id' : re.compile('ContentPlaceHolder1_ContentPlaceHolde...
rkrox 907
0

votes
1

answer
20

Views

My Webscraping code with beautifulsoup doesn't go past the first page

it doesn't seem to go past the first page. What's wrong? import requests from bs4 import BeautifulSoup for i in range (1,5): url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i) the_word = 'is' r = requests.get(url, allow_redirects=False) soup = BeautifulSoup(r.content, 'lxml') words =...
introvertme
0

votes
0

answer
4

Views

Python web scrape from multiple columns

I am trying to pull data from various columns in the odds table from this website: https://www.sportsbookreview.com/betting-odds/nba-basketball/totals/?date=20190419 I have tried using the following code but I am only getting the open lines. I want to be able to get exact columns. For example, the p...
stekcar
0

votes
0

answer
11

Views

To scrape website and put in excel segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel...
Pavan SN
0

votes
1

answer
19

Views

How scrape a website in which i post information

I want to scrape announcements information from the https://nseindia.com/corporates/corporateHome.html?id=allAnnouncements. Specifically i want to goto Corporate information tab on left hand side of website and then open the link of corporate announcements under equities. After that i want to post i...
user159944
0

votes
0

answer
13

Views

Using ProcessPoolExecutor for Web Scraping: How to get data back to queue and results?

I have written a program to crawl a single website and scrape certain data. I would like to speed up its execution by using ProcessingPoolExecutor. However, I am having trouble understanding how I can convert from single threaded to concurrent. Specifically, when creating a job (via ProcessPoolExecu...
xibalba1
1

votes
1

answer
50

Views

how to save the urls from for loop into a single variable?

I want to store the multiple urls into a single variable 'URLs'. Those URLs is made up from three parts,'urlp1' ,'n'and 'urlp2', which you can see in the code below. urlp1 = 'https://www.proteinatlas.org/' URLs = [] for cancer in cancer_list: urlp2 = '/pathology/tissue/' + cancer[1] f = cancer[0] t...
Rujun Guan
1

votes
0

answer
45

Views

How to get table data from HTTPSConnection using python

I would like you to help me in getting the data from Httpsconnection since the webpage is ASP.Net and the data can not be retrieved from beautiful soup directory, I implement the code below: import http.client import requests from urllib.request import urlopen from bs4 import BeautifulSoup conn = h...
Hannah

View additional questions