Questions tagged [beautifulsoup]

1

votes
1

answer
15

Views

Get href links & Text in python loop

I need to scrape informations from the apple store, I have a hashmap hashmap_genre_link with genre and a URL ( {'Games': 'https://itunes.apple.com/us/genre/ios-games/id6014?mt=8' ; ...} ), I want to create for each key an other hashmap with iOS apps (text) and app url as a value : games_apps:{'Pokem...
userHG
1

votes
1

answer
1.3k

Views

Installed BeautifulSoup but still get no module named bs4

I'm using a Jupyter notebook, Python 3.5, and a virtual environment. Within my virtual env I did: (venv) > pip install BeautifulSoup4 Which seemed to run fine b/c the terminal output was: Downloading beautifulsoup4-4.6.0-py2-none-any.whl (86kB) 100% |████████████████...
14wml
1

votes
1

answer
192

Views

how to extract href from <a> element using lxml cssselctor?

def extract_page_data(html): tree = lxml.html.fromstring(html) item_sel = CSSSelector('.my-item') text_sel = CSSSelector('.my-text-content') time_sel = CSSSelector('.time') author_sel = CSSSelector('.author-text') a_tag = CSSSelector('.a') for item in item_sel(tree): yield {'href': a_tag(item)[0].te...
elrich bachman
1

votes
2

answer
29

Views

beautifulsoup get value of attribute using get_attr method

I'd like to print all items in the list, but not containing the style tag = the following value: 'text-align: center' test = soup.find_all('p') for x in test: if not x.has_attr('style'): print(x) Essentially, return me all items in list where style is not equal to: 'text-align: center'. Probably jus...
dataviews
1

votes
2

answer
53

Views

Is there a regular expression for finding all question sentences from a webpage?

I am trying to extract some questions from a web site using BeautifulSoup, and want to use regular expression to get these questions from the web. Is my regular expression incorrect? And how can I combine soup.find_all with re.compile? I have tried the following: from bs4 import BeautifulSoup import...
user3741679
1

votes
2

answer
41

Views

Getting full html back from a website request using Python

I'm trying to send an http request to a website (for ex, Digikey) and read back the full html. For example, I'm using this link: https://www.digikey.com/products/en?keywords=part_number to get a part number such as: https://www.digikey.com/products/en?keywords=511-8002-KIT. However what I get back i...
ItM
1

votes
2

answer
82

Views

Checking if div class exists returns an error

I try to web scrap from a webpage after I login some products with beautifulsoup. There is a case where product is no longer available. The webpage has a div class like the following only on page which has not a product There is an error So I do if soup.find_all('div', {'class': 'alert'}): print('A...
Evridiki
1

votes
1

answer
46

Views

BeautifulSoup4 : Need to add inverse paragraphs tags to separate a field into two paragraphs

Currently, there is this one header tag which has its contents attached to it. I need to separate the header from its content by maintaining them in separate paragraph tags. block_tag = 1.1 Header Information. Content of the header with multiple lines type(block_tag) Header is expected to be enclo...
DazzlerJay
1

votes
2

answer
42

Views

Identify the word 'method' from a tag and extract the text

I need to identify all tags which has the word 'method' in it. I developed a python code using requests and regex. The code will first read a text file to extract the ID and then use request to open the URL to identify the tags that have 'method' keyword in it however the output is returning empty l...
RRg
1

votes
3

answer
46

Views

ValueError when scraping Tripadvisor for reviews with BeautifulSoup

I am trying to scrape some Tripadvisor reviews as a complete newbie to this. I'm using code from Susanli2016. It worked (though, removing the attribute 'language') for one link but it doesn't work for any more link (for example.) I'm receiving the error: Traceback (most recent call last): File '',...
Sennheiser
1

votes
2

answer
36

Views

BeautifulSoup findAll returns empty list when selecting class

findall() returns empty list when specifying class Specifying tags work fine import urllib2 from bs4 import BeautifulSoup url = 'https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week' hdr = { 'User-Agent' : 'tempro' } req = urllib2.Request(url, headers=hdr) htmlpage = urllib2.urlopen(req).rea...
Farhaan Mithagare
1

votes
1

answer
40

Views

BeautifulSoup can't find div with specific class

So for some background I have been trying to learn web scraping to get some images for machine learning projects involving CNNs. I have been trying to scrape some images from a site (HTML code on the left, my code on the right) with no luck; my code ends up printing/returning an empty list. Is there...
Gabriel Bello
1

votes
3

answer
1.3k

Views

How to remove content in nested tags with BeautifulSoup?

How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the s from a list of s that contains some nested s? I have tried .text but it...
alvas
0

votes
1

answer
50

Views

Why cant i find a <p> within a <span> tag in this html example?

I have a value I need to grab out of a div tag. Within the div there is a , and . When I write out the results of the find_all for the main I can see everything I want to get. But when I look for all the tags within that main div, the one I need doesn't exist/return in the results. This is what...
CubanGT
0

votes
0

answer
6

Views

Python crawler finds desired tags in only first few divs

I am trying to scrape some images from a shopping site (https://www.grailed.com/shop/EkpEBRw4rw) but I am having some trouble with it since the listings updates as you scroll. I am trying to get the image source in the HTML tag below: Now the code I have been using is shown below: from bs4 import...
Gabriel Bello
1

votes
1

answer
16

Views

Convert nested html-table to nested-dictionary in python?

I am writting an application convert the html-table string data receive from website (by calling RESR API) to dictionary format. The problem is that the format of HTML table string is nested HTML table format. After a while doing search on the internet i cannot find the solution for this case. Event...
hieu to van
1

votes
4

answer
69

Views

.split() doesn't transform string in list

in my code i'm trying to split a string and put links(that are in the string ) in an array, with the method .split(), but when i try to do that. ciao = [] for article in soup.find_all('a', {'style': 'height:81px;'}): ciao = article.get('href').split() print(ciao[1]) i get the error : 'IndexError: li...
JacopoDT
2

votes
1

answer
10

Views

Table data returning empty values after web scraping

I tried to web scrape the table data from a binary signals website. The data updates after some time and I wanted to get the data as it updates. The problem is, when I scrape the code it returns empty values. The table has a table tag. I'm not sure if it uses something else other than html because i...
Mark Gacoka
1

votes
1

answer
125

Views

Web Scraping the screen image from an interactive web map

I need to extract the map component to an static image from: http://www.bom.gov.au/water/landscape/#/sm/Relative/day/-35.30/145.17/5/Point////2018/12/16/ This page contains a Leaflet-based interactive web map, in which the layer data is updated daily via web mapping services. The extracted image sho...
alextc
1

votes
3

answer
46

Views

Loging in using Mechanize

I am trying to extract some data from a website - not a lot - but enough to warrant a little script... I am attempting to first log in to the site https://squashlevels.com using mechanize and cookielib, but I am failing... I currently have from bs4 import BeautifulSoup import requests import re imp...
MoonKnight
1

votes
1

answer
45

Views

beautifulsoup webscraper problem: can't find tables on webpage

I want to get tables from this website with this code: from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup my_url = 'https://www.flashscore.pl/pilka-nozna/' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, 'html.parser') conta...
ak27
1

votes
2

answer
65

Views

Scrape URLs using BeautifulSoup in Python 3

I tried this code but the list with the URLs stays empty. No error massage, nothing. from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'}) html_page = urlo...
TAN-C-F-OK
1

votes
1

answer
48

Views

Why is BeautifulSoup4 missing the first file URL?

I'm trying to catalog the files on this website as a personal exercise. When I run the following code I don't know why I'm not getting the first file url on this website. Any help is appreciated. import requests from bs4 import BeautifulSoup import regex url = 'https://www.liberliber.it/online/autor...
Frontsky
1

votes
2

answer
29

Views

Find information in HTML tables with Beautiful soup

I'm trying to extract information from an html table (found in this example page https://www.detrasdelafachada.com/house-for-sale-marianao-havana-cuba/dcyktckvwjxhpl9): Type of property: Apartment Building style: 50 year Sale price: 12 000 CUC Rooms: 1 Bathrooms: 1 Kitchens: 1 Surface: 38 mts2...
Rodolphe
1

votes
2

answer
55

Views

scraping greatschools.org using BeautifulSoup returns empty list

I've been learning how to scrape the greatschools.org website using BeautifulSoup. I've run into a dead end despite looking up different solutions here and in other places. By using the 'inspect' feature on chrome I can see that the website has table tags but a find_all('tr') or find_all('table') o...
ph03nix
1

votes
2

answer
85

Views

Separating text inside a <pre> tag

I wanted to try some basic web-scraping but ran into a problem since I am used to simple td-tags, in this case I had a webpage which had the following pre-tag and all the text inside of it which means it is a bit trickier to scrape it. 11111111 11111112 11111113 11111114 11111115 Any suggestions on...
Blueprov
1

votes
2

answer
53

Views

how do I parse info from yahoo finance with beautiful soup

I have got so far by using soup.findAll('span') Previous Close, 5.52, , 5.49, Volume, 1,164,604, ... I want a tabkle that shows me Open 5.49 Volume 1,164,604 ... I tried soup.findAll('span').text but it gives error msg: ResultSet object has no attribute 'text'. You're probably treating a list of ite...
Candice
1

votes
1

answer
46

Views

Python: How can I break loop and append the last page of results?

I've made a scraper that works except that it wont scrape the last page. The url doesn't change, so I set it up to run on an infinite loop. Ive set the loop up to break when it cant click on the next button anymore( on the last page), and it seems that the script is ending before it appends the last...
Smitty
1

votes
4

answer
87

Views

Pulling current stock price (Yahoo) with beautifulsoup

I'm having issues using beautiful soup (python3) to pull the latest stock price import requests from money import Money from bs4 import BeautifulSoup response = requests.get('https://finance.yahoo.com/quote/VTI?p=VTI') soup = BeautifulSoup(response.content, 'lxml') price = soup.find('span', attrs =...
user3654225
1

votes
4

answer
57

Views

Extracting from script - beautiful soup

How would the value for the 'tier1Category' be extracted from the source of this page? https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product soup.find('script') returns only a subset of the source, and the following returns another source within that co...
dirtyw0lf
1

votes
1

answer
48

Views

Custom attributes in BeautifulSoup?

I am trying to use Beautiful soup to target a DIV with a non-standard attribute. Here's the DIV: `` I need to find_all DIV with the data-asin attribute, and get the asin as well. BS appears to support this feature, but what I am doing isn't working. Here's my code that doesn't work: `rows = soup.fin...
krypterro
1

votes
2

answer
30

Views

Extracting text from 'value' attribute using beautifulsoup

HTML Code: i would like to extra the text in 'Value' attribute ('1435.95') i tried doing it by executing the following code, but no luck. driver.get(someURL) page = driver.page_source soup = BeautifulSoup(page, 'lxml') price = soup.find('td', {'id' : re.compile('ContentPlaceHolder1_ContentPlaceHolde...
rkrox 907
0

votes
1

answer
20

Views

My Webscraping code with beautifulsoup doesn't go past the first page

it doesn't seem to go past the first page. What's wrong? import requests from bs4 import BeautifulSoup for i in range (1,5): url = 'https://www.nairaland.com/search/ipob/0/0/0/{}'.format(i) the_word = 'is' r = requests.get(url, allow_redirects=False) soup = BeautifulSoup(r.content, 'lxml') words =...
introvertme
1

votes
1

answer
39

Views

I want to extract links of members

I am trying to extract links of the following members from bs4 import BeautifulSoup import requests r = requests.get('https://www.aapkiawaz.in/about/doctor-hospital-directory-medical-directory-doctors-doctor-hospital-listing-medical-directory-doctors-listing-medical-directory-doctors-doctor-hospital...
dhananjay
0

votes
0

answer
11

Views

To scrape website and put in excel segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel...
Pavan SN
0

votes
1

answer
19

Views

How scrape a website in which i post information

I want to scrape announcements information from the https://nseindia.com/corporates/corporateHome.html?id=allAnnouncements. Specifically i want to goto Corporate information tab on left hand side of website and then open the link of corporate announcements under equities. After that i want to post i...
user159944
1

votes
1

answer
50

Views

how to save the urls from for loop into a single variable?

I want to store the multiple urls into a single variable 'URLs'. Those URLs is made up from three parts,'urlp1' ,'n'and 'urlp2', which you can see in the code below. urlp1 = 'https://www.proteinatlas.org/' URLs = [] for cancer in cancer_list: urlp2 = '/pathology/tissue/' + cancer[1] f = cancer[0] t...
Rujun Guan
1

votes
2

answer
73

Views

Web scraping tables using python

I have been trying to extract a table from Wikipedia list of noble laureates .The table has some none value I don't know how to take care of those values.while looping through the cells How can I include the none values in the table. The link to the Wikipedia page is :https://en.wikipedia.org/wiki/L...
Aamir
2

votes
1

answer
16

Views

How do I implement a breadth first and depth first search web crawler?

I am attempting to write a web crawler in Python with Beautiful Soup in order to crawl a webpage for all of the links. After I obtain all the links on the main page, I am trying to implement a depth-first and breadth-first search to find 100 additional links. Currently, I have scraped and obtained t...
dacoda007
1

votes
0

answer
556

Views

Beautiful Soup prettify format only string values

I am using Beautiful Soup 4 to parse and modify a couple of Angular templates (HTML files). I have some issues when using the prettify function to write the modified content back into the file. This issue is related to special characters such as: >,
Tudor Ciotlos

View additional questions