Questions tagged [jsoup]

97

votes
4

answer
51.6k

Views

I get a SocketTimeoutException in Jsoup: Read timed out

I get a SocketTimeoutException when I try to parse a lot of HTML documents using Jsoup.For example, I got a list of links : link1 link2 link3 link4 For each link, I parse the document linked to the URL (from the href attribute) to get other pieces of information in those pages.So I can imagine that...
C. Maillard
85

votes
15

answer
48.4k

Views

How do I preserve line breaks when using jsoup to convert html to plain text?

I have the following code: public class NewClass { public String noTags(String str){ return Jsoup.parse(str).text(); } public static void main(String args[]) { String strings='' + ' body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;} hello worldyo googlez '; NewClass text...
Billy
48

votes
10

answer
88k

Views

How to “scan” a website (or page) for info, and bring it into my program?

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java). For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the...
James
46

votes
4

answer
52.3k

Views

jsoup posting and cookie

I'm trying to use jsoup to login to a site and then scrape information, I am running into in a problem, I can login successfully and create a Document from index.php but I cannot get other pages on the site. I know I need to set a cookie after I post and then load it when I'm trying to open another...
Gwindow
39

votes
1

answer
3k

Views

How to parse data in Talend with Java (coming from a previously produced .txt file)?

I have a process in Talend which gets the search result of a page, saves the html and writes it into files, as seen here: Initially I had a two step process with parsing out the date from the HTML files in Java. Here is the code: It works and writes it to a mysql database. Here is the code which bas...
ZedBrannigan
36

votes
8

answer
33.6k

Views

How to add proxy support to Jsoup (HTML parser)?

I am a newbie to Java and my first task is to parse some 10,000 urls and extract some info outta it, for this I am using Jsoup and its working fine. But now I want to add proxy support to it. The Proxies have a username and password too. Can any1 help me with this. Thanks
Himanshu
36

votes
3

answer
33.3k

Views

jsoup - strip all formatting and link tags, keep text only

Let's say i have a html fragment like this: foo bar foobar baz What i want to extract from that is: foo bar foobar baz So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html? As you can see in the title, i want to use j...
WonderCsabo
34

votes
4

answer
43.1k

Views

JSoup UserAgent, how to set it right?

I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0). I'm setting my User Agent like this: doc = Jsoup.connect(url) .userAgent('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/2010...
Markus
33

votes
4

answer
28.6k

Views

Jsoup: how to get an image's absolute url?

Is there a way in jsoup to extract an image absolute url, much like one can get a link's absolute url? Consider the following image element found in http://www.example.com/ I would like to get http://www.example.com/images/chicken.jpg. What should I do?
r0u1i
32

votes
2

answer
21.6k

Views

(how) can I download an image using JSoup?

I already know where the image is, but for simplicity's sake I wanted to download the image using JSoup itself. (This is to simplify getting cookies, referrer, etc.) This is what I have so far: //Open a URL Stream Response resultImageResponse = Jsoup.connect(imageLocation).cookies(cookies).ignoreCon...
30

votes
2

answer
24.8k

Views

Does jsoup support xpath?

There's some work in progress related to adding xpath support to jsoup https://github.com/jhy/jsoup/pull/80. Is it working? How can I use it?
gguardin
30

votes
1

answer
38.3k

Views

How to parse XML with jsoup

I am trying to parse XML with jsoup, but I can't find any examples on this task. My XML document looks like this: xxx xxx xxx xxx .... It should be quite straightforward, but my attempt has failed. Code: Element content = doc.getElementById('content'); Elements tests = content.getElementsByTag('test...
JavaCake
29

votes
2

answer
18.3k

Views

Jsoup select div having multiple classes

I am trying to select, using Jsoup, a that has multiple classes: ... The syntax for doing so, to the best of my understanding, should be: document.select('div.content-text.right-align.bold-font'); However, for some reason, this doesn't work for me. When I try the same exact syntax on JSFIDDLE, it w...
ef2011
26

votes
2

answer
48.3k

Views

How to parse HTML table using jsoup?

I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse - If you see my below table, it has three tr as of now (I have shorten it down to have three table rows just for understanding pur...
john
25

votes
5

answer
28.1k

Views

How do I select this element in JSOUP?

This is the HTML structure: Element link = doc.select('div.subtabs p').first(); That does not seem to work. How do I select that p?
HackToHell
25

votes
2

answer
15.2k

Views

Connection error: “org.jsoup.UnsupportedMimeTypeException: Unhandled content type”

When I try to open a link to parse with jsoup I get an error. Connection command: Document doc = Jsoup.connect('http://www.rfi.ro/podcast/emisiune/174/feed.xml') .timeout(10 * 1000).get(); Errors thrown: Exception in thread 'main' org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must...
user2340897
24

votes
7

answer
23.4k

Views

Jsoup.clean without adding html entities

I'm cleaning some text from unwanted HTML tags (such as ) by using String clean = Jsoup.clean(someInput, Whitelist.basicWithImages()); The problem is that it replaces for instance å with å (which causes troubles for me since it's not 'pure xml'). For example Jsoup.clean('hello å world', Whitelis...
aioobe
24

votes
7

answer
33.6k

Views

Page content is loaded with javascript and Jsoup doesn't see it

One block on the page is filled with content by javascript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also javascript generated content when parsing page with Jsoup? Special UPD for Marcin: Can't paste page code here, since it is too long: http://paste...
Eugene
24

votes
2

answer
39k

Views

how to parse a table from HTML using jsoup

5,390.85 5,428.15 5,376.15 5,413.85 This is the HTML source from which i have to extract the values 5390.85,5428.15 , 5376.15 , 5413.85. I wanted to do this using jsoup. But i am relatively new to jsoup( today i started using it). So how should i do this? URL url = new URL('http://www.nseindia.com...
CyprUS
22

votes
3

answer
36.2k

Views

Jsoup select and iterate all elements

I will connect to a url through jsoup and get all the contents of it but the thing is if I select like, doc.select('body') its returning a single element but I want to get all the elements in the page and iterate them one by one for example, Test Hello All Second Page Test If I select using body I a...
Karthik
22

votes
2

answer
35k

Views

How to POST Data into website using Jsoup

I am trying to POST data into website to make a login into the site using Jsoup , but its not working ? I am trying the code Document docs = Jsoup.connect('http://some.com/login') .data('cmd', 'login','username', 'xxxx','password', 'yyyyy') .referrer('http://some.com/login/').post(); here it is giv...
Aspirant
22

votes
3

answer
38.6k

Views

JSOUP select <div> with specific ID

I'm making a small Android application for a class where I find cancer-related events from the American Cancer Society's website. I've been using JSoup to get basic information about the events, and to get specific information from the website I've tried to use the select() method. However, the cu...
Tom
21

votes
2

answer
6.3k

Views

using jsoup with proguard closing force close

EDIT : MY PROGUARD VERSION IS 4.7 Today I tried include jsoup (version 1.7.1) in my android application, but it is causing me a lot of troubles. When I exported the signed apk with proguard turned on everytime my application was facing force close issues, then i disabled proguard and exported the ap...
android_newbie
21

votes
8

answer
34k

Views

How to connect via HTTPS using Jsoup?

It's working fine over HTTP, but when I try and use an HTTPS source it throws the following exception: 10-12 13:22:11.169: WARN/System.err(332): javax.net.ssl.SSLHandshakeException: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found. 10-12 13:22:11.179: WARN...
jfisk
21

votes
4

answer
36.1k

Views

JSoup character encoding issue

I am using JSoup to parse content from http://www.latijnengrieks.com/vertaling.php?id=5368 . this is a third party website and does not specify proper encoding. i am using the following code to load the data: public class Loader { public static void main(String[] args){ String url = 'http://www.lati...
Hihaatje
21

votes
4

answer
23k

Views

JSoup: Requesting JSON response

I'm using JSoup to authenticate then connect to a website. Some URL have a JSON response (because part of the site is in AJAX). Can JSoup handle JSON response ? Connection.Response doc = Jsoup.connect('...') .data(...) .cookie(...) .header(...) .method(Method.POST) .execute(); String result = doc.b...
20

votes
4

answer
28.6k

Views

Get element by class in JSoup

I try to get all info contained in div class named : bg_block_info, but instead i get info for another div class Why i'm getting it wrong ? Document doc = Jsoup.connect('http://www.maib.md').get(); Elements myin = doc.getElementsByClass('bg_block_info');
develoops
20

votes
3

answer
20.1k

Views

How do I convert a document made in Jsoup (the Java html parser) into a string

I have a document that was made in jsoup that looks like this Document doc = Jsoup.connect('http://en.wikipedia.org/').get(); How do i convert that doc into a string.
Hudson Hughes
19

votes
9

answer
16.1k

Views

How can I extract only the main textual content from an HTML page?

Update Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text. So if an API...
Renato Dinhani
19

votes
3

answer
17.3k

Views

Java - Obtain text within script tag using Jsoup

I am using the Jsoup library to read a URL. This url has text within a few tags. Is it possible for me to obtain the text within each tag? Please note that I am not asking to parse a Javascript file as I am already aware JSoup does not allow that. The actual source code of the URL has text within...
Matt9Atkins
19

votes
1

answer
7.5k

Views

JSoup.connect throws 403 error while apache.httpclient is able to fetch the content

I am trying to parse HTML dump of any given page. I used HTML Parser and also tried JSoup for parsing. I found useful functions in Jsoup but I am getting 403 error while calling Document doc = Jsoup.connect(url).get(); I tried HTTPClient, to get the html dump and it was successful for the same url....
instanceOfObject
18

votes
3

answer
23.3k

Views

Java - Quickest way to check if URL exists

Hi I am writing a program that goes through many different URLs and just checks if they exist or not. I am basically checking if the error code returned is 404 or not. However as I am checking over 1000 URLs, I want to be able to do this very quickly. The following is my code, I was wondering how I...
Matt9Atkins
18

votes
3

answer
6.2k

Views

Is it possible to convert HTML into XHTML with Jsoup 1.8.1?

String body = ''; Document document = Jsoup.parseBodyFragment(body); document.outputSettings().escapeMode(EscapeMode.xhtml); String str = document.body().html(); System.out.println(str); expect: result: Can Jsoup convert value HTML into XHTML?
Henry
17

votes
2

answer
22.4k

Views

Jsoup Java HTML parser : Executing javascript events

Can I fill out forms, execute events and javascript functions in Jsoup. If yes how can I? Or should I go for another parser.
17

votes
1

answer
2.9k

Views

Jsoup exclude children from .text()

I have a problem similar to those: jQuery: exclude children from .text() Is it possible to achive it in JSoup?
emesx
15

votes
1

answer
15.6k

Views

How to extract separate text nodes with Jsoup?

I have an element like this : TextA TextB How can I extract TextA and TextB separately?
M.M
15

votes
3

answer
30.5k

Views

How to post form login using jsoup?

i want to login in here source code :: Dhaka Electric Supply Company Limited (DESCO):: img{ border:0px; } function checkLogin() { if( document.login.username.value == '') { alert( 'Please enter your account number' ); return false; }return true; } alert('Payments through VISA and Master Card...
MD TAHMID HOSSAIN
15

votes
3

answer
5.2k

Views

How to avoid surrounding html head tags in Jsoup parse

Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these. Sample Input: This is my sentence of text. Java code: import java.io.File; import java.io.IOException; import org.apache.commons.io.FileUti...
Roshan
15

votes
3

answer
10.3k

Views

Use jsoup to parse XML - prevent jsoup from “cleaning” <link> tags

In most case, I have no problem with using jsoup to parse XML. However, if there are tags in the XML document, jsoup will change some text here to some text here. This makes it impossible to extract text inside the tag using CSS selector. So how to prevent jsoup from 'cleaning' tags?
Ethan
14

votes
2

answer
14.2k

Views

How to extract absolute URL from relative HTML links using Jsoup?

I am using Jsoup to extract URL of an webpage. The href attribute of those URL's are relative like: example Here is my attempt: Document document = Jsoup.connect(url).get(); Elements results = document.select('div.results'); Elements dls = results.select('dl'); for (Element dl : dls) { String url =...
sundhar