incomplete scraped HTML page by OkHttp3, javascript needed?

1383 views java
5

I am scraping some JSON data from a website which works pretty well. I can login and download the necessary data. However, in one case I have to download a HTML page to extract the info from the HTML.

I've modified the request headers such that they match the ones that were visibile using Chrome developer options (F12).

Request request = new Request.Builder().url(url)
                    .header("Host", "www.host.com")
                    .header("Connection", "Keep-Alive")
                    .header("Cache-Control", "max-age=0")
                    .header("Upgrade-Insecure-Requests", "1")
                    .header("User-Agent",this.user_agent_user_for_this_session)
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
                    .header("Accept-Encoding", "gzip, deflate, br")
                    .header("Accept-Language", "en-US,en;q=0.9,fr;q=0.8,nl;q=0.7,de;q=0.6,af;q=0.5")
                    .get().build();

            Response response = client.newCall(request).execute();

            String html = IOUtils.toString(new GZIPInputStream(response.body().byteStream()));

I receive a HTML file but it is much smaller as compared to the HTML file that is saved manually from Chrome (save source as). I noticed all kinds of ng (angular) references in the HTML file which made me wonder if I only received the first part of a certain javascript process that was not finished?

In addition, the HTML that is downloaded looks identical to the HTML file that is downloaded in the first network view of Chrome (i copy pasted the content and the file sizes are the same).

So should I allow for some additional analyses on the request?

answered question

1 Answer

1

If it is angular page then you are out of lack - whole page is generated on runtime so actual index.html is kind of small.

As workaround I have used Selenium to actually open page in headless browser and fetch content after angular application is initialized.

posted this

Have an answer?

JD

Please login first before posting an answer.