Big News: FounderDating is joining OneVest to build the largest community for entrepreneurs. Details here
Latest Notifications
You have no recent recommendations.
Name
Title
 
MiniBio
FOLLOW
Title
 Followers
FOLLOW TOPIC

Question goes here

1,300 Followers

  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur

What are the most effective ways to scrape content from websites?

If you want to aggregate content from websites that do not offer API integration, what are the most effective methods? Let's say you want to pull consumer product review comments from Best Buy or Amazon. Can this be done? I recall from years past just how messy screen scraping was. Is this still the only option? Has the technology to do this improved in recent years? What if the content you want to aggregate requires a search in order to be found? For instance, let's say you want reviews pertaining to a specific product. If you were on Amazon, you'd need to search for the product first. How does screen scraping get around this? Can you run into legal issues and pushback if you attempt to collect content without consent?

In situations where a web site harbors product review comments and does not offer an API, what's the best approach to get them to share their data? I'm assuming that if they could monetize this content somehow they'd be receptive to the idea.

20 Replies

Andrew Lockley
1
0
Andrew Lockley Advisor
Investor and strategy consultant
Import.io
Near Privman
2
1
Near Privman Advisor
Googler, Startup Advisor
The technology has greatly improved from the time it was commonly referred to as "screen scraping". Currently more commonly referred to as web scraping, or data scraping, is mostly done using automated headless browsers, which are able to return cookies, execute JavaScript, etc., making it much easier from a technical point of view. If the site you are scraping doesn't defend itself against scraping (but simply hasn't bothered to expose the data in API form), you should be able use a cloud scraping service like those offered by AWS and Google, for example.
The problem usually lies in the fact that you are probably doing something very much against the website owner's interests, and probably violating their terms of use, exposing yourself to lawsuits, etc.
In the examples you provide (Best Buy, Amazon), the content is very much central to their value proposition to customers, and poses a significant competitive advantage (people prefer to shop on Amazon because they trust the reviews there, for some reason). They would have to be convinced that they will gain directly from your use of the content, and that you will safe guard this content against their competitors as diligently as they do themselves (of which it would be tough for a startup to convince them).
If you do not have their permission, you will probably find that they spend vast resources to foil scraping attempts, e.g. by blocking or serving fake responses to requests they are able to identify as scraping attempts.
Rob Mitchell
0
0
Rob Mitchell Entrepreneur
Senior Java Software Engineer at Direct Commerce
Richard, I've worked a bit with some of what it sounds like you're trying to do and I can tell you from an engineering perspective, it is not trivial. Unless a company's website intends you to scrape or otherwise get their content, they will do many things to protect it including hiding it behind mostly dynamic HTML.

What @NearPrivman talks about is quite true.

If I were you, I would readjust the value proposition of what you're trying to do. Possible reach out and establish business/partner relations to make what you want happen a reality.
Moh'd Jebrini
0
0
Moh'd Jebrini Entrepreneur
CTO at Mashvisor
Hi, Richard

Taking all what have been said in considerations, there is still some engineering tools that could help you achieve your target. there is no 1 way solution to that, but usually you need to integrate a few technologies with each others to make sure you hit the target.

Import.io service is quite nice, if you like it! otherwise ..
I suggest you check out this tool with your engineers (https://github.com/scrapinghub/portia) + (http://scrapy.org)




5 Star Film Co.Ltd. *
0
3
5 Star Film Co.Ltd. * Entrepreneur
Agents for an Award Winning Television Channel Franchise
I dont see what legal issues there are about collecting content from a website,because Firms have been doing this for years with cookies,and scripts,copyrights protect I.P from others designs to exploit them for profit at the loss of the owner,so that is why "fair usage"was integrated into copyright law. However your software is designed to facilitate other parties usage of Amazon,so therefore your software would be aiding their growth,rather than a liability.

There are no legal grounds to prevent you. In fact The Concept of windows 2000 was to create the first O.S that could establish a Window into everyone's Computer. Perhaps there should be legal grounds against this spy system,because Microsoft have the ability to loon into anyone's files in any computer,and download what ever update program they like. The very idea was given to Bill Gates,when he was approached by the CIA who asked him to create the database to assist them with National Security,Gates was later conveniently paid off by the Corporation to move aside whilst a CIA Executive took up a position on their executive board. Not ever top executive liked that move,so inevitably there had to be a whistleblower.
Richard Pridham
0
0
Richard Pridham Entrepreneur • Advisor
Investor, President & CEO at Retina Labs
So web scraping has improved but it's still messy and challenging from an engineering point of view. The idea I have revolves around product review aggregation and analysis from various sites that contain such info, social media, blogs, etc... Some may have APIs, others won't. The intended audience for the collected and analyzed data would be product manufacturers.
Chris Pointon
0
0
Chris Pointon Entrepreneur
Internet Entrepreneur and Technologist
Check out the tools athttp://scrapinghub.com - as others have said, it's a complex business retrieving structured data from modern dynamic websites, especially if they don't want you to. Many have terms of use specifically barring scraping or data aggregation.
Richard Pridham
0
0
Richard Pridham Entrepreneur • Advisor
Investor, President & CEO at Retina Labs
I'm not a programmer so I'm not sure what this means:

http://docs.aws.amazon.com/AWSECommerceService/latest/DG/EX_RetrievingCustomerReviews.html

Does this mean that Amazon allows product review retrieval?
Rob Mitchell
0
0
Rob Mitchell Entrepreneur
Senior Java Software Engineer at Direct Commerce
@RichardPridham specifically for Amazon, that link your asking about, is for Amazon'sProduct Advertising API which means its practically like create a web shoppingexperience on your website but use Amazon's products and reviews. When customer is ready to purchase, your website then switches to Amazon's website to login and completeorder payment, etc.

IMO its nothing more than a sales affiliate program to sell Amazon products thusgiving them more direct exposure to your website readers/subscribers.

Divya Raghavan
0
2
Divya Raghavan Entrepreneur
Software Engineer 2 at Citrix Systems with a passion for products
Hi Richard, The best way to scrape a website is to use a HTML parser like JSoup in Java. You need to step through the HTML tags to get what you want. This is not a very reliable way because if the website owner makes changes to his page then your code will break. But this is the best you can do. Hope this helps! Thanks Divya
Join FounderDating to participate in the discussion
Nothing gets posted to LinkedIn and your information will not be shared.

Just a few more details please.

DO: Start a discussion, share a resource, or ask a question related to entrepreneurship.
DON'T: Post about prohibited topics such as recruiting, cofounder wanted, check out my product
or feedback on the FD site (you can send this to us directly info@founderdating.com).
See the Community Code of Conduct for more details.

Title

Give your question or discussion topic a great title, make it catchy and succinct.

Details

Make sure what you're about to say is specific and relevant - you'll get better responses.

Topics

Tag your discussion so you get more relevant responses.

Question goes here

1,300 Followers

  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
Know someone who should answer this question? Enter their email below
Stay current and follow these discussion topics?