Big News: FounderDating is joining OneVest to build the largest community for entrepreneurs. Details here
Latest Notifications
You have no recent recommendations.
Name
Title
 
MiniBio
FOLLOW
Title
 Followers
FOLLOW TOPIC

Question goes here

1,300 Followers

  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur

What are some good free web scrappers & techniques? If you've used one please share your story.

Looking to learn more about web scraping and hear from those who've used them. If you've used one (or several) please share your purpose, scrapper link or name and results! (RoR-oriented ones would be most helpful!) Cheers!

9 Replies

Michael Brill
1
0
Michael Brill Entrepreneur
Technology startup exec focused on AI-driven products
Hi Marlina.

I've been scraping up a storm lately - learning a fair bit about copyright law and proxy meshes in the process. My general assessment is that if you have development skills, then you're probably better off using your favorite language rather than using a purpose-built tool. I looked at outwit and mozenda... and while they would both have done what I needed, it's yet another technology to learn and support.

No experience with RoR scraping libraries, but I'm sure if you google around that 1 or 2 will float to the top. Looked at Scrapy for Python and that looked quite good. Since we're working in Node.js, I just used two modules (request and cheerio) and it's ridiculous how simple it is. Since many sites have anti-scraping technology installed, you may end up having to use a VPN that cycles through IPs. I just used Hotspot Shield ($30/year) and that's worked on everything so far.

Make sure you know about IP issues and litigation risks.

...Michael

Scott Fairbairn
0
0
Scott Fairbairn Entrepreneur
Software Artisan, Technologist
Hi Marlina,

We've used BeautifulSoup, which is a Python library, successfully in the past. As Michael mentioned, when it comes to scraping data (hopefully legally), you run into all kinds of corner cases that off the shelf products might not handle gracefully.

Writing the code directly seems to work best, at least for us.

-Scott

David Hunter
0
0
David Hunter Entrepreneur
Machine Learning Research, University of Oxford
Been looking at this very recently and I agree with Michael that it's better to write your own scraper if you can.

I'm a python addict and have so far found splinter (http://splinter.cobrateam.info/) to be perfect for pretty much any application, including scraping 'ajaxy' data rendered and hidden with javascript.

David
Jonathan Vanasco
0
0
Jonathan Vanasco Entrepreneur • Advisor
Co-Founder at Aptise
Ruby on Rails is a web-app development framework -- basically the exact thing you don't want to scrape with.

Generally speaking, you want the scrapers to either be their own daemon/service or being dispatched tasks out of a messaging queue. Requesting web pages is a blocking operation, and scraping often creates additional tasks ( ie, you derive another page to scrape ) , so those are things to consider as well. Scraping is often best when implemented in an "event driven" and asynchronous framework.

If you can start from scratch, I'd probably do everything in Erlang ( or Node.js ).

If you want to stay in Ruby, you should look at Redis+Resque (https://github.com/resque/resque ).

In Python, you can do some decent scraping with Redis+Celery for task management. You can also do everything in Twisted Python. I've done both with great results. I already am biased with Python , but Python has the BeautifulSoup library for parsing and navigating HTML documents -- and that makes pulling data out of the scraped pages way way way easier.

If you can avoid scraping, I'd suggest doing it. There are companies like Embedly ( http://embed.ly/ ) that offer an API that gives most of the data you'd get from scraping , with a lot less work.
Manu Kodiyan
0
0
Manu Kodiyan Entrepreneur
Founder at Althea Health
I am interested in this topic because we may go down the web scraping path in the future. It is on my radar but I am by no means an expert. In the past I worked in a company that had a different group than mine doing web scraping, I know they worked very hard dealing with badly formed HTMl, javascript issues, etc. on different web sites.

So unless you have significant development resources and you need to scrape just a handful of sites, you may be better off to see if there are any companies/open source projects that have already learnt the lessons and fought the battles of web scraping.That way you can concentrate on your core competencies.

By the way a company that adopted this strategy fairly successfully is mint.com. They used Yodlee to do the web scraping for them. This was circa 2006 both companies seemed to have moved on since then.

You might want to look at this discussion http://www.quora.com/Web-Scraping/What-are-some-good-free-web-scrapers-scraping-techniques

Finally, in my opinion, the hard job in web scraping is to handle the idiosyncrasies of various web sites so if there is a great non-RoR tool it may not be that hard to integrate it with a RoR back end to handle the results of scraping.

Best of luck,
Manu
Jesal Gadhia
0
0
Jesal Gadhia Entrepreneur
Full-Stack Developer
Take a look at these two Ruby web scraping libraries:


I've played around with both in the past with decent results. I've also used :https://github.com/sparklemotion/nokogiri- Which is more bare metal but gives you more flexibility. (Pismo & Mechanize uses Nokogiri on the back-end.)
Harshit Rastogi
0
0
Harshit Rastogi Entrepreneur
Breaking the barrier of being #ordinary
i have used beautifulsoup in python , thats good to begin . But i have realized that i prefer using xpath to get the data since it doesn't have learning curve.

I used ruby andnokogiri ..
Jerome Dangu
0
0
Jerome Dangu Entrepreneur
CTO & Co-Founder at ClarityAd
PhantomJS is a headless browser that is especially useful for sites that rely heavily on javascript.
You can access the DOM as opposed to the HTML source.
Toddy Mladenov
0
0
Toddy Mladenov Entrepreneur • Advisor
CTO and Co-Founder at Agitare Technologies Inc.
I usedhttp://scrapy.org/ (Python) for downloading Yahoo!Finance information and it required only 40 lines of code. Very easy to learn and use.
Join FounderDating to participate in the discussion
Nothing gets posted to LinkedIn and your information will not be shared.

Just a few more details please.

DO: Start a discussion, share a resource, or ask a question related to entrepreneurship.
DON'T: Post about prohibited topics such as recruiting, cofounder wanted, check out my product
or feedback on the FD site (you can send this to us directly info@founderdating.com).
See the Community Code of Conduct for more details.

Title

Give your question or discussion topic a great title, make it catchy and succinct.

Details

Make sure what you're about to say is specific and relevant - you'll get better responses.

Topics

Tag your discussion so you get more relevant responses.

Question goes here

1,300 Followers

  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
Know someone who should answer this question? Enter their email below
Stay current and follow these discussion topics?