Big News: FounderDating is joining OneVest to build the largest community for entrepreneurs. Details here
Latest Notifications
You have no recent recommendations.
Name
Title
 
MiniBio
FOLLOW
Title
 Followers
FOLLOW TOPIC

Question goes here

1,300 Followers

  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur
  • Name
    Entrepreneur

Can a Python - Elastic stack handle big data (2 billion + records)?

We are estimating our product to be aggregating around 2 billion data points in the next year. Each data point will be a combination of 15 to 20 information items (text) about entities. We use Python/Django for the application and APIs and Elasticsearch as the data store.

Will this stack scale to 2 billion records and processing? If no, what should I be considering. Would appreciate your suggestions.

15 Replies

Kias Hanifa
1
0
Kias Hanifa Entrepreneur
Chief Technology Officer at Fonicom Limited, Malta
My suggestion would be to use a primary Data store with Mysql partition cluster or Mongodb. Elasticsearch in a distributed or cluster model for search and analytics and Python/Django for the application.


Radamantis Torres
1
0
Support Manager at Appcelerator
Hello,
This depends on how are you architecting your backend, Python/Django can handle that amount of data, but, you will need a load balancer and different servers processing whatever needs to be processed, about ElasticSearch, also depends on the "size" of the data, you mentioned 2 billion records, but, are we talking about terabytes? petabytes?
I'll suggest to use ElasticSearch as a caching mechanism and another powerful database behind it, like Postgresql, if you're talking about serious petabytes, you want to check Hadoop
Marc Milgrom
1
0
Marc Milgrom Advisor
Business Manager at Bloomberg, LP
To reiterate the above, Elastic is NOT a data store. It's an index and search engine that runs on top of another database in the stack.
I would personally recommend PostgrSQL as your persistent data store, unless the data records are very large text documents with no structure, in which case Mongo or Hadoop are better suited.
Slavomir Jasinski
0
0
Slavomir Jasinski Entrepreneur
Technical Director at Real Estate Industry
Kind of interesting talk related with using Python and moving towards golang:

https://www.youtube.com/watch?v=JOx9enktnUM

So if you are looking for super efficiency - consider something "better" then Python.
Federico Marani
3
0
Federico Marani Entrepreneur
Technical Architect
There is nothing inherently limiting in the tools you choose, but at that size the backend architecture becomes really important.
I wouldn't use Elasticsearch as a primary datastore, it's not its purpose. Use something like Postgresql (or investigate into Cassandra).
You can use Elasticsearch as a cache for the "aggregated view" of your data (after having joined together all 15-20 datapoints) but if you are not doing full-text search over it, you may as well keep using Postgresql.
A configuration that may work is Postgresql with 2 sets of tables, one with the original data and one with an optimized structure for the most common operations your product does.
I have seen this sort of thing working well for processing 20 million records every day, but we spent a fair amount of time optimizing it and the database rows were quite independent.
Doug Helferich
0
0
Doug Helferich Entrepreneur • Advisor
Product Manager - Events and Integrations at Wayfair
MongoDB might be better depending on the structure of the data (if it's inconsistent rather than having a nice clean relational structure).

I'd second pairing it with ElasticSearch as a cache, might be a little faster for reads than SQL.
Joe Emison
8
1
Joe Emison Advisor
Chief Information Officer at Xceligent
You shouldn't listen to any advice here so far, because you haven't given us enough details about what you're doing. You have described what you want to store, but you haven't talked about:
  • how you need to search/retrieve data (are you searching with filters? and/or map? or direct retrieval of data? are you retrieving aggregate data or records directly? are you needing to connect your data to other data sources automatically/via API?)
  • how fast you need to retrieve data (sub-second? monthly reports?)
  • your key business risks (are you low on money? need to launch quickly? worried about adoption?)
Again, don't take any architecture advice from anyone until you've actually laid out the full range of what your needs are and what your risks are. Also, engineers are absolutely awful at overengineering solutions for startups when the biggest risk is that no one will use the thing at all. Much better to have actual customers who are running into technical limitations that you have to refactor because you were able to launch to them quickly vs. you never having any customers because engineers decided to build the thing to last for 10 years and you ran out of money before you could get to product-market fit.
Wedge Martin
0
0
Wedge Martin Advisor
CTO at Vivo Technology Inc
Lots of good answers here so I'm late to the party. My short answer is, you can scale just about any platform to handle X traffic load, but the real difference will be in how much development time you have to put into sharding your data, and how much the platform will cost to run. Django and MySQL can get you there reasonably well, and ElasticSearch is a great solution for handling search queries and being a bit of a buffer from your database. My personal preference would be MongoDB, as I've got a lot of experience with it, but you want someone in-house that really knows the platform well. Some other good decisions to make will be around how the app is architected; the back end should be all API endpoints feeding the front end clients ( mobile or otherwise ) and no views generated server side. Layer cache on top of everything, and make sure that you overwrite entries in cache when data is updated instead of letting it expire as to avoid stampeding.
Joe Emison
1
1
Joe Emison Advisor
Chief Information Officer at Xceligent
Just to further emphasize that you shouldn't rely upon advice here without giving further context:

If your product builds an aggregate report from all of this data that you generate once daily and email out to people, then almost certainly the best option hasn't been mentioned yet: use HP Vertica or Red Shift (columnar databases), and just write simple scripts to run queries on those and send emails. (In this context, the scripts would be so small that even using something ugly and hard to maintain like perl would probably be reasonable for MVP).

You can see from the advice here how common it is for engineers to recommend their favorite/preferred stacks without full information (hell, they even do it with full information that certainly indicates that something else would be better), and ignore that technology must be implemented in a business context, not in isolation of it.
Jofin Joseph
0
0
Jofin Joseph Entrepreneur
Lead data diagnostics at Vibe
Thanks a lot for the suggestions and sorry for not being comprehensive enough in my question. Our data records are all text with an average size of 3 KB per record. Hence we are estimating the maximum size to be: around 6000 GB

Our use case involves searching and retrieving individual records from this dataset which can go upto a frequency of upto 50 per second. There will also be update operations on the record happening at a rate of upto 10 per second.

We are contemplating all the options in front of us to reach a decision. Would appreciate more thoughts.
Join FounderDating to participate in the discussion
Nothing gets posted to LinkedIn and your information will not be shared.

Just a few more details please.

DO: Start a discussion, share a resource, or ask a question related to entrepreneurship.
DON'T: Post about prohibited topics such as recruiting, cofounder wanted, check out my product
or feedback on the FD site (you can send this to us directly info@founderdating.com).
See the Community Code of Conduct for more details.

Title

Give your question or discussion topic a great title, make it catchy and succinct.

Details

Make sure what you're about to say is specific and relevant - you'll get better responses.

Topics

Tag your discussion so you get more relevant responses.

Question goes here

1,300 Followers

  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
  • Name
    Details
Know someone who should answer this question? Enter their email below
Stay current and follow these discussion topics?