In my company we have a data store of 10TB and growing, when we started we tried to use MySQL, and in all the docs it said that after 200GB of space, performance starts to degrade.
We tried to do everything on a single computer during the tests, at the start we got 50k inserts/sec, not bad and it was ok for our needs, but with the database grows the insert speed declined, a known problem with MySQL and other databases as well.
We tried a number of designs, more tables, partitions, different table designs, but it meant that instead of starting to lose performance at 50GB, we started to get it at 100GB, not something that would get us far.
At 500GB of storage, the insert speed was 6k inserts/sec, not something we’d be able to use, we researched some more and there was no solution with MySQL or any other free database (NoSQL was not a fit for us), so it meant we had to either use Amazon RedShift or deploy a 10TB cluster with MySQL. Which meant paying 20k-30k$ per month, before even making one dollar in profits.
At the end we spent two years building our own database, the database is tailored to our business needs, so we can’t sell the database software, it can reach 350k inserts and 350k selects per second, and we are able to store the entire 10tb on one machine, it has compression, partitions, indexes, and sparse indexes.
We do have another 2TB MySQL database that is being updated in a rate of 100 inserts per second (that’s the rate of the incoming data) and we are quite happy with the performance of it, we did have to use partitions though after 300GB per table it was hard to optimize the tables.
“Why didn’t you use Distributed system or cloud SQL from vendor X”
The answer is price, the cost would be 10–30k$ per month, and this is without failover for 10TB SSD storage with enough CPU/Memory per instance to accommodate the insert performance, if a company has the revenue to support it, that might be the right approach since my company idea was just in design phase, I’d rather spend my time then pay money I don’t have.
Furthermore, my database can store more information in 10TB than the other solutions 10TB, the reason is:
- I know my data, I don’t need to store any metadata or other variables to support the generic nature of MySQL
- I’m using 3 types of compressions:
- String compression
- URL Dictionary compression
- I test in real-time which compression and combination gives me the best performance per item/table and use it, also with zlib, I sort the data and check for the best compression ratio, it adds improvement of 30% in size
So I’m able to do in 700$/mo which includes a failover what the other systems do in 10–30k$/mo without failover, and I can store more data, and have better scale-up cost
It’s true that it did cost me time for two years, but, I had time, didn’t have the money, I know a CTO of a very big company and he said, he’d spent even 100k$/mo, and keep his engineers working on something else, he has money but he always needs more good engineers.