What I really want: SQLite storage backend driver for s3/gcs. No need for disks ...

dacort · on Jan 26, 2020

As part of a fun side project to make a SQLite driver for Athena, I made a read-only storage driver for S3.

https://github.com/dacort/athena-sqlite/blob/master/lambda-f...

Implemented the VFS side in Python, thanks to the awesome apsw library.

tn1 · on Jan 26, 2020

The challenge with this is that S3 and friends are object stores, meaning you upload or download the whole file each time. As you can imagine, this will cost you tremendous bandwidth even to insert one row.

Furthermore, it doesn't solve the multiple writers problem, because (afaik) there's no way to lock a file on S3.

pm90 · on Jan 26, 2020

> As you can imagine, this will cost you tremendous bandwidth even to insert one row.

Is this the cost of network transfers changed by AWS or GCP? What if I'm hosting my app on e.g. EKS or GKE respectively?

kondro · on Jan 26, 2020

It costs roughly $5/million PUTs and $0.40/million GETs on S3 in addition to the bandwidth and storage you use.

S3 objects are also immutable. Once they’re written, they can’t be updated.

A read-only version of this might be useful, but probably wouldn’t work in-place.

Something that might be of interest is S3 SELECT support that lets you query a single (optionally compressed) CSV, JSON or Parquet file server-side at the same cost of a regular S3 GET.

https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-co...

And if you really want relational (i.e. JOINs, aggregations and sub-queries) semantics on a bucket full of CSV, JSON, Parquet, ORC or regular-expression-describable files in a cost effective way that has great performance on buckets containing 100’s of TBs of data, definitely look at Athena which is only $5/TB of data scanned during a query.

siscia · on Jan 26, 2020

I am working on RediSQL[1] and I am about to launch a managed version.

The interface will be either HTTP or Redis protocol, you create your database, and daily I will back it up on S3.

(If interested you can subscribe for updated here: https://simplesql.carrd.co/

[1]: https://redisql.com/

pm90 · on Jan 26, 2020

Redisql looks pretty sweet! To be frank, and this is only my personal opinion, I don't think I would want to pay for an api, but rather run my own. The business model of providing everything OSS but sending telemetry seems rather intriguing; as a hobbyist user I am ok with such telemetry being collected. If I were to run it in production for a business app though, I wouldn't even bother considering the unpaid version for the following reasons:

1. I do not want my production instance to shut down for WHATEVER reason. This is just not an acceptable risk for most businesses. The only time a DB can go down is something goes wrong.

2. As an engineer, I understand that 3 counters that are not accurate aren't a big deal. I can even look into the source and see that they really do as you say. Justifying this to a security org will be a complete nightmare, as most security orgs in enterprises are staffed with barely technical folks masquerading as "security".

So, it seems like a pretty good way to coerce enterprises to pay up while letting hobbyists continue using it. Very smart, I wish you the very best!

siscia · on Jan 26, 2020

Thanks! That was exactly my reasoning.

Make it available and simple for hobbyist and small companies, ask money to who can afford it to sustain the development.

pupdogg · on Jan 27, 2020

this looks very interesting...can you please post some benchmarks in your docs for reference?

siscia · on Jan 27, 2020

Benchmarks are always tricky, but sometimes useful, so yes I should post some of them.

Right now I am busy with releasing the v2, but after that I should definitely do some more marketing.

Anyhow, to give you an order of magnitude, on memory data storage we reach ~80k inserts for second. On a machine with 1vCPU and 3GB of RAM, a 15$/month box from DO.

bshipp · on Jan 26, 2020

that would be cool! as an interim step, if your data is small enough, perhaps you could run an in-memory sqlite db and periodically backup to a permanent s3 file?

APSW exposes the sqlite backup api so you could do them online without shutting down the database.