Hacker News new | past | comments | ask | show | jobs | submit login

I'm curious as to why you choose to break out specific headers in the schema.

For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.

If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.

For example

  CREATE TABLE IF NOT EXISTS "messages" (
    "id" INTEGER NOT NULL PRIMARY KEY, -- internal id
    "message_id" TEXT NOT NULL, -- Gmail message id
    "thread_id" TEXT NOT NULL, -- Gmail thread id
    "headers" JSON NOT NULL, -- JSON object of { "header": value },
    "subject" TEXT GENERATED ALWAYS AS (json_extract("headers", '$.Subject')) VIRTUAL NOT NULL)
    ...
  );
  CREATE INDEX subjectidx on messages(subject);
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as

  ALTER TABLE messages ADD dkim TEXT GENERATED ALWAYS AS (json_extract("headers", '$."Dkim-Signature"')) VIRTUAL NOT NULL);
  CREATE INDEX dkimidx on messages(dkim);
  SELECT dkim, COUNT(0) FROM messages GROUP BY dkim;
or whatever you want.



Note that you don't actually need the generated column either, SQLite supports indexes on expressions, so you can do, for example,

  CREATE INDEX subjectidx ON messages(json_extract(headers, '$.Subject'))
and it will use this index anywhere you reference that expression.

I find it useful to create indexes like this, then create VIEWs using these expressions instead of ALTER'ing the main table with generated columns.


And since view and indexes don't change the data, you can use tools like https://github.com/fsaintjacques/recordlite to automate schema management.


This is cool! I quite like this.


What a great timely tip. Was just looking for good direction on how to do this. Thanks!


Adding indexes to support a one off query seems like bad practice?

In general I prefer break out columns that I expect to have/use consistently, especially for something as stable as email headers. Maybe schema changes are a bit easier with a headers column, but imo its just trading the pain on write for pain on read (while leaving the door open to stuff failing silently).


I reach for a similar pattern a lot with postgres as I'm building up a system. Start with a think about the fields I know I want, and create the tables with them, and then store all the metadata I have lying around in a json column, then in 2 months when I realize what fields I actually need populate them from json, and then make my API keep them up to date, or make a view, or what ever.

I've found it really helpful to avoid the growing pains that come with "just shove it all in mongo", or "just put it on the file system", but not much cost.


I see that you defined the `dkim` column as NOT NULL. So what happens when an email message does not contain the Dkim-Signature header?


Probably something like

  Error: stepping, NOT NULL constraint failed: messages.dkim (19)
because, unlike MySQL, SQLite apparently returns SQL NULL for JSON null value.


As hun3 said, it would throw an error. My mistake in the quick example I put together. I just noticed an unbalanced () as well. Whoops.


TIL, thanks a lot!


You can also create indices directly on expressions, including json_extract etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: