As a machine learning engineer who mainly deals with vision-based projects, I have little opportunity in my work to actually use SQL (Structured Query Language.). Even when I need it, I often just ask the data engineers for a favor.

However, I thought it would be an essential skill to learn even if I do not need it for now.

Why do machine learning engineers needs SQL?

It really depends on what kind of data you are dealing with. If you are doing traditional machine learning, for example, recommendation, ranking, the raw data will probably be stored in databases, and you will need SQL for data wrangling.

If you are working on computer vision, audio signal recognition etc., you may not need to know SQL, but it depends.


Which SQL databases should we use?

When it comes to the actual SQL databases, there are a lot of SQL databases out there.

  • MySQL: It is a client-server RDBMS. It has more features than sqlite. Now development led by people from Oracle.
  • PostgresSQL: Another RDMBS. It has more advanced feature and it is open source.
  • SQLite: It is an embeddable DBMS. It is small and use local files. Can not handle large-scale requests and tera-bytes of data.
  • Microsoft SQL server: It is a commercial software.
  • Oracle SQL: It is also a commercial software.

All these different databases support some of the common SQL syntax, but they may have their own extended syntax.

If you like open source, PostgreSQL and MySQL is a good choice. According to the DB ranking, Postgres and MySQL are ranked 4th and 2nd in the RDBMS. Postgresql seems to support more standard SQL syntax than mysql, but much of the knowledge is transferable between the two.

Practically, we can learn MySQL or PostgreSQL, but it seems that Postgres have more detailed documentations. I also give some links discussing the differences between different databases below.