Data Engineering
- How to store multimodal data?
- Where to store data for lower latency of access and lower cost
- How to store models to run on different hardware
Data Format
Common Data Formats
Source: Designing ML systems
Format | Binary or Text | Example Use cases |
---|---|---|
JSON | Text | Everywhere |
CSV | Text | Everywhere |
Parquet | Binary | Hadoop, Amazon Redshift |
Avro | Binary | Hadoop |
Protobuf | Binary | Tensorflow TF Record |
Pickle | Binary | Python, Pytorch Serialization |
Text vs. Binary Format
--- | Text | Binary |
---|---|---|
Examples | CSV, Json | Parquet |
Pros | Human Readable | (1) Fast to unload, (2) Takes up less space |
Cons | slow to load/unload, takes up a lot of space | Not human readable |
Row-Major vs Column-Major
Multidimensional arrays are stored as contiguous data in memory. Source: https://ncar-hackathons.github.io/scientific-computing/numpy/02_memory_layout.html
Row-Major | Column-Major | |
---|---|---|
Description | Data is stored and retrieved row by row | Data is stored and retrieved by column by column |
Examples | CSV | Parquet |
Use Cases | A lot of writes | A lot of feature reads |
Data Models
How data is stored.
Relational Model
Data is organized into relations and each relation is a set of tuples. Data following the relational models is usually stored in file formats like CSV or Parquet.
- Normalization (https://stackoverflow.com/questions/4972271/what-is-the-t-sql-to-normalize-an-existing-table)
Pros | Cons |
---|---|
Reduces the storage space and duplicate data | Decreases the query performance and speed, since you have to perform more joins and lookups to retrieve the data. |
- The language that you can use to specify the data that you want from a database is called a
query language
. The most popular query langauge is SQL. SQL is a declarative language, so it's up to the db system to decide how to execute a given query. - Imperative language(Procedural, OOP) focuses on writing an explicit sequence of commands to describe how you want the computer to do things, and declarative language (functional, logical) focuses on specifying the result of what you want.
NoSQL
- Origianlly nonrelational databases, but reinterpreted as Not Only SQL.
Document model
A single continusou string encoded as JSON, XML or binary format like BSON. It doesn't enforce a schema. Compared to the relational model, it's harder and less efficient to execute joins across documents compared to across tables.
Graph model
(https://towardsdatascience.com/comparing-graph-databases-5475bdb2e65f)
Pros | Cons |
---|---|
The speed depends on the number of relationships. Good for Real-Time Recommendation Engines. | There is no standardized query language. inappropriate for transactional-based systems. |
Structured versus Unstructured Data
https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/data-lake-foundation.html
--- | Structured data | Unstructured data |
---|---|---|
Repository | Data warehouses | Data lakes |
Schema | Yes | No |
Search | Fast lookup and analysis | Fast arrival of the data but search is slow |
Schema change | Schema changes will cause a lot of trouble | No need to handle schema changes, as the downstream needs to handle it |