Skip to main content

Data Engineering

  1. How to store multimodal data?
  2. Where to store data for lower latency of access and lower cost
  3. How to store models to run on different hardware

Data Format

Common Data Formats

Source: Designing ML systems

FormatBinary or TextExample Use cases
JSONTextEverywhere
CSVTextEverywhere
ParquetBinaryHadoop, Amazon Redshift
AvroBinaryHadoop
ProtobufBinaryTensorflow TF Record
PickleBinaryPython, Pytorch Serialization

Text vs. Binary Format

---TextBinary
ExamplesCSV, JsonParquet
ProsHuman Readable(1) Fast to unload, (2) Takes up less space
Consslow to load/unload, takes up a lot of spaceNot human readable

Row-Major vs Column-Major

Multidimensional arrays are stored as contiguous data in memory. Source: https://ncar-hackathons.github.io/scientific-computing/numpy/02_memory_layout.html RowCol

Row-MajorColumn-Major
DescriptionData is stored and retrieved row by rowData is stored and retrieved by column by column
ExamplesCSVParquet
Use CasesA lot of writesA lot of feature reads

Data Models

How data is stored.

Relational Model

Data is organized into relations and each relation is a set of tuples. Data following the relational models is usually stored in file formats like CSV or Parquet.

From (https://www.linkedin.com/advice/0/what-advantages-disadvantages-using-denormalized#:~:text=Disadvantages%20of%20normalization&text=First%2C%20it%20increases%20the%20complexity,lookups%20to%20retrieve%20the%20data.)

ProsCons
Reduces the storage space and duplicate dataDecreases the query performance and speed, since you have to perform more joins and lookups to retrieve the data.
  • The language that you can use to specify the data that you want from a database is called a query language. The most popular query langauge is SQL. SQL is a declarative language, so it's up to the db system to decide how to execute a given query.
  • Imperative language(Procedural, OOP) focuses on writing an explicit sequence of commands to describe how you want the computer to do things, and declarative language (functional, logical) focuses on specifying the result of what you want.

NoSQL

  • Origianlly nonrelational databases, but reinterpreted as Not Only SQL.

Document model

A single continusou string encoded as JSON, XML or binary format like BSON. It doesn't enforce a schema. Compared to the relational model, it's harder and less efficient to execute joins across documents compared to across tables.

Graph model

(https://towardsdatascience.com/comparing-graph-databases-5475bdb2e65f) GraphDatabase

ProsCons
The speed depends on the number of relationships. Good for Real-Time Recommendation Engines.There is no standardized query language. inappropriate for transactional-based systems.

Structured versus Unstructured Data

https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/data-lake-foundation.html

---Structured dataUnstructured data
RepositoryData warehousesData lakes
SchemaYesNo
SearchFast lookup and analysisFast arrival of the data but search is slow
Schema changeSchema changes will cause a lot of troubleNo need to handle schema changes, as the downstream needs to handle it