GCP_DATA

GCP_DATA_BQ

Dataproc

Access

Best practise

we can configure retention on dataset -> table -> partition

importing data to BQ

can be expensive - as running on large dataset. need to estimate

avoid streaming inserts if it is possible

dry-run - estimate volume

price calculator

bq command line tool

BQ REST API

cloud console

HBase APIs like libs - java, etc

pay by the amount of data scan and its storage

make sense to apply perf technics

partition table to multiply segments

clustering - group related data together

improve perf

optimize costs

inherit schema

we can expire partition

date related and Integer are OK for partition

click to edit

goal - collocate data and avoid bigger scan

need to avoid creating small partitions

Streaming - expensive. ( (dataproc, dataflow - stream, pubsub))

query external data - Federated Queue

Batch - free (dataproc, dataflow - batch)

BigQuery data transfer Service (Google Ads, Analytics, etc - other 3-th party and Amazon S3..., ReadShift, etc)

there are a set of limits for streaming, duplicates(100 mb) without duplicates (1Gb)

click to edit

Stream ID - provide option to avoid duplicates

expire data automatically

storages

lower storage cost - like a like a Earline in CS

long term - 90 days - is not edited for 90 days

good optimised by big queries, for narrow small search - BT is better

audit log - available in BQ

click to edit

Dataproc

type

perform complex batch processing

there are multiple cluster modes: single node, standard, HA master

managed Sspark and Hadoop service

can use printable nodes or regular

Data_lifecycle

Igest

Store (different storages here)

Process and analysed

Explore and visualise

Batch

DB migration

Streaming

PubSub

BQ Transfer Service

Transfer Appliance and gustil

Storage Transfer service

DB Migration Services, Dataflow, etc

dataproc

DLP

dataprep

Dataflow

Datalab and DataStudio(managed services to visualise smth)

click to edit