GCP_DATA_BQ
Access
Best practise
we can configure retention on dataset -> table -> partition
importing data to BQ
can be expensive - as running on large dataset. need to estimate
avoid streaming inserts if it is possible
dry-run - estimate volume
price calculator
bq command line tool
BQ REST API
cloud console
HBase APIs like libs - java, etc
pay by the amount of data scan and its storage
make sense to apply perf technics
partition table to multiply segments
clustering - group related data together
improve perf
optimize costs
inherit schema
we can expire partition
date related and Integer are OK for partition
click to edit
click to edit
goal - collocate data and avoid bigger scan
need to avoid creating small partitions
Streaming - expensive. ( (dataproc, dataflow - stream, pubsub))
query external data - Federated Queue
Batch - free (dataproc, dataflow - batch)
BigQuery data transfer Service (Google Ads, Analytics, etc - other 3-th party and Amazon S3..., ReadShift, etc)
there are a set of limits for streaming, duplicates(100 mb) without duplicates (1Gb)
click to edit
Stream ID - provide option to avoid duplicates
expire data automatically
storages
lower storage cost - like a like a Earline in CS
long term - 90 days - is not edited for 90 days
good optimised by big queries, for narrow small search - BT is better
audit log - available in BQ
click to edit
Dataproc
type
perform complex batch processing
there are multiple cluster modes: single node, standard, HA master
managed Sspark and Hadoop service
can use printable nodes or regular
Data_lifecycle
Igest
Store (different storages here)
Process and analysed
Explore and visualise
Batch
DB migration
Streaming
PubSub
BQ Transfer Service
Transfer Appliance and gustil
Storage Transfer service
DB Migration Services, Dataflow, etc
dataproc
DLP
dataprep
Dataflow
ML
Datalab and DataStudio(managed services to visualise smth)
BQ
click to edit