Dataproc
What is Dataproc?
Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters, hive, and other open sources big data tools.
He can be integrated with other Google Cloud services, such as BigQuery, Cloud Storage, and Cloud Dataflow ...
remarque
While Dataproc is fully managed, it is not serverless. You still need to create and manage clusters explicitly. However.
Advantages
- Low cost: You only pay for the resources you use.
- Fast: Dataproc clusters are put up and down quickly (in less than 90 seconds).
- Integration: Dataproc is integrated with other Google Cloud services.
- Fully managed: Google Cloud manages the infrastructure, so you can focus on your data and analysis.
- Easy to use: Dataproc is easy to use, due to UI, CLI, and API.
- Scalable: You can easily scale your clusters up and down.
Use cases
- Data processing: Dataproc can be used to process large amounts of data.
- Data analysis: Dataproc can be used to analyze large amounts of data.
- Machine learning: Dataproc can be used to train machine learning models.
- ETL: Dataproc can be used to extract, transform, and load data.
- ...
How to use Dataproc?
- Setup: Create a cluster
- Configure: Configure your cluster
- Optimize: Optimize configuration for your workload (scale, autoscale, preemptible VMs, CPU, GPU, etc.)
- Run: Submit your job
- Monitor: Check the status of your job (Cloud Monitoring, CLoud Logging, etc.)
Dataproc or Dataflow ?
Other Points
attention
Storage HDFS with Dataproc is not persistent. If you want to keep your data, you need to use external storage like Google Cloud Storage.
astuce
Optimization General optimization Storage