Table of Contents
Definition
Amazon S3 is the Amazon Web Services object storage service of type PaaS (Platform as a Service). It is the most common solution to store data in the cloud securely, efficiently, and in a scalable way.
With this service, we can store information of any kind, which will be treated as objects. For example, we can use S3 to store photos, text files, static web code, logs, backups, videos, etc.
S3 is one of the best-known and most used AWS services. It is also fully integrated with most of the other services offered and is very easy to use. The alternative in the Microsoft Azure cloud would be the Azure Blob Storage service.
How is The Data Stored? Bucket S3
Data in S3 is stored as objects within so-called S3 Buckets. A thing is the primary storage unit in S3, and it consists of a file with an identifier and associated metadata. A Bucket in Amazon S3 is nothing more than a high-level logical directory where objects are located, each identified with a key. An example of object identification can be the following:
In this case, the Bucket name would be “bucket-and the file’s key inside that Bucket “2020 / log.csv”. As we can see, the identifier specifies the s3 protocol as the beginning of the address. There are several methods and interfaces to access the files stored in S3 and download them: URL with authentication, API Rest, the AWS console (AWS console), and the available SDKs. Additionally, files can be uploaded to Amazon S3 with FTP clients.
When using S3, it is crucial to be aware of access policies and avoid leaving public buckets as much as possible. It is possible to create rules and ACLs (Access Control Lists) to define who has access to buckets and stored objects. Also, make use of file transfer with the SSL protocol.
Amazon S3 Features
S3 has an eventual consistency model. It means that immediately after an object is updated, the latest update may not be returned until it has propagated and replicated in the system asynchronously. It can also happen when removing a file. In this case, it could continue to appear in the listings for a while. The time can vary from several seconds to hours, depending on the size of the objects and the target regions.
The availability of the data provided by S3 is very high, 99.99% by default. There are several types of storage that allow you to adjust availability to your needs. Besides, it scales vertically transparently and gives us a practically unlimited storage capacity in the cloud according to the use of data.
The cost of the service depends on its use, both in terms of used storage capacity and network requests. This price is very competitive and includes data replication and backups, which are managed automatically and transparently. S3 also allows you to store compressed data. In this way, the total cost can be reduced.
Regarding the performance of S3, you have to take into account the connection latencies. Using S3 with other AWS services could not be very important, as optimizations within the networks within the same ecosystem. On the other hand, if we use S3 as a storage layer for our on-premises system, it is critical to consider different storage technologies such as HDFS and its advantages.
Amazon S3 Glacier
AWS also provides another S3-based service called Amazon S3 Glacier. This service is aimed at providing durable cloud object storage for data files at a significantly reduced price.
The price of this service is approximately 1 euro per terabyte per month. It offers several access options, from minutes to hours. There is also an even more excellent service, and with access from 12 to 48 hours, the data is available called Amazon S3 Glacier Deep Archive.
The most common use for these Amazon S3 Glacier services is to store backups. These files do not need to be accessed instantly, and access time can generally be sacrificed to save costs since they are often large files.
Notifications System in AWS S3
The Buckets of S3 can generate notifications when occurring certain events such as object creation or deletion. Currently, three mechanisms are supported to notify these events:
- AWS Lambda
- Simple Queue Service (SQS)
- Simple Notification Service (SNS)
Security in S3
When using a cloud storage service like Amazon S3, we must pay attention to the security of the buckets.
The first step is to disable public access if we do not want to serve this data to everyone. When creating a bucket, AWS prompts you to do this. It is a perfect idea to activate the access logs to have a record of the actions performed.
We will also need to configure the IAM access policies. In AWS, to perform actions on a resource (such as listing the objects in an S3 bucket), the access policies to that resource for that user must be allowed. By default, the procedure for all actions is denied.
In S3, there are also so-called bucket policies (S3 Bucket Policies) that allow us to limit access to the resource and enable us to have another security line. Here, we can choose which specific buckets to allow access to.
Finally, we must consider the use of encryption on the data stored in the S3 buckets. We can keep the encryption keys in services such as KMS to increase security.
Amazon S3 vs. HDFS
It is common to compare these two technologies as the storage layer in our extensive data systems. Even so, they are still totally different, with use cases for both. In this section, we will compare the main aspects of each of them to have a good choice criterion: HDFS vs. S3.
1. Scalability
HDFS scales horizontally by adding more local storage on more nodes. It is possible to add more nodes to the system with larger disks or the most common solution. This way of increasing your capacity is more expensive and complicated than S3. It requires the purchase and maintenance of more pieces of hardware that do not impact 100% on storage capacity.
As we have seen, S3 scales automatically and transparently with virtually no limit and no configuration changes.
2. Price
Regarding the price of S3 and HDFS, it must be considered that HDFS needs three copies of each block with its default configuration, which multiplies by three the necessary space, and with it the associated costs.
S3 does not have this need since it manages the copies transparently and is paid for consumption.
3. Performance
Performance is usually higher on HDFS, as processing is done on the same host where the data is stored. It makes it possible to have a high speed of processing and access to the data. S3 has a higher latency in reads and writes, so it seems a better solution in intensive computational processes that require moving large amounts of HDFS data.
Consistency or object handling does not make it ideal for specific workloads without a significant impact on performance.
In practice, it is common to use S3 as more excellent object storage than HDFS. In HDFS, the frequently accessed data on which to do operations will be stored. S3, on the other hand, becomes the cheapest object storage and on which fewer accesses or procedures will be carried out.
In addition, you can read more helpful posts at primewebreviews