Not to mention the costs, which although start low, tend to pile up as time goes on. In this case you still have to manage cloud data buckets, wait for data transfer from bucket to instance every time the instance starts, handle compliance issues that come with putting data on the cloud, and deal with all the inconvenience that come with working on a remote machine. For example, AWS offers instances with Terabytes of RAM. Alternatively, one can rent a single strong cloud instance with as much memory as required to work with the data in question. Imagine having to set up a cluster for a dataset that is just out of RAM reach, like in the 30–50 GB range. While this is a valid approach for some cases, it comes with the significant overhead of managing and maintaining a cluster. The next strategy is to use distributed computing. The drawback here is obvious: one may miss key insights by not looking at the relevant portions, or even worse, misinterpret the story the data it telling by not looking at all of it. There are 3 strategies commonly employed when working with such datasets.
Thus, they are already tricky to open and inspect, let alone to explore or analyse. They are small enough to fit into the hard-drive of your everyday laptop, but way to big to fit in RAM. Now, these kind of datasets are a bit… uncomfortable to use. Therefore it is becoming increasingly common for data scientists to face 50GB or even 500GB sized datasets. Many organizations are trying to gather and utilise as much data as possible to improve on how they run their business, increase revenue, or how they impact the world around them.