What if you could get blazing fast queries on your data without having to be on call for a giant, expensive database? By picking the right file format for your data, you can store your data on disk in the cloud and still get the performance you need for modern analytics. We’ll discuss benchmarks of four different data storage formats: Parquet, ORC, Avro, and traditional character-separated files like CSV. We’ll cover what they are, how they work at a bits-and-bytes level, and why you might choose each one for your use case.
Emily May Curtin is a rare Atlanta native. She works on Apache Spark and applications in the Spark ecosystem. Emily lives in the city with her husband Ryan. When she’s not busy having very strong opinions about Scala, she can be found on the Hooch, on the Appalachian Trail, or in the zone at her easel.