Skip to content

Commit

Permalink
Add parquet vs csv
Browse files Browse the repository at this point in the history
  • Loading branch information
baniasbaabe committed Feb 4, 2024
1 parent 4d8b00e commit d97634c
Showing 1 changed file with 48 additions and 0 deletions.
48 changes: 48 additions & 0 deletions book/pandas/Chapter.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,54 @@
"data = {'Value': [1.2343129, 5.8956701, 6.224289]}\n",
"df = pd.DataFrame(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Faster I/O with Parquet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whenever you work with bigger datasets, please avoid using CSV format (or similar).\n",
"\n",
"CSV files are text files, which are human-readable, and therefore a popular option to store data.\n",
"\n",
"For small datasets, this is not a big issue.\n",
"\n",
"But, what if your data has millions of rows?\n",
"\n",
"It can get really slow to do read/write operations on them.\n",
"\n",
"On the other side, binary files exist too.\n",
"\n",
"They consist of 0s and 1s and are not meant to be human-readable but to be used by programs that know how to interpret them.\n",
"\n",
"Because of that, binary files are more compact and consume less space.\n",
"\n",
"Parquet is one popular binary file format, which is more memory-efficient than CSVs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Shape: (100000000, 5)\n",
"df = pd.DataFrame(...)\n",
"\n",
"# Time: 1m 58s\n",
"df.to_csv(\"data.csv\")\n",
"\n",
"# Time: 8s\n",
"df.to_parquet(\"data.parquet\")"
]
}
],
"metadata": {
Expand Down

0 comments on commit d97634c

Please sign in to comment.