deploy: dccdefc

baniasbaabe · Feb 4, 2024 · e10541d · e10541d
1 parent 7d2451f
commit e10541d
Show file tree

Hide file tree

Showing 9 changed files with 261 additions and 2 deletions.
diff --git a/_sources/book/cooltools/Chapter.ipynb b/_sources/book/cooltools/Chapter.ipynb
@@ -1665,6 +1665,45 @@
     "        }\n",
     "    )"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## SQL Query Builder in Python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can build SQL queries in Python with pypika.\n",
+    "\n",
+    "pypika provides a simple interface to build SQL queries with an easy syntax.\n",
+    "\n",
+    "It supports nearly every SQL command."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pypika import Tables, Query\n",
+    "\n",
+    "history, customers = Tables('history', 'customers')\n",
+    "q = Query \\\n",
+    "    .from_(history) \\\n",
+    "    .join(customers) \\\n",
+    "    .on(history.customer_id == customers.id) \\\n",
+    "    .select(history.star) \\\n",
+    "    .where(customers.id == 5)\n",
+    "    \n",
+    "q.get_sql()\n",
+    "# SELECT \"history\".* FROM \"history\" JOIN \"customers\" \n",
+    "# ON \"history\".\"customer_id\"=\"customers\".\"id\" WHERE \"customers\".\"id\"=5"
+   ]
   }
  ],
  "metadata": {

diff --git a/_sources/book/machinelearning/outlierdetection.ipynb b/_sources/book/machinelearning/outlierdetection.ipynb
@@ -83,6 +83,65 @@
     "    \n",
     "majority_vote(labels)"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Robust Outlier Detection with `puncc`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Outlier Detection is notoriously hard.\n",
+    "\n",
+    "But it doesn't have to.\n",
+    "\n",
+    "`puncc` offers outlier detection, powered by Conformal Prediction, where the detection threshold will be calibrated.\n",
+    "\n",
+    "So, false alarms are reduced."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install puncc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.ensemble import IsolationForest\n",
+    "from deel.puncc.anomaly_detection import SplitCAD\n",
+    "from deel.puncc.api.prediction import BasePredictor\n",
+    "\n",
+    "# We need to redefine the predict to output the nonconformity scores.\n",
+    "class ADPredictor(BasePredictor):\n",
+    "    def predict(self, X):\n",
+    "        return -self.model.score_samples(X)\n",
+    "\n",
+    "# Wrap Isolation Forest in a predictor\n",
+    "if_predictor = ADPredictor(IsolationForest())\n",
+    "\n",
+    "# Instantiate CAD on top of IF predictor\n",
+    "if_cad = SplitCAD(if_predictor, train=True)\n",
+    "\n",
+    "\n",
+    "if_cad.fit(z=dataset, fit_ratio=0.7)\n",
+    "\n",
+    "# Maximum false detection rate\n",
+    "alpha = 0.01\n",
+    "\n",
+    "results = if_cad.predict(new_data, alpha=alpha)"
+   ]
   }
  ],
  "metadata": {

diff --git a/_sources/book/pandas/Chapter.ipynb b/_sources/book/pandas/Chapter.ipynb
@@ -189,6 +189,54 @@
     "data = {'Value': [1.2343129, 5.8956701, 6.224289]}\n",
     "df = pd.DataFrame(data)"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Faster I/O with Parquet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Whenever you work with bigger datasets, please avoid using CSV format (or similar).\n",
+    "\n",
+    "CSV files are text files, which are human-readable, and therefore a popular option to store data.\n",
+    "\n",
+    "For small datasets, this is not a big issue.\n",
+    "\n",
+    "But, what if your data has millions of rows?\n",
+    "\n",
+    "It can get really slow to do read/write operations on them.\n",
+    "\n",
+    "On the other side, binary files exist too.\n",
+    "\n",
+    "They consist of 0s and 1s and are not meant to be human-readable but to be used by programs that know how to interpret them.\n",
+    "\n",
+    "Because of that, binary files are more compact and consume less space.\n",
+    "\n",
+    "Parquet is one popular binary file format, which is more memory-efficient than CSVs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Shape: (100000000, 5)\n",
+    "df = pd.DataFrame(...)\n",
+    "\n",
+    "# Time: 1m 58s\n",
+    "df.to_csv(\"data.csv\")\n",
+    "\n",
+    "# Time: 8s\n",
+    "df.to_parquet(\"data.parquet\")"
+   ]
   }
  ],
  "metadata": {

diff --git a/_sources/book/pythontricks/Chapter.ipynb b/_sources/book/pythontricks/Chapter.ipynb
@@ -961,7 +961,15 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": []
+   "source": [
+    "One cool feature in Python 3.12:\n",
+    "\n",
+    "The support for Type Variables.\n",
+    "\n",
+    "You can use them to parametrize generic classes and functions.\n",
+    "\n",
+    "See below for a small example where our generic class is parametrized by T which we indicate with [T]."
+   ]
   },
   {
    "cell_type": "code",

diff --git a/book/cooltools/Chapter.html b/book/cooltools/Chapter.html
@@ -449,6 +449,7 @@ <h2> Contents </h2>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#better-alternative-to-requests">2.1.32. Better Alternative to <code class="docutils literal notranslate"><span class="pre">requests</span></code></a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#managing-configurations-with-python-dotenv">2.1.33. Managing Configurations with <code class="docutils literal notranslate"><span class="pre">python-dotenv</span></code></a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#work-with-notion-via-python-with">2.1.34. Work with Notion via Python with</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#sql-query-builder-in-python">2.1.35. SQL Query Builder in Python</a></li>
 </ul>
             </nav>
         </div>
@@ -1462,6 +1463,31 @@ <h2><span class="section-number">2.1.34. </span>Work with Notion via Python with
 </div>
 </div>
 </section>
+<section id="sql-query-builder-in-python">
+<h2><span class="section-number">2.1.35. </span>SQL Query Builder in Python<a class="headerlink" href="#sql-query-builder-in-python" title="Permalink to this heading">#</a></h2>
+<p>You can build SQL queries in Python with pypika.</p>
+<p>pypika provides a simple interface to build SQL queries with an easy syntax.</p>
+<p>It supports nearly every SQL command.</p>
+<div class="cell docutils container">
+<div class="cell_input docutils container">
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pypika</span> <span class="kn">import</span> <span class="n">Tables</span><span class="p">,</span> <span class="n">Query</span>
+
+<span class="n">history</span><span class="p">,</span> <span class="n">customers</span> <span class="o">=</span> <span class="n">Tables</span><span class="p">(</span><span class="s1">&#39;history&#39;</span><span class="p">,</span> <span class="s1">&#39;customers&#39;</span><span class="p">)</span>
+<span class="n">q</span> <span class="o">=</span> <span class="n">Query</span> \
+    <span class="o">.</span><span class="n">from_</span><span class="p">(</span><span class="n">history</span><span class="p">)</span> \
+    <span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">customers</span><span class="p">)</span> \
+    <span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="n">history</span><span class="o">.</span><span class="n">customer_id</span> <span class="o">==</span> <span class="n">customers</span><span class="o">.</span><span class="n">id</span><span class="p">)</span> \
+    <span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">history</span><span class="o">.</span><span class="n">star</span><span class="p">)</span> \
+    <span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">customers</span><span class="o">.</span><span class="n">id</span> <span class="o">==</span> <span class="mi">5</span><span class="p">)</span>
+
+<span class="n">q</span><span class="o">.</span><span class="n">get_sql</span><span class="p">()</span>
+<span class="c1"># SELECT &quot;history&quot;.* FROM &quot;history&quot; JOIN &quot;customers&quot; </span>
+<span class="c1"># ON &quot;history&quot;.&quot;customer_id&quot;=&quot;customers&quot;.&quot;id&quot; WHERE &quot;customers&quot;.&quot;id&quot;=5</span>
+</pre></div>
+</div>
+</div>
+</div>
+</section>
 </section>
 
     <script type="text/x-thebe-config">
@@ -1565,6 +1591,7 @@ <h2><span class="section-number">2.1.34. </span>Work with Notion via Python with
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#better-alternative-to-requests">2.1.32. Better Alternative to <code class="docutils literal notranslate"><span class="pre">requests</span></code></a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#managing-configurations-with-python-dotenv">2.1.33. Managing Configurations with <code class="docutils literal notranslate"><span class="pre">python-dotenv</span></code></a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#work-with-notion-via-python-with">2.1.34. Work with Notion via Python with</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#sql-query-builder-in-python">2.1.35. SQL Query Builder in Python</a></li>
 </ul>
   </nav></div>
 

diff --git a/book/machinelearning/outlierdetection.html b/book/machinelearning/outlierdetection.html
@@ -416,6 +416,7 @@ <h2> Contents </h2>
             <nav aria-label="Page">
                 <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#ensembling-for-outlier-detection">5.6.1. Ensembling for Outlier Detection</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#robust-outlier-detection-with-puncc">5.6.2. Robust Outlier Detection with <code class="docutils literal notranslate"><span class="pre">puncc</span></code></a></li>
 </ul>
             </nav>
         </div>
@@ -486,6 +487,48 @@ <h2><span class="section-number">5.6.1. </span>Ensembling for Outlier Detection<
 </div>
 </div>
 </section>
+<section id="robust-outlier-detection-with-puncc">
+<h2><span class="section-number">5.6.2. </span>Robust Outlier Detection with <code class="docutils literal notranslate"><span class="pre">puncc</span></code><a class="headerlink" href="#robust-outlier-detection-with-puncc" title="Permalink to this heading">#</a></h2>
+<p>Outlier Detection is notoriously hard.</p>
+<p>But it doesn’t have to.</p>
+<p><code class="docutils literal notranslate"><span class="pre">puncc</span></code> offers outlier detection, powered by Conformal Prediction, where the detection threshold will be calibrated.</p>
+<p>So, false alarms are reduced.</p>
+<div class="cell docutils container">
+<div class="cell_input docutils container">
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span>!pip install puncc
+</pre></div>
+</div>
+</div>
+</div>
+<div class="cell docutils container">
+<div class="cell_input docutils container">
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">IsolationForest</span>
+<span class="kn">from</span> <span class="nn">deel.puncc.anomaly_detection</span> <span class="kn">import</span> <span class="n">SplitCAD</span>
+<span class="kn">from</span> <span class="nn">deel.puncc.api.prediction</span> <span class="kn">import</span> <span class="n">BasePredictor</span>
+
+<span class="c1"># We need to redefine the predict to output the nonconformity scores.</span>
+<span class="k">class</span> <span class="nc">ADPredictor</span><span class="p">(</span><span class="n">BasePredictor</span><span class="p">):</span>
+    <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
+        <span class="k">return</span> <span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">score_samples</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
+
+<span class="c1"># Wrap Isolation Forest in a predictor</span>
+<span class="n">if_predictor</span> <span class="o">=</span> <span class="n">ADPredictor</span><span class="p">(</span><span class="n">IsolationForest</span><span class="p">())</span>
+
+<span class="c1"># Instantiate CAD on top of IF predictor</span>
+<span class="n">if_cad</span> <span class="o">=</span> <span class="n">SplitCAD</span><span class="p">(</span><span class="n">if_predictor</span><span class="p">,</span> <span class="n">train</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
+
+
+<span class="n">if_cad</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">z</span><span class="o">=</span><span class="n">dataset</span><span class="p">,</span> <span class="n">fit_ratio</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
+
+<span class="c1"># Maximum false detection rate</span>
+<span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.01</span>
+
+<span class="n">results</span> <span class="o">=</span> <span class="n">if_cad</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">)</span>
+</pre></div>
+</div>
+</div>
+</div>
+</section>
 </section>
 
     <script type="text/x-thebe-config">
@@ -556,6 +599,7 @@ <h2><span class="section-number">5.6.1. </span>Ensembling for Outlier Detection<
   <nav class="bd-toc-nav page-toc">
     <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#ensembling-for-outlier-detection">5.6.1. Ensembling for Outlier Detection</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#robust-outlier-detection-with-puncc">5.6.2. Robust Outlier Detection with <code class="docutils literal notranslate"><span class="pre">puncc</span></code></a></li>
 </ul>
   </nav></div>
 

diff --git a/book/pandas/Chapter.html b/book/pandas/Chapter.html
@@ -420,6 +420,7 @@ <h2> Contents </h2>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#change-the-plotting-backend">8.1.3. Change the Plotting Backend</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#style-your-dataframes">8.1.4. Style your DataFrames</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#set-precision-of-displayed-floats">8.1.5. Set Precision of Displayed Floats</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#faster-i-o-with-parquet">8.1.6. Faster I/O with Parquet</a></li>
 </ul>
             </nav>
         </div>
@@ -534,6 +535,34 @@ <h2><span class="section-number">8.1.5. </span>Set Precision of Displayed Floats
 </div>
 </div>
 </section>
+<section id="faster-i-o-with-parquet">
+<h2><span class="section-number">8.1.6. </span>Faster I/O with Parquet<a class="headerlink" href="#faster-i-o-with-parquet" title="Permalink to this heading">#</a></h2>
+<p>Whenever you work with bigger datasets, please avoid using CSV format (or similar).</p>
+<p>CSV files are text files, which are human-readable, and therefore a popular option to store data.</p>
+<p>For small datasets, this is not a big issue.</p>
+<p>But, what if your data has millions of rows?</p>
+<p>It can get really slow to do read/write operations on them.</p>
+<p>On the other side, binary files exist too.</p>
+<p>They consist of 0s and 1s and are not meant to be human-readable but to be used by programs that know how to interpret them.</p>
+<p>Because of that, binary files are more compact and consume less space.</p>
+<p>Parquet is one popular binary file format, which is more memory-efficient than CSVs.</p>
+<div class="cell docutils container">
+<div class="cell_input docutils container">
+<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
+
+<span class="c1"># Shape: (100000000, 5)</span>
+<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
+
+<span class="c1"># Time: 1m 58s</span>
+<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s2">&quot;data.csv&quot;</span><span class="p">)</span>
+
+<span class="c1"># Time: 8s</span>
+<span class="n">df</span><span class="o">.</span><span class="n">to_parquet</span><span class="p">(</span><span class="s2">&quot;data.parquet&quot;</span><span class="p">)</span>
+</pre></div>
+</div>
+</div>
+</div>
+</section>
 </section>
 
     <script type="text/x-thebe-config">
@@ -608,6 +637,7 @@ <h2><span class="section-number">8.1.5. </span>Set Precision of Displayed Floats
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#change-the-plotting-backend">8.1.3. Change the Plotting Backend</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#style-your-dataframes">8.1.4. Style your DataFrames</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#set-precision-of-displayed-floats">8.1.5. Set Precision of Displayed Floats</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#faster-i-o-with-parquet">8.1.6. Faster I/O with Parquet</a></li>
 </ul>
   </nav></div>
 

diff --git a/book/pythontricks/Chapter.html b/book/pythontricks/Chapter.html
@@ -1014,6 +1014,10 @@ <h2><span class="section-number">10.1.23. </span>Modify Print Statements<a class
 </section>
 <section id="type-variables-in-python-3-12">
 <h2><span class="section-number">10.1.24. </span>Type Variables in Python 3.12<a class="headerlink" href="#type-variables-in-python-3-12" title="Permalink to this heading">#</a></h2>
+<p>One cool feature in Python 3.12:</p>
+<p>The support for Type Variables.</p>
+<p>You can use them to parametrize generic classes and functions.</p>
+<p>See below for a small example where our generic class is parametrized by T which we indicate with [T].</p>
 <div class="cell docutils container">
 <div class="cell_input docutils container">
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Stack</span><span class="p">[</span><span class="n">T</span><span class="p">]:</span>

diff --git a/searchindex.js b/searchindex.js