Skip to content

Commit

Permalink
deploy: 131c2df
Browse files Browse the repository at this point in the history
  • Loading branch information
roumail committed Oct 31, 2023
1 parent fb36edf commit 20aaead
Show file tree
Hide file tree
Showing 76 changed files with 11,378 additions and 1,434 deletions.
10 changes: 5 additions & 5 deletions 404.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<!DOCTYPE html>
<!-- This site was created with Wowchemy. https://www.wowchemy.com -->
<!-- Last Published: October 27, 2023 --><html lang="en-us" >
<!-- Last Published: October 31, 2023 --><html lang="en-us" >


<head>
Expand Down Expand Up @@ -712,12 +712,14 @@ <h2>Latest</h2>

<li><a href="/event/example/">Example Talk</a></li>

<li><a href="/personal-values-part-ii/">Unearthing Your True North</a></li>

<li><a href="/post/">Posts</a></li>

<li><a href="/personal-values-part-iii/">Embedding Your True North</a></li>

<li><a href="/post/series-decoding-rohail/">Decoding Rohail</a></li>

<li><a href="/personal-values-part-ii/">Unearthing Your True North</a></li>

<li><a href="/personal-values-part-i/">Quarantine Chronicles: How COVID-19 Helped me look inward</a></li>

<li><a href="/a-scraper-that-scales-part-ii/">Stateful Applications Need to Be Designed Differently</a></li>
Expand All @@ -726,8 +728,6 @@ <h2>Latest</h2>

<li><a href="/a-scraper-that-scales-part-i/">The Motivation to Build a Scraper in Python</a></li>

<li><a href="/post/20231017-hello-world/">Welcome to my Website! 👋</a></li>

</ul>


Expand Down
112 changes: 59 additions & 53 deletions a-scraper-that-scales-part-i/index.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<!DOCTYPE html>
<!-- This site was created with Wowchemy. https://www.wowchemy.com -->
<!-- Last Published: October 27, 2023 --><html lang="en-us" >
<!-- Last Published: October 31, 2023 --><html lang="en-us" >


<head>
Expand Down Expand Up @@ -945,42 +945,44 @@ <h2 id="motivation">Motivation</h2>
property listings website: <a href="https://immoweb.be" target="_blank" rel="noopener">immoweb</a>. My desire for this
first version was to first, be able to have a very general idea of the Brussels
property market. Thereafter, I would launch this script every few days to look
at the new properties. The output of this script would be a CSV that I&rsquo;d use
to spot good deals and have all the relevant information I&rsquo;d need to schedule
at the new properties. The output of this script would be a CSV that I&rsquo;d use to
spot good deals and have all the relevant information I&rsquo;d need to schedule
visits.</p>
<h2 id="implementing-the-proof-of-concept">Implementing the Proof of concept</h2>
<p>I was running this script from a windows machine at the time and having done a
scraping project once before already, knew that I&rsquo;d start with <code>selenium</code> for
the browser automation and parsing of the html. The setup required that I choose
a browser and a corresponding geckodriver (with the appropriate version for your
browser) to go along with it. I&rsquo;ve used firefox and edge browsers (and their
respective drivers) for different iterations of the scraper implementation.</p>
<p>After messing around with developer tools, looking into the dom&rsquo;s containing
the information I was looking for using <code>inspect</code>, I had a script that was doing
the job. I made a conda export of the environment I used for the scraping in
case I ever needed to revisit this work again. This script did the job and I
was quite happy leaving it at that with an environment export so I could pick
up from this analysis when needed. This version of the script can be found
<a href="https://github.com/roumail/immoweb-scraper/tree/second_run" target="_blank" rel="noopener">here</a> for
those who are interested.</p>
respective drivers) for different iterations of the scraper implementation.</p>
<p>After messing around with developer tools, looking into the dom&rsquo;s containing the
information I was looking for using <code>inspect</code>, I had a script that was doing the
job. I made a conda export of the environment I used for the scraping in case I
ever needed to revisit this work again. This script did the job and I was quite
happy leaving it at that with an environment export so I could pick up from this
analysis when needed. This version of the script can be found
<a href="https://github.com/roumail/immoweb-scraper/tree/second_run" target="_blank" rel="noopener">here</a> for those who
are interested.</p>
<h2 id="a-sidenote-on-bayesian-statistics">A sidenote on Bayesian statistics</h2>
<p>For the longest time I&rsquo;ve been a fan of Bayesian statistics. Being able to
explicitly encode your modelling assumptions in the form of priors, as well as
being very deliberate in reconstructing the data generating process of the
phenomenon you&rsquo;re modelling. You can visually verify how well your model is
generalizing by doing what is called a <a href="https://en.wikipedia.org/wiki/Posterior_predictive_distribution" target="_blank" rel="noopener">posterior predictive check</a>. The computational aspects of MCMC sampling
also appeals to the nerd in me, while the convergence of your sampler gives
indications about how well-informed a hypothesis you have for your data
generating process. An ill-formulated model will simply not converge,
unlike a number of other approaches which would always give a solution and then
you&rsquo;re left to figure out if you&rsquo;re overfitting or underfitting. Then there is
the fact that you are always able to work with distributions of your phenomenon
of interest rather than relying solely on point estimates like we would in most
other methods. There&rsquo;s a number of fascinating things that are possible with
these posterior distributions, which include bayesian decision making. I will
link to a great discussion on the subject by Thomas Wiecki on the subject
<a href="https://twiecki.io/blog/2019/01/14/supply_chain/" target="_blank" rel="noopener">here</a> where we can see how
to use our models to directly show the impact of uncertainy on real business
generalizing by doing what is called a
<a href="https://en.wikipedia.org/wiki/Posterior_predictive_distribution" target="_blank" rel="noopener">posterior predictive check</a>.
The computational aspects of MCMC sampling also appeals to the nerd in me, while
the convergence of your sampler gives indications about how well-informed a
hypothesis you have for your data generating process. An ill-formulated model
will simply not converge, unlike a number of other approaches which would always
give a solution and then you&rsquo;re left to figure out if you&rsquo;re overfitting or
underfitting. Then there is the fact that you are always able to work with
distributions of your phenomenon of interest rather than relying solely on point
estimates like we would in most other methods. There&rsquo;s a number of fascinating
things that are possible with these posterior distributions, which include
bayesian decision making. I will link to a great discussion on the subject by
Thomas Wiecki on the subject
<a href="https://twiecki.io/blog/2019/01/14/supply_chain/" target="_blank" rel="noopener">here</a> where we can see how to
use our models to directly show the impact of uncertainy on real business
metrics rather than arcane statistical metrics such as <code>mean squared error</code>,
<code>f1 score</code> and the like which don&rsquo;t hold any real business meaning.</p>
<p>Naturally, I have my bias for these methods and using these models bring their
Expand All @@ -991,13 +993,15 @@ <h2 id="a-sidenote-on-bayesian-statistics">A sidenote on Bayesian statistics</h2
the natural hierarchical structure of data in a
<a href="https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling" target="_blank" rel="noopener">hierarchical modelling</a>
or the flexiblity of Gaussian process modelling to capture the intricaties of
non-linear processes. The <a href="https://www.pymc.io/projects/examples/en/latest/gaussian_processes/GP-smoothing.html" target="_blank" rel="noopener">link</a> shows the distinction between modelling
the same problem as a regression vs using a gaussian process smoothing model.</p>
non-linear processes. The
<a href="https://www.pymc.io/projects/examples/en/latest/gaussian_processes/GP-smoothing.html" target="_blank" rel="noopener">link</a>
shows the distinction between modelling the same problem as a regression vs
using a gaussian process smoothing model.</p>
<h2 id="revisiting-the-implementation-once-again">Revisiting the implementation once again</h2>
<p>Any data scientist or machine learning practitioner will tell you about their
struggles with data. It&rsquo;s either data quality (or lack thereof) or just the lack
of data itself for performing interesting analyses. Then it suddenly occurred
to me: property data is perfect for the experiments I wanted to conduct.</p>
of data itself for performing interesting analyses. Then it suddenly occurred to
me: property data is perfect for the experiments I wanted to conduct.</p>
<p>Scraping property prices over time gives the opportunity to model property
prices over time and ask interesting questions, including, but not limited to
the following:</p>
Expand All @@ -1012,13 +1016,13 @@ <h2 id="revisiting-the-implementation-once-again">Revisiting the implementation
property prices evolve over time. There is also a natural structure in the data
that can be exploited since we can indeed expect properties within communes to
be similarly priced. This would be a place where we can use Gaussian process
modelling to capture the underlying trends and fluctuations in property
prices within each commune. We can use property proximity to model the inherent
spatial relationships between properties, assuming that properties closer to
each other are more likely to have similar prices!</p>
modelling to capture the underlying trends and fluctuations in property prices
within each commune. We can use property proximity to model the inherent spatial
relationships between properties, assuming that properties closer to each other
are more likely to have similar prices!</p>
<p>By revisiting my initial scraper implementation with this newfound focus, I am
not just enhancing a tool; I am building a robust data collection pipeline
that will serve as the backbone for these sophisticated analytical experiments.</p>
not just enhancing a tool; I am building a robust data collection pipeline that
will serve as the backbone for these sophisticated analytical experiments.</p>
<h2 id="areas-of-improvement">Areas of improvement</h2>
<p>With a clear goal in mind, I identified several key areas to refine the
scraper&rsquo;s implementation. These improvements were aimed at making the scraper
Expand All @@ -1027,41 +1031,43 @@ <h2 id="areas-of-improvement">Areas of improvement</h2>
<h3 id="1-efficiency-in-data-scraping">1. Efficiency in Data Scraping</h3>
<p>Switch to Beautiful Soup: I wanted to transition from
<a href="https://pypi.org/project/selenium/" target="_blank" rel="noopener">Selenium</a> to
<a href="https://pypi.org/project/beautifulsoup4/" target="_blank" rel="noopener">Beautiful Soup</a> for parsing raw
HTML. This change ought to significantly reduced the time needed to scrape data.</p>
<p>Parametrization of Postal Codes: Allowing postal codes as an input parameter
to make the scraper more flexible. I was initially only looking into a few
communes in Brussels that I was interested in. However, if I wanted to do some
<a href="https://pypi.org/project/beautifulsoup4/" target="_blank" rel="noopener">Beautiful Soup</a> for parsing raw HTML.
This change ought to significantly reduced the time needed to scrape data.</p>
<p>Parametrization of Postal Codes: Allowing postal codes as an input parameter to
make the scraper more flexible. I was initially only looking into a few communes
in Brussels that I was interested in. However, if I wanted to do some
interesting analyses, I also wanted to consider communes neighboring Brussels.</p>
<h3 id="2-dependency-management">2. Dependency Management</h3>
<p>Use of Poetry: To manage the project&rsquo;s dependencies more effectively, I
wanted to convert the script into a Python package and used <a href="">Poetry</a> for managing
the dependencies. This streamlines the installation process and allows me to
manage the package versions in a systematic version. This would be especially
useful as we dockerize the analysis in the future and build a CI/CD pipeline.</p>
<p>Use of Poetry: To manage the project&rsquo;s dependencies more effectively, I wanted
to convert the script into a Python package and used <a href="">Poetry</a> for managing the
dependencies. This streamlines the installation process and allows me to manage
the package versions in a systematic version. This would be especially useful as
we dockerize the analysis in the future and build a CI/CD pipeline.</p>
<p>Implementation of Typer: I used <a href="https://pypi.org/project/typer/" target="_blank" rel="noopener">Typer</a> to
create a command-line interface from the main application entrypoint. I&rsquo;ve
effectively transitioned to using this instead of
<a href="https://pypi.org/project/click/" target="_blank" rel="noopener"><code>click</code></a> recently.</p>
<h3 id="3-code-refactoring-for-readability-and-maintainability">3. Code Refactoring for Readability and Maintainability</h3>
<p>Object-Oriented Approach: I wanted to refactor the code to use Python classes
instead of just functions where appropriate. By using meaningful class names, the
code can become self-documenting and easier to maintain and extend in the long
run.</p>
instead of just functions where appropriate. By using meaningful class names,
the code can become self-documenting and easier to maintain and extend in the
long run.</p>
<h3 id="5-data-storage-and-validation">5. Data Storage and Validation</h3>
<p>SQLite Database: I wanted to use a SQLite database with an initial schema to
store the data I&rsquo;d be accumulating over time. I&rsquo;ve really enjoyed working with
<a href="https://pypi.org/project/SQLAlchemy/" target="_blank" rel="noopener"><code>SQLAlchemy</code></a> as the ORM mapper to interact with the database.</p>
<p>Data Validation with Pydantic: Before adding the scraped data to the database,
I implemented validation checks using
<a href="https://pypi.org/project/SQLAlchemy/" target="_blank" rel="noopener"><code>SQLAlchemy</code></a> as the ORM mapper to
interact with the database.</p>
<p>Data Validation with Pydantic: Before adding the scraped data to the database, I
implemented validation checks using
<a href="https://pypi.org/project/pydantic/" target="_blank" rel="noopener">Pydantic</a>. This ensured that only
high-quality, accurate data was stored.</p>
<p>By focusing on these areas, I aimed to build a scraper that was not just a
one-off script but a robust data collection tool capable of supporting more
complex analyses and experiments.</p>
<h2 id="final-comments">Final comments</h2>
<p>In the next blog <a href="/a-scraper-that-scales-part-ii/">post</a> in the series, I will go over the implementation details.
For those interested, you can find the current state of the project
<p>In the next blog <a href="/a-scraper-that-scales-part-ii/">post</a> in the series, I will
go over the implementation details. For those interested, you can find the
current state of the project
<a href="https://github.com/roumail/immoweb-scraper/tree/v1.0.0" target="_blank" rel="noopener">here</a>.</p>


Expand Down
2 changes: 1 addition & 1 deletion a-scraper-that-scales-part-ii/index.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<!DOCTYPE html>
<!-- This site was created with Wowchemy. https://www.wowchemy.com -->
<!-- Last Published: October 27, 2023 --><html lang="en-us" >
<!-- Last Published: October 31, 2023 --><html lang="en-us" >


<head>
Expand Down
6 changes: 3 additions & 3 deletions categories/index.html
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<!DOCTYPE html>
<!-- This site was created with Wowchemy. https://www.wowchemy.com -->
<!-- Last Published: October 27, 2023 --><html lang="en-us" >
<!-- Last Published: October 31, 2023 --><html lang="en-us" >


<head>
Expand Down Expand Up @@ -324,7 +324,7 @@
<meta property="og:description" content="Project portfolio and blog website of Rohail Taimour for Machine Learning, Data science and personal musics about Drums and life" /><meta property="og:image" content="https://rohailtaimour.com/media/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_3.png" /><meta property="og:locale" content="en-us" />


<meta property="og:updated_time" content="2023-10-26T18:24:31&#43;02:00" />
<meta property="og:updated_time" content="2023-10-31T13:26:24&#43;01:00" />



Expand Down Expand Up @@ -777,7 +777,7 @@ <h1>Blog Categories</h1>



Oct 26, 2023
Oct 31, 2023
</span>


Expand Down
2 changes: 1 addition & 1 deletion categories/index.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<link>https://rohailtaimour.com/categories/</link>
<atom:link href="https://rohailtaimour.com/categories/index.xml" rel="self" type="application/rss+xml" />
<description>Blog Categories</description>
<generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Thu, 26 Oct 2023 18:24:31 +0200</lastBuildDate>
<generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Tue, 31 Oct 2023 13:26:24 +0100</lastBuildDate>
<image>
<url>https://rohailtaimour.com/media/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_3.png</url>
<title>Blog Categories</title>
Expand Down
Loading

0 comments on commit 20aaead

Please sign in to comment.