better readme, small fix

fpgmaas · Jun 23, 2024 · 1656ad3 · 1656ad3
1 parent 95abfb8
commit 1656ad3
Show file tree

Hide file tree

Showing 3 changed files with 4 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -52,7 +52,8 @@ There are three methods to run the setup script, dependent on if you have a NVID
 - [Option 3: Using Docker without NVIDIA GPU and NVIDIA Container Toolkit](SETUP.md#option-3-using-docker-without-nvidia-gpu-and-nvidia-container-toolkit)
 
 > [!NOTE]
-> Although the dataset contains all packages on PyPI with more than 50 weekly downloads, by default only the top 40% of this dataset (those with more than approximately 250 downloads per week) are added to the vector database. To include packages with less weekly downloads in the database, you can increase the value of `FRAC_DATA_TO_INCLUDE` in `pypi_scout/config.py`.
+> The dataset contains approximately 100.000 packages on PyPI with more than 100 weekly downloads. To speed up local development,
+> you can lower the amount of packages that is processed locally by lowering the value of `FRAC_DATA_TO_INCLUDE` in `pypi_scout/config.py`.
 
 #### 3. **Run the Application**
 

diff --git a/frontend/app/components/InfoBox.tsx b/frontend/app/components/InfoBox.tsx
@@ -20,7 +20,7 @@ const InfoBox: React.FC<InfoBoxProps> = ({ infoBoxVisible }) => {
       <br />
       <p className="text-gray-100">
         Once you click search, your query will be matched against the summary
-        and the first part of the description of the ~50.000 most popular
+        and the first part of the description of the ~100.000 most popular
         packages on PyPI, which includes all packages with at least ~100
         downloads per week. The results are then scored based on their
         similarity to the query and their number of weekly downloads, and the

diff --git a/pypi_scout/data/raw_data_reader.py b/pypi_scout/data/raw_data_reader.py
@@ -21,7 +21,7 @@ def read(self):
             DataFrame: The processed dataframe.
         """
         df = pl.read_csv(self.raw_dataset)
-        df = df.with_columns(weekly_downloads=pl.col("number_of_downloads").round().cast(pl.Int32))
+        df = df.with_columns(weekly_downloads=pl.col("number_of_downloads").cast(pl.Int32))
         df = df.drop("number_of_downloads")
         df = df.unique(subset="name")
         df = df.filter(~(pl.col("description").is_null() & pl.col("summary").is_null()))