diff --git a/README.md b/README.md index e50edb1..a09814f 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,8 @@ There are three methods to run the setup script, dependent on if you have a NVID - [Option 3: Using Docker without NVIDIA GPU and NVIDIA Container Toolkit](SETUP.md#option-3-using-docker-without-nvidia-gpu-and-nvidia-container-toolkit) > [!NOTE] -> Although the dataset contains all packages on PyPI with more than 50 weekly downloads, by default only the top 40% of this dataset (those with more than approximately 250 downloads per week) are added to the vector database. To include packages with less weekly downloads in the database, you can increase the value of `FRAC_DATA_TO_INCLUDE` in `pypi_scout/config.py`. +> The dataset contains approximately 100.000 packages on PyPI with more than 100 weekly downloads. To speed up local development, +> you can lower the amount of packages that is processed locally by lowering the value of `FRAC_DATA_TO_INCLUDE` in `pypi_scout/config.py`. #### 3. **Run the Application** diff --git a/frontend/app/components/InfoBox.tsx b/frontend/app/components/InfoBox.tsx index 3fe12a3..88c5cd4 100644 --- a/frontend/app/components/InfoBox.tsx +++ b/frontend/app/components/InfoBox.tsx @@ -20,7 +20,7 @@ const InfoBox: React.FC = ({ infoBoxVisible }) => {

Once you click search, your query will be matched against the summary - and the first part of the description of the ~50.000 most popular + and the first part of the description of the ~100.000 most popular packages on PyPI, which includes all packages with at least ~100 downloads per week. The results are then scored based on their similarity to the query and their number of weekly downloads, and the diff --git a/pypi_scout/data/raw_data_reader.py b/pypi_scout/data/raw_data_reader.py index 2184fd3..a3a6eef 100644 --- a/pypi_scout/data/raw_data_reader.py +++ b/pypi_scout/data/raw_data_reader.py @@ -21,7 +21,7 @@ def read(self): DataFrame: The processed dataframe. """ df = pl.read_csv(self.raw_dataset) - df = df.with_columns(weekly_downloads=pl.col("number_of_downloads").round().cast(pl.Int32)) + df = df.with_columns(weekly_downloads=pl.col("number_of_downloads").cast(pl.Int32)) df = df.drop("number_of_downloads") df = df.unique(subset="name") df = df.filter(~(pl.col("description").is_null() & pl.col("summary").is_null()))