Skip to content

Latest commit

 

History

History
13 lines (10 loc) · 1.33 KB

README.md

File metadata and controls

13 lines (10 loc) · 1.33 KB

Text-Collection

The aim of this project is to read in a semi-structured file ('txt for-assignment-data-science.text'), and create a hash table. The current file contains 3 articles from the LA Times articles collection.

The hash-table's purpose is to count the number of times that a word appears in each article. For example, entering the word as a 'key' to the hash table will bring up a list of counters that are indexed in order of document number. The only thing to take into consideration is zero-indexing. Thus, to find out how many times 'and' would appear in document number 1, type 'hashtable['and'][0]'. This accesses the counter in the first element of the list.

The repository contains 3 relevant files:

  • TextCollect.py
    • This contains the class TextCollect() which contains methods that are used. Tags in the file e.g. '</p>' that mark the documents are removed. Punctuation is also removed by removing all non alphabetical or numerical values. Finally, words are also singularised and lower-cased to avoid multiple counts for similar words, e.g. 'cow', 'cows'.
  • txt-for-assignment-data-science.txt
    • This contains the raw semi-structured data.
  • Text Collection - Demo and Visualisation.ipynb
    • This contains the demonstration of how to use the class, and plots a histogram of the number of times a word appeared in all documents.