diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..b17f688 Binary files /dev/null and b/.DS_Store differ diff --git a/README.md b/README.md new file mode 100644 index 0000000..efaba39 --- /dev/null +++ b/README.md @@ -0,0 +1,49 @@ +# How people’s attitude towards COVID-19 has changed? +### Heather Chen + +#### 1. Research questions +My research project is basically a content analysis program that focuses on the changes in people’s attitudes during the period of covid-19. To be specific, using Global Database of Events, Language and Tone (GDELT), my research devotes to answering the following three questions:
+**1) How the volume of news about covid-19 has changed? Are there any breakpoints?**
+**2) Which countries report covid-19 news most often? Can we see a pattern here?**
+**3) Do countries differ in their focus about covid-19?**
+In the sections below, I will first introduce the dataset that I utilize in this project. In the third section, I will discuss the large-scale computing strategies that I employ and why they are necessary for the implementation of the project. In the fourth section, I will present general results that answer the research questions. In the last section, I will discuss shortly the limitations of my project and how it can be improved if possible.
+ +#### 2. Data and project description +The corpus that I am using in this project is a sub-database called “covid19.onlinenewsgeo” from GDELT. It extracts and updates news about covid-19 across the world every 15 minutes. The database contains the title, url, origin country, contextual content, and other important information about one individual news. I will only focus on several aspects in the following analysis.

+Also, given the huge volume of covid-19 news, I only select news from January 7th to March 31st. On January 7th, the Chinese government officially reported (or acknowledged) the disease for the first time, so it is a reasonable start point. The length of three months gives us enough time to track changes in attitudes. Therefore, data from January 7th to March 31st is the optimal choice given the limited computing power (which I will discuss in detail later in the fifth section).
+ +#### 3. Parallelism or large-scale computing strategy +I mainly utilize two large-scale computing strategies in this project: Amazon S3 for data storage and PySpark on EMR cluster for data analysis.
+**1) Amazon S3 bucket**
+I use Amazon S3 bucket for data storage because the daily news data (in csv format) is too large to save and process on local machine. The 85-days-corpus takes over 50 GBs of storage. Uploading them onto S3 bucket also makes them available for future processing on EMR cluster.
+**2) PySpark on EMR cluster**
+To do exploratory analysis and topic modeling on the corpus, I employ PySpark on EMR cluster. Exploratory data analysis involves selecting and filtering data of given criterion. I create an EMR cluster of 2 instances since this part does not require much computing power. However, topic modeling requires more complex machine learning methods, including Tokenizer, StopwordRemover, TF-IDF, CountVectorizer, as well as Latent Dirichlet Allocation. Therefore, an 8-instances cluster is created especially for this part.
+ +#### 4. Results +Here in this section, I will present the results of my project in the order of research questions. final_project.ipynb contains most of the information in exploratory analysis. final_project2.ipynb does all of topic modeling.
+ +**1) How the volume of news about covid-19 has changed?**
+![Figure 1.1](https://github.com/heathercchen/LSC_projects/graphs/1.1.jpg)
+![Figure 1.2](https://github.com/heathercchen/LSC_projects/graphs/1.2.jpg)
+To answer this question, I make a graph using data size against time and another graph using the amount of news against time. These two graphs show very similar trends. The news of covid-19 surged at around January 17th for the first time. The second outbreak happens at around March 5th. From then on, there are more than 600,000 of news every day about covid-19.
+ +**2) Which countries report covid-19 news most often? Can we see a pattern here?**
+![Figure 2.1](https://github.com/lsc4ss-a20/final-project-heathercchen/blob/master/graphs/2.1.jpg)
+From the graph above we can see that, different countries have different trend when reporting the news about covid-19. Here I only select 5 countries, China, America, the United Kingdom, Brazil, and India. The peak of news from China appears around the end of January, which is different from any other countries here. The U.S. reports few news about the pandemic at the beginning, just like other western countries. However, people’s attention tremendously increases at around the beginning of March. (From my experience, it is when the cases in the U.S. begin to spread across the states.)
+ +**3) Do countries differ in their focus about covid-19?**
+In topic modeling part, I only select China and the U.S. as two countries of example. In each of the three months, I extract 5 topics and top 10 words from these topics. Below are the results from China in January, Febuary, and March.
+![Figure 3.1.1](https://github.com/heathercchen/LSC_projects/graphs/3.1.1.jpg)
+![Figure 3.1.2](https://github.com/heathercchen/LSC_projects/graphs/3.1.2.jpg)
+![Figure 3.1.3](https://github.com/heathercchen/LSC_projectsgraphs/3.1.3.jpg)
+Results in January reveal no clear patterns. While in Febuary, the second topic contains words like “li”, “wenliang”, and “media”, which clearly illustrates the story of China’s whistle-blower Wenliang Li. His warnings against the virus were suppressed by officials. His death brought nationwide sorrow, respectfulness, and anger. Topics in March contain more information on people’s daily lives, such as lockdown, quarantine, and restrictions. The names of political leaders also appear in topic 3.
+![Figure 3.2.1](https://github.com/heathercchen/LSC_projects/graphs/3.2.1.jpg)
+![Figure 3.2.2](https://github.com/heathercchen/LSC_projectsgraphs/3.2.2.jpg)
+![Figure 3.2.3](https://github.com/heathercchen/LSC_projectsgraphs/3.2.3.jpg)
+At the same time in the U.S., the media’s focus is somewhat different. As we can see in January, there are many names of locations such as “alaska”, “chicago”, “san francisco”, and “los angeles”. States in the U.S. begin to report their first cases. In Febuary, we can see similar names of places, while there are more political terms such as “trump”, “white house”, “president”, and “cdc”. In March, besides reporting new cases in several states, we can see in topic 1 words like “economic” and “package”. They are not found in any of the topics in China.
+ +Summarized from the above, from this simple topic modeling, we can clearly get a sense of what these two countries are looking at during the three months. There are some similarities between their focuses, while we cannot conclude for sure about their preferences on certain well-structured topics (E.g., Americans focus more on economic topics regarding covid).
+ +#### 5. Discussions (Further advancements for better results) +Having all the results, to some degree, what we get in content analysis captures what happened in real lives.

+ However, my project has a lot of space for future improvements. Due to the speed of using VPN to access AWS (I am in China right now), I am not able to run on larger datasets and employ more complicated machine learning techniques. If time and computing power permits, I would include more enhanced techniques on the corpus, including lemmatizing and normalizing the tokens.
diff --git a/final_project.ipynb b/final_project.ipynb new file mode 100644 index 0000000..cf1a735 --- /dev/null +++ b/final_project.ipynb @@ -0,0 +1,970 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Final Project \n", + "### Heather Chen\n", + "#### 1. Import the data uploaded on S3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "spark" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d42b8dc66c3248328e4a5ba7e8a99f23", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Starting Spark application\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
0application_1607702121291_0001pysparkidleLinkLink
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SparkSession available as 'spark'.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting boto3\n", + " Downloading https://files.pythonhosted.org/packages/2f/f5/aeb4d65266f7712a627674bd19994cee3e1c66ff588adbc4db3fc0bbbf97/boto3-1.16.34-py2.py3-none-any.whl (129kB)\n", + "Collecting s3transfer<0.4.0,>=0.3.0 (from boto3)\n", + " Downloading https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl (69kB)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3)\n", + "Collecting botocore<1.20.0,>=1.19.34 (from boto3)\n", + " Downloading https://files.pythonhosted.org/packages/03/ef/e35e41d6e445f472ac8f4fca8dd22726d8c6dc19ab06317164a222d13599/botocore-1.19.34-py2.py3-none-any.whl (7.0MB)\n", + "Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.20.0,>=1.19.34->boto3)\n", + " Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)\n", + "Collecting urllib3<1.27,>=1.25.4; python_version != \"3.4\" (from botocore<1.20.0,>=1.19.34->boto3)\n", + " Downloading https://files.pythonhosted.org/packages/f5/71/45d36a8df68f3ebb098d6861b2c017f3d094538c0fb98fa61d4dc43e69b9/urllib3-1.26.2-py2.py3-none-any.whl (136kB)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.20.0,>=1.19.34->boto3)\n", + "Installing collected packages: python-dateutil, urllib3, botocore, s3transfer, boto3\n", + "Successfully installed boto3-1.16.34 botocore-1.19.34 python-dateutil-2.8.1 s3transfer-0.3.3 urllib3-1.26.2" + ] + } + ], + "source": [ + "sc.install_pypi_package(\"boto3\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "1aedc1b8069f4310a2300230c8832898", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2020-12-07 14:46:28+00:00 \t 4 MB\t gdelt_data/bq-results-0107.csv \n", + "\n", + "2020-12-07 14:45:53+00:00 \t 4 MB\t gdelt_data/bq-results-0108.csv \n", + "\n", + "2020-12-07 14:47:16+00:00 \t 9 MB\t gdelt_data/bq-results-0109.csv \n", + "\n", + "2020-12-07 14:46:30+00:00 \t 6 MB\t gdelt_data/bq-results-0110.csv \n", + "\n", + "2020-12-07 14:46:04+00:00 \t 6 MB\t gdelt_data/bq-results-0111.csv \n", + "\n", + "2020-12-07 14:49:53+00:00 \t 2 MB\t gdelt_data/bq-results-0112.csv \n", + "\n", + "2020-12-07 14:50:25+00:00 \t 7 MB\t gdelt_data/bq-results-0113.csv \n", + "\n", + "2020-12-07 14:50:13+00:00 \t 6 MB\t gdelt_data/bq-results-0114.csv \n", + "\n", + "2020-12-07 14:50:22+00:00 \t 6 MB\t gdelt_data/bq-results-0115.csv \n", + "\n", + "2020-12-07 14:51:03+00:00 \t 10 MB\t gdelt_data/bq-results-0116.csv \n", + "\n", + "2020-12-07 14:49:21+00:00 \t 20 MB\t gdelt_data/bq-results-0117.csv \n", + "\n", + "2020-12-07 14:50:40+00:00 \t 17 MB\t gdelt_data/bq-results-0118.csv \n", + "\n", + "2020-12-07 14:50:24+00:00 \t 10 MB\t gdelt_data/bq-results-0119.csv \n", + "\n", + "2020-12-07 14:49:22+00:00 \t 41 MB\t gdelt_data/bq-results-0120.csv \n", + "\n", + "2020-12-07 14:49:21+00:00 \t 99 MB\t gdelt_data/bq-results-0121.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 129 MB\t gdelt_data/bq-results-0122.csv \n", + "\n", + "2020-12-07 14:56:08+00:00 \t 151 MB\t gdelt_data/bq-results-0123.csv \n", + "\n", + "2020-12-07 14:56:08+00:00 \t 185 MB\t gdelt_data/bq-results-0124.csv \n", + "\n", + "2020-12-07 14:56:09+00:00 \t 148 MB\t gdelt_data/bq-results-0125.csv \n", + "\n", + "2020-12-07 14:56:09+00:00 \t 167 MB\t gdelt_data/bq-results-0126.csv \n", + "\n", + "2020-12-07 14:56:09+00:00 \t 203 MB\t gdelt_data/bq-results-0127.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 255 MB\t gdelt_data/bq-results-0128.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 296 MB\t gdelt_data/bq-results-0129.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 316 MB\t gdelt_data/bq-results-0130.csv \n", + "\n", + "2020-12-08 17:29:14+00:00 \t 324 MB\t gdelt_data/bq-results-0131.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 210 MB\t gdelt_data/bq-results-0201.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 225 MB\t gdelt_data/bq-results-0202.csv \n", + "\n", + "2020-12-07 15:13:07+00:00 \t 292 MB\t gdelt_data/bq-results-0203.csv \n", + "\n", + "2020-12-07 15:13:07+00:00 \t 274 MB\t gdelt_data/bq-results-0204.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 272 MB\t gdelt_data/bq-results-0205.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 254 MB\t gdelt_data/bq-results-0206.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 247 MB\t gdelt_data/bq-results-0207.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 161 MB\t gdelt_data/bq-results-0208.csv \n", + "\n", + "2020-12-07 16:02:50+00:00 \t 155 MB\t gdelt_data/bq-results-0209.csv \n", + "\n", + "2020-12-07 16:02:50+00:00 \t 210 MB\t gdelt_data/bq-results-0210.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 201 MB\t gdelt_data/bq-results-0211.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 193 MB\t gdelt_data/bq-results-0212.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 217 MB\t gdelt_data/bq-results-0213.csv \n", + "\n", + "2020-12-07 16:09:59+00:00 \t 177 MB\t gdelt_data/bq-results-0214.csv \n", + "\n", + "2020-12-07 16:09:59+00:00 \t 128 MB\t gdelt_data/bq-results-0215.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 130 MB\t gdelt_data/bq-results-0216.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 186 MB\t gdelt_data/bq-results-0217.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 209 MB\t gdelt_data/bq-results-0218.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 191 MB\t gdelt_data/bq-results-0219.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 198 MB\t gdelt_data/bq-results-0220.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 257 MB\t gdelt_data/bq-results-0221.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 155 MB\t gdelt_data/bq-results-0222.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 173 MB\t gdelt_data/bq-results-0223.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 274 MB\t gdelt_data/bq-results-0224.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 302 MB\t gdelt_data/bq-results-0225.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 333 MB\t gdelt_data/bq-results-0226.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 371 MB\t gdelt_data/bq-results-0227.csv \n", + "\n", + "2020-12-07 16:53:44+00:00 \t 367 MB\t gdelt_data/bq-results-0228.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 276 MB\t gdelt_data/bq-results-0229.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 255 MB\t gdelt_data/bq-results-0301.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 405 MB\t gdelt_data/bq-results-0302.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 379 MB\t gdelt_data/bq-results-0303.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 389 MB\t gdelt_data/bq-results-0304.csv \n", + "\n", + "2020-12-07 17:18:38+00:00 \t 422 MB\t gdelt_data/bq-results-0305.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 424 MB\t gdelt_data/bq-results-0306.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 317 MB\t gdelt_data/bq-results-0307.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 333 MB\t gdelt_data/bq-results-0308.csv \n", + "\n", + "2020-12-07 17:18:38+00:00 \t 539 MB\t gdelt_data/bq-results-0309.csv \n", + "\n", + "2020-12-07 17:18:39+00:00 \t 626 MB\t gdelt_data/bq-results-0310.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 674 MB\t gdelt_data/bq-results-0311.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 773 MB\t gdelt_data/bq-results-0312.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 807 MB\t gdelt_data/bq-results-0313.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 583 MB\t gdelt_data/bq-results-0314.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 587 MB\t gdelt_data/bq-results-0315.csv \n", + "\n", + "2020-12-08 12:07:45+00:00 \t 754 MB\t gdelt_data/bq-results-0316.csv \n", + "\n", + "2020-12-08 12:07:47+00:00 \t 852 MB\t gdelt_data/bq-results-0317.csv \n", + "\n", + "2020-12-08 12:07:46+00:00 \t 918 MB\t gdelt_data/bq-results-0318.csv \n", + "\n", + "2020-12-08 12:07:46+00:00 \t 925 MB\t gdelt_data/bq-results-0319.csv \n", + "\n", + "2020-12-08 12:07:47+00:00 \t 865 MB\t gdelt_data/bq-results-0320.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 682 MB\t gdelt_data/bq-results-0321.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 677 MB\t gdelt_data/bq-results-0322.csv \n", + "\n", + "2020-12-08 13:30:22+00:00 \t 863 MB\t gdelt_data/bq-results-0323.csv \n", + "\n", + "2020-12-08 13:30:22+00:00 \t 891 MB\t gdelt_data/bq-results-0324.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 901 MB\t gdelt_data/bq-results-0325.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 886 MB\t gdelt_data/bq-results-0326.csv \n", + "\n", + "2020-12-08 15:06:24+00:00 \t 870 MB\t gdelt_data/bq-results-0327.csv \n", + "\n", + "2020-12-08 15:06:24+00:00 \t 603 MB\t gdelt_data/bq-results-0328.csv \n", + "\n", + "2020-12-08 15:06:24+00:00 \t 625 MB\t gdelt_data/bq-results-0329.csv \n", + "\n", + "2020-12-08 15:45:47+00:00 \t 801 MB\t gdelt_data/bq-results-0330.csv \n", + "\n", + "2020-12-08 15:45:47+00:00 \t 834 MB\t gdelt_data/bq-results-0331.csv" + ] + } + ], + "source": [ + "#Get the names and sizes of these data files using Amazon boto3\n", + "import boto3\n", + "\n", + "s3 = boto3.resource('s3')\n", + "bucket = 'aws-emr-resources-787469208957-us-east-1'\n", + "bucket_resource = s3.Bucket(bucket)\n", + "file_sizes = []\n", + "file_names = []\n", + "\n", + "for obj in bucket_resource.objects.all():\n", + " if 'gdelt_data' in obj.key and 'bq-results' in obj.key:\n", + " print(obj.last_modified,\"\\t\", round(obj.size * 1e-6), \"MB\\t\",\n", + " obj.key, \"\\n\")\n", + " file_sizes.append(obj.size * 1e-6)\n", + " file_names.append(obj.key)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f540339ef8ea4014a3baed4a699af3ea", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Then read these data files to spark dataframe\n", + "df_list = []\n", + "for filename in file_names:\n", + " df = spark.read.csv('s3://aws-emr-resources-787469208957-us-east-1/' + filename, header=True)\n", + " df_list.append(df)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ac79297f28a041d2937d5ed2cc8631bf", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "root\n", + " |-- DateTime: string (nullable = true)\n", + " |-- URL: string (nullable = true)\n", + " |-- Title: string (nullable = true)\n", + " |-- SharingImage: string (nullable = true)\n", + " |-- LangCode: string (nullable = true)\n", + " |-- DocTone: string (nullable = true)\n", + " |-- DomainCountryCode: string (nullable = true)\n", + " |-- Location: string (nullable = true)\n", + " |-- Lat: string (nullable = true)\n", + " |-- Lon: string (nullable = true)\n", + " |-- CountryCode: string (nullable = true)\n", + " |-- Adm1Code: string (nullable = true)\n", + " |-- Adm2Code: string (nullable = true)\n", + " |-- GeoType: string (nullable = true)\n", + " |-- ContextualText: string (nullable = true)\n", + " |-- GeoCoord: string (nullable = true)" + ] + } + ], + "source": [ + "#Show the structure of these data files\n", + "df_list[0].printSchema()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 2. How the volume of news about covid-19 has changed?" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "cb1d67f1801d4373a9d5a6cbcbe081ae", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting pandas\n", + " Downloading https://files.pythonhosted.org/packages/fd/70/e8eee0cbddf926bf51958c7d6a86bc69167c300fa2ba8e592330a2377d1b/pandas-1.1.5-cp37-cp37m-manylinux1_x86_64.whl (9.5MB)\n", + "Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib64/python3.7/site-packages (from pandas)\n", + "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas)\n", + "Requirement already satisfied: python-dateutil>=2.7.3 in /mnt/tmp/1607702376424-0/lib/python3.7/site-packages (from pandas)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas)\n", + "Installing collected packages: pandas\n", + "Successfully installed pandas-1.1.5" + ] + } + ], + "source": [ + "#If looking at the size of these data files\n", + "#First import pandas\n", + "sc.install_pypi_package(\"pandas\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "43726ed0dafb40259979cb7526021f4d", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting matplotlib\n", + " Downloading https://files.pythonhosted.org/packages/30/f2/10c822cb0ca5ebec58bd1892187bc3e3db64a867ac26531c6204663fc218/matplotlib-3.3.3-cp37-cp37m-manylinux1_x86_64.whl (11.6MB)\n", + "Requirement already satisfied: numpy>=1.15 in /usr/local/lib64/python3.7/site-packages (from matplotlib)\n", + "Requirement already satisfied: python-dateutil>=2.1 in /mnt/tmp/1607702376424-0/lib/python3.7/site-packages (from matplotlib)\n", + "Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 (from matplotlib)\n", + " Downloading https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl (67kB)\n", + "Collecting pillow>=6.2.0 (from matplotlib)\n", + " Downloading https://files.pythonhosted.org/packages/af/fa/c1302a26d5e1a17fa8e10e43417b6cf038b0648c4b79fcf2302a4a0c5d30/Pillow-8.0.1-cp37-cp37m-manylinux1_x86_64.whl (2.2MB)\n", + "Collecting cycler>=0.10 (from matplotlib)\n", + " Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl\n", + "Collecting kiwisolver>=1.0.1 (from matplotlib)\n", + " Downloading https://files.pythonhosted.org/packages/d2/46/231de802ade4225b76b96cffe419cf3ce52bbe92e3b092cf12db7d11c207/kiwisolver-1.3.1-cp37-cp37m-manylinux1_x86_64.whl (1.1MB)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib)\n", + "Installing collected packages: pyparsing, pillow, cycler, kiwisolver, matplotlib\n", + "Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.3 pillow-8.0.1 pyparsing-2.4.7" + ] + } + ], + "source": [ + "sc.install_pypi_package(\"matplotlib\")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "5ff56bb60751498fa9ed5f3d91402460", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pandas as pd\n", + "from datetime import datetime\n", + "import matplotlib.pyplot as plt\n", + "\n", + "date_range = pd.date_range(start=\"2020-01-07\",end=\"2020-03-31\")\n", + "plt.plot(date_range, file_sizes)\n", + "plt.title(\"The volume of covid-19 data from 01/07 to 03/31 using file sizes (in MBs)\")\n", + "plt.locator_params(axis='x', nbins=20)\n", + "plt.xticks(rotation='45', size='7.5')\n", + "plt.show()\n", + "%matplot plt" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d94b068c73bc41c3a02289be198bbacd", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#If looking at the number of news within each datafile\n", + "num_news = []\n", + "for file in df_list:\n", + " num_news.append(file.count())\n", + "\n", + "plt.close()\n", + "plt.plot(date_range, num_news)\n", + "plt.title(\"The volume of covid-19 data from 01/07 to 03/31 using no. of news\")\n", + "plt.locator_params(axis='x', nbins=20)\n", + "plt.xticks(rotation='45', size='7.5')\n", + "plt.show()\n", + "%matplot plt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 3. Which countries report covid-19 news most often?" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "9f190918961543ada1f6fa5fb7d3c648", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Get the count of countries within each datafile\n", + "country_df_list = []\n", + "no_countries = []\n", + "for file in df_list:\n", + " country_df = file.groupby('CountryCode').count()\n", + " country_df_list.append(country_df)\n", + " no_countries.append(country_df.count())" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "12630d0dc71e4bcab13e1c3fd6bb15f5", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Plot the number of countries\n", + "plt.close()\n", + "plt.plot(date_range, no_countries)\n", + "plt.title(\"The no. of countries reporting covid-19 news from 01/07 to 03/31\")\n", + "plt.locator_params(axis='x', nbins=20)\n", + "plt.xticks(rotation='45', size='7.5')\n", + "plt.show()\n", + "%matplot plt" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "11d8a21702f94ba29384b6802626a032", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Now select some countries of interest to see their number of reporting news\n", + "#Here we select CH(China), US, UK, JP(Japan), BR(Brazil), and IN(India)\n", + "def get_num_news_for_country(country):\n", + " country_newsNo = []\n", + " for file in country_df_list:\n", + " try:\n", + " no_news = file.filter(file.CountryCode==country).collect()[0]['count']\n", + " country_newsNo.append(no_news)\n", + " except IndexError:\n", + " country_newsNo.append(0)\n", + " return country_newsNo" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "cf0325af747345538625cbb85a3a6efd", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "CH_newsNo = get_num_news_for_country('CH')\n", + "US_newsNo = get_num_news_for_country('US')\n", + "UK_newsNo = get_num_news_for_country('UK')\n", + "BR_newsNo = get_num_news_for_country('BR')\n", + "IN_newsNo = get_num_news_for_country('IN')" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c708bdba0cc74814b53ed45e442a2b83", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Plot the number of news for each countries\n", + "plt.close()\n", + "plt.plot(date_range, CH_newsNo, label='CH')\n", + "plt.plot(date_range, US_newsNo, label='US')\n", + "plt.plot(date_range, UK_newsNo, label='UK')\n", + "plt.plot(date_range, BR_newsNo, label='BR')\n", + "plt.plot(date_range, IN_newsNo, label='IN')\n", + "plt.legend()\n", + "plt.title(\"The no. of countries reporting covid-19 news from 01/07 to 03/31\")\n", + "plt.locator_params(axis='x', nbins=20)\n", + "plt.xticks(rotation='45', size='7.5')\n", + "plt.show()\n", + "%matplot plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "PySpark", + "language": "", + "name": "pysparkkernel" + }, + "language_info": { + "codemirror_mode": { + "name": "python", + "version": 2 + }, + "mimetype": "text/x-python", + "name": "pyspark", + "pygments_lexer": "python2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/final_project2.ipynb b/final_project2.ipynb new file mode 100644 index 0000000..8f5df4a --- /dev/null +++ b/final_project2.ipynb @@ -0,0 +1,1405 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Final Project Notebook 2\n", + "### Heather Chen\n", + "#### 4. Topic modeling using Pyspark" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "45c4cc7a7d04425c8c9f57dba506b93e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Starting Spark application\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
0application_1607674079580_0001pysparkidleLinkLink
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SparkSession available as 'spark'.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "" + ] + } + ], + "source": [ + "spark" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0e7aa1f78b6d436a894526191f069911", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting boto3\n", + " Downloading https://files.pythonhosted.org/packages/2f/f5/aeb4d65266f7712a627674bd19994cee3e1c66ff588adbc4db3fc0bbbf97/boto3-1.16.34-py2.py3-none-any.whl (129kB)\n", + "Collecting s3transfer<0.4.0,>=0.3.0 (from boto3)\n", + " Downloading https://files.pythonhosted.org/packages/69/79/e6afb3d8b0b4e96cefbdc690f741d7dd24547ff1f94240c997a26fa908d3/s3transfer-0.3.3-py2.py3-none-any.whl (69kB)\n", + "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from boto3)\n", + "Collecting botocore<1.20.0,>=1.19.34 (from boto3)\n", + " Downloading https://files.pythonhosted.org/packages/03/ef/e35e41d6e445f472ac8f4fca8dd22726d8c6dc19ab06317164a222d13599/botocore-1.19.34-py2.py3-none-any.whl (7.0MB)\n", + "Collecting python-dateutil<3.0.0,>=2.1 (from botocore<1.20.0,>=1.19.34->boto3)\n", + " Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)\n", + "Collecting urllib3<1.27,>=1.25.4; python_version != \"3.4\" (from botocore<1.20.0,>=1.19.34->boto3)\n", + " Downloading https://files.pythonhosted.org/packages/f5/71/45d36a8df68f3ebb098d6861b2c017f3d094538c0fb98fa61d4dc43e69b9/urllib3-1.26.2-py2.py3-none-any.whl (136kB)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.20.0,>=1.19.34->boto3)\n", + "Installing collected packages: python-dateutil, urllib3, botocore, s3transfer, boto3\n", + "Successfully installed boto3-1.16.34 botocore-1.19.34 python-dateutil-2.8.1 s3transfer-0.3.3 urllib3-1.26.2" + ] + } + ], + "source": [ + "sc.install_pypi_package(\"boto3\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "e9a56b5777794dbdbf8e0a596cb9b5fe", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2020-12-07 14:46:28+00:00 \t 4 MB\t gdelt_data/bq-results-0107.csv \n", + "\n", + "2020-12-07 14:45:53+00:00 \t 4 MB\t gdelt_data/bq-results-0108.csv \n", + "\n", + "2020-12-07 14:47:16+00:00 \t 9 MB\t gdelt_data/bq-results-0109.csv \n", + "\n", + "2020-12-07 14:46:30+00:00 \t 6 MB\t gdelt_data/bq-results-0110.csv \n", + "\n", + "2020-12-07 14:46:04+00:00 \t 6 MB\t gdelt_data/bq-results-0111.csv \n", + "\n", + "2020-12-07 14:49:53+00:00 \t 2 MB\t gdelt_data/bq-results-0112.csv \n", + "\n", + "2020-12-07 14:50:25+00:00 \t 7 MB\t gdelt_data/bq-results-0113.csv \n", + "\n", + "2020-12-07 14:50:13+00:00 \t 6 MB\t gdelt_data/bq-results-0114.csv \n", + "\n", + "2020-12-07 14:50:22+00:00 \t 6 MB\t gdelt_data/bq-results-0115.csv \n", + "\n", + "2020-12-07 14:51:03+00:00 \t 10 MB\t gdelt_data/bq-results-0116.csv \n", + "\n", + "2020-12-07 14:49:21+00:00 \t 20 MB\t gdelt_data/bq-results-0117.csv \n", + "\n", + "2020-12-07 14:50:40+00:00 \t 17 MB\t gdelt_data/bq-results-0118.csv \n", + "\n", + "2020-12-07 14:50:24+00:00 \t 10 MB\t gdelt_data/bq-results-0119.csv \n", + "\n", + "2020-12-07 14:49:22+00:00 \t 41 MB\t gdelt_data/bq-results-0120.csv \n", + "\n", + "2020-12-07 14:49:21+00:00 \t 99 MB\t gdelt_data/bq-results-0121.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 129 MB\t gdelt_data/bq-results-0122.csv \n", + "\n", + "2020-12-07 14:56:08+00:00 \t 151 MB\t gdelt_data/bq-results-0123.csv \n", + "\n", + "2020-12-07 14:56:08+00:00 \t 185 MB\t gdelt_data/bq-results-0124.csv \n", + "\n", + "2020-12-07 14:56:09+00:00 \t 148 MB\t gdelt_data/bq-results-0125.csv \n", + "\n", + "2020-12-07 14:56:09+00:00 \t 167 MB\t gdelt_data/bq-results-0126.csv \n", + "\n", + "2020-12-07 14:56:09+00:00 \t 203 MB\t gdelt_data/bq-results-0127.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 255 MB\t gdelt_data/bq-results-0128.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 296 MB\t gdelt_data/bq-results-0129.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 316 MB\t gdelt_data/bq-results-0130.csv \n", + "\n", + "2020-12-08 17:29:14+00:00 \t 324 MB\t gdelt_data/bq-results-0131.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 210 MB\t gdelt_data/bq-results-0201.csv \n", + "\n", + "2020-12-07 15:13:06+00:00 \t 225 MB\t gdelt_data/bq-results-0202.csv \n", + "\n", + "2020-12-07 15:13:07+00:00 \t 292 MB\t gdelt_data/bq-results-0203.csv \n", + "\n", + "2020-12-07 15:13:07+00:00 \t 274 MB\t gdelt_data/bq-results-0204.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 272 MB\t gdelt_data/bq-results-0205.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 254 MB\t gdelt_data/bq-results-0206.csv \n", + "\n", + "2020-12-07 15:48:44+00:00 \t 247 MB\t gdelt_data/bq-results-0207.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 161 MB\t gdelt_data/bq-results-0208.csv \n", + "\n", + "2020-12-07 16:02:50+00:00 \t 155 MB\t gdelt_data/bq-results-0209.csv \n", + "\n", + "2020-12-07 16:02:50+00:00 \t 210 MB\t gdelt_data/bq-results-0210.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 201 MB\t gdelt_data/bq-results-0211.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 193 MB\t gdelt_data/bq-results-0212.csv \n", + "\n", + "2020-12-07 16:09:58+00:00 \t 217 MB\t gdelt_data/bq-results-0213.csv \n", + "\n", + "2020-12-07 16:09:59+00:00 \t 177 MB\t gdelt_data/bq-results-0214.csv \n", + "\n", + "2020-12-07 16:09:59+00:00 \t 128 MB\t gdelt_data/bq-results-0215.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 130 MB\t gdelt_data/bq-results-0216.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 186 MB\t gdelt_data/bq-results-0217.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 209 MB\t gdelt_data/bq-results-0218.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 191 MB\t gdelt_data/bq-results-0219.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 198 MB\t gdelt_data/bq-results-0220.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 257 MB\t gdelt_data/bq-results-0221.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 155 MB\t gdelt_data/bq-results-0222.csv \n", + "\n", + "2020-12-07 16:28:42+00:00 \t 173 MB\t gdelt_data/bq-results-0223.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 274 MB\t gdelt_data/bq-results-0224.csv \n", + "\n", + "2020-12-07 16:28:43+00:00 \t 302 MB\t gdelt_data/bq-results-0225.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 333 MB\t gdelt_data/bq-results-0226.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 371 MB\t gdelt_data/bq-results-0227.csv \n", + "\n", + "2020-12-07 16:53:44+00:00 \t 367 MB\t gdelt_data/bq-results-0228.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 276 MB\t gdelt_data/bq-results-0229.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 255 MB\t gdelt_data/bq-results-0301.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 405 MB\t gdelt_data/bq-results-0302.csv \n", + "\n", + "2020-12-07 16:53:43+00:00 \t 379 MB\t gdelt_data/bq-results-0303.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 389 MB\t gdelt_data/bq-results-0304.csv \n", + "\n", + "2020-12-07 17:18:38+00:00 \t 422 MB\t gdelt_data/bq-results-0305.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 424 MB\t gdelt_data/bq-results-0306.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 317 MB\t gdelt_data/bq-results-0307.csv \n", + "\n", + "2020-12-07 17:46:22+00:00 \t 333 MB\t gdelt_data/bq-results-0308.csv \n", + "\n", + "2020-12-07 17:18:38+00:00 \t 539 MB\t gdelt_data/bq-results-0309.csv \n", + "\n", + "2020-12-07 17:18:39+00:00 \t 626 MB\t gdelt_data/bq-results-0310.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 674 MB\t gdelt_data/bq-results-0311.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 773 MB\t gdelt_data/bq-results-0312.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 807 MB\t gdelt_data/bq-results-0313.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 583 MB\t gdelt_data/bq-results-0314.csv \n", + "\n", + "2020-12-08 11:13:06+00:00 \t 587 MB\t gdelt_data/bq-results-0315.csv \n", + "\n", + "2020-12-08 12:07:45+00:00 \t 754 MB\t gdelt_data/bq-results-0316.csv \n", + "\n", + "2020-12-08 12:07:47+00:00 \t 852 MB\t gdelt_data/bq-results-0317.csv \n", + "\n", + "2020-12-08 12:07:46+00:00 \t 918 MB\t gdelt_data/bq-results-0318.csv \n", + "\n", + "2020-12-08 12:07:46+00:00 \t 925 MB\t gdelt_data/bq-results-0319.csv \n", + "\n", + "2020-12-08 12:07:47+00:00 \t 865 MB\t gdelt_data/bq-results-0320.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 682 MB\t gdelt_data/bq-results-0321.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 677 MB\t gdelt_data/bq-results-0322.csv \n", + "\n", + "2020-12-08 13:30:22+00:00 \t 863 MB\t gdelt_data/bq-results-0323.csv \n", + "\n", + "2020-12-08 13:30:22+00:00 \t 891 MB\t gdelt_data/bq-results-0324.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 901 MB\t gdelt_data/bq-results-0325.csv \n", + "\n", + "2020-12-08 14:11:26+00:00 \t 886 MB\t gdelt_data/bq-results-0326.csv \n", + "\n", + "2020-12-08 15:06:24+00:00 \t 870 MB\t gdelt_data/bq-results-0327.csv \n", + "\n", + "2020-12-08 15:06:24+00:00 \t 603 MB\t gdelt_data/bq-results-0328.csv \n", + "\n", + "2020-12-08 15:06:24+00:00 \t 625 MB\t gdelt_data/bq-results-0329.csv \n", + "\n", + "2020-12-08 15:45:47+00:00 \t 801 MB\t gdelt_data/bq-results-0330.csv \n", + "\n", + "2020-12-08 15:45:47+00:00 \t 834 MB\t gdelt_data/bq-results-0331.csv" + ] + } + ], + "source": [ + "#Get the names and sizes of these data files using Amazon boto3\n", + "import boto3\n", + "\n", + "s3 = boto3.resource('s3')\n", + "bucket = 'aws-emr-resources-787469208957-us-east-1'\n", + "bucket_resource = s3.Bucket(bucket)\n", + "file_sizes = []\n", + "file_names = []\n", + "\n", + "for obj in bucket_resource.objects.all():\n", + " if 'gdelt_data' in obj.key and 'bq-results' in obj.key:\n", + " print(obj.last_modified,\"\\t\", round(obj.size * 1e-6), \"MB\\t\",\n", + " obj.key, \"\\n\")\n", + " file_sizes.append(obj.size * 1e-6)\n", + " file_names.append(obj.key)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "4208d15079884c238ac1a4dc926fd172", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Then read these data files to spark dataframe\n", + "df_list = []\n", + "for filename in file_names:\n", + " df = spark.read.csv('s3://aws-emr-resources-787469208957-us-east-1/' + filename, header=True)\n", + " df_list.append(df)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "e9eb6e332aa1478e8620d0655dea42c0", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#First divide these dataframes to January, Febuary, and March\n", + "jan_dfs = df_list[0:25]\n", + "feb_dfs = df_list[25:54]\n", + "march_dfs = df_list[54:85]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "9034f7aa841846d7b3cf4f9c12fc0697", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Given the large volume of our data, we only select China and the U.S to see how their focused topics change over the\n", + "#three months\n", + "def collect_country_data(df_list, country):\n", + " indx = 0\n", + " for df in df_list:\n", + " df = df.filter(df.CountryCode==country)\n", + " if indx == 0:\n", + " ori_df = df\n", + " indx = indx + 1\n", + " continue\n", + " else:\n", + " curr_df = df\n", + " ori_df = ori_df.union(curr_df)\n", + " indx = indx + 1\n", + " return ori_df" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a88520353ba54b478d00586c866fd6cb", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Collect data from China and from US\n", + "CH_jan_df = collect_country_data(jan_dfs, 'CH')\n", + "CH_feb_df = collect_country_data(feb_dfs, 'CH')\n", + "CH_march_df = collect_country_data(march_dfs, 'CH')\n", + "\n", + "US_jan_df = collect_country_data(jan_dfs, 'US')\n", + "US_feb_df = collect_country_data(feb_dfs, 'US')\n", + "US_march_df = collect_country_data(march_dfs, 'US')" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d8124aceffc944f387e8842b27320d52", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Do basic data cleaning jobs for these dataframes\n", + "#(Since our text data is already lowercased, we only do basic regular expressions)\n", + "from pyspark.sql.functions import regexp_replace\n", + "from pyspark.ml.feature import Tokenizer\n", + "\n", + "def reg_clean_df(df):\n", + " # Remove punctuation (REGEX provided) and numbers\n", + " wrangled = df.withColumn('text', regexp_replace(df.ContextualText, '[_():;,.!?\\\\-]', ' '))\n", + " wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, '[0-9]', ' '))\n", + "\n", + " # Merge multiple spaces and drop null values here\n", + " wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))\n", + " wrangled = wrangled.na.drop()\n", + " \n", + " return wrangled\n", + "\n", + "CH_jan_cleaned_df = reg_clean_df(CH_jan_df)\n", + "CH_feb_cleaned_df = reg_clean_df(CH_feb_df)\n", + "CH_mar_cleaned_df = reg_clean_df(CH_march_df)\n", + "\n", + "US_jan_cleaned_df = reg_clean_df(US_jan_df)\n", + "US_feb_cleaned_df = reg_clean_df(US_feb_df)\n", + "US_mar_cleaned_df = reg_clean_df(US_march_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "eb8bb802c56142e6843758effb42f61e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Define a function to do the following job before topic modeling\n", + "from pyspark.ml.feature import Tokenizer, StopWordsRemover, IDF, CountVectorizer\n", + "\n", + "def pre_analysis(df):\n", + " tokenized_df = Tokenizer(inputCol='text', outputCol='words').transform(df)\n", + " tokenized_df = tokenized_df.na.drop()\n", + " \n", + " # Remove stop words.\n", + " removeStopwords_df = StopWordsRemover(inputCol='words', outputCol='terms').transform(tokenized_df)\n", + "\n", + " tf_model = CountVectorizer(inputCol=\"terms\", outputCol=\"features\", vocabSize=40000, minDF=5).fit(removeStopwords_df)\n", + " vectorizer = tf_model.transform(removeStopwords_df)\n", + " \n", + " idfizer = IDF(inputCol='features', outputCol='tf_idf_features')\n", + " idf_model = idfizer.fit(vectorizer)\n", + " tfidf_result = idf_model.transform(vectorizer)\n", + " \n", + " #Return a tuple for results and td model\n", + " return tfidf_result, tf_model;" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c883e7c4154d40878ad1ab3ad9d0912a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Do pre-analysis country by country, month by month\n", + "CH_jan_result_df, CH_jan_tf_model = pre_analysis(CH_jan_cleaned_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c34d8657caeb442680a90a974b584a92", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------------+\n", + "| tf_idf_features|\n", + "+--------------------+\n", + "|(26826,[0,1,3,19,...|\n", + "|(26826,[0,1,4,7,8...|\n", + "|(26826,[0,1,4,7,8...|\n", + "|(26826,[0,1,4,7,8...|\n", + "|(26826,[0,1,2,4,8...|\n", + "+--------------------+\n", + "only showing top 5 rows" + ] + } + ], + "source": [ + "#Show how these tf_idf features look like\n", + "CH_jan_result_df.select('tf_idf_features').show(5)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "15bbdac9111744bfbde14318d0e9999c", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "CH_feb_result_df, CH_feb_tf_model = pre_analysis(CH_feb_cleaned_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d4a6f99fda844dd299963f898cdfd45c", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "CH_mar_result_df, CH_mar_tf_model = pre_analysis(CH_mar_cleaned_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d0c17c108a034946a54e6e9fa161dd84", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "US_jan_result_df, US_jan_tf_model = pre_analysis(US_jan_cleaned_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a593af36f3894879b5f9aa9aae17a58e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "US_feb_result_df, US_feb_tf_model = pre_analysis(US_feb_cleaned_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "74df5c49449843279c38b61c103f1586", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "US_mar_result_df, US_mar_tf_model = pre_analysis(US_mar_cleaned_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c0a4f999fd294d93927b564541021e07", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from pyspark.ml.feature import IDF\n", + "from pyspark.ml.clustering import LDA\n", + "import pyspark.sql.functions as F\n", + "import pyspark.sql.types as T\n", + "\n", + "def get_words(token_list):\n", + " return [vocab[token_id] for token_id in token_list]\n", + "\n", + "#Set the number of topics=5\n", + "num_topics = 5\n", + "max_iter = 10\n", + "num_top_words = 10" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "28883bae0f2f46ff90dc0e4b0bd4550e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----+--------------------------------------------------------------------------------------------+\n", + "|topic| topicWords|\n", + "+-----+--------------------------------------------------------------------------------------------+\n", + "| 0| [flight, citizens, said, hubei, kingdom, province, travel, health, british, government]|\n", + "| 1| [cities, million, travel, flights, measures, government, hubei, people, city, beijing]|\n", + "| 2| [cases, human, new, confirmed, market, reported, people, year, sars, chinese]|\n", + "| 3| [case, confirmed, man, cases, health, said, woman, united, people, hospital]|\n", + "| 4|[hospital, airport, staff, coronavirus, international, health, new, medical, patients, sars]|\n", + "+-----+--------------------------------------------------------------------------------------------+" + ] + } + ], + "source": [ + "#Do topic modeling and get the topic words \n", + "#China January\n", + "lda = LDA(k=num_topics, maxIter=max_iter, featuresCol='tf_idf_features')\n", + "lda_model = lda.fit(CH_jan_result_df)\n", + " \n", + "vocab = CH_jan_tf_model.vocabulary\n", + "udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))\n", + " \n", + "CH_jan_topics = lda_model.describeTopics(num_top_words).withColumn('topicWords', udf_to_words(F.col('termIndices')))\n", + "CH_jan_topics.select('topic', 'topicWords').show(truncate=100)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "fc0b7d2063484f9592f2cc4d1f8fb13f", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----+------------------------------------------------------------------------------------+\n", + "|topic| topicWords|\n", + "+-----+------------------------------------------------------------------------------------+\n", + "| 0| [new, said, beijing, people, outbreak, health, u, ship, china, coronavirus]|\n", + "| 1|[li, hospital, party, media, patients, january, wuhan, medical, officials, wenliang]|\n", + "| 2| [people, said, cities, city, cases, government, chinese, million, one, flight]|\n", + "| 3| [cases, two, united, u, quarantine, health, citizens, south, said, confirmed]|\n", + "| 4| [cases, deaths, new, hubei, confirmed, province, reported, number, mainland, novel]|\n", + "+-----+------------------------------------------------------------------------------------+" + ] + } + ], + "source": [ + "#China Febuary\n", + "lda = LDA(k=num_topics, maxIter=max_iter, featuresCol='tf_idf_features')\n", + "lda_model = lda.fit(CH_feb_result_df)\n", + " \n", + "vocab = CH_feb_tf_model.vocabulary\n", + "udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))\n", + " \n", + "CH_feb_topics = lda_model.describeTopics(num_top_words).withColumn('topicWords', udf_to_words(F.col('termIndices')))\n", + "CH_feb_topics.select('topic', 'topicWords').show(truncate=100)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a480cd582a8244daaa8b64ec1aab15fd", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----+----------------------------------------------------------------------------------------+\n", + "|topic| topicWords|\n", + "+-----+----------------------------------------------------------------------------------------+\n", + "| 0| [people, said, city, virus, cases, beijing, chinese, province, new, hotel]|\n", + "| 1| [city, hubei, people, province, said, restrictions, outbreak, new, lockdown, beijing]|\n", + "| 2|[coronavirus, people, us, masks, hospital, said, quarantine, symptoms, beijing, wearing]|\n", + "| 3| [cases, president, chinese, new, xi, virus, trump, first, world, people]|\n", + "| 4| [cases, new, reported, read, hubei, imported, health, deaths, infections, confirmed]|\n", + "+-----+----------------------------------------------------------------------------------------+" + ] + } + ], + "source": [ + "#China March\n", + "lda = LDA(k=num_topics, maxIter=max_iter, featuresCol='tf_idf_features')\n", + "lda_model = lda.fit(CH_mar_result_df)\n", + " \n", + "vocab = CH_mar_tf_model.vocabulary\n", + "udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))\n", + " \n", + "CH_mar_topics = lda_model.describeTopics(num_top_words).withColumn('topicWords', udf_to_words(F.col('termIndices')))\n", + "CH_mar_topics.select('topic', 'topicWords').show(truncate=100)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a1f935fac6be489ab47124528d9e02a4", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----+---------------------------------------------------------------------------------------------+\n", + "|topic| topicWords|\n", + "+-----+---------------------------------------------------------------------------------------------+\n", + "| 0| [plane, base, flight, anchorage, alaska, passengers, nearby, citizens, wednesday, reserve]|\n", + "| 1| [flights, china, cases, airlines, coronavirus, confirmed, white, house, next, trump]|\n", + "| 2| [case, states, confirmed, reported, first, u, cases, woman, health, chicago]|\n", + "| 3|[international, airport, airports, passengers, screening, san, francisco, angeles, los, year]|\n", + "| 4| [county, university, patient, orange, base, department, student, health, local, cdc]|\n", + "+-----+---------------------------------------------------------------------------------------------+" + ] + } + ], + "source": [ + "#U.S. January\n", + "lda = LDA(k=num_topics, maxIter=max_iter, featuresCol='tf_idf_features')\n", + "lda_model = lda.fit(US_jan_result_df)\n", + " \n", + "vocab = US_jan_tf_model.vocabulary\n", + "udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))\n", + " \n", + "US_jan_topics = lda_model.describeTopics(num_top_words).withColumn('topicWords', udf_to_words(F.col('termIndices')))\n", + "US_jan_topics.select('topic', 'topicWords').show(truncate=100)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a40f6c01a0d34e8e8ec59eccb445037a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----+-------------------------------------------------------------------------------------------+\n", + "|topic| topicWords|\n", + "+-----+-------------------------------------------------------------------------------------------+\n", + "| 0| [house, white, trump, billion, county, president, outbreak, funding, health, new]|\n", + "| 1| [base, air, passengers, force, quarantine, san, travis, california, ship, evacuees]|\n", + "| 2| [county, trump, patient, health, officials, case, medical, cdc, school, sacramento]|\n", + "| 3| [county, cases, santa, woman, clara, health, officials, state, case, confirmed]|\n", + "| 4|[boston, china, washington, president, trump, massachusetts, coronavirus, said, also, city]|\n", + "+-----+-------------------------------------------------------------------------------------------+" + ] + } + ], + "source": [ + "#U.S. Febuary\n", + "lda = LDA(k=num_topics, maxIter=max_iter, featuresCol='tf_idf_features')\n", + "lda_model = lda.fit(US_feb_result_df)\n", + " \n", + "vocab = US_feb_tf_model.vocabulary\n", + "udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))\n", + " \n", + "US_feb_topics = lda_model.describeTopics(num_top_words).withColumn('topicWords', udf_to_words(F.col('termIndices')))\n", + "US_feb_topics.select('topic', 'topicWords').show(truncate=100)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "6385fdf491c2491faaa2906a64531fae", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----+----------------------------------------------------------------------------+\n", + "|topic| topicWords|\n", + "+-----+----------------------------------------------------------------------------+\n", + "| 0| [county, cases, confirmed, march, positive, case, covid, health, state, m]|\n", + "| 1|[trump, house, white, president, county, package, u, economic, donald, said]|\n", + "| 2| [county, people, health, houston, city, order, chicago, said, cases, covid]|\n", + "| 3| [county, cases, ship, new, san, state, washington, cruise, virus, said]|\n", + "| 4| [new, school, seattle, york, said, city, home, march, county, schools]|\n", + "+-----+----------------------------------------------------------------------------+" + ] + } + ], + "source": [ + "#U.S. March\n", + "lda = LDA(k=num_topics, maxIter=max_iter, featuresCol='tf_idf_features')\n", + "lda_model = lda.fit(US_mar_result_df)\n", + " \n", + "vocab = US_mar_tf_model.vocabulary\n", + "udf_to_words = F.udf(get_words, T.ArrayType(T.StringType()))\n", + " \n", + "US_mar_topics = lda_model.describeTopics(num_top_words).withColumn('topicWords', udf_to_words(F.col('termIndices')))\n", + "US_mar_topics.select('topic', 'topicWords').show(truncate=100)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "PySpark", + "language": "", + "name": "pysparkkernel" + }, + "language_info": { + "codemirror_mode": { + "name": "python", + "version": 2 + }, + "mimetype": "text/x-python", + "name": "pyspark", + "pygments_lexer": "python2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/graphs/.DS_Store b/graphs/.DS_Store new file mode 100644 index 0000000..4a2367a Binary files /dev/null and b/graphs/.DS_Store differ diff --git a/graphs/1.1.jpg b/graphs/1.1.jpg new file mode 100644 index 0000000..0a87480 Binary files /dev/null and b/graphs/1.1.jpg differ diff --git a/graphs/1.2.jpg b/graphs/1.2.jpg new file mode 100644 index 0000000..84ab9f5 Binary files /dev/null and b/graphs/1.2.jpg differ diff --git a/graphs/2.1.jpg b/graphs/2.1.jpg new file mode 100644 index 0000000..9b1836a Binary files /dev/null and b/graphs/2.1.jpg differ diff --git a/graphs/3.1.1.jpg b/graphs/3.1.1.jpg new file mode 100644 index 0000000..6e443f7 Binary files /dev/null and b/graphs/3.1.1.jpg differ diff --git a/graphs/3.1.2.jpg b/graphs/3.1.2.jpg new file mode 100644 index 0000000..fbeb225 Binary files /dev/null and b/graphs/3.1.2.jpg differ diff --git a/graphs/3.1.3.jpg b/graphs/3.1.3.jpg new file mode 100644 index 0000000..49eac2c Binary files /dev/null and b/graphs/3.1.3.jpg differ diff --git a/graphs/3.2.1.jpg b/graphs/3.2.1.jpg new file mode 100644 index 0000000..81234f9 Binary files /dev/null and b/graphs/3.2.1.jpg differ diff --git a/graphs/3.2.2.jpg b/graphs/3.2.2.jpg new file mode 100644 index 0000000..33d17d2 Binary files /dev/null and b/graphs/3.2.2.jpg differ diff --git a/graphs/3.2.3.jpg b/graphs/3.2.3.jpg new file mode 100644 index 0000000..e245d2c Binary files /dev/null and b/graphs/3.2.3.jpg differ