From 4543705db7b9d4176e3e869aa016cd92a02820d9 Mon Sep 17 00:00:00 2001
From: Kalle Westerling In this brief workshop we will be discussing the basics of research data, in terms of material, transformation, and presentation. We will also be focusing on the ethics of data cleaning and representation. If the rest of the course is practical, then this is a small detour to allow us to sit and think about what we are doing. Because everyone has a different approach to data and ethics, this workshop will also include multiple sites for discussions to help us think together as a group. \"Material or information on which an argument, theory, test or hypothesis, or another research output is based.\" \nQueensland University of Technology. Manual of Procedures and Policies. Section 2.8.3. http://www.mopp.qut.edu.au/D/D_02_08.jsp \n\"What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models\" \nMarieke Guy. http://www.slideshare.net/MariekeGuy/bridging-the-gap-between-researchers-and-research-data-management , #2 \n\"Units of information created in the course of research\" \nhttps://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp \n\"(i) Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.\" \nOMB-110, Subpart C, section 36, (d) (i), http://www.whitehouse.gov/omb/circulars_a110/ \n\"The short answer is that we can’t always trust empirical measures at face value: data is always biased, measurements always contain errors, systems always have confounders, and people always make assumptions.\" Angela Bassa. https://medium.com/@angebassa/data-alone-isnt-ground-truth-9e733079dfd4 \nIn summary, research data is: \nMaterial or information necessary to come to your conclusion. There are many ways to represent data, just as there are many sources of data. After processing our data, we turn it into a number of products. For example:\n* Non-digital text (lab books, field notebooks)\n* Digital texts or digital copies of text\n* Spreadsheets\n* Audio\n* Video\n* Computer Aided Design/CAD\n* Statistical analysis (SPSS, SAS)\n* Databases\n* Geographic Information Systems (GIS) and spatial data\n* Digital copies of images\n* Web files\n* Scientific sample collections\n* Matlab files & 3D Models\n* Metadata & Paradata\n* Data visualizations\n* Computer code\n* Standard operating procedures and protocols\n* Protein or genetic sequences\n* Artistic products\n* Curriculum materials\n* Collection of digital objects acquired and generated during research \nAdapted from: Georgia Tech–http://libguides.gatech.edu/content.php?pid=123776&sid=3067221 These are some (most!) of the shapes your research data might transform into. \n1. What are some forms of data you use in your work? \n2. What about forms of data that you produce as your output? Perhaps there are some forms that are typical of your field. \n3. Where do you usually get your data from? We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations: Raw data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc.. It could be in any of the forms listed above. \nBut \"raw data\" is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is \"raw data\". As we think about data collection, we should also consider the labor involved in the process. Many researchers rely on Amazon Mechanical Turk (sometimes also refered to as MTurk) for data collection, often paying less than minimum wage for the task. Often the assumption made of these workers is someone who is retired, bored, and participating in online gig work for fun or to kill time. While this may be true for some, more than half of those surveyed in a Pew Research study cite that the income from this work is essential or important. Often, those who view the income from this work as essential or important are also from underserved communities. \nIn addition to being mindful of paying a fair wage to the workers on such platforms, this working environment also brings some further considerations to the data that is collected. Often times, for workers to get close to minimum wage, they cannot afford to spend much time on each task, increasing potential errors in the collected data. Processing data puts it into a state more readily available for analysis, and makes the data legible. For instance it could be rendered as structured data. This can also take many forms, e.g., a table. \nHere are a few you're likely to come across, all representing the same data: \nXML JSON CSV A small detour to discuss (the ethics of?) data formats. For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration: \n1. Open this file in a text editor, and then in an app like Excel. This is a CSV, an open, text-only, file format. \n2. Now do the same with this one. This is a proprietary format! \nSustainable formats are generally unencrypted, uncompressed, and follow an open standard. A small list:\n* ASCII\n* PDF \n* .csv\n* FLAC\n* TIFF\n* JPEG2000\n* MPEG-4\n* XML\n* RDF\n* .txt\n* .r How do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations? There are guidelines to the processing of data, sometimes referred to as Tidy Data.1 One manifestation of these rules: \n1. Each variable is in a column. \n2. Each observation is a row. \n3. Each value is a cell. \nLook back at our example of cats to see how they may or may not follow those guidelines. Important note: some data formats allow for more than one dimension of data! How might that complicate the concept of Tidy Data? 1Wickham, Hadley. \"Tidy Data\". Journal of Statistical Software. High quality data is measured in its validity, accuracy, completeness, consistency, and uniformity. \nProcessed data, even in a table, is going to be full of errors: \n1. Empty fields \n2. Multiple formats, such as \"yes\" or \"y\" or \"1\" for a positive response. \n3. Suspect answers, like a date of birth of 00/11/1234 \n4. Impossible negative numbers, like an age of \"-37\" \n5. Dubious outliers \n6. Duplicated rows \n7. And many more! \nCleaning data is the work of correcting the errors listed above, and moving towards high quality. This work can be done manually or programatically. \nValidity \nMeasurements must be valid, in that they must conform to set constraints: \n1. The aforementioned \"yes\" or \"y\" or \"1\" should all be changed to one response. \n2. Certain fields cannot be empty, or the whole observation must be thrown out. \n3. Uniqueness, for instance no two people should have the same social security number. \nAccuracy \nMeasurements must be accurate, in that they must represent the correct values. While an observation may be valid, it might at the same time be inaccurate. 123 Fake street is a valid, inaccurate street address. \nUnfortunately, accuracy is mostly acheived in the observation process. To be achieved in the cleaning process, an outside trusted source would have to be cross-referenced. \nCompleteness \nMeasurements must be complete, in that they must represent everything that might be known. This also is nearly impossible to achieve in the cleaning process! For instance in a survey, it would be necessary to re-interview someone whose previous answer to a question was left blank. \nConsistency \nMeasurements must be consistent, in that different observations must not contradict each other. For instance, one person cannot be represented as both dead and still alive in different observations. \nUniformity \nMeasurements must be uniform, in that the same unit of measure must be used in all relevant measurements. If one person's height is listed in meters and another in feet, one measurement must be converted. How do we know when our data is cleaned enough? What happens to the data that is removed? What are we choosing to say about our dataset as we prepare them for analysis? Analysis can take many forms (just like the rest of this stuff!), but many techniques fall within a couple of categories: Techniques geared towards summarizing a data set, such as:\n* Mean\n* Median\n* Mode\n* Average\n* Standard deviation Techniques geared towards testing a hypothesis about a population, based on your data set, such as:\n* Extrapolation\n* P-Value calculation As we consider the types of analysis that we choose to apply onto our data set, what are we representing and leaving out? How do we guide our decisions of interpretation with our choices of analyses? Are we comfortable with the intended use of our research? Are we comfortable with the unintended use of our research? What are potential misuses of our outputs? What can happen when we are trying to just go for the next big thing (tool/methods/algorithms) or just ran out of time and/or budget for our project?What Constitutes Research Data?
\nForms of Data
\nChallenge: Forms of Data
\nRaw
\nData and Labor
\nChallenge: Raw Data and Labor
\n\n
\nProcessed/Transformed
\n<Cats> \n <Cat> \n <firstName>Smally</firstName> <lastName>McTiny</lastName> \n </Cat> \n <Cat> \n <firstName>Kitty</firstName> <lastName>Kitty</lastName> \n </Cat> \n <Cat> \n <firstName>Foots</firstName> <lastName>Smith</lastName> \n </Cat> \n <Cat> \n <firstName>Tiger</firstName> <lastName>Jaws</lastName> \n </Cat> \n</Cats> \n
{\"Cats\":[ \n { \"firstName\":\"Smally\", \"lastName\":\"McTiny\" }, \n { \"firstName\":\"Kitty\", \"lastName\":\"Kitty\" }, \n { \"firstName\":\"Foots\", \"lastName\":\"Smith\" }, \n { \"firstName\":\"Tiger\", \"lastName\":\"Jaws\" } \n]} \n
First Name,Last Name/n\nSmally,McTiny/n\nKitty,Kitty/n\nFoots,Smith/n\nTiger,Jaws/n\n
The importance of using open data formats
\nChallenge: Processed/Transformed
\nTidy Data
\n{\"Cats\":[\n {\"Calico\":[\n { \"firstName\":\"Smally\", \"lastName\":\"McTiny\" },\n { \"firstName\":\"Kitty\", \"lastName\":\"Kitty\" }],\n \"Tortoiseshell\":[\n { \"firstName\":\"Foots\", \"lastName\":\"Smith\" }, \n { \"firstName\":\"Tiger\", \"lastName\":\"Jaws\" }]}]}\n
Cleaned
\nChallenge: When do we stop cleaning?
\nAnalyzed
\nDescriptive Analysis
\nInferential Analysis
\nChallenge: Analysis
\nVisualized
\n\n
\n
\n\n
\n\n
\n\n
\n\n
\nAdapted from Evergreen, Stephanie D. Effective data visualization : the right chart for the right data. Los Angeles: SAGE, 2017.\n\n\n\n
As we transform our results into visuals, we are also trying to tell a narrative about the data we collected. Data visualization can help us to decode information and share quickly and simply. What are we assuming when we choose to visually represent data in particular ways? How can data visualization mislead us?
", - "order": 3 - } - }, - { - "model": "lesson.lesson", - "pk": 1038, - "fields": { - "title": "Data Literacy and Ethics", - "created": "2020-07-09T16:41:00.650Z", - "updated": "2020-07-09T16:41:00.650Z", - "workshop": 152, - "text": "Throughout the workshop we have been thinking together through some of
\nthe potential ethical concerns that might crop up as we proceed with our
\nown projects. Just as we have disucssed thus far, we hope that you see
\nthat data and ethics is an ongoing process throughout the lifespans of
\nyour project(s) and don’t often come with easy answers.
\nIn this final activity, we would like for you to think about some of the
\npotential concerns that might come up in the scenario below and discuss
\nhow you might approach them:
\nYou are interested in looking at the reactions to the democratic party
\npresidential debates across time. You decided that you would use data
\nfrom twitter to analyze the responses. After collecting your data, you
\nlearned that your data has information from users who were later banned
\nand included some tweets that were removed/deleted from the site.
\nData and ethics are contextually driven. As such, there isn’t always a
\nrisk-free approach. We often have to work through ethical dilemmas while
\nthinking through information that we may not have (what are the risks of
\ndoing/not doing this work?). We may be approaching a moment where the
\nquestion is no longer what we could do but what we should do.
", - "order": 4 - } - }, - { - "model": "frontmatter.learningobjective", - "pk": 890, - "fields": { - "frontmatter": 144, - "label": "Understand the stages of data analysis." - } - }, - { - "model": "frontmatter.learningobjective", - "pk": 891, - "fields": { - "frontmatter": 144, - "label": "Understand the beginning of cleaning/tidying data" - } - }, - { - "model": "frontmatter.learningobjective", - "pk": 892, - "fields": { - "frontmatter": 144, - "label": "Experience the difference between proprietary and open data formats." - } - }, - { - "model": "frontmatter.learningobjective", - "pk": 893, - "fields": { - "frontmatter": 144, - "label": "Become familiar with the specific requirements of \"high quality data.\"" - } - }, - { - "model": "frontmatter.learningobjective", - "pk": 894, - "fields": { - "frontmatter": 144, - "label": "Have an understanding of potential ethical concerns around working with different types of data and analysis." - } - }, - { - "model": "frontmatter.contributor", - "pk": 443, - "fields": { - "first_name": "Stephen", - "last_name": "Zweibel", - "role": null, - "url": null - } - }, - { - "model": "frontmatter.contributor", - "pk": 444, - "fields": { - "first_name": "Di", - "last_name": "Yoong", - "role": null, - "url": null - } - }, - { - "model": "frontmatter.contributor", - "pk": 445, - "fields": { - "first_name": "Ian", - "last_name": "Phillips", - "role": null, - "url": null - } - }, - { - "model": "library.reading", - "pk": 703, - "fields": { - "title": "Big? Smart? Clean? Messy? Data in the Humanities", - "url": "http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities", - "annotation": "[Big? Smart? Clean? Messy? Data in the Humanities](http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/)", - "zotero_item": null - } - }, - { - "model": "library.reading", - "pk": 704, - "fields": { - "title": "Bit By Bit: Social Research in Digital Age", - "url": "https://www.bitbybitbook.com/en/1st-ed/preface", - "annotation": "[Bit By Bit: Social Research in Digital Age](https://www.bitbybitbook.com/en/1st-ed/preface/)", - "zotero_item": null - } - }, - { - "model": "library.reading", - "pk": 705, - "fields": { - "title": "Ten Simple Rules for Responsible Big Data Research", - "url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373508", - "annotation": "[Ten Simple Rules for Responsible Big Data Research](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373508/)", - "zotero_item": null - } - }, - { - "model": "library.project", - "pk": 352, - "fields": { - "title": "Data for Public Good", - "url": "https://dataforgood.commons.gc.cuny.edu", - "annotation": "[Data for Public Good](https://dataforgood.commons.gc.cuny.edu/): Graduate student fellows creates a semester-long collaborative project that makes public-interest dataset useful and informative to a public audience.", - "zotero_item": null - } - }, - { - "model": "library.project", - "pk": 353, - "fields": { - "title": "SAFElab", - "url": "https://safelab.socialwork.columbia.edu", - "annotation": "[SAFElab](https://safelab.socialwork.columbia.edu/): Uses computational and social work approaches to understand mechanisms of violence and how to prevent and intervene in violence that occur in neighbourhoods and on social media.", - "zotero_item": null - } - }, - { - "model": "library.tutorial", - "pk": 343, - "fields": { - "label": "Computational social science with R", - "url": "https://compsocialscience.github.io/summer-institute/curriculum#day_2", - "annotation": "[Computational social science with R](https://compsocialscience.github.io/summer-institute/curriculum#day_2) by the Summer Institutes in Computational Social Science", - "zotero_item": null - } - }, - { - "model": "library.tutorial", - "pk": 344, - "fields": { - "label": "SQLite Tutorial", - "url": "https://www.sqlitetutorial.net", - "annotation": "[SQLite Tutorial](https://www.sqlitetutorial.net/) by SQLiteTutorial", - "zotero_item": null - } - }, - { - "model": "library.reading", - "pk": 706, - "fields": { - "title": "data management presentation", - "url": "https://www.slideshare.net/MariekeGuy/bridging-the-gap-between-researchers-and-research-data-management", - "annotation": "Marieke Guy's [data management presentation](https://www.slideshare.net/MariekeGuy/bridging-the-gap-between-researchers-and-research-data-management)", - "zotero_item": null - } - }, - { - "model": "library.reading", - "pk": 707, - "fields": { - "title": "Management of Research Data", - "url": "http://www.mopp.qut.edu.au/D/D_02_08.jsp", - "annotation": "Queensland University of Technology's [Management of Research Data](http://www.mopp.qut.edu.au/D/D_02_08.jsp).", - "zotero_item": null - } - }, - { - "model": "library.reading", - "pk": 708, - "fields": { - "title": "Perspectives on Big Data, Ethics, and Society", - "url": "https://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society", - "annotation": "The Council for Big Data, Ethics, and Society's publication [Perspectives on Big Data, Ethics, and Society](https://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society/).", - "zotero_item": null - } - }, - { - "model": "workshop.workshop", - "pk": 153, - "fields": { - "name": "Text Analysis", - "slug": "text-analysis", - "created": "2020-07-09T16:41:02.101Z", - "updated": "2020-07-09T16:41:02.101Z", - "parent_backend": "Github", - "parent_repo": "DHRI-Curriculum/text-analysis", - "parent_branch": "v2.0-rafa-edits" - } - }, - { - "model": "frontmatter.frontmatter", - "pk": 145, - "fields": { - "workshop": 153, - "abstract": "Digital technologies have made vast amounts of text available to researchers, and this same technological moment has provided us with the capacity to analyze that text. The first step in that analysis is to transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural Language ToolKit (commonly called NLTK), this workshop introduces strategies to turn qualitative texts into quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data.", - "ethical_considerations": "['In working with massive amounts of text, it is natural to lose the original context. We must be aware of that and be careful when analizing it.', 'It is important to constantly question our assumptions and the indexes we are using. Numbers and graphs do not tell the story, our analysis does. We must be careful not to draw hasty and simplistic conclusions for things that are complex. Just because we found out that author A uses more unique words than author B, does it mean that A is a better writer than B?']", - "estimated_time": "10", - "projects": [ - 354, - 355, - 356 - ], - "resources": [], - "readings": [ - 709, - 710 - ], - "contributors": [ - 446, - 447, - 448, - 449, - 450, - 451, - 452, - 453, - 454 - ], - "prerequisites": [] - } - }, - { - "model": "praxis.praxis", - "pk": 131, - "fields": { - "discussion_questions": "['Content TBD']", - "next_steps": "[]", - "workshop": 153, - "further_readings": [ - 711, - 712 - ], - "more_projects": [], - "more_resources": [], - "tutorials": [ - 345, - 346, - 347 - ] - } - }, - { - "model": "lesson.lesson", - "pk": 1039, - "fields": { - "title": "Overview", - "created": "2020-07-09T16:41:02.110Z", - "updated": "2020-07-09T16:41:02.110Z", - "workshop": 153, - "text": "This tutorial will give a brief overview of the considerations and tools involved in basic text analysis with Python. By completing this tutorial, you will have a general sense of how to turn text into data using the Python package, NLTK. You will also be able to take publicly available text files and transform them into a corpus that you can perform your own analysis on. Finally, you will have some insight into the types of questions that can be addressed with text analysis.
\nIf you have not already installed the Anaconda distribution of Python 3, please do so.
\nYou will also need nltk
and matplotlib
to complete this tutorial. Both packages come installed with Anaconda. To check to be sure you have them, open a new Jupyter Notebook (or any IDE to run Python).
\nFind Anaconda Navigator on your computer (it should be located in the folder with your other applications), and from Acadonda Navigator's interface, launch a Jupyter Notebook.
\n
\nIt will open in the browser. All of the directories (folders) in your home directory will appear — we'll get to that later. For now, select New
>> Python3
in the upper right corner.
\n
\nA blank page with an empty box should appear.
\n
\nIn the box, type:
\nimport nltk\nimport matplotlib\n
Press Shift + Enter
to run the cell (or click run at the top of the page). Don't worry too much about what this is doing - that will be explained later in this tutorial. For now, we just want to make sure the packages we will need are installed.
\n
\nIf nothing happens, they are installed and you are ready to move on! If you get an error message, either you have a typo or they are not installed. If it is the latter, open the command line and type:
\nconda install nltk -y\nconda install matplotlib -y\n
Now we need to install the nltk corpus. This is very large and may take some time if you are on a weak connection.
\nIn the next cell, type:
\nnltk.download()\n
and run the cell.
\nThe NLTK downloader should appear. Please install all of the packages. If you are short on time, focus on \"book\" for this tutorial—you can download the other packages at another time for later use.
\nYours will look a little different, but the same interface. Click on the 'all' option and then 'Download'. Once they all trun green, you can close the Downloader dialogue box.
\n
\nReturn to your Jupyter Notebook and type:
\nfrom nltk.book import *\n
A list of books should appear. If this happens, great! If not, return to the downloader to make sure everything is ok.
\nClose this Notebook without saving — the only purpose was to check if we have the appropriate packages installed.
", - "order": 1 - } - }, - { - "model": "lesson.lesson", - "pk": 1040, - "fields": { - "title": "Text as Data", - "created": "2020-07-09T16:41:02.172Z", - "updated": "2020-07-09T16:41:02.172Z", - "workshop": 153, - "text": "When we think of \"data,\" we often think of numbers, things that can be summarized, statisticized, and graphed. Rarely when I ask people \"what is data?\" do they respond \"Moby Dick.\" And yet, more and more, text is data. Whether it is Moby Dick, or every romance novel written since 1750, or today's newspaper or twitter feed, we are able to transform written (and spoken) language into data that can be quantified and visualized.
\nThe first step in gathering insights from texts is to create a corpus. A corpus is a collection of texts that are somehow related to each other. For example, the Corpus of Contemporary American English, Donald Trump's Tweets, text messages sent by bilingual young adults, digitized newspapers, or books in the public domain are all corpora. There are infinitely many corpora, and, sometimes, you will want to make your own—that is, one that best fits your research question.
\nThe route you take from here will depend on your research question. Let's say, for example, that you want to examine gender differences in writing style. Based on previous linguistic research, you hypothesize that male-identified authors use more definitives than female-identified. So you collect two corpora—one written by men, one written by women—and you count the number of thes, thiss, and thats compared to the number of as, ans, and ones. Maybe you find a difference, maybe you don't. We can already see that this is a relatively crude way of going about answering this question, but it is a start. (More likely, you'd use a supervised classification task, which you will learn about in the Machine Learning Tutorial.)
\nThere has been some research about how the linguistic complexity of written language in long-form pieces (i.e., books, articles, letters, etc.) has decreased over time. Simply put, people today use shorter sentences with fewer embedded clauses and complex tense constructions than people did in the past. (Note that this is not necessarily a bad or good thing.) Based on this research, we want to know if short-form platforms are emblematic of the change (we predict that they are based on our own experience with short-form platforms like email and Twitter). One way to do this would be to use Part-of-Speech tagging. Part-of-Speech (POS) tagging is a way to identify the category of words in a given text.
\nFor example, the sentence:
\n\n", - "order": 2 - } - }, - { - "model": "lesson.lesson", - "pk": 1041, - "fields": { - "title": "Cleaning and Normalizing", - "created": "2020-07-09T16:41:02.175Z", - "updated": "2020-07-09T16:41:02.175Z", - "workshop": 153, - "text": "I like the red bicycle.
\nhas one pronoun, one verb, one determiner, one adjective, and one noun.
\n(I : Pronoun), (like : Verb), (the : Determiner), (red : Adjective), (bicycle : Noun)
\nNLTK uses the Penn Tree Bank Tag Set. This is a very detailed tag list that goes far beyond just nouns, verbs, and adjectives, but gives insight into different types of nouns, prepositions, and verbs as well. Virtually all POS taggers will create a list of (word, POS) pairs. If newspaper articles have a higher ratio of function words (prepositions, auxiliaries, determiners, etc.) to semantic words (nouns, verbs, adjectives), than tweets, then we have one piece of evidence supporting our hypothesis. It's important to note here that we must use either ratios or otherwise normalized data (in the sense that raw numbers will not work). Because of the way that language works (function words are often repeated, for example), a sample of 100 words will have more unique words than a sample of 1,000. Therefore, to compare different data types (articles vs. tweets), this fact should be taken into account.
\n
Generally, however, our questions are more about topics rather than writing style. So, once we have a corpus—whether that is one text or millions—we usually want to clean and normalize it. There are three terms we are going to need:
\n- Text normalization is the process of taking a list of words and transforming it into a more uniform sequence. Usually, this involves removing punctuation, making the words all the same case, removing stop words, and either stemming or lemmatizing the words. It can also include expanding abbreviations or matching misspellings (but these are advanced practices that we will not cover).
\nYou probably know what removing punctuation and capitalization refer to, but the other terms may be new:
\n- Stop words are words that appear frequently in a language, often adding grammatical structure, but little semantic content. There is no official list of stop words for any language, though there are some common, all-purpose lists built in to NLTK. However, different tasks require different lists. The purpose of removing stop words is to remove words that are so common that their meaning is diminished across a large number of texts.
\n- Stemming and lemmatizing both of these processes try to consolidate words like \"laughs\" and \"laughing\" to \"laugh\" since they all mean essentially the same thing, they are just inflected differently. So again, in an attempt to reduce the number of words, and get a realistic understanding of the meaning of a text, these words are collapsed. Stemming does this by cutting off the end (very fast), lemmatizing does this by looking up the dictionary form (very slow).
\nLanguage is messy, and created for and by people, not computers. There is a lot of grammatical information in a sentence that a computer cannot use. For example, I could say to you:
\n\n\nThe house is burning.
\nand you would understand me. You would also understand if I say
\nhouse burn.
\nThe first has more information about tense, and which house in particular, but the sentiment is the same either way.
\nIn going from the first sentence to the normalized words, we removed the stop words (the and is), and removed punctuation and case, and lemmatized what was left (burning becomes burn—though we might have stemmed this, its impossible to tell from the example). This results in what is essentially a \"bag of words,\" or a corpus of words without any structure. Because normalizing your text reduces the number of words (and therefore the number of dimensions in your data), and keeps only the words that contribute meaning to the document, this cleaning is usually desirable.
\nAgain, this will be covered more in depth in the Machine Learning Tutorial, but for the time being, we just need to know that there is \"clean\" and \"dirty\" versions of text data. Sometimes our questions are about the clean data, but sometimes our questions are in the \"dirt.\"
\n
In the next section, we are going to go through a series of methods that come built-in to NLTK that allow us to turn our words into numbers and visualizations. This is just scratching the surface, but should give you an idea of what is possible beyond just counting words.
", - "order": 3 - } - }, - { - "model": "lesson.lesson", - "pk": 1042, - "fields": { - "title": "NLTK Methods with the NLTK Corpus", - "created": "2020-07-09T16:41:02.181Z", - "updated": "2020-07-09T16:41:02.181Z", - "workshop": 153, - "text": "All of the code for this section is in a Jupyter Notebook in the GitHub repository. I encourage you to follow along by retyping all of the code, but if you get lost, or want another reference, the code is there as well.
\nTo open the notebook, first create a projects
folder if you don't already have one by entering this command in your terminal:
mkdir -p ~/Desktop/projects\n
If you already have a projects folder, you can skip this step.
\nNext, clone the text analysis session repository into your projects folder by entering this command:
\ngit clone https://github.com/DHRI-Curriculum/text-analysis.git ~/Desktop/projects/text-analysis\n
Then move to the new directory:
\ncd ~/Desktop/projects/text-analysis\n
Now launch the Jupyter Notebook application by typing this into the terminal:
\njupyter notebook\n
If it's your first time opening the notebook, you may be prompted to enter a URL into your browser. Copy out the URL and paste it into the Firefox or Google Chrome search bar.
\nFinally, in the Jupyter Notebook file browser, find the notebook file and open it. It should be called TextAnalysis.ipynb
. You will use this file for reference in case you get stuck in the next few sections, so keep it open.
\nReturn to the Jupyter Home Tab in your Browser (or Launch the Jupyter Notebook again), and start a New Python3 Notebook using the New
button in the upper right corner.
\nEven though Jupyter Notebook doesn't force you to do so, it is very important to name your file, or you will end up later with a bunch of untitled files and you will have no idea what they are about. In the top left, click in the word Untitled
and give your file a name such as \"intro_nltk\".
\nIn the first blank cell, type the following to import the NLTK library:
\nimport nltk\n
Libraries are sets of instructions that Python can use to perform specialized functions. The Natural Language ToolKit (nltk
) is one such library. As the name suggests, its focus is on language processing.
\nWe will also need the matplotlib library later on, so import it now:
\nimport matplotlib\n
matplotlib
is a library for making graphs. In the middle of this tutorial, we are going to make a dispersion plot of words in our texts.
\nFinally, because of a quirk of Jupyter notebooks, we need to specify that matplotlib should display its graphs in the notebook (as opposed to in a separate window), so we type this command (this is technically a Jupyter command, not Python):
\n%matplotlib inline\n
All three of these commands can be written in the same cell and run all at once (Shift + Enter
) or in different cells.
\n
\nIf you don't see an error when you run the notebook—that is, if nothing happens—you can move on to the next step.
\nNext, we need to load all of the NLTK corpora into our program. Even though we downloaded them to our computer, we need to tell Python we want to use them.
\nfrom nltk.book import *\n
The pre-loaded NLTK texts should appear again. These are preformatted data sets. We will still have to do some minor processing, but having the data in this format saves us a few steps. At the end of this tutorial, we will make our own corpus. This is a special type of python object specific to NLTK (it isn't a string, list, or dictionary). Sometimes it will behave like a string, and sometimes like a list of words. How it is behaving is noted for each function as we try it out.
\n
\nLet's start by analyzing Moby Dick, which is text1
for NLTK.
The first function we will look at is concordance
. \"Concordance\" in this context means the characters on either side of the word. Our text is behaving like a string. As discussed in the Python tutorial LINK, Python does not evaluate strings, so it just counts the number of characters on either side. By default, this is 25 characters on either side of our target word (including spaces).
\nIn the Jupyter Notebook, type:
\ntext1.concordance(\"whale\")\n
The output shows us the 25 characters on either side of the word \"whale\" in Moby Dick. Let's try this with another word, \"love.\" Just replace the word \"whale\" with \"love,\" and we get the contexts in which Melville uses \"love\" in Moby Dick. concordance
is used (behind the scenes) for several other functions, including similar
and common_contexts
.
\nLet's now see which words appear in similar contexts as the word \"love.\" NLTK has a built-in function for this as well: similar
.
text1.similar(\"love\")\n
Behind the scenes, Python found all the contexts where the word \"love\" appears. It also finds similar environments, and then what words were common among the similar contexts. This gives a sense of what other words appear in similar contexts. This is somewhat interesting, but more interesting if we can compare it to something else. Let's take a look at another text. What about Sense and Sensibility? Let's see what words are similar to \"love\" in Jane Austen's writing. In the next cell, type:
\ntext2.similar(\"love\")\n
We can compare the two and see immediately that Melville and Austen use the word \"love\" differently.
\nLet's expand from novels for a minute and take a look at the NLTK Chat Corpus. In chats, text messages, and other digital communication platforms, \"lol\" is exceedingly common. We know it doesn't simply mean \"laughing out loud\"—maybe the similar
function can provide some insight into what it does mean.
text5.similar(\"lol\")\n
The resulting list is a lot of greetings, indicating that \"lol\" probably has more of a phatic function. Phatic language is language primarily for communicating social closeness. Phatic words stand in contrast to semantic words, which contribute meaning to the utterance.
\nIf you are interested in this type of analysis, take a look at the common_contexts
function in the NLTK book or in the NLTK docs.
In many ways, concordance
and similar
are heightened word searches that tell us something about what is happening near the target words. Another metric we can use is to visualize where the words appear in the text. In the case of Moby Dick, we want to compare where \"whale\" and \"monster\" appear throughout the text. In this case, the text is functioning as a list of words, and will make a mark where each word appears, offset from the first word. We will pass this function a list of strings to plot. This will likely help us develop a visual of the story — where the whale goes from being a whale to being a monster to being a whale again. In the next cell, type:
text1.dispersion_plot([\"whale\", \"monster\"])\n
A graph should appear with a tick mark everywhere that \"whale\" appears and everywhere that \"monster\" appears. Knowing the story, we can interpret this graph and align it to what we know of how the narrative progresses. If we did not know the story, this could give us a picture of the narrative arc.
\nTry this with text2
, Sense and Sensibility. Some relevant words are \"marriage,\" \"love,\" \"home,\" \"mother,\" \"husband,\" \"sister,\" and \"wife.\" Pick a few to compare. You can compare an unlimited number, but it's easier to read a few at a time. (Note that the comma in our writing here is inside the quotation mark but for Python, this would be unreadable and you would have to put commas outside of quotation marks to create a list.)
\nNLTK has many more functions built-in, but some of the most powerful functions are related to cleaning, part-of-speech tagging, and other stages in the text analysis pipeline (where the pipeline refers to the process of loading, cleaning, and analyzing text).
", - "order": 6 - } - }, - { - "model": "lesson.lesson", - "pk": 1045, - "fields": { - "title": "Built-In Python Functions", - "created": "2020-07-09T16:41:02.203Z", - "updated": "2020-07-09T16:41:02.203Z", - "workshop": 153, - "text": "We will now turn our attention away from the NLTK library and work with our text using the built-in Python functions—the ones that come included with the Python language, rather than the NLTK library.
\nFirst, let's find out how many times a given word appears in the corpus. In this case (and all cases going forward), our text will be treated as a list of words. Therefore, we will use the count
function. We could just as easily do this with a text editor, but performing this in Python allows us to save it to a variable and then utilize this statistic in other calculations (for example, if we want to know what percentage of words in a corpus are 'lol', we would need a count of the 'lol's). In the next cell, type:
text1.count(\"whale\")\n
We see that \"whale\" occurs 906 times, but that seems a little low. Let's check on \"Whale\" and see how often that appears:
\ntext1.count(\"Whale\")\n
\"Whale\" with a capital \"W\" appears 282 times. This is a problem for us—we actually want them to be collapsed into one word, since \"whale\" and \"Whale\" really are the same for our purposes. We will deal with that in a moment. For the time being, we will accept that we have two entries for \"whale.\"
\nThis gets at a distinction between type and token. \"Whale\" and \"whale\" are different types (as of now) because they do not match identically. Every instance of \"whale\" in the corpus is another token—it is an instance of the type, \"whale.\" Therefore, there are 906 tokens of \"whale\" in our corpus.
\nLet's fix this by making all of the words lowercase. We will make a new list of words, and call it \"text1_tokens\". We will fill this list with all the words in text1, but in their lowercase form. Python has a built-in function, lower()
that takes all letters and makes them lowercase. In this same step, we are going to do a kind of tricky move, and only keep the words that are alphabetical and pass over anything that is punctuation or numbers. There is a built-in function, isalpha()
, that will allow us to save only those words that are made of letters. If isalpha()
is true, we'll make the word lowercase, and keep the word. If not, we'll pass over it and move to the next one.
\nType the following code into a new cell in your notebook. Pay special attention to the indentation, which must appear as below. (Note that in Jupyter Notebook, indentation usually comes automatically. If not, make sure to type the space
key 4 times)
text1_tokens = []\nfor t in text1:\n if t.isalpha():\n t = t.lower()\n text1_tokens.append(t)\n
\nAnother way to perform the same action more tersely is to use what's called a list comprehension. A list comprehension is a shorter, faster way to write a for-loop. It is syntactically a little more difficult to read (for a human), but, in this case, it's much faster to process. Don't worry too much about understanding the syntax of list comprehensions right now. For every example, we will show both the for loop and list comprehension options.
\ntext1_tokens = [t.lower() for t in text1 if t.isalpha()]\n
Great! Now text1_tokens
is a list of all of the tokens in our corpus, with the punctuation removed, and all the words in lowercase.
\nNow we want to know how many words there are in our corpus—that is, how many tokens in total. Therefore, we want to ask, \"What is the length of that list of words?\" Python has a built-in len
function that allows you to find out the length of many types. Pass it a list, and it will tell you how many items are in the list. Pass it a string, and it will tell you how many characters are in the string. Pass it a dictionary, and it will tell you how many items are in the dictionary. In the next cell, type:
len(text1_tokens)\n
Just for comparison, check out how many words were in \"text1\"—before we removed the punctuation and the numbers.
\nlen(text1)\n
We see there are over 218,000 words in Moby Dick (including metadata). But this is the number of words total—we want to know the number of unique words. That is, we want to know how many types, not just how many tokens.
\nIn order to get unique words, rather than just all words in general, we will make a set from the list. A set
in Python work just like it would in math, it's all the unique values, with any duplicate items removed.
\nSo let's find out the length of our set. just like in math, we can also nest our functions. So, rather than saying x = set(text1_tokens)
and then finding the length of \"x\", we can do it all in one step.
len(set(text1_tokens))\n
Great! Now we can calculate the lexical density of Moby Dick. Statistical studies have shown that lexical density (the number of unique words per total words) is a good metric to approximate lexical diversity—the range of vocabulary an author uses. For our first pass at lexical density, we will simply divide the number of unique words by the total number of words:
\nlen(set(text1_tokens))/len(text1_tokens)\n
If we want to use this metric to compare texts, we immediately notice a problem. Lexical density is dependent upon the length of a text and therefore is strictly a comparative measure. It is possible to compare 100 words from one text to 100 words from another, but because language is finite and repetitive, it is not possible to compare 100 words from one to 200 words from another. Even with these restrictions, lexical density is a useful metric in grade level estimations, vocabulary use and genre classification, and a reasonable proxy for lexical diversity.
\nLet's take this constraint into account by working with only the first 10,000 words of our text. First we need to slice our list, returning the words in position 0 to position 9,999 (we'll actually write it as \"up to, but not including\" 10,000).
\ntext1_slice = text1_tokens[0:10000]\n
Now we can do the same calculation we did above:
\nlen(set(text1_slice)) / len(text1_slice)\n
This is a much higher number, though the number itself is arbitrary. When comparing different texts, this step is essential to get an accurate measure.
", - "order": 7 - } - }, - { - "model": "lesson.lesson", - "pk": 1046, - "fields": { - "title": "Making Your Own Corpus: Data Cleaning", - "created": "2020-07-09T16:41:02.232Z", - "updated": "2020-07-09T16:41:02.232Z", - "workshop": 153, - "text": "Thus far, we have been asking questions that take stopwords and grammatical features into account. For the most part, we want to exclude these features since they don't actually contribute very much semantic content to our models. Therefore, we will:
\n1. Remove capitalization and punctuation (we've already done this).
\n2. Remove stop words.
\n3. Lemmatize (or stem) our words, i.e. \"jumping\" and \"jumps\" become \"jump.\"
\nWe already completed step one, and are now working with our text1_tokens
. Remember, this variable, text1_tokens
, contains a list of strings that we will work with. We want to remove the stop words from that list. The NLTK library comes with fairly comprehensive lists of stop words for many languages. Stop words are function words that contribute very little semantic meaning and most often have grammatical functions. Usually, these are function words such as determiners, prepositions, auxiliaries, and others.
\nTo use NLTK's stop words, we need to import the list of words from the corpus. (We could have done this at the beginning of our program, and in more fully developed code, we would put it up there, but this works, too.) In the next cell, type:
\nfrom nltk.corpus import stopwords\n
We need to specify the English list, and save it into its own variable that we can use in the next step:
\nstops = stopwords.words('english')\n
Now let's take a look at those words:
\nprint(stops)\n
Now we want to go through all of the words in our text, and if that word is in the stop words list, remove it from our list. Otherwise, we want it to skip it. (The code below is VERY slow, so it may take some time to process). The way we can write this in Python is:
\ntext1_stops = []\nfor t in text1_tokens:\n if t not in stops:\n text1_stops.append(t)\n
A faster option, if you are feeling bold, would be using list comprehensions:
\ntext1_stops = [t for t in text1_tokens if t not in stops]\n
To check the result:
\nprint(text1_stops[:30])\n
Now that we removed our stop words, let's see how many words are left in our list:
\nlen(text1_stops)\n
You should get a much lower number.
\nFor reference, let's also check how many unique words there are. We will do this by making a set of words. Sets are the same in Python as they are in math, they are all of the unique words rather than all the words. So, if \"whale\" appears 200 times in the list of words, it will only appear once in the set.
\nlen(set(text1_stops))\n
Now that we've removed the stop words from our corpus, the next step is to stem or lemmatize the remaining words. This means that we will strip off the grammatical structure from the words. For example, cats —> cat
, and walked —> walk
. If that was all we had to do, we could stem the corpus and achieve the correct result, because stemming (as the name implies) really just means cutting off affixes to find the root (or the stem). Very quickly, however, this gets complicated, such as in the case of men —> man
and sang —> sing
. Lemmatization deals with this by looking up the word in a reference and finding the appropriate root (though note that this still is not entirely accurate). Lemmatization, therefore, takes a relatively long time, since each word must be looked up in a reference. NLTK comes with pre-built stemmers and lemmatizers.
\nWe will use the WordNet Lemmatizer from the NLTK Stem library, so let's import that now:
\nfrom nltk.stem import WordNetLemmatizer\n
Because of the way that it is written \"under the hood,\" an instance of the lemmatizer needs to be called. We know this from reading the docs.
\nwordnet_lemmatizer = WordNetLemmatizer()\n
Let's quickly see what lemmatizing does.
\nwordnet_lemmatizer.lemmatize(\"children\")\n
Now try this one:
\nwordnet_lemmatizer.lemmatize(\"better\")\n
It didn't work, but...
\nwordnet_lemmatizer.lemmatize(\"better\", pos='a')\n
... sometimes we can get better results if we define a specific part of speech(pos). \"a\" is for \"adjective\", as we learned here.
\nNow we will lemmatize the words in the list.
\ntext1_clean = []\nfor t in text1_stops:\n t_lem = wordnet_lemmatizer.lemmatize(t)\n text1_clean.append(t_lem)\n
And again, there is a faster version for you to use once you feel comfortable with list comprehensions:
\ntext1_clean = [wordnet_lemmatizer.lemmatize(t) for t in text1_stops]\n
Let's check now to see the length of our final, cleaned version of the data, and then check the unique set of words. Notice how we will use the print
function this time. Jupyter Notebook does print commands without the print
function, but it will only print one thing per cell (the last command), and we wanted to print two different things:
print(len(text1_clean))\nprint(len(set(text1_clean)))\n
If everything went right, you should have the same length as before, but a smaller number of unique words. That makes sense since we did not remove any word, we only changed some of them.
\nNow if we were to calculate lexical density, we would be looking at how many word stems with semantic content are represented in Moby Dick, which is a different question than the one in our first analysis of lexical density.
\nWhy don't you try that by yourself? Try to remember how to calculate lexical density without looking back first. It is ok if you have forgotten.
\nNow let's have a look at the words Melville uses in Moby Dick. We'd like to look at all of the types, but not necessarily all of the tokens. We will order this set so that it is in an order we can handle. In the next cell, type:
\nsorted(set(text1_clean))[:30]\n
Sorted
+ set
should give us a list of list of all the words in Moby Dick in alphabetical order, but we only want to see the first ones. Notice how there are some words we wouldn't have expected, such as 'abandon', 'abandoned', 'abandonedly', and 'abandonment'. This process is far from perfect, but it is useful. However, depending on your goal, a different process, like stemming
might be better.
The code to implement this and view the output is below:
\nfrom nltk.stem import PorterStemmer\nporter_stemmer = PorterStemmer()\n
The Porter is the most common Stemmer. Let's see what stemming does to words and compare it with lemmatizers:
\nprint(porter_stemmer.stem('berry'))\nprint(porter_stemmer.stem('berries'))\nprint(wordnet_lemmatizer.lemmatize(\"berry\"))\nprint(wordnet_lemmatizer.lemmatize(\"berries\"))\n
Stemmer doesn't look so good, right? But how about checking how stemmer handles some of the words that our lemmatized \"failed\" us?
\nprint(porter_stemmer.stem('abandon'))\nprint(porter_stemmer.stem('abandoned'))\nprint(porter_stemmer.stem('abandonedly'))\nprint(porter_stemmer.stem('abandonment'))\n
Still not perfect, but a bit better. So the question is, how to choose between stemming and lemmatizing? As many things in text analysis, that depends. The best way to go is experimenting, seeing the results and chosing the one that better fits your goals.
\nAs a general rule, stemming is faster while lemmatizing is more accurate (but not always, as we just saw). For academics, usually the choice goes for the latter.
\nAnyway, let's stem our text with the Porter Stemmer:
\nt1_porter = []\nfor t in text1_clean:\n t_stemmed = porter_stemmer.stem(t)\n t1_porter.append(t_stemmed)\n
Or, if we want a faster way:
\nt1_porter = [porter_stemmer.stem(t) for t in text1_clean]\n
And let's check the results:
\nprint(len(set(t1_porter)))\nprint(sorted(set(t1_porter))[:30])\n
A very different list of words is produced. This list is shorter than the list produced by the lemmatizer, but is also less accurate, and some of the words will completely change their meaning (like 'berry' becoming 'berri').
\nNow that we've seen some of the differences between both, we will proceed using our lemmatized corpus, which we saved as \"text1_clean\":
\nmy_dist = FreqDist(text1_clean)\n
If nothing happened, that is normal. Check to make sure it is there by calling for the type of the \"my_dist\" object.
\ntype(my_dist)\n
The result should say it is a nltk probability distribution (nltk.probability.FreqDist
). It doesn't matter too much right now what that is, only that it worked. We can now plot this with the matplotlib function, plot
. We want to plot the first 20 entries of the my_dist object.
my_dist.plot(20)\n
\nWe've made a nice image here, but it might be easier to comprehend as a list. Because this is a special probability distribution object we can call the most_common
on this, too. Let's find the twenty most common words:
my_dist.most_common(20)\n
What about if we are interested in a list of specific words—perhaps to identify texts that have biblical references. Let's make a (short) list of words that might suggest a biblical reference and see if they appear in Moby Dick. Set this list equal to a variable:
\nb_words = ['god', 'apostle', 'angel']\n
Then we will loop through the words in our cleaned corpus, and see if any of them are in our list of biblical words. We'll then save into another list just those words that appear in both.
\nmy_list = []\nfor word in b_words:\n if word in text1_clean:\n my_list.append(word)\n else:\n pass\n
And then we will print the results.
\nprint(my_list)\n
You can obviously do this with much larger lists and even compare entire novels if you wish, though it would take a while with this approach. You can use this to get similarity measures and answer related questions.
", - "order": 8 - } - }, - { - "model": "lesson.lesson", - "pk": 1047, - "fields": { - "title": "Make Your Own Corpus", - "created": "2020-07-09T16:41:02.244Z", - "updated": "2020-07-09T16:41:02.244Z", - "workshop": 153, - "text": "Now that we have seen and implemented a series of text analysis techniques, let's go to the Internet to find a new text. You could use something such as historic newspapers, or Supreme Court proceedings, or use any txt file on your computer. Here we will use Project Gutenberg. Project Gutenberg is an archive of public domain written works, available in a wide variety of formats, including .txt. You can download these to your computer or access them via the url. We'll use the url method. We found Don Quixote in the archive, and will work with that.
\nThe Python package, urllib, comes installed with Python, but is inactive by default, so we still need to import it to utilize the functions. Since we are only going to use the urlopen function, we will just import that one.
\nIn the next cell, type:
\nfrom urllib.request import urlopen\n
The urlopen
function allows your program to interact with files on the internet by opening them. It does not read them, however—they are just available to be read in the next line. This is the default behavior any time a file is opened and read by Python. One reason is that you might want to read a file in different ways. For example, if you have a really big file—think big data—you might want to read line-by-line rather than the whole thing at once.
\nNow let's specify which URL we are going to use. Though you might be able to find Don Quixote in the Project Gutenberg files, please type this in so that we are all using the same format (there are multiple .txt files on the site, one with utf-8 encoding, another with ascii encoding). We want the utf-8 encoded one. The difference between these is beyond the scope of this tutorial, but you can check out this introduction to character encoding from The World Wide Web Consortium (W3C) if you are interested.
\nSet the URL we want to a variable:
\nmy_url = \"http://www.gutenberg.org/files/996/996-0.txt\"\n
We still need to open the file and read the file. You will have to do this with files stored locally as well. (in which case, you would type the path to the file (i.e., data/texts/mytext.txt
) in place of my_url
)
file = urlopen(my_url)\nraw = file.read()\n
This file is in bytes, so we need to decode it into a string. In the next cell, type:
\ndon = raw.decode()\n
Now let's check on what kind of object we have in the \"don\" variable. Type:
\ntype(don)\n
This should be a string. Great! We have just read in our first file and now we are going to transform that string into a text that we can perform NLTK functions on. Since we already imported nltk at the beginning of our program, we don't need to import it again, we can just use its functions by specifying nltk
before the function. The first step is to tokenize the words, transforming the giant string into a list of words. A simple way to do this would be to split on spaces, and that would probably be fine, but we are going to use the NLTK tokenizer to ensure that edge cases are captured (i.e., \"don't\" is made into 2 words: \"do\" and \"n't\"). In the next cell, type:
don_tokens = nltk.word_tokenize(don)\n
You can check out the type of don_tokens
using the type()
function to make sure it worked—it should be a list. Let's see how many words there are in our novel:
len(don_tokens)\n
Since this is a list, we can look at any slice of it that we want. Let's inspect the first ten words:
\ndon_tokens[:10]\n
That looks like metadata—not what we want to analyze. We will strip this off before proceeding. If you were doing this to many texts, you would want to use Regular Expressions. Regular Expressions are an extremely powerful way to match text in a document. However, we are just using this text, so we could either guess, or cut and paste the text into a text reader and identify the position of the first content (i.e., how many words in is the first word). That is the route we are going to take. We found that the content begins at word 315, so let's make a slice of the text from word position 315 to the end.
\ndq_text = don_tokens[315:]\n
Finally, if we want to use the NLTK specific functions:
\n- concordance
\n- similar
\n- dispersion_plot
\n- or others from the NLTK book
\nwe would have to make a specific NLTK Text
object.
dq_nltk_text = nltk.Text(dq_text)\n
If we wanted to use the built-in Python functions, we can just stick with our list of words in dq_text
. Since we've already covered all of those functions, we are going to move ahead with cleaning our text.
\nJust as we did earlier, we are going to remove the stopwords based on a list provided by NLTK, remove punctuation, and capitalization, and lemmatize the words. You can do it one by one as we did before, and that is totally fine. You can also merge some of the steps as you see below.
\n1. Lowercase, remove punctuation and stopwords
\ndq_clean = []\nfor w in dq_text:\n if w.isalpha():\n if w.lower() not in stops:\n dq_clean.append(w.lower())\nprint(dq_clean[:50])\n
2. Lemmatize
\nfrom nltk.stem import WordNetLemmatizer\nwordnet_lemmatizer = WordNetLemmatizer()\ndq_lemmatized = []\nfor t in dq_clean:\n dq_lemmatized.append(wordnet_lemmatizer.lemmatize(t))\n
From here, you could perform all of the operations that we did after cleaning our text in the previous session. Instead, we will perform another type of analysis: part-of-speech (POS) tagging.
", - "order": 9 - } - }, - { - "model": "lesson.lesson", - "pk": 1048, - "fields": { - "title": "Part-of-Speech Tagging", - "created": "2020-07-09T16:41:02.251Z", - "updated": "2020-07-09T16:41:02.251Z", - "workshop": 153, - "text": "Note that we are going to use the pre-cleaned, dq_text
object for this section.
\nPOS tagging is going through a text and identifying which part of speech each word belongs to (i.e., Noun, Verb, or Adjective). Every word belongs to a part of speech, but some words can be confusing.
\n- Floyd is happy.
\n- Happy is a state of being.
\n- Happy has five letters.
\n- I'm going to Happy Cat tonight.
\nTherefore, part of speech is as much related to the word itself as its relationship to the words around it. A good part-of-speech tagger takes this into account, but there are some impossible cases as well:
\n- Wanda was entertaining last night.
\nPart of Speech tagging can be done very simply: with a very small tag set, or in a very complex way: with a much more elaborate tag set. We are going to implement a compromise, and use a neither small nor large tag set, the Penn Tree Bank POS Tag Set.
\nThis is the tag set that is pre-loaded into NLTK. When we call the tagger, we expect it to return an object with the word and the tag associated. Because POS tagging is dependent upon the stop words, we have to use a text that includes the stop words. Therefore, we will go back to using the dq_text
object for this section. Let's try it out. Type:
dq_tagged = nltk.pos_tag(dq_text)\n
Let's inspect what we have:
\nprint(dq_tagged[:10])\n
This is a list of ordered tuples. (A tuple is like a list, but can't be changed once it is created.) Each element in the list is a pairing of (word, POS-tag)
. (Tuples are denoted with parentheses, rather than square brackets.) This is great, but it is very detailed. I would like to know how many Nouns, Verbs, and Adjectives I have.
\nFirst, I'll make an empty dictionary to hold my results. Then I will go through this list of tuples and count the number of times each tag appears. Every time I encounter a new tag, I'll add it to a dictionary and then increment by one every time I encounter that tag again. Let's see what that looks like in code:
\ntag_dict = {}\n# For every word/tag pair in my list,\nfor (word, tag) in dq_tagged:\n if tag in tag_dict:\n tag_dict[tag]+=1\n else:\n tag_dict[tag] = 1\n
Now let's see what we got:
\ntag_dict\n
This would be better with some order to it, but dictionaries are made to be unordered. When we google \"sort dictionaries python\" we find a solution in our great friend stack overflow. Even though we cannot sort a dictionary, we can get a representation of a dictionary that is sorted. Don't worry too much about understanding the following code, as it uses things we have not discussed, and are out of the scope of this course. It is useful to see how we can reuse pieces of code even when we don't fully understand them.
\nNow let's do it and find out what the most common tag is.
\ntag_dict_sorted = sorted(tag_dict.items(),\n reverse=True,\n key=lambda kv: kv[1])\nprint(tag_dict_sorted)\n
Now check out what we have. It looks like NN is the most common tag. We can look up what NN means in the Penn Tree Bank. Looks like NN is a Noun, singular or mass. Great! This information will likely help us with genre classification, or identifying the author of a text, or a variety of other functions.
", - "order": 10 - } - }, - { - "model": "lesson.lesson", - "pk": 1049, - "fields": { - "title": "Conclusion", - "created": "2020-07-09T16:41:02.255Z", - "updated": "2020-07-09T16:41:02.255Z", - "workshop": 153, - "text": "At this point, you should have a familiarity with what is possible with text analysis, and some of the most important functions (i.e., cleaning and part-of-speech tagging). Yet, this tutorial has only scratched the surface of what is possible with text analysis and natural language processing. It is a rapidly growing field, if you are interested, be sure to work through the online NLTK Book as well as peruse the resources in the Zotero Library.
\nLet's compare the lexical density of Moby Dick with Sense and Sensibility. Make sure to:
\nThe command line is a text-based way of interacting with your computer. You may hear it called different names, such as the terminal, the shell, or bash. In practice, you can use these terms interchangeably. (If you're curious, though, you can read more about them in the glossary.) The shell we use (whether terminal, shell, or bash) is a program that accepts commands as text input and converts commands into appropriate operating system functions.
\nThe command line (of computers today) receives these commands as text that is typed in.
\nFor those of us comfortable reading and writing, the idea of \"text-based\" in the context of computers can seem a bit strange. As we start to get comfortable typing commands to the computer, it's important to distinguish \"text\" from word processed, desktop publishing (think Microsoft Word or Google Docs) in which we use software that displays what we want to produce without showing us the code the computer is reading to render the formatting. Plain text has the advantage of being manipulable in different contexts.
\nLet's take a quick moment to discuss text and text editors.
", - "order": 1 - } - }, - { - "model": "lesson.lesson", - "pk": 1051, - "fields": { - "title": "Text editors", - "created": "2020-07-09T16:41:03.584Z", - "updated": "2020-07-09T16:41:03.584Z", - "workshop": 154, - "text": "Before we explain which program we'll be using for editing text, we want to give a general sense of this \"text\" we keep mentioning. For those of us in the humanities, whether we follow literary theorists who read any object as a \"text\" or we dive into philology, paleography, codicology or any of the fields David Greetham lays out in Textual Scholarship, \"text\" has its specific meanings. As scholars working with computers, we need to be aware of the ways plain text and formatted text differ. Words on a screen may have hidden formatting. Many of us grew up using Microsoft Word and don't realize how much is going on behind the words shown on the screen. For the purposes of communicating with the computer and for easier movement between different programs, we need to use text without hidden formatting.
\n
\nUsers with visual disabilities, click here to dowload the Word file.
\nIf you ask the command line to read that file, this Word .docx file will look something like this
\n
\nUsers with visual disabilities, click here to dowload the text file.
\nWord documents which look like \"just words!\" are actually comprised of an archive of extensible markup language (XML) instructions that only Microsoft Word can read. Plain text files can be opened in a number of different editors and can be read within the command line.
\nFor the purposes of communicating with machines and between machines, we need characters to be as flexible as possible. Plain text include characters of readable material but not graphical representation.
\nAccording to the Unicode Standard,
\n\n\nPlain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes.
\nPlain text has two main properties in regard to rich text:
\nplain text is the underlying content stream to which formatting can be applied. Plain text is public, standardized, and universally readable.
\nPlain text shows its cards—if it's marked up, the markup will be human readable. Plain text can be moved between programs more fluidly and can respond to programmatic manipulations. Because it is not tied to a particular font or color or placement, plain text can be styled externally.
\nA counterpoint to plain text is rich text (sometimes denoted by the Microsoft rich text format .rtf file extension) or \"enriched text\" (sometimes seen as an option in email programs). In rich text files, plain text is elaborated with formatting specific to the program in which they are made.
\n
An important tool for programming and working in the command line is a text editor. A text editor is a program that allows you to edit plain text files, such as .txt, .csv, or .md. Text editors are not used to edit rich text documents, such as .docx or .rtf, and rich text editors should not be used to edit plain text files. This is because rich text editors will add many invisible special characters that will prevent programs from running and configuration files from being read correctly.
\nWhile it doesn't really matter which text editor you choose, you should try to become comfortable with at least one text editor.
\nChoosing a text editor has as much to do with personality as it does with functionality. Graphical user interfaces (GUIs), user options, and \"hackability\" vary from program to program.
\nFor our workshops, we will be using Visual Studio Code. Not only is Visual Studio Code free and open source, but it is also consistent across OSX, Windows, and Linux systems.
\nYou will have downloaded VS Code according to the instructions on the installations page. We won't be using the editor a lot in this tutorial, so don't worry about getting to know the editor now. In later workshops we will discuss syntax highlighting and version control, which Visual Studio Code supports. For now we will get back to working in the command line itself.
", - "order": 2 - } - }, - { - "model": "lesson.lesson", - "pk": 1052, - "fields": { - "title": "Why is the command line useful?", - "created": "2020-07-09T16:41:03.590Z", - "updated": "2020-07-09T16:41:03.590Z", - "workshop": 154, - "text": "Initially, for some of us, the command line can feel a bit unfamiliar. Why step away from a point-and-click workflow? By using the command line, we move into an environment where we have more minute control over each task we'd like the computer to perform. Instead of ordering your food in a restaurant, you're stepping into the kitchen. It's more work, but there are also more possibilities.
\nThe command line allows you to...
\n- Easily automate tasks such as creating, copying, and converting files.
\n- Set up your programming environment.
\n- Run programs you create.
\n- Access the (many) programs and utilities that do not have graphical equivalents.
\n- Control other computers remotely.
\nIn addition to being a useful tool in itself, the command line gives you access to a second set of programs and utilities and is a complement to learning programming.
\nWhat if all these cool possibilities seem a bit abstract to you right now? That's alright! On a very basic level, most uses of the command line are about showing information that the computer has, or modifying or making things (files, programs, etc.) on the computer.
\nIn the next section, we'll make this a little more clear by getting started with the command line.
", - "order": 3 - } - }, - { - "model": "lesson.lesson", - "pk": 1053, - "fields": { - "title": "Getting to the command line", - "created": "2020-07-09T16:41:03.594Z", - "updated": "2020-07-09T16:41:03.594Z", - "workshop": 154, - "text": "If you're using macOS:
\n1. Click the Spotlight Search button (the magnifying glass) in the top right of your desktop.
\n2. Type \"terminal\" into the bar that appears.
\n3. Select the first item that appears in the list.
\n4. When the Terminal pops up, you will likely see either a window with black text over white background or colored text over a black background.
\n
\nWhen you see the $
, you're in the right place. We call the $
the command prompt; the $
lets us know the computer is ready to receive a command.
\nYou can change the color of your Terminal or BashShell background and text by selecting Shell
from the top menu bar, then selecting a theme from the menu under New Window
.
\nBonus points: if you really want to get the groove of just typing instead of pointing and clicking, you can press \"Command (⌘)\" and the space bar at the same time to pull up Spotlight search, start typing \"Terminal,\" and then hit \"Enter\" to open a terminal window. This will pull up a terminal window without touching your mousepad. For super bonus points, try to navigate like this for the next fifteen minutes, or even the rest of this session—it is tricky and sometimes a bit tiring when you start, but you can really pick up speed when you practice!
\nWe won't be using Windows's own non-UNIX version of the command line. We installed Git Bash, following these instructions, so that we can work in the cross-platform Unix command line for this session.
\n1. Look for Git Bash in your programs menu and open.
\n2. If you can't find the git folder, just type \"git bash\" in the search box and select \"git bash\" when it appears.
\n3. Open the program.
\n4. When the terminal pops up, you will likely see either a window with black text over white background or colored text over a black background.You know you're in the right place when you see the $
.
$
$
, which we will refer to as the \"command prompt,\" is the place you type commands you wish the computer to execute. We will now learn some of the most common commands.
\nIn the next section, we'll learn how to navigate the filesystem in the command line.
", - "order": 4 - } - }, - { - "model": "lesson.lesson", - "pk": 1054, - "fields": { - "title": "Navigation", - "created": "2020-07-09T16:41:03.601Z", - "updated": "2020-07-09T16:41:03.602Z", - "workshop": 154, - "text": "Go slow at first and check your spelling!
\nOne of the biggest things you can do to make sure your code runs correctly and you can use the command line successfully is to make sure you check your spelling! Keep this in mind today, this week, and your whole life. If at first something doesn't work, check your spelling! Unlike in human reading, where letters operate simultaneously as atomistic symbols and as complex contingencies (check Johanna Drucker on the alphabet), in coding, each character has a discrete function including (especially!) spaces.
\nKeep in mind that the command line and file systems on macOS and Unix are usually pre-configured as cAsE-pReSeRvInG—so capitalizations also matter when typing commands and file and folder names.
\nAlso, while copying and pasting from this handy tutorial may be tempting to avoid spelling errors and other things, we encourage you not to! Typing out each command will help you remember them and how they work.
\nYou may also see your username to the left of the command prompt $
. Let's try our first command. Type the following and press the enter
key:
$ whoami\n
The whoami
command should print out your username. Congrats, you've executed your first command! This is a basic pattern of use in the command line: type a command, press enter
on your keyboard, and receive output.
OK, we're going to try another command. But first, let's make sure we understand some things about how your computer's filesystem works.
\nYour computer's files are organized in what's known as a hierarchical filesystem. That means there's a top level or \"root\" folder on your system. That folder has other folders in it, and those folders have folders in them, and so on. You can draw these relationships in a tree:
\nUsers\n|\n —— your-username\n |\n —— Applications\n —— Desktop\n —— Documents\n
The root or highest-level folder on macOS is just called /
. We won't need to go in there, though, since that's mostly just files for the operating system. On Windows, the root directory is usually called C:
(More on why C is default on Windows).
\nNote that we are using the word \"directory\" interchangeably with \"folder\"—they both refer to the same thing.
\nOK, let's try a command that tells us where we are in the filesystem:
\n$ pwd\n
You should get output like /Users/your-username
. That means you're in the your-username
directory in the Users
folder inside the /
or root directory. On Windows, your output would instead be C:/Users/your-username
. The folder you're in is called the working directory, and pwd
stands for \"print working directory.\"
\nThe command pwd
won't actually print anything except on your screen. This command is easier to grasp when we interpret \"print\" as \"display.\"
\nOK, we know where we are. But what if we want to know what files and folders are in the your-username
directory, a.k.a. the working directory?
\nTry entering:
\n$ ls\n
You should see a number of folders, probably including Documents
, Desktop
, and so on. You may also see some files. These are the contents of the current working directory. ls
will \"list\" the contents of the directory you are in.
\nWonder what's in the Desktop folder? Let's try navigating to it with the following command:
\n$ cd Desktop\n
The cd
command lets us \"change directory.\" (Make sure the \"D\" in \"Desktop\" is capitalized.) If the command was successful, you won't see any output. This is normal—often, the command line will succeed silently.
\nSo how do we know it worked? That's right, let's use our pwd
command again. We should get:
$ pwd\n/Users/your-username/Desktop\n
Now try ls
again to see what's on your desktop. These three commands—pwd
, ls
, and cd
—are the most commonly used in the terminal. Between them, you can orient yourself and move around.
\nBefore we move on, let's take a minute to navigate through our computer's file system using the command line.
\nIt's important to note that this is the same old information you can get by pointing and clicking displayed to you in a different way.
\nGo ahead and use pointing and clicking to navigate to your working directory—you can get there a few ways, but try starting from \"My Computer\" and clicking down from there. You'll notice that the folder names should match the ones that the command line spits out for you, since it's the same information! We're just using a different mode of navigation around your computer to see it.
\nSo far, we've only performed commands that give us information. Let's use a command that creates something on the computer.
\nFirst, make sure you're in the home directory:
\n$ pwd\n/Users/your-username\n
Let's move to the Desktop folder, or \"change directory\" with cd
:
cd Desktop\n
Once you've made sure you're in the Desktop folder with pwd
, let's try a new command:
touch foo.txt\n
If the command succeeds, you won't see any output. Now move the terminal window and look at your \"real\" desktop, the graphical one. See any differences? If the command was successful and you were in the right place, you should see an empty text file called \"foo.txt\" on the desktop. Pretty cool, right?
\nLet's say you liked that \"foo.txt\" file so much you'd like another! In the terminal window, press the \"up arrow\" on your keyboard. You'll notice this populates the line with the command that you just wrote. You can hit \"Enter\" to create another \"foo.txt,\" (note - touch
command will not overwrite your document nor will it add another document to the same directory, but it will update info about that file.) or you could use your left/right arrows to change the file name to \"foot.txt\" to create something different.
\nAs we start to write more complicated and longer commands in our terminal, the \"up arrow\" is a great shortcut so you don't have to spend lots of time typing.
\nOK, so we're going to be doing a lot of work during the Digital Research Institute. Let's create a project folder in our Desktop so that we can keep all our work in one place.
\nFirst, let's check to make sure we're still in the Desktop folder with pwd
:
$ pwd\n/Users/your-username/Desktop\n
Once you've double-checked you're in Desktop, we'll use the mkdir
or \"make directory\" command to make a folder called \"projects\":
mkdir projects\n
Now run ls
to see if a projects folder has appeared. Once you confirm that the projects folder was created successfully, cd
into it.
$ cd projects\n$ pwd\n/Users/your-username/Desktop/projects\n
foo.txt
file we created earlier.In this section, we'll create a text file that we can use as a cheat sheet. You can use it to keep track of all the awesome commands you're learning.
\nEcho
Instead of creating an empty file like we did with touch
, let's try creating a file with some text in it. But first, let's learn a new command: echo
$ echo \"Hello from the command line\"\nHello from the command line\n
>
)By default, the echo command just prints out the text we give it. Let's use it to create a file with some text in it:
\necho \"This is my cheat sheet\" > cheat-sheet.txt\n
Now let's check the contents of the directory:
\n$ pwd\n/Users/your-username/projects\n$ ls\ncheat-sheet.txt\n
OK, so the file has been created. But what was the >
in the command we used? On the command line, a >
is known as a \"redirect.\" It takes the output of a command and puts it in a file. Be careful, since it's possible to overwrite files with the >
command.
\nIf you want to add text to a file but not overwrite it, you can use the >>
command, known as the redirect and append command, instead. If there's already a file with text in it, this command can add text to the file without destroying and recreating it.
Cat
Let's check if there's any text in cheat-sheet.txt.
\ncat cheat-sheet.txt\nThis is my cheat sheet\n
As you can see, the cat
command prints the contents of a file to the screen. cat
stands for \"concatenate,\" because it can link strings of characters or files together from end to end.
Your cheat sheet is titled cheat-sheet.txt
instead of cheat sheet.txt
for a reason. Can you guess why?
\nTry to make a file titled cheat sheet.txt
and report to the class what happens.
\nNow imagine you're attempting to open a very important data file using the command line that is titled cheat sheet.txt
\nFor your digital best practices, we recommend making sure that file names contain no spaces—you can use creative capitalization, dashes, or underscores instead. Just keep in mind that the macOS and Unix file systems are usually pre-configured as cAsE-pReSeRvInG, which means that capitalization matters when you type commands to navigate between or do things to directories and files.
\nThe challenge for this section will be using a text editor, specifically Visual Studio Code (install guide here), to add some of the commands that we've learned to the newly created cheat sheet. Text editors are programs that allow you to edit plain text files, such as .txt, .py (Python scripts), and .csv (comma-separated values, also known as spreadsheet files). Remember not to use programs such as Microsoft Word to edit text files, since they add invisible characters that can cause problems.
\nSo far, you've learned a number of commands and one special symbol, the >
or redirect. Now we're going to learn another, the |
or \"pipe.\"
\nPipes let you take the output of one command and use it as the input for another.
\nLet's start with a simple example:
\n$ echo \"Hello from the command line\" | wc -w\n5\n
In this example, we take the output of the echo
command (\"Hello from the command line\") and pipe it to the wc
or word count command, adding a flag -w
for number of words. The result is the number of words in the text that we entered.
\nLet's try another. What if we wanted to put the commands in our cheat sheet in alphabetical order?
\nUse pwd
and cd
to make sure you're in the folder with your cheat sheet. Then try:
cat cheat-sheet.txt | sort\n
You should see the contents of the cheat sheet file with each line rearranged in alphabetical order. If you wanted to save this output, you could use a >
to print the output to a file, like this:
cat cheat-sheet.txt | sort > new-cheat-sheet.txt\n
So far the only text file we've been working with is our cheat sheet. Now, this is where the command line can be a very powerful tool: let's try working with a large text file, one that would be too large to work with by hand.
\nLet's download the data we're going to work with:
\nOur data set is a list of public domain items from the New York Public Library. It's in .csv format, which is a plain text spreadsheet format. CSV stands for \"comma separated values,\" and each field in the spreadsheet is separated with a comma. It's all still plain text, though, so we can manipulate the data using the command line.
\nOnce the file is downloaded, move it from your Downloads
folder to the projects
folder on your desktop—either through the command line, or drag and drop in the GUI. Since this is indeed a command line workshop, you should try the former!
\nTo move this file using the command line, you first need to navigate to your Downloads
folder where that file is saved. Then type the mv
command followed by the name of the file you want to move and then the file path to your projects
folder on your desktop, which is where you want to move that file to (note that ~
refers to your home folder):
mv nypl_items.csv ~/Desktop/projects/\n
You can then navigate to that projects
folder and use the ls
command to check that the file is now there.
Try using cat
to look at the data. You'll find it all goes by too fast to get any sense of it. (You can click Control
and C
on your keyboard to cancel the output if it's taking too long.)
\nInstead, let's use another tool, the less
command, to get the data one page at a time:
$ less nypl_items.csv\n...\n
less
gives you a paginated view of the data; it will show you contents of a file or the output from a command or string of commands, page by page.
\nTo view the file contents page by page, you may use the following keyboard shortcuts (that should work on Windows using Git Bash or on macOS terminal):
\nClick the f
key to view forward one page, or the b
key to view back one page.
\nOnce you're done, click the q
key to return to the command line.
\nLet's try two more commands for viewing the contents of a file:
\n$ head nypl_items.csv\n...\n$ tail nypl_items.csv\n...\n
These commands print out the very first (the \"head\") and very last (the \"tail\") sections of the file, respectively.
\nWhen you are navigating in the command line, typing folder and file names can seem to go against the promise of easier communication with your computer. Here comes tab
completion, stage right!
\nWhen you need to type out a file or folder name—for example, the name of that csv file we've been working with: nypl_items.csv—in the command line and want to move more quickly, you can just type out the beginning characters of that file name up until it's distinct in that folder and then click the tab
key. And voilà! Clicking that tab
key will complete the rest of that name for you, and it only works if that file or folder already exists within your working directory.
\nIn other words, anytime in the command line you can type as much of the file or folder name that is unique within that directory, and tab
complete the rest!
If all the text remaining in your terminal window is starting to overwhelm you, you have some options. You may type the clear
command into the command line, or click the command
and k
keys to clear the scrollback. In macOS terminal, clicking the command
and l
keys will clear the output from your most recent command.
We didn't tell you this before, but there are duplicate lines in our data! Two, to be exact. Before we try removing them, let's see how many entries are in our .csv file:
\n$ cat nypl_items.csv | wc -l\n100001\n
This tells us there are 100,001 lines in our file. The wc
tool stands for \"word count,\" but it can also count characters and lines in a file. We tell wc
to count lines by using the -l
flag. If we wanted to count characters, we could use wc -m
. Flags marked with hyphens, such as -l
or -m
, indicate options which belong to specific commands. See the glossary for more information about flags and options.
\nTo find and remove duplicate lines, we can use the uniq
command. Let's try it out:
$ cat nypl_items.csv | uniq | wc -l\n99999\n
OK, the count went down by two because the uniq
command removed the duplicate lines. But which lines were duplicated?
$ cat nypl_items.csv | uniq -d\n...\n
The uniq
command with the -d
flag prints out the lines that have duplicates.
\n
\nSo we've cleaned our data set, but how do we find entries that use a particular term?
\nLet's say I want to find all the entries in our data set that use the term \"Paris.\"
\nHere we can use the grep
command. grep
stands for \"global regular expression print.\" The grep
command processes text line by line and prints any lines which match a specified pattern. Regular expressions are infamously human-illegible commands that use character by character matching to return a pattern. grep
gives us access to the power of regular expressions as we search for text.
$ cat nypl_items.csv | grep -i \"paris\"\n...\n
This will print out all the lines that contain the word \"Paris.\" (The -i
flag makes the command ignore capitalization.) Let's use our wc -l
command to see how many lines that is:
$ cat nypl_items.csv | grep -i \"paris\" | wc -l\n191\n
Here we have asked cat
to read nypl_items.csv, take the output and pipe it into the grep -i
command, which will ignore capitalization and find all instances of the word \"paris.\" We then take the output of that grep
command and pipe it into the word count wc
command with the -l
lines option. The pipeline returns 191
letting us know that Paris (or paris) occurs on 191 lines of our data set.
You've made it through your introduction to the command line! By now, you have experienced some of the power of communicating with your computer using text commands. The basic steps you learned today will help as you move forward through the week—you'll work with the command line interface to set up your version control with git and you'll have your text editor open while writing python scripts and building basic websites with HTML and CSS.
\nNow is a good time to do a quick review!
\nIn this session, we learned:
\n- how to use touch
and echo
to create files
\n- how to use mkdir
to create folders
\n- how to navigate our file structure by cd
(change directory), pwd
(print working directory), and ls
(list)
\n- how to use redirects (>
) and pipes (|
) to create a pipeline
\n- how to explore a comma separated values (.csv) dataset using word and line counts, head
and tail
, and the concatenate command cat
\n- how to search text files using the grep
command
\nAnd we made a cheat sheet for reference!
\nWhen we started, we reviewed what text is—whether plain or enriched. We learned that text editors that don't fix formatting of font, color, and size, do allow for more flexible manipulation and multi-program use. If text is allowed to be a string of characters (and not specific characters chosen for their compliance with a designer's intention), that text can be fed through programs and altered with automated regularity. Text editors are different software than Bash (or Terminal), which is a text-based shell that allows you to interact directly with your operating system giving direct input and receiving output.
\nHaving a grasp of command line basics will not only make you more familiar with how your computer and basic programming work, but it will also give you access to tools and communities that will expand your research.
", - "order": 7 - } - }, - { - "model": "lesson.challenge", - "pk": 245, - "fields": { - "lesson": 1054, - "title": "", - "text": "Use the three commands you've just learned—pwd
, ls
and cd
—eight (8) times each. Go poking around your Photos folder, or see what's so special about that root /
directory. When you're done, come back to the home folder with
cd ~\n
(That's a tilde, on the top left of your keyboard.) One more command you might find useful is
\ncd ..\n
which will move you one directory up in the filesystem. That's a cd
with two periods after it.
Try and create a sub-folder and file on your own!
" - } - }, - { - "model": "lesson.challenge", - "pk": 247, - "fields": { - "lesson": 1056, - "title": "", - "text": "You could use the GUI to open your Visual Studio Code text editor—from your programs menu, via Finder or Applications or Launchpad in Mac OSX, or via the Windows button in Windows—and then click \"File\" and then \"Open\" from the drop-down menu and navigate to your Desktop folder and click to open the cheat-sheet.txt file.
\nOr, you can open that specific cheat-sheet.txt file in the Visual Studio Code text editor directly from the command line! Let's try that by using the code
command followed by the name of your file in the command line.
Once you've got your cheat sheet open in the Visual Studio Code text editor, type to add the commands we've learned so far to the file. Include descriptions about what each command does. Remember, this cheat sheet is for you. Write descriptions that make sense to you or take notes about questions.
\nSave the file.
\nOnce you're done, check the contents of the file on the command line with the cat
command followed by the name of your file.
Use the commands you've learned so far to create a new version of the nypl_items.csv
file with the duplicated lines removed. (Hint: redirects are your friend.)
Use the grep
command to explore our .csv file a bit. What areas are best covered by the data set?
Type pwd
to see where on your computer you are located
\nType cd name-of-your-folder
to enter a subfolder
\nType ls
to see the content of that folder
\nType cd ..
to leave that folder
\nType pwd
to make sure you are back to the folder where you wish to be
\nType cd ~
to go back to your home folder
\nType pwd
to make sure you are in the folder where you wish to be
\nType cd /
to go back to your root folder
\nType ls
to see the content of folder you are currently in
\nType pwd
to make sure you are in the folder where you wish to be
\nType cd name-of-your-folder
to enter a subfolder
Type pwd
to see where on your computer you are located. If you are not in the \"projects\" folder we just created, navigate to that folder using the commands you learned in the previous lesson
\nType mkdir name-of-your-subfolder
to create a subfolder
\nType cd name-of-your-folder
to navigate to that folder
\nType challenge.txt
to create a new text file
\nType ls
to check whether you created the file correctly
$ code cheat-sheet.txt\n
\n ```console
\n$ cat cheat-sheet.txt
\nMy Institute Cheat Sheet
ls
\nlists files and folders in a directory
\ncd ~
\nchange directory to home folder
\n...
\n```
\nType pwd
to see where on your computer you are located. If you are not in the \"projects\" folder we just created, navigate to that folder using the commands you learned in the previous lesson
\nType ls
to check whether the file nypl_items.csv
is in your projects folder
\nType cat nypl_items.csv | uniq -d > new_nypl_items.csv
to create a new version of the nypl_items.csv
file with the duplicated lines removed.
If you want to get a little more milage out of the grep command, refer to this tutorial on grep and regular expressions. Regular expressions (or regex) provide methods to search for text in more advanced ways, including specific wildcards, matching ranges of characters such as letters and numbers, and detecting features such as the beginning and end of lines. If you want to experiment with regular expressions in an easy-to-use environment, numerous regex test interfaces are available from a simple google search, such as RegExr, which includes a handy cheat sheet.
\nMost digital projects come to an end at some point, in one way or another. We either simply stop working on them, or we forget about them, or we move on to something else. Few digital projects have an end \"form\" in the way that we think of a monograph. We rarely think of digital scholarship in its \"done\" form, but sooner or later even if they're not \"finished\"—so to speak—at some point, these projects end.
\nDone can take many different shapes: \n* it can morph into something new;\n* it can be retired;\n* it can be archived in a repository;\n* it can be saved on some form of storage media;\n* it can run out of funding; \n* and sometimes you are done with it!
\nSo it's helpful to think about what you want \"done\" to look like before you begin, because then you always have a sense of what will make a satisfactory ending to the work you're about to embark on.
\n\n # Identifying Audiences, Constituencies, and Collaborators
\nProjects typically satisfy more than one audience's need. The key to identifying a well-defined audience is research and creating several narrow profiles
\nIf you are working on a project that is institutionally based (such as creating a platform, creating a resource, or building a teaching tool), you may have institutaional partners who have a stake in your project's success. It's a good idea to identify these folks and consider their interests and needs as well.
\nPossible stakeholders include: your library, colleagues, IT division, academic program, a center, or institute who shares your mission and/or goals.
\nExample of a \"stakeholder\":
Conducting an in-depth environmental scan and literature review early in the planning process is a critical step to see if there are existing projects that are similar to your own or that may accomplish similar goals to your potential project. Sometimes, the planning process stops after the scan because you find that someone has already done it! Typically, a scan is useful in articulating and justifying the \"need\" for your research OR to justify your choice of one technology in lieu of others. Performing an environmental scan early and reviewing and revising it periodically will go a long way to help you prove that your project fills a current need for an actual audience.
\nSuccessful project proposals demonstrate knowledge of the ecosystem of existing projects in your field, and the field's response to those projects. Scans often help organizations identify potential collaborators, national intitiatives, publications, articles, or professional organizations, which in turn can demonstraate a wider exigency for your project. Following a preliminary scan, you should be able to explain why your project is important to the field, what it provides that does not currently exist, and how your project can serve as a leader or example to other organizations in such a way that they can put your findings to new issue.
\nBelow are suggestions for finding similar projects and initiatives in and outside of your field:
\nThe key to the environmental scan is to see what a wider community is already up to. How does your project fit into the ongoing work of others in your field? What about in a related field that addresses a similar question from another perspective? Is someone already working on a similar question?
\n1. Brainstorm where you might go to look for digital projects in your field that use emerging or new forms of technology. Try to list 3 places you might look to see how others in your field are adapting their methods to use new digital tools.
\n2. What technologies/methods do most people use in your field, if any, for capturing, storing, exploring/analyzing, or displaying their data? Why do they tend to use it? Is there a reason why you want to use the same technologies as your colleagues? What are the benefits of doing things differently?
\n3. Does your project fill a need or stake new methodological ground? How do you know?
\n4. If there aren't any technologies that do exactly what you were hoping for, has anyone else run into this problem? How did they solve it? Will you need to create a new tools or make significant changes to an existing one to accomplish your goal?
\n5. Once you have gathered information about what is \"out there,\" what are the limits of what you are willing to change about your own project in response? How will you know if you have stretched beyond the core objectives of your own research project?
", - "order": 3 - } - }, - { - "model": "lesson.lesson", - "pk": 1060, - "fields": { - "title": "Resource Assessment", - "created": "2020-07-09T16:41:04.882Z", - "updated": "2020-07-09T16:41:04.882Z", - "workshop": 155, - "text": "The next step in our process is figuring out what resources you have available to you and what you still need in order to accomplish your project's objectives.
\nDo you have the dataset you need to do your project? Finding, cleaning, storing, managing changes in, and sharing your data is an often overlooked but extremely important part of designing your project. Successfully finding a good dataset means that you should keep in mind: Is the dataset the appropriate size and complexity to help address your project's goals? Finding, using, or creating a good dataset is a core part of your project's long-term success.
\nWhat data resources do you have at your disposal? What do you still need? What steps do you need to take during the course of your project in order to work with the dataset now that you have a general sense of what the data needs to look like if you are working with either textual or numeric data?
\nHave: basic knowledge of git and python and some nltk
\nNeed: I need a more powerful computer, to learn how to install and use Beautiful Soup, and to get help cleaning the data. I will also need to learn about the D3.js library.
\nLooking back at the Audiences worksheet, review which of your audiences were invested in your work. Who can you draw on for support? Consider the various roles that might be necessary for the project. Who will fill those roles? \n* design\n* maintenance and support\n* coding/programming\n* outreach / documentation\n* project management
\nOutreach can take many different forms, from presenting your research at conferences and through peer-reviewed scholarly publications, but also through blog posts, Twitter conversations, forums, and/or press releases. The key to a good outreach plan is to being earlier than you think is necessary, and give your work a public presence (via Tumblr, Twitter, website, etc). You can use your outreach contacts to ask for feedback and input as well as share challenges and difficult decisions. \n* Will you create a website for your project? \n* How will you share your work? \n* Will you publish in a traditional paper or in a less-traditional format? \n* Whom will you reach out to get the word out about your work? \n* Is there someone at your college who can help you to publicize your accomplishments? \n* Will you have a logo? Twitter account? Tumblr page? Why or why not? \n* Can you draw on your colleagues to help get the word out about your work? \n* What information could you share about your project at its earliest stages? \n* Does your project have a title?
\nYou will need to come up with a plan for how you are going to manage the \"data\" created by your project. Data management plans, now required by most funders, will ask for you to list all the types of data and metadata created over the duration of the project and then account for the various manners by which you will account for various versions, make the datasets available openly (if possible) and share your data.
\nSustainability plans require detailing what format files will be in and accounting for how those files and your data will continue to be accessible to you and/or to your audience or a general public long after the project's completion.
\nLibrarians are your allies in developing a sound data management and sustainability plan.
\nVery quickly, try to think of all the different types of data your project will involve. \n* Where will you store your data? \n* Is your software open source? \n* What is the likelihood that your files will remain usable? \n* How will you keep track of your data files? \n* Where will the data live after the project is over?
", - "order": 6 - } - }, - { - "model": "lesson.lesson", - "pk": 1063, - "fields": { - "title": "Effective Partnerships", - "created": "2020-07-09T16:41:04.893Z", - "updated": "2020-07-09T16:41:04.893Z", - "workshop": 155, - "text": "After brainstorming your project ideas and assessing your available resources, it is time to scope out potential partners to help fill in gaps and formalize relationships.
\nplease keep in mind that each project is different. This outline offers suggestions and lessons learned from successful and less successful collaborations. while each project is unique in the way responsibilities are shared, perhaps one universal attribute of successful partnerships is mutual respect. The most successful collaborations are characterized by a demonstrated respect for each partners's time, work, space, staff, or policies in words and actions.
\nOnce you know where you need help, start thinking about who you know who might have those skills, areas of expertise, resources, and interest. \n* Partnerships should be selected on the basis of specific strengths. \n* If you don't know someone who fits the bill, can someone you know introduce you to someone you would like to know? What are some ways of finding someone with skills you don't have if you don't know anyone with those skills?
\nWhen preparing a proposal, you will need mentors, collaborators, or other interested parties to write a strong letter of support for your project that will help your proposal stand out to the reviewers. Some funders want letters from all project participants.
\nIt is important to respect people’s time when asking them for a letter by showing that you’ve done your research and that you have some grant materials to share with them. Good letters demonstrate some knowledge of the project and recognition of its impact if funded.
\nFollow these steps when asking for a support letter and for specific types of assistance during the life of the grant, and you should receive a good letter in return.\n* One month before grant deadline, begin brainstorming candidates for letters of support and note which collaborators are required to submit letters of commitment and support. \n* Start asking supporters at least two weeks in advance of grant deadline, because they will also have deadlines and other work competing for their work hours. You may find some folks are on leave at the time you inquire, be sure to have back-ups on your list. \n* Email potential supporters, collaborators:
\n * State why, specifically, you are asking Person A for support;
\n * Be specific about what you are asking Person A to do over the scope of the grant, if anything, such as participate in 3 meetings, 2 phone calls over 18 months; or agree to review the project and provide feedback one month before official launch;
\n * Provide any information about compensation, especially when asking someone to participate (ie, there will be a modest honorarium to recognize the time you give to this project of $xxx);
\n 8 Tell supporters what exactly you need to complete the grant application, in what format, and by what date (ie, a 2-page CV in PDF and letter of support on letterhead by next Friday).\n* Attach materials that will be helpful for them when writing the letter.
\n * Provide a short project summary that includes the project goals, deliverables, and work plan from the grant proposal draft;
\n * Include a starter letter containing sample text that references that person’s or institution’s role and why they are supporting the project.
", - "order": 7 - } - }, - { - "model": "lesson.lesson", - "pk": 1064, - "fields": { - "title": "Finding Funding", - "created": "2020-07-09T16:41:04.898Z", - "updated": "2020-07-09T16:41:04.898Z", - "workshop": 155, - "text": "Now that you have started to form:\n* a more refined project idea;\n* a wider awareness of the ecosystem of existing projects in your field;\n* a sense of the national, local, or institutional demand for your project;\n* and a clearer sense of the resources at your disposal
\n... the next step is to find an appropriate funding source. Below you will find some suggestions as to where to begin the search for funding. As you look for possible funders, below are some guidelines for the process:
\n1. Check federal, state, and local grant-making agencies, and local foundations for possibility of grants.
\n * Federal agencies list all of their available grants on http://grants.gov.
\n * States also have opportunities for grants, such as state humanities councils.
\n * Private foundations are also possible areas to look. The following may prove useful:
\n * The Foundation Center: [http://foundationcenter.org] (http://foundationcenter.org)
\n * A Directory of State and Local Foundations:
\n [http://foundationcenter.org/getstarted/topical/sl_dir.html] (http://foundationcenter.org/getstarted/topical/sl_dir.html)
\n * The Council on Foundations Community Foundations List
\n http://www.conf.org/whoweserve/community/resources/index.cfm?navitemNumber=15626#locator
\n * The USDA offers a valuable Guide to Funding Resources [https://www.nal.usda.gov/ric/guide-to-funding-resources] (https://www.nal.usda.gov/ric/guide-to-funding-resources)
\n2. Check your institution’s eligibility for a potential grants before beginning the application process. Eligibility requirements and restrictions are often found in grant guidelines.
\n3. Review the types of projects this program funds, and consider how your project fits with the agency or foundation’s mission and strategic goals.
\n4. Review a potential grant program’s deadlines and requirements (including proposal requirements and format for submission).
\n5. Identify funding levels/maxes, and keep them close at hand as you develop your budget.
\n6. Jenny Furlong Director of the Office of Career Services will be here tomorrow, and she is an excellent resource for those interested in external fellowships.
\nFind one or two grant opportunites in your subject area. Consider also looking for fellowship opportunities.
\nWhat follows is a template for writing a short project proposal that, once developed, will position you to move forward with building partnerships with other institutions or for pursuing funding opportunities. Though this template does not directly reflect a specific grant narrative format, the short project proposal includes important project-development steps that can later form the basis for a wide variety of grant narratives.
\n150 word summary of project: (1 short paragraph)
\nStatement of the conditions that make the project necessary and beneficial for your key audiences (2-3 paragraphs).
\nA brief explanation that combines your environmental scan and your research goals. Why is what you are doing necessary and different in your field—and maybe to more than just scholars in your field. (4-5 paragraphs)
\nRough outline and project calendar that includes project design and evaluation, and possibly a communications plan, depending on the grant with major deliverables (bullet-pointed list of phases and duration):\n* Phase 1 (month/year - month/year):\n* Phase 2 (month/year - month/year):\n* Phase 3 (month/year - month/year):
\nDescription of the why the cooperating institutions and key personnel are well-suited to undertake this work (list of experience and responsibilities of each staff member, and institutional description).
\nIf applicable, describe how this project will live beyond the grant period. Will it continue to be accessible? How so? A data management plan might need to be specified here.
", - "order": 8 - } - }, - { - "model": "lesson.lesson", - "pk": 1065, - "fields": { - "title": "Presentation Template", - "created": "2020-07-09T16:41:04.901Z", - "updated": "2020-07-09T16:41:04.901Z", - "workshop": 155, - "text": "Name:
\nProgram:
\nProject title:
\n2 Sentence abstract:
\nWhat resources do you have now?
\nWhat have you learned this week that will help you?
\nWhat additional support will you need as you take your next steps?
", - "order": 9 - } - }, - { - "model": "lesson.lesson", - "pk": 1066, - "fields": { - "title": "Presentation", - "created": "2020-07-09T16:41:04.909Z", - "updated": "2020-07-09T16:41:04.909Z", - "workshop": 155, - "text": "2 Sentence abstract
\nMy project is going to make every installation seamless. It will make all of your Python dreams come true, your databases tidy, and your Git Hub happy.
\nWhat resources do you have now?
\nWhat have you learned this week that will help you?
\nWhat additional support will you need as you take your next steps?
\ngit add yourlastname.md\n
git commit -m \"my presentation file\"\n
git add images/myfile.jpg\n
git commit -m \"adding an image file\"\n
Let's begin by starting an \"interactive session\" session with Python. This means we will be using Python in the terminal, which is a special space that allows us to run little bits of Python, experimenting and exploring what it can do, without having to save it. Think of this interactive space as a playground. Later on, we will be working with Python in a more robust way, doing what we call saving and executing Python scripts.
\nFor now, though, let's start an interactive session with Python, which is accessed through the terminal.
\nOpen your terminal and type:
\n$ python\n
at the prompt. You should see something like this
\nPython 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49)\n[GCC 7.2.0] on Linux\nType \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n>>>\n
Unlike the normal $
terminal prompt, the Python prompt looks like this:
>>>\n
These carrots are how you know that you have entered an interactive session with Python. Now you are interacting directly with Python, rather than in the regular terminal. Keep an eye on these carrots, as a common early source of confusion is entering terminal commands into the Python prompt or entering Python commands into the terminal.
\nLet's try a little math at the Python prompt. In the example below, type the text that appears after the Python prompt (the >>>
). The line below is the output that is returned. This will be a standard convention when giving examples using the Python prompt.
>>> 2 + 3\n5\n>>> 14 - 10\n4\n>>> 10 * 10\n100\n>>> 6 / 3\n2\n>>> 21 % 4\n1\n
The first four operations above are addition, subtraction, multiplication, and division, respectively. The last operation is modulo, or mod, which returns the remainder after division.
\nNote the way you interact with Python at the prompt. After entering an expression such as 2 + 3
, Python \"evaluates\" it to a simpler form, 5
, and then prints out the answer for you. This process is called the Read Eval Print Loop, or REPL. Reading takes commands from you, the input is evaluated or run, the result is printed out, and the prompt is shown again to wait for more input. The normal terminal (the one with the $
) is another example of a REPL.
\nThe REPL is useful for quick tests and, later, can be used for exploring and debugging your programs interactively. You might consider it a kind of playground for testing and experimenting with python expressions.
", - "order": 1 - } - }, - { - "model": "lesson.lesson", - "pk": 1068, - "fields": { - "title": "Types", - "created": "2020-07-09T16:41:12.846Z", - "updated": "2020-07-09T16:41:12.846Z", - "workshop": 156, - "text": "Types are classifications that let the computer know how a programmer intends to use a piece of data. You can just think of them as, well, types of data.
\nWe've already seen one type in the last section: the integer. In this section, we'll learn four more: the floating point number, the string, the boolean, and the list.
\nEnter these lines as you see them below:
\n>>> type(1)\n<class 'int'>\n>>> type(1.0)\n<class 'float'>\n>>> type(\"Hello there!\")\n<class 'str'>\n>>> type(True)\n<class 'bool'>\n>>> type([1, 2, 3])\n<class 'list'>\n
Each of these represents a different type:
\nInteger: 1
\nIntegers are whole numbers.
\nFloat: 1.0
\nFloats are numbers with decimals, and are treated a little differently than integers.
\nString: \"Hello there!\"
\nStrings are arbitrary sets of characters, such as letters and numbers. You can think of them as a way to store text.
\nBoolean: True
and False
\nBoolean is a fancy term for values representing \"true\" and \"false,\" or \"truthiness\" and \"falsiness.\"
\nList: [1, 2, 3]
\nA list is an ordered collection of values. You can put any type in a list: [\"hello\", \"goodbye\", \"see ya later\"]
is also a valid list.
\nDon't worry about trying to actively remember these types. We'll be working with each in turn in the following sections.
\ntype()
is a function. You can think of functions in Python in a few different ways:
\n1. A way of doing something in Python.
\n2. A way of saving some code for reuse.
\n3. A way of taking an input, transforming that input, and returning an output. The input goes in the parentheses ()
.
\nThese are all valid ways of thinking about functions. We'll be learning more about functions in later sections.
", - "order": 2 - } - }, - { - "model": "lesson.lesson", - "pk": 1069, - "fields": { - "title": "Variables", - "created": "2020-07-09T16:41:12.861Z", - "updated": "2020-07-09T16:41:12.861Z", - "workshop": 156, - "text": "A variable is a symbol that refers to an object, such as a string, integer, or list. If you're not already at the Python prompt, open your terminal and type python
at the $
. You're in the right place when you see >>>
.
\nTry these commands in order:
\n>>> x = 5\n>>> x\n5\n>>> x + 10\n15\n>>> y = \"hello\"\n>>> y\n'hello'\n>>> y + \" and goodbye\"\n'hello and goodbye'\n
As you can see above, the =
sign lets you assign symbols like x
and y
to data.
\nVariables can be longer words as well:
\n>>> books = ['Gender Trouble', 'Cruising Utopia','Living a\n>Feminist Life']\n>>> books\n['Gender Trouble', 'Cruising Utopia', 'Living a Feminist\n>Life']\n>>> type(books)\n<class 'list'>\n
Variables can have letters, numbers, and underscores, but should start with a letter.
\nIf you are curious about learning more about naming conventions for variables, you can check out the PEP8 style guide's section on Naming Conventions. PEP8 is the widely accepted guide for Python programmers everywhere.
", - "order": 3 - } - }, - { - "model": "lesson.lesson", - "pk": 1070, - "fields": { - "title": "Running scripts", - "created": "2020-07-09T16:41:12.880Z", - "updated": "2020-07-09T16:41:12.880Z", - "workshop": 156, - "text": "So far, you've interacted with Python one line at a time in the REPL. This is what we call the Interactive Mode, which is like a playground for experimenting and exploring different Python expressions, like 2 + 2
or type(\"some stuff\")
. The code that we write in the REPL is not saved after you exit, which means that this space is for running Python expressions and not for writing longer programs.
\nFor the rest of this session, we're going to expand beyond the REPL to write and execute longer programs. To do this, we will begin to work with text editor, where we write out longer Python scripts, and run those scripts from the terminal.
\nThis is a big move, so let's take it slow. The major change is that we will be working across two spaces, the terminal and the text editor, rather than just the terminal alone. We will be writing our scripts into the text editor, and using the terminal to run those scripts.
\nFirst, let's begin with the text editor. Open your text editor of choice (such as VS Code) and create a new file with this line:
\nprint(\"Hello world!\")\n
Save it with the name hello.py
to a known location, such as your desktop. Open your terminal and move to the desktop directory:
$ cd Desktop\n
Once you're in the folder with your hello.py
file, move to the terminal. Do not enter the Python Interactive Mode (the REPL), which is unecessary to run python scripts. Instead, lookout for the $
symbol that lets you know you're in the terminal. If you find yourself in the Interactive mode (>>>
), then exit it with control-D
. You should see the $
symbol, letting you know you're back in the terminal.
\nNow that you're in the terminal, type the following, and press enter:
\n$ python hello.py\nHello world!\n
You should see the text Hello world!
appear as output in the terminal window.
\nCongratulations! You've written your first script. That's kind of a big deal.
\nThere are a couple of important things to note here. First, it bears repeating that you are moving between two different spaces, the text editor and the terminal. You wrote your Python script in the text editor, and used the terminal to run the script. Second, within in the text editor, you included the print()
function because, unlike in the REPL, things aren't automatically printed out when writing scripts. When you're in the text editor, you always need to include the print()
function so that your output will appear in the terminal.
Fundamentally, Python programs are just text files. You can write them in any text editor, like VS Code or Notepad on Windows. When you pass the text file to Python, it runs the code in the file one line at a time. There's nothing special about .py
files—they're just regular text files. This makes them work well with command line tools like Git. The tools you've learned so far—the command line, Git, markdown, grep—are all designed to work well together, and the medium through which they all work is plain text.
Our usual response when seeing an error on a computer screen is a stress response. Our heart rate elevates and, if we cannot do what we were asking the computer to do, our frustration mounts. This is because many errors when interacting with programs are not useful or informative, and because we often have no capability to fix the program in front of us.
\nIn Python, errors are our friends. This might be hard to accept initially, but the errors you see when running Python scripts generally do a good job of pointing you to what's going wrong in your program. When you see an error in Python, therefore, try not to fall into the stress response you may be used to when interacting with your computer normally.
\nIn Python, there are two kinds of errors you will encounter frequently. One appears before the program runs, and the other appears during the execution of a program.
\nsyntax errors - When you ask Python to run a program or execute a line in the REPL, it will first check to see if the program is valid Python code—that is, that it follows the grammatical or syntactical rules of Python. If it doesn't, before the program even runs, you'll see a syntax error printed out to the screen.
\nIn this below example, the syntax error is a common one—mismatched single and double quotes, which is not allowed in Python. You can replicate the below error by opening the REPL (type python
in the command line) and entering the line after the >>>
prompt.
>>> print('This string has mismatched quotes. But Python will help us figure out this bug.\")\n File \"<stdin>\", line 1\n print('This string has mismatched quotes. But Python will help us figure out this bug.\")\n ^\nSyntaxError: EOL while scanning string literal\n
Note the caret (^
) underneath the mismatched quote, helpfully pointing out where the error lies. Similarly, if this error happened when running a script, Python would tell us the filename and the line number for the line on which the error occurs.
\nTraceback errors - These errors occur during the execution of a Python program when the program finds itself in an untenable state and must stop. Traceback errors are often logical inconsistencies in a program that is valid Python code. A common traceback error is referring to a variable that hasn't been defined, as below.
\n>>> print(not_a_variable)\nTraceback (most recent call last):\n File \"<stdin>\", line 1, in <module>\nNameError: name 'not_a_variable' is not defined\n
Traceback errors try to tell you a little about what happened in the program that caused the problem, including the category of error, such as NameError
or TypeError
.
Debugging is a fancy word for fixing problems with a program. Here are some common strategies for debugging a program when first learning Python:
\n- If the error is a syntax error, look at where the caret is pointing.
\n- If the error is a syntax error, pay attention to grammatical features such as quotes, parentheses, and indentation.
\n- If the error is a syntax error, consider reading the program, or the offending line, backward. It's surprising, but this often helps to detect the issue.
\n- If the error is a traceback error, first look at the line where the error occured, then consider the general category of error. What could have gone wrong?
\n- If the error is a name error (NameError), check your spelling.
\n- If the error is a traceback error, try copying the last line of the error and pasting it into Google. You'll often find a quick solution this way.
\n- If you changed the program and expect a different output, but are getting old output, you may not have saved the file. Go back and make sure the file has been correctly saved.
", - "order": 5 - } - }, - { - "model": "lesson.lesson", - "pk": 1072, - "fields": { - "title": "Lists and Loops", - "created": "2020-07-09T16:41:12.969Z", - "updated": "2020-07-09T16:41:12.969Z", - "workshop": 156, - "text": "\nRemember lists? They look like this:
\nbooks = ['Gender Trouble', 'Cruising Utopia', 'Living a Feminist Life']\n
For now, let's just create a list and print it out. In a text editor, our script will look like this:
\nbooks = ['Gender Trouble', 'Cruising Utopia', 'Living a Feminist Life']\nprint(books)\n
Save this to a new file called loop.py
and run it with python loop.py
. You should see the list printed out in the terminal.
\nSo far, we've only learned one function: type()
. Let's try out another:
books = ['Gender Trouble', 'Cruising Utopia', 'Living a Feminist Life']\n# print(books)\nlist_length = len(book)\nprint(list_length)\n
The len()
function returns the number of items in a list or the number of characters in a string.
\nNotice that, if you run the code above, you won't see the books
list printed out. That's because that line has become a comment. If you put a #
(hash or pound) at the beginning of a line, that line will be ignored.
A useful property of a list is the list index. This allows you to pick out an item from within the list by a number starting from zero:
\nprint(books[0]) # Gender Trouble\nprint(books[1]) # Cruising Utopia\n
Note that the first item in the list is item [0]. The second item is item [1]. That's because counting in Python, and in almost all programming languages, starts from 0.
\nYou can print out the last item in a list using negative numbers:
\nprint(books[-1]) # Living a Feminist Life\n
There are many things you can do with list indexing. Let's play around with slicing. Slicing consists of taking a section of a list, using the list index to pick out a range of list items. For example, you could take out the first two items of a list with a slice that begins with 0
and ends with 2
.
\nThe slice syntax consists of square brakets, start point and end point, and a colon to indicate the gap in between. This should print out the first two items of your list.
\nprint(books[0:2])\n
Note a couple of things. First, the start point is inclusive, meaning that Python will include the [0]
item in your range, and the end point is exclusive, so Python won't print the [2]
item. Instead, it will print everything up until that [2]
item.
\nFor ultimate brevity, you can also write this expression as:
\nprint(books[:2])\n
The empty value before the colon allows Python to assume the range starts at the first list item, at [0]
. You can also end the slice with :
, if you want the list range to include all subseuquent items until the end of the list. The example below will print everything from the second item to the end of the list.
print(books[1:])\n
With a list that contains three items total, list slicing might not seem very impressive right now. However, this will become a powerful tool once we get to Text Analysis and start to encounter lists that contain hundreds (or thousands!) of items.
\nWhat if we want to print out each item in the list separately? For that, we'll need something called a loop:
\nbooks = ['Gender Trouble', 'Cruising Utopia', 'Living a Feminist Life']\n# print(books)\nfor book in books:\n print(\"My favorite book is \" + book)\n
What's happening here? This kind of loop is called a \"for\" loop, and tells Python: \"for each item in the list, do something.\" Let's break it down:
\nfor <variable name> in <list name>:\n <do something>\n
Indented code like this is known as a \"code block.\" Python will run the <do something>
code in the code block once for each item in the list. You can also refer to <variable name>
in the <do something>
block.
\nYou can also perform more complicated operations. Let's tackle one in a challenge. But first, a note on naming variables.
\nIn this section, we've discussed books in the context of a list. Why do we use the variable name books
in this section for our list of book names? Why not just use the variable name x
, or perhaps f
?
\nWhile the computer might not care if our list of books is called x
, giving variables meaningful names makes a program considerably easier to read than it would be otherwise. Consider this for loop:
y = ['Gender Trouble', 'Cruising Utopia', 'Living a Feminist Life']\nfor x in y:\n print(x)\n
Which is easier to read, this for loop or the one used in the example?
\nWhen variable names accurately reflect what they represent, and are therefore meaningful, we call them \"semantic.\" Always try to create semantic variable names whenever possible.
", - "order": 6 - } - }, - { - "model": "lesson.lesson", - "pk": 1073, - "fields": { - "title": "Conditionals", - "created": "2020-07-09T16:41:12.987Z", - "updated": "2020-07-09T16:41:12.987Z", - "workshop": 156, - "text": "Conditionals allow programs to change their behavior based on whether some statement is true or false. Let's try this out by writing a script that will give different outputs based on the book titles:
\nrandom_book = \"The Undercommons\"\nif random_book == \"The Undercommons\":\n print(\"This is the correct book\")\nelse:\n print(\"I don't know which book it is! I'm just a little program...\")\n
In our first line, we set a variable random_book
to the string \"The Undercommons,\" representing a random book on our bookshelf. The if
statement checks whether the random book is set to the title \"The Undercommons.\" If it is, the code in the block beneath is executed, so the text \"This is the corrent book will be printed.
\nThe else
statement handles any inputs that aren't \"The Undercommons\"—the program merely prints out that it doesn't know what you should bring. Try this script out both with the variable set to \"The Undercommons\" and the variable set to some other value.
\nWhat if we want our program to handle other books, giving different messages for each one? Other cases after the first if
statement are handled with elif
:
random_book = \"The Undercommons\"\nif random_book == \"The Undercommons\":\n print(\"This is the correct book, well done!\")\nelif random_book == \"Gramophone, Film, Typewriter\":\n print(\"This is not the correct book. Please attempt with another title.\")\nelif random_book == \"Radiant Textuality\":\n print(\"Welp. Try again.\")\nelse:\n print(\"I don't know which book you're talking about! I'm just a little program...\")\n
You can add as many elif
statements as you need, meaning that conditionals in Python have one if
statement, any number of elif
statements, and one else
statement that catches any input not covered by if
or elif
. Over the next sections, we'll work on improving this little application, making it able to handle user input directly.
Note: If you're using Python 2.7, replace all input()
functions in the code below with raw_input()
. You can check your version by running python --version
in the command line.
\nPython allows you to take input directly from the user using the input
function. Let's use it to improve our book application by asking for the book before displaying the output.
random_book = input(\"Which book do you want to read today? \")\nif random_book == \"The Undercommons\":\n print(\"This is the correct book, well done!\")\nelif random_book == \"Gramophone, Film, Typewriter\":\n print(\"This is not the correct book. Please attempt with another name.\")\nelif random_book == \"Radiant Textuality\":\n print(\"Welp. Wrong book. Try again.\")\nelse:\n print(\"I don't know which book you're talking about! I'm just a little program...\")\n
When you run this program, Python should ask you for some input with the prompt \"Which book do you want to read today? \"
(The space before the second \"
makes the prompt look more tidy in the console.) It will then return some advice based on the input. Try running it now.
Okay. Let's make our little book application a little more robust. We are going to create a list of books (remember lists?) that we can then manipulate in all sorts of ways.
\nFirst, create a list with at least three books that are important to your research right now. Shorten the titles if need be. Let's call this list our library
. Remember the proper syntax for creating a list includes square brakets with quotations and commas separating the list items.
library = [\"Orlando\", \"Confessions of the Fox\", \"These Waves of Girls\"]\n
Next, let's sort our library
in alphabetical order. There's a handy method called sort()
for doing just this kind of thing. What's a method, you might ask? Well, methods are very similar to functions, and you'll remember that functions are ways of doing things, like print()
and type()
. Methods are also ways of doing things, but these things are attached to what we call objects in Python. Objects are part of object-oriented programming, and that's definitely not necessary to learn right now. Suffice it to say that methods are just like functions, that is, they are ways of doing things to your data.
\nTo sort the list, use the sort()
method on your list. It should look like this:
library = [\"Orlando\", \"Confessions of the Fox\", \"These Waves of Girls\"]\nlibrary.sort()\nprint(library)\n
What happened here? Let's take it line by line. First, we created a list library
with three items attached to it. Then, we applied the sort()
method to the library list. Finally, we printed the library
, which is now sorted in alphabetical order.
\nYou'll see that we have a couple of new things happening with symbols. First, the period (.
) which we call an operator in Python. The period operator is another part of object-oriented programming, and it basically means that we are applying a task to whatever precedes the period. In this case, we are applying the sort()
method to our library
list. It's kind of like attaching a function to our library
. Second, we have the parenthesis ()
after sort
. When you get more comfortable with programming, you'll see that you can use the parentheses to add what we call arguments that allows us to do more complex things to data. Let's see how an argument works with the append()
method.
\nWhat if we want to add items to the list? We can use the append()
method for that. Try:
library = [\"Orlando\", \"Confessions of the Fox\", \"These Waves of Girls\"]\nlibrary.append(\"La Frontera\")\nprint(library)\n
Here, we added \"La Frontera\"
as an argument to the append()
method, but putting it between the parenthesis. It basically means that we will be appending this specific title to the library list.
\nWhen you print library
, you should see your new book appear at the end of the list. Pretty cool, right? Go ahead and add a couple more books to your list.
\nWhat if you wanted to take out some of the books? We can use pop()
to remove the last item, or \"pop\" it off, from our list.
library = [\"Orlando\", \"Confessions of the Fox\", \"These Waves of Girls\", \"La Frontera\", \"Dawn\"]\nlibrary.pop()\nprint(library)\n
The last item that you added to your list should be missing from the library
when you print the list.
Our library app is working pretty well, but you may have noticed that it's case sensitive:
\nWhat do you want to do with your books today? \nSort\nI don't know what you want me to do!\n
How could we fix our program to handle cases like this? We could add a bunch of new elif
statements, like this:
...\nelif response == \"Sort books\":\n library.sort()\n print(library)\nelif response == \"SORT BOOKS\":\n library.sort()\n print(library)\n...\n
This is a lot of work, and it's a pretty ugly solution. If we wanted to add more cases to our program, we would have to write them in twice every time, and it still wouldn't fix inputs like Sort Books
. The best way to improve our program would be to convert the input to lower case before we send it to our if/else
block.
Even if you're a super rad Python programmer, you're not going to remember every function name or how to do things you might not have touched in awhile. One thing programmers get very good at is googling for answers. In fact, this is arguably the most important skill in modern-day programming. So let's use Google to find out how to convert strings to lower case.
\nLet's try the search term make string lowercase Python
:
\n
\nWhile Google searches change over time, some of your results likely come from a site called Stack Overflow. This is a questions and answers site for programmers that usually has strong answers to questions about Python.
\n
\nOn this Stack Overflow page, take a quick look at the question to make sure it's relevant to your problem. Then scroll down to the answers to find what we're looking for. You may also notice snarky debates -- another \"feature\" of Stack Overflow.
\nAccording to this answer, we can make a string lowercase by adding .lower()
to the end of it, like this:
>>> \"SORT BOOKS\".lower()\n'sort books'\n
OK, that seems to work, even if we don't really know what's going on with that dot. Let's incorporate this transformation into our weather app:
\nlibrary = [\"Orlando\", \"Confessions of the Fox\", \"These Waves of Girls\"]\nresponse = input(\"What do you want to do with your books today? \")\nresponse = response.lower()\nif response == \"sort books\":\n library.sort()\n print(library)\nelif response == \"add a book\":\n library.append(\"La Frontera\")\n print(library)\nelif response == \"remove a book\":\n library.pop()\n print(library)\nelse: \n print(\"I don't know what you want me to do!\")\n
This new script should handle any combination of upper or lowercase characters. The new second line sets the response variable to a new value, response.lower()
, which is a lowercase version of the original input.
\nThere's no shame in googling for answers! Error messages are especially useful to google when you run into them. Keep an eye out for Stack Overflow answers, as they tend to have useful examples. The official Python documentation will also frequently come up, but I would recommend avoiding it as a resource until you have more programming experience. It's a great resource, but the way information is presented can be confusing until you get the hang of reading documentation.
", - "order": 9 - } - }, - { - "model": "lesson.lesson", - "pk": 1076, - "fields": { - "title": "A Little Motivation", - "created": "2020-07-09T16:41:14.010Z", - "updated": "2020-07-09T16:41:14.010Z", - "workshop": 156, - "text": "Early on, we learned a bit about lists, which look like this:
\n['rose', 'violet', 'buttercup']\n
We're going to create a small application that will print a random motivational saying every time a user presses Enter
. Our first step will be to create a list of positive sayings:
motivational_phrases = [\n \"Importing modules is easy!\",\n \"Programming! Yay!\",\n \"You write lists like a pro!\",\n ]\n
Lists open with a square bracket [
, have items seperated with commas, and end with a square bracket ]
, like this:
[1, 2, 3, 4, 5]\n
Our positivity list above spreads the list out over multiple lines for greater readability, which is allowed in Python. Remember that you can change the strings in the list to whatever phrases you choose.
\nNow that we have our sayings, let's use it in conjunction with some functionality from a module that's built into Python: the random
module.
import random\nmotivational_phrases = [\n \"Importing modules is easy!\",\n \"Programming! Yay!\",\n \"You write lists like a pro!\",\n ]\nprint(random.choice(motivational_phrases))\n
The random.choice
function chooses a random item from a list and returns it. The .
syntax indicates that the function is coming from the random
library.
\n1. The real point of this section is to learn import
, which is where Python really starts to get interesting. Python comes with many libraries (importable collections of code), and you can install many more. Think of something you're interested in doing (statistics, text analysis, web scraping, quantitative analysis, processing Excel/PDF/image files) and search google \"\\
\n2. (optional) As with our weather app, this positive saying generator could be improved by making it so the program doesn't have to run again every time to get new output. Add a while loop for the final version. You can see a solution here.
", - "order": 10 - } - }, - { - "model": "lesson.lesson", - "pk": 1077, - "fields": { - "title": "Objects in Python", - "created": "2020-07-09T16:41:14.046Z", - "updated": "2020-07-09T16:41:14.046Z", - "workshop": 156, - "text": "Objects in Python (and other programming languages) are basically containers that can hold data and/or functions inside them. When a function is inside an object, we usually call the function a \"method.\" When data is inside an object, we usually call it an \"attribute.\" The terminology isn't that important, though. What we do need to know is that you can access these \"methods\" and \"attributes\" with a .
(a dot or period).
\nWhen we added lower case to our weather program, we briefly saw a method contained inside all string objects in Python—lower()
, which makes the string lower case.
>>> loud_greeting = \"HELLO!\"\n>>> loud_greeting.lower()\n'hello!'\n
Many, or most, objects in Python have methods that allow you to use them in different ways. As you move into using more advanced libraries, you'll find that understanding what methods are available becomes more important.
\nWhen you encounter an object, how can you learn its methods and atributes so you can use them? There are two main ways. The first, and likely the most practical, is to read the documentation of the library you're using.
\nHowever, you can also use the dir()
function, which will tell you which methods and attributes are available in an object.
\nLet's use the REPL for a moment—open it by typing python
at the command line.
>>> s = 'Hello, world!'\n>>> dir(s)\n['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__',\n...\n'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']\n
The above output shows all the methods and attributes available to Python strings that can be accessed using the dot (.
) syntax. When using dir()
, you'll mostly want to ignore the methods and attributes that have underscores around them. They mainly have to do with the internals of the Python language.
For a few minutes, practice moving in and out of Python's interactive mode (also known as the REPL). You can get out of Python by hitting Control-d
(or Control-z
if you're using Git Bash) or by typing exit()
, and you can get back in by typing python
at the $
prompt. Remember that you're in the REPL when you see >>>
, and you're in bash when you see the $
.
One \"operator\" (math symbol) we didn't learn is the exponent—you know, \"x raised to the power of...\" If you were Guido van Rossum, the creator of Python, how would you define this operator?
\nSo I just told you that variables shouldn't start with a number or an underscore. What does that even mean? Will your computer explode if you write 1_book = \"Gender Trouble\"
?
Only one way to find out. Try giving weird names to variables and see if you can learn a bit about the rules.
" - } - }, - { - "model": "lesson.challenge", - "pk": 250, - "fields": { - "lesson": 1070, - "title": "", - "text": "Rewrite your program so that you assign the message to a variable, then print the variable. This will make your program two lines instead of one. There's a fancy programmer word for rewriting your code without changing it's behavior—\"refactoring.\"
\n(optional) Are you already getting sick of typing python hello.py
again and again? Try typing !!
in the command line (the $
). This will run your last line of code again.
(even more optional) If you're on Windows and have a minute, try pressing the Windows button on your keyboard and searching for a program called IDLE
that comes with Python. It's a special editor (or IDE) that lets you run Python code from inside it. You might like it more than git bash.
Try to create as many errors as you can in the next few minutes. After getting your first two syntax errors, try instead to get traceback errors. Some areas to try include mathematical impossibilities and using math operations on types that do not support them.
" - } - }, - { - "model": "lesson.challenge", - "pk": 252, - "fields": { - "lesson": 1072, - "title": "", - "text": "prime_numbers = [2, 3, 5, 7, 11]\n
Write some code to print out the square of each of these numbers. Remember that the square of a number is that number times itself. The solution is below, but you're not allowed to look at it until you've tried to solve it yourself for 3.5 minutes. (Seriously! That's 210 seconds.)
\nThe square of 2 is 4.\nThe square of 3 is 9.\nThe square of 5 is 25.\nThe square of 7 is 49.\nThe square of 11 is 121.\n
Note: the \"f-string\" is a new string formatting method for Python 3. You can read more about this new string formatting method.
" - } - }, - { - "model": "lesson.challenge", - "pk": 253, - "fields": { - "lesson": 1073, - "title": "", - "text": "Add two more elif
statements to this program to make it better able to handle different potential books.
Remember the input()
function from the beginning of this lesson? This challenge uses that function to create a little library app. You will play around with the input button, asking the user what kinds of things they want to do with their library, and writing some code that does those things and prints out the results.
First, create a new file called library.py
. Save it to your current working folder.
Second, create a list of library
books, with at least three books (you can use the same ones as before).
library = ["Orlando", "Confessions of the Fox", "These Waves of Girls"]\n
Then, add an input statement that will save the user's response to a variable, like response
.
response = input("What do you want to do with your books today? ")\n
Now, create a conditional statement that matches the user's response to series of options for doing things to the library
list. You can include sort()
, append()
, and pop()
. I'll do the first one, sort()
, for you:
library = ["Orlando", "Confessions of the Fox", "These Waves of Girls"]\nresponse = input("What do you want to do with your books today? ")\nif response == "sort":\n library.sort()\n print(library)\nelse: \n print("I don't know what you want me to do!")\n
See how the order of statements build on each other toward the final product? First, we create a library of books. Then, we set the user's response about what to do with those books. Then, we create a conditional statement that matches the response to specific tasks. The first condition checks to see if the user wants to \"sort\" the books, then sorts them, then prints the final result.
\nAfter adding a few more conditions, test out your code! You should have a little library app that sorts, adds, and removes books from your list.
" - } - }, - { - "model": "lesson.challenge", - "pk": 255, - "fields": { - "lesson": 1075, - "title": "", - "text": "while
loops to get Python to repeat loops over and over again. This involves adding a while
statement to your libary app. The code should look like this, and it goes right after the library
list and before your input
statement.while True:\n
Make sure that everything under while True:
is indented (this is called a code block).
Once you get it to work, you can add more elif
statements to include more and more books on the list. Then, run the program, adding books, sorting them and removing them.
(optional) OK, I told you not to look at the Python documentation. But doesn't that make you really want to go look at the Python documentation? How bad could this \"documentation\" really be? What terrible secrets might it hold?
\nFine. Have a look at the Python documentation on built-in functions. Don't say I didn't warn you.
\nAs we've learned, libraries are Python code written by others that can be pulled into your program, allowing you to use that functionality. In this challenge, do a little research on Python libraries that might solve a problem for you or address a domain that you're interested in.
\nThe best way to find a Python library in a particular area is to do a Google search. For example, if you wanted to find Python libraries for dealing with cleaning up HTML files, you might search one of these:
\n\n\nworking with html python library
\nhtml parser python library
\n
In your research, you may also want to look at the libraries that come with Python. You can find a list of libraries in these libraries here.
\n" - } - }, - { - "model": "lesson.challenge", - "pk": 257, - "fields": { - "lesson": 1077, - "title": "", - "text": "You can also use dir()
to see what functions are available from Python libraries that you import. Try importing the random library again and see what you get when you enter dir(random)
.
Try entering other objects based on Python types we've already learned to the dir()
function. For example, you might try dir([1, 2, 3])
to see what methods are available when using lists.
**
. For example, the number 3
to the power of 2
would be expressed as 3**2
.There are a few rules regarding the way that you write the variable statement. This is because Python reads everything left to right, and needs things to be in a certain order.
\nFirst, you cannot use any numbers or special characters to start a variable name. So 1_book
, 1book
, or any variable that contains special characters @
, #
, $
, $
, etc, wouldn't be acceptable in Python. You must start the variable with a letter and avoid using special characters.
You can incorporate numbers after you've started with a letter. So book_1
or b1
is acceptable, though you cannot use special characters at any point in the variable name.
Second, you might also notice that variable syntax requires you to write the variable name first, followed by an equal sign =
, and then the value, which can be any data type. You cannot start the variable statement with the data value, because python always recognizes the first thing written as the thing to be assigned. The thing that comes after the =
is the data that becomes attached to the preceding variable.
hello.py
:greeting = "Hello World!"\nprint(greeting)\n
Then, making sure you're in the right directory, run python hello.py
in the terminal $
. You should see the following output:
$ python hello.py\nHello world!\n
Some examples of syntax errors include...
\nStarting the variable name with a special character.
\n>>> %greeting = "Hello World"\n File "<stdin>", line 1\n %greeting = "Hello World"\n ^\nSyntaxError: invalid syntax\n
Starting a variable by writing the data values before the variable.
\n>>> "hey there!" = greeting\n File "<stdin>", line 1\nSyntaxError: can't assign to literal\n
Including spaces in a variable.
\n>>> pleasant greeting = "Hello!"\n File "<stdin>", line 1\n pleasant greeting = "Hello!"\n ^\nSyntaxError: invalid syntax\n
Some examples of traceback errors include...
\nConcatenating data types, like strings and integers.
\n>>> greeting = "hello" + 1\nTraceback (most recent call last):\n File "<stdin>", line 1, in <module>\nTypeError: can only concatenate str (not "int") to str\n
Using Booleans (True
or False
) without capitalizing them.
>>> greeting = false\nTraceback (most recent call last):\n File "<stdin>", line 1, in <module>\nNameError: name 'false' is not defined\n>>> greeting = False\n>>> greeting\nFalse\n
prime_numbers = [2, 3, 5, 7, 11]\n\nfor num in prime_numbers:\n print(num * num)\n
prime_numbers= [2,3,5,7,11]\nfor num in prime_numbers:\n print(f"The square of {num} is {num * num}")\n
random_book = "The Undercommons"\n\nif random_book == "The Undercommons":\n print("This is the correct book, well done!")\nelif random_book == "Gramophone, Film, Typewriter":\n print("This is not the correct book. Please attempt with another title.")\nelif random_book == "Radiant Textuality":\n print("Welp. Try again.")\nelif random_book == "The New Jim Code":\n print("Bzzzzzt! Wrong Answer!)\nelif random_book == "Algorithmic Criticism":\n print("That's just wrong.")\nelse:\n print("I don't know which book you're talking about! I'm just a little program...")\n
library = ["Orlando", "Confessions of the Fox", "These Waves of Girls"]\nresponse = input("What do you want to do with your books today? ")\nif response == "sort books":\n library.sort()\n print(library)\nelif response == "add a book":\n library.append("La Frontera")\n print(library)\nelif response == "remove a book":\n library.pop()\n print(library)\nelse: \n print("I don't know what you want me to do!")\n
library = ["Orlando", "Confessions of the Fox", "These Waves of Girls"]\nwhile True:\n response = input("What do you want to do with your books today? ")\n if response == "sort books":\n library.sort()\n print(library)\n elif response == "add a book":\n library.append("La Frontera")\n print(library)\n elif response == "add another":\n library.append("Dawn")\n print(library)\n elif response == "more books":\n library.append("Frankenstein")\n print(library)\n elif response == "again":\n library.append("Nightwood")\n print(library)\n elif response == "remove a book":\n library.pop()\n print(library)\n else: \n print("I don't know what you want me to do!")\n
Websites seem like these magical things that appear when we open our web browser (i.e. Chrome, Firefox, Safari). We know that websites are hypertext, meaning that we can click between links, travelling from page to page until we find what we need. What may be less obvious about websites is that, fundamentally websites are plain text documents, usually written in HTML or another web-based markup language, such as XML or XHTML.
\nFun fact: Nearly 80% of all websites (whose markup language we know) use HTML.
\nHTML is a markup language used to write web-based documents. It enables us to provide web browsers with information about the content of a document. We can, for example, indicate that some part of our document is a paragraph, image, heading, or link. The browser uses this information when displaying the document for users.
\nHTML is a markup language, not a programming language. Programming languages are used to transform data, by creating scripts that organize an output of data based on a particular input of data. A markup language is used to control the presentation of data.
\nFor a practical example of this difference, we can think about tables. A programming language can help you search through a table, understand the kinds of data the table includes, find particular data points, or transform its content into other kinds of data, such as frequencies. A markup language would instead determine the content, layout, and visual presentation of the table.
\nFundamentally, then, a script or program is a set of instructions given to the computer. A document in a markup language determines how information is presented to a user.
\nNOTE - Markup vs Markdown: Markdown and HTML are both types of markup languages; Markdown is a play on words. Markup languages help format content.
\nCSS is usually used in conjunction with HTML. HTML tells the browser what the different parts of a document are. CSS tells the browser what the parts of the document should look like. It is essentially a set of rules that are applied when rendering an HTML document. Its name—Cascading Style Sheets—refers to the fact that there is an order of precedence in how the browswer applies CSS rules to a document. More specific rules overwrite less specific rules.
\nTogether, these languages can be used to write and style a website using a text editor (such as VS Code) directly from your computer. No internet access needed.
\nHowever, internet access is necessary if you plan on making your website available to the public. At the end of this workshop, we will briefly discuss how to get your website from your local computer onto the internet.
", - "order": 1 - } - }, - { - "model": "lesson.lesson", - "pk": 1079, - "fields": { - "title": "Opening Activity", - "created": "2020-07-09T16:41:17.713Z", - "updated": "2020-07-09T16:41:17.713Z", - "workshop": 157, - "text": "\n
A second tab should open in your browser displaying the underlying code of the page. This is the code that is used to make and render the page in your web browser.
\nIn this session, we are going to learn how to read and write this code, and render it in the browser on your local computer. At the end we will discuss the next steps for how you could host your new website, making it available for browsing by others via the internet.
", - "order": 2 - } - }, - { - "model": "lesson.lesson", - "pk": 1080, - "fields": { - "title": "Basic Template for HTML", - "created": "2020-07-09T16:41:17.744Z", - "updated": "2020-07-09T16:41:17.744Z", - "workshop": 157, - "text": "Below is a basic template for an empty HTML Document.
\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n</head>\n<body>\n</body>\n</html>\n
HTML documents start with a DOCTYPE
declaration that states what version of HTML is being used. This tells the browser how to read the code below it to render the page. If the webpage were written with a different markup language (i.e. XML, XHTML), it would tell you here.
\nAfter the DOCTYPE
, we see the start of the Root Element. EVERYTHING—all content—that you want presented on this page and all information about how you want that information to be organized and styled goes in the root element, and it is demarcated by <html>
and </html>
.
\nThe root element begins by indicating which language the document is written in; and in this basic template, en
tells us and the computer that we are writing in English.
\nWithin the root element of the basic template above, you'll notice the two main sections of all HTML documents: a head section (demarcated by <head>
and </head>
) and a body section (demarcated by <body>
and </body>
).
\nThe head section contains basic information about the file such as the title, keywords, authors, a short description, and so on. This is also where you will link to your CSS stylesheet which describes how you want the page styled—colors, fonts, size of text, and positioning of elements on the page.
\nThe body section contains the content of the page, including paragraphs, images, links, and more, and indicates how this content is to be structured on the page.
\nCreate a folder called htmlpractice
in your projects folder (~/Desktop/projects/htmlpractice
). If you haven't created a projects folder in an earlier session, you can create one now. Inside that folder, create a new text file and save it as index.html
.
\nLet's use the command line to create the new folder and file:
\n1. Open your terminal.
\n2. Navigate to your projects folder using this command:
\nbash\n cd ~/Desktop/projects
\n3. Create a new folder:
\nbash\n mkdir htmlpractice
\n4. Use your VS Code text editor to create a file called index.html
: code index.html
.
\n5. Paste the template above (starting with <!DOCTYPE html>
) into the new file.
\nThe index.html
file is your default homepage for the website we are creating. This is an industry standard, because web browsers tend to recognize the index.html
page as the opening page to the directory that is your website. See here for more explanation.
\nOnce you've created your new file, open it with a web browser using your graphical user interface:
\nOn macOS, click on the Finder in your dock (the apps at the bottom of the screen) and click on Desktop on the left. From there, navigate to projects
, then htmlpractice
. Alternately, you can click the projects folder icon on your Desktop and find it from there. If you're using a Mac and would prefer to use the command line, you can also type open index.html
from within your htmlpractice
folder.
\nOn Windows, click the projects
folder icon on your desktop. Navigate to projects
, then htmlpractice
. Double click the index.html
file. If it does not open in a browser, right click the index.html
icon and select \"Open with...\" from the menu. Select Firefox or Google Chrome from the app list that appears.
When you open the empty template, you'll see only a blank web page. Open your secondary menu (right click on Windows, hold control and click with macOS) and view the page source. How can you explain what happens when you open these text files?
\nWhen you \"View Page Source,\" you should see the code for the basic template.
\nNo content renders on the page, because there is no content in the template at this time.
", - "order": 3 - } - }, - { - "model": "lesson.lesson", - "pk": 1081, - "fields": { - "title": "Tags and Elements", - "created": "2020-07-09T16:41:17.758Z", - "updated": "2020-07-09T16:41:17.759Z", - "workshop": 157, - "text": "Tags and elements are the structuring components of html webpages.
\nElements identify the different parts of a page, such as paragraphs, headings, titles, body text, images and more. Elements are demarcated by tags which enclose the content of an element (ex. <head>
and </head>
are tags that denote the head element of your page).
\nTags demarcate elements in one of two ways. As with the paragraph element below, an element can have an opening and a closing tag, with the content in between.
\n<p>This is a paragraph.</p>\n<p>\n This is also a paragraph.\n</p>\n
Elements which have an opening and closing tag can have other elements inside them. Inside the paragraph element below is a strong element, which emphasizes the included text by making it bold.
\n<p>\n When I came home from school, I saw he had <strong>stolen</strong> my chocolate pudding.\n</p>\n
Other elements have self-closing tags as with the image element below. These tags are also called void tags.
\n<img src=\"image.jpeg\" />\n
These elements don't require a separate closing tag. Closing tags aren't needed because you wouldn't add content inside these elements. For example, it doesn't make sense to add any additional content inside an image.
\nBelow is HTML that adds alternative text to an image—or text that describes the image. This information added is an attribute—or something that modifies the default functionality of an element.
\n<img alt=\"This is an image\" src=\"image.jpeg\" />\n
Adding alternative text to an image, as was done in this example, is vitally important. That information makes the image more accessible to those viewing your site. For instance, users with poor vision who may not be able to see your image will still understand what it is and why it's there if you provide alternative text describing it.
\nIf you look back at the basic template in your index.html
file, you'll see that the main sections of your file have opening and closing tags. Each of these main elements will eventually hold many other elements, many of which will be the content of our website.
Paragraphs and headings are the main textual elements of the body of your webpages. Because these contain content that you want to organize and display on your webpage, these are entered in the body element.
\nThe <h1>
, <h2>
, <h3>
, etc. tags denote headings and subheadings, with <h1>
being the largest and <h6>
the smallest.
\nThe <p>
tags denote paragraphs, or blocks of text.
<!DOCTYPE html>\n <html lang=\"en\">\n <head>\n <title>A boring story</title>\n </head>\n <body>\n <h1>\n Cleaning my boiler\n </h1>\n <p>\n When I got to my basement that day, I knew that I just had to clean my boiler. It was just too dirty. Honestly, it was getting to be a hazard. So I got my wire brush and put on my most durable pair of boiler-cleaning overalls. It was going to be a long day.\n </p>\n </body>\n</html>\n
Note that the <title>
is in the <head>
element, which is where information about the webpage goes. The title doesn't appear on the page, but instead elsewhere in the browser when the page is displayed. For example, in Chrome, the title appears on the tab above the navbar.
\n
\nNote also that the elements and tags used in HTML have meaning. They provide information about the structure of a web page, showing how its parts work together. Those who make use of assistive technologies such as screen readers rely on this semantic information to navigate the page. Thus, it's important to use elements such as headers only when the information being marked calls for it. Making text large and/or bold for visual effect should be done using CSS. The Mozilla Developer Network has some good introductory information on semantic HTML.
\nUsing your text editor, add the following to your index.html
:
\n- Title
\n- Heading
\n- Paragraph
\nThen, re-save the file. Open it in your browser again or refresh the page if still opened.
\nWhat do you notice about how the information is organized in the webpage? In other words, where are the title, heading, and paragraph text?
\nThe heading should appear at the top of the page, followed by the paragraph text. The heading text should be larger. The title should appear in the browser window tab.
\n
", - "order": 5 - } - }, - { - "model": "lesson.lesson", - "pk": 1083, - "fields": { - "title": "Links", - "created": "2020-07-09T16:41:17.841Z", - "updated": "2020-07-09T16:41:17.841Z", - "workshop": 157, - "text": "Links are the foundation of the World Wide Web, and thus are an important component of most websites. Hyperlinking text enables users to move between the different webpages on your site (sometimes in the form of a menu or navigation bar), or connect to other resources or information on other websites.
\nThe <a>
tag, or anchor tag, creates a link to another document. You can use the <a>
tag to link to other documents or webpages you created for the same site or to documents located elsewhere on the web. You can also use it to link to a particular location on a page—we'll see an example of this in the section on classes and ids.
Relative links take the current page as an origin point and search for other files within the same folder or directory. This method is useful for creating links to pages within your own site.
\nThe following appears as a link to the about.html
page in the same folder as index.html
:
<a href=\"about.html\">About</a>\n
On your webpage it will appear as:
\n\n\n\nThis link is asking the browser to look for a file titled
\nabout.html
in the same folder. If a file namedabout.html
is not in the same folder, clicking the link will result in a404
(\"Page Not Found\") error.
An absolute link includes information that allows the browser to find resources on other websites. This information includes the site domain—such as google.com
—and often the protocol—such as http
or https
.
<a href=\"http://www.google.com\">Google</a>\n
On your webpage it will appear as:
\n\n\n\nThis pathway is directing your browser to look online for this text document at the URL address provided.
\n
Each example above includes an href
tag. The href
tag, short for hypertext reference, is an example of an attribute. Attributes offer secondary information about an element.
\nThe <a>
tag, or anchor tag, creates a link. The text within the <a>
and </a>
tags, the anchor text, is what a visitor to the site will see and can click on. The href=
attribute tells the browser where the user should be directed when they click the link.
\nThere is another technical difference between the two options above.
\nUse relative links when referring to pages on your own site. The main advantage of using relative links to pages on your site is that your site will not break if it is moved to a different folder or environment.
\nabout.html
in your htmlpractice
folder. Copy over the HTML from your index.html
file, but change the text in the <h1>
element to \"About.\"index.html
file, add a relative link leading to your About page.About page
to your index.html
page. In this link, call your index.html
page Home
(Reminder: index.html
is the default homepage)http:
) and also a domain (for example, cuny.edu
), such as http://cuny.edu/about
.When your pages are updated, you should be able to navigate from your Home page to your About page, and vice versa. You should also be able to navigate to the external web page.
", - "order": 6 - } - }, - { - "model": "lesson.lesson", - "pk": 1084, - "fields": { - "title": "Images", - "created": "2020-07-09T16:41:17.847Z", - "updated": "2020-07-09T16:41:17.847Z", - "workshop": 157, - "text": "Images are another important component of websites. Sometimes these just help bring your website to life, but other times they can help communicate information to users.
\nImages are created with the <img>
tag. Similar to the <a>
tag, <img>
requires an attribute, in this case src
. The src
attribute stands for \"source\" and communicates secondary information to your browser that identifies and locates the image. Unlike many other tags, the <img>
tag does not need to be closed, making it an example of a void tag.
\nThe following element pulls in an image located in the same folder as the .html
file:
<img src=\"scream.jpeg\" />\n
The same rules apply here as with the href
attribute: if the image is not located in the same folder as the document you are writing in, the browser won't find it. If the browser cannot find an image resource, you will see a broken image icon, such as this one from Chrome:
\n
\nNote: Some sites use a lot of images. When this is the case, it can be helpful to keep images in a separate folder within your site's structure. To enable the browser to find an image in that case, just add the directory in front of the file name. For example, if you have a folder named images in the same folder as your index.html file, and scream.jpeg is in that folder, you'd change the void tag above to <img src=\"/images/scream.jpeg\" />
.
As briefly noted earlier, alternative text, or alt text, is descriptive \"text associated with an image that serves the same purpose and conveys the same essential information as the image\" (see Wikipedia Manual of Style/Accessibility/Alternative Text for Images for more), and is important for ensuring content conveyed by images is accessible to all.
\nTo add alternative text to an image, you add an additional attribute, alt
followed by your descriptive text. For example:
<img src=\"filename.png\" alt=\"Text in these quotes describes the image\" />\n
For more information, see what the Social Security Administration has to say.
\nIf you're planning to use images that you did not take or make yourself, you'll need to use \"public domain\" or \"open license\" images.
\nThis guide by the OpenLab at City Tech includes more information on licensure and a list of places where you can find reuseable images.
\nDownload and save an image from the web, or move an image from your computer into the same folder as your index.html
file.
\nTip: Give the file a simple name. Also, the name cannot have spaces. A good practice is to use either dashes or underscores where there would otherwise be spaces. For example: this-is-an-image.jpg
or this_is_an_image.jpg
.
\nUsing the code above as a reference, add that image into your index.html
file, re-save the file, and re-open or refresh the page in your browser. Your image should now appear on the page.
As we’ve gone through the different components of creating a webpage, you likely have noticed some common conventions or industry standards for creating a webpage using HTML. Can you guess any of these?
\nHere are a few examples:
\n- Some tags are self-closing, while others require a closing tag. Self-closing tags are called void tags, and are generally self-closing because you wouldn't need or want to add another element within a tag. They also generally end with a backslash (/
) to mark the end of the tag.
\n- Use lower case. While HTML is not case sensitive, it makes scanning the code easier, and makes it look more consistent.
\n- Your code should be nested. This is not a technical necessity either — blank space has no meaning in html. However, this makes it easier to scan the code quickly, which is particularly helpful when you run into errors!
", - "order": 8 - } - }, - { - "model": "lesson.lesson", - "pk": 1086, - "fields": { - "title": "Challenge: Create an Institute Website", - "created": "2020-07-09T16:41:17.856Z", - "updated": "2020-07-09T16:41:17.856Z", - "workshop": 157, - "text": "For this challenge, practice using the command line. If you need a reminder of which commands to use to create new folders and files, see here.
\nUsing the tags we've just reviewed, and two additional ones (see below) begin creating an introductory page for your future Institute.
\nIn your projects
folder on your desktop, create a new folder called website
. Create a index.html
file inside that folder. This will be the homepage or landing page of your site.
\nAdd HTML to your index.html
file. This page should include the following:
\n- Doctype
\n- Root element
\n- Head and a body
\n- Title for the page
\n- One heading
\n- One paragraph
\n- One image
\n- A menu or navigation bar that links to your Home and About pages
\nThink about the order of your content as you assemble the body of your page.
\nDon't worry about getting the content just right, as much as using this exercise to review the structure of a webpage, and practice creating a webpage.
\nHere are two additional tags that might come in handy in assembling your page:
\nTo make a list, you open and close it with the ul
tags, and each item is an enclosed li
tag:
<ul>\n <li> item 1 </li>\n <li> item 2 </li>\n <li> item 3 </li>\n</ul>\n
The HTML above will produce an unordered (bulleted) list. To create an ordered (numbered) list instead, just substitute <ol>
and </ol>
for <ul>
and </ul>
.
\n(This may come in handy when making your menu or navigation bar.)
\nTo make a line break or give space between different elements:
\n<br />\n
Finished early? Play around with other tags by referring to this HTML cheatsheet.
", - "order": 9 - } - }, - { - "model": "lesson.lesson", - "pk": 1087, - "fields": { - "title": "CSS Basics", - "created": "2020-07-09T16:41:17.859Z", - "updated": "2020-07-09T16:41:17.859Z", - "workshop": 157, - "text": "CSS stands for Cascading Style Sheets. This language works in coordination with HTML, but is its own language with its own rules and terminology. In contrast to HTML, which is responsible for the content of the page, CSS is responsible for the presentation of the page.
\nExamples of what CSS can help you determine include:
\n- What background color you want to use for the page or a paragraph.
\n- What font or font size you want for your headings or your normal text.
\n- How large you want the images, and whether you want them aligned center, left, or right.
\n- Where elements appear on the page.
\n- Whether elements are visible to a user or not.
", - "order": 10 - } - }, - { - "model": "lesson.lesson", - "pk": 1088, - "fields": { - "title": "Integrating CSS and HTML", - "created": "2020-07-09T16:41:17.879Z", - "updated": "2020-07-09T16:41:17.879Z", - "workshop": 157, - "text": "In order for CSS to inform the style of the content on the page, it must be integrated with your HTML. CSS can be integrated into your HTML in three ways: inline, internal, and external.
\nInline styling adds CSS directly into the HTML of a page to adjust the style of particular parts of a page.
\nFor example, if you want the text of your first paragraph to be red, but the text of your second paragraph to be blue:
\n<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <title>About</title>\n </head>\n <body>\n <p style=\"color: red\">\n Content of paragraph\n </p>\n <p style=\"color: blue\">\n Content of paragraph\n </p>\n </body>\n</html>\n
Internal styling also adds CSS directly into the HTML, but keeps it separate from the content code of the page by adding it into the head using the <style>
tag. When using internal styling you are providing styling rules for the entire page. For example, if you want all headings to be blue:
<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <title>About</title>\n <style>\n h1 {\n color: blue;\n }\n </style>\n </head>\n <body>\n <h1>\n Heading One\n </h1>\n <p>\n Content of paragraph\n </p>\n <h1>\n Heading Two\n </h1>\n <p>\n Content of paragraph\n </p>\n </body>\n</html>\n
External styling creates a completely separate document for your CSS that will be linked to your HTML in the head section of your HTML document using the code below. This separate document is called a stylesheet and should be named style.css
. This document must be stored in the same folder as the HTML document it is linked to.
<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <title>CSS Example</title>\n <link rel=\"stylesheet\" href=\"style.css\" />\n </head>\n <body>\n ...\n </body>\n</html>\n
It's best practice to use Option 3, external styling, for a number of reasons:
\n1. It helps us remember what each language focuses on: HTML is for content, CSS is for styling. (This is sometimes referred to as \"separation of concerns\")
\n2. It helps us maintain consistency across the various pages of our site; multiple HTML files can link to the same CSS file.
\n3. Because multiple HTML files can link to the same CSS file, it's not necessary to write the same CSS code multiple times. Once suffices.
\nOption 3, external styling, is preferred by most web developers because it's more manageable and because it lends itself to greater consistency across the entire site.
", - "order": 11 - } - }, - { - "model": "lesson.lesson", - "pk": 1089, - "fields": { - "title": "Rule Sets", - "created": "2020-07-09T16:41:17.890Z", - "updated": "2020-07-09T16:41:17.890Z", - "workshop": 157, - "text": "CSS is based on selectors and declarations, which together form rule sets (or just \"rules\"). Rule sets (included in a .css file) look like this:
\nh1 {\n color: orange;\n font-style: italic;\n}\np {\n font-family: sans-serif;\n font-style: normal;\n}\n#navbar {\n background-color: yellow;\n padding: 80px;\n}\n.intro {\n font-family: arial;\n background-color: grey;\n color: dark-grey;\n}\n
The first rule (which starts with the h1
selector) applies to all <h1>
tags on each page where your HTML document links to your stylesheet, and changes the font style and display of those headings.
\nThe lines within the curly braces (i.e. { }
) are called declarations, and they change the formatting of the elements in the HTML document. Each line in the declaration sets the value for a property and ends with a semicolon (;
).
\nNote the different syntax being used to select items for for styling with rule sets. The bottom two selectors are used to apply rule sets to ids and classes. In general, adding classes and ids to HTML elements allows for more specific styling — more on these soon!
\nThe formatting of the text on your page should change accordingly. Your <h1>
should be orange and italic, for example.
\nWhat are some other rules you might set for different HTML elements? Do a quick Google search for a CSS rule that changes the appearance of your page, such as putting a border around an element.
", - "order": 12 - } - }, - { - "model": "lesson.lesson", - "pk": 1090, - "fields": { - "title": "Filtering", - "created": "2020-07-09T16:41:17.897Z", - "updated": "2020-07-09T16:41:17.897Z", - "workshop": 157, - "text": "Some of you may be wondering whether it matters what order you add the rule sets to your style.css
document. The answer is no. CSS has an automatic filtering function where the most specific rule in CSS always takes precedence.
\nSo if your stylesheet contained the following rule sets:
\np {\n color: green;\n}\np strong {\n color: red;\n}\n
...then the text of your paragraph would be green, but where the strong tag is found in the paragraph, the text would be bold and red. In other words, the more specific styling for the <strong>
text in your paragraph will override the less specific styling of the paragraph in general. This would occur regardless of the order these rule sets appear in the stylesheet.
\nThis rule also applies to how you integrate CSS into your HTML to style your content. For example, if you link to an external stylesheet, and you add inline or internal CSS into your HTML, the inline or internal CSS will override the rules set in the external stylesheet. Similarly, the inline CSS will override the internal CSS.
", - "order": 13 - } - }, - { - "model": "lesson.lesson", - "pk": 1091, - "fields": { - "title": "Classes and IDs", - "created": "2020-07-09T16:41:17.920Z", - "updated": "2020-07-09T16:41:17.920Z", - "workshop": 157, - "text": "Classes and IDs enable more fine-grained styling by allowing you to define your own selectors. The difference between classes and IDs is that IDs are unique, used to identify one specific element or part of an element, whereas classes are used to identify multiple instances of the same type of element.
\nBasically, if you're styling a part of your page that is unique, such as the navbar or footer, use an ID. If you're styling something that recurs in different places, like an info box or form field, use a class.
\nIncorporating classes and IDs into the styling of your document includes two steps:
\n1. Some HTML code that CSS selectors can refer back to must be added to your HTML document.
\n2. CSS rules that select that code must be added to your style sheet.
\nThe code for classes and IDs is different in both CSS and HTML.
\nIn HTML, classes and ids are added to the first part of a tag. Here's an example of what HTML code with classes and ids looks like:
\n<ul id=\"navbar\">\n <li>Home</li>\n <li>About</li>\n</ul>\n<h1 class=\"football\">Football teams</h1>\n<ul>\n <li class=\"football\" id=\"colts\">Indianapolis Colts</li>\n <li class=\"football\" id=\"packers\">Green Bay Packers</li>\n</ul>\n<h1 class=\"baseball\">Baseball teams</h1>\n<p>American League teams</p>\n<ul>\n <li class=\"baseball american\" id=\"twins\">Minnesota Twins</li>\n <li class=\"baseball american\" id=\"tigers\">Detroit Tigers</li>\n</ul>\n<p>National League teams</p>\n<ul>\n <li class=\"baseball national\" id=\"dodgers\">Los Angeles Dodgers</li>\n <li class=\"baseball national\" id=\"mets\">New York Mets</li>\n</ul>\n
Note that it's possible to assign more than one class to an element — just leave a blank space between the two classes, as in the baseball examples above.
\nBonus: ID selectors can be used to create links that can be used for navigation within a page. For example, to add a link to the page that takes the user directly to the line that reads \"New York Mets,\" we might write HTML like this: <a href=\"#mets\">Mets</a>
.
Class selectors in CSS are denoted with a period in front of the class name you're creating. They look like this:
\n#navbar {\n padding: 80px;\n background-color: red;\n color: white;\n}\n.football {\n font-family: arial;\n background-color: lightgrey;\n color: blue;\n}\n.baseball {\n font-weight: bold;\n color: green;\n}\n.american {\n background-color: yellow;\n}\n
...look like this in the CSS—the name of the selector preceeded by a hashmark (#
) (also known as a pound sign or octothorpe):
#navbar {\n background-color: yellow;\n padding: 80px;\n}\n
...and in the HTML they are incorporated into elements like this:
\n<ul id=\"navbar\">\n <li>Home</li>\n <li>About</li>\n</ul>\n
If you run into an error, be sure to check your punctuation. Oftentimes the problem is a typo, or overlooking a semi-colon, a period, etc. See the Troubleshooting section for more information on common issues.
", - "order": 14 - } - }, - { - "model": "lesson.lesson", - "pk": 1092, - "fields": { - "title": "Useful Properties", - "created": "2020-07-09T16:41:17.930Z", - "updated": "2020-07-09T16:41:17.930Z", - "workshop": 157, - "text": "Below is a list of useful properties that can be modified with CSS, compiled by Digital Fellow Patrick Smyth. There are also CSS cheatsheets available online.
\nDetermines text color. Can be a word or a hex value, like #FFFFFF:
\ncolor: blue;\ncolor: #000000;\n
Sets the background color of an element.
\nbackground-color: pink\nbackground-color: #F601F6;\n
Aligns text to the left, center, or right.
\ntext-align: center;\n
The space between text and the \"box\" (<div>
) surrounding it.
padding: 10px;\npadding-right: 10px\n
The space between an element's box and the next element (or the edge of the page).
\nmargin: 10px;\nmargin-top: 10px;\n
Sets the width or height of an element. Typically, don't set both of these.
\nwidth: 50%;\nheight: 40px;\n
Determines if text wraps around an element.
\nfloat: left;\n
Determines if an element is treated as a block or inline element. Can also be set to none
, which makes the element disappear.
display: inline;\ndisplay: block;\ndisplay: none;\n
Determines default styling on lists. Usually best to set it to none
.
list-style-type: none;\n
Sets the font. Usually best to copy this from Google Fonts or another web font repository.
\nfont-family: 'Lato', sans-serif;\n
Using the CSS basics we've just reviewed, and the list of properties found on the Properties page and online, give your website some styling.
\nI encourage you to use an external stylesheet with classes and IDs to style particular aspects of your site more specifically, but feel free to also play around with inline and internal styling if desired.
", - "order": 16 - } - }, - { - "model": "lesson.lesson", - "pk": 1094, - "fields": { - "title": "Troubleshooting", - "created": "2020-07-09T16:41:17.938Z", - "updated": "2020-07-09T16:41:17.938Z", - "workshop": 157, - "text": "It is common—especially in the beginning—that you'll add or amend something to/in your text editor, but it won't present when rendered by your browser.
\nYour first inclination should be to scan the text in your editor for errors. Nesting will help tremendously with this task. Oftentimes it is as little as forgetting a semicolon or closing tag.
\nAnother investigative tactic is to View Page Source on the page opened in the browser.
\nIf you think it is an error with the HTML, you'll be able to see it there.
\nIf you think it is an error with the CSS, then from the Page Source you'll need to click on the link for the style.css
page. The link to this page should be found in the head of your page. Don't see it? That may be the problem! If you do see it, open the link to see what CSS the browser is reading and applying to your HTML. It should match what you have in your text editor. If it looks like an earlier version of your style sheet, then maybe you need to re-save the document.
Through this workshop, you have learned the basics of two of the most commonly-used languages for building on the web: HTML and CSS.
\nHTML, or Hypertext Markup Language, organizes content on your page using elements denoted by tags (< >
). When rendered by your browser, these tags tell your browser that certain content is paragraph text, while other content is heading or title text, and so on. You can also use image (<img>
) and link or anchor (<a>
) tags to tell the browser to render an image on the page, or take the visitor to another page on your or another website. We also discussed some important conventions to consider when creating HTML documents, such as nesting.
\nCSS, or Cascading Style Sheets, allows for further styling of your website through the application of a series of rule sets that are applied to different aspects/elements of your site. In order for CSS to render on a webpage, it must be integrated with your html, which can happen in three ways: inline, internal, and external. CSS rules can be of varying specificity, and in particular, through creating classes and ids. We also discussed how the ordering of rule sets doesn't matter, because an important function of CSS is the way it filters and applies rules in accordance with the specificity of the rule.
\nThrough understanding these languages in combination with one another, you can also reframe your understanding of the web—not as poof! magic!, but as a series of intentionally styled, hyperlinked text documents, with each website representing a folder of documents.
\nWhile this is a good starting point, one important question remains: how can I get these text documents on the Internet so they can be accessed, and interacted with, and linked to by others?
", - "order": 18 - } - }, - { - "model": "lesson.lesson", - "pk": 1096, - "fields": { - "title": "Making your Website Public", - "created": "2020-07-09T16:41:18.039Z", - "updated": "2020-07-09T16:41:18.039Z", - "workshop": 157, - "text": "Great job! Now you have an amazing website, but it's stuck on your computer where no one else can find it or view it. How do you get your website onto the Internet so it can be shared?
\nTo get your site on the internet, you'll need hosting — that is, a remote computer that will stay on day in and day out to serve the website to visitors. In theory, you could host your website on your own computer, but in practice, it usually makes sense to purchase hosting from a hosting company or use a free service.
\nYou'll also need a way of getting your website to your host. That's where FTP, or File Transfer Protocol, comes in.
\nFTP is a protocol used to share files from your computer (a client) to another computer called a server, and back again over the Internet. This is something we do ALL THE TIME, but we refer to it as \"uploading\" and \"downloading.\"
\nNote: Though FTP stands for file transfer protocol, you are not really transfering or moving your files from your computer; instead they are copied to the server. Fear not!
\nIn order to transfer your website files (also called your website's directory) to a server you will need:
\n1. Access to the Internet.
\n2. An FTP Client.
\n3. A server that is connected to the Internet where you can send your files.
\nAssuming you all can manage accessing the internet on your own, let's focus on the latter two.
\nAn FTP client is software designed specifically for the purpose of sharing files between computers. There are widely-used, freely-available GUI applications (i.e., applications that use a graphic user interface, or the point-and-click, user-friendly software interfaces you are used to) that you can download for use, including Filezilla and Cyberduck. You can also run an FTP client program through the command line on most computers, though the process varies by operating system.
\nYou also need a server to transfer your files to, where they can be stored and shared on the Internet. This is what we call web hosting and there are multiple options here as well. The GCDI (CUNY Graduate Center Digital Initiatives) website contains a list of low-cost cloud hosting services for students. As long as your site is just plain HTML and CSS, it's also possible to host your website for free on services such as GitHub Pages.
\nCode School's Beginner's Guide to Web Development
\nWeb Development with Accessibility in Mind
\nYouTube Series: How to Build a Responsive Website from Start to Finish
", - "order": 20 - } - }, - { - "model": "lesson.challenge", - "pk": 258, - "fields": { - "lesson": 1088, - "title": "", - "text": "Create a stylesheet using the command line (following option 3, external styling, described above). In your index.html
document, link to your style sheet and re-save the file.
If you need a reminder on which commands to use to create your new stylesheet file, see here.
\nTo link your stylesheet with your index.html
file, insert the following code into the head element of that index.html
file:
<link rel="stylesheet" href="style.css" />\n
Copy and paste the CSS above into your style.css
file and re-save the file. Then open or refresh your index.html
file in your browser and see what happens.
Reminder: After creating a stylesheet, you must link it to all HTML documents that you want this styling to apply to. You can do so with the <link>
tag:
<link rel="stylesheet" type="text/css" href="style.css" />\n
This will tell your HTML document to apply the style rules from the text file named style.css
in the same folder.
Git is software used for version control—that is, tracking the state of files and changes you make to them over time. Git can be enabled in a folder, and then used to save the state of the contents in that folder at different points in the future, as designated by you. In the language of Git, a folder is called a repository. In the context of this workshop, it refers to a folder that is being tracked by Git. Using Git, you can view a log of the changes you've made to the files in a repository and compare changes over time. You can also revert back to previous versions, and create branches of a project to explore different futures. Git is also useful for collaboration, as a repository can be shared across computers, and its contents can be asynchonously developed and eventually merged with the main project.
\nGitHub is a online platform for hosting Git repositories. It functions for some, predominantly programmers, as a social network for sharing and collaborating on code-based projects. Users can share their own projects, as well as search for others, which they can then often work on and contribute to. Digital Humanists, librarians, and other academics are also finding ways Git and GitHub are useful in writing projects and teaching. GitHub also serves as a web-hosting platform, allowing users to create websites from their repositories.
\nMarkdown is a markup language for formatting text. Like HTML, you add markers to plain text to style and organize the text of a document.
\nIn HTML: \n<h1> Heading 1 </h1>\nIn Markdown:\n# Heading 1\n
Whereas you use HTML and CSS with WordPress, you use Markdown with GitHub. Markdown has fewer options for marking text than HTML. It was designed to be human-readable, meaning easy to write and edit.
\nThis file you are reading is written in markdown—here is what it looks like in its raw, unrendered form.
\nCompare that with this - the source code for the GitHub page, written in HTML: view-source:https://github.com/DHRI-Curriculum/git
\nMarkdown is also arguably more sustainable and accessible than formats like .docx
because of its simplicity and related ability to be read across multiple platforms. Use of Markdown is also supported by document-conversion tools like Pandoc that can change a markdown file to an epub with one command entered into your terminal.
As we move forward its important to make sure we're firm on the distinctions between the three different tools outlined above.
\nGit is a software that you use on your laptop, or your local computer/machine. The repository with your project's files is stored on your hard drive. You also edit the text files on your local machine using a plain text editor, which is another software on your local machine like VS Code.
\nGitHub is a cloud-based platform that you access through your internet browser. Even though you physically are still in the same place, working on your laptop, you are no longer working on your local machine, you are on the Internet. This is a fundamentally different location than when you're working with your Git repository and editing and creating files in your plain text editor. With GitHub, you are uploading your repository - as described above - from your local machine to this platform on the Internet to be shared more broadly. You can also create private repositories if you want to use GitHub to backup a project.
\nMarkdown is the language used to format the plain text files in your Git-enabled repository. GitHub reads this language so that the markups made to the file are rendered when you view your file on the platform (i.e. #headers appears as headers, links are inserted).
", - "order": 1 - } - }, - { - "model": "lesson.lesson", - "pk": 1099, - "fields": { - "title": "What You Can Do with Git and GitHub", - "created": "2020-07-09T16:41:29.689Z", - "updated": "2020-07-09T16:41:29.689Z", - "workshop": 158, - "text": "A study of how Digital Humanists use GitHub, conducted by Lisa Spiro and Sean Morey Smith, found that a wide range of users, including professors, research staff, graduate students, IT staff, and librarians commonly used the site in their DH work. They used GitHub for a diverse range of activities, such as:
\n- Developing software
\n- Sharing data sets
\n- Creating websites
\n- Writing articles and books
\n- Collating online resources
\n- Keeping research notes
\n- Hosting syllabi and course materials
\nParticipants in the study said they found GitHub useful in their Digital Humanities work for several reasons. In particular, it facilitated:
\n- Sharing and backing up files on multiple computers
\n- Monitoring changes effectively
\n- Recovering from bugs or errors by going back in time before the error arose
\n- Using different branches for experiments and new directions
\n- Sharing and managing files with others—seeing who added what content and when
\nAs you can see across these sessions, we use GitHub to host workshop curricula. Hosting sessions on GitHub allows you (and anyone else interested in these topics!) to follow our repositories, and create your own version of the workshop based on our materials. This fosters open scholarship and knowledge sharing. It also facilitates attribution and citation by clearly tracking which content was created by whom, when it was added, and which projects or materials are derived from others.
\nIf you look just under the workshop title, DHRI-Curriculum/git
at the top of this page, you can see it is forked from pswee001/Git_DRI_Jan_2018
. That line shows that this particular repository is building on (\"forked from\") the curriculum for a session I presented at our January 2018 Institute. If you then look at that repository, you will see that it is in turn forked from previous sessions that were developed by other GC Digital Fellows for workshops in past years.
\nForking is a proper function of the GitHub platform. It supports collaboration by allowing you to copy someone else's repository to your own account on GitHub while maintaining a trail of attribution and derivation. This function is described in further detail in the final lesson of this workshop.
\nGit is also used to track changes (version control in Git parlance) in writing projects, especially when there are multiple authors working asynchronously. It can be an alternative to using track changes in Microsoft Word, or comments and edits in a Google Doc.
\nGit and GitHub - together or independently - support multi-author publishing. Like we have done with the DHRI curriculum, you can have a shared project folder that multiple people are working from asynchronously, even on the same parts if they wanted, and then those different offshoots can be carefully folded back into the master project. This entails the process of creating branches and merging.
\nGit and GitHub also help with attribution by tracking individual contributions throughout. Additional branches could be created by a singular author as well, allowing the writer to explore different ways forward. The version control feature also allows authors to easily return to and compare older drafts or retrieve sections previoulsy discarded.
\nBranches and merging are important functionalities when using Git to collaborate, but they are also advanced and thus beyond the scope of this workshop. See the Resources section at the end of the workshop for more information.
\nHow did you initially come by the syllabus you use for your class, and did you develop it over time? Many professors borrow and adapt from each other, and most of us probably update our syllabi each semester, even if only a little bit.
\nPut your hand up if you have a folder somewhere that looks something like this. Or even multiple folders.
\n|\n--Documents\n |\n --syllabus.doc\n --syllabus2.doc \n --syllabusnew.doc \n --syllabusRevised.doc \n --syllabusFINAL.doc \n --syllabus?.doc \n
Ok, hands down.
\nConsider the following questions as well:
\n- Can you remember who you initially got this syllabus from?
\n- Do you know if there were other contributors along with or before them?
\n- Do you acknowledge prior contributors somewhere on your syllabus?
\n- Can you or others see what changes have been made to the syllabus over time?
\nIncreasingly we see that faculty are sharing their syllabi on GitHub. Some are even using GitPages that apply a user-friendly interface to their repository to make it easier to access and navigate for their students. This is because Git and GitHub make it easy to make contributors and changes to documents over time visible.
\nUsing Git and GitHub, you could fork that syllabi to your account, and download - or clone as Git calls it - it to your local machine to edit. After making changes to the files and and reuploading them or pushing them to the repo (short for repository) on GitHub, someone else could compare the changes you made and see who the original or additional contributors were. They could also decide to continue the chain by copying or forking your version, or decide to return to the original repo and fork that version. Both Git and GitHub help with attribution here; tracking who changes and adds what is a key feature of both tools.
\nEven if you were only working with your own self-created syllabus, like we'll do later in this workshop, Git and GitHub can be useful for tracking your changes without the hassle of multiple files. From one file, you can use Git to compare your current version with older versions; you can also compare and share these different versions on GitHub.
\nCloning and pushing are proper functionalities of GitHub that describe how you communicate and share files between your local machine and the Internet. These are covered more in-depth in a later lesson in this workshop.
", - "order": 2 - } - }, - { - "model": "lesson.lesson", - "pk": 1100, - "fields": { - "title": "Review of the Command Line", - "created": "2020-07-09T16:41:29.700Z", - "updated": "2020-07-09T16:41:29.700Z", - "workshop": 158, - "text": "During this workshop, you'll be communicating with GitHub from our local machine via the command line (terminal, bash). It will be useful if you've taken the Command Line Workshop before continuing. This section reviews some of the basic commands that will also be used in this workshop.
\nIn addition to the command line, you'll be using your text editor and your browser. Before continuing, its important that we clearly distinguish between these three different spaces or environments:
\n- Your plain text editor where you'll be writing your syllabus in Markdown is on your local machine.
\n- That syllabus is intiailly saved in a git-enabled repository on your local machine.
\n- Your browser is where you'll be uploading your repository to GitHub.
\n- Your terminal is where you'll be communicating with GitHub to send the repository and project files back and forth between the web and your hard drive.
\nBecause you'll be moving between these three spaces throughout the workshop, you may want to use (command + tab) or (ctrl + tab) to move quickly between the three windows on your desktop.
\nPress the space bar and the command key at the same time and type terminal
. Press Enter
.
Press the Windows button on your keyboard. When the search menu pops up, type git bash
and press Enter
.
In this session, we will be making a syllabus and using Git to keep track of our revisions. Let's create a Git project folder
\ncd <directory-name> \n
will let you navigate inside a directory of your choosing.
\nType
\ncd Desktop\n
and hit Enter
. This will change your current working directory from /Users/<your-name>
to /Users/<your-name>/Desktop
.
\nTo check your current directory, type
\npwd\n
Try this now to make sure you're in your Desktop directory.
\nNow, use
\ncd ..\n
to go up one directory. In this case, this will take you back to your home directory.
\nPractice going back and forth between your Desktop and your home directory.
\nWhen finished, go to your Desktop folder and check that you're there with pwd
.
If you've worked through the command line session, you should see a projects
folder on your desktop. Navigate into it with
cd projects\n
If you don't have a projects folder on your desktop, create one with
\nmkdir projects\n
From Desktop
, navigate into your projects
folder. Then create a git-practice
folder with the below command:
mkdir git-practice\n
Enter the new git
folder with
cd git-practice\n
At this point, when you type pwd
, your folder structure should look like this:
/home/<username>/Desktop/projects/git-practice\n
Our first step in working with Git is letting the software know who we are so it can track our work and attribute our contributions. Through this section, you'll be checking your installation and configuring Git with your own name and information.
\nLet's make sure Git has been successfully installed. In your terminal, type
\ngit --version\n
If you see a version number, you're all set. If not, click here and install as you would any other software on your system.
\nNext, let's configure git so that it can identify who we are. This information is useful because it connects identifying information with the changes you make in your repository.
\nType the following into your command line, filling in the sections—below labeled \"John Doe\"—for your name and email (use quotations where you see them). This does not necessarily need to be the name and email you used to sign up for GitHub. Remember, these are different spaces and softwares.
\ngit config --global user.name \"John Doe\"\ngit config --global user.email johndoe@example.com\n
To check your set-up, use:
\ngit config --list\n
You'll get something that looks like this:
\nuser.name=Superstar Git User\nuser.email=gitsuperstar@gmail.com\n
The next step is to initialize the project folder that we want Git to track. When we initialize a folder, we are telling Git to pay attention to it. This only needs to happen once because what is actually happening through this process is Git is adding a hidden subfolder within your folder that houses the internal data structure required for version control.
\nFirst, use cd
, navigate to the git-practice
folder inside projects
. From your home directory, type:
\n cd Desktop/projects/git-practice
\nNext we're going to initialize our repository using the following command:
\ngit init\n
You should see output like this:
\nInitialized empty Git repository in /home/patrick/projects/git/.git/\n
Now Git is tracking our directory. Importantly, it has not done any versioning yet. There is no history of changes as of yet: 1) because there are no files and we haven't made any changes, 2) becuase we have to tell Git when to take a snapshot, which we go through in the next section. For now, Git knows this folder exists and is prepared to take a snapshot of the files when you tell it to.
\nBefore version control is useful, we'll have to create a text file for Git to track. For this session, the file we will track will be a course syllabus—we'll create that next.
\nTo create a plain text file, we're going to switch to our text editor, VS Code, to create and edit a file named syllabus.md
and save it to our 'git-practice' folder. If you have not installed VS Code, review the installation instructions here.
\nIn terminal, check to make sure you are in your git-practice
folder. (HINT: use 'pwd' to see what directory you are currently in) Next, type:
code syllabus.md\n
to open a syllabus.md
file in VS Code. You should see a window appear that looks similar to this:
\n
\nIf VS Code does not open when you use the code
command in your terminal, open it using the Start Menu on Windows or Spotlight Search on Mac OS as you would any other software. Then click File > Open File
and use the dialog to navigate to the /Users/<your-name>/Desktop/projects/git
folder and create a syllabus.md
file there.
\nWe'll be typing our markdown into this file in the VS Code window. At any time, you can save your file by hitting Control-s
on Windows or ⌘-s
on Mac OS. Alternatively, you can click the File
menu on the top right, then select Save
from the dropdown menu.
\nSaving frequently is advised. When we get to the version contol functionality of Git, only changes that are saved will be preserved when a version is created.
\nWe'll be using markdown to write a syllabus, and then using Git to track any changes we make to it. Markdown allows us to format textual features like headings, emphasis, links, and lists in a plain text file using a streamlined set of notations that humans can interpret without much training. Markdown files usually have a .md
extension.
\nIn markdown, we insert headings with a single hash mark like this:
\n# My Syllabus Heading\n
A sub-heading (H2) heading uses two hash marks like this:
\n## Readings\n
To provide emphasis, place asterisks around some text:
\n*This text will appear italicized.*\n**This text will appear bold.**\n
For emphasis, you need to mark where it should start and where it should end, so you need astrisks at the beginning and end of whatever text is being emphasized.
\nTo create a bulleted list, put a hyphen at the beginning of each list item:
\n- Reading one\n- Reading two\n- Reading three\n
To create a link, put the anchor text (the text you will see) in square brackets and the URL in parentheses. Don't put a space between them:
\nI teach at [The Graduate Center, CUNY](https://www.gc.cuny.edu).\n
Paragraphs of text are denoted by putting a blank line between them:
\nThis is a paragraph in markdown. It's separated from the paragraph below with a blank line. If you know HTML, it's kind of like the \\<p> tag. That means that there is a little space before and after the paragraph when it is rendered.\nThis is a second paragraph in markdown, which I'll use to tell you what I like about markdown. I like markdown because it looks pretty good, if minimal, whether you're looking at the rendered or unrendered version. It's like tidy HTML.\n
Git's primary function is version control, or to track a project as it exists at different points in time. Now that we have a file to track—our markdown syllabus—let's use Git to save the current state of the repository as it exists now.
\nIn Git, a commit is a snapshot of a repository that is entered into its permanent history. To commit a change to a repository, we take two steps:
\n1. Adding files to a \"staging area,\" meaning that we intend to commit them.
\n2. Finalizing the commit.
\nStaging a file or files is you telling Git, \"Hey! Pay attention these files and the changes in them\".
\nMaking a commit is a lot like taking a photo. First, you have to decide who will be in the photo and arrange your friends or family in front of the camera (the staging process). Once everyone is present and ready, you take the picture, entering that moment into the permanent record (the commit process).
\nFirst, let's see what state Git is currently in. It's a good idea to use this command before and after doing anything in Git so you can always be on the same page as the computer.
\nMake sure you're in your /home/<your-name>/Desktop/projects/git-practice
directory using the pwd
command in the terminal. Once you're there, enter this command:
git status\n
You should see output like this:
\nOn branch master\nNo commits yet\nUntracked files:\n (use \"git add <file>...\" to include in what will be committed)\n syllabus.md\nnothing added to commit but untracked files present (use \"git add\" to track)\n
This means we've initialized our repository, but haven't made any commits yet. If you're instead getting a message that begins with the word fatal
when you use git status
, you may be in the wrong directory or perhaps you haven't run the git init
command on your directory yet.
\nLet's follow the recommendation in the status message above and use the add
command to stage files, making them ready to be committed.
\nType this command:
\ngit add syllabus.md\n
You should see no output from the command line, meaning that the above command succeeded. Let's run git status
again to check if things have changed. You should see output like this:
On branch master\nNo commits yet\nChanges to be committed:\n (use \"git rm --cached <file>...\" to unstage)\n new file: syllabus.md\n
The new file: syllabus.md
should be highlighted in green to show that it's ready for commit.
\nThis is Git telling you, \"Ok, I see the file(s) you're talking about.\"
\nNow that our files have been staged, let's commit them, making them part of the permanent record of the repository. Type:
\ngit commit -m \"Initial commit of syllabus file\"\n
The -m
flag provides a message along with the commit that will tell others—or remind a future version of yourself—what the commit was all about. Try not to type git commit
without the -m
flag—there's a note about this below.
\nAfter running the command, you should see output like this:
\n[master (root-commit) 8bb8306] Initial commit of syllabus file\n 1 file changed, 0 insertions(+), 0 deletions(-)\n create mode 100644 syllabus.md\n
This means you have successfully made your first commit in the repository—congratulations!
\nLet's check the state of our repository after the commit with git status
:
On branch master\nnothing to commit, working tree clean\n
This means that everything in the repository is successfully committed and up-to-date. If you edit your syllabus file or create a new file in the repository, the message you get with git status
will instead list files that have uncommitted changes.
\nLet's run one other command to see the effect our commit has had. Enter this command:
\ngit log\n
You should see output similar to this:
\ncommit 8bb8306c1392eed52d4407eb16867a49b49a46ac (HEAD -> master)\nAuthor: Patrick Smyth <patricksmyth01@gmail.com>\nDate: Sun May 20 16:03:39 2018 -0400\n Initial commit of syllabus file\n
This is the log of commits, comprising a history of your repository. There's only one commit here now, though. If you don't see a prompt (the $
) after running git log
, you may need to press the q
key (just the q
key by itself) to return to the command line.
The -m flag is useful for human purposes and technical purposes. For human purposes, the -m flag helps you keep track of the changes you're making. Version control is most useful when you can confidently return to a specific version. It can also help you be more structured in your approach to making changes - your notes to self are limited, so to make them clear you might make commits after specific tasks are completed, such as update readings for week 1 or added S.Noble reading. This can also make it easier to reverse a specific change in the future.
\nAlso, if you type git commit
by itself, git will open the command line's default text editor to allow you to enter the commit message. Unfortunately, the default text editor, vi
, requires some knowledge to use, and we don't teach it as part of our sessions.
\nIf you find yourself stuck in an unfamiliar screen in the command line after running git commit
, you're probably in vi
. Type this to leave that environment and return to the $
prompt:
:q\n
If you're ever stuck or \"trapped\" on the command line, try running through these common exit commands to return to the prompt:
\nControl-c\nControl-d\nq\n:q\n
Control-c
attempts to abort the current task and restore user control. Control-d
escapes the current shell environment—if you use it at the normal $
prompt, it will end the current command line session. q
is used to escape from specific utilities like less
. :q
first changes the mode in vi
, allowing you to enter the q
key to quit, so it's a command specific to vi
.
Now, you may want to backup or share that file on the Internet. Let's connect the directory you made on your local machine to GitHub, on the web.
\nRemember, GitHub is a service that allows you to host files, collaborate, and find the work of others. Once our syllabus is on GitHub, it will be publicly visible. Note that repositories on GitHub can also be private.
\nGo to GitHub in your browser and click the plus sign in the upper right hand corner.
\n
\nAfter clicking the plus button, select New repository
from the dropdown menu.
\n
\nAfter clicking New repository
, you'll have to enter some information, including a name and description for your repository.
\n
\n- Choose a name, such as git-practice
.
\n- Enter a description, such as Test syllabus for learning Git and GitHub
.
\n- Keep the Public — Anyone can see this repository
selector checked.
\n- Do not select Initialize this repository with a README
since you will be importing an existing repository from your computer.
\n- Click Create repository
.
\nYou should end up inside your newly created git-practice repo. It will look like a set of instructions that you might want to use to connect your GitHub repository to a local repository.
\nThe instructions we want consist of two lines underneath the heading ...or push an existing repository from the command line
. The hand in this screenshot points to where these directions are on the page:
\n
\nCopy out the first command and paste it in your terminal. It should look something like this:
\ngit remote add origin git@github.com:<username>/<repository-name>.git\n
You'll need the command copied from your new repo, since it will contain the correct URL.
\nNext, paste the second command. It will look exactly like this:
\ngit push -u origin master\n
After running this command, you should see output that looks like this:
\nTotal 4 (delta 3), reused 0 (delta 0)\nremote: Resolving deltas: 100% (3/3), completed with 3 local objects.\nTo github.com:<repo-name>/git.git\n 916998f..9779fa7 master -> master\n
If you see output like this, go back to your new repository page in the browser and click the Refresh
button. You should see your syllabus.md
file on GitHub!
We have covered the basic steps of creating a file and tracking changes within a file on your local machine and on GitHub.
\nThis has involved coordinating across three different environments, so let's go through that one more time. Note that this process is very slightly different. I'll highlight it when it comes up.
\nTo start, let's add something to our syllabus. Another week of materials or a new reading.
\nSave that file.
\nUse git add
via the command line to stage the file - tell Git what document you want it to pay attention to.
\nUse git commit
via the command line to save the changes you've just made as a snapshot or new version of you file. Remember to use the -m flag and include a message about the change you just made.
\nSo far, we have not done anything with GitHub or on the Internet. We have used Git, installed on our local machine, to save a version of file as it stands now. We could stop here if we only had an interest in using Git for version control. But if we also wanted to use GitHub to back up our files, or to share our project with a team or publicly, we want to upload, or push, that repository to GitHub on the Internet.
\nUse git push origin master
to push that file to your repository on GitHub. After refreshing the webpage, your file should appear online. The difference I noted above appears here. Note the absense of the -u
flag from the command. That flag was used the first time to establish the connection between the repository on your local machine and on GitHub. Now that that connection has been established, that flag is not needed.
GitHub was built for sharing and collaborating on projects. A key advantage of the platform is that you can find lots of bits of software that do many different things - such as code for plugins for WordPress or Leaflet. Increasingly, you might find syllabi or open writing projects. If a project is public, you can save a copy of it to your loca machine, work on it, save your admendations and share it on your own GitHub account. Like we've already mentioned, GitHub usefully helps track attribution along the way.
\nCloning and forking are the basic functions of this capability. Each are first explained below, and followed by an example and activity to further explain.
\nCloning a repository means making a copy of a repository on GitHub, to download and work on locally - on your local machine. By entering the following code into your terminal, you can clone any public directory on GitHub:
\ngit clone <repository-url>\n
When you clone a repository from GitHub, the folder that shows up on your local machine comes built-in with a few things. First, Git is already present, so you don't need to initialize the folder. Also, the connection between your local copy and the online repository is already made, so git push origin master
will work (no -u flag needed).
\nFor practice, let's clone the repository for this workshop about Git and GitHub.
\nFirst, let's navigate back to your Desktop folder.
\ncd ~/Desktop\n
Remember that the ~ refers to your home directory. Now let's find the URL we need to clone the lesson.
\nFirst, follow this link to the main page of this lesson on Git and GitHub.
\nOn the main page, there should be a green Clone or download
button on the right side:
\n
\nClick the green button and you will see a box with highlighted text under a heading that says Clone with HTTPS
. If you instead see Cloning with SSH
, click the small link that says Use HTTPS
.
\nNow copy out the text in the box:
\n
\nNow that you have the text copied, go back to your terminal. Remember, you should be on the desktop.
\nType
\ngit clone <copied-url>\n
If the command is successful, the full Git lesson will be replicated on your local machine. You can type
\ncd git\n
to enter the lesson folder, since the lesson repository is simply called git
. Use the ls
command to take a look at the various files in the lesson folder.
\nCloning can be especially useful when you're joining a group project that is hosted on GitHub, and you want your changes to eventually be pushed and shared with that same repository.
\nBut maybe that is not possible or ideal. Maybe you don't want to contribute your changes to someone else's repository. Maybe you want to make a derivative of their folder for yourself, on your GitHub account, and make changes there.
\nForking is the step you could take before cloning to do this.
\nForking a repository means making a copy of someone else's repository on GitHub, and saving it to your account on GitHub. This function happens within GitHub, and has nothing to do with what is happening on your local machine.
\nFor example, go to the repository for this workshop on GitHub. Note the Fork
button in the upper right hand corner.
\nADD SCREENSHOT.
\nBy clicking that button copying, or forking, this repository to your account. In the upper left hand corner, it would say your account name instead of DHRI-Curriculum
, instead it will reference our account below after forked from
.
\nADD SCREENSHOT
\nYour local machine would come into play when you want to clone that repository so you can work on it locally. This also means that when you push those changes to GitHub, you would be pushing them to a forked repository associated with your own account.
\nYou might use this method if you were going to teach your own Git & GitHub workshop. You could use our repository as a base for getting started, and add more examples or change some language, clarify something further, or create a connection to another workshop you are giving, etc. This allows us to continue to use the workshop as we have it as well. Also, maybe at a later time, we want to merge some of your changes with ours. We can do that too by revisiting your version history.
", - "order": 8 - } - }, - { - "model": "lesson.challenge", - "pk": 261, - "fields": { - "lesson": 1102, - "title": "", - "text": "Use these five elements—headings, emphasis, lists, links, and paragraphs—to create a syllabus. Have a main heading that gives the course title (one #
), then subheadings for, at least, course info and readings. Use emphasis (*
) for book titles and try to get a list in there somewhere.
You can look at an example syllabus in raw text form here. When it's rendered by GitHub, it looks like this. When editing the markdown file in VS Code, it might look like this:
\n\nVS Code also has a preview feature for your markdown. Hit the preview button on the top right while editing your markdown file:
\n\nYou'll get two side-by-side panels. Your markdown file will be on the left, and your rendered preview will be on the right:
\n\nRemember to save your work with Control-s
on Windows or ⌘-s
on Mac OS.
Go through the process a few more times by adding additional readings and weeks of course material. Remember to commit changes intentionally so your commit messages make sense. Use git log
to review your changes.
Also try creating a new file and adding an assignment. Rewrite the assignment using Markdown, or edit and add in the markers. Go through the process of staging and commiting that file, and pushing it to your repository on GitHub.
\nTest your understanding by thinking through the following questions:
\n- Do you need to push the file to GitHub each time you commit changes to the file, or can you make several commits to a file and push them all to GitHub at once?
\n- Do you need to use git init
after after adding an assignment file to your folder?
\n- What about the -u flag in the git push origin master? Does this flag need to be used to add the assignment to your repository on GitHub?
" - } - }, - { - "model": "lesson.challenge", - "pk": 263, - "fields": { - "lesson": 1105, - "title": "", - "text": "You made it to the end of this workshop-congratulations! Now, practice your new skills:
\nThis is the about page.
" - } - }, - { - "model": "website.page", - "pk": 2, - "fields": { - "name": "Workshops", - "slug": "workshops", - "text": "This is the workshop page.
" - } - }, - { - "model": "website.page", - "pk": 3, - "fields": { - "name": "Library", - "slug": "library", - "text": "This is the library page.
" - } - } -] \ No newline at end of file +[{"model": "workshop.workshop", "pk": 159, "fields": {"name": "Data And Ethics", "slug": "data-and-ethics", "created": "2020-07-09T19:01:19.054Z", "updated": "2020-07-09T19:01:19.054Z", "parent_backend": "Github", "parent_repo": "DHRI-Curriculum/data-and-ethics", "parent_branch": "v2.0-di-edits"}}, {"model": "frontmatter.frontmatter", "pk": 151, "fields": {"workshop": 159, "abstract": "What is data? Nearly all digital work requires dealing with data. In this workshop we will be discussing the basics of research data, in terms of material, transformation, and presentation.", "ethical_considerations": "['Data and data analysis is [not free from bias](https://medium.com/@angebassa/data-alone-isnt-ground-truth-9e733079dfd4). There is no magic blackbox for which data emerges from and is contextually driven. As we think about the automation process of looking at \"big\" data, we have to be aware of [the biases that gets reproduced that is \"hidden.\"](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing)', 'De-identified information can be [reconstructed from piecemeal data](https://techscience.org/a/2015092903/)found across different sources. When we consider what we are doing with the data we have collected, we also need to think about the possible re-identification of our participants. ', 'Big data projects often times requiring sharing data sets across different individuals and teams. In addition, to ensure that our work is reproducable and accountable, we may also feel inclined to share the data collected. As such, figuring out [how to share such data](https://techscience.org/a/2015101601/) is crucial in the project planning stage.']", "estimated_time": 0, "projects": [373, 374], "resources": [], "readings": [741, 742, 743], "contributors": [471, 472, 473], "prerequisites": []}}, {"model": "praxis.praxis", "pk": 137, "fields": {"discussion_questions": "['What are some forms of data you use in your work? What about forms of data that you produce as your output? Perhaps there are some forms that are typical of your field? Where do you usually get your data from?', 'What is publically available data?', \"How do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations?\", 'How do we know when our data is cleaned enough? What happens to the data that is removed? What are we choosing to say about our dataset as we prepare them for analysis?', 'As we consider the types of analysis that we choose to apply onto our data set, what are we representing and leaving out? How do we guide our decisions of interpretation with our choices of analyses? Are we comfortable with the (un)intended use of our research? What are potential misuses of our outputs? ', 'What can happen when we are trying to just go for the next big thing (tool/methods/algorithms) or just ran out of time and/or budget for our project?', 'What are we assuming when we choose to visually represent data in particular ways? How can data visualization mislead us?']", "next_steps": "[]", "workshop": 159, "further_readings": [744, 745, 746], "more_projects": [], "more_resources": [], "tutorials": [362, 363]}}, {"model": "lesson.lesson", "pk": 1106, "fields": {"title": "Data is Foundational", "created": "2020-07-09T19:01:19.077Z", "updated": "2020-07-09T19:01:19.077Z", "workshop": 159, "text": "In this brief workshop we will be discussing the basics of research data, in terms of material, transformation, and presentation. We will also be focusing on the ethics of data cleaning and representation. Because everyone has a different approach to data and ethics, this workshop will also include multiple sites for discussions to help us think together as a group.
\n\"Material or information on which an argument, theory, test or hypothesis, or another research output is based.\"
\nQueensland University of Technology. Manual of Procedures and Policies. Section 2.8.3. http://www.mopp.qut.edu.au/D/D_02_08.jsp
\n\"What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models\"
\nMarieke Guy. http://www.slideshare.net/MariekeGuy/bridging-the-gap-between-researchers-and-research-data-management , #2
\n\"Units of information created in the course of research\"
\nhttps://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp
\n\"(i) Research data is defined as the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues.\"
\nOMB-110, Subpart C, section 36, (d) (i), http://www.whitehouse.gov/omb/circulars_a110/
\n\"The short answer is that we can\u2019t always trust empirical measures at face value: data is always biased, measurements always contain errors, systems always have confounders, and people always make assumptions.\" Angela Bassa. https://medium.com/@angebassa/data-alone-isnt-ground-truth-9e733079dfd4
\nIn summary, research data is:
\nMaterial or information necessary to come to your conclusion.
\nThere are many ways to represent data, just as there are many sources of data. After processing our data, we turn it into a number of products. For example:\n* Non-digital text (lab books, field notebooks)\n* Digital texts or digital copies of text\n* Spreadsheets\n* Audio\n* Video\n* Computer Aided Design/CAD\n* Statistical analysis (SPSS, SAS)\n* Databases\n* Geographic Information Systems (GIS) and spatial data\n* Digital copies of images\n* Web files\n* Scientific sample collections\n* Matlab files & 3D Models\n* Metadata & Paradata\n* Data visualizations\n* Computer code\n* Standard operating procedures and protocols\n* Protein or genetic sequences\n* Artistic products\n* Curriculum materials\n* Collection of digital objects acquired and generated during research
\nAdapted from: Georgia Tech\u2013http://libguides.gatech.edu/content.php?pid=123776&sid=3067221
\nThese are some (most!) of the shapes your research data might transform into.
\n1. What are some forms of data you use in your work?
\n2. What about forms of data that you produce as your output? Perhaps there are some forms that are typical of your field.
\n3. Where do you usually get your data from?
", "order": 1}}, {"model": "lesson.lesson", "pk": 1107, "fields": {"title": "Stages of Data", "created": "2020-07-09T19:01:19.113Z", "updated": "2020-07-09T19:01:19.113Z", "workshop": 159, "text": "We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations:
\nRaw data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc.. It could be in any of the forms listed above.
\nBut \"raw data\" is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is \"raw data\".
\nAs we think about data collection, we should also consider the labor involved in the process. Many researchers rely on Amazon Mechanical Turk (sometimes also refered to as MTurk) for data collection, often paying less than minimum wage for the task. Often the assumption made of these workers is someone who is retired, bored, and participating in online gig work for fun or to kill time. While this may be true for some, more than half of those surveyed in a Pew Research study cite that the income from this work is essential or important. Often, those who view the income from this work as essential or important are also from underserved communities.
\nIn addition to being mindful of paying a fair wage to the workers on such platforms, this working environment also brings some further considerations to the data that is collected. Often times, for workers to get close to minimum wage, they cannot afford to spend much time on each task, increasing potential errors in the collected data.
\nProcessing data puts it into a state more readily available for analysis, and makes the data legible. For instance it could be rendered as structured data. This can also take many forms, e.g., a table.
\nHere are a few you're likely to come across, all representing the same data:
\nXML
\n<Cats> \n <Cat> \n <firstName>Smally</firstName> <lastName>McTiny</lastName> \n </Cat> \n <Cat> \n <firstName>Kitty</firstName> <lastName>Kitty</lastName> \n </Cat> \n <Cat> \n <firstName>Foots</firstName> <lastName>Smith</lastName> \n </Cat> \n <Cat> \n <firstName>Tiger</firstName> <lastName>Jaws</lastName> \n </Cat> \n</Cats> \n
JSON
\n{\"Cats\":[ \n { \"firstName\":\"Smally\", \"lastName\":\"McTiny\" }, \n { \"firstName\":\"Kitty\", \"lastName\":\"Kitty\" }, \n { \"firstName\":\"Foots\", \"lastName\":\"Smith\" }, \n { \"firstName\":\"Tiger\", \"lastName\":\"Jaws\" } \n]} \n
CSV
\nFirst Name,Last Name/n\nSmally,McTiny/n\nKitty,Kitty/n\nFoots,Smith/n\nTiger,Jaws/n\n
A small detour to discuss (the ethics of?) data formats. For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:
\n1. Open this file in a text editor, and then in an app like Excel. This is a CSV, an open, text-only, file format.
\n2. Now do the same with this one. This is a proprietary format!
\nSustainable formats are generally unencrypted, uncompressed, and follow an open standard. A small list:\n* ASCII\n* PDF \n* .csv\n* FLAC\n* TIFF\n* JPEG2000\n* MPEG-4\n* XML\n* RDF\n* .txt\n* .r
\nHow do you decide the formats to store your data when you transition from 'raw' to 'processed/transformed' data? What are some of your considerations?
\nThere are guidelines to the processing of data, sometimes referred to as Tidy Data.1 One manifestation of these rules:
\n1. Each variable is in a column.
\n2. Each observation is a row.
\n3. Each value is a cell.
\nLook back at our example of cats to see how they may or may not follow those guidelines. Important note: some data formats allow for more than one dimension of data! How might that complicate the concept of Tidy Data?
\n{\"Cats\":[\n {\"Calico\":[\n { \"firstName\":\"Smally\", \"lastName\":\"McTiny\" },\n { \"firstName\":\"Kitty\", \"lastName\":\"Kitty\" }],\n \"Tortoiseshell\":[\n { \"firstName\":\"Foots\", \"lastName\":\"Smith\" }, \n { \"firstName\":\"Tiger\", \"lastName\":\"Jaws\" }]}]}\n
1Wickham, Hadley. \"Tidy Data\". Journal of Statistical Software.
", "order": 2}}, {"model": "lesson.lesson", "pk": 1108, "fields": {"title": "More Stages of Data", "created": "2020-07-09T19:01:19.118Z", "updated": "2020-07-09T19:01:19.118Z", "workshop": 159, "text": "High quality data is measured in its validity, accuracy, completeness, consistency, and uniformity.
\nProcessed data, even in a table, is going to be full of errors:
\n1. Empty fields
\n2. Multiple formats, such as \"yes\" or \"y\" or \"1\" for a positive response.
\n3. Suspect answers, like a date of birth of 00/11/1234
\n4. Impossible negative numbers, like an age of \"-37\"
\n5. Dubious outliers
\n6. Duplicated rows
\n7. And many more!
\nCleaning data is the work of correcting the errors listed above, and moving towards high quality. This work can be done manually or programatically.
\nValidity
\nMeasurements must be valid, in that they must conform to set constraints:
\n1. The aforementioned \"yes\" or \"y\" or \"1\" should all be changed to one response.
\n2. Certain fields cannot be empty, or the whole observation must be thrown out.
\n3. Uniqueness, for instance no two people should have the same social security number.
\nAccuracy
\nMeasurements must be accurate, in that they must represent the correct values. While an observation may be valid, it might at the same time be inaccurate. 123 Fake street is a valid, inaccurate street address.
\nUnfortunately, accuracy is mostly acheived in the observation process. To be achieved in the cleaning process, an outside trusted source would have to be cross-referenced.
\nCompleteness
\nMeasurements must be complete, in that they must represent everything that might be known. This also is nearly impossible to achieve in the cleaning process! For instance in a survey, it would be necessary to re-interview someone whose previous answer to a question was left blank.
\nConsistency
\nMeasurements must be consistent, in that different observations must not contradict each other. For instance, one person cannot be represented as both dead and still alive in different observations.
\nUniformity
\nMeasurements must be uniform, in that the same unit of measure must be used in all relevant measurements. If one person's height is listed in meters and another in feet, one measurement must be converted.
\nHow do we know when our data is cleaned enough? What happens to the data that is removed? What are we choosing to say about our dataset as we prepare them for analysis?
\nAnalysis can take many forms (just like the rest of this stuff!), but many techniques fall within a couple of categories:
\nTechniques geared towards summarizing a data set, such as:\n* Mean\n* Median\n* Mode\n* Average\n* Standard deviation
\nTechniques geared towards testing a hypothesis about a population, based on your data set, such as:\n* Extrapolation\n* P-Value calculation
\nAs we consider the types of analysis that we choose to apply onto our data set, what are we representing and leaving out? How do we guide our decisions of interpretation with our choices of analyses? Are we comfortable with the intended use of our research? Are we comfortable with the unintended use of our research? What are potential misuses of our outputs? What can happen when we are trying to just go for the next big thing (tool/methods/algorithms) or just ran out of time and/or budget for our project?
\n\nAdapted from Evergreen, Stephanie D. Effective data visualization : the right chart for the right data. Los Angeles: SAGE, 2017.
As we transform our results into visuals, we are also trying to tell a narrative about the data we collected. Data visualization can help us to decode information and share quickly and simply. What are we assuming when we choose to visually represent data in particular ways? How can data visualization mislead us?
", "order": 3}}, {"model": "lesson.lesson", "pk": 1109, "fields": {"title": "Data Literacy and Ethics", "created": "2020-07-09T19:01:19.122Z", "updated": "2020-07-09T19:01:19.122Z", "workshop": 159, "text": "Throughout the workshop we have been thinking together through some of
\nthe potential ethical concerns that might crop up as we proceed with our
\nown projects. Just as we have disucssed thus far, we hope that you see
\nthat data and ethics is an ongoing process throughout the lifespans of
\nyour project(s) and don\u2019t often come with easy answers.
\nIn this final activity, we would like for you to think about some of the
\npotential concerns that might come up in the scenario below and discuss
\nhow you might approach them:
\nYou are interested in looking at the reactions to the democratic party
\npresidential debates across time. You decided that you would use data
\nfrom twitter to analyze the responses. After collecting your data, you
\nlearned that your data has information from users who were later banned
\nand included some tweets that were removed/deleted from the site.
\nData and ethics are contextually driven. As such, there isn\u2019t always a
\nrisk-free approach. We often have to work through ethical dilemmas while
\nthinking through information that we may not have (what are the risks of
\ndoing/not doing this work?). We may be approaching a moment where the
\nquestion is no longer what we could do but what we should do.
", "order": 4}}, {"model": "frontmatter.learningobjective", "pk": 935, "fields": {"frontmatter": 151, "label": "Understand the stages of data analysis."}}, {"model": "frontmatter.learningobjective", "pk": 936, "fields": {"frontmatter": 151, "label": "Understand the beginning of cleaning/tidying data"}}, {"model": "frontmatter.learningobjective", "pk": 937, "fields": {"frontmatter": 151, "label": "Experience the difference between proprietary and open data formats."}}, {"model": "frontmatter.learningobjective", "pk": 938, "fields": {"frontmatter": 151, "label": "Become familiar with the specific requirements of \"high quality data.\""}}, {"model": "frontmatter.learningobjective", "pk": 939, "fields": {"frontmatter": 151, "label": "Have an understanding of potential ethical concerns around working with different types of data and analysis."}}, {"model": "frontmatter.contributor", "pk": 471, "fields": {"first_name": "Stephen", "last_name": "Zweibel", "role": null, "url": null}}, {"model": "frontmatter.contributor", "pk": 472, "fields": {"first_name": "Di", "last_name": "Yoong", "role": null, "url": null}}, {"model": "frontmatter.contributor", "pk": 473, "fields": {"first_name": "Ian", "last_name": "Phillips", "role": null, "url": null}}, {"model": "library.reading", "pk": 741, "fields": {"title": "Big? Smart? Clean? Messy? Data in the Humanities", "url": "http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities", "annotation": "[Big? Smart? Clean? Messy? Data in the Humanities](http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/)", "zotero_item": null}}, {"model": "library.reading", "pk": 742, "fields": {"title": "Bit By Bit: Social Research in Digital Age", "url": "https://www.bitbybitbook.com/en/1st-ed/preface", "annotation": "[Bit By Bit: Social Research in Digital Age](https://www.bitbybitbook.com/en/1st-ed/preface/)", "zotero_item": null}}, {"model": "library.reading", "pk": 743, "fields": {"title": "Ten Simple Rules for Responsible Big Data Research", "url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373508", "annotation": "[Ten Simple Rules for Responsible Big Data Research](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373508/)", "zotero_item": null}}, {"model": "library.project", "pk": 373, "fields": {"title": "Data for Public Good", "url": "https://dataforgood.commons.gc.cuny.edu", "annotation": "[Data for Public Good](https://dataforgood.commons.gc.cuny.edu/): Graduate student fellows creates a semester-long collaborative project that makes public-interest dataset useful and informative to a public audience.", "zotero_item": null}}, {"model": "library.project", "pk": 374, "fields": {"title": "SAFElab", "url": "https://safelab.socialwork.columbia.edu", "annotation": "[SAFElab](https://safelab.socialwork.columbia.edu/): Uses computational and social work approaches to understand mechanisms of violence and how to prevent and intervene in violence that occur in neighbourhoods and on social media.", "zotero_item": null}}, {"model": "library.tutorial", "pk": 362, "fields": {"label": "Computational social science with R", "url": "https://compsocialscience.github.io/summer-institute/curriculum#day_2", "annotation": "[Computational social science with R](https://compsocialscience.github.io/summer-institute/curriculum#day_2) by the Summer Institutes in Computational Social Science", "zotero_item": null}}, {"model": "library.tutorial", "pk": 363, "fields": {"label": "SQLite Tutorial", "url": "https://www.sqlitetutorial.net", "annotation": "[SQLite Tutorial](https://www.sqlitetutorial.net/) by SQLiteTutorial", "zotero_item": null}}, {"model": "library.reading", "pk": 744, "fields": {"title": "data management presentation", "url": "https://www.slideshare.net/MariekeGuy/bridging-the-gap-between-researchers-and-research-data-management", "annotation": "Marieke Guy's [data management presentation](https://www.slideshare.net/MariekeGuy/bridging-the-gap-between-researchers-and-research-data-management)", "zotero_item": null}}, {"model": "library.reading", "pk": 745, "fields": {"title": "Management of Research Data", "url": "http://www.mopp.qut.edu.au/D/D_02_08.jsp", "annotation": "Queensland University of Technology's [Management of Research Data](http://www.mopp.qut.edu.au/D/D_02_08.jsp).", "zotero_item": null}}, {"model": "library.reading", "pk": 746, "fields": {"title": "Perspectives on Big Data, Ethics, and Society", "url": "https://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society", "annotation": "The Council for Big Data, Ethics, and Society's publication [Perspectives on Big Data, Ethics, and Society](https://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society/).", "zotero_item": null}}, {"model": "workshop.workshop", "pk": 160, "fields": {"name": "Text Analysis", "slug": "text-analysis", "created": "2020-07-09T19:01:21.212Z", "updated": "2020-07-09T19:01:21.212Z", "parent_backend": "Github", "parent_repo": "DHRI-Curriculum/text-analysis", "parent_branch": "v2.0-rafa-edits"}}, {"model": "frontmatter.frontmatter", "pk": 152, "fields": {"workshop": 160, "abstract": "Digital technologies have made vast amounts of text available to researchers, and this same technological moment has provided us with the capacity to analyze that text. The first step in that analysis is to transform texts designed for human consumption into a form a computer can analyze. Using Python and the Natural Language ToolKit (commonly called NLTK), this workshop introduces strategies to turn qualitative texts into quantitative objects. Through that process, we will present a variety of strategies for simple analysis of text-based data.", "ethical_considerations": "['In working with massive amounts of text, it is natural to lose the original context. We must be aware of that and be careful when analizing it.', 'It is important to constantly question our assumptions and the indexes we are using. Numbers and graphs do not tell the story, our analysis does. We must be careful not to draw hasty and simplistic conclusions for things that are complex. Just because we found out that author A uses more unique words than author B, does it mean that A is a better writer than B?']", "estimated_time": "10", "projects": [375, 376, 377], "resources": [], "readings": [747, 748], "contributors": [474, 475, 476, 477, 478, 479, 480, 481, 482], "prerequisites": []}}, {"model": "praxis.praxis", "pk": 138, "fields": {"discussion_questions": "['Content TBD']", "next_steps": "[]", "workshop": 160, "further_readings": [749, 750], "more_projects": [], "more_resources": [], "tutorials": [364, 365, 366]}}, {"model": "lesson.lesson", "pk": 1110, "fields": {"title": "Overview", "created": "2020-07-09T19:01:21.218Z", "updated": "2020-07-09T19:01:21.218Z", "workshop": 160, "text": "This tutorial will give a brief overview of the considerations and tools involved in basic text analysis with Python. By completing this tutorial, you will have a general sense of how to turn text into data using the Python package, NLTK. You will also be able to take publicly available text files and transform them into a corpus that you can perform your own analysis on. Finally, you will have some insight into the types of questions that can be addressed with text analysis.
\nIf you have not already installed the Anaconda distribution of Python 3, please do so.
\nYou will also need nltk
and matplotlib
to complete this tutorial. Both packages come installed with Anaconda. To check to be sure you have them, open a new Jupyter Notebook (or any IDE to run Python).
\nFind Anaconda Navigator on your computer (it should be located in the folder with your other applications), and from Acadonda Navigator's interface, launch a Jupyter Notebook.
\n
\nIt will open in the browser. All of the directories (folders) in your home directory will appear \u2014 we'll get to that later. For now, select New
>> Python3
in the upper right corner.
\n
\nA blank page with an empty box should appear.
\n
\nIn the box, type:
\nimport nltk\nimport matplotlib\n
Press Shift + Enter
to run the cell (or click run at the top of the page). Don't worry too much about what this is doing - that will be explained later in this tutorial. For now, we just want to make sure the packages we will need are installed.
\n
\nIf nothing happens, they are installed and you are ready to move on! If you get an error message, either you have a typo or they are not installed. If it is the latter, open the command line and type:
\nconda install nltk -y\nconda install matplotlib -y\n
Now we need to install the nltk corpus. This is very large and may take some time if you are on a weak connection.
\nIn the next cell, type:
\nnltk.download()\n
and run the cell.
\nThe NLTK downloader should appear. Please install all of the packages. If you are short on time, focus on \"book\" for this tutorial\u2014you can download the other packages at another time for later use.
\nYours will look a little different, but the same interface. Click on the 'all' option and then 'Download'. Once they all trun green, you can close the Downloader dialogue box.
\n
\nReturn to your Jupyter Notebook and type:
\nfrom nltk.book import *\n
A list of books should appear. If this happens, great! If not, return to the downloader to make sure everything is ok.
\nClose this Notebook without saving \u2014 the only purpose was to check if we have the appropriate packages installed.
", "order": 1}}, {"model": "lesson.lesson", "pk": 1111, "fields": {"title": "Text as Data", "created": "2020-07-09T19:01:21.224Z", "updated": "2020-07-09T19:01:21.224Z", "workshop": 160, "text": "When we think of \"data,\" we often think of numbers, things that can be summarized, statisticized, and graphed. Rarely when I ask people \"what is data?\" do they respond \"Moby Dick.\" And yet, more and more, text is data. Whether it is Moby Dick, or every romance novel written since 1750, or today's newspaper or twitter feed, we are able to transform written (and spoken) language into data that can be quantified and visualized.
\nThe first step in gathering insights from texts is to create a corpus. A corpus is a collection of texts that are somehow related to each other. For example, the Corpus of Contemporary American English, Donald Trump's Tweets, text messages sent by bilingual young adults, digitized newspapers, or books in the public domain are all corpora. There are infinitely many corpora, and, sometimes, you will want to make your own\u2014that is, one that best fits your research question.
\nThe route you take from here will depend on your research question. Let's say, for example, that you want to examine gender differences in writing style. Based on previous linguistic research, you hypothesize that male-identified authors use more definitives than female-identified. So you collect two corpora\u2014one written by men, one written by women\u2014and you count the number of thes, thiss, and thats compared to the number of as, ans, and ones. Maybe you find a difference, maybe you don't. We can already see that this is a relatively crude way of going about answering this question, but it is a start. (More likely, you'd use a supervised classification task, which you will learn about in the Machine Learning Tutorial.)
\nThere has been some research about how the linguistic complexity of written language in long-form pieces (i.e., books, articles, letters, etc.) has decreased over time. Simply put, people today use shorter sentences with fewer embedded clauses and complex tense constructions than people did in the past. (Note that this is not necessarily a bad or good thing.) Based on this research, we want to know if short-form platforms are emblematic of the change (we predict that they are based on our own experience with short-form platforms like email and Twitter). One way to do this would be to use Part-of-Speech tagging. Part-of-Speech (POS) tagging is a way to identify the category of words in a given text.
\nFor example, the sentence:
\n\n", "order": 2}}, {"model": "lesson.lesson", "pk": 1112, "fields": {"title": "Cleaning and Normalizing", "created": "2020-07-09T19:01:21.235Z", "updated": "2020-07-09T19:01:21.235Z", "workshop": 160, "text": "I like the red bicycle.
\nhas one pronoun, one verb, one determiner, one adjective, and one noun.
\n(I : Pronoun), (like : Verb), (the : Determiner), (red : Adjective), (bicycle : Noun)
\nNLTK uses the Penn Tree Bank Tag Set. This is a very detailed tag list that goes far beyond just nouns, verbs, and adjectives, but gives insight into different types of nouns, prepositions, and verbs as well. Virtually all POS taggers will create a list of (word, POS) pairs. If newspaper articles have a higher ratio of function words (prepositions, auxiliaries, determiners, etc.) to semantic words (nouns, verbs, adjectives), than tweets, then we have one piece of evidence supporting our hypothesis. It's important to note here that we must use either ratios or otherwise normalized data (in the sense that raw numbers will not work). Because of the way that language works (function words are often repeated, for example), a sample of 100 words will have more unique words than a sample of 1,000. Therefore, to compare different data types (articles vs. tweets), this fact should be taken into account.
\n
Generally, however, our questions are more about topics rather than writing style. So, once we have a corpus\u2014whether that is one text or millions\u2014we usually want to clean and normalize it. There are three terms we are going to need:
\n- Text normalization is the process of taking a list of words and transforming it into a more uniform sequence. Usually, this involves removing punctuation, making the words all the same case, removing stop words, and either stemming or lemmatizing the words. It can also include expanding abbreviations or matching misspellings (but these are advanced practices that we will not cover).
\nYou probably know what removing punctuation and capitalization refer to, but the other terms may be new:
\n- Stop words are words that appear frequently in a language, often adding grammatical structure, but little semantic content. There is no official list of stop words for any language, though there are some common, all-purpose lists built in to NLTK. However, different tasks require different lists. The purpose of removing stop words is to remove words that are so common that their meaning is diminished across a large number of texts.
\n- Stemming and lemmatizing both of these processes try to consolidate words like \"laughs\" and \"laughing\" to \"laugh\" since they all mean essentially the same thing, they are just inflected differently. So again, in an attempt to reduce the number of words, and get a realistic understanding of the meaning of a text, these words are collapsed. Stemming does this by cutting off the end (very fast), lemmatizing does this by looking up the dictionary form (very slow).
\nLanguage is messy, and created for and by people, not computers. There is a lot of grammatical information in a sentence that a computer cannot use. For example, I could say to you:
\n\n\nThe house is burning.
\nand you would understand me. You would also understand if I say
\nhouse burn.
\nThe first has more information about tense, and which house in particular, but the sentiment is the same either way.
\nIn going from the first sentence to the normalized words, we removed the stop words (the and is), and removed punctuation and case, and lemmatized what was left (burning becomes burn\u2014though we might have stemmed this, its impossible to tell from the example). This results in what is essentially a \"bag of words,\" or a corpus of words without any structure. Because normalizing your text reduces the number of words (and therefore the number of dimensions in your data), and keeps only the words that contribute meaning to the document, this cleaning is usually desirable.
\nAgain, this will be covered more in depth in the Machine Learning Tutorial, but for the time being, we just need to know that there is \"clean\" and \"dirty\" versions of text data. Sometimes our questions are about the clean data, but sometimes our questions are in the \"dirt.\"
\n
In the next section, we are going to go through a series of methods that come built-in to NLTK that allow us to turn our words into numbers and visualizations. This is just scratching the surface, but should give you an idea of what is possible beyond just counting words.
", "order": 3}}, {"model": "lesson.lesson", "pk": 1113, "fields": {"title": "NLTK Methods with the NLTK Corpus", "created": "2020-07-09T19:01:21.242Z", "updated": "2020-07-09T19:01:21.242Z", "workshop": 160, "text": "All of the code for this section is in a Jupyter Notebook in the GitHub repository. I encourage you to follow along by retyping all of the code, but if you get lost, or want another reference, the code is there as well.
\nTo open the notebook, first create a projects
folder if you don't already have one by entering this command in your terminal:
mkdir -p ~/Desktop/projects\n
If you already have a projects folder, you can skip this step.
\nNext, clone the text analysis session repository into your projects folder by entering this command:
\ngit clone https://github.com/DHRI-Curriculum/text-analysis.git ~/Desktop/projects/text-analysis\n
Then move to the new directory:
\ncd ~/Desktop/projects/text-analysis\n
Now launch the Jupyter Notebook application by typing this into the terminal:
\njupyter notebook\n
If it's your first time opening the notebook, you may be prompted to enter a URL into your browser. Copy out the URL and paste it into the Firefox or Google Chrome search bar.
\nFinally, in the Jupyter Notebook file browser, find the notebook file and open it. It should be called TextAnalysis.ipynb
. You will use this file for reference in case you get stuck in the next few sections, so keep it open.
\nReturn to the Jupyter Home Tab in your Browser (or Launch the Jupyter Notebook again), and start a New Python3 Notebook using the New
button in the upper right corner.
\nEven though Jupyter Notebook doesn't force you to do so, it is very important to name your file, or you will end up later with a bunch of untitled files and you will have no idea what they are about. In the top left, click in the word Untitled
and give your file a name such as \"intro_nltk\".
\nIn the first blank cell, type the following to import the NLTK library:
\nimport nltk\n
Libraries are sets of instructions that Python can use to perform specialized functions. The Natural Language ToolKit (nltk
) is one such library. As the name suggests, its focus is on language processing.
\nWe will also need the matplotlib library later on, so import it now:
\nimport matplotlib\n
matplotlib
is a library for making graphs. In the middle of this tutorial, we are going to make a dispersion plot of words in our texts.
\nFinally, because of a quirk of Jupyter notebooks, we need to specify that matplotlib should display its graphs in the notebook (as opposed to in a separate window), so we type this command (this is technically a Jupyter command, not Python):
\n%matplotlib inline\n
All three of these commands can be written in the same cell and run all at once (Shift + Enter
) or in different cells.
\n
\nIf you don't see an error when you run the notebook\u2014that is, if nothing happens\u2014you can move on to the next step.
\nNext, we need to load all of the NLTK corpora into our program. Even though we downloaded them to our computer, we need to tell Python we want to use them.
\nfrom nltk.book import *\n
The pre-loaded NLTK texts should appear again. These are preformatted data sets. We will still have to do some minor processing, but having the data in this format saves us a few steps. At the end of this tutorial, we will make our own corpus. This is a special type of python object specific to NLTK (it isn't a string, list, or dictionary). Sometimes it will behave like a string, and sometimes like a list of words. How it is behaving is noted for each function as we try it out.
\n
\nLet's start by analyzing Moby Dick, which is text1
for NLTK.
The first function we will look at is concordance
. \"Concordance\" in this context means the characters on either side of the word. Our text is behaving like a string. As discussed in the Python tutorial LINK, Python does not evaluate strings, so it just counts the number of characters on either side. By default, this is 25 characters on either side of our target word (including spaces).
\nIn the Jupyter Notebook, type:
\ntext1.concordance(\"whale\")\n
The output shows us the 25 characters on either side of the word \"whale\" in Moby Dick. Let's try this with another word, \"love.\" Just replace the word \"whale\" with \"love,\" and we get the contexts in which Melville uses \"love\" in Moby Dick. concordance
is used (behind the scenes) for several other functions, including similar
and common_contexts
.
\nLet's now see which words appear in similar contexts as the word \"love.\" NLTK has a built-in function for this as well: similar
.
text1.similar(\"love\")\n
Behind the scenes, Python found all the contexts where the word \"love\" appears. It also finds similar environments, and then what words were common among the similar contexts. This gives a sense of what other words appear in similar contexts. This is somewhat interesting, but more interesting if we can compare it to something else. Let's take a look at another text. What about Sense and Sensibility? Let's see what words are similar to \"love\" in Jane Austen's writing. In the next cell, type:
\ntext2.similar(\"love\")\n
We can compare the two and see immediately that Melville and Austen use the word \"love\" differently.
\nLet's expand from novels for a minute and take a look at the NLTK Chat Corpus. In chats, text messages, and other digital communication platforms, \"lol\" is exceedingly common. We know it doesn't simply mean \"laughing out loud\"\u2014maybe the similar
function can provide some insight into what it does mean.
text5.similar(\"lol\")\n
The resulting list is a lot of greetings, indicating that \"lol\" probably has more of a phatic function. Phatic language is language primarily for communicating social closeness. Phatic words stand in contrast to semantic words, which contribute meaning to the utterance.
\nIf you are interested in this type of analysis, take a look at the common_contexts
function in the NLTK book or in the NLTK docs.
In many ways, concordance
and similar
are heightened word searches that tell us something about what is happening near the target words. Another metric we can use is to visualize where the words appear in the text. In the case of Moby Dick, we want to compare where \"whale\" and \"monster\" appear throughout the text. In this case, the text is functioning as a list of words, and will make a mark where each word appears, offset from the first word. We will pass this function a list of strings to plot. This will likely help us develop a visual of the story \u2014 where the whale goes from being a whale to being a monster to being a whale again. In the next cell, type:
text1.dispersion_plot([\"whale\", \"monster\"])\n
A graph should appear with a tick mark everywhere that \"whale\" appears and everywhere that \"monster\" appears. Knowing the story, we can interpret this graph and align it to what we know of how the narrative progresses. If we did not know the story, this could give us a picture of the narrative arc.
\nTry this with text2
, Sense and Sensibility. Some relevant words are \"marriage,\" \"love,\" \"home,\" \"mother,\" \"husband,\" \"sister,\" and \"wife.\" Pick a few to compare. You can compare an unlimited number, but it's easier to read a few at a time. (Note that the comma in our writing here is inside the quotation mark but for Python, this would be unreadable and you would have to put commas outside of quotation marks to create a list.)
\nNLTK has many more functions built-in, but some of the most powerful functions are related to cleaning, part-of-speech tagging, and other stages in the text analysis pipeline (where the pipeline refers to the process of loading, cleaning, and analyzing text).
", "order": 6}}, {"model": "lesson.lesson", "pk": 1116, "fields": {"title": "Built-In Python Functions", "created": "2020-07-09T19:01:21.266Z", "updated": "2020-07-09T19:01:21.266Z", "workshop": 160, "text": "We will now turn our attention away from the NLTK library and work with our text using the built-in Python functions\u2014the ones that come included with the Python language, rather than the NLTK library.
\nFirst, let's find out how many times a given word appears in the corpus. In this case (and all cases going forward), our text will be treated as a list of words. Therefore, we will use the count
function. We could just as easily do this with a text editor, but performing this in Python allows us to save it to a variable and then utilize this statistic in other calculations (for example, if we want to know what percentage of words in a corpus are 'lol', we would need a count of the 'lol's). In the next cell, type:
text1.count(\"whale\")\n
We see that \"whale\" occurs 906 times, but that seems a little low. Let's check on \"Whale\" and see how often that appears:
\ntext1.count(\"Whale\")\n
\"Whale\" with a capital \"W\" appears 282 times. This is a problem for us\u2014we actually want them to be collapsed into one word, since \"whale\" and \"Whale\" really are the same for our purposes. We will deal with that in a moment. For the time being, we will accept that we have two entries for \"whale.\"
\nThis gets at a distinction between type and token. \"Whale\" and \"whale\" are different types (as of now) because they do not match identically. Every instance of \"whale\" in the corpus is another token\u2014it is an instance of the type, \"whale.\" Therefore, there are 906 tokens of \"whale\" in our corpus.
\nLet's fix this by making all of the words lowercase. We will make a new list of words, and call it \"text1_tokens\". We will fill this list with all the words in text1, but in their lowercase form. Python has a built-in function, lower()
that takes all letters and makes them lowercase. In this same step, we are going to do a kind of tricky move, and only keep the words that are alphabetical and pass over anything that is punctuation or numbers. There is a built-in function, isalpha()
, that will allow us to save only those words that are made of letters. If isalpha()
is true, we'll make the word lowercase, and keep the word. If not, we'll pass over it and move to the next one.
\nType the following code into a new cell in your notebook. Pay special attention to the indentation, which must appear as below. (Note that in Jupyter Notebook, indentation usually comes automatically. If not, make sure to type the space
key 4 times)
text1_tokens = []\nfor t in text1:\n if t.isalpha():\n t = t.lower()\n text1_tokens.append(t)\n
\nAnother way to perform the same action more tersely is to use what's called a list comprehension. A list comprehension is a shorter, faster way to write a for-loop. It is syntactically a little more difficult to read (for a human), but, in this case, it's much faster to process. Don't worry too much about understanding the syntax of list comprehensions right now. For every example, we will show both the for loop and list comprehension options.
\ntext1_tokens = [t.lower() for t in text1 if t.isalpha()]\n
Great! Now text1_tokens
is a list of all of the tokens in our corpus, with the punctuation removed, and all the words in lowercase.
\nNow we want to know how many words there are in our corpus\u2014that is, how many tokens in total. Therefore, we want to ask, \"What is the length of that list of words?\" Python has a built-in len
function that allows you to find out the length of many types. Pass it a list, and it will tell you how many items are in the list. Pass it a string, and it will tell you how many characters are in the string. Pass it a dictionary, and it will tell you how many items are in the dictionary. In the next cell, type:
len(text1_tokens)\n
Just for comparison, check out how many words were in \"text1\"\u2014before we removed the punctuation and the numbers.
\nlen(text1)\n
We see there are over 218,000 words in Moby Dick (including metadata). But this is the number of words total\u2014we want to know the number of unique words. That is, we want to know how many types, not just how many tokens.
\nIn order to get unique words, rather than just all words in general, we will make a set from the list. A set
in Python work just like it would in math, it's all the unique values, with any duplicate items removed.
\nSo let's find out the length of our set. just like in math, we can also nest our functions. So, rather than saying x = set(text1_tokens)
and then finding the length of \"x\", we can do it all in one step.
len(set(text1_tokens))\n
Great! Now we can calculate the lexical density of Moby Dick. Statistical studies have shown that lexical density (the number of unique words per total words) is a good metric to approximate lexical diversity\u2014the range of vocabulary an author uses. For our first pass at lexical density, we will simply divide the number of unique words by the total number of words:
\nlen(set(text1_tokens))/len(text1_tokens)\n
If we want to use this metric to compare texts, we immediately notice a problem. Lexical density is dependent upon the length of a text and therefore is strictly a comparative measure. It is possible to compare 100 words from one text to 100 words from another, but because language is finite and repetitive, it is not possible to compare 100 words from one to 200 words from another. Even with these restrictions, lexical density is a useful metric in grade level estimations, vocabulary use and genre classification, and a reasonable proxy for lexical diversity.
\nLet's take this constraint into account by working with only the first 10,000 words of our text. First we need to slice our list, returning the words in position 0 to position 9,999 (we'll actually write it as \"up to, but not including\" 10,000).
\ntext1_slice = text1_tokens[0:10000]\n
Now we can do the same calculation we did above:
\nlen(set(text1_slice)) / len(text1_slice)\n
This is a much higher number, though the number itself is arbitrary. When comparing different texts, this step is essential to get an accurate measure.
", "order": 7}}, {"model": "lesson.lesson", "pk": 1117, "fields": {"title": "Making Your Own Corpus: Data Cleaning", "created": "2020-07-09T19:01:21.295Z", "updated": "2020-07-09T19:01:21.295Z", "workshop": 160, "text": "Thus far, we have been asking questions that take stopwords and grammatical features into account. For the most part, we want to exclude these features since they don't actually contribute very much semantic content to our models. Therefore, we will:
\n1. Remove capitalization and punctuation (we've already done this).
\n2. Remove stop words.
\n3. Lemmatize (or stem) our words, i.e. \"jumping\" and \"jumps\" become \"jump.\"
\nWe already completed step one, and are now working with our text1_tokens
. Remember, this variable, text1_tokens
, contains a list of strings that we will work with. We want to remove the stop words from that list. The NLTK library comes with fairly comprehensive lists of stop words for many languages. Stop words are function words that contribute very little semantic meaning and most often have grammatical functions. Usually, these are function words such as determiners, prepositions, auxiliaries, and others.
\nTo use NLTK's stop words, we need to import the list of words from the corpus. (We could have done this at the beginning of our program, and in more fully developed code, we would put it up there, but this works, too.) In the next cell, type:
\nfrom nltk.corpus import stopwords\n
We need to specify the English list, and save it into its own variable that we can use in the next step:
\nstops = stopwords.words('english')\n
Now let's take a look at those words:
\nprint(stops)\n
Now we want to go through all of the words in our text, and if that word is in the stop words list, remove it from our list. Otherwise, we want it to skip it. (The code below is VERY slow, so it may take some time to process). The way we can write this in Python is:
\ntext1_stops = []\nfor t in text1_tokens:\n if t not in stops:\n text1_stops.append(t)\n
A faster option, if you are feeling bold, would be using list comprehensions:
\ntext1_stops = [t for t in text1_tokens if t not in stops]\n
To check the result:
\nprint(text1_stops[:30])\n
Now that we removed our stop words, let's see how many words are left in our list:
\nlen(text1_stops)\n
You should get a much lower number.
\nFor reference, let's also check how many unique words there are. We will do this by making a set of words. Sets are the same in Python as they are in math, they are all of the unique words rather than all the words. So, if \"whale\" appears 200 times in the list of words, it will only appear once in the set.
\nlen(set(text1_stops))\n
Now that we've removed the stop words from our corpus, the next step is to stem or lemmatize the remaining words. This means that we will strip off the grammatical structure from the words. For example, cats \u2014> cat
, and walked \u2014> walk
. If that was all we had to do, we could stem the corpus and achieve the correct result, because stemming (as the name implies) really just means cutting off affixes to find the root (or the stem). Very quickly, however, this gets complicated, such as in the case of men \u2014> man
and sang \u2014> sing
. Lemmatization deals with this by looking up the word in a reference and finding the appropriate root (though note that this still is not entirely accurate). Lemmatization, therefore, takes a relatively long time, since each word must be looked up in a reference. NLTK comes with pre-built stemmers and lemmatizers.
\nWe will use the WordNet Lemmatizer from the NLTK Stem library, so let's import that now:
\nfrom nltk.stem import WordNetLemmatizer\n
Because of the way that it is written \"under the hood,\" an instance of the lemmatizer needs to be called. We know this from reading the docs.
\nwordnet_lemmatizer = WordNetLemmatizer()\n
Let's quickly see what lemmatizing does.
\nwordnet_lemmatizer.lemmatize(\"children\")\n
Now try this one:
\nwordnet_lemmatizer.lemmatize(\"better\")\n
It didn't work, but...
\nwordnet_lemmatizer.lemmatize(\"better\", pos='a')\n
... sometimes we can get better results if we define a specific part of speech(pos). \"a\" is for \"adjective\", as we learned here.
\nNow we will lemmatize the words in the list.
\ntext1_clean = []\nfor t in text1_stops:\n t_lem = wordnet_lemmatizer.lemmatize(t)\n text1_clean.append(t_lem)\n
And again, there is a faster version for you to use once you feel comfortable with list comprehensions:
\ntext1_clean = [wordnet_lemmatizer.lemmatize(t) for t in text1_stops]\n
Let's check now to see the length of our final, cleaned version of the data, and then check the unique set of words. Notice how we will use the print
function this time. Jupyter Notebook does print commands without the print
function, but it will only print one thing per cell (the last command), and we wanted to print two different things:
print(len(text1_clean))\nprint(len(set(text1_clean)))\n
If everything went right, you should have the same length as before, but a smaller number of unique words. That makes sense since we did not remove any word, we only changed some of them.
\nNow if we were to calculate lexical density, we would be looking at how many word stems with semantic content are represented in Moby Dick, which is a different question than the one in our first analysis of lexical density.
\nWhy don't you try that by yourself? Try to remember how to calculate lexical density without looking back first. It is ok if you have forgotten.
\nNow let's have a look at the words Melville uses in Moby Dick. We'd like to look at all of the types, but not necessarily all of the tokens. We will order this set so that it is in an order we can handle. In the next cell, type:
\nsorted(set(text1_clean))[:30]\n
Sorted
+ set
should give us a list of list of all the words in Moby Dick in alphabetical order, but we only want to see the first ones. Notice how there are some words we wouldn't have expected, such as 'abandon', 'abandoned', 'abandonedly', and 'abandonment'. This process is far from perfect, but it is useful. However, depending on your goal, a different process, like stemming
might be better.
The code to implement this and view the output is below:
\nfrom nltk.stem import PorterStemmer\nporter_stemmer = PorterStemmer()\n
The Porter is the most common Stemmer. Let's see what stemming does to words and compare it with lemmatizers:
\nprint(porter_stemmer.stem('berry'))\nprint(porter_stemmer.stem('berries'))\nprint(wordnet_lemmatizer.lemmatize(\"berry\"))\nprint(wordnet_lemmatizer.lemmatize(\"berries\"))\n
Stemmer doesn't look so good, right? But how about checking how stemmer handles some of the words that our lemmatized \"failed\" us?
\nprint(porter_stemmer.stem('abandon'))\nprint(porter_stemmer.stem('abandoned'))\nprint(porter_stemmer.stem('abandonedly'))\nprint(porter_stemmer.stem('abandonment'))\n
Still not perfect, but a bit better. So the question is, how to choose between stemming and lemmatizing? As many things in text analysis, that depends. The best way to go is experimenting, seeing the results and chosing the one that better fits your goals.
\nAs a general rule, stemming is faster while lemmatizing is more accurate (but not always, as we just saw). For academics, usually the choice goes for the latter.
\nAnyway, let's stem our text with the Porter Stemmer:
\nt1_porter = []\nfor t in text1_clean:\n t_stemmed = porter_stemmer.stem(t)\n t1_porter.append(t_stemmed)\n
Or, if we want a faster way:
\nt1_porter = [porter_stemmer.stem(t) for t in text1_clean]\n
And let's check the results:
\nprint(len(set(t1_porter)))\nprint(sorted(set(t1_porter))[:30])\n
A very different list of words is produced. This list is shorter than the list produced by the lemmatizer, but is also less accurate, and some of the words will completely change their meaning (like 'berry' becoming 'berri').
\nNow that we've seen some of the differences between both, we will proceed using our lemmatized corpus, which we saved as \"text1_clean\":
\nmy_dist = FreqDist(text1_clean)\n
If nothing happened, that is normal. Check to make sure it is there by calling for the type of the \"my_dist\" object.
\ntype(my_dist)\n
The result should say it is a nltk probability distribution (nltk.probability.FreqDist
). It doesn't matter too much right now what that is, only that it worked. We can now plot this with the matplotlib function, plot
. We want to plot the first 20 entries of the my_dist object.
my_dist.plot(20)\n
\nWe've made a nice image here, but it might be easier to comprehend as a list. Because this is a special probability distribution object we can call the most_common
on this, too. Let's find the twenty most common words:
my_dist.most_common(20)\n
What about if we are interested in a list of specific words\u2014perhaps to identify texts that have biblical references. Let's make a (short) list of words that might suggest a biblical reference and see if they appear in Moby Dick. Set this list equal to a variable:
\nb_words = ['god', 'apostle', 'angel']\n
Then we will loop through the words in our cleaned corpus, and see if any of them are in our list of biblical words. We'll then save into another list just those words that appear in both.
\nmy_list = []\nfor word in b_words:\n if word in text1_clean:\n my_list.append(word)\n else:\n pass\n
And then we will print the results.
\nprint(my_list)\n
You can obviously do this with much larger lists and even compare entire novels if you wish, though it would take a while with this approach. You can use this to get similarity measures and answer related questions.
", "order": 8}}, {"model": "lesson.lesson", "pk": 1118, "fields": {"title": "Make Your Own Corpus", "created": "2020-07-09T19:01:21.307Z", "updated": "2020-07-09T19:01:21.308Z", "workshop": 160, "text": "Now that we have seen and implemented a series of text analysis techniques, let's go to the Internet to find a new text. You could use something such as historic newspapers, or Supreme Court proceedings, or use any txt file on your computer. Here we will use Project Gutenberg. Project Gutenberg is an archive of public domain written works, available in a wide variety of formats, including .txt. You can download these to your computer or access them via the url. We'll use the url method. We found Don Quixote in the archive, and will work with that.
\nThe Python package, urllib, comes installed with Python, but is inactive by default, so we still need to import it to utilize the functions. Since we are only going to use the urlopen function, we will just import that one.
\nIn the next cell, type:
\nfrom urllib.request import urlopen\n
The urlopen
function allows your program to interact with files on the internet by opening them. It does not read them, however\u2014they are just available to be read in the next line. This is the default behavior any time a file is opened and read by Python. One reason is that you might want to read a file in different ways. For example, if you have a really big file\u2014think big data\u2014you might want to read line-by-line rather than the whole thing at once.
\nNow let's specify which URL we are going to use. Though you might be able to find Don Quixote in the Project Gutenberg files, please type this in so that we are all using the same format (there are multiple .txt files on the site, one with utf-8 encoding, another with ascii encoding). We want the utf-8 encoded one. The difference between these is beyond the scope of this tutorial, but you can check out this introduction to character encoding from The World Wide Web Consortium (W3C) if you are interested.
\nSet the URL we want to a variable:
\nmy_url = \"http://www.gutenberg.org/files/996/996-0.txt\"\n
We still need to open the file and read the file. You will have to do this with files stored locally as well. (in which case, you would type the path to the file (i.e., data/texts/mytext.txt
) in place of my_url
)
file = urlopen(my_url)\nraw = file.read()\n
This file is in bytes, so we need to decode it into a string. In the next cell, type:
\ndon = raw.decode()\n
Now let's check on what kind of object we have in the \"don\" variable. Type:
\ntype(don)\n
This should be a string. Great! We have just read in our first file and now we are going to transform that string into a text that we can perform NLTK functions on. Since we already imported nltk at the beginning of our program, we don't need to import it again, we can just use its functions by specifying nltk
before the function. The first step is to tokenize the words, transforming the giant string into a list of words. A simple way to do this would be to split on spaces, and that would probably be fine, but we are going to use the NLTK tokenizer to ensure that edge cases are captured (i.e., \"don't\" is made into 2 words: \"do\" and \"n't\"). In the next cell, type:
don_tokens = nltk.word_tokenize(don)\n
You can check out the type of don_tokens
using the type()
function to make sure it worked\u2014it should be a list. Let's see how many words there are in our novel:
len(don_tokens)\n
Since this is a list, we can look at any slice of it that we want. Let's inspect the first ten words:
\ndon_tokens[:10]\n
That looks like metadata\u2014not what we want to analyze. We will strip this off before proceeding. If you were doing this to many texts, you would want to use Regular Expressions. Regular Expressions are an extremely powerful way to match text in a document. However, we are just using this text, so we could either guess, or cut and paste the text into a text reader and identify the position of the first content (i.e., how many words in is the first word). That is the route we are going to take. We found that the content begins at word 315, so let's make a slice of the text from word position 315 to the end.
\ndq_text = don_tokens[315:]\n
Finally, if we want to use the NLTK specific functions:
\n- concordance
\n- similar
\n- dispersion_plot
\n- or others from the NLTK book
\nwe would have to make a specific NLTK Text
object.
dq_nltk_text = nltk.Text(dq_text)\n
If we wanted to use the built-in Python functions, we can just stick with our list of words in dq_text
. Since we've already covered all of those functions, we are going to move ahead with cleaning our text.
\nJust as we did earlier, we are going to remove the stopwords based on a list provided by NLTK, remove punctuation, and capitalization, and lemmatize the words. You can do it one by one as we did before, and that is totally fine. You can also merge some of the steps as you see below.
\n1. Lowercase, remove punctuation and stopwords
\ndq_clean = []\nfor w in dq_text:\n if w.isalpha():\n if w.lower() not in stops:\n dq_clean.append(w.lower())\nprint(dq_clean[:50])\n
2. Lemmatize
\nfrom nltk.stem import WordNetLemmatizer\nwordnet_lemmatizer = WordNetLemmatizer()\ndq_lemmatized = []\nfor t in dq_clean:\n dq_lemmatized.append(wordnet_lemmatizer.lemmatize(t))\n
From here, you could perform all of the operations that we did after cleaning our text in the previous session. Instead, we will perform another type of analysis: part-of-speech (POS) tagging.
", "order": 9}}, {"model": "lesson.lesson", "pk": 1119, "fields": {"title": "Part-of-Speech Tagging", "created": "2020-07-09T19:01:21.314Z", "updated": "2020-07-09T19:01:21.314Z", "workshop": 160, "text": "Note that we are going to use the pre-cleaned, dq_text
object for this section.
\nPOS tagging is going through a text and identifying which part of speech each word belongs to (i.e., Noun, Verb, or Adjective). Every word belongs to a part of speech, but some words can be confusing.
\n- Floyd is happy.
\n- Happy is a state of being.
\n- Happy has five letters.
\n- I'm going to Happy Cat tonight.
\nTherefore, part of speech is as much related to the word itself as its relationship to the words around it. A good part-of-speech tagger takes this into account, but there are some impossible cases as well:
\n- Wanda was entertaining last night.
\nPart of Speech tagging can be done very simply: with a very small tag set, or in a very complex way: with a much more elaborate tag set. We are going to implement a compromise, and use a neither small nor large tag set, the Penn Tree Bank POS Tag Set.
\nThis is the tag set that is pre-loaded into NLTK. When we call the tagger, we expect it to return an object with the word and the tag associated. Because POS tagging is dependent upon the stop words, we have to use a text that includes the stop words. Therefore, we will go back to using the dq_text
object for this section. Let's try it out. Type:
dq_tagged = nltk.pos_tag(dq_text)\n
Let's inspect what we have:
\nprint(dq_tagged[:10])\n
This is a list of ordered tuples. (A tuple is like a list, but can't be changed once it is created.) Each element in the list is a pairing of (word, POS-tag)
. (Tuples are denoted with parentheses, rather than square brackets.) This is great, but it is very detailed. I would like to know how many Nouns, Verbs, and Adjectives I have.
\nFirst, I'll make an empty dictionary to hold my results. Then I will go through this list of tuples and count the number of times each tag appears. Every time I encounter a new tag, I'll add it to a dictionary and then increment by one every time I encounter that tag again. Let's see what that looks like in code:
\ntag_dict = {}\n# For every word/tag pair in my list,\nfor (word, tag) in dq_tagged:\n if tag in tag_dict:\n tag_dict[tag]+=1\n else:\n tag_dict[tag] = 1\n
Now let's see what we got:
\ntag_dict\n
This would be better with some order to it, but dictionaries are made to be unordered. When we google \"sort dictionaries python\" we find a solution in our great friend stack overflow. Even though we cannot sort a dictionary, we can get a representation of a dictionary that is sorted. Don't worry too much about understanding the following code, as it uses things we have not discussed, and are out of the scope of this course. It is useful to see how we can reuse pieces of code even when we don't fully understand them.
\nNow let's do it and find out what the most common tag is.
\ntag_dict_sorted = sorted(tag_dict.items(),\n reverse=True,\n key=lambda kv: kv[1])\nprint(tag_dict_sorted)\n
Now check out what we have. It looks like NN is the most common tag. We can look up what NN means in the Penn Tree Bank. Looks like NN is a Noun, singular or mass. Great! This information will likely help us with genre classification, or identifying the author of a text, or a variety of other functions.
", "order": 10}}, {"model": "lesson.lesson", "pk": 1120, "fields": {"title": "Conclusion", "created": "2020-07-09T19:01:21.318Z", "updated": "2020-07-09T19:01:21.318Z", "workshop": 160, "text": "At this point, you should have a familiarity with what is possible with text analysis, and some of the most important functions (i.e., cleaning and part-of-speech tagging). Yet, this tutorial has only scratched the surface of what is possible with text analysis and natural language processing. It is a rapidly growing field, if you are interested, be sure to work through the online NLTK Book as well as peruse the resources in the Zotero Library.
\nLet's compare the lexical density of Moby Dick with Sense and Sensibility. Make sure to:
\nThe command line is a text-based way of interacting with your computer. You may hear it called different names, such as the terminal, the shell, or bash. In practice, you can use these terms interchangeably. (If you're curious, though, you can read more about them in the glossary.) The shell we use (whether terminal, shell, or bash) is a program that accepts commands as text input and converts commands into appropriate operating system functions.
\nThe command line (of computers today) receives these commands as text that is typed in.
\nFor those of us comfortable reading and writing, the idea of \"text-based\" in the context of computers can seem a bit strange. As we start to get comfortable typing commands to the computer, it's important to distinguish \"text\" from word processed, desktop publishing (think Microsoft Word or Google Docs) in which we use software that displays what we want to produce without showing us the code the computer is reading to render the formatting. Plain text has the advantage of being manipulable in different contexts.
\nLet's take a quick moment to discuss text and text editors.
", "order": 1}}, {"model": "lesson.lesson", "pk": 1122, "fields": {"title": "Text editors", "created": "2020-07-09T19:01:23.009Z", "updated": "2020-07-09T19:01:23.009Z", "workshop": 161, "text": "Before we explain which program we'll be using for editing text, we want to give a general sense of this \"text\" we keep mentioning. For those of us in the humanities, whether we follow literary theorists who read any object as a \"text\" or we dive into philology, paleography, codicology or any of the fields David Greetham lays out in Textual Scholarship, \"text\" has its specific meanings. As scholars working with computers, we need to be aware of the ways plain text and formatted text differ. Words on a screen may have hidden formatting. Many of us grew up using Microsoft Word and don't realize how much is going on behind the words shown on the screen. For the purposes of communicating with the computer and for easier movement between different programs, we need to use text without hidden formatting.
\n
\nUsers with visual disabilities, click here to dowload the Word file.
\nIf you ask the command line to read that file, this Word .docx file will look something like this
\n
\nUsers with visual disabilities, click here to dowload the text file.
\nWord documents which look like \"just words!\" are actually comprised of an archive of extensible markup language (XML) instructions that only Microsoft Word can read. Plain text files can be opened in a number of different editors and can be read within the command line.
\nFor the purposes of communicating with machines and between machines, we need characters to be as flexible as possible. Plain text include characters of readable material but not graphical representation.
\nAccording to the Unicode Standard,
\n\n\nPlain text is a pure sequence of character codes; plain Unicode-encoded text is therefore a sequence of Unicode character codes.
\nPlain text has two main properties in regard to rich text:
\nplain text is the underlying content stream to which formatting can be applied. Plain text is public, standardized, and universally readable.
\nPlain text shows its cards\u2014if it's marked up, the markup will be human readable. Plain text can be moved between programs more fluidly and can respond to programmatic manipulations. Because it is not tied to a particular font or color or placement, plain text can be styled externally.
\nA counterpoint to plain text is rich text (sometimes denoted by the Microsoft rich text format .rtf file extension) or \"enriched text\" (sometimes seen as an option in email programs). In rich text files, plain text is elaborated with formatting specific to the program in which they are made.
\n
An important tool for programming and working in the command line is a text editor. A text editor is a program that allows you to edit plain text files, such as .txt, .csv, or .md. Text editors are not used to edit rich text documents, such as .docx or .rtf, and rich text editors should not be used to edit plain text files. This is because rich text editors will add many invisible special characters that will prevent programs from running and configuration files from being read correctly.
\nWhile it doesn't really matter which text editor you choose, you should try to become comfortable with at least one text editor.
\nChoosing a text editor has as much to do with personality as it does with functionality. Graphical user interfaces (GUIs), user options, and \"hackability\" vary from program to program.
\nFor our workshops, we will be using Visual Studio Code. Not only is Visual Studio Code free and open source, but it is also consistent across OSX, Windows, and Linux systems.
\nYou will have downloaded VS Code according to the instructions on the installations page. We won't be using the editor a lot in this tutorial, so don't worry about getting to know the editor now. In later workshops we will discuss syntax highlighting and version control, which Visual Studio Code supports. For now we will get back to working in the command line itself.
", "order": 2}}, {"model": "lesson.lesson", "pk": 1123, "fields": {"title": "Why is the command line useful?", "created": "2020-07-09T19:01:23.012Z", "updated": "2020-07-09T19:01:23.012Z", "workshop": 161, "text": "Initially, for some of us, the command line can feel a bit unfamiliar. Why step away from a point-and-click workflow? By using the command line, we move into an environment where we have more minute control over each task we'd like the computer to perform. Instead of ordering your food in a restaurant, you're stepping into the kitchen. It's more work, but there are also more possibilities.
\nThe command line allows you to...
\n- Easily automate tasks such as creating, copying, and converting files.
\n- Set up your programming environment.
\n- Run programs you create.
\n- Access the (many) programs and utilities that do not have graphical equivalents.
\n- Control other computers remotely.
\nIn addition to being a useful tool in itself, the command line gives you access to a second set of programs and utilities and is a complement to learning programming.
\nWhat if all these cool possibilities seem a bit abstract to you right now? That's alright! On a very basic level, most uses of the command line are about showing information that the computer has, or modifying or making things (files, programs, etc.) on the computer.
\nIn the next section, we'll make this a little more clear by getting started with the command line.
", "order": 3}}, {"model": "lesson.lesson", "pk": 1124, "fields": {"title": "Getting to the command line", "created": "2020-07-09T19:01:23.016Z", "updated": "2020-07-09T19:01:23.016Z", "workshop": 161, "text": "If you're using macOS:
\n1. Click the Spotlight Search button (the magnifying glass) in the top right of your desktop.
\n2. Type \"terminal\" into the bar that appears.
\n3. Select the first item that appears in the list.
\n4. When the Terminal pops up, you will likely see either a window with black text over white background or colored text over a black background.
\n
\nWhen you see the $
, you're in the right place. We call the $
the command prompt; the $
lets us know the computer is ready to receive a command.
\nYou can change the color of your Terminal or BashShell background and text by selecting Shell
from the top menu bar, then selecting a theme from the menu under New Window
.
\nBonus points: if you really want to get the groove of just typing instead of pointing and clicking, you can press \"Command (\u2318)\" and the space bar at the same time to pull up Spotlight search, start typing \"Terminal,\" and then hit \"Enter\" to open a terminal window. This will pull up a terminal window without touching your mousepad. For super bonus points, try to navigate like this for the next fifteen minutes, or even the rest of this session\u2014it is tricky and sometimes a bit tiring when you start, but you can really pick up speed when you practice!
\nWe won't be using Windows's own non-UNIX version of the command line. We installed Git Bash, following these instructions, so that we can work in the cross-platform Unix command line for this session.
\n1. Look for Git Bash in your programs menu and open.
\n2. If you can't find the git folder, just type \"git bash\" in the search box and select \"git bash\" when it appears.
\n3. Open the program.
\n4. When the terminal pops up, you will likely see either a window with black text over white background or colored text over a black background.You know you're in the right place when you see the $
.
$
$
, which we will refer to as the \"command prompt,\" is the place you type commands you wish the computer to execute. We will now learn some of the most common commands.
\nIn the next section, we'll learn how to navigate the filesystem in the command line.
", "order": 4}}, {"model": "lesson.lesson", "pk": 1125, "fields": {"title": "Navigation", "created": "2020-07-09T19:01:23.025Z", "updated": "2020-07-09T19:01:23.025Z", "workshop": 161, "text": "Go slow at first and check your spelling!
\nOne of the biggest things you can do to make sure your code runs correctly and you can use the command line successfully is to make sure you check your spelling! Keep this in mind today, this week, and your whole life. If at first something doesn't work, check your spelling! Unlike in human reading, where letters operate simultaneously as atomistic symbols and as complex contingencies (check Johanna Drucker on the alphabet), in coding, each character has a discrete function including (especially!) spaces.
\nKeep in mind that the command line and file systems on macOS and Unix are usually pre-configured as cAsE-pReSeRvInG\u2014so capitalizations also matter when typing commands and file and folder names.
\nAlso, while copying and pasting from this handy tutorial may be tempting to avoid spelling errors and other things, we encourage you not to! Typing out each command will help you remember them and how they work.
\nYou may also see your username to the left of the command prompt $
. Let's try our first command. Type the following and press the enter
key:
$ whoami\n
The whoami
command should print out your username. Congrats, you've executed your first command! This is a basic pattern of use in the command line: type a command, press enter
on your keyboard, and receive output.
OK, we're going to try another command. But first, let's make sure we understand some things about how your computer's filesystem works.
\nYour computer's files are organized in what's known as a hierarchical filesystem. That means there's a top level or \"root\" folder on your system. That folder has other folders in it, and those folders have folders in them, and so on. You can draw these relationships in a tree:
\nUsers\n|\n \u2014\u2014 your-username\n |\n \u2014\u2014 Applications\n \u2014\u2014 Desktop\n \u2014\u2014 Documents\n
The root or highest-level folder on macOS is just called /
. We won't need to go in there, though, since that's mostly just files for the operating system. On Windows, the root directory is usually called C:
(More on why C is default on Windows).
\nNote that we are using the word \"directory\" interchangeably with \"folder\"\u2014they both refer to the same thing.
\nOK, let's try a command that tells us where we are in the filesystem:
\n$ pwd\n
You should get output like /Users/your-username
. That means you're in the your-username
directory in the Users
folder inside the /
or root directory. On Windows, your output would instead be C:/Users/your-username
. The folder you're in is called the working directory, and pwd
stands for \"print working directory.\"
\nThe command pwd
won't actually print anything except on your screen. This command is easier to grasp when we interpret \"print\" as \"display.\"
\nOK, we know where we are. But what if we want to know what files and folders are in the your-username
directory, a.k.a. the working directory?
\nTry entering:
\n$ ls\n
You should see a number of folders, probably including Documents
, Desktop
, and so on. You may also see some files. These are the contents of the current working directory. ls
will \"list\" the contents of the directory you are in.
\nWonder what's in the Desktop folder? Let's try navigating to it with the following command:
\n$ cd Desktop\n
The cd
command lets us \"change directory.\" (Make sure the \"D\" in \"Desktop\" is capitalized.) If the command was successful, you won't see any output. This is normal\u2014often, the command line will succeed silently.
\nSo how do we know it worked? That's right, let's use our pwd
command again. We should get:
$ pwd\n/Users/your-username/Desktop\n
Now try ls
again to see what's on your desktop. These three commands\u2014pwd
, ls
, and cd
\u2014are the most commonly used in the terminal. Between them, you can orient yourself and move around.
\nBefore we move on, let's take a minute to navigate through our computer's file system using the command line.
\nIt's important to note that this is the same old information you can get by pointing and clicking displayed to you in a different way.
\nGo ahead and use pointing and clicking to navigate to your working directory\u2014you can get there a few ways, but try starting from \"My Computer\" and clicking down from there. You'll notice that the folder names should match the ones that the command line spits out for you, since it's the same information! We're just using a different mode of navigation around your computer to see it.
\nSo far, we've only performed commands that give us information. Let's use a command that creates something on the computer.
\nFirst, make sure you're in the home directory:
\n$ pwd\n/Users/your-username\n
Let's move to the Desktop folder, or \"change directory\" with cd
:
cd Desktop\n
Once you've made sure you're in the Desktop folder with pwd
, let's try a new command:
touch foo.txt\n
If the command succeeds, you won't see any output. Now move the terminal window and look at your \"real\" desktop, the graphical one. See any differences? If the command was successful and you were in the right place, you should see an empty text file called \"foo.txt\" on the desktop. Pretty cool, right?
\nLet's say you liked that \"foo.txt\" file so much you'd like another! In the terminal window, press the \"up arrow\" on your keyboard. You'll notice this populates the line with the command that you just wrote. You can hit \"Enter\" to create another \"foo.txt,\" (note - touch
command will not overwrite your document nor will it add another document to the same directory, but it will update info about that file.) or you could use your left/right arrows to change the file name to \"foot.txt\" to create something different.
\nAs we start to write more complicated and longer commands in our terminal, the \"up arrow\" is a great shortcut so you don't have to spend lots of time typing.
\nOK, so we're going to be doing a lot of work during the Digital Research Institute. Let's create a project folder in our Desktop so that we can keep all our work in one place.
\nFirst, let's check to make sure we're still in the Desktop folder with pwd
:
$ pwd\n/Users/your-username/Desktop\n
Once you've double-checked you're in Desktop, we'll use the mkdir
or \"make directory\" command to make a folder called \"projects\":
mkdir projects\n
Now run ls
to see if a projects folder has appeared. Once you confirm that the projects folder was created successfully, cd
into it.
$ cd projects\n$ pwd\n/Users/your-username/Desktop/projects\n
foo.txt
file we created earlier.In this section, we'll create a text file that we can use as a cheat sheet. You can use it to keep track of all the awesome commands you're learning.
\nEcho
Instead of creating an empty file like we did with touch
, let's try creating a file with some text in it. But first, let's learn a new command: echo
$ echo \"Hello from the command line\"\nHello from the command line\n
>
)By default, the echo command just prints out the text we give it. Let's use it to create a file with some text in it:
\necho \"This is my cheat sheet\" > cheat-sheet.txt\n
Now let's check the contents of the directory:
\n$ pwd\n/Users/your-username/projects\n$ ls\ncheat-sheet.txt\n
OK, so the file has been created. But what was the >
in the command we used? On the command line, a >
is known as a \"redirect.\" It takes the output of a command and puts it in a file. Be careful, since it's possible to overwrite files with the >
command.
\nIf you want to add text to a file but not overwrite it, you can use the >>
command, known as the redirect and append command, instead. If there's already a file with text in it, this command can add text to the file without destroying and recreating it.
Cat
Let's check if there's any text in cheat-sheet.txt.
\ncat cheat-sheet.txt\nThis is my cheat sheet\n
As you can see, the cat
command prints the contents of a file to the screen. cat
stands for \"concatenate,\" because it can link strings of characters or files together from end to end.
Your cheat sheet is titled cheat-sheet.txt
instead of cheat sheet.txt
for a reason. Can you guess why?
\nTry to make a file titled cheat sheet.txt
and report to the class what happens.
\nNow imagine you're attempting to open a very important data file using the command line that is titled cheat sheet.txt
\nFor your digital best practices, we recommend making sure that file names contain no spaces\u2014you can use creative capitalization, dashes, or underscores instead. Just keep in mind that the macOS and Unix file systems are usually pre-configured as cAsE-pReSeRvInG, which means that capitalization matters when you type commands to navigate between or do things to directories and files.
\nThe challenge for this section will be using a text editor, specifically Visual Studio Code (install guide here), to add some of the commands that we've learned to the newly created cheat sheet. Text editors are programs that allow you to edit plain text files, such as .txt, .py (Python scripts), and .csv (comma-separated values, also known as spreadsheet files). Remember not to use programs such as Microsoft Word to edit text files, since they add invisible characters that can cause problems.
\nSo far, you've learned a number of commands and one special symbol, the >
or redirect. Now we're going to learn another, the |
or \"pipe.\"
\nPipes let you take the output of one command and use it as the input for another.
\nLet's start with a simple example:
\n$ echo \"Hello from the command line\" | wc -w\n5\n
In this example, we take the output of the echo
command (\"Hello from the command line\") and pipe it to the wc
or word count command, adding a flag -w
for number of words. The result is the number of words in the text that we entered.
\nLet's try another. What if we wanted to put the commands in our cheat sheet in alphabetical order?
\nUse pwd
and cd
to make sure you're in the folder with your cheat sheet. Then try:
cat cheat-sheet.txt | sort\n
You should see the contents of the cheat sheet file with each line rearranged in alphabetical order. If you wanted to save this output, you could use a >
to print the output to a file, like this:
cat cheat-sheet.txt | sort > new-cheat-sheet.txt\n
So far the only text file we've been working with is our cheat sheet. Now, this is where the command line can be a very powerful tool: let's try working with a large text file, one that would be too large to work with by hand.
\nLet's download the data we're going to work with:
\nOur data set is a list of public domain items from the New York Public Library. It's in .csv format, which is a plain text spreadsheet format. CSV stands for \"comma separated values,\" and each field in the spreadsheet is separated with a comma. It's all still plain text, though, so we can manipulate the data using the command line.
\nOnce the file is downloaded, move it from your Downloads
folder to the projects
folder on your desktop\u2014either through the command line, or drag and drop in the GUI. Since this is indeed a command line workshop, you should try the former!
\nTo move this file using the command line, you first need to navigate to your Downloads
folder where that file is saved. Then type the mv
command followed by the name of the file you want to move and then the file path to your projects
folder on your desktop, which is where you want to move that file to (note that ~
refers to your home folder):
mv nypl_items.csv ~/Desktop/projects/\n
You can then navigate to that projects
folder and use the ls
command to check that the file is now there.
Try using cat
to look at the data. You'll find it all goes by too fast to get any sense of it. (You can click Control
and C
on your keyboard to cancel the output if it's taking too long.)
\nInstead, let's use another tool, the less
command, to get the data one page at a time:
$ less nypl_items.csv\n...\n
less
gives you a paginated view of the data; it will show you contents of a file or the output from a command or string of commands, page by page.
\nTo view the file contents page by page, you may use the following keyboard shortcuts (that should work on Windows using Git Bash or on macOS terminal):
\nClick the f
key to view forward one page, or the b
key to view back one page.
\nOnce you're done, click the q
key to return to the command line.
\nLet's try two more commands for viewing the contents of a file:
\n$ head nypl_items.csv\n...\n$ tail nypl_items.csv\n...\n
These commands print out the very first (the \"head\") and very last (the \"tail\") sections of the file, respectively.
\nWhen you are navigating in the command line, typing folder and file names can seem to go against the promise of easier communication with your computer. Here comes tab
completion, stage right!
\nWhen you need to type out a file or folder name\u2014for example, the name of that csv file we've been working with: nypl_items.csv\u2014in the command line and want to move more quickly, you can just type out the beginning characters of that file name up until it's distinct in that folder and then click the tab
key. And voil\u00e0! Clicking that tab
key will complete the rest of that name for you, and it only works if that file or folder already exists within your working directory.
\nIn other words, anytime in the command line you can type as much of the file or folder name that is unique within that directory, and tab
complete the rest!
If all the text remaining in your terminal window is starting to overwhelm you, you have some options. You may type the clear
command into the command line, or click the command
and k
keys to clear the scrollback. In macOS terminal, clicking the command
and l
keys will clear the output from your most recent command.
We didn't tell you this before, but there are duplicate lines in our data! Two, to be exact. Before we try removing them, let's see how many entries are in our .csv file:
\n$ cat nypl_items.csv | wc -l\n100001\n
This tells us there are 100,001 lines in our file. The wc
tool stands for \"word count,\" but it can also count characters and lines in a file. We tell wc
to count lines by using the -l
flag. If we wanted to count characters, we could use wc -m
. Flags marked with hyphens, such as -l
or -m
, indicate options which belong to specific commands. See the glossary for more information about flags and options.
\nTo find and remove duplicate lines, we can use the uniq
command. Let's try it out:
$ cat nypl_items.csv | uniq | wc -l\n99999\n
OK, the count went down by two because the uniq
command removed the duplicate lines. But which lines were duplicated?
$ cat nypl_items.csv | uniq -d\n...\n
The uniq
command with the -d
flag prints out the lines that have duplicates.
\n
\nSo we've cleaned our data set, but how do we find entries that use a particular term?
\nLet's say I want to find all the entries in our data set that use the term \"Paris.\"
\nHere we can use the grep
command. grep
stands for \"global regular expression print.\" The grep
command processes text line by line and prints any lines which match a specified pattern. Regular expressions are infamously human-illegible commands that use character by character matching to return a pattern. grep
gives us access to the power of regular expressions as we search for text.
$ cat nypl_items.csv | grep -i \"paris\"\n...\n
This will print out all the lines that contain the word \"Paris.\" (The -i
flag makes the command ignore capitalization.) Let's use our wc -l
command to see how many lines that is:
$ cat nypl_items.csv | grep -i \"paris\" | wc -l\n191\n
Here we have asked cat
to read nypl_items.csv, take the output and pipe it into the grep -i
command, which will ignore capitalization and find all instances of the word \"paris.\" We then take the output of that grep
command and pipe it into the word count wc
command with the -l
lines option. The pipeline returns 191
letting us know that Paris (or paris) occurs on 191 lines of our data set.
You've made it through your introduction to the command line! By now, you have experienced some of the power of communicating with your computer using text commands. The basic steps you learned today will help as you move forward through the week\u2014you'll work with the command line interface to set up your version control with git and you'll have your text editor open while writing python scripts and building basic websites with HTML and CSS.
\nNow is a good time to do a quick review!
\nIn this session, we learned:
\n- how to use touch
and echo
to create files
\n- how to use mkdir
to create folders
\n- how to navigate our file structure by cd
(change directory), pwd
(print working directory), and ls
(list)
\n- how to use redirects (>
) and pipes (|
) to create a pipeline
\n- how to explore a comma separated values (.csv) dataset using word and line counts, head
and tail
, and the concatenate command cat
\n- how to search text files using the grep
command
\nAnd we made a cheat sheet for reference!
\nWhen we started, we reviewed what text is\u2014whether plain or enriched. We learned that text editors that don't fix formatting of font, color, and size, do allow for more flexible manipulation and multi-program use. If text is allowed to be a string of characters (and not specific characters chosen for their compliance with a designer's intention), that text can be fed through programs and altered with automated regularity. Text editors are different software than Bash (or Terminal), which is a text-based shell that allows you to interact directly with your operating system giving direct input and receiving output.
\nHaving a grasp of command line basics will not only make you more familiar with how your computer and basic programming work, but it will also give you access to tools and communities that will expand your research.
", "order": 7}}, {"model": "lesson.challenge", "pk": 265, "fields": {"lesson": 1125, "title": "", "text": "Use the three commands you've just learned\u2014pwd
, ls
and cd
\u2014eight (8) times each. Go poking around your Photos folder, or see what's so special about that root /
directory. When you're done, come back to the home folder with
cd ~\n
(That's a tilde, on the top left of your keyboard.) One more command you might find useful is
\ncd ..\n
which will move you one directory up in the filesystem. That's a cd
with two periods after it.
Try and create a sub-folder and file on your own!
"}}, {"model": "lesson.challenge", "pk": 267, "fields": {"lesson": 1127, "title": "", "text": "You could use the GUI to open your Visual Studio Code text editor\u2014from your programs menu, via Finder or Applications or Launchpad in Mac OSX, or via the Windows button in Windows\u2014and then click \"File\" and then \"Open\" from the drop-down menu and navigate to your Desktop folder and click to open the cheat-sheet.txt file.
\nOr, you can open that specific cheat-sheet.txt file in the Visual Studio Code text editor directly from the command line! Let's try that by using the code
command followed by the name of your file in the command line.
Once you've got your cheat sheet open in the Visual Studio Code text editor, type to add the commands we've learned so far to the file. Include descriptions about what each command does. Remember, this cheat sheet is for you. Write descriptions that make sense to you or take notes about questions.
\nSave the file.
\nOnce you're done, check the contents of the file on the command line with the cat
command followed by the name of your file.
Use the commands you've learned so far to create a new version of the nypl_items.csv
file with the duplicated lines removed. (Hint: redirects are your friend.)
Use the grep
command to explore our .csv file a bit. What areas are best covered by the data set?
Type pwd
to see where on your computer you are located
\nType cd name-of-your-folder
to enter a subfolder
\nType ls
to see the content of that folder
\nType cd ..
to leave that folder
\nType pwd
to make sure you are back to the folder where you wish to be
\nType cd ~
to go back to your home folder
\nType pwd
to make sure you are in the folder where you wish to be
\nType cd /
to go back to your root folder
\nType ls
to see the content of folder you are currently in
\nType pwd
to make sure you are in the folder where you wish to be
\nType cd name-of-your-folder
to enter a subfolder
Type pwd
to see where on your computer you are located. If you are not in the \"projects\" folder we just created, navigate to that folder using the commands you learned in the previous lesson
\nType mkdir name-of-your-subfolder
to create a subfolder
\nType cd name-of-your-folder
to navigate to that folder
\nType challenge.txt
to create a new text file
\nType ls
to check whether you created the file correctly
$ code cheat-sheet.txt\n
\n ```console
\n$ cat cheat-sheet.txt
\nMy Institute Cheat Sheet
ls
\nlists files and folders in a directory
\ncd ~
\nchange directory to home folder
\n...
\n```
\nType pwd
to see where on your computer you are located. If you are not in the \"projects\" folder we just created, navigate to that folder using the commands you learned in the previous lesson
\nType ls
to check whether the file nypl_items.csv
is in your projects folder
\nType cat nypl_items.csv | uniq -d > new_nypl_items.csv
to create a new version of the nypl_items.csv
file with the duplicated lines removed.
If you want to get a little more milage out of the grep command, refer to this tutorial on grep and regular expressions. Regular expressions (or regex) provide methods to search for text in more advanced ways, including specific wildcards, matching ranges of characters such as letters and numbers, and detecting features such as the beginning and end of lines. If you want to experiment with regular expressions in an easy-to-use environment, numerous regex test interfaces are available from a simple google search, such as RegExr, which includes a handy cheat sheet.
\nMost digital projects come to an end at some point, in one way or another. We either simply stop working on them, or we forget about them, or we move on to something else. Few digital projects have an end \"form\" in the way that we think of a monograph. We rarely think of digital scholarship in its \"done\" form, but sooner or later even if they're not \"finished\"\u2014so to speak\u2014at some point, these projects end.
\nDone can take many different shapes: \n* it can morph into something new;\n* it can be retired;\n* it can be archived in a repository;\n* it can be saved on some form of storage media;\n* it can run out of funding; \n* and sometimes you are done with it!
\nSo it's helpful to think about what you want \"done\" to look like before you begin, because then you always have a sense of what will make a satisfactory ending to the work you're about to embark on.
\n\n # Identifying Audiences, Constituencies, and Collaborators
\nProjects typically satisfy more than one audience's need. The key to identifying a well-defined audience is research and creating several narrow profiles
\nIf you are working on a project that is institutionally based (such as creating a platform, creating a resource, or building a teaching tool), you may have institutaional partners who have a stake in your project's success. It's a good idea to identify these folks and consider their interests and needs as well.
\nPossible stakeholders include: your library, colleagues, IT division, academic program, a center, or institute who shares your mission and/or goals.
\nExample of a \"stakeholder\":
Conducting an in-depth environmental scan and literature review early in the planning process is a critical step to see if there are existing projects that are similar to your own or that may accomplish similar goals to your potential project. Sometimes, the planning process stops after the scan because you find that someone has already done it! Typically, a scan is useful in articulating and justifying the \"need\" for your research OR to justify your choice of one technology in lieu of others. Performing an environmental scan early and reviewing and revising it periodically will go a long way to help you prove that your project fills a current need for an actual audience.
\nSuccessful project proposals demonstrate knowledge of the ecosystem of existing projects in your field, and the field's response to those projects. Scans often help organizations identify potential collaborators, national intitiatives, publications, articles, or professional organizations, which in turn can demonstraate a wider exigency for your project. Following a preliminary scan, you should be able to explain why your project is important to the field, what it provides that does not currently exist, and how your project can serve as a leader or example to other organizations in such a way that they can put your findings to new issue.
\nBelow are suggestions for finding similar projects and initiatives in and outside of your field:
\nThe key to the environmental scan is to see what a wider community is already up to. How does your project fit into the ongoing work of others in your field? What about in a related field that addresses a similar question from another perspective? Is someone already working on a similar question?
\n1. Brainstorm where you might go to look for digital projects in your field that use emerging or new forms of technology. Try to list 3 places you might look to see how others in your field are adapting their methods to use new digital tools.
\n2. What technologies/methods do most people use in your field, if any, for capturing, storing, exploring/analyzing, or displaying their data? Why do they tend to use it? Is there a reason why you want to use the same technologies as your colleagues? What are the benefits of doing things differently?
\n3. Does your project fill a need or stake new methodological ground? How do you know?
\n4. If there aren't any technologies that do exactly what you were hoping for, has anyone else run into this problem? How did they solve it? Will you need to create a new tools or make significant changes to an existing one to accomplish your goal?
\n5. Once you have gathered information about what is \"out there,\" what are the limits of what you are willing to change about your own project in response? How will you know if you have stretched beyond the core objectives of your own research project?
", "order": 3}}, {"model": "lesson.lesson", "pk": 1131, "fields": {"title": "Resource Assessment", "created": "2020-07-09T19:01:26.028Z", "updated": "2020-07-09T19:01:26.028Z", "workshop": 162, "text": "The next step in our process is figuring out what resources you have available to you and what you still need in order to accomplish your project's objectives.
\nDo you have the dataset you need to do your project? Finding, cleaning, storing, managing changes in, and sharing your data is an often overlooked but extremely important part of designing your project. Successfully finding a good dataset means that you should keep in mind: Is the dataset the appropriate size and complexity to help address your project's goals? Finding, using, or creating a good dataset is a core part of your project's long-term success.
\nWhat data resources do you have at your disposal? What do you still need? What steps do you need to take during the course of your project in order to work with the dataset now that you have a general sense of what the data needs to look like if you are working with either textual or numeric data?
\nHave: basic knowledge of git and python and some nltk
\nNeed: I need a more powerful computer, to learn how to install and use Beautiful Soup, and to get help cleaning the data. I will also need to learn about the D3.js library.
\nLooking back at the Audiences worksheet, review which of your audiences were invested in your work. Who can you draw on for support? Consider the various roles that might be necessary for the project. Who will fill those roles? \n* design\n* maintenance and support\n* coding/programming\n* outreach / documentation\n* project management
\nOutreach can take many different forms, from presenting your research at conferences and through peer-reviewed scholarly publications, but also through blog posts, Twitter conversations, forums, and/or press releases. The key to a good outreach plan is to being earlier than you think is necessary, and give your work a public presence (via Tumblr, Twitter, website, etc). You can use your outreach contacts to ask for feedback and input as well as share challenges and difficult decisions. \n* Will you create a website for your project? \n* How will you share your work? \n* Will you publish in a traditional paper or in a less-traditional format? \n* Whom will you reach out to get the word out about your work? \n* Is there someone at your college who can help you to publicize your accomplishments? \n* Will you have a logo? Twitter account? Tumblr page? Why or why not? \n* Can you draw on your colleagues to help get the word out about your work? \n* What information could you share about your project at its earliest stages? \n* Does your project have a title?
\nYou will need to come up with a plan for how you are going to manage the \"data\" created by your project. Data management plans, now required by most funders, will ask for you to list all the types of data and metadata created over the duration of the project and then account for the various manners by which you will account for various versions, make the datasets available openly (if possible) and share your data.
\nSustainability plans require detailing what format files will be in and accounting for how those files and your data will continue to be accessible to you and/or to your audience or a general public long after the project's completion.
\nLibrarians are your allies in developing a sound data management and sustainability plan.
\nVery quickly, try to think of all the different types of data your project will involve. \n* Where will you store your data? \n* Is your software open source? \n* What is the likelihood that your files will remain usable? \n* How will you keep track of your data files? \n* Where will the data live after the project is over?
", "order": 6}}, {"model": "lesson.lesson", "pk": 1134, "fields": {"title": "Effective Partnerships", "created": "2020-07-09T19:01:26.043Z", "updated": "2020-07-09T19:01:26.043Z", "workshop": 162, "text": "After brainstorming your project ideas and assessing your available resources, it is time to scope out potential partners to help fill in gaps and formalize relationships.
\nplease keep in mind that each project is different. This outline offers suggestions and lessons learned from successful and less successful collaborations. while each project is unique in the way responsibilities are shared, perhaps one universal attribute of successful partnerships is mutual respect. The most successful collaborations are characterized by a demonstrated respect for each partners's time, work, space, staff, or policies in words and actions.
\nOnce you know where you need help, start thinking about who you know who might have those skills, areas of expertise, resources, and interest. \n* Partnerships should be selected on the basis of specific strengths. \n* If you don't know someone who fits the bill, can someone you know introduce you to someone you would like to know? What are some ways of finding someone with skills you don't have if you don't know anyone with those skills?
\nWhen preparing a proposal, you will need mentors, collaborators, or other interested parties to write a strong letter of support for your project that will help your proposal stand out to the reviewers. Some funders want letters from all project participants.
\nIt is important to respect people\u2019s time when asking them for a letter by showing that you\u2019ve done your research and that you have some grant materials to share with them. Good letters demonstrate some knowledge of the project and recognition of its impact if funded.
\nFollow these steps when asking for a support letter and for specific types of assistance during the life of the grant, and you should receive a good letter in return.\n* One month before grant deadline, begin brainstorming candidates for letters of support and note which collaborators are required to submit letters of commitment and support. \n* Start asking supporters at least two weeks in advance of grant deadline, because they will also have deadlines and other work competing for their work hours. You may find some folks are on leave at the time you inquire, be sure to have back-ups on your list. \n* Email potential supporters, collaborators:
\n * State why, specifically, you are asking Person A for support;
\n * Be specific about what you are asking Person A to do over the scope of the grant, if anything, such as participate in 3 meetings, 2 phone calls over 18 months; or agree to review the project and provide feedback one month before official launch;
\n * Provide any information about compensation, especially when asking someone to participate (ie, there will be a modest honorarium to recognize the time you give to this project of $xxx);
\n 8 Tell supporters what exactly you need to complete the grant application, in what format, and by what date (ie, a 2-page CV in PDF and letter of support on letterhead by next Friday).\n* Attach materials that will be helpful for them when writing the letter.
\n * Provide a short project summary that includes the project goals, deliverables, and work plan from the grant proposal draft;
\n * Include a starter letter containing sample text that references that person\u2019s or institution\u2019s role and why they are supporting the project.
", "order": 7}}, {"model": "lesson.lesson", "pk": 1135, "fields": {"title": "Finding Funding", "created": "2020-07-09T19:01:26.047Z", "updated": "2020-07-09T19:01:26.047Z", "workshop": 162, "text": "Now that you have started to form:\n* a more refined project idea;\n* a wider awareness of the ecosystem of existing projects in your field;\n* a sense of the national, local, or institutional demand for your project;\n* and a clearer sense of the resources at your disposal
\n... the next step is to find an appropriate funding source. Below you will find some suggestions as to where to begin the search for funding. As you look for possible funders, below are some guidelines for the process:
\n1. Check federal, state, and local grant-making agencies, and local foundations for possibility of grants.
\n * Federal agencies list all of their available grants on http://grants.gov.
\n * States also have opportunities for grants, such as state humanities councils.
\n * Private foundations are also possible areas to look. The following may prove useful:
\n * The Foundation Center: [http://foundationcenter.org] (http://foundationcenter.org)
\n * A Directory of State and Local Foundations:
\n [http://foundationcenter.org/getstarted/topical/sl_dir.html] (http://foundationcenter.org/getstarted/topical/sl_dir.html)
\n * The Council on Foundations Community Foundations List
\n http://www.conf.org/whoweserve/community/resources/index.cfm?navitemNumber=15626#locator
\n * The USDA offers a valuable Guide to Funding Resources [https://www.nal.usda.gov/ric/guide-to-funding-resources] (https://www.nal.usda.gov/ric/guide-to-funding-resources)
\n2. Check your institution\u2019s eligibility for a potential grants before beginning the application process. Eligibility requirements and restrictions are often found in grant guidelines.
\n3. Review the types of projects this program funds, and consider how your project fits with the agency or foundation\u2019s mission and strategic goals.
\n4. Review a potential grant program\u2019s deadlines and requirements (including proposal requirements and format for submission).
\n5. Identify funding levels/maxes, and keep them close at hand as you develop your budget.
\n6. Jenny Furlong Director of the Office of Career Services will be here tomorrow, and she is an excellent resource for those interested in external fellowships.
\nFind one or two grant opportunites in your subject area. Consider also looking for fellowship opportunities.
\nWhat follows is a template for writing a short project proposal that, once developed, will position you to move forward with building partnerships with other institutions or for pursuing funding opportunities. Though this template does not directly reflect a specific grant narrative format, the short project proposal includes important project-development steps that can later form the basis for a wide variety of grant narratives.
\n150 word summary of project: (1 short paragraph)
\nStatement of the conditions that make the project necessary and beneficial for your key audiences (2-3 paragraphs).
\nA brief explanation that combines your environmental scan and your research goals. Why is what you are doing necessary and different in your field\u2014and maybe to more than just scholars in your field. (4-5 paragraphs)
\nRough outline and project calendar that includes project design and evaluation, and possibly a communications plan, depending on the grant with major deliverables (bullet-pointed list of phases and duration):\n* Phase 1 (month/year - month/year):\n* Phase 2 (month/year - month/year):\n* Phase 3 (month/year - month/year):
\nDescription of the why the cooperating institutions and key personnel are well-suited to undertake this work (list of experience and responsibilities of each staff member, and institutional description).
\nIf applicable, describe how this project will live beyond the grant period. Will it continue to be accessible? How so? A data management plan might need to be specified here.
", "order": 8}}, {"model": "lesson.lesson", "pk": 1136, "fields": {"title": "Presentation Template", "created": "2020-07-09T19:01:26.050Z", "updated": "2020-07-09T19:01:26.050Z", "workshop": 162, "text": "Name:
\nProgram:
\nProject title:
\n2 Sentence abstract:
\nWhat resources do you have now?
\nWhat have you learned this week that will help you?
\nWhat additional support will you need as you take your next steps?
", "order": 9}}, {"model": "lesson.lesson", "pk": 1137, "fields": {"title": "Presentation", "created": "2020-07-09T19:01:26.058Z", "updated": "2020-07-09T19:01:26.058Z", "workshop": 162, "text": "2 Sentence abstract
\nMy project is going to make every installation seamless. It will make all of your Python dreams come true, your databases tidy, and your Git Hub happy.
\nWhat resources do you have now?
\nWhat have you learned this week that will help you?
\nWhat additional support will you need as you take your next steps?
\ngit add yourlastname.md\n
git commit -m \"my presentation file\"\n
git add images/myfile.jpg\n
git commit -m \"adding an image file\"\n
This is the workshop page.
", "template": "workshop/workshop-list.html"}}, {"model": "website.page", "pk": 5, "fields": {"name": "About", "slug": "about", "text": "This is the about page.
", "template": "website/page.html"}}, {"model": "website.page", "pk": 6, "fields": {"name": "Library", "slug": "library", "text": "This is the library page.
", "template": "library/all-library-items.html"}}] \ No newline at end of file diff --git a/app/workshop/templates/library/all-library-items.html b/app/workshop/templates/library/all-library-items.html index 3576f6e0..5d33fb20 100644 --- a/app/workshop/templates/library/all-library-items.html +++ b/app/workshop/templates/library/all-library-items.html @@ -10,6 +10,18 @@This is the workshop page.
', + template = 'workshop/workshop-list.html' + ) + p.save() + collector['pages'].append(p) + + p = Page( + name = 'About', + slug = 'about', + text = 'This is the about page.
', + template = 'website/page.html' + ) + p.save() + collector['pages'].append(p) + + p = Page( + name = 'Library', + slug = 'library', + text = 'This is the library page.
', + template = 'library/all-library-items.html' + ) + p.save() + collector['pages'].append(p) + + fixtures.add(collector=collector) + fixtures.save() from dhri.setup import setup