Python NLTK 06270949

Download Solution: Click to Download Solution
Solution File Name: PythonNLTK06270949.docx
Unzip Password:

1. Problem 28 at the end of Chapter 2 
Use one of the predefined similarity measures to score the similarity of each of the following pairs of words. Rank the pairs in order of decreasing similarity: 
car-automobile, gem-jewel, journey-voyage, boylad, coast-shore, asylum-madhouse, magician-wizard, midday-noon, furnace-stove, food-fruit, bird-cock, bird-crane, tool-implement, brother-monk, lad-brother, crane-implement, journey-car, monk-oracle, cemetery-woodland, food-rooster, coast-hill, forest-graveyard, shore-woodland, monk-slave, coastforest, lad-wizard, chord-smile, glass-magician, rooster-voyage, noon-string.

2. Problem 6 at the end of Chapter 3 
Describe the class of strings matched by the following regular expressions. 
1. String that contain one or more character az or AZ
2. String that contain character from AZ and zero or more character from az
3. String start with “p”, contain any character from “a”, ”e”, ”I”, ”o”, ”u”  at least two times, then end with “t”  
4. String that start with one or more digit, combine with a slash \, then end with zero or more digit
5. String start with a non-vowel, then continue with zero or more vowels then end with non-vowel
6. String start with one or more character, combine with one or more non-white-space character 

3. Problem 8 at the end of Chapter 3 
(Write a utility function that takes a URL as its ….) 
Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use from urllib import request and then request.urlopen('').read().decode('utf8') to access the contents of the URL.


4. Problem 18 at the end of Chapter 3 
Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (whwords in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation? Write the bulk of the code as a method that takes a single argument, the file name of the corpus.

5. Problem 22 Chapter 3
Write code that accesses the Weather Underground website and retrieves the current temperature in Wilmington. Have your program open that URL and run BeautifulSoup to get the non-HTML text. Now, what you are going to do is try to find the pattern within file that indicates where the current temperature is located on the page. If you use the regular expression findall method you should be able to find that string and then grab the group of characters corresponding to the temperature.

6. Problem 29 at the end of Chapter 3. 
Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μw to be the average number of letters per word, and μs to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 μw + 0.5 μs - 21.43. Compute the ARI score for various sections of the Brown Corpus, including section f (lore) and j (learned). Make use of the fact that nltk.corpus.brown.words() produces a sequence of words, while nltk.corpus.brown.sents() produces a sequence of sentences.


Add Comment