Agenda • Overview of the project • Resources CS172 Project crawling indexing ranking Phase 1 Options • Web data – Needs to come out with your own crawling strategy • Twitter data – Can use third-party for Twitter Streaming API – Still needs some web crawling Crawling getNext Frontier • • www.cs.ucr.edu www.cs.ucr.edu/~v agelis getNext() Add(List<URLs>) 1 Download contents of page 2 Parse the downloaded file to extract links the page 3 Clean and Normalize the extracted links 4 Store extracted links in the Frontier addAll(List) 1. Download File Contents 2. Parsing HTML to extract links <- This is what you will see when you download a page. Notice HTML Tags. 2. Parsing HTML file • Write your own parser Some suggestions: Parse HTML file as XML. Two Parsing methods – SAX (Simple API for XML) – DOM (Document Object Model) • Use existing library – JSoup (http://jsoup.org/). Can be used to download the page. – HTML Parser (http://htmlparser.sourceforge.net/) 2. Parsing HTML file • Things to think about – How to handle Malformed HTML? Browser can still display it, but how do you handle it? 3. Clean extracted URLs • Some URL entries while crawling www.cs.ucr.edu /intranet/ /inventthefuture.html systems.engr.ucr.edu news/e-newsletter.html http://www.engr.ucr.edu/sendmail.html http://ucrcmsdev.ucr.edu/oucampus/de.jsp?user=D01002&site=cmsengr&path=%2Findex.html /faculty/ / /about/ #main http://www.pe.com/local-news/riverside-county/riverside/riverside-headlines-index/20120408riverside-ucr-develops-sensory-detection-for-smartphones.ece?ssimg=532988#ssStory533104 3. Clean extracted URLs What to avoid • Parse only http links (avoid ftp, https or any other protocol) • Avoid duplicates – Bookmarks : #main – Bookmarks should be stripped off. – Self paths: / • Avoid downloading pdfs or images – /news/GraphenePublicationsIndex.pdf – Its ok to download them, but you cannot parse them. • Take care of invalid characters in URLs – Space: www.cs.ucr.edu/vagelis hristidis – Ampersand: www.cs.ucr.edu/vagelis&hristidis – These characters should be encoded else you will get a MalformedURLException Normalize Links Found on the page • Relative URLs: – These URLs have no host address – E.g. While crawling (www.cs.ucr.edu/faculty) you find urls such as: – Case 1: /find_people.php • A “/” at the beginning means path starts from the root of the host (www.cs.ucr.edu) in this case. – Case 2: all • No “/” means the path is relative to current path. • Normalize them (respectively) to – www.cs.ucr.edu/find_people.php – www.cs.ucr.edu/faculty/all Clean extracted URLs • Different Parts of the URL highlighted with different colors • http://www.pe.com:8080/local-news/riversidecounty/riverside/riverside-headlines-index/20120408riverside-ucr-develops-sensory-detection-forsmartphones.ece?ssimg=532988#ssStory533 • Protocol • Port • Host • Path • Query • Bookmark java.net.URL Has methods that can separate different parts of the URL. getProtocol: http getHost: www.pe.com getPort: -1 getPath: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riversidegetQuery: ssimg=532988 getFile: /local-news/riverside-county/riverside/riverside-headlines-index/20120408-riverside-u Normalizing with java.net.URL • You can normalize URLs with simple string manipulations and using methods from java.net.URL class. • Here is the snippet for normalizing “Case 1” root relative URLs Crawler Ethics • Some websites don’t want crawlers swarming all over them. • Why? – Increases load on the server – Private websites – Dynamic websites –… Crawler Ethics • How does the website tell you (crawler) if and what is off limits. • Two options – Site wide restrictions: robots.txt – Webpage specific restrictions: Meta tag Crawler Ethics robots.txt • A file called “robots.txt” in the root directory of the website • Example: http://www.about.com/robots.txt • Format: User-Agent: <Crawler name> Disallow: <don’t follow path> Allow: <can-follow-paths> Crawler Ethics robots.txt • What should you do? – Before starting on a new website: – Check if robots.txt exists. – If it does, download it and parse it for all inclusions and exclusions for “generic crawler” i.e. User-Agent: * – Don’t’ crawl anything in the exclusion list including sub-directories Crawler Ethics Website Specific: Meta tags • Some webpages have one the following metatag entries: • <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> • <META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW"> • <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> • Options: – INDEX or NOINDEX – FOLLOW or NOFOLLOW Twitter data collecting • Collecting through Twitter Streaming API – https://dev.twitter.com/docs/platform-objects/tweets, where you can check the data schema. – Rate limit: you will get up to 1% of the whole Twitter traffic. So you can get about 4.3M tweets per day (about 2GB) – You need to have a Twitter account for that. Check https://dev.twitter.com/ Third-party libarary Twitter4j for Java. • You can find supports for other languages also. • Well documented and code examples. e.g., http://twitter4j.org/en/code-examples.html Important Fields • At least following fields you should save: – – – – – Text Timestamp Geolocation User of the tweet Links Crawl links in Tweets • Tweets may contain links. – It may contains useful information. E.g., links to news articles. • After collect the tweets, use another process to crawl the links. – Because the crawling is slower, so you may not want to crawl it right after you get the tweet.