How to Learn R

December 10, 2015, 12:40 pm

≫ Next: Integrating Python and R Part III: An Extended Example

≪ Previous: Data Science Radar – Data Wrangler Profile

There are tons of resources to help you learn the different aspects of R, and as a beginner this can be overwhelming. It’s also a dynamic language and rapidly changing, so it’s important to keep up with the latest tools and technologies.

That’s why R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more.

Just like R, this learning path is a dynamic resource. We want to continually evolve and improve the resources to provide the best possible learning experience. So if you have suggestions for improvement please email tal.galili@gmail.com with your feedback.

Learning Path

Getting started: The basics of R

Setting up your machine

R packages

Importing your data into R

Data Manipulation

Data Visualization

Data Science & Machine Learning with R

Reporting Results in R

Next steps

Getting started: The basics of R

The best way to learn R is by doing. In case you are just getting started with R, this free introduction to R tutorial by DataCamp is a great resource as well the successor Intermediate R programming (subscription required). Both courses teach you R programming and data science interactively, at your own pace, in the comfort of your browser. You get immediate feedback during exercises with helpful hints along the way so you don’t get stuck.

Another free online interactive learning tutorial for R is available by O’reilly’s code school website called try R. An offline interactive learning resource is swirl, an R package that makes if fun and easy to become an R programmer. You can take a swirl course by (i) installing the package in R, and (ii) selecting a course from the course library. If you want to start right away without needing to install anything you can also choose for the online version of Swirl.

There are also some very good MOOC’s available on edX and Coursera that teach you the basics of R programming. On edX you can find Introduction to R Programming by Microsoft, an 8 hour course that focuses on the fundamentals and basic syntax of R. At Coursera there is the very popular R Programming course by Johns Hopkins. Both are highly recommended!

If you instead prefer to learn R via a written tutorial or book there is plenty of choice. There is the introduction to R manual by CRAN, as well as some very accessible books like Jared Lander’s R for Everyone or R in Action by Robert Kabacoff.

Setting up your machine

You can download a copy of R from the Comprehensive R Archive Network (CRAN). There are binaries available for Linux, Mac and Windows.

Once R is installed you can choose to either work with the basic R console, or with an integrated development environment (IDE). RStudio is by far the most popular IDE for R and supports debugging, workspace management, plotting and much more (make sure to check out the RStudio shortcuts).

Next to RStudio you also have Architect, and Eclipse-based IDE for R. If you prefer to work with a graphical user interface you can have a look at R-commander (aka as Rcmdr), or Deducer.

R packages

R packages are the fuel that drive the growth and popularity of R. R packages are bundles of code, data, documentation, and tests that are easy to share with others. Before you can use a package, you will first have to install it. Some packages, like the base package, are automatically installed when you install R. Other packages, like for example the ggplot2 package, won’t come with the bundled R installation but need to be installed.

Many (but not all) R packages are organized and available from CRAN, a network of servers around the world that store identical, up-to-date, versions of code and documentation for R. You can easily install these package from inside R, using the install.packages function. CRAN also maintains a set of Task Views that identify all the packages associated with a particular task such as for example TimeSeries.

Next to CRAN you also have bioconductor which has packages for the analysis of high-throughput genomic data, as well as for example the github and bitbucket repositories of R package developers. You can easily install packages from these repositories using the devtools package.

Finding a package can be hard, but luckily you can easily search packages from CRAN, github and bioconductor using Rdocumentation, inside-R, or you can have a look at this quick list of useful R packages.

To end, once you start working with R, you’ll quickly find out that R package dependencies can cause a lot of headaches. Once you get confronted with that issue, make sure to check out packrat (see video tutorial) or checkpoint. When you’d need to update R, if you are using Windows, you can use the updateR() function from the installr package.

Importing your data into R

The data you want to import into R can come in all sorts for formats: flat files, statistical software files, databases and web data.

Getting different types of data into R often requires a different approach to use. To learn more in general on how to get different data types into R you can check out this online Importing Data into R tutorial (subscription required), this post on data importing, or this webinar by RStudio.

Flat files are typically simple text files that contain table data. The standard distribution of R provides functionality to import these flat files into R as a data frame with functions such as read.table() and read.csv() from the utils package. Specific R packages to import flat files data are readr, a fast and very easy to use package that is less verbose as utils and multiple times faster (more information), and data.table’s fread() function for importing and munging data into R (using the fread function).

In case you want to get your excel files into R, it’s a good idea to have a look at the readxl package. Alternatively, there is the gdata package which has function that supports the import of Excel data, and the XLConnect package. The latter acts as a real bridge between Excel and R meaning you can do any action you could do within Excel but you do it from inside R. Read more on importing your excel files into R.

Software packages such as SAS, STATA and SPSS use and produce their own file types. The haven package by Hadley Wickham can deal with importing SAS, STATA and SPSS data files into R and is very easy to use. Alternatively there is the foreign package, which is able to import not only SAS, STATA and SPSS files but also more exotic formats like Systat and Weka for example. It’s also able to export data again to various formats. (Tip: if you’re switching from SAS,SPSS or STATA to R, check out Bob Muenchen’s tutorial (subscription required))

The packages used to connect to and import from a relational database depend on the type of database you want to connect to. Suppose you want to connect to a MySQL database, you will need the RMySQL package. Others are for example the RpostgreSQL and ROracle package.The R functions you can then use to access and manipulate the database, is specified in another R package called DBI.

If you want to harvest web data using R you need to connect R to resources online using API’s or through scraping with packages like rvest. To get started with all of this, there is this great resource freely available on the blog of Rolf Fredheim.

Data Manipulation

Turning your raw data into well structured data is important for robust analysis, and to make data suitable for processing. R has many built-in functions for data processing, but they are not always that easy to use. Luckily, there are some great packages that can help you:

The tidyr package allows you to “tidy” your data. Tidy data is data where each column is a variable and each row an observation. As such, it turns your data into data that is easy to work with. Check this excellent resource on how you can tidy your data using tidyr.
If you want to do string manipulation, you should learn about the stringr package. The vignette is very understandable, and full of useful examples to get you started.
dplyr is a great package when working with data frame like objects (in memory and out of memory). It combines speed with a very intuitive syntax. To learn more on dplyr you can take this data manipulation course (subscription required) and check out this handy cheat sheet.
When performing heavy data wrangling tasks, the data.table package should be your “go-to”package. It’s blazingly fast, and once you get the hang of it’s syntax you will find yourself using data.table all the time. Check this data analysis course (subscription required) to discover the ins and outs of data.table, and use this cheat sheet as a reference.
Chances are you find yourself working with times and dates at some point. This can be a painful process, but luckily lubridate makes it a bit easier to work with. Check it’s vignette to better understand how you can use lubridate in your day-to-day analysis.
Base R has limited functionality to handle time series data. Fortunately, there are package like zoo, xts and quantmod. Take this tutorial by Eric Zivot to better understand how to use these packages, and how to work with time series data in R.

If you want to have a general overview of data manipulation with R, you can read more in the book Data Manipulation with R or see the Data Wrangling with R video by RStudio. In case you run into troubles with handling your data frames, check 15 easy solutions to your data frame problems.

Data Visualization

One of the things that make R such a great tool is its data visualizations capabilities. For performing visualizations in R, ggplot2 is probably the most well known package and a must learn for beginners! You can find all relevant information to get you started with ggplot2 onhttp://ggplot2.org/ and make sure to check out the cheatsheet and the upcomming book. Next to ggplot2, you also have packages such as ggvis for interactive web graphics (seetutorial (subscription required)), googleVis to interface with google charts (learn to re-create this TED talk), Plotly for R, and many more. See the task view for some hidden gems, and if you have some issues with plotting your data this post might help you out.

In R there is a whole task view dedicated to handling spatial data that allow you to create beautiful maps such as this famous one:

To get started look at for example a package such as ggmap, which allows you to visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps. Alternatively you can start playing around with maptools, choroplethr, and the tmap package. If you need a great tutorial take this Introduction to visualising spatial data in R.

You’ll often see that visualizations in R make use of all these magnificent color schemes that fit like a glove on the graph/map/… If you want to achieve this for your visualizations as well, then deepen yourself into the RColorBrewer package and ColorBrewer.

One of the latest visualizations tools in R is HTML widgets. HTML widgets work just like R plots but they create interactive web visualizations such as dynamic maps (leaflet), time-series data charting (dygraphs), and interactive tables (DataTables). There are some very nice examples of HTML widgets in the wild, and solid documentation on how to create your own one (not in a reading mode: just watch this video).

If you want to get some inspiration on what visualization to create next, you can have a look at blogs dedicated to visualizations such as FlowingData.

Data Science & Machine Learning with R

There are many beginner resources on how to do data science with R. A list of available online courses:

Andrew Conway’s Introduction to statistics with R (subscription required)
Data Analysis and Statistical Inference
Data Analysis for life sciences
Data Science Specialization by Johns Hopkins

Alternatively, if you prefer a good read:

Practical Data Science With R
R for Data Science (upcomming, see progress)
A Survival Guide to Data Science with R

Once your start doing some machine learning with R, you will quickly find yourself using packages such as caret, rpart and randomForest. Luckily, there are some great learning resources for these packages and Machine Learning in general. If you are just getting started,this guide will get you going in no time. Alternatively, you can have a look at the booksMastering Machine Learning with R and Machine Learning with R. If you are looking for some step-by-step tutorials that guide you through a real life example there is the Kaggle Machine Learning course or you can have a look at Wiekvoet’s blog.

Reporting Results in R

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It is a great tool for reporting your data analysis in a reproducible manner, thereby making the analysis more useful and understandable. R markdown is based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can even create interactive R markdown documents using Shiny. This 4 hour tutorial on Reporting with R Markdown (subscription required) get’s you going with R markdown, and in addition you can use this nice cheat sheet for future reference.

Next to R markdown, you should also make sure to check out Shiny. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. RStudio maintains a great learning portal to get you started with Shiny, including this set of video tutorials (click on the essentials of Shiny Learning Roadmap). More advanced topics are available, as well as a great set of examples.

Next steps

Once you become more fluent in writing R syntax (and consequently addicted to R), you will want to unlock more of its power (read: do some really nifty stuff). In that case make sure to check out RCPP, an R package that makes it easier for integrating C++ code with R, or RevoScaleR (start the free tutorial).

After spending some time writing R code (and you became an R-addict), you’ll reach a point that you want to start writing your own R package. Hilary Parker from Etsy has written a short tutorial on how to create your first package, and if you’re really serious about it you need to read R packages, an upcoming book by Hadley Wickham that is already available for free on the web.

If you want to start learning on the inner workings of R and improve your understanding of it, the best way to get you started is by reading Advanced R.

Finally, come visit us again at R-bloggers.com to read of the latest news and tutorials from bloggers of the R community.

↧

Integrating Python and R Part III: An Extended Example

December 18, 2015, 5:14 am

≫ Next: Online R courses at Udemy – for only $15 (“Christmas deal”)

≪ Previous: How to Learn R

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Musselle

This is the third post in a three part series where I have explored the options available for including both R and Python in a data analysis pipeline. See post one for some reasons on why you may wish to do this, and details of a general strategy involving flat files. Post two expands on this by showing how R or Python processes can call each other and parse arguments between them.

In this post I will be sharing a longer example using these approaches in analysis we carried out at Mango as a proof of concept to cluster news articles. The pipeline involved the use of both R and Python at different stages, with a Python script being called from R to fetch the data, and the exploratory analysis piece being conducted in R.

Full implementation details can be found in the repository on github here, though for brevity this article will focus on the core concepts with the most relevant parts to R and Python integration discussed below.

Document Clustering

We were interested in the problem of document clustering of live published news articles, and specifically, wished to investigate times when multiple news websites were talking about the same content. As a first step towards this, we looked at sourcing live articles via RSS feeds, and used text mining methods to preprocess and cluster the articles based on their content.

Sourcing News Articles From RSS Feeds

There are some great Python tools out there for scraping and sourcing web data, and so for this task we used a combination of feedparser, requests, and BeautifulSoup to process the RSS feeds, fetch web content, and extract the parts we were interested in. though the general code structure was as follows:

# fetch_RSS_feed.py

def get_articles(feed_url, json_filename='articles.json'):
    """Update JSON file with articles from RSS feed"""
    #
    # See github link for full function script
    #

if __name__ == '__main__':

    # Pass Arguments
    args = sys.argv[1:]
    feed_url = args[0]
    filepath = args[1]

    # Get the latest articles and append to the JSON file given
    get_articles(feed_url, filepath)

Here we can see that the get_articles function is defined to perform the bulk of the data sourcing tasks, and that the parameters passed to it are the positional arguments from the command line. Within get_articles, the url link, publication date, title and text contents, were then extracted for each article in the RSS feed and stored in a JSON file. For each article, the text content was made up of all HTML paragraph tags within the news article.

Sidenote: The if __name__ == "__main__": line may look strange to non-Python programmers, but this is a common way in Python scripts to control the sections of the code that are run when the whole script is executed, vs when the script is imported by another Python script. If the script is executed directly (as is the case when it is called from R later), the if statement evaluates to true and all code is run. If however, at some point in the future I wanted to reuse get_articles in another Python script, I could now import that function from this script without triggering the code within the if statement.

The above Python script was then executed from within R by defining the utility function shown below. Note that by using stdout=TRUE, any messages printed to stdout with print() in the Python code, can be captured and displaced in the R console.

fetch_articles <- function(url, filepath) {

command = "python"
path2script='"fetch_RSS_feed.py"'

args = c(url, filepath)
allArgs = c(path2script, args)

output = system2(command, args=allArgs, stdout=TRUE)
print(output)

}

Loading Data into R

Once the data had been written to a JSON file, the next job was to get it into R to be used with the tm package for text mining. This proved a little trickier than first expected however, as the tm package is mainly geared around reading in documents from raw text files, or directories containing multiple text files. To convert the JSON file into the expected VCorpus object for tm I used the following:

load_json_file <- function(filepath) {

# Load data from JSON
json_file <- file(filepath, "rb", encoding = "UTF-8")
json_obj <- fromJSON(json_file)
close(json_file)

# Convert to VCorpus
bbc_texts <- lapply(json_obj, FUN = function(x) x$text )
df = as.data.frame(bbc_texts)
df = t(df)
articles = VCorpus(DataframeSource(df))
articles
}

Unicode Woes

One potential problem when manipulating text data from a variety of sources and passing it between languages, is that you can easily get tripped up by character encoding errors on route. We found that by default Python was able to read in, process and write out the article content from the HTML sources, but R was struggling to decode certain characters that were written out to the resulting JSON file. This is due to the languages using or expecting a different character encoding by default.

To remedy this, you should be explicit in the encoding you are using when writing and reading a file, by specifying it when opening a file connection. This meant using the following in Python when writing out to a JSON file,

# Write updated file.
with open(json_filename, 'w', encoding='utf-8') as json_file:
    json.dump(JSON_articles, json_file, indent=4)

and on the R side opening the file connection was as follows:

# Load data from JSON
json_file <- file(filepath, "rb", encoding = "UTF-8")
json_obj <- fromJSON(json_file)
close(json_file)

Here “UTF-8″ Unicode is chosen as it is a good default encoding to use, and is the most popular one used in HTML documents worldwide.

For more details on Unicode and ways of handling it in Python 2 and 3 see Ned Batchelder’s PyCon talk here.

Summary of Text Preprocessing and Analysis

The text preprocessing part of the analysis consisted of the following steps, which were all carried out using the tm package in R:

Tokenisation – Splitting text into words.
Punctuation and whitespace removal.
Conversion to lowercase.
Stemming – to consolidate different word endings.
Stopword removal – to ignore the most common and therefore least informative words.

Once cleaned and processed, the Term Frequency-Inverse Document Frequency (TF-IDF) statistic was calculated for the collection of articles. This statistic aims to provide a measure of how important each word is for a particular document, across a collection of documents. It is more sophisticated that just using the word frequencies themselves, as it takes into account that some words may naturally occur more frequently than others across all documents.

Finally a distance matrix was constructed based on the TF-IDF values and hierarchical clustering was performed. The results were then visualised as a dendogram using the dendextend package in R.

An example of the clusters formed from 475 articles published over the last 4 days is shown below where the leaf nodes are coloured according to their source, with blue corresponding to BBC News, green to The Guardian, and indigo to The Independent.

It is interesting here to see articles from the same news websites occasionally forming groups, suggesting that news websites often post multiple articles with similar content, which is plausible considering how news story unfold over time.

What’s more interesting is finding clusters where multiple new websites are talking about similar things. Below is one such cluster with the article headlines displayed, which mostly relate to the recent flooding in Cumbria.

Hierarchical clustering is often a useful step in exploratory data analysis, and this work gives some insight into what is possible with news article clustering from live RSS feeds. Future work will look to evaluate different clustering approaches in more detail by examining the quality of the clusters they produce.

Other Approaches

In this series we have focused on describing the simplest approach of using flat files as an intermediate storage medium between the two languages. However it is worth briefly mentioning several other options that are available, such as:

Using a database, such as sqlite, as a medium of storage instead of flat files.
Passing the results of a script execution in memory instead of writing to an intermediate file.
Running two persistent R and Python processes at once, and passing data between them. Libraries such as rpy2 and rPython provide one such way of doing this.

Each of these methods brings with it some additional pros and cons, and so the question of which is most suitable is often dependent on the application itself. As a first port of call though, using common flat file formats is a good place to start.

Summary

This post gave an extended example of how Mango have been using both Python and R to perform exploratory analysis around clustering news articles. We used the flat file air gap strategy described in the first post in this series, and then automated the calling of Python from R by spawning a separate subprocess (described in the second post). As can be seen with a bit of care around character encodings, this provides a straight forward approach to “bridging the language gap”, and allows multiple skillsets to be utilised when performing a piece of analysis.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Online R courses at Udemy – for only $15 (“Christmas deal”)

December 20, 2015, 6:47 pm

≫ Next: Google scholar scraping with rvest package

≪ Previous: Integrating Python and R Part III: An Extended Example

tl;dr: $15 Christmas deal at Udemy – (until the 24th)

For the next 3 days (until 12/24/2015 , 6:00am PST), Udemy is offering readers of R-bloggers access to its global online learning marketplace with a (special) $15 (up to 97% off) deal on hundreds of their courses (including many R-Programming, data science, machine learning etc.)

Click here to browse ALL (R and non-R) courses

(p.s: if you are a company with a product you are willing to offer to R-bloggers readers with a good discount, please email me about it)

Advanced R courses:

General R courses for “data science”:

From Udemy:

We live in a new world where learning is not limited to the classroom or a book, but now on-demand, at your own pace, and on any device. Udemy is chock full of master courses and mini courses on everything from programming to photography, and we encourage you to take a look.

Their library of courses is quite extensive, you may also find interest in one of their other courses ranging from writing or yoga, excel (yak), communication skills, app developer, web designer or more — still, for $15 (up to 97% off).

↧

Google scholar scraping with rvest package

January 1, 2016, 9:12 am

≫ Next: Sentiment Analysis on Donald Trump using R and Tableau

≪ Previous: Online R courses at Udemy – for only $15 (“Christmas deal”)

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In this post, I will show how to scrape google scholar. Particularly, we will use the 'rvest' R package to scrape the google scholar account of my PhD advisor. We will see his coauthors, how many times they have been cited and their affiliations. “rvest, inspired by libraries like beautiful soup, makes it easy to scrape (or harvest) data from html web pages”, wrote Hadley Wickham on RStudio Blog. Since it is designed to work with magrittr, we can express complex operations as elegant pipelines composed of simple and easily understood pieces of code.

Load required libraries:

We will use ggplot2 to create plots.

library(rvest)
library(ggplot2)

How many times have his papers been cited

Let’s use SelectorGadget to find out which css selector matches the “cited by” column.

page <- read_html("https://scholar.google.com/citations?user=sTR9SIQAAAAJ&hl=en&oi=ao")

Specify the css selector in html_nodes() and extract the text with html_text(). Finally, change the string to numeric using as.numeric().

citations <- page %>% html_nodes ("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric()

See the number of citations:

citations 
148 96 79 64 57 57 57 55 52 50 48 37 34 33 30 28 26 25 23 22

Create a barplot of the number of citation:

barplot(citations, main="How many times has each paper been cited?", ylab='Number of citations', col="skyblue", xlab="")

Here is the plot:

Coauthors, thier affilations and how many times they have been cited

My PhD advisor, Ben Zaitchik, is a really smart scientist. He not only has the skills to create network and cooperate with other scientists, but also intelligence and patience.
Next, let’s see his coauthors, their affiliations and how many times they have been cited.
Similarly, we will use SelectorGadget to find out which css selector matches the Co-authors.

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
Coauthors = page%>% html_nodes(css=".gsc_1usr_name a") %>% html_text()
Coauthors = as.data.frame(Coauthors)
names(Coauthors)='Coauthors'

Now let’s exploring Coauthors

head(Coauthors) 
                  Coauthors
1               Jason Evans
2             Mutlu Ozdogan
3            Rasmus Houborg
4          M. Tugrul Yilmaz
5 Joseph A. Santanello, Jr.
6              Seth Guikema

dim(Coauthors) 
[1] 27  1

As of today, he has published with 27 people.

How many times have his coauthors been cited?

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
citations = page%>% html_nodes(css = ".gsc_1usr_cby")%>%html_text()

citations 
 [1] "Cited by 2231"  "Cited by 1273"  "Cited by 816"   "Cited by 395"   "Cited by 652"   "Cited by 1531" 
 [7] "Cited by 674"   "Cited by 467"   "Cited by 7967"  "Cited by 3968"  "Cited by 2603"  "Cited by 3468" 
[13] "Cited by 3175"  "Cited by 121"   "Cited by 32"    "Cited by 469"   "Cited by 50"    "Cited by 11"   
[19] "Cited by 1187"  "Cited by 1450"  "Cited by 12407" "Cited by 1939"  "Cited by 9"     "Cited by 706"  
[25] "Cited by 336"   "Cited by 186"   "Cited by 192"

Let’s extract the numeric characters only using global substitute.

citations = gsub('Cited by','', citations)

citations
 [1] " 2231"  " 1273"  " 816"   " 395"   " 652"   " 1531"  " 674"   " 467"   " 7967"  " 3968"  " 2603"  " 3468"  " 3175" 
[14] " 121"   " 32"    " 469"   " 50"    " 11"    " 1187"  " 1450"  " 12407" " 1939"  " 9"     " 706"   " 336"   " 186"  
[27] " 192"

Change string to numeric and then to data frame to make it easy to use with ggplot2

citations = as.numeric(citations)
citations = as.data.frame(citations)

Affilation of coauthors

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
affilation = page %>% html_nodes(css = ".gsc_1usr_aff")%>%html_text()
affilation = as.data.frame(affilation)
names(affilation)='Affilation'

Now, let’s create a data frame that consists of coauthors, citations and affilations

cauthors=cbind(Coauthors, citations, affilation)

cauthors 
                             Coauthors citations                                                                                  Affilation
1                          Jason Evans      2231                                                               University of New South Wales
2                        Mutlu Ozdogan      1273    Assistant Professor of Environmental Science and Forest Ecology, University of Wisconsin
3                       Rasmus Houborg       816                    Research Scientist at King Abdullah University of Science and Technology
4                     M. Tugrul Yilmaz       395 Assistant Professor, Civil Engineering Department, Middle East Technical University, Turkey
5            Joseph A. Santanello, Jr.       652                                                  NASA-GSFC Hydrological Sciences Laboratory
.....

Re-order coauthors based on their citations

Let’s re-order coauthors based on their citations so as to make our plot in a decreasing order.

cauthors$Coauthors <- factor(cauthors$Coauthors, levels = cauthors$Coauthors[order(cauthors$citations, decreasing=F)])

ggplot(cauthors,aes(Coauthors,citations))+geom_bar(stat="identity", fill="#ff8c1a",size=5)+
theme(axis.title.y   = element_blank())+ylab("# of citations")+
theme(plot.title=element_text(size = 18,colour="blue"), axis.text.y = element_text(colour="grey20",size=12))+
              ggtitle('Citations of his coauthors')+coord_flip()

Here is the plot:

He has published with scientists who have been cited more than 12000 times and with students like me who are just toddling.

Summary

In this post, we saw how to scrape Google Scholar. We scraped the account of my advisor and got data on the citations of his papers and his coauthors with thier affilations and how many times they have been cited.

As we have seen in this post, it is easy to scrape an html page using the rvest R package. It is also important to note that SelectorGadget is useful to find out which css selector matches the data of our interest.

If you have any question feel free to post a comment below.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

Sentiment Analysis on Donald Trump using R and Tableau

January 2, 2016, 11:28 am

≫ Next: 100 “must read” R-bloggers’ posts for 2015

≪ Previous: Google scholar scraping with rvest package

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Recently, the presidential candidate Donal Trump has become controversial. Particularly, associated with his provocative call to temporarily bar Muslims from entering the US, he has faced strong criticism.
Some of the many uses of social media analytics is sentiment analysis where we evaluate whether posts on a specific issue are positive or negative.
We can integrate R and Tableau for text data mining in social media analytics, machine learning, predictive modeling, etc., by taking advantage of the numerous R packages and compelling Tableau visualizations.

In this post, let’s mine tweets and analyze their sentiment using R. We will use Tableau to visualize our results. We will see spatial-temporal distribution of tweets, cities and states with top number of tweets and we will also map the sentiment of the tweets. This will help us to see in which areas his comments are accepted as positive and where they are perceived as negative.

Load required packages:

library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(ggmap)
library(dplyr)
library(plyr)
library(tm)
library(wordcloud)

Get Twitter authentication

All information below is obtained from twitter developer account. We will set working directory to save our authentication.

key="hidden"
secret="hidden"
setwd("/text_mining_and_web_scraping")

download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="/text_mining_and_web_scraping/cacert.pem",
              method="auto")
authenticate <- OAuthFactory$new(consumerKey=key,
                                 consumerSecret=secret,
                                 requestURL="https://api.twitter.com/oauth/request_token",
                                 accessURL="https://api.twitter.com/oauth/access_token",
                                 authURL="https://api.twitter.com/oauth/authorize")
setup_twitter_oauth(key, secret)
save(authenticate, file="twitter authentication.Rdata")

Get sample tweets from various cities

Let’s scrape most recent tweets from various cities across the US. Let’s request 2000 tweets from each city. We will need the latitude and longitude of each city.

N=2000  # tweets to request from each query
S=200  # radius in miles
lats=c(38.9,40.7,37.8,39,37.4,28,30,42.4,48,36,32.3,33.5,34.7,33.8,37.2,41.2,46.8,
       46.6,37.2,43,42.7,40.8,36.2,38.6,35.8,40.3,43.6,40.8,44.9,44.9)

lons=c(-77,-74,-122,-105.5,-122,-82.5,-98,-71,-122,-115,-86.3,-112,-92.3,-84.4,-93.3,
       -104.8,-100.8,-112, -93.3,-89,-84.5,-111.8,-86.8,-92.2,-78.6,-76.8,-116.2,-98.7,-123,-93)

#cities=DC,New York,San Fransisco,Colorado,Mountainview,Tampa,Austin,Boston,
#       Seatle,Vegas,Montgomery,Phoenix,Little Rock,Atlanta,Springfield,
#       Cheyenne,Bisruk,Helena,Springfield,Madison,Lansing,Salt Lake City,Nashville
#       Jefferson City,Raleigh,Harrisburg,Boise,Lincoln,Salem,St. Paul

donald=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Donald+Trump',
              lang="en",n=N,resultType="recent",
              geocode=paste(lats[i],lons[i],paste0(S,"mi"),sep=","))))

Let’s get the latitude and longitude of each tweet, the tweet itself, how many times it was re-twitted and favorited, the date and time it was twitted, etc.

donaldlat=sapply(donald, function(x) as.numeric(x$getLatitude()))
donaldlat=sapply(donaldlat, function(z) ifelse(length(z)==0,NA,z))  

donaldlon=sapply(donald, function(x) as.numeric(x$getLongitude()))
donaldlon=sapply(donaldlon, function(z) ifelse(length(z)==0,NA,z))  

donalddate=lapply(donald, function(x) x$getCreated())
donalddate=sapply(donalddate,function(x) strftime(x, format="%Y-%m-%d %H:%M:%S",tz = "UTC"))

donaldtext=sapply(donald, function(x) x$getText())
donaldtext=unlist(donaldtext)

isretweet=sapply(donald, function(x) x$getIsRetweet())
retweeted=sapply(donald, function(x) x$getRetweeted())
retweetcount=sapply(donald, function(x) x$getRetweetCount())

favoritecount=sapply(donald, function(x) x$getFavoriteCount())
favorited=sapply(donald, function(x) x$getFavorited())

data=as.data.frame(cbind(tweet=donaldtext,date=donalddate,lat=donaldlat,lon=donaldlon,
                           isretweet=isretweet,retweeted=retweeted, retweetcount=retweetcount,favoritecount=favoritecount,favorited=favorited))

First, let’s create a word cloud of the tweets. A word cloud helps us to visualize the most common words in the tweets and have a general feeling of the tweets.

# Create corpus
corpus=Corpus(VectorSource(data$tweet))

# Convert to lower-case
corpus=tm_map(corpus,tolower)

# Remove stopwords
corpus=tm_map(corpus,function(x) removeWords(x,stopwords()))

# convert corpus to a Plain Text Document
corpus=tm_map(corpus,PlainTextDocument)

col=brewer.pal(6,"Dark2")
wordcloud(corpus, min.freq=25, scale=c(5,2),rot.per = 0.25,
          random.color=T, max.word=45, random.order=F,colors=col)

Here is the word cloud:

We see from the word cloud that among the most frequent words in the tweets are ‘muslim’, ‘muslims’, ‘ban’. This suggests that most tweets were on Trump’s recent idea of temporarily banning Muslims from entering the US.

The dashboard below shows time series of the number of tweets scraped. We can change the time unit between hour and day and the dashboard will change based on the selected time unit. Pattern of number of tweets over time helps us to drill in and see how each activities/campaigns are being perceived.

Here is the screenshot. (View it live in this link)

Getting address of tweets

Since some tweets do not have lat/lon values, we will remove them because we want geographic information to show the tweets and their attributes by state, city and zip code.

data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)

Let’s get full address of each tweet location using the google maps API. The ggmaps package is what enables us to get the street address, city, zipcode and state of the tweets using the longitude and latitude of the tweets. Since the google maps API does not allow more than 2500 queries per day, I used a couple of machines to reverse geocode the latitude/longitude information in a full address. However, I was not lucky enough to reverse geocode all of the tweets I scraped. So, in the following visualizations, I am showing only some percentage of the tweets I scraped that I was able to reverse geocode.

result <- do.call(rbind, lapply(1:nrow(lonlat),
                     function(i) revgeocode(as.numeric(lonlat[i,1:2]))))

If we see some of the values of result, we see that it contains the full address of the locations where the tweets were posted.

result[1:5,]
     [,1]                                              
[1,] "1778 Woodglo Dr, Asheboro, NC 27205, USA"        
[2,] "1550 Missouri Valley Rd, Riverton, WY 82501, USA"
[3,] "118 S Main St, Ann Arbor, MI 48104, USA"         
[4,] "322 W 101st St, New York, NY 10025, USA"         
[5,] "322 W 101st St, New York, NY 10025, USA"

So, we will apply some regular expression and string manipulation to separate the city, zip code and state into different columns.

data2=lapply(result,  function(x) unlist(strsplit(x,",")))
address=sapply(data2,function(x) paste(x[1:3],collapse=''))
city=sapply(data2,function(x) x[2])
stzip=sapply(data2,function(x) x[3])
zipcode = as.numeric(str_extract(stzip,"[0-9]{5}"))   
state=str_extract(stzip,"[:alpha:]{2}")
data2=as.data.frame(list(address=address,city=city,zipcode=zipcode,state=state))

Concatenate data2 to data:

data=cind(data,data2)

Some text cleaning:

tweet=data$tweet
tweet_list=lapply(tweet, function(x) iconv(x, "latin1", "ASCII", sub=""))
tweet_list=lapply(tweet, function(x) gsub("htt.*",' ',x))
tweet=unlist(tweet)
data$tweet=tweet

We will use lexicon based sentiment analysis. A list of positive and negative opinion words or sentiment words for English was downloaded from here.

positives= readLines("positivewords.txt")
negatives = readLines("negativewords.txt")

First, let’s have a wrapper function that calculates sentiment scores.

sentiment_scores = function(tweets, positive_words, negative_words, .progress='none'){
  scores = laply(tweets,
                 function(tweet, positive_words, negative_words){
                 tweet = gsub("[[:punct:]]", "", tweet)    # remove punctuation
                 tweet = gsub("[[:cntrl:]]", "", tweet)   # remove control characters
                 tweet = gsub('\d+', '', tweet)          # remove digits
                
                 # Let's have error handling function when trying tolower
                 tryTolower = function(x){
                     # create missing value
                     y = NA
                     # tryCatch error
                     try_error = tryCatch(tolower(x), error=function(e) e)
                     # if not an error
                     if (!inherits(try_error, "error"))
                       y = tolower(x)
                     # result
                     return(y)
                   }
                   # use tryTolower with sapply
                   tweet = sapply(tweet, tryTolower)
                   # split sentence into words with str_split function from stringr package
                   word_list = str_split(tweet, "\s+")
                   words = unlist(word_list)
                   # compare words to the dictionaries of positive & negative terms
                   positive.matches = match(words, positive_words)
                   negative.matches = match(words, negative_words)
                   # get the position of the matched term or NA
                   # we just want a TRUE/FALSE
                   positive_matches = !is.na(positive_matches)
                   negative_matches = !is.na(negative_matches)
                   # final score
                   score = sum(positive_matches) - sum(negative_matches)
                   return(score)
                 }, positive_matches, negative_matches, .progress=.progress )
  return(scores)
}

score = sentiment_scores(tweet, positives, negatives, .progress='text')
data$score=score

Let’s plot a histogram of the sentiment score:

hist(score,xlab=" ",main="Sentiment of sample tweetsn that have Donald Trump in them ",
     border="black",col="skyblue")

Here is the plot:

We see from the histogram that the sentiment is slightly positive. Using Tableau, we will see the spatial distribution of the sentiment scores.

Save the data as csv file and import it to Tableau

The map below shows the tweets that I was able to reverse geocode. The size is proportional to the number of favorites each tweet got. In the interactive map, we can hover each circle and read the tweet, the address it was tweeted from, and the date and time it was posted.

Here is the screenshot (View it live in this link)

Similarly, the dashboard below shows the tweets and the size is proportional to the number of times each tweet was retweeted.
Here is the screenshot (View it live in this link)

In the following three visualizations, top zip codes, cities and states by the number of tweets are shown. In the interactive map, we can change the number of zip codes, cities and states to display by using the scrollbars shown in each viz. These visualizations help us to see the distribution of the tweets by zip code, city and state.

By zip code
Here is the screenshot (View it live in this link)

By city
Here is the screenshot (View it live in this link)

By state
Here is the screenshot (View it live in this link)

Sentiment of tweets

Sentiment analysis has myriads of uses. For example, a company may investigate what customers like most about the company’s product, and what are the issues the customers are not satisfied with? When a company releases a new product, has the product been perceived positively or negatively? How does the sentiment of the customers vary across space and time? In this post, we are evaluating, the sentiment of tweets that we scraped on Donald Trump.

The viz below shows the sentiment score of the reverse geocoded tweets by state. We see that the tweets have highest positive sentiment in NY, NC and Tx.
Here is the screenshot (View it live in this link)

Summary

In this post, we saw how to integrate R and Tableau for text mining, sentiment analysis and visualization. Using these tools together enables us to answer detailed questions.

We used a sample from the most recent tweets that contain Donald Trump and since I was not able to reverse geocode all the tweets I scraped because of the constraint imposed by google maps API, we just used about 6000 tweets. The average sentiment is slightly above zero. Some states show strong positive sentiment. However, statistically speaking, to make robust conclusions, mining ample size sample data is important.

The accuracy of our sentiment analysis depends on how fully the words in the the tweets are included in the lexicon. More over, since tweets may contain slang, jargon and collequial words which may not be included in the lexicon, sentiment analysis needs careful evaluation.

This is enough for today. I hope you enjoyed it! If you have any questions or feedback, feel free to leave a comment.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

100 “must read” R-bloggers’ posts for 2015

January 20, 2016, 1:14 pm

≫ Next: A Checkpoint Of Spanish Football League

≪ Previous: Sentiment Analysis on Donald Trump using R and Tableau

The site R-bloggers.com is now 6 years young. It strives to be an (unofficial) online news and tutorials website for the R community, written by over 600 bloggers who agreed to contribute their R articles to the website. In 2015, the site served almost 17.7 million pageviews to readers worldwide.

In celebration to R-bloggers’ 6th birth-month, here are the top 100 most read R posts written in 2015, enjoy:

p.s.: 2015 was also a great year for R-users.com, a job board site for R users. If you are an employer who is looking to hire people from the R community, please visit this link to post a new R job (it’s free, and registration takes less than 10 seconds). If you are a job seekers, please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

↧

A Checkpoint Of Spanish Football League

January 20, 2016, 1:45 pm

≫ Next: Materials for NYU Shortcourse “Data Science and Social Science”

≪ Previous: 100 “must read” R-bloggers’ posts for 2015

(This article was first published on Ripples, and kindly contributed to R-bloggers)

I am an absolute beginner, but I am absolutely sane (Absolute Beginners, David Bowie)

Some time ago I wrote this post, where I predicted correctly the winner of the Spanish Football League several months before its ending. After thinking intensely about taking the risk of ruining my reputation repeating the analysis, I said “no problem, Antonio, do it again: in the end you don’t have any reputation to keep”. So here we are.

From a technical point of view there are many differences between both analysis. Now I use webscraping to download data, dplyr and pipes to do transformations and interactive D3.js graphs to show results. I think my code is better now and it makes me happy.

As I did the other time, Bradley-Terry Model gives an indicator of the power of each team, called ability, which provides a natural mechanism for ranking teams. This is the evolution of abilities of each team during the championship (last season was played during the past weekend):

liga1_ability2

Although it is a bit messy, the graph shows two main groups of teams: on the one hand, Barcelona, Atletico de Madrid, Real Madrid and Villareal; on the other hand, the rest. Let’s have a closer look to evolution of the abilities of the top 4 teams:

liga2_ability2

While Barcelona, Atletico de Madrid and Real Madrid walk in parallel, Villareal seems to be a bit stacked in the last seasons; the gap between them and Real Madrid is increasing little by little. Maybe is the Zidane’s effect. It is quite interesting discovering what teams are increasing their abilities: they are Malaga, Eibar and Getafe. They will probably finish the championship in a better position than they have nowadays (Eibar could reach fifth position):

liga3_ability2

What about Villareal? Will they go up some position? I don’t think so. This plot shows the probability of winning any of the top 3:

liga4_villareal2

As you can see, probability is decreasing significantly. And what about Barcelona? Will win? It is a very difficult question. They are almost tied with Atletico de Madrid, and only 5 and 8 points above Real Madrid and Villareal. But it seems Barcelona keep them at bay. This plot shows the evolution of the probability of be beaten by Atletico, Real Madrid and Villareal:

liga5_Barcelona2

All probabilities are under 50% and decreasing (I supposed a scoring of 2-0 for Barcelona in the match against Sporting of season 16 that was postponed to next February 17th).

Data science is a profession for brave people so it is time to do some predictions. These are mine, ordered by likelihood:

Barcelona will win, followed by Atletico (2), Real Madrid (3), Villareal (4) and Eibar (5)
Malaga and Getafe will go up some positions
Next year I will do the analysis again

Here you have the code:

library(rvest)
library(stringr)
library(BradleyTerry2)
library(dplyr)
library(reshape)
library(rCharts)
nseasons=20
results=data.frame()
for (i in 1:nseasons)
{
  webpage=paste0("http://www.marca.com/estadisticas/futbol/primera/2015_16/jornada_",i,"/")
  html(webpage) %>%
    html_nodes("table") %>%
    .[[1]] %>%
    html_table(header=FALSE, fill=TRUE) %>%
    mutate(X4=i) %>%
    rbind(results)->results
}
colnames(results)=c("home", "score", "visiting", "season")
results %>% 
  mutate(home     = iconv(home,     from="UTF8",to="ASCII//TRANSLIT"),
         visiting = iconv(visiting, from="UTF8",to="ASCII//TRANSLIT")) %>%
  #filter(grepl("-", score)) %>%
  mutate(score=replace(score, score=="18:30 - 17/02/2016", "0-2")) %>% # resultado fake para el Barcelona
  mutate(score_home     = as.numeric(str_split_fixed(score, "-", 2)[,1])) %>%
  mutate(score_visiting = as.numeric(str_split_fixed(score, "-", 2)[,2])) %>%
  mutate(points_home     =ifelse(score_home > score_visiting, 3, ifelse(score_home < score_visiting, 0, 1))) %>%
  mutate(points_visiting =ifelse(score_home > score_visiting, 0, ifelse(score_home < score_visiting, 3, 1))) -> data
prob_BT=function(x, y) {exp(x-y) / (1 + exp(x-y))}
BTabilities=data.frame()
for (i in 13:nseasons)
{
  data %>% filter(season<=i) %>%
    BTm(cbind(points_home, points_visiting), home, visiting, data=.) -> footballBTModel
  BTabilities(footballBTModel) %>%
  as.data.frame()  -> tmp 
  cbind(tmp, as.character(rownames(tmp)), i) %>% 
  mutate(ability=round(ability, digits = 2)) %>%
  rbind(BTabilities) -> BTabilities
}
colnames(BTabilities)=c("ability", "s.e.", "team", "season")
sort(unique(BTabilities[,"team"])) -> teams
BTprobabilities=data.frame()
for (i in 13:nseasons)
{
  BTabilities[BTabilities$season==i,1] %>% outer( ., ., prob_BT) -> tmp
  colnames(tmp)=teams
  rownames(tmp)=teams  
  cbind(melt(tmp),i) %>% rbind(BTprobabilities) -> BTprobabilities
}
colnames(BTprobabilities)=c("team1", "team2", "probability", "season")
BTprobabilities %>% 
  filter(team1=="Villarreal") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team2 %in% c("R. Madrid", "Barcelona", "Atletico")) -> BTVillareal
BTprobabilities %>% 
  filter(team2=="Barcelona") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team1 %in% c("R. Madrid", "Villarreal", "Atletico")) -> BTBarcelona
AbilityPlot <- nPlot(
  ability ~ season, 
  data = BTabilities, 
  group = "team",
  type = "lineChart")
AbilityPlot$yAxis(axisLabel = "Estimated Ability", width = 62)
AbilityPlot$xAxis(axisLabel = "Season")
VillarealPlot <- nPlot(
  probability ~ season, 
  data = BTVillareal, 
  group = "team2",
  type = "lineChart")
VillarealPlot$yAxis(axisLabel = "Probability of beating", width = 62)
VillarealPlot$xAxis(axisLabel = "Season")
BarcelonaPlot <- nPlot(
  probability ~ season, 
  data = BTBarcelona, 
  group = "team1",
  type = "lineChart")
BarcelonaPlot$yAxis(axisLabel = "Probability of being beaten", width = 62)
BarcelonaPlot$xAxis(axisLabel = "Season")

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

↧

Materials for NYU Shortcourse “Data Science and Social Science”

January 27, 2016, 6:00 am

≫ Next: R User Groups on GitHub

≪ Previous: A Checkpoint Of Spanish Football League

(This article was first published on R – Bad Hessian, and kindly contributed to R-bloggers)

Pablo Barberá, Dan Cervone, and I prepared a short course at New York University on Data Science and Social Science, sponsored by several institutes at NYU. The course was intended as an introduction to R and basic data science tasks, including data visualization, social network analysis, textual analysis, web scraping, and APIs. The workshop is geared towards social scientists with little experience in R, but experience with other statistical packages.

You can download and tinker around with the materials on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – Bad Hessian.

↧

R User Groups on GitHub

January 28, 2016, 8:30 am

≫ Next: Craft httr calls cleverly with curlconverter

≪ Previous: Materials for NYU Shortcourse “Data Science and Social Science”

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Quite a few times over the past few years I have highlighted presentations posted by R user groups on their websites and recommended these sites as a source for interesting material, but I have never thought to see what the user groups were doing on GitHub. As you might expect, many people who make presentations at R user group meetings make their code available on GitHub. However as best as I can tell, only a few R user groups are maintaining GitHub sites under the user group name.

The Indy UseR Group is one that seems to be making very good use of their GitHub Site. Here is the link to a very nice tutorial from Shankar Vaidyaraman on using the rvest package to do some web scraping with R. The following code which scrapes the first page from Springer's Use R! series to produce a short list of books comes form Shankar's simple example.

# load libraries
library(rvest)
library(dplyr)
library(stringr)
 
# link to Use R! titles at Springer site
useRlink = "http://www.springer.com/?SGWID=0-102-24-0-0&series=Use+R&sortOrder=relevance&searchType=ADVANCED_CDA&searchScope=editions&queryText=Use+R"
 
# Read the page
userPg = useRlink %>% read_html()
 
## Get info of books displayed on the page
booktitles = userPg %>% html_nodes(".productGraphic img") %>% html_attr("alt")
bookyr = userPg %>% html_nodes(xpath = "//span[contains(@class,'renditionDescription')]") %>% html_text()
bookauth = userPg %>% html_nodes("span[class = 'displayBlock']") %>% html_text()
bookprice = userPg %>% html_nodes(xpath = "//div[@class = 'bookListPriceContainer']//span[2]") %>% html_text()
pgdf = data.frame(title = booktitles, pubyr = bookyr, auth = bookauth, price = bookprice)
pgdf

This plot,which shows a list of books ranked by number of downloads, comes from Shankar's extended recommender example.

The Ann Arbor R User Group meetup site has done an exceptional job of creating an aesthetically pleasing and informative web property on their GitHub site.

I am particularly impressed with the way they have integrated news, content and commentary into their "News" section. Scroll down the page and have look at the care taken to describe and document the presentations made to the group. I found the introduction and slides for Bob Carpenter's RStan presentation very well done.

Other RUGs active on GitHub include:

Cambridge R User Group: meetup site and GitHub site
Inland Northwest R User Group: GitHub only
Twin Cities R User Group: meetup site and GitHub site
UVa R Users Group: meetup site and GitHub site

If your R user group is on GitHub and I have not included you in my short list please let me know about it. I think RUG GitHub sites have the potential for creating a rich code sharing experience among user groups. If you would like some help getting started with GItHub have a look at tutorials on the Murdoch University R User Group webpage.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

Craft httr calls cleverly with curlconverter

February 10, 2016, 9:11 am

≫ Next: Target Store Locations with rvest and ggmap

≪ Previous: R User Groups on GitHub

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:

primary

Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:

XHR

You can actually see a preview of those requests (usually JSON):

Developer_Tools_-_http___graphics_latimes_com_election-2016-new-hampshire-results_

While you could go through all the headers and cookies and transcribe them into httr::GET or httr::POST requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl and curl packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:

curl 'http://graphics.latimes.com/election-2016-31146-feed.json' 
  -H 'Pragma: no-cache' 
  -H 'DNT: 1' 
  -H 'Accept-Encoding: gzip, deflate, sdch'
  -H 'X-Requested-With: XMLHttpRequest' 
  -H 'Accept-Language: en-US,en;q=0.8' 
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' 
  -H 'Accept: */*' 
  -H 'Cache-Control: no-cache' 
  -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' 
  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' 
  -H 'Connection: keep-alive' 
  -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT'
  -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' 
  --compressed

While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter.

The curlconverter package has (for the moment) two main functions:

straighten() : which returns a list with all of the necessary parts to craft an httr POST or GET call
make_req() : which actually _returns a working httr call, pre-filled with all of the necessary information.

By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req() or req_params <- straighten()), but they can take in a vector of cURL command lines, too (NOTE: make_req() is currently limited to one while straighten() can handle as many as you want).

Let’s show what happens using election results cURL command line:

REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache'  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed"
 
resp <- curlconverter::straighten(REP)
jsonlite::toJSON(resp, pretty=TRUE)
 
    ## [
    ##   {
    ##     "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"],
    ##     "method": ["get"],
    ##     "headers": {
    ##       "Pragma": ["no-cache"],
    ##       "DNT": ["1"],
    ##       "Accept-Encoding": ["gzip, deflate, sdch"],
    ##       "X-Requested-With": ["XMLHttpRequest"],
    ##       "Accept-Language": ["en-US,en;q=0.8"],
    ##       "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"],
    ##       "Accept": ["*/*"],
    ##       "Cache-Control": ["no-cache"],
    ##       "Connection": ["keep-alive"],
    ##       "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"],
    ##       "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"]
    ##     },
    ##     "cookies": {
    ##       "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"],
    ##       "s_cc": ["true"]
    ##     },
    ##     "url_parts": {
    ##       "scheme": ["http"],
    ##       "hostname": ["graphics.latimes.com"],
    ##       "port": {},
    ##       "path": ["election-2016-31146-feed.json"],
    ##       "query": {},
    ##       "params": {},
    ##       "fragment": {},
    ##       "username": {},
    ##       "password": {}
    ##     }
    ##   }
    ## ]

You can then use the items in the returned list to make a GET request manually (but still tediously).

curlconverter‘s make_req() will try to do this conversion for you automagically using httr‘s little used VERB() function. It’s easier to show than to tell:

curlconverter::make_req(REP)

VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", 
     add_headers(Pragma = "no-cache", 
                 DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
                 `X-Requested-With` = "XMLHttpRequest", 
                 `Accept-Language` = "en-US,en;q=0.8", 
                 `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
                 Accept = "*/*", 
                 `Cache-Control` = "no-cache", 
                 Connection = "keep-alive", 
                 `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", 
                 Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/"))

You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).

You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter Node module by Nick Carneiro.

It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

↧

Target Store Locations with rvest and ggmap

February 15, 2016, 9:18 pm

≫ Next: The Gender of Big Data

≪ Previous: Craft httr calls cleverly with curlconverter

(This article was first published on R – Luke Stanke, and kindly contributed to R-bloggers)

I just finished developing a presentation for Target Analytics Network showcasing geospatial and mapping tools in R . I decided to use Target store locations as part of a case study in the presentation. The problem: I didn’t have any store location data, so I needed to get it from somewhere off the web. Since there some great tools in R to get this information, mainly rvest for scraping and ggmap for geocoding, it wasn’t a problem. Instead of just doing the work, I thought I should share what this process looks like:

First, we can go to the target website and find stores broken down by state.

After finding this information, we can use the rvest package to scrape the information. The URL is so nicely formatted that you can easily grab any state if you know the state’s mailing code.

# Set the URL to borrow the data.
TargetURL <- paste0('http://www.target.com/store-locator/state-result?stateCode=', state)

Now we can set a state — Minnesota’s mailing code is MN.

# Set the state.
state <- 'MN'

Now that we have the URL, let’s grab the html from the webpage.

# Download the webpage.
TargetWebpage <-
  TargetURL %>%
  xml2::read_html()

Now we have to find the location of the table in the html code.

Once we have found the html table, there are a number of ways we could extract from this location. I like to copy the the XPath location. It’s a bit lazy, but for the purpose of this exercise it makes life easy.

Once we have the XPath location, it’s easy to exact the table from the Target’s webpage. First we can pipe the html through the html_nodes function, this will isolate the html responsible for creating the store locations table. After that we can use the html_table to parse the html table into an R list. Let’s then use the data.frame function to take the list to a data frame and use the select function from the dplyr library to select specific variables. The problem with extracting the data is that the city, state, and zip code are in one column. Well its not really a problem for this exercise, but its maybe the perfectionist in me. Let’s use the separate function in the tidyr library to make city, state, and zipcode their own columns.

# Get all of the store locations.
TargetStores <-
  TargetWebpage %>%
  rvest::html_nodes(xpath = '//*[@id="stateresultstable"]/table') %>%
  rvest::html_table() %>%
  data.frame() %>%
  dplyr::select(`Store Name` = Store.Name, Address, `City/State/ZIP` = City.State.ZIP) %>%
  tidyr::separate(`City/State/ZIP`, into = c('City', 'Zipcode'), sep = paste0(', ', state)) %>%
  dplyr::mutate(State = state) %>%
  dplyr::as_data_frame()

Let’s get the coordinates for these stores; we can pass each store’s address through the geocode function which obtains the information from the Google Maps API — you can only geocode up to 2500 locations per day for free using the Google API.

# Geocode each store
TargetStores %<>%
  dplyr::bind_cols(
    ggmap::geocode(
      paste0(
        TargetStores$`Store Name`, ', ',
        TargetStores$Address, ', ',
        TargetStores$City, ', ',
        TargetStores$State, ', ',
        TargetStores$Zipcode
      ),
      output = 'latlon',
      source = 'google'
    )
  )

Now that we have the data, let’s plot. In order to plot this data, we need to put it in a spatial data frame — we can do this using the SpatialPointsDataFrame and CRS functions from the sp package. We need to specify the coordinates, the underlying data, and the projections

# Make a spatial data frame
TargetStores <-
  sp::SpatialPointsDataFrame(
    coords = TargetStores %>% dplyr::select(lon, lat) %>% data.frame,
    data = TargetStores %>% data.frame,
    proj4string = sp::CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
  )

Now that we have a spatial data frame, we can plot these points — I’m going to plot some other spatial data frames to make add context for the Target store point data.

# Plot Target in Minnesota
plot(mnCounties, col = '#EAF6AE', lwd = .4, border = '#BEBF92', bg = '#F5FBDA')
plot(mnRoads, col = 'darkorange', lwd = .5, add = TRUE)
plot(mnRoads2, col = 'darkorange', lwd = .15, add = TRUE)
plot(mnRivers, lwd = .6, add = TRUE, col = '#13BACC')
plot(mnLakes, border = '#13BACC', lwd = .2, col = '#EAF6F9', add = TRUE)
plot(TargetStores, add = TRUE, col = scales::alpha('#E51836', .8), pch = 20, cex = .6)

Yes! We’ve done it. We’ve plotted Target stores in Minnesota. That’s cool and all, but really we haven’t done much with the data we just obtained. Stay tuned for the next post to see what else we can do with this data.

UPDATE: David Radcliffe of the Twin Cities R User group presented something similar using Walmart stores.

To leave a comment for the author, please follow the link and comment on their blog: R – Luke Stanke.

↧

The Gender of Big Data

February 17, 2016, 2:37 pm

≫ Next: Web scraping with R & novel classification algorithms on unbalanced data

≪ Previous: Target Store Locations with rvest and ggmap

(This article was first published on Ripples, and kindly contributed to R-bloggers)

When I grow up I want to be a dancer (Carmen, my beautiful daughter)

The presence of women in positions of responsibility inside Big Data companies is quite far of parity: while approximately 5o% of world population are women, only 7% of CEOs of Top 100 Big Data Companies are. Like it or not, technology seems to be a guy thing.

Big_Data_Gender
To do this experiment, I did some webscraping to download the list of big data companies from here. I also used a very interesting package called genderizeR, which makes gender prediction based on first names (more info here).

Here you have the code:

library(rvest)
library(stringr)
library(dplyr)
library(genderizeR)
library(ggplot2)
library(googleVis)
paste0("http://www.crn.com/slide-shows/data-center/300076704/2015-big-data-100-business-analytics.htm/pgno/0/", 1:45) %>%
c(., paste0("http://www.crn.com/slide-shows/data-center/300076709/2015-big-data-100-data-management.htm/pgno/0/",1:30)) %>%
c(., paste0("http://www.crn.com/slide-shows/data-center/300076740/2015-big-data-100-infrastructure-tools-and-services.htm/pgno/0/",1:25)) -> webpages
results=data.frame()
for(x in webpages)
{
read_html(x) %>% html_nodes("p:nth-child(1)") %>% .[[2]] %>% html_text() -> Company
read_html(x) %>% html_nodes("p:nth-child(2)") %>% .[[1]] %>% html_text() -> Executive
results=rbind(results, data.frame(Company, Executive))
}
results=data.frame(lapply(results, as.character), stringsAsFactors=FALSE)
results[74,]=c("Trifacta", "Top Executive: CEO Adam Wilson")
results %>% mutate(Name=gsub("Top|\bExec\S*|\bCEO\S*|President|Founder|and|Co-Founder|\:", "", Executive)) %>%
mutate(Name=word(str_trim(Name))) -> results
results %>%
select(Name) %>%
findGivenNames() %>%
filter(probability > 0.9 & count > 15) %>%
as.data.frame() -> data
data %>% group_by(gender) %>% summarize(Total=n()) -> dat
doughnut=gvisPieChart(dat,
options=list(
width=450,
height=450,
legend="{ position: 'bottom', textStyle: {fontSize: 10}}",
chartArea="{left:25,top:50}",
title='TOP 100 BIG DATA COMPANIES 2015
Gender of CEOs',
colors="['red','blue']",
pieHole=0.5),
chartid="doughnut")
plot(doughnut)

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

↧

Web scraping with R & novel classification algorithms on unbalanced data

February 23, 2016, 6:49 am

≫ Next: Web scraping with R

≪ Previous: The Gender of Big Data

(This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

Tomorrow, the next RBelgium meeting will be held at the bnosac offices. This is the schedule.

Interested? Feel free to join the event. More info: http://www.meetup.com/RBelgium/events/228427510/

• 18h00-18h30: enter & meet other R users

• 18h30-19h00: Web scraping with R: live scraping products & prices of www.delhaize.be

• 19h15-20h00: State-of-the-art classification algorithms with unbalanced data. Package unbalanced: Racing for Unbalanced Methods Selection.

To leave a comment for the author, please follow the link and comment on their blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

↧

Web scraping with R

February 25, 2016, 2:13 am

≫ Next: Developing a R Tutorial shiny dashboard app

≪ Previous: Web scraping with R & novel classification algorithms on unbalanced data

(This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

For those of you who are interested in web scraping with R. Enjoy the slides of our presentation on this topic during the last RBelgium meetup. The talk is about using rvest, RSelenium and our own package scrapeit.core which makes scraping deployment, logging and replaying your scrapes more easy.

The slides below are in Flash so make sure you don’t use an addblocker in order to view it.

If you are interested in scraping content from websites and feeding it into your analytical systems, let us know at bnosac.be/index.php/contact/mail-us so that we can set up a quick proof of concept to get your analytics rolling.

To leave a comment for the author, please follow the link and comment on their blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

↧

Developing a R Tutorial shiny dashboard app

March 9, 2016, 9:56 am

≫ Next: Who Has the Best Fantasy Football Projections? 2016 Update

≪ Previous: Web scraping with R

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Through this post, I would like to describe a R Tutorial Shiny app that I recently developed. You can access the app here. (Please open the app on Chrome as some of the features may not work on IE. The app also includes a “ReadMe” introduction which provides a quick overview on how to use the app)

The App provides a set of most commonly performed data manipulation tasks & use cases and the R code/syntax for these use cases, in a structured and easily navigable format. For people just starting off with R, this will hopefully be a useful tool to quickly figure out the code and syntax for their routine data analysis task.

In this post, I would not be getting into the basics of how to develop a Shiny app since it’s fairly well documented elsewhere, including an article in DataScience+. What I will be doing however, is to focus on how this Tutorial app was developed specifically emphasizing on some of it’s key components and features.

High Level App Overview

The basic structure of the app is fairly straight forward. For each topic a separate dataframe is created (I initially wrote the content in an excel file, which I then imported as a dataframe in R). Individual topic datasets are then included in a list object.

The relevant R code for this is:

tutorial_set <- list(basic_operations = basic_operations,dplyr_tutorial = dplyr_tutorial,
                       loops_tutorial = loops_tutorial,model_tutorial = model_tutorial)

Depending on the user selection on the “Choose Topic” dropdown box, the relevant dataframe is extracted from the list object, with the following code:

selected_topic <- tutorial_set[[input$topic_select]]
# where topic_select is the inputId of the dropdown box with values similar to topic dataset names

Each dataset incorporates the relevant set of tasks, associated use cases and the underlying code and comments.

Enhance the app’s visual appeal using shinydashboard package

If you like shiny, you would love it even more once you start using the shinydashboard package. This package which is built on top of Shiny can help you design visually stunning apps & dashboard. The tutorial app was not really meant to be a visual dashboard rather the emphasis was on functionality – Hence I haven’t explored all the various themes, layouts, widgets etc. that this package provides. However, you can read an excellent overview of this package in R Studio’s posts here.) Also if you are familiar with Shiny, picking up shinydashbaord should be a cakewalk. All that is required are some minor modifications to ui side of your code and then you are ready to go!

Interactive and dynamic datatables using the DT package

Rendering a dataset as an output is a fairly standard requirement, while building an app. I have been using the default shiny functions to render a datatable, till i came across the DT package. The package is basically a R interface to the Java script Data Table library and you can read more about it here. Rendering datatables using the DT package can help provide a whole new level of interactivity to your app.

Datatables rendered using this package, not only looks better but provides ways to capture user actions as they interact with the table. You can then program specific tasks that can be triggered, basis these user actions.

There is a whole range of different functionalities that DT offers, but for the purpose of the tutorial app, I only needed to know which row the user clicks on (on the “use case” dataset). This can be done quite easily using the following code:

selected_index <- input$use_cases_row_last_clicked
# where use_cases is the inputId of the use case datatable and row_last_clicked returns the index of the selected row

Executing the underlying code and displaying the output

Once the index of the dataset row that the user clicks on is captured, we need to extract the relevant code from the tutorial data set (which is fairly straightforward) and then execute the code (which requires a few, relatively less often used functions). The code for this section is given below:

#code_out is the output_id of the "Code Output" box on the app
output$code_out <- DT::renderDataTable({
     
    # Extract the formula from the dataset of the selected topic; the Col name is titled "formula"
    formula <- as.character(selected_topic[selected_index,"formula"])
    
    # parse parses the formula string as an expression which is then evaluated using eval
    # by default, parse expects the input to be in a file format, so text = form specifies that the input is text
    code_output <- eval(parse(text = form))

     
   # code output is then displayed as a datatable
   # The prefix "DT::" specifies that datatable function in DT package should be considered rather than the deprecated Shiny versions
   # selection = "single" specifies that user can select only 1 row at a time
   # In options; scroolX = TRUE implies that scroll bar should be displayed and rownames = FALSE hides the rownames from displayed output
   DT::datatable(data = code_output, selection = 'single',
                      options = list(scrollX = TRUE, rownames= FALSE))

      })

Rendering output using R Markdown

Once the index of the user selected row is extracted, and the underlying code is executed, the final step is to render the code and comments as a html output (The output is displayed in the “Code and Comments” box of the app).
We use the R Markdown package to render the “code & comments” output in a html format (Getting up to speed on R Markdown, if you are not familiar with it, should not be a challenge at all. You can read about R Markdown here and you can also refer to the cheat sheet which provides a quick overview of the various formatting tags that you may need)

For the purpose of the Tutorial app, this is what was needed – Generate a R Markdown document on the fly (depending on the user selection) and render the output in a html format, which can then be displayed on the app.

I had come across this app some time back, where the author had attempted something very similar, which I suitably modified for my tutorial app. Given below is the code for this section:

output$comment <- reactive({

        # Initialize a temp file
        t <- tempfile()
        selected_index <- as.numeric(input$use_cases_row_last_clicked)
 
        #Extract the comment from the selected topic dataset
        comment <- as.character(selected_topic[index,"comment"])

        #cat command concatenates the output and prints the output to the temp file created
        cat(comment, file = t)
       
        # Use the R Markdown package's render function to render the file as an html document
        t <- render(
          input         = t,
          output_format = 'html_document')
 
        ## read the results, delete the temp file and return the output
        comment_html <- readLines(t)
        unlink(sub('.html$', '*', t))
        paste(comment_html, collapse = 'n')

      })

Hope you found the post useful. If you have any queries or questions, please feel free to comment below.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

Who Has the Best Fantasy Football Projections? 2016 Update

March 12, 2016, 10:22 am

≫ Next: Performing SQL selects on R data frames

≪ Previous: Developing a R Tutorial shiny dashboard app

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

In prior posts, we demonstrated how to download projections from numerous sources, calculate custom projections for your league, and compare the accuracy of different sources of projections (2013, 2014, 2015). In the latest version of our annual series, we hold the forecasters accountable and see who had the most and least accurate fantasy football projections over the last 4 years.

The R Script

You can download the R script for comparing the projections from different sources here. You can download the historical projections and performance using our Projections tool.

To compare the accuracy of the projections, we use the following metrics:

R-squared (R²) – higher is better
Mean absolute scaled error (MASE) – lower is better

For a discussion of these metrics, see here and here.

Whose Predictions Were the Best?

The results are in the table below. We compared the accuracy for projections of the following positions: QB, RB, WR, and TE. The rows represent the different sources of predictions (e.g., ESPN, CBS) and the columns represent the different measures of accuracy for the last four years and the average across years. The source with the best measure for each metric is in blue.

Source	2012		2013		2014		2015		Average
	R²	MASE	R²	MASE	R²	MASE	R²	MASE	R²	MASE
Fantasy Football Analytics: Average	.670	.545	.567	.635	.618	.577	.626	.553	.620	.578
Fantasy Football Analytics: Robust Average	.667	.549	.561	.636	.613	.581	.628	.554	.617	.580
Fantasy Football Analytics: Weighted Average							.626	.553	–	–
CBS Average	.637	.604	.479	.722	.575	.632	.500	.664	.548	.656
EDS Football					.554	.651	.584	.624	.569	.638
ESPN	.576	.669	.500	.705	.498	.723	.615	.585	.547	.671
FantasySharks					.529	.673			–	–
FFtoday	.661	.551	.550	.646	.530	.659	.546	.626	.572	.621
FOX Sports					.459	.720	.550	.677	.505	.699
NFL.com	.551	.650	.505	.709	.518	.692	.582	.632	.539	.671
numberFire					.486	.712	.560	.643	.523	.678
RTSports							.547	.670	–	–
WalterFootball					.472	.713	.431	.724	.452	.719
Yahoo					.547	.645	.635	.554	.591	.600

Here is how the projections ranked over the last four years (based on MASE):

Fantasy Football Analytics: Average (or Weighted Average)
Fantasy Football Analytics: Robust Average
Yahoo
FFtoday
EDS Football
CBS Average
RTSports
ESPN
NFL.com
FantasySharks
numberFire
FOX Sports
WalterFootball

Notes: CBS estimates were averaged across Jamey Eisenberg and Dave Richard. FantasyFootballNerd projections were not included because the full projections are subscription only. We did not calculate the weighted average prior to 2015. The accuracy estimates may differ slightly from those provided in prior years because a) we now use standard league scoring settings (you can see the league scoring settings we used here) and b) we are only examining the following positions: QB, RB, WR, and TE. The weights for the weighted average were based on historical accuracy (1-MASE). For the analysts not included in the accuracy calculations, we calculated the average (1-MASE) value and subtracted 1/2 the standard deviation of (1-MASE). The weights in the weighted average for 2015 were:

CBS Average: .428
EDS Football: .428
ESPN: .383
FantasyFootballNerd: .428
FFToday: .482
FOX Sports: .428
NFL.com: .384
numberFire: .404
RTSports.com: .428
WalterFootball: .428
Yahoo Sports: .433

Here is a scatterplot of our average projections in relation to players’ actual fantasy points scored in 2015:

Accuracy 2015

Interesting Observations

Projections that combined multiple sources of projections (FFA Average, Weighted Average, Robust Average) were more accurate than all single sources of projections (e.g., CBS, NFL.com, ESPN) every year. This is consistent with the wisdom of the crowd.
The simple average (mean) was more accurate than the robust average. The robust average gives extreme values less weight in the calculation of the average. This suggests that outliers may reflect meaningful sources of variance (i.e., they may help capture a player’s ceiling/floor) and may not be bad projections (i.e., error/noise).
The weighted average was equally accurate compared to the simple average. Weights were based on historical accuracy. If the best analysts are consistently more accurate than other analysts, the weighted average will likely outperform the mean. If, on the other hand, analysts don’t reliably outperform each other, the mean might be more accurate.
The FFA Average explained 57–67% of the variation in players’ actual performance. That means that the projections are somewhat accurate but have much room for improvement in terms of prediction accuracy. 1/3 to 1/2 of the variance in actual points is unexplained by projections. Nevertheless, the projections are likely more accurate than pre-season rankings.
The R-squared of the FFA average projection was .67 in 2012, .57 in 2013, .62 in 2014, and .63 in 2015. This suggests that players are more predictable in some years than others.
There was little consistency in performance across time among sites that used single projections (CBS, NFL.com, ESPN). In 2012, CBS was the most accurate single source of projection but they were the least accurate in 2013. Moreover, ESPN was among the least accurate in 2014, but they were among the most accurate in 2015. This suggests that no single source reliably outperforms the others. While some sites may do better than others in any given year (because of fairly random variability–i.e., chance), it is unlikely that they will continue to outperform the other sites.
Projections were more accurate for some positions than others. Projections were much more accurate for QBs and WRs than for RBs. Projections were the least accurate for Ks, DBs, and DSTs. For more info, see here. Here is how positions ranked in accuracy of their projections (from most to least accurate):
1. QB: R² = .71
2. WR: R² = .57
3. LB: R² = .56
4. TE: R² = .54
5. DL: R² = .48
6. RB: R² = .47
7. K: R² = .38
8. DB: R² = .32
9. DST: R² = .15
Projections over-estimated players’ performance by about 4–10 points every year across most positions (based on mean error). It will be interesting to see if this pattern holds in future seasons. If it does, we could account for this over-expectation in players’ projections. In a future post, I hope to explore the types of players for whom this over-expectation occurs.

Conclusion

Fantasy Football Analytics had the most accurate projections over the last four years. Why? We average across sources. Combining sources of projections removes some of their individual judgment biases (error) and gives us a more accurate fantasy projection. No single source (CBS, NFL.com, ESPN) reliably outperformed the others or the crowd, suggesting that differences between them are likely due in large part to chance. In sum, crowd projections are more accurate than individuals’ judgments for fantasy football projections. People often like to “go with their gut” when picking players. That’s fine—fantasy football is a game. Do what is fun for you. But, crowd projections are the most reliably accurate of any source. Do with that what you will! But don’t take my word for it. Examine the accuracy yourself with our Projections tool and see what you find. And let us know if you find something interesting!

The post Who Has the Best Fantasy Football Projections? 2016 Update appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

↧

Performing SQL selects on R data frames

March 13, 2016, 1:42 am

≫ Next: On the growth of CRAN packages

≪ Previous: Who Has the Best Fantasy Football Projections? 2016 Update

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

For anyone who has SQL background and who wants to learn R, I guess the sqldf package is very useful because it enables us to use SQL commands in R. One who has basic SQL skills can manipulate data frames in R using their SQL skills. You can read more about sqldf package from cran.

In this post, we will see how to perform joins and other queries to retrieve, sort and filter data in R using SQL. We will also see how we can achieve the same tasks using non-SQL R commands. Currently, I am working with the FDA adverse events data. These data sets are ideal for a data scientist or for any data enthusiast to work with because they contain almost any kind of data messiness: missing values, duplicates, compatibility problems of data sets created during different time periods, variable names and number of variables changing over time (e.g., sex in one dataset, gender in other dataset, if_cod in one dataset and if_code in other dataset), missing errors, etc.

In this post, we will use FDA adverse events data, which are publicly available.The FDA adverse events datasets can also be downloaded in .csv format from the National Bureau of Economic Research. Since, it is easier to automate the data download in R from the National Bureau of Economic Research, we will download various datasets from this website. I encourage you to try the R codes and download data from the website and explore the adverse events datasets.

The adverse events datasets are created in quarterly temporal resolution and each quarter data includes demography information, drug/biologic information, adverse event, outcome and diagnosis, etc.

Let’s download some datasets and use SQL queries to join, sort and filter the various data frames.

Load R Packages

require(downloader)
library(dplyr)
library(sqldf)
library(data.table)
library(ggplot2)
library(compare)
library(plotrix)

Basic error Handing with tryCatch()

We will use the function below to download data. Since the data sets have quarterly time resolution, there are four data sets per year for each observation. So, the function below will automate the data dowlnoad. If the quartely data is not available on the website (not yet released), the function will give an error message that the dataset was not found. We are downloading zipped data and unzipping it.

try.error = function(url)
{
  try_error = tryCatch(download(url,dest="data.zip"), error=function(e) e)
  if (!inherits(try_error, "error")){
      download(url,dest="data.zip")
        unzip ("data.zip")
      }
    else if (inherits(try_error, "error")){
    cat(url,"not foundn")
      }
      }

Download adverse events data

The FDA adverse events data are available since 2004. In this post, let’s download data since 2013. We will check if there are data up to the present time and download what we can get.

Sys.time() gives the current date and time
the function year() from the data.table package gets the year from the current date and time data that we obtain from the Sys.time() function

We are downloading demography, drug, diagnosis/indication, outcome and reaction (adeverse event) data.

year_start=2013
year_last=year(Sys.time())
for (i in year_start:year_last){
            j=c(1:4)
            for (m in j){
            url1<-paste0("http://www.nber.org/fda/faers/",i,"/demo",i,"q",m,".csv.zip")
            url2<-paste0("http://www.nber.org/fda/faers/",i,"/drug",i,"q",m,".csv.zip")
            url3<-paste0("http://www.nber.org/fda/faers/",i,"/reac",i,"q",m,".csv.zip")
            url4<-paste0("http://www.nber.org/fda/faers/",i,"/outc",i,"q",m,".csv.zip")
            url5<-paste0("http://www.nber.org/fda/faers/",i,"/indi",i,"q",m,".csv.zip")
           try.error(url1)
           try.error(url2)
           try.error(url3)
           try.error(url4)
           try.error(url5)     
            }
        }

http://www.nber.org/fda/faers/2015/demo2015q4.csv.zip not found
...
http://www.nber.org/fda/faers/2016/indi2016q4.csv.zip not found

From the error messages shown above, in the time of writing this post, we see that the database has adverse events data up to the third quarter of 2015.

list.files() produce a character vector of the names of files in my working directory.
I am using regular expressions to select each categoy of dateset. Example, ^demo.*.csv means all files that start with the word demo and end with .csv.

filenames <- list.files(pattern="^demo.*.csv", full.names=TRUE)
cat('We have downloaded the following quarterly demography datasets')
filenames

We have downloaded the following quarterly demography datasets

"./demo2012q1.csv" "./demo2012q2.csv" "./demo2012q3.csv" "./demo2012q4.csv" "./demo2013q1.csv" "./demo2013q2.csv" "./demo2013q3.csv" "./demo2013q4.csv" "./demo2014q1.csv" "./demo2014q2.csv" "./demo2014q3.csv" "./demo2014q4.csv" "./demo2015q1.csv" "./demo2015q2.csv" "./demo2015q3.csv"

Now let’s use fread() function from the data.table Package to read in the datasets. Let’s start with the demography data:

demo=lapply(filenames,fread)

Now, let’s change them to data frames and concatenate them to create one demography data

demo_all=do.call(rbind,lapply(1:length(demo),function(i) select(as.data.frame(demo[i]),primaryid,caseid, age,age_cod,event_dt,sex,reporter_country)))
dim(demo_all)
        3554979   7

We see that our demography data has more than 3.5 million rows.

Now, lets’ merge all drug files

filenames <- list.files(pattern="^drug.*.csv", full.names=TRUE)
cat('We have downloaded the following quarterly drug datasets:n')
filenames
drug=lapply(filenames,fread)
cat('n')
cat('Variable names:n')
names(drug[[1]])
drug_all=do.call(rbind,lapply(1:length(drug), function(i) select(as.data.frame(drug[i]),primaryid,caseid, drug_seq,drugname,route)))

We have downloaded the following quarterly drug datasets:

"./drug2012q1.csv" "./drug2012q2.csv" "./drug2012q3.csv" "./drug2012q4.csv" "./drug2013q1.csv" "./drug2013q2.csv" "./drug2013q3.csv" "./drug2013q4.csv" "./drug2014q1.csv" "./drug2014q2.csv" "./drug2014q3.csv" "./drug2014q4.csv" "./drug2015q1.csv" "./drug2015q2.csv" "./drug2015q3.csv"

Variable names:

"primaryid" "drug_seq" "role_cod" "drugname" "val_vbm" "route" "dose_vbm" "dechal" "rechal" "lot_num" "exp_dt" "exp_dt_num" "nda_num"

Merge all diagnoses/indications datasets

filenames <- list.files(pattern="^indi.*.csv", full.names=TRUE)
cat('We have downloaded the following quarterly diagnoses/indications datasets:n')

filenames

indi=lapply(filenames,fread)

cat('n')
cat('Variable names:n')

names(indi[[15]])

indi_all=do.call(rbind,lapply(1:length(indi), function(i) select(as.data.frame(indi[i]),primaryid,caseid, indi_drug_seq,indi_pt)))

We have downloaded the following quarterly diagnoses/indications datasets:

"./indi2012q1.csv" "./indi2012q2.csv" "./indi2012q3.csv" "./indi2012q4.csv" "./indi2013q1.csv" "./indi2013q2.csv" "./indi2013q3.csv" "./indi2013q4.csv" "./indi2014q1.csv" "./indi2014q2.csv" "./indi2014q3.csv" "./indi2014q4.csv" "./indi2015q1.csv" "./indi2015q2.csv" "./indi2015q3.csv"

Variable names:

"primaryid" "caseid" "indi_drug_seq" "indi_pt"

Merge patient outcomes

filenames <- list.files(pattern="^outc.*.csv", full.names=TRUE)
cat('We have downloaded the following quarterly patient outcome datasets:n')

filenames
outc_all=lapply(filenames,fread)

cat('n')
cat('Variable namesn')

names(outc_all[[1]])
names(outc_all[[4]])
colnames(outc_all[[4]])=c("primaryid", "caseid", "outc_cod")
outc_all=do.call(rbind,lapply(1:length(outc_all), function(i) select(as.data.frame(outc_all[i]),primaryid,outc_cod)))

We have downloaded the following quarterly patient outcome datasets:

"./outc2012q1.csv" "./outc2012q2.csv" "./outc2012q3.csv" "./outc2012q4.csv" "./outc2013q1.csv" "./outc2013q2.csv" "./outc2013q3.csv" "./outc2013q4.csv" "./outc2014q1.csv" "./outc2014q2.csv" "./outc2014q3.csv" "./outc2014q4.csv" "./outc2015q1.csv" "./outc2015q2.csv" "./outc2015q3.csv"

Variable names

    "primaryid" "outc_cod" 
    "primaryid" "caseid" "outc_code"

Finally, merge reaction (adverse event) datasets

filenames <- list.files(pattern="^reac.*.csv", full.names=TRUE)
cat('We have downloaded the following quarterly reaction (adverse event)  datasets:n')

filenames
reac=lapply(filenames,fread)

cat('n')
cat('Variable names:n')
names(reac[[3]])

reac_all=do.call(rbind,lapply(1:length(indi), function(i) select(as.data.frame(reac[i]),primaryid,pt)))

We have downloaded the following quarterly reaction (adverse event) datasets:

"./reac2012q1.csv" "./reac2012q2.csv" "./reac2012q3.csv" "./reac2012q4.csv" "./reac2013q1.csv" "./reac2013q2.csv" "./reac2013q3.csv" "./reac2013q4.csv" "./reac2014q1.csv" "./reac2014q2.csv" "./reac2014q3.csv" "./reac2014q4.csv" "./reac2015q1.csv" "./reac2015q2.csv" "./reac2015q3.csv"

Variable names:

  "primaryid" "pt"

Let’s see number of rows from each data type

all=as.data.frame(list(Demography=nrow(demo_all),Drug=nrow(drug_all),
                   Indications=nrow(indi_all),Outcomes=nrow(outc_all),
                   Reactions=nrow(reac_all)))
row.names(all)='Number of rows'
all

SQL queries

Keep in mind that sqldf uses SQLite.

COUNT

#SQL
sqldf("SELECT COUNT(primaryid)as 'Number of rows of Demography data'
FROM demo_all;")

# R
nrow(demo_all)
3554979

LIMIT

#  SQL
sqldf("SELECT *
FROM demo_all 
LIMIT 6;")

#R
head(demo_all,6)

R1=head(demo_all,6)
SQL1 =sqldf("SELECT *
FROM demo_all 
LIMIT 6;")
all.equal(R1,SQL1)
TRUE

WHERE

SQL2=sqldf("SELECT *
FROM demo_all WHERE sex ='F';")
R2 = filter(demo_all, sex=="F")
identical(SQL2, R2)
TRUE

SQL3=sqldf("SELECT *
FROM demo_all WHERE age BETWEEN 20 AND 25;")
R3 = filter(demo_all, age >= 20 & age <= 25)
identical(SQL3, R3)
TRUE

GROUP BY and ORDER BY

#SQL
sqldf("SELECT sex, COUNT(primaryid) as Total
FROM demo_all
WHERE sex IN ('F','M','NS','UNK')
GROUP BY sex
ORDER BY Total DESC ;")

# R
demo_all%>%filter(sex %in%c('F','M','NS','UNK'))%>%group_by(sex) %>%
        summarise(Total = n())%>%arrange(desc(Total))

SQL3 = sqldf("SELECT sex, COUNT(primaryid) as Total
FROM demo_all
GROUP BY sex
ORDER BY Total DESC ;")

R3 = demo_all%>%group_by(sex) %>%
        summarise(Total = n())%>%arrange(desc(Total))

compare(SQL3,R3, allowAll=TRUE)
TRUE
  dropped attributes

SQL=sqldf("SELECT sex, COUNT(primaryid) as Total
FROM demo_all
WHERE sex IN ('F','M','NS','UNK')
GROUP BY sex
ORDER BY Total DESC ;")
SQL$Total=as.numeric(SQL$Total
pie3D(SQL$Total, labels = SQL$sex,explode=0.1,col=rainbow(4),
   main="Pie Chart of adverse event reports by gender",cex.lab=0.5, cex.axis=0.5, cex.main=1,labelcex=1)

This is the plot:

Inner Join

Let’s join drug data and indication data based on primary id and drug sequence
First, let’s check the variable names to see how to merge the two datasets.

names(indi_all)
names(drug_all)

    "primaryid" "indi_drug_seq" "indi_pt" 
    "primaryid" "drug_seq" "drugname" "route" 

names(indi_all)=c("primaryid", "drug_seq", "indi_pt" ) # so as to have the same name (drug_seq)
R4= merge(drug_all,indi_all, by = intersect(names(drug_all), names(indi_all)))
R4=arrange(R3, primaryid,drug_seq,drugname,indi_pt)
SQL4= sqldf("SELECT d.primaryid as primaryid, d.drug_seq as drug_seq, d.drugname as drugname,
                       d.route as route,i.indi_pt as indi_pt
                       FROM drug_all d
                       INNER JOIN indi_all i
                      ON d.primaryid= i.primaryid AND d.drug_seq=i.drug_seq
                      ORDER BY primaryid,drug_seq,drugname, i.indi_pt")
compare(R4,SQL4,allowAll=TRUE)
TRUE

R5 = merge(reac_all,outc_all,by=intersect(names(reac_all), names(outc_all)))
SQL5 =reac_outc_new4=sqldf("SELECT r.*, o.outc_cod as outc_cod
                     FROM reac_all r 
                     INNER JOIN outc_all o
                     ON r.primaryid=o.primaryid
                     ORDER BY r.primaryid,r.pt,o.outc_cod")

compare(R5,SQL5,allowAll = TRUE)
TRUE

ggplot(sqldf('SELECT age, sex
             FROM demo_all
             WHERE age between 0 AND 100 AND sex IN ("F","M")
             LIMIT 10000;'), aes(x=age, fill = sex))+ geom_density(alpha = 0.6)

This is the plot:

ggplot(sqldf("SELECT d.age as age, o.outc_cod as outcome
                     FROM demo_all d
                     INNER JOIN outc_all o
                     ON d.primaryid=o.primaryid
                     WHERE d.age BETWEEN 20 AND 100
                     LIMIT 20000;"),aes(x=age, fill = outcome))+ geom_density(alpha = 0.6)

This is the plot:

ggplot(sqldf("SELECT de.sex as sex, dr.route as route
                     FROM demo_all de
                     INNER JOIN drug_all dr
                     ON de.primaryid=dr.primaryid
                     WHERE de.sex IN ('M','F') AND dr.route IN ('ORAL','INTRAVENOUS','TOPICAL')
                     LIMIT 200000;"),aes(x=route, fill = sex))+ geom_bar(alpha=0.6)

This is the plot:

ggplot(sqldf("SELECT d.sex as sex, o.outc_cod as outcome
                     FROM demo_all d
                     INNER JOIN outc_all o
                     ON d.primaryid=o.primaryid
                     WHERE d.age BETWEEN 20 AND 100 AND sex IN ('F','M')
                     LIMIT 20000;"),aes(x=outcome,fill=sex))+ geom_bar(alpha = 0.6)

This is the plot:

demo1= demo_all[1:20000,]
demo2=demo_all[20001:40000,]

UNION ALL

R6 <- rbind(demo1, demo2)
SQL6 <- sqldf("SELECT  * FROM demo1 UNION ALL SELECT * FROM demo2;")
compare(R6,SQL6, allowAll = TRUE)
TRUE

INTERSECT

R7 <- semi_join(demo1, demo2)
SQL7 <- sqldf("SELECT  * FROM demo1 INTERSECT SELECT * FROM demo2;")
compare(R7,SQL7, allowAll = TRUE)
TRUE

EXCEPT

R8 <- anti_join(demo1, demo2)
SQL8 <- sqldf("SELECT  * FROM demo1 EXCEPT SELECT * FROM demo2;")
compare(R8,SQL8, allowAll = TRUE)
TRUE

See you in my next post! If you have any questions or feedback, feel free to leave a comment.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

↧

On the growth of CRAN packages

April 15, 2016, 6:34 am

≫ Next: The Hype Bubble Map for Dog Breeds

≪ Previous: Performing SQL selects on R data frames

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

Every once in a while somebody asks me how many packages are on CRAN. (More than 8,000 in April, 2016). A year ago, in April 2015, there were ~6,200 packages on CRAN.

This poses a second question: what is the historical growth of CRAN packages?

One source of information is Bob Muenchen's blog R Now Contains 150 Times as Many Commands as SAS, that contains this graphic showing packages from 2002 through 2014. (Bob fitted a quadratic curve through the data, that fits quite well, except that this model estimates too high in the very early years).

CRAN package data through 2014 by Bob Muenchen

But where does this data come from? Bob's article references an earlier article by John Fox in the R Journal, Aspects of the Social Organization and Trajectory of the R Project. (This is a fascinating article, and I highly recommend you read it). The analysis by John Fox contains this graphic showing data from 2001 through 2009. John fits an exponential growth curve through the data, that again fits very well:

CRAN package data through 2009 by John Fox

I was particularly interested in trying to see if I can find the original source of the data. The original graphic contains a caption with references to the R source code on SVN, but I could only find the release dates of historical R releases, not the package counts.

Next I put the search term "john fox 2009 cran package data" into my favourite search engine and came across the dataset CRANPackages in the package Ecdat. The Ecdat package contains data sets for econometrics, compiled by Spencer Graves.

I promptly installed the package and inspected the data:

> library(Ecdat)

> head(CRANpackages)
  Version       Date Packages            Source
1     1.3 2001-06-21      110         John Fox 
2     1.4 2001-12-17      129         John Fox 
3     1.5 2002-05-29      162         John Fox 
4     1.6 2002-10-01      163 John Fox, updated
5     1.7 2003-05-27      219         John Fox 
6     1.8 2003-11-16      273         John Fox 

> tail(CRANpackages)
   Version       Date Packages         Source
24    2.15 2012-07-07     4000      John Fox 
25    2.15 2012-11-01     4082 Spencer Graves
26    2.15 2012-12-14     4210 Spencer Graves
27    2.15 2013-10-28     4960 Spencer Graves
28    2.15 2013-11-08     5000 Spencer Graves
29     3.1 2014-04-13     5428 Spencer Graves

This data is exactly what I was after, but what is the origin?

> ?CRANpackages

Data casually collected on the number of packages on the Comprehensive R Archive Network (CRAN) at different dates.

So it seems this gets compiled and updated by hand, orginally by John Fox, and more recently by Spencer Graves himself.

Can we do better?

This set me thinking. Can we do better and automate this process by scraping CRAN?

This is in fact possible, and you can find the source data at CRAN for older, archived releases (R-1.7 in 2004 through R-2.10 in 2010) as well as more recent releases.

However, you will have to scrape the dates from a list of package release dates for each historic release (you can find my code at the bottom of this blog).

The results

I get the following result. Note that the rug marks indicate the release date and number of packages for each release. The data is linear, not log, but the rug marks gives the illusion of a logarithmic scale.

CRAN package data through 2016 by Andrie de Vries

Caveat

I took a few shortcuts in the analysis:

For each release, the actual data is a list of packages, as well as the publication date for each package. I took the date of the "release" as the very last package publication date. This means my estimate for the "release date" will be wrong. Specifically, in each case, the actual release would have occurred earlier.
I made no attempt to find the data prior to 2004.

Further work

The analysis can really benefit from fitting some curves through the data. Specifically, I would like to fit an exponential growth curve to see. For example, are there indications that the contribution rate is steady, accelerating or decelerating. Might a S-curve fit the data better?

The plot itself needs additional labels for the dot releases.

I hope to address these in a follow-up post.

The code and data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

The Hype Bubble Map for Dog Breeds

April 17, 2016, 11:30 pm

≫ Next: Web-Scraping JavaScript rendered Sites

≪ Previous: On the growth of CRAN packages

(This article was first published on Ripples, and kindly contributed to R-bloggers)

In the whole history of the world there is but one thing that money can not buy… to wit the wag of a dog’s tail (Josh Billings)

In this post I combine several things:

Simple webscraping to read the list of companion dogs from Wikipedia. I love rvest package to do these things.
Google Trends queries to download the evolution of searchings of breeds during last 6 months. I use gtrendsR package to do this and works quite well.
A dinamic Highchart visualization using the awesome highcharter package
A static ggplot visualization.

The experiment is based on a simple idea: what people search on the Internet is what people do. Can be Google Trends an useful tool to know which breed will become fashionable in the future? To be honest, I don’t really know but I will make my own bet.

What I have done is to extract last 6 months of Google trends of this list of companion breeds. After some simple text mining, I divide the set of names into 5-elements subsets because Google API doesn’t allow searchings with more than 5 items. The result of the query to Google trends is a normalized time series, meaning the 0 – 100 values are relative, not absolute, measures. This is done by taking all of the interest data for your keywords and dividing it by the highest point of interest for that date range. To make all 5-items of results comparable I always include King Charles Spaniel breed in all searchings (as a kind of undercover agent I will use to compare searching levels). The resulting number is my “Level” Y-Axis of the plot. I limit searchings to code=”0-66″ which is restrict results to Animals and pets category. Thanks, Philippe, for your help in this point. I also restrict rearchings To the United States of America.

There are several ways to obtain an aggregated trend indicator of a time series. My choice here was doing a short moving average order=2 to the resulting interest over time obtained from Google. The I divide the weekly variations by the smoothed time series. The trend indicator is the mean of these values. To obtains a robust indicator, I remove outliers of the original time series. This is my X-axis.

This is how dog breeds are arranged with respect my Trend and Level indicators:

HypeBubbleGgplot

Inspired by Gartner’s Hype Cycle of Emerging Technologies I distinguish two sets of dog breeds:

Plateau of Productivity Breeds (succesful breeds with very high level indicator and possitive trend): Golden Retriever, Pomeranian, Chihuahua, Collie and Shih Tzu.
Innovation Trigger Breeds (promising dog breeds with very high trend indicator and low level): Mexican Hairless Dog, Keeshond, West Highland White Terrier and German Spitz.

I discovered recently a wonderful package called highcharter which allows you to create incredibly cool dynamic visualizations. I love it and I could not resist to use it to do the previous plot with the look and feel of The Economist. This is an screenshot (reproduce it to play with tits interactivity):

BubbleEconomist
And here comes my prediction. After analyzing the set Innovation Trigger Breeds, my bet is Keeshond will increase its popularity in the nearly future: don’t you think it is lovely?

640px-Little_Puppy_Keeshond
Photo by Terri Brown – Flickr: IMG_4723, CC BY 2.0

Here you have the code:

library(gtrendsR)
library(rvest)
library(dplyr)
library(stringr)
library(forecast)
library(outliers)
library(highcharter)
library(ggplot2)
library(scales)

#Webscraping
x="https://en.wikipedia.org/wiki/Companion_dog"
read_html(x) %>%
  html_nodes("ul:nth-child(19)") %>%
  html_text() %>%
  strsplit(., "n") %>%
  unlist() -> breeds

#Some simple cleansing
breeds=iconv(breeds[breeds!= ""], "UTF-8")

usr <- "YOUR GOOGLE ACCOUNT"
psw <- "YOUR GOOGLE PASSWORD"
gconnect(usr, psw)

#Reference (undercover agent)
ref="King Charles Spaniel"

#Remove the undercover agent from the set of breeds
breeds=setdiff(breeds, ref)

#Subsets. Do not worry about warning message
sub.breeds=split(breeds, 1:ceiling(length(breeds)/4))

#Loop to obtain google trends of each 5-items subset
results=list()
for (i in 1:length(sub.breeds))
{
  res <- gtrends(unlist(union(ref, sub.breeds[i])),           start_date = Sys.Date()-180,           cat="0-66",           geo="US")   results[[i]]=res } #Loop to obtain trend and level indicator of each breed trends=data.frame(name=character(0), level=numeric(0), trend=numeric(0)) for (i in 1:length(results)) {   df=results[[i]]$trend   lr=mean(results[[i]]$trend[,3]/results[[1]]$trend[,3])   for (j in 3:ncol(df))   {     s=rm.outlier(df[,j], fill = TRUE)     t=mean(diff(ma(s, order=2))/ma(s, order=2), na.rm = T)     l=mean(results[[i]]$trend[,j]/lr)     trends=rbind(data.frame(name=colnames(df)[j], level=l, trend=t), trends)   } } #Prepare data for visualization trends %>%
  group_by(name) %>%
  summarize(level=mean(level), trend=mean(trend*100)) %>%
  filter(level>0 & trend > -10 & level<500) %>%
  na.omit() %>%
  mutate(name=str_replace_all(name, ".US","")) %>%
  mutate(name=str_replace_all(name ,"[[:punct:]]"," ")) %>%
  rename(
    x = trend,
    y = level
  ) -> trends
trends$y=(trends$y/max(trends$y))*100

#Dinamic chart as The Economist
highchart() %>%
  hc_title(text = "The Hype Bubble Map for Dog Breeds") %>%
  hc_subtitle(text = "According Last 6 Months of Google Searchings") %>%
  hc_xAxis(title = list(text = "Trend"), labels = list(format = "{value}%")) %>%
  hc_yAxis(title = list(text = "Level")) %>%
  hc_add_theme(hc_theme_economist()) %>%
  hc_add_series(data = list.parse3(trends), type = "bubble", showInLegend=FALSE, maxSize=40) %>%
  hc_tooltip(formatter = JS("function(){
                            return ('<b>Trend: </b>' + Highcharts.numberFormat(this.x, 2)+'%' + '
<b>Level: </b>' + Highcharts.numberFormat(this.y, 2) + '
<b>Breed: </b>' + this.point.name)
                            }"))

#Static chart
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 30))
ggplot(trends, aes(x=x/100, y=y, label=name), guide=FALSE)+
  geom_point(colour="white", fill="darkorchid2", shape=21, alpha=.3, size=9)+
  scale_size_continuous(range=c(2,40))+
  scale_x_continuous(limits=c(-.02,.02), labels = percent)+
  scale_y_continuous(limits=c(0,100))+
  labs(title="The Hype Bubble Map for Dog Breeds",
       x="Trend",
       y="Level")+
  geom_text(data=subset(trends, x> .2 & y > 50), size=4, colour="gray25")+
  geom_text(data=subset(trends, x > .7), size=4, colour="gray25")+opts

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

↧

Web-Scraping JavaScript rendered Sites

March 26, 2016, 5:00 pm

≫ Next: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor

≪ Previous: The Hype Bubble Map for Dog Breeds

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

Gathering data from the web is one of the key tasks in order to generate easy data-driven insights into various topics.
Thanks to the fantastic Rvest R package web scraping is pretty straight forward.
It basically works like this; go to a website, find the right items using the selector gadget and plug the element path into your R-code.
There are various, great tutorials on how to do that (e.g. 1, 2 ).

Increasingly, I see that websites, which (un-)consciously make it hard to scrape their site by employing delayed JavaScript-based rendering. In these cases the simple approach of using rvest breaks down. Examples include the Bwin betting site or the German site busliniensuche. Busliniensuche is a site to compare bus travel providers on price, duration and schedule.

As side-note, the German bus travel market has been deregulated in 2013, hence the market is still rapidly developing. I thought it would be interesting to analyse the basic market elements and compare bus travel to the established train provider Deutsche Bahn.

As mentioned the site employs some kind of delayed JavaScript rendering. It basically loads the page content with a delay. This would be great if the web site would call a structured JSON endpoint. This is not the case, rather I believe they make it intentionally hard to gather their data in a structured form.

##How-to scrape JS-rendered websites?

One way to gather the data nonetheless is using a “headless” browser such as PhantomJS.
“A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” (Source: Wikipedia). In order to control PhantomJS from R we need two scripts; a) the PhantomJS file and b) a R file that manipulates and runs PhantomJS. Both files are included at the end of this post.

The PhantomJS file has one parameter; the URL that is supposed to be scraped, here placed right at the beginning of the file.
The headless browser loads the URL, waits 2500 milliseconds and saves the file to disk (file-name: 1.html).
The R file changes the URL to the target site, runs the headless browser using a system call and works with the locally saved file in an rvest-like way.

So to get started on your own: Download PhantomJS, place the .exe file in your working R-directory and adapt the source code accordingly.

Happy Scraping!

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.

↧