Quantcast
Channel: Search Results for “web scraping”– R-bloggers
Viewing all 132 articles
Browse latest View live

How to Learn R

$
0
0

There are tons of resources to help you learn the different aspects of R, and as a beginner this can be overwhelming. It’s also a dynamic language and rapidly changing, so it’s important to keep up with the latest tools and technologies.

That’s why R-bloggers and DataCamp have worked together to bring you a learning path for R. Each section points you to relevant resources and tools to get you started and keep you engaged to continue learning. It’s a mix of materials ranging from documentation, online courses, books, and more.

Just like R, this learning path is a dynamic resource. We want to continually evolve and improve the resources to provide the best possible learning experience. So if you have suggestions for improvement please email tal.galili@gmail.com with your feedback.

Learning Path

Getting started:  The basics of R

Setting up your machine

R packages

Importing your data into R

Data Manipulation

Data Visualization

Data Science & Machine Learning with R

Reporting Results in R

Next steps

Getting started:  The basics of R

image02

The best way to learn R is by doing. In case you are just getting started with R, this free introduction to R tutorial by DataCamp is a great resource as well the successor Intermediate R programming (subscription required). Both courses teach you R programming and data science interactively, at your own pace, in the comfort of your browser. You get immediate feedback during exercises with helpful hints along the way so you don’t get stuck.

Another free online interactive learning tutorial for R is available by O’reilly’s code school website called try R. An offline interactive learning resource is swirl, an R package that makes if fun and easy to become an R programmer. You can take a swirl course by (i) installing the package in R, and (ii) selecting a course from the course library. If you want to start right away without needing to install anything you can also choose for the online version of Swirl.

There are also some very good MOOC’s available on edX and Coursera that teach you the basics of R programming. On edX you can find Introduction to R Programming by Microsoft, an 8 hour course that focuses on the fundamentals and basic syntax of R. At Coursera there is the very popular R Programming course by Johns Hopkins. Both are highly recommended!

If you instead prefer to learn R via a written tutorial or book there is plenty of choice. There is the introduction to R manual by CRAN, as well as some very accessible books like Jared Lander’s R for Everyone or R in Action by Robert Kabacoff.

Setting up your machine

You can download a copy of R from the Comprehensive R Archive Network (CRAN). There are binaries available for Linux, Mac and Windows.

Once R is installed you can choose to either work with the basic R console, or with an integrated development environment (IDE). RStudio is by far the most popular IDE for R and supports debugging, workspace management, plotting and much more (make sure to check out the RStudio shortcuts).

image05

Next to RStudio you also have Architect, and Eclipse-based IDE for R. If you prefer to work with a graphical user interface you can have a look at R-commander  (aka as Rcmdr), or Deducer.

R packages

image04

R packages are the fuel that drive the growth and popularity of R. R packages are bundles of code, data, documentation, and tests that are easy to share with others. Before you can use a package, you will first have to install it. Some packages, like the base package, are automatically installed when you install R. Other packages, like for example the ggplot2 package, won’t come with the bundled R installation but need to be installed.

Many (but not all) R packages are organized and available from CRAN, a network of servers around the world that store identical, up-to-date, versions of code and documentation for R. You can easily install these package from inside R, using the install.packages function. CRAN also maintains a set of Task Views that identify all the packages associated with a particular task such as for example TimeSeries.

Next to CRAN you also have bioconductor which has packages for the analysis of high-throughput genomic data, as well as for example the github and bitbucket repositories of R package developers. You can easily install packages from these repositories using the devtools package.

Finding a package can be hard, but luckily you can easily search packages from CRAN, github and bioconductor using Rdocumentation, inside-R, or you can have a look at this quick list of useful R packages.

To end, once you start working with R, you’ll quickly find out that R package dependencies can cause a lot of headaches. Once you get confronted with that issue, make sure to check out packrat (see video tutorial) or checkpoint. When you’d need to update R, if you are using Windows, you can use the updateR() function from the installr package.

Importing your data into R

The data you want to import into R can come in all sorts for formats: flat files, statistical software files, databases and web data.

image03

Getting different types of data into R often requires a different approach to use. To learn more in general on how to get different data types into R you can check out this online Importing Data into R tutorial (subscription required), this post on data importing, or this webinar by RStudio.

  • Flat files are typically simple text files that contain table data. The standard distribution of R provides functionality to import these flat files into R as a data frame with functions such as read.table() and read.csv() from the utils package. Specific R packages to import flat files data are readr, a fast and very easy to use package that is less verbose as utils and multiple times faster (more information), and data.table’s fread() function for importing and munging data into R (using the fread function).
  • Software packages such as SAS, STATA and SPSS use and produce their own file types. The haven package by Hadley Wickham can deal with importing SAS, STATA and SPSS data files into R and is very easy to use. Alternatively there is the foreign package, which is able to import not only SAS, STATA and SPSS files but also more exotic formats like Systat and Weka for example. It’s also able to export data again to various formats. (Tip: if you’re switching from SAS,SPSS or STATA to R, check out Bob Muenchen’s tutorial (subscription required))
  • The packages used to connect to and import from a relational database depend on the type of database you want to connect to. Suppose you want to connect to a MySQL database, you will need the RMySQL package. Others are for example the RpostgreSQL and ROracle package.The R functions you can then use to access and manipulate the database, is specified in another R package called DBI.
  • If you want to harvest web data using R you need to connect R to resources online using API’s or through scraping with packages like rvest. To get started with all of this, there is this great resource freely available on the blog of Rolf Fredheim.

Data Manipulation

Turning your raw data into well structured data is important for robust analysis, and to make data suitable for processing. R has many built-in functions for data processing, but they are not always that easy to use. Luckily, there are some great packages that can help you:

  • The tidyr package allows you to “tidy” your data. Tidy data is data where each column is a variable and each row an observation. As such, it turns your data into data that is easy to work with. Check this excellent resource on how you can tidy your data using tidyr.
  • If you want to do string manipulation, you should learn about the stringr package. The vignette is very understandable, and full of useful examples to get you started.
  • dplyr is a great package when working with data frame like objects (in memory and out of memory). It combines speed with a very intuitive syntax. To learn more on dplyr you can take this data manipulation course (subscription required) and check out this handy cheat sheet.
  • When performing heavy data wrangling tasks, the data.table package should be your “go-to”package. It’s blazingly fast, and once you get the hang of it’s syntax you will find yourself using data.table all the time. Check this data analysis course (subscription required) to discover the ins and outs of data.table, and use this cheat sheet as a reference.
  • Chances are you find yourself working with times and dates at some point. This can be a painful process, but luckily lubridate makes it a bit easier to work with. Check it’s vignette to better understand how you can use lubridate in your day-to-day analysis.
  • Base R has limited functionality to handle time series data. Fortunately, there are package like zoo, xts and quantmod. Take this tutorial by Eric Zivot to better understand how to use these packages, and how to work with time series data in R.

If you want to have a general overview of data manipulation with R, you can read more in the book Data Manipulation with R or see the Data Wrangling with R video by RStudio. In case you run into troubles with handling your data frames, check 15 easy solutions to your data frame problems.

Data Visualization

One of the things that make R such a great tool is its data visualizations capabilities. For performing visualizations in R, ggplot2 is probably the most well known package and a must learn for beginners! You can find all relevant information to get you started with ggplot2 onhttp://ggplot2.org/ and make sure to check out the cheatsheet and the upcomming book. Next to ggplot2, you also have packages such as ggvis for interactive web graphics (seetutorial (subscription required)), googleVis to interface with google charts (learn to re-create this TED talk), Plotly for R, and many more. See the task view for some hidden gems, and if you have some issues with plotting your data this post might help you out.

In R there is a whole task view dedicated to handling spatial data that allow you to create beautiful maps such as this famous one:

image01

To get started look at for example a package such as ggmap, which allows you to visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps. Alternatively you can start playing around with maptools, choroplethr, and the tmap package. If you need a great tutorial take this Introduction to visualising spatial data in R.

You’ll often see that visualizations in R make use of all these magnificent color schemes that fit like a glove on the graph/map/… If you want to achieve this for your visualizations as well, then deepen yourself into the RColorBrewer package and ColorBrewer.

One of the latest visualizations tools in R is HTML widgets. HTML widgets work just like R plots but they create interactive web visualizations such as  dynamic maps (leaflet), time-series data charting (dygraphs), and interactive tables (DataTables). There are some very nice examples of HTML widgets in the wild, and solid documentation on how to create your own one (not in a reading mode: just watch this video).

If you want to get some inspiration on what visualization to create next, you can have a look at blogs dedicated to visualizations such as FlowingData.

Data Science & Machine Learning with R

There are many beginner resources on how to do data science with R. A list of available online courses:

Alternatively, if you prefer a good read:

Once your start doing some machine learning with R, you will quickly find yourself using packages such as caret, rpart and randomForest. Luckily, there are some great learning resources for these packages and Machine Learning in general. If you are just getting started,this guide will get you going in no time. Alternatively, you can have a look at the booksMastering Machine Learning with R and Machine Learning with R. If you are looking for some step-by-step tutorials that guide you through a real life example there is the Kaggle Machine Learning course or you can have a look at Wiekvoet’s blog.

Reporting Results in R

R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It is a great tool for reporting your data analysis in a reproducible manner, thereby making the analysis more useful and understandable. R markdown is based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can even create interactive R markdown documents using Shiny. This 4 hour tutorial on Reporting with R Markdown (subscription required) get’s you going with R markdown, and in addition you can use this nice cheat sheet for future reference.

Next to R markdown, you should also make sure to check out  Shiny. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. RStudio maintains a great learning portal to get you started with Shiny, including this set of video tutorials (click on the essentials of Shiny Learning Roadmap). More advanced topics are available, as well as a great set of examples.

image00

Next steps

Once you become more fluent in writing R syntax (and consequently addicted to R), you will want to unlock more of its power (read: do some really nifty stuff). In that case make sure to check out RCPP, an R package that makes it easier for integrating C++ code with R, or RevoScaleR (start the free tutorial).

After spending some time writing R code (and you became an R-addict), you’ll reach a point that you want to start writing your own R package. Hilary Parker from Etsy has written a short tutorial on how to create your first package, and if you’re really serious about it you need to read R packages, an upcoming book by Hadley Wickham that is already available for free on the web.

If you want to start learning on the inner workings of R and improve your understanding of it, the best way to get you started is by reading Advanced R.

Finally, come visit us again at R-bloggers.com to read of the latest news and tutorials from bloggers of the R community.


Integrating Python and R Part III: An Extended Example

$
0
0

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

 

By Chris Musselle

This is the third post in a three part series where I have explored the options available for including both R and Python in a data analysis pipeline. See post one for some reasons on why you may wish to do this, and details of a general strategy involving flat files. Post two expands on this by showing how R or Python processes can call each other and parse arguments between them.

In this post I will be sharing a longer example using these approaches in analysis we carried out at Mango as a proof of concept to cluster news articles. The pipeline involved the use of both R and Python at different stages, with a Python script being called from R to fetch the data, and the exploratory analysis piece being conducted in R.

Full implementation details can be found in the repository on github here, though for brevity this article will focus on the core concepts with the most relevant parts to R and Python integration discussed below.

Document Clustering

We were interested in the problem of document clustering of live published news articles, and specifically, wished to investigate times when multiple news websites were talking about the same content. As a first step towards this, we looked at sourcing live articles via RSS feeds, and used text mining methods to preprocess and cluster the articles based on their content.

Sourcing News Articles From RSS Feeds

There are some great Python tools out there for scraping and sourcing web data, and so for this task we used a combination of feedparser, requests, and BeautifulSoup to process the RSS feeds, fetch web content, and extract the parts we were interested in. though the general code structure was as follows:

# fetch_RSS_feed.py

def get_articles(feed_url, json_filename='articles.json'):
    """Update JSON file with articles from RSS feed"""
    #
    # See github link for full function script
    #

if __name__ == '__main__':

    # Pass Arguments
    args = sys.argv[1:]
    feed_url = args[0]
    filepath = args[1]

    # Get the latest articles and append to the JSON file given
    get_articles(feed_url, filepath)

Here we can see that the get_articles function is defined to perform the bulk of the data sourcing tasks, and that the parameters passed to it are the positional arguments from the command line. Within get_articles, the url link, publication date, title and text contents, were then extracted for each article in the RSS feed and stored in a JSON file. For each article, the text content was made up of all HTML paragraph tags within the news article.

Sidenote: The if __name__ == "__main__": line may look strange to non-Python programmers, but this is a common way in Python scripts to control the sections of the code that are run when the whole script is executed, vs when the script is imported by another Python script. If the script is executed directly (as is the case when it is called from R later), the if statement evaluates to true and all code is run. If however, at some point in the future I wanted to reuse get_articles in another Python script, I could now import that function from this script without triggering the code within the if statement.

The above Python script was then executed from within R by defining the utility function shown below. Note that by using stdout=TRUE, any messages printed to stdout with print() in the Python code, can be captured and displaced in the R console.

fetch_articles <- function(url, filepath) {

command = "python"
path2script='"fetch_RSS_feed.py"'

args = c(url, filepath)
allArgs = c(path2script, args)

output = system2(command, args=allArgs, stdout=TRUE)
print(output)

}

Loading Data into R

Once the data had been written to a JSON file, the next job was to get it into R to be used with the tm package for text mining. This proved a little trickier than first expected however, as the tm package is mainly geared around reading in documents from raw text files, or directories containing multiple text files. To convert the JSON file into the expected VCorpus object for tm I used the following:

load_json_file <- function(filepath) {

# Load data from JSON
json_file <- file(filepath, "rb", encoding = "UTF-8")
json_obj <- fromJSON(json_file)
close(json_file)

# Convert to VCorpus
bbc_texts <- lapply(json_obj, FUN = function(x) x$text )
df = as.data.frame(bbc_texts)
df = t(df)
articles = VCorpus(DataframeSource(df))
articles
}

Unicode Woes

One potential problem when manipulating text data from a variety of sources and passing it between languages, is that you can easily get tripped up by character encoding errors on route. We found that by default Python was able to read in, process and write out the article content from the HTML sources, but R was struggling to decode certain characters that were written out to the resulting JSON file. This is due to the languages using or expecting a different character encoding by default.

To remedy this, you should be explicit in the encoding you are using when writing and reading a file, by specifying it when opening a file connection. This meant using the following in Python when writing out to a JSON file,

# Write updated file.
with open(json_filename, 'w', encoding='utf-8') as json_file:
    json.dump(JSON_articles, json_file, indent=4)

and on the R side opening the file connection was as follows:

# Load data from JSON
json_file <- file(filepath, "rb", encoding = "UTF-8")
json_obj <- fromJSON(json_file)
close(json_file)

Here “UTF-8″ Unicode is chosen as it is a good default encoding to use, and is the most popular one used in HTML documents worldwide.

For more details on Unicode and ways of handling it in Python 2 and 3 see Ned Batchelder’s PyCon talk here.

Summary of Text Preprocessing and Analysis

The text preprocessing part of the analysis consisted of the following steps, which were all carried out using the tm package in R:

  • Tokenisation – Splitting text into words.
  • Punctuation and whitespace removal.
  • Conversion to lowercase.
  • Stemming – to consolidate different word endings.
  • Stopword removal – to ignore the most common and therefore least informative words.

Once cleaned and processed, the Term Frequency-Inverse Document Frequency (TF-IDF) statistic was calculated for the collection of articles. This statistic aims to provide a measure of how important each word is for a particular document, across a collection of documents. It is more sophisticated that just using the word frequencies themselves, as it takes into account that some words may naturally occur more frequently than others across all documents.

Finally a distance matrix was constructed based on the TF-IDF values and hierarchical clustering was performed. The results were then visualised as a dendogram using the dendextend package in R.

An example of the clusters formed from 475 articles published over the last 4 days is shown below where the leaf nodes are coloured according to their source, with blue corresponding to BBC News, green to The Guardian, and indigo to The Independent.

Article_Clustering_full

It is interesting here to see articles from the same news websites occasionally forming groups, suggesting that news websites often post multiple articles with similar content, which is plausible considering how news story unfold over time.

What’s more interesting is finding clusters where multiple new websites are talking about similar things. Below is one such cluster with the article headlines displayed, which mostly relate to the recent flooding in Cumbria.

Article_subsection_flooding

Hierarchical clustering is often a useful step in exploratory data analysis, and this work gives some insight into what is possible with news article clustering from live RSS feeds. Future work will look to evaluate different clustering approaches in more detail by examining the quality of the clusters they produce.

Other Approaches

In this series we have focused on describing the simplest approach of using flat files as an intermediate storage medium between the two languages. However it is worth briefly mentioning several other options that are available, such as:

  • Using a database, such as sqlite, as a medium of storage instead of flat files.
  • Passing the results of a script execution in memory instead of writing to an intermediate file.
  • Running two persistent R and Python processes at once, and passing data between them. Libraries such as rpy2 and rPython provide one such way of doing this.

Each of these methods brings with it some additional pros and cons, and so the question of which is most suitable is often dependent on the application itself. As a first port of call though, using common flat file formats is a good place to start.

Summary

This post gave an extended example of how Mango have been using both Python and R to perform exploratory analysis around clustering news articles. We used the flat file air gap strategy described in the first post in this series, and then automated the calling of Python from R by spawning a separate subprocess (described in the second post). As can be seen with a bit of care around character encodings, this provides a straight forward approach to “bridging the language gap”, and allows multiple skillsets to be utilised when performing a piece of analysis.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Online R courses at Udemy – for only $15 (“Christmas deal”)

$
0
0

tl;dr$15 Christmas deal at Udemy – (until the 24th)

For the next 3 days (until 12/24/2015 , 6:00am PST), Udemy is offering readers of R-bloggers access to its global online learning marketplace with a (special) $15 (up to 97% off) deal on hundreds of their courses (including many R-Programming, data science, machine learning etc.)

Click here to browse ALL (R and non-R) courses

(p.s: if you are a company with a product you are willing to offer to R-bloggers readers with a good discount, please email me about it)

Advanced R courses: 

General R courses for “data science”: 

2015-11-18 21_23_47-Clipboard

From Udemy:

We live in a new world where learning is not limited to the classroom or a book, but now on-demand, at your own pace, and on any device. Udemy is chock full of master courses and mini courses on everything from programming to photography, and we encourage you to take a look.

Their library of courses is quite extensive, you may also find interest in one of their other courses ranging from writing or yoga, excel (yak),  communication skills, app developer, web designer or more — still, for $15 (up to 97% off).

 

Google scholar scraping with rvest package

$
0
0

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In this post, I will show how to scrape google scholar. Particularly, we will use the 'rvest' R package to scrape the google scholar account of my PhD advisor. We will see his coauthors, how many times they have been cited and their affiliations. “rvest, inspired by libraries like beautiful soup, makes it easy to scrape (or harvest) data from html web pages”, wrote Hadley Wickham on RStudio Blog. Since it is designed to work with magrittr, we can express complex operations as elegant pipelines composed of simple and easily understood pieces of code.

Load required libraries:

We will use ggplot2 to create plots.

library(rvest)
library(ggplot2)

How many times have his papers been cited

Let’s use SelectorGadget to find out which css selector matches the “cited by” column.

page <- read_html("https://scholar.google.com/citations?user=sTR9SIQAAAAJ&hl=en&oi=ao")

Specify the css selector in html_nodes() and extract the text with html_text(). Finally, change the string to numeric using as.numeric().

citations <- page %>% html_nodes ("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric()

See the number of citations:

citations 
148 96 79 64 57 57 57 55 52 50 48 37 34 33 30 28 26 25 23 22 

Create a barplot of the number of citation:

barplot(citations, main="How many times has each paper been cited?", ylab='Number of citations', col="skyblue", xlab="")

Here is the plot:
barplot-gscholar

Coauthors, thier affilations and how many times they have been cited

My PhD advisor, Ben Zaitchik, is a really smart scientist. He not only has the skills to create network and cooperate with other scientists, but also intelligence and patience.
Next, let’s see his coauthors, their affiliations and how many times they have been cited.
Similarly, we will use SelectorGadget to find out which css selector matches the Co-authors.

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
Coauthors = page%>% html_nodes(css=".gsc_1usr_name a") %>% html_text()
Coauthors = as.data.frame(Coauthors)
names(Coauthors)='Coauthors'

Now let’s exploring Coauthors

head(Coauthors) 
                  Coauthors
1               Jason Evans
2             Mutlu Ozdogan
3            Rasmus Houborg
4          M. Tugrul Yilmaz
5 Joseph A. Santanello, Jr.
6              Seth Guikema

dim(Coauthors) 
[1] 27  1

As of today, he has published with 27 people.

How many times have his coauthors been cited?

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
citations = page%>% html_nodes(css = ".gsc_1usr_cby")%>%html_text()

citations 
 [1] "Cited by 2231"  "Cited by 1273"  "Cited by 816"   "Cited by 395"   "Cited by 652"   "Cited by 1531" 
 [7] "Cited by 674"   "Cited by 467"   "Cited by 7967"  "Cited by 3968"  "Cited by 2603"  "Cited by 3468" 
[13] "Cited by 3175"  "Cited by 121"   "Cited by 32"    "Cited by 469"   "Cited by 50"    "Cited by 11"   
[19] "Cited by 1187"  "Cited by 1450"  "Cited by 12407" "Cited by 1939"  "Cited by 9"     "Cited by 706"  
[25] "Cited by 336"   "Cited by 186"   "Cited by 192" 

Let’s extract the numeric characters only using global substitute.

citations = gsub('Cited by','', citations)

citations
 [1] " 2231"  " 1273"  " 816"   " 395"   " 652"   " 1531"  " 674"   " 467"   " 7967"  " 3968"  " 2603"  " 3468"  " 3175" 
[14] " 121"   " 32"    " 469"   " 50"    " 11"    " 1187"  " 1450"  " 12407" " 1939"  " 9"     " 706"   " 336"   " 186"  
[27] " 192"  

Change string to numeric and then to data frame to make it easy to use with ggplot2

citations = as.numeric(citations)
citations = as.data.frame(citations)

Affilation of coauthors

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
affilation = page %>% html_nodes(css = ".gsc_1usr_aff")%>%html_text()
affilation = as.data.frame(affilation)
names(affilation)='Affilation'

Now, let’s create a data frame that consists of coauthors, citations and affilations

cauthors=cbind(Coauthors, citations, affilation)

cauthors 
                             Coauthors citations                                                                                  Affilation
1                          Jason Evans      2231                                                               University of New South Wales
2                        Mutlu Ozdogan      1273    Assistant Professor of Environmental Science and Forest Ecology, University of Wisconsin
3                       Rasmus Houborg       816                    Research Scientist at King Abdullah University of Science and Technology
4                     M. Tugrul Yilmaz       395 Assistant Professor, Civil Engineering Department, Middle East Technical University, Turkey
5            Joseph A. Santanello, Jr.       652                                                  NASA-GSFC Hydrological Sciences Laboratory
.....

Re-order coauthors based on their citations

Let’s re-order coauthors based on their citations so as to make our plot in a decreasing order.

cauthors$Coauthors <- factor(cauthors$Coauthors, levels = cauthors$Coauthors[order(cauthors$citations, decreasing=F)])

ggplot(cauthors,aes(Coauthors,citations))+geom_bar(stat="identity", fill="#ff8c1a",size=5)+
theme(axis.title.y   = element_blank())+ylab("# of citations")+
theme(plot.title=element_text(size = 18,colour="blue"), axis.text.y = element_text(colour="grey20",size=12))+
              ggtitle('Citations of his coauthors')+coord_flip()

Here is the plot:
citation-gscholar-authors

He has published with scientists who have been cited more than 12000 times and with students like me who are just toddling.

Summary

In this post, we saw how to scrape Google Scholar. We scraped the account of my advisor and got data on the citations of his papers and his coauthors with thier affilations and how many times they have been cited.

As we have seen in this post, it is easy to scrape an html page using the rvest R package. It is also important to note that SelectorGadget is useful to find out which css selector matches the data of our interest.

If you have any question feel free to post a comment below.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sentiment Analysis on Donald Trump using R and Tableau

$
0
0

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Recently, the presidential candidate Donal Trump has become controversial. Particularly, associated with his provocative call to temporarily bar Muslims from entering the US, he has faced strong criticism.
Some of the many uses of social media analytics is sentiment analysis where we evaluate whether posts on a specific issue are positive or negative.
We can integrate R and Tableau for text data mining in social media analytics, machine learning, predictive modeling, etc., by taking advantage of the numerous R packages and compelling Tableau visualizations.

In this post, let’s mine tweets and analyze their sentiment using R. We will use Tableau to visualize our results. We will see spatial-temporal distribution of tweets, cities and states with top number of tweets and we will also map the sentiment of the tweets. This will help us to see in which areas his comments are accepted as positive and where they are perceived as negative.

Load required packages:

library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(ggmap)
library(dplyr)
library(plyr)
library(tm)
library(wordcloud)

Get Twitter authentication

All information below is obtained from twitter developer account. We will set working directory to save our authentication.

key="hidden"
secret="hidden"
setwd("/text_mining_and_web_scraping")

download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="/text_mining_and_web_scraping/cacert.pem",
              method="auto")
authenticate <- OAuthFactory$new(consumerKey=key,
                                 consumerSecret=secret,
                                 requestURL="https://api.twitter.com/oauth/request_token",
                                 accessURL="https://api.twitter.com/oauth/access_token",
                                 authURL="https://api.twitter.com/oauth/authorize")
setup_twitter_oauth(key, secret)
save(authenticate, file="twitter authentication.Rdata")

Get sample tweets from various cities

Let’s scrape most recent tweets from various cities across the US. Let’s request 2000 tweets from each city. We will need the latitude and longitude of each city.

N=2000  # tweets to request from each query
S=200  # radius in miles
lats=c(38.9,40.7,37.8,39,37.4,28,30,42.4,48,36,32.3,33.5,34.7,33.8,37.2,41.2,46.8,
       46.6,37.2,43,42.7,40.8,36.2,38.6,35.8,40.3,43.6,40.8,44.9,44.9)

lons=c(-77,-74,-122,-105.5,-122,-82.5,-98,-71,-122,-115,-86.3,-112,-92.3,-84.4,-93.3,
       -104.8,-100.8,-112, -93.3,-89,-84.5,-111.8,-86.8,-92.2,-78.6,-76.8,-116.2,-98.7,-123,-93)

#cities=DC,New York,San Fransisco,Colorado,Mountainview,Tampa,Austin,Boston,
#       Seatle,Vegas,Montgomery,Phoenix,Little Rock,Atlanta,Springfield,
#       Cheyenne,Bisruk,Helena,Springfield,Madison,Lansing,Salt Lake City,Nashville
#       Jefferson City,Raleigh,Harrisburg,Boise,Lincoln,Salem,St. Paul

donald=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Donald+Trump',
              lang="en",n=N,resultType="recent",
              geocode=paste(lats[i],lons[i],paste0(S,"mi"),sep=","))))

Let’s get the latitude and longitude of each tweet, the tweet itself, how many times it was re-twitted and favorited, the date and time it was twitted, etc.

donaldlat=sapply(donald, function(x) as.numeric(x$getLatitude()))
donaldlat=sapply(donaldlat, function(z) ifelse(length(z)==0,NA,z))  

donaldlon=sapply(donald, function(x) as.numeric(x$getLongitude()))
donaldlon=sapply(donaldlon, function(z) ifelse(length(z)==0,NA,z))  

donalddate=lapply(donald, function(x) x$getCreated())
donalddate=sapply(donalddate,function(x) strftime(x, format="%Y-%m-%d %H:%M:%S",tz = "UTC"))

donaldtext=sapply(donald, function(x) x$getText())
donaldtext=unlist(donaldtext)

isretweet=sapply(donald, function(x) x$getIsRetweet())
retweeted=sapply(donald, function(x) x$getRetweeted())
retweetcount=sapply(donald, function(x) x$getRetweetCount())

favoritecount=sapply(donald, function(x) x$getFavoriteCount())
favorited=sapply(donald, function(x) x$getFavorited())

data=as.data.frame(cbind(tweet=donaldtext,date=donalddate,lat=donaldlat,lon=donaldlon,
                           isretweet=isretweet,retweeted=retweeted, retweetcount=retweetcount,favoritecount=favoritecount,favorited=favorited))

First, let’s create a word cloud of the tweets. A word cloud helps us to visualize the most common words in the tweets and have a general feeling of the tweets.

# Create corpus
corpus=Corpus(VectorSource(data$tweet))

# Convert to lower-case
corpus=tm_map(corpus,tolower)

# Remove stopwords
corpus=tm_map(corpus,function(x) removeWords(x,stopwords()))

# convert corpus to a Plain Text Document
corpus=tm_map(corpus,PlainTextDocument)

col=brewer.pal(6,"Dark2")
wordcloud(corpus, min.freq=25, scale=c(5,2),rot.per = 0.25,
          random.color=T, max.word=45, random.order=F,colors=col)

Here is the word cloud:
wordcloud_donald

We see from the word cloud that among the most frequent words in the tweets are ‘muslim’, ‘muslims’, ‘ban’. This suggests that most tweets were on Trump’s recent idea of temporarily banning Muslims from entering the US.

The dashboard below shows time series of the number of tweets scraped. We can change the time unit between hour and day and the dashboard will change based on the selected time unit. Pattern of number of tweets over time helps us to drill in and see how each activities/campaigns are being perceived.

Here is the screenshot. (View it live in this link)
tableau-screenshot1

Getting address of tweets

Since some tweets do not have lat/lon values, we will remove them because we want geographic information to show the tweets and their attributes by state, city and zip code.

data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)

Let’s get full address of each tweet location using the google maps API. The ggmaps package is what enables us to get the street address, city, zipcode and state of the tweets using the longitude and latitude of the tweets. Since the google maps API does not allow more than 2500 queries per day, I used a couple of machines to reverse geocode the latitude/longitude information in a full address. However, I was not lucky enough to reverse geocode all of the tweets I scraped. So, in the following visualizations, I am showing only some percentage of the tweets I scraped that I was able to reverse geocode.

result <- do.call(rbind, lapply(1:nrow(lonlat),
                     function(i) revgeocode(as.numeric(lonlat[i,1:2]))))

If we see some of the values of result, we see that it contains the full address of the locations where the tweets were posted.

result[1:5,]
     [,1]                                              
[1,] "1778 Woodglo Dr, Asheboro, NC 27205, USA"        
[2,] "1550 Missouri Valley Rd, Riverton, WY 82501, USA"
[3,] "118 S Main St, Ann Arbor, MI 48104, USA"         
[4,] "322 W 101st St, New York, NY 10025, USA"         
[5,] "322 W 101st St, New York, NY 10025, USA"

So, we will apply some regular expression and string manipulation to separate the city, zip code and state into different columns.

data2=lapply(result,  function(x) unlist(strsplit(x,",")))
address=sapply(data2,function(x) paste(x[1:3],collapse=''))
city=sapply(data2,function(x) x[2])
stzip=sapply(data2,function(x) x[3])
zipcode = as.numeric(str_extract(stzip,"[0-9]{5}"))   
state=str_extract(stzip,"[:alpha:]{2}")
data2=as.data.frame(list(address=address,city=city,zipcode=zipcode,state=state))

Concatenate data2 to data:

data=cind(data,data2)

Some text cleaning:

tweet=data$tweet
tweet_list=lapply(tweet, function(x) iconv(x, "latin1", "ASCII", sub=""))
tweet_list=lapply(tweet, function(x) gsub("htt.*",' ',x))
tweet=unlist(tweet)
data$tweet=tweet

We will use lexicon based sentiment analysis. A list of positive and negative opinion words or sentiment words for English was downloaded from here.

positives= readLines("positivewords.txt")
negatives = readLines("negativewords.txt")

First, let’s have a wrapper function that calculates sentiment scores.

sentiment_scores = function(tweets, positive_words, negative_words, .progress='none'){
  scores = laply(tweets,
                 function(tweet, positive_words, negative_words){
                 tweet = gsub("[[:punct:]]", "", tweet)    # remove punctuation
                 tweet = gsub("[[:cntrl:]]", "", tweet)   # remove control characters
                 tweet = gsub('\d+', '', tweet)          # remove digits
                
                 # Let's have error handling function when trying tolower
                 tryTolower = function(x){
                     # create missing value
                     y = NA
                     # tryCatch error
                     try_error = tryCatch(tolower(x), error=function(e) e)
                     # if not an error
                     if (!inherits(try_error, "error"))
                       y = tolower(x)
                     # result
                     return(y)
                   }
                   # use tryTolower with sapply
                   tweet = sapply(tweet, tryTolower)
                   # split sentence into words with str_split function from stringr package
                   word_list = str_split(tweet, "\s+")
                   words = unlist(word_list)
                   # compare words to the dictionaries of positive & negative terms
                   positive.matches = match(words, positive_words)
                   negative.matches = match(words, negative_words)
                   # get the position of the matched term or NA
                   # we just want a TRUE/FALSE
                   positive_matches = !is.na(positive_matches)
                   negative_matches = !is.na(negative_matches)
                   # final score
                   score = sum(positive_matches) - sum(negative_matches)
                   return(score)
                 }, positive_matches, negative_matches, .progress=.progress )
  return(scores)
}

score = sentiment_scores(tweet, positives, negatives, .progress='text')
data$score=score

Let’s plot a histogram of the sentiment score:

hist(score,xlab=" ",main="Sentiment of sample tweetsn that have Donald Trump in them ",
     border="black",col="skyblue")

Here is the plot:
hist1_2_16

We see from the histogram that the sentiment is slightly positive. Using Tableau, we will see the spatial distribution of the sentiment scores.

Save the data as csv file and import it to Tableau

The map below shows the tweets that I was able to reverse geocode. The size is proportional to the number of favorites each tweet got. In the interactive map, we can hover each circle and read the tweet, the address it was tweeted from, and the date and time it was posted.

Here is the screenshot (View it live in this link)
by_retweets

Similarly, the dashboard below shows the tweets and the size is proportional to the number of times each tweet was retweeted.
Here is the screenshot (View it live in this link)
by_retweets

In the following three visualizations, top zip codes, cities and states by the number of tweets are shown. In the interactive map, we can change the number of zip codes, cities and states to display by using the scrollbars shown in each viz. These visualizations help us to see the distribution of the tweets by zip code, city and state.

By zip code
Here is the screenshot (View it live in this link)
top10zip

By city
Here is the screenshot (View it live in this link)
top15cities

By state
Here is the screenshot (View it live in this link)
top15zip

Sentiment of tweets

Sentiment analysis has myriads of uses. For example, a company may investigate what customers like most about the company’s product, and what are the issues the customers are not satisfied with? When a company releases a new product, has the product been perceived positively or negatively? How does the sentiment of the customers vary across space and time? In this post, we are evaluating, the sentiment of tweets that we scraped on Donald Trump.

The viz below shows the sentiment score of the reverse geocoded tweets by state. We see that the tweets have highest positive sentiment in NY, NC and Tx.
Here is the screenshot (View it live in this link)
by_sentiment

Summary

In this post, we saw how to integrate R and Tableau for text mining, sentiment analysis and visualization. Using these tools together enables us to answer detailed questions.

We used a sample from the most recent tweets that contain Donald Trump and since I was not able to reverse geocode all the tweets I scraped because of the constraint imposed by google maps API, we just used about 6000 tweets. The average sentiment is slightly above zero. Some states show strong positive sentiment. However, statistically speaking, to make robust conclusions, mining ample size sample data is important.

The accuracy of our sentiment analysis depends on how fully the words in the the tweets are included in the lexicon. More over, since tweets may contain slang, jargon and collequial words which may not be included in the lexicon, sentiment analysis needs careful evaluation.

This is enough for today. I hope you enjoyed it! If you have any questions or feedback, feel free to leave a comment.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

100 “must read” R-bloggers’ posts for 2015

$
0
0

The site R-bloggers.com is now 6 years young. It strives to be an (unofficial) online news and tutorials website for the R community, written by over 600 bloggers who agreed to contribute their R articles to the website. In 2015, the site served almost 17.7 million pageviews to readers worldwide.

In celebration to R-bloggers’ 6th birth-month, here are the top 100 most read R posts written in 2015, enjoy:

  1. How to Learn R
  2. How to Make a Histogram with Basic R
  3. How to Make a Histogram with ggplot2
  4. Choosing R or Python for data analysis? An infographic
  5. How to Get the Frequency Table of a Categorical Variable as a Data Frame in R
  6. How to perform a Logistic Regression in R
  7. A new interactive interface for learning R online, for free
  8. How to learn R: A flow chart
  9. Learn Statistics and R online from Harvard
  10. Twitter’s new R package for anomaly detection
  11. R 3.2.0 is released (+ using the installr package to upgrade in Windows OS)
  12. What’s the probability that a significant p-value indicates a true effect?
  13. Fitting a neural network in R; neuralnet package
  14. K-means clustering is not a free lunch
  15. Why you should learn R first for data science
  16. How to format your chart and axis titles in ggplot2
  17. Illustrated Guide to ROC and AUC
  18. The Single Most Important Skill for a Data Scientist
  19. A first look at Spark
  20. Change Point Detection in Time Series with R and Tableau
  21. Interactive visualizations with R – a minireview
  22. The leaflet package for online mapping in R
  23. Programmatically create interactive Powerpoint slides with R
  24. My New Favorite Statistics & Data Analysis Book Using R
  25. Dark themes for writing
  26. How to use SparkR within Rstudio?
  27. Shiny 0.12: Interactive Plots with ggplot2
  28. 15 Questions All R Users Have About Plots
  29. This R Data Import Tutorial Is Everything You Need
  30. R in Business Intelligence
  31. 5 New R Packages for Data Scientists
  32. Basic text string functions in R
  33. How to get your very own RStudio Server and Shiny Server with DigitalOcean
  34. Think Bayes: Bayesian Statistics Made Simple
  35. 2014 highlight: Statistical Learning course by Hastie & Tibshirani
  36. ggplot 2.0.0
  37. Machine Learning in R for beginners
  38. Top 77 R posts for 2014 (+R jobs)
  39. Introducing Radiant: A shiny interface for R
  40. Eight New Ideas From Data Visualization Experts
  41. Microsoft Launches Its First Free Online R Course on edX
  42. Imputing missing data with R; MICE package
  43. Variable Importance Plot” and Variable Selection
  44. The Data Science Industry: Who Does What (Infographic)
  45. d3heatmap: Interactive heat maps
  46. R + ggplot2 Graph Catalog
  47. Time Series Graphs & Eleven Stunning Ways You Can Use Them
  48. Working with “large” datasets, with dplyr and data.table
  49. Why the Ban on P-Values? And What Now?
  50. Part 3a: Plotting with ggplot2
  51. Importing Data Into R – Part Two
  52. How-to go parallel in R – basics + tips
  53. RStudio v0.99 Preview: Graphviz and DiagrammeR
  54. Downloading Option Chain Data from Google Finance in R: An Update
  55. R: single plot with two different y-axes
  56. Generalised Linear Models in R
  57. Hypothesis Testing: Fishing for Trouble
  58. The advantages of using count() to get N-way frequency tables as data frames in R
  59. Playing with R, Shiny Dashboard and Google Analytics Data
  60. Benchmarking Random Forest Implementations
  61. Fuzzy String Matching – a survival skill to tackle unstructured information
  62. Make your R plots interactive
  63. R #6 in IEEE 2015 Top Programming Languages, Rising 3 Places
  64. How To Analyze Data: Seven Modern Remakes Of The Most Famous Graphs Ever Made
  65. dplyr 0.4.0
  66. Installing and Starting SparkR Locally on Windows OS and RStudio
  67. Making R Files Executable (under Windows)
  68. Evaluating Logistic Regression Models
  69. Awesome-R: A curated list of the best add-ons for R
  70. Introducing Distributed Data-structures in R
  71. SAS vs R? The right answer to the wrong question?
  72. But I Don’t Want to Be a Statistician!
  73. Get data out of excel and into R with readxl
  74. Interactive R Notebooks with Jupyter and SageMathCloud
  75. Learning R: Index of Online R Courses, October 2015
  76. R User Group Recap: Heatmaps and Using the caret Package
  77. R Tutorial on Reading and Importing Excel Files into R
  78. R 3.2.2 is released
  79. Wanted: A Perfect Scatterplot (with Marginals)
  80. KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!
  81. Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance
  82. 10 Top Tips For Becoming A Better Coder!
  83. James Bond movies
  84. Modeling and Solving Linear Programming with R – Free book
  85. Scraping Web Pages With R
  86. Why you should start by learning data visualization and manipulation
  87. R tutorial on the Apply family of functions
  88. The relation between p-values and the probability H0 is true is not weak enough to ban p-values
  89. A Bayesian Model to Calculate Whether My Wife is Pregnant or Not
  90. First year books
  91. Using rvest to Scrape an HTML Table
  92. dplyr Tutorial: verbs + split-apply
  93. RStudio Clone for Python – Rodeo
  94. Time series outlier detection (a simple R function)
  95. Building Wordclouds in R
  96. Should you teach Python or R for data science?
  97. Free online data mining and machine learning courses by Stanford University
  98. Centering and Standardizing: Don’t Confuse Your Rows with Your Columns
  99. Network analysis with igraph
  100. Regression Models, It’s Not Only About Interpretation 

    (oh hack, why not include a few more posts…)

  101. magrittr: The best thing to have ever happened to R?
  102. How to Speak Data Science
  103. R vs Python: a Survival Analysis with Plotly
  104. 15 Easy Solutions To Your Data Frame Problems In R
  105. R for more powerful clustering
  106. Using the R MatchIt package for propensity score analysis
  107. Interactive charts in R
  108. R is the fastest-growing language on StackOverflow
  109. Hash Table Performance in R: Part I
  110. Review of ‘Advanced R’ by Hadley Wickham
  111. Plotting Time Series in R using Yahoo Finance data
  112. R: the Excel Connection
  113. Cohort Analysis with Heatmap
  114. Data Visualization cheatsheet, plus Spanish translations
  115. Back to basics: High quality plots using base R graphics
  116. 6 Machine Learning Visualizations made in Python and R
  117. An R tutorial for Microsoft Excel users
  118. Connecting R to Everything with IFTTT
  119. Data Manipulation with dplyr
  120. Correlation and Linear Regression
  121. Why has R, despite quirks, been so successful?
  122. Introducing shinyjs: perform common JavaScript operations in Shiny apps using plain R code
  123. R: How to Layout and Design an Infographic
  124. New package for image processing in R
  125. In-database R coming to SQL Server 2016
  126. Making waffle charts in R (with the new ‘waffle’ package)
  127. Revolution Analytics joins Microsoft
  128. Six Ways You Can Make Beautiful Graphs (Like Your Favorite Journalists)

 

p.s.: 2015 was also a great year for R-users.com, a job board site for R users. If you are an employer who is looking to hire people from the R community, please visit this link to post a new R job (it’s free, and registration takes less than 10 seconds). If you are a job seekers, please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

 

R_single_01

A Checkpoint Of Spanish Football League

$
0
0

(This article was first published on Ripples, and kindly contributed to R-bloggers)

I am an absolute beginner, but I am absolutely sane (Absolute Beginners, David Bowie)

Some time ago I wrote this post, where I predicted correctly the winner of the Spanish Football League several months before its ending. After thinking intensely about taking the risk of ruining my reputation repeating the analysis, I said “no problem, Antonio, do it again: in the end you don’t have any reputation to keep”. So here we are.

From a technical point of view there are many differences between both analysis. Now I use webscraping to download data, dplyr and pipes to do transformations and interactive D3.js graphs to show results. I think my code is better now and it makes me happy.

As I did the other time, Bradley-Terry Model gives an indicator of  the power of each team, called ability, which provides a natural mechanism for ranking teams. This is the evolution of abilities of each team during the championship (last season was played during the past weekend):

liga1_ability2

Although it is a bit messy, the graph shows two main groups of teams: on the one hand, Barcelona, Atletico de Madrid, Real Madrid and Villareal; on the other hand, the rest. Let’s have a closer look to evolution of the abilities of the top 4 teams:

liga2_ability2

While Barcelona, Atletico de Madrid and Real Madrid walk in parallel,  Villareal seems to be a bit stacked in the last seasons; the gap between them and Real Madrid is increasing little by little. Maybe is the Zidane’s effect. It is quite interesting discovering what teams are increasing their abilities: they are Malaga, Eibar and Getafe. They will probably finish the championship in a better position than they have nowadays (Eibar could reach fifth position):

liga3_ability2

What about Villareal? Will they go up some position? I don’t think so. This plot shows the probability of winning any of the top 3:

liga4_villareal2

As you can see, probability is decreasing significantly. And what about Barcelona? Will win? It is a very difficult question. They are almost tied with Atletico de Madrid, and only 5 and 8 points above Real Madrid and Villareal. But it seems Barcelona keep them at bay. This plot shows the evolution of the probability of be beaten by Atletico, Real Madrid and Villareal:

liga5_Barcelona2

All probabilities are under 50% and decreasing (I supposed a scoring of 2-0 for Barcelona in the match against Sporting of season 16 that was postponed to next February 17th).

Data science is a profession for brave people so it is time to do some predictions. These are mine, ordered by likelihood:

  • Barcelona will win, followed by Atletico (2), Real Madrid (3), Villareal (4) and Eibar (5)
  • Malaga and Getafe will go up some positions
  • Next year I will do the analysis again

Here you have the code:

library(rvest)
library(stringr)
library(BradleyTerry2)
library(dplyr)
library(reshape)
library(rCharts)
nseasons=20
results=data.frame()
for (i in 1:nseasons)
{
  webpage=paste0("http://www.marca.com/estadisticas/futbol/primera/2015_16/jornada_",i,"/")
  html(webpage) %>%
    html_nodes("table") %>%
    .[[1]] %>%
    html_table(header=FALSE, fill=TRUE) %>%
    mutate(X4=i) %>%
    rbind(results)->results
}
colnames(results)=c("home", "score", "visiting", "season")
results %>% 
  mutate(home     = iconv(home,     from="UTF8",to="ASCII//TRANSLIT"),
         visiting = iconv(visiting, from="UTF8",to="ASCII//TRANSLIT")) %>%
  #filter(grepl("-", score)) %>%
  mutate(score=replace(score, score=="18:30 - 17/02/2016", "0-2")) %>% # resultado fake para el Barcelona
  mutate(score_home     = as.numeric(str_split_fixed(score, "-", 2)[,1])) %>%
  mutate(score_visiting = as.numeric(str_split_fixed(score, "-", 2)[,2])) %>%
  mutate(points_home     =ifelse(score_home > score_visiting, 3, ifelse(score_home < score_visiting, 0, 1))) %>%
  mutate(points_visiting =ifelse(score_home > score_visiting, 0, ifelse(score_home < score_visiting, 3, 1))) -> data
prob_BT=function(x, y) {exp(x-y) / (1 + exp(x-y))}
BTabilities=data.frame()
for (i in 13:nseasons)
{
  data %>% filter(season<=i) %>%
    BTm(cbind(points_home, points_visiting), home, visiting, data=.) -> footballBTModel
  BTabilities(footballBTModel) %>%
  as.data.frame()  -> tmp 
  cbind(tmp, as.character(rownames(tmp)), i) %>% 
  mutate(ability=round(ability, digits = 2)) %>%
  rbind(BTabilities) -> BTabilities
}
colnames(BTabilities)=c("ability", "s.e.", "team", "season")
sort(unique(BTabilities[,"team"])) -> teams
BTprobabilities=data.frame()
for (i in 13:nseasons)
{
  BTabilities[BTabilities$season==i,1] %>% outer( ., ., prob_BT) -> tmp
  colnames(tmp)=teams
  rownames(tmp)=teams  
  cbind(melt(tmp),i) %>% rbind(BTprobabilities) -> BTprobabilities
}
colnames(BTprobabilities)=c("team1", "team2", "probability", "season")
BTprobabilities %>% 
  filter(team1=="Villarreal") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team2 %in% c("R. Madrid", "Barcelona", "Atletico")) -> BTVillareal
BTprobabilities %>% 
  filter(team2=="Barcelona") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team1 %in% c("R. Madrid", "Villarreal", "Atletico")) -> BTBarcelona
AbilityPlot <- nPlot(
  ability ~ season, 
  data = BTabilities, 
  group = "team",
  type = "lineChart")
AbilityPlot$yAxis(axisLabel = "Estimated Ability", width = 62)
AbilityPlot$xAxis(axisLabel = "Season")
VillarealPlot <- nPlot(
  probability ~ season, 
  data = BTVillareal, 
  group = "team2",
  type = "lineChart")
VillarealPlot$yAxis(axisLabel = "Probability of beating", width = 62)
VillarealPlot$xAxis(axisLabel = "Season")
BarcelonaPlot <- nPlot(
  probability ~ season, 
  data = BTBarcelona, 
  group = "team1",
  type = "lineChart")
BarcelonaPlot$yAxis(axisLabel = "Probability of being beaten", width = 62)
BarcelonaPlot$xAxis(axisLabel = "Season")

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Materials for NYU Shortcourse “Data Science and Social Science”

$
0
0

(This article was first published on R – Bad Hessian, and kindly contributed to R-bloggers)

Pablo Barberá, Dan Cervone, and I prepared a short course at New York University on Data Science and Social Science, sponsored by several institutes at NYU. The course was intended as an introduction to R and basic data science tasks, including data visualization, social network analysis, textual analysis, web scraping, and APIs. The workshop is geared towards social scientists with little experience in R, but experience with other statistical packages.

You can download and tinker around with the materials on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – Bad Hessian.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R User Groups on GitHub

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Quite a few times over the past few years I have highlighted presentations posted by R user groups on their websites and recommended these sites as a source for interesting material, but I have never thought to see what the user groups were doing on GitHub. As you might expect, many people who make presentations at R user group meetings make their code available on GitHub. However as best as I can tell, only a few R user groups are maintaining GitHub sites under the user group name.

The Indy UseR Group is one that seems to be making very good use of their GitHub Site. Here is the link to a very nice tutorial from Shankar Vaidyaraman on using the rvest package to do some web scraping with R. The following code which scrapes the first page from Springer's Use R! series to produce a short list of books comes form Shankar's simple example

# load libraries
library(rvest)
library(dplyr)
library(stringr)
 
# link to Use R! titles at Springer site
useRlink = "http://www.springer.com/?SGWID=0-102-24-0-0&series=Use+R&sortOrder=relevance&searchType=ADVANCED_CDA&searchScope=editions&queryText=Use+R"
 
# Read the page
userPg = useRlink %>% read_html()
 
## Get info of books displayed on the page
booktitles = userPg %>% html_nodes(".productGraphic img") %>% html_attr("alt")
bookyr = userPg %>% html_nodes(xpath = "//span[contains(@class,'renditionDescription')]") %>% html_text()
bookauth = userPg %>% html_nodes("span[class = 'displayBlock']") %>% html_text()
bookprice = userPg %>% html_nodes(xpath = "//div[@class = 'bookListPriceContainer']//span[2]") %>% html_text()
pgdf = data.frame(title = booktitles, pubyr = bookyr, auth = bookauth, price = bookprice)
pgdf

 

Book_list

 This plot,which shows a list of books ranked by number of downloads, comes from Shankar's extended recommender example.

Book_downloads

The Ann Arbor R User Group meetup site has done an exceptional job of creating an aesthetically pleasing and informative web property on their GitHub site.

AnnArbor_github

I am particularly impressed with the way they have integrated news, content and commentary into their "News" section. Scroll down the page and have look at the care taken to describe and document the presentations made to the group. I found the introduction and slides for Bob Carpenter's RStan presentation very well done.

StanvsAlternatives

Other RUGs active on GitHub include:

If your R user group is on GitHub and I have not included you in my short list please let me know about it. I think RUG GitHub sites have the potential for creating a rich code sharing experience among user groups. If you would like some help getting started with GItHub have a look at tutorials on the Murdoch University R User Group webpage.

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Craft httr calls cleverly with curlconverter

$
0
0

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

When you visit a site like the LA Times’ NH Primary Live Results site and wish you had the data that they used to make the tables & visualizations on the site:

primary

Sometimes it’s as simple as opening up your browsers “Developer Tools” console and looking for XHR (XML HTTP Requests) calls:

XHR

You can actually see a preview of those requests (usually JSON):

Developer_Tools_-_http___graphics_latimes_com_election-2016-new-hampshire-results_

While you could go through all the headers and cookies and transcribe them into httr::GET or httr::POST requests, that’s tedious, especially when most browsers present an option to “Copy as cURL”. cURL is a command-line tool (with a corresponding systems programming library) that you can use to grab data from URIs. The RCurl and curl packages in R are built with the underlying library. The cURL command line captures all of the information necessary to replicate the request the browser made for a resource. The cURL command line for the URL that gets the Republican data is:

curl 'http://graphics.latimes.com/election-2016-31146-feed.json' 
  -H 'Pragma: no-cache' 
  -H 'DNT: 1' 
  -H 'Accept-Encoding: gzip, deflate, sdch'
  -H 'X-Requested-With: XMLHttpRequest' 
  -H 'Accept-Language: en-US,en;q=0.8' 
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' 
  -H 'Accept: */*' 
  -H 'Cache-Control: no-cache' 
  -H 'If-None-Match: "7b341d7181cbb9b72f483ae28e464dd7"' 
  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' 
  -H 'Connection: keep-alive' 
  -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT'
  -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' 
  --compressed

While that’s easier than manual copy/paste transcription, these requests are uniform enough that there Has To Be A Better Way. And, now there is, with curlconverter.

The curlconverter package has (for the moment) two main functions:

  • straighten() : which returns a list with all of the necessary parts to craft an httr POST or GET call
  • make_req() : which actually _returns a working httr call, pre-filled with all of the necessary information.

By default, either function reads from the clipboard (envision the workflow where you do the “Copy as cURL” then switch to R and type make_req() or req_params <- straighten()), but they can take in a vector of cURL command lines, too (NOTE: make_req() is currently limited to one while straighten() can handle as many as you want).

Let’s show what happens using election results cURL command line:

REP <- "curl 'http://graphics.latimes.com/election-2016-31146-feed.json' -H 'Pragma: no-cache' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'X-Requested-With: XMLHttpRequest' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36' -H 'Accept: */*' -H 'Cache-Control: no-cache'  -H 'Cookie: s_fid=79D97B8B22CA721F-2DD12ACE392FF3B2; s_cc=true' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 10 Feb 2016 16:40:15 GMT' -H 'Referer: http://graphics.latimes.com/election-2016-new-hampshire-results/' --compressed"
 
resp <- curlconverter::straighten(REP)
jsonlite::toJSON(resp, pretty=TRUE)
 
    ## [
    ##   {
    ##     "url": ["http://graphics.latimes.com/election-2016-31146-feed.json"],
    ##     "method": ["get"],
    ##     "headers": {
    ##       "Pragma": ["no-cache"],
    ##       "DNT": ["1"],
    ##       "Accept-Encoding": ["gzip, deflate, sdch"],
    ##       "X-Requested-With": ["XMLHttpRequest"],
    ##       "Accept-Language": ["en-US,en;q=0.8"],
    ##       "User-Agent": ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36"],
    ##       "Accept": ["*/*"],
    ##       "Cache-Control": ["no-cache"],
    ##       "Connection": ["keep-alive"],
    ##       "If-Modified-Since": ["Wed, 10 Feb 2016 16:40:15 GMT"],
    ##       "Referer": ["http://graphics.latimes.com/election-2016-new-hampshire-results/"]
    ##     },
    ##     "cookies": {
    ##       "s_fid": ["79D97B8B22CA721F-2DD12ACE392FF3B2"],
    ##       "s_cc": ["true"]
    ##     },
    ##     "url_parts": {
    ##       "scheme": ["http"],
    ##       "hostname": ["graphics.latimes.com"],
    ##       "port": {},
    ##       "path": ["election-2016-31146-feed.json"],
    ##       "query": {},
    ##       "params": {},
    ##       "fragment": {},
    ##       "username": {},
    ##       "password": {}
    ##     }
    ##   }
    ## ]

You can then use the items in the returned list to make a GET request manually (but still tediously).

curlconverter‘s make_req() will try to do this conversion for you automagically using httr‘s little used VERB() function. It’s easier to show than to tell:

curlconverter::make_req(REP)
VERB(verb = "GET", url = "http://graphics.latimes.com/election-2016-31146-feed.json", 
     add_headers(Pragma = "no-cache", 
                 DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
                 `X-Requested-With` = "XMLHttpRequest", 
                 `Accept-Language` = "en-US,en;q=0.8", 
                 `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.39 Safari/537.36", 
                 Accept = "*/*", 
                 `Cache-Control` = "no-cache", 
                 Connection = "keep-alive", 
                 `If-Modified-Since` = "Wed, 10 Feb 2016 16:40:15 GMT", 
                 Referer = "http://graphics.latimes.com/election-2016-new-hampshire-results/"))

You probably don’t need all those headers, but you just need to delete what you don’t need vs trial-and-error build by hand. Try assigning the output of that function to a variable and inspecting what’s returned. I think you’ll find this is a big enhancement to your workflows (if you do alot of this “scraping without scraping”).

You can find the package on gitub. It’s built with V8 and uses a modified version of the curlconverter Node module by Nick Carneiro.

It’s still in beta and could use some tyre kicking. Convos in the comments, issues or feature requests in GH (pls).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Target Store Locations with rvest and ggmap

$
0
0

(This article was first published on R – Luke Stanke, and kindly contributed to R-bloggers)

I just finished developing a presentation for Target Analytics Network showcasing geospatial and mapping tools in R . I decided to use Target store locations as part of a case study in the presentation. The problem: I didn’t have any store location data, so I needed to get it from somewhere off the web. Since there some great tools in R to get this information, mainly rvest for scraping and ggmap for geocoding, it wasn’t a problem. Instead of just doing the work, I thought I should share what this process looks like:

First, we can go to the target website and find stores broken down by state.

Screen Shot 2016-02-15 at 4.14.41 PM

After finding this information, we can use the rvest package to scrape the information. The URL is so nicely formatted that you can easily grab any state if you know the state’s mailing code.

# Set the URL to borrow the data.
TargetURL <- paste0('http://www.target.com/store-locator/state-result?stateCode=', state)

Now we can set a state — Minnesota’s mailing code is MN.

# Set the state.
state <- 'MN'

Now that we have the URL, let’s grab the html from the webpage.

# Download the webpage.
TargetWebpage <-
  TargetURL %>%
  xml2::read_html()

Now we have to find the location of the table in the html code.

Screen Shot 2016-02-15 at 4.15.46 PM

Once we have found the html table, there are a number of ways we could extract from this location. I like to copy the the XPath location. It’s a bit lazy, but for the purpose of this exercise it makes life easy.

Once we have the XPath location, it’s easy to exact the table from the Target’s webpage. First we can pipe the html through the html_nodes function, this will isolate the html responsible for creating the store locations table. After that we can use the html_table to parse the html table into an R list. Let’s then use the data.frame function to take the list to a data frame and use the select function from the dplyr library to select specific variables. The problem with extracting the data is that the city, state, and zip code are in one column. Well its not really a problem for this exercise, but its maybe the perfectionist in me. Let’s use the separate function in the tidyr library to make city, state, and zipcode their own columns.

# Get all of the store locations.
TargetStores <-
  TargetWebpage %>%
  rvest::html_nodes(xpath = '//*[@id="stateresultstable"]/table') %>%
  rvest::html_table() %>%
  data.frame() %>%
  dplyr::select(`Store Name` = Store.Name, Address, `City/State/ZIP` = City.State.ZIP) %>%
  tidyr::separate(`City/State/ZIP`, into = c('City', 'Zipcode'), sep = paste0(', ', state)) %>%
  dplyr::mutate(State = state) %>%
  dplyr::as_data_frame()

Let’s get the coordinates for these stores; we can pass each store’s address through the geocode function which obtains the information from the Google Maps API — you can only geocode up to 2500 locations per day for free using the Google API.

# Geocode each store
TargetStores %<>%
  dplyr::bind_cols(
    ggmap::geocode(
      paste0(
        TargetStores$`Store Name`, ', ',
        TargetStores$Address, ', ',
        TargetStores$City, ', ',
        TargetStores$State, ', ',
        TargetStores$Zipcode
      ),
      output = 'latlon',
      source = 'google'
    )
  )

Now that we have the data, let’s plot. In order to plot this data, we need to put it in a spatial data frame — we can do this using the SpatialPointsDataFrame and CRS functions from the sp package. We need to specify the coordinates, the underlying data, and the projections

# Make a spatial data frame
TargetStores <-
  sp::SpatialPointsDataFrame(
    coords = TargetStores %>% dplyr::select(lon, lat) %>% data.frame,
    data = TargetStores %>% data.frame,
    proj4string = sp::CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
  )

Now that we have a spatial data frame, we can plot these points — I’m going to plot some other spatial data frames to make add context for the Target store point data.

# Plot Target in Minnesota
plot(mnCounties, col = '#EAF6AE', lwd = .4, border = '#BEBF92', bg = '#F5FBDA')
plot(mnRoads, col = 'darkorange', lwd = .5, add = TRUE)
plot(mnRoads2, col = 'darkorange', lwd = .15, add = TRUE)
plot(mnRivers, lwd = .6, add = TRUE, col = '#13BACC')
plot(mnLakes, border = '#13BACC', lwd = .2, col = '#EAF6F9', add = TRUE)
plot(TargetStores, add = TRUE, col = scales::alpha('#E51836', .8), pch = 20, cex = .6)

Target Locations in Minnesota

Yes! We’ve done it. We’ve plotted Target stores in Minnesota. That’s cool and all, but really we haven’t done much with the data we just obtained. Stay tuned for the next post to see what else we can do with this data.

UPDATE: David Radcliffe of the Twin Cities R User group presented something similar using Walmart stores.

To leave a comment for the author, please follow the link and comment on their blog: R – Luke Stanke.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Gender of Big Data

$
0
0

(This article was first published on Ripples, and kindly contributed to R-bloggers)

When I grow up I want to be a dancer (Carmen, my beautiful daughter)

The presence of women in positions of responsibility inside Big Data companies is quite far of parity: while approximately 5o% of world population are women, only 7% of CEOs of Top 100 Big Data Companies are. Like it or not, technology seems to be a guy thing.

Big_Data_Gender
To do this experiment, I did some webscraping to download the list of big data companies from here. I also used a very interesting package called genderizeR, which makes gender prediction based on first names (more info here).

Here you have the code:

library(rvest)
library(stringr)
library(dplyr)
library(genderizeR)
library(ggplot2)
library(googleVis)
paste0("http://www.crn.com/slide-shows/data-center/300076704/2015-big-data-100-business-analytics.htm/pgno/0/", 1:45) %>%
c(., paste0("http://www.crn.com/slide-shows/data-center/300076709/2015-big-data-100-data-management.htm/pgno/0/",1:30)) %>%
c(., paste0("http://www.crn.com/slide-shows/data-center/300076740/2015-big-data-100-infrastructure-tools-and-services.htm/pgno/0/",1:25)) -> webpages
results=data.frame()
for(x in webpages)
{
read_html(x) %>% html_nodes("p:nth-child(1)") %>% .[[2]] %>% html_text() -> Company
read_html(x) %>% html_nodes("p:nth-child(2)") %>% .[[1]] %>% html_text() -> Executive
results=rbind(results, data.frame(Company, Executive))
}
results=data.frame(lapply(results, as.character), stringsAsFactors=FALSE)
results[74,]=c("Trifacta", "Top Executive: CEO Adam Wilson")
results %>% mutate(Name=gsub("Top|\bExec\S*|\bCEO\S*|President|Founder|and|Co-Founder|\:", "", Executive)) %>%
mutate(Name=word(str_trim(Name))) -> results
results %>%
select(Name) %>%
findGivenNames() %>%
filter(probability > 0.9 & count > 15) %>%
as.data.frame() -> data
data %>% group_by(gender) %>% summarize(Total=n()) -> dat
doughnut=gvisPieChart(dat,
options=list(
width=450,
height=450,
legend="{ position: 'bottom', textStyle: {fontSize: 10}}",
chartArea="{left:25,top:50}",
title='TOP 100 BIG DATA COMPANIES 2015
Gender of CEOs',
colors="['red','blue']",
pieHole=0.5),
chartid="doughnut")
plot(doughnut)

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Web scraping with R & novel classification algorithms on unbalanced data

$
0
0

(This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

Tomorrow, the next RBelgium meeting will be held at the bnosac offices. This is the schedule.

Interested? Feel free to join the event. More info: http://www.meetup.com/RBelgium/events/228427510/

• 18h00-18h30: enter & meet other R users

• 18h30-19h00: Web scraping with R: live scraping products & prices of www.delhaize.be

• 19h15-20h00: State-of-the-art classification algorithms with unbalanced data. Package unbalanced: Racing for Unbalanced Methods Selection.

 

 

To leave a comment for the author, please follow the link and comment on their blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Web scraping with R

$
0
0

(This article was first published on BNOSAC - Belgium Network of Open Source Analytical Consultants, and kindly contributed to R-bloggers)

For those of you who are interested in web scraping with R. Enjoy the slides of our presentation on this topic during the last RBelgium meetup. The talk is about using rvest, RSelenium and our own package scrapeit.core which makes scraping deployment, logging and replaying your scrapes more easy.

The slides below are in Flash so make sure you don’t use an addblocker in order to view it.


 

If you are interested in scraping content from websites and feeding it into your analytical systems, let us know at bnosac.be/index.php/contact/mail-us so that we can set up a quick proof of concept to get your analytics rolling.

To leave a comment for the author, please follow the link and comment on their blog: BNOSAC - Belgium Network of Open Source Analytical Consultants.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Developing a R Tutorial shiny dashboard app

$
0
0

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Through this post, I would like to describe a R Tutorial Shiny app that I recently developed. You can access the app here. (Please open the app on Chrome as some of the features may not work on IE. The app also includes a “ReadMe” introduction which provides a quick overview on how to use the app)

The App provides a set of most commonly performed data manipulation tasks & use cases and the R code/syntax for these use cases, in a structured and easily navigable format. For people just starting off with R, this will hopefully be a useful tool to quickly figure out the code and syntax for their routine data analysis task.

In this post, I would not be getting into the basics of how to develop a Shiny app since it’s fairly well documented elsewhere, including an article in DataScience+. What I will be doing however, is to focus on how this Tutorial app was developed specifically emphasizing on some of it’s key components and features.

High Level App Overview

The basic structure of the app is fairly straight forward. For each topic a separate dataframe is created (I initially wrote the content in an excel file, which I then imported as a dataframe in R). Individual topic datasets are then included in a list object.

The relevant R code for this is:

tutorial_set <- list(basic_operations = basic_operations,dplyr_tutorial = dplyr_tutorial,
                       loops_tutorial = loops_tutorial,model_tutorial = model_tutorial)

Depending on the user selection on the “Choose Topic” dropdown box, the relevant dataframe is extracted from the list object, with the following code:

selected_topic <- tutorial_set[[input$topic_select]]
# where topic_select is the inputId of the dropdown box with values similar to topic dataset names

Each dataset incorporates the relevant set of tasks, associated use cases and the underlying code and comments.

Enhance the app’s visual appeal using shinydashboard package

If you like shiny, you would love it even more once you start using the shinydashboard package. This package which is built on top of Shiny can help you design visually stunning apps & dashboard. The tutorial app was not really meant to be a visual dashboard rather the emphasis was on functionality – Hence I haven’t explored all the various themes, layouts, widgets etc. that this package provides. However, you can read an excellent overview of this package in R Studio’s posts here.) Also if you are familiar with Shiny, picking up shinydashbaord should be a cakewalk. All that is required are some minor modifications to ui side of your code and then you are ready to go!

Interactive and dynamic datatables using the DT package

Rendering a dataset as an output is a fairly standard requirement, while building an app. I have been using the default shiny functions to render a datatable, till i came across the DT package. The package is basically a R interface to the Java script Data Table library and you can read more about it here. Rendering datatables using the DT package can help provide a whole new level of interactivity to your app.

Datatables rendered using this package, not only looks better but provides ways to capture user actions as they interact with the table. You can then program specific tasks that can be triggered, basis these user actions.

There is a whole range of different functionalities that DT offers, but for the purpose of the tutorial app, I only needed to know which row the user clicks on (on the “use case” dataset). This can be done quite easily using the following code:

selected_index <- input$use_cases_row_last_clicked
# where use_cases is the inputId of the use case datatable and row_last_clicked returns the index of the selected row

Executing the underlying code and displaying the output

Once the index of the dataset row that the user clicks on is captured, we need to extract the relevant code from the tutorial data set (which is fairly straightforward) and then execute the code (which requires a few, relatively less often used functions). The code for this section is given below:

#code_out is the output_id of the "Code Output" box on the app
output$code_out <- DT::renderDataTable({
     
    # Extract the formula from the dataset of the selected topic; the Col name is titled "formula"
    formula <- as.character(selected_topic[selected_index,"formula"])
    
    # parse parses the formula string as an expression which is then evaluated using eval
    # by default, parse expects the input to be in a file format, so text = form specifies that the input is text
    code_output <- eval(parse(text = form))

     
   # code output is then displayed as a datatable
   # The prefix "DT::" specifies that datatable function in DT package should be considered rather than the deprecated Shiny versions
   # selection = "single" specifies that user can select only 1 row at a time
   # In options; scroolX = TRUE implies that scroll bar should be displayed and rownames = FALSE hides the rownames from displayed output
   DT::datatable(data = code_output, selection = 'single',
                      options = list(scrollX = TRUE, rownames= FALSE))

      })

Rendering output using R Markdown

Once the index of the user selected row is extracted, and the underlying code is executed, the final step is to render the code and comments as a html output (The output is displayed in the “Code and Comments” box of the app).
We use the R Markdown package to render the “code & comments” output in a html format (Getting up to speed on R Markdown, if you are not familiar with it, should not be a challenge at all. You can read about R Markdown here and you can also refer to the cheat sheet which provides a quick overview of the various formatting tags that you may need)

For the purpose of the Tutorial app, this is what was needed – Generate a R Markdown document on the fly (depending on the user selection) and render the output in a html format, which can then be displayed on the app.

I had come across this app some time back, where the author had attempted something very similar, which I suitably modified for my tutorial app. Given below is the code for this section:

output$comment <- reactive({

        # Initialize a temp file
        t <- tempfile()
        selected_index <- as.numeric(input$use_cases_row_last_clicked)
 
        #Extract the comment from the selected topic dataset
        comment <- as.character(selected_topic[index,"comment"])

        #cat command concatenates the output and prints the output to the temp file created
        cat(comment, file = t)
       
        # Use the R Markdown package's render function to render the file as an html document
        t <- render(
          input         = t,
          output_format = 'html_document')
 
        ## read the results, delete the temp file and return the output
        comment_html <- readLines(t)
        unlink(sub('.html$', '*', t))
        paste(comment_html, collapse = 'n')

      })

Hope you found the post useful. If you have any queries or questions, please feel free to comment below.

    Related Post

    1. Strategies to Speedup R Code
    2. Sentiment analysis with machine learning in R
    3. Sentiment Analysis on Donald Trump using R and Tableau
    4. Google scholar scraping with rvest package
    5. PubMed search Shiny App using RISmed

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Who Has the Best Fantasy Football Projections? 2016 Update

    $
    0
    0

    (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

    In prior posts, we demonstrated how to download projections from numerous sources, calculate custom projections for your league, and compare the accuracy of different sources of projections (2013, 2014, 2015).  In the latest version of our annual series, we hold the forecasters accountable and see who had the most and least accurate fantasy football projections over the last 4 years.

    The R Script

    You can download the R script for comparing the projections from different sources here.  You can download the historical projections and performance using our Projections tool.

    To compare the accuracy of the projections, we use the following metrics:

    For a discussion of these metrics, see here and here.

    Whose Predictions Were the Best?

    The results are in the table below.  We compared the accuracy for projections of the following positions: QB, RB, WR, and TE.  The rows represent the different sources of predictions (e.g., ESPN, CBS) and the columns represent the different measures of accuracy for the last four years and the average across years.  The source with the best measure for each metric is in blue.
    Source 2012 2013 2014 2015 Average
    R2 MASE R2 MASE R2 MASE R2 MASE R2 MASE
    Fantasy Football Analytics: Average .670 .545 .567 .635 .618 .577 .626 .553 .620 .578
    Fantasy Football Analytics: Robust Average .667 .549 .561 .636 .613 .581 .628 .554 .617 .580
    Fantasy Football Analytics: Weighted Average .626 .553
    CBS Average .637 .604 .479 .722 .575 .632 .500 .664 .548 .656
    EDS Football .554 .651 .584 .624 .569 .638
    ESPN .576 .669 .500 .705 .498 .723 .615 .585 .547 .671
    FantasySharks .529 .673
    FFtoday .661 .551 .550 .646 .530 .659 .546 .626 .572 .621
    FOX Sports .459 .720 .550 .677 .505 .699
    NFL.com .551 .650 .505 .709 .518 .692 .582 .632 .539 .671
    numberFire .486 .712 .560 .643 .523 .678
    RTSports .547 .670
    WalterFootball .472 .713 .431 .724 .452 .719
    Yahoo .547 .645 .635 .554 .591 .600
    Here is how the projections ranked over the last four years (based on MASE):
    1. Fantasy Football Analytics: Average (or Weighted Average)
    2. Fantasy Football Analytics: Robust Average
    3. Yahoo
    4. FFtoday
    5. EDS Football
    6. CBS Average
    7. RTSports
    8. ESPN
    9. NFL.com
    10. FantasySharks
    11. numberFire
    12. FOX Sports
    13. WalterFootball

    Notes: CBS estimates were averaged across Jamey Eisenberg and Dave Richard.  FantasyFootballNerd projections were not included because the full projections are subscription only.  We did not calculate the weighted average prior to 2015.  The accuracy estimates may differ slightly from those provided in prior years because a) we now use standard league scoring settings (you can see the league scoring settings we used here) and b) we are only examining the following positions: QB, RB, WR, and TE. The weights for the weighted average were based on historical accuracy (1-MASE).  For the analysts not included in the accuracy calculations, we calculated the average (1-MASE) value and subtracted 1/2 the standard deviation of (1-MASE).  The weights in the weighted average for 2015 were:

    CBS Average: .428
    EDS Football: .428
    ESPN: .383
    FantasyFootballNerd: .428
    FFToday: .482
    FOX Sports: .428
    NFL.com: .384
    numberFire: .404
    RTSports.com: .428
    WalterFootball: .428
    Yahoo Sports: .433

    Here is a scatterplot of our average projections in relation to players’ actual fantasy points scored in 2015:

    Accuracy 2015

     

    Interesting Observations

    1. Projections that combined multiple sources of projections (FFA Average, Weighted Average, Robust Average) were more accurate than all single sources of projections (e.g., CBS, NFL.com, ESPN) every year.  This is consistent with the wisdom of the crowd.
    2. The simple average (mean) was more accurate than the robust average.  The robust average gives extreme values less weight in the calculation of the average.  This suggests that outliers may reflect meaningful sources of variance (i.e., they may help capture a player’s ceiling/floor) and may not be bad projections (i.e., error/noise).
    3. The weighted average was equally accurate compared to the simple average.  Weights were based on historical accuracy.  If the best analysts are consistently more accurate than other analysts, the weighted average will likely outperform the mean.  If, on the other hand, analysts don’t reliably outperform each other, the mean might be more accurate.
    4. The FFA Average explained 57–67% of the variation in players’ actual performance.  That means that the projections are somewhat accurate but have much room for improvement in terms of prediction accuracy.  1/3 to 1/2 of the variance in actual points is unexplained by projections.  Nevertheless, the projections are likely more accurate than pre-season rankings.
    5. The R-squared of the FFA average projection was .67 in 2012, .57 in 2013, .62 in 2014, and .63 in 2015.  This suggests that players are more predictable in some years than others.
    6. There was little consistency in performance across time among sites that used single projections (CBS, NFL.com, ESPN). In 2012, CBS was the most accurate single source of projection but they were the least accurate in 2013.  Moreover, ESPN was among the least accurate in 2014, but they were among the most accurate in 2015.  This suggests that no single source reliably outperforms the others.  While some sites may do better than others in any given year (because of fairly random variability–i.e., chance), it is unlikely that they will continue to outperform the other sites.
    7. Projections were more accurate for some positions than others.  Projections were much more accurate for QBs and WRs than for RBs.  Projections were the least accurate for Ks, DBs, and DSTs.  For more info, see here.  Here is how positions ranked in accuracy of their projections (from most to least accurate):
      1. QB: R2 = .71
      2. WR: R2 = .57
      3. LB: R2 = .56
      4. TE: R2 = .54
      5. DL: R2 = .48
      6. RB: R2 = .47
      7. K: R2 = .38
      8. DB: R2 = .32
      9. DST: R2 = .15
    8. Projections over-estimated players’ performance by about 4–10 points every year across most positions (based on mean error).  It will be interesting to see if this pattern holds in future seasons.  If it does, we could account for this over-expectation in players’ projections.  In a future post, I hope to explore the types of players for whom this over-expectation occurs.

    Conclusion

    Fantasy Football Analytics had the most accurate projections over the last four years.  Why?  We average across sources.  Combining sources of projections removes some of their individual judgment biases (error) and gives us a more accurate fantasy projection.  No single source (CBS, NFL.com, ESPN) reliably outperformed the others or the crowd, suggesting that differences between them are likely due in large part to chance.  In sum, crowd projections are more accurate than individuals’ judgments for fantasy football projections.  People often like to “go with their gut” when picking players.  That’s fine—fantasy football is a game.  Do what is fun for you.  But, crowd projections are the most reliably accurate of any source.  Do with that what you will!  But don’t take my word for it.  Examine the accuracy yourself with our Projections tool and see what you find.  And let us know if you find something interesting!

    The post Who Has the Best Fantasy Football Projections? 2016 Update appeared first on Fantasy Football Analytics.

    To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Performing SQL selects on R data frames

    $
    0
    0

    (This article was first published on DataScience+, and kindly contributed to R-bloggers)

    For anyone who has SQL background and who wants to learn R, I guess the sqldf package is very useful because it enables us to use SQL commands in R. One who has basic SQL skills can manipulate data frames in R using their SQL skills. You can read more about sqldf package from cran.

    In this post, we will see how to perform joins and other queries to retrieve, sort and filter data in R using SQL. We will also see how we can achieve the same tasks using non-SQL R commands. Currently, I am working with the FDA adverse events data. These data sets are ideal for a data scientist or for any data enthusiast to work with because they contain almost any kind of data messiness: missing values, duplicates, compatibility problems of data sets created during different time periods, variable names and number of variables changing over time (e.g., sex in one dataset, gender in other dataset, if_cod in one dataset and if_code in other dataset), missing errors, etc.

    In this post, we will use FDA adverse events data, which are publicly available.The FDA adverse events datasets can also be downloaded in .csv format from the National Bureau of Economic Research. Since, it is easier to automate the data download in R from the National Bureau of Economic Research, we will download various datasets from this website. I encourage you to try the R codes and download data from the website and explore the adverse events datasets.

    The adverse events datasets are created in quarterly temporal resolution and each quarter data includes demography information, drug/biologic information, adverse event, outcome and diagnosis, etc.

    Let’s download some datasets and use SQL queries to join, sort and filter the various data frames.

    Load R Packages

    require(downloader)
    library(dplyr)
    library(sqldf)
    library(data.table)
    library(ggplot2)
    library(compare)
    library(plotrix)

    Basic error Handing with tryCatch()

    We will use the function below to download data. Since the data sets have quarterly time resolution, there are four data sets per year for each observation. So, the function below will automate the data dowlnoad. If the quartely data is not available on the website (not yet released), the function will give an error message that the dataset was not found. We are downloading zipped data and unzipping it.

    try.error = function(url)
    {
      try_error = tryCatch(download(url,dest="data.zip"), error=function(e) e)
      if (!inherits(try_error, "error")){
          download(url,dest="data.zip")
            unzip ("data.zip")
          }
        else if (inherits(try_error, "error")){
        cat(url,"not foundn")
          }
          }

    Download adverse events data

    The FDA adverse events data are available since 2004. In this post, let’s download data since 2013. We will check if there are data up to the present time and download what we can get.

    • Sys.time() gives the current date and time
    • the function year() from the data.table package gets the year from the current date and time data that we obtain from the Sys.time() function

    We are downloading demography, drug, diagnosis/indication, outcome and reaction (adeverse event) data.

    year_start=2013
    year_last=year(Sys.time())
    for (i in year_start:year_last){
                j=c(1:4)
                for (m in j){
                url1<-paste0("http://www.nber.org/fda/faers/",i,"/demo",i,"q",m,".csv.zip")
                url2<-paste0("http://www.nber.org/fda/faers/",i,"/drug",i,"q",m,".csv.zip")
                url3<-paste0("http://www.nber.org/fda/faers/",i,"/reac",i,"q",m,".csv.zip")
                url4<-paste0("http://www.nber.org/fda/faers/",i,"/outc",i,"q",m,".csv.zip")
                url5<-paste0("http://www.nber.org/fda/faers/",i,"/indi",i,"q",m,".csv.zip")
               try.error(url1)
               try.error(url2)
               try.error(url3)
               try.error(url4)
               try.error(url5)     
                }
            }
    
    http://www.nber.org/fda/faers/2015/demo2015q4.csv.zip not found
    ...
    http://www.nber.org/fda/faers/2016/indi2016q4.csv.zip not found

    From the error messages shown above, in the time of writing this post, we see that the database has adverse events data up to the third quarter of 2015.

    • list.files() produce a character vector of the names of files in my working directory.
    • I am using regular expressions to select each categoy of dateset. Example, ^demo.*.csv means all files that start with the word demo and end with .csv.
    filenames <- list.files(pattern="^demo.*.csv", full.names=TRUE)
    cat('We have downloaded the following quarterly demography datasets')
    filenames

    We have downloaded the following quarterly demography datasets

    "./demo2012q1.csv" "./demo2012q2.csv" "./demo2012q3.csv" "./demo2012q4.csv" "./demo2013q1.csv" "./demo2013q2.csv" "./demo2013q3.csv" "./demo2013q4.csv" "./demo2014q1.csv" "./demo2014q2.csv" "./demo2014q3.csv" "./demo2014q4.csv" "./demo2015q1.csv" "./demo2015q2.csv" "./demo2015q3.csv" 

    Now let’s use fread() function from the data.table Package to read in the datasets. Let’s start with the demography data:

    demo=lapply(filenames,fread)

    Now, let’s change them to data frames and concatenate them to create one demography data

    demo_all=do.call(rbind,lapply(1:length(demo),function(i) select(as.data.frame(demo[i]),primaryid,caseid, age,age_cod,event_dt,sex,reporter_country)))
    dim(demo_all)
            3554979   7 

    We see that our demography data has more than 3.5 million rows.

    Now, lets’ merge all drug files

    filenames <- list.files(pattern="^drug.*.csv", full.names=TRUE)
    cat('We have downloaded the following quarterly drug datasets:n')
    filenames
    drug=lapply(filenames,fread)
    cat('n')
    cat('Variable names:n')
    names(drug[[1]])
    drug_all=do.call(rbind,lapply(1:length(drug), function(i) select(as.data.frame(drug[i]),primaryid,caseid, drug_seq,drugname,route)))

    We have downloaded the following quarterly drug datasets:

    "./drug2012q1.csv" "./drug2012q2.csv" "./drug2012q3.csv" "./drug2012q4.csv" "./drug2013q1.csv" "./drug2013q2.csv" "./drug2013q3.csv" "./drug2013q4.csv" "./drug2014q1.csv" "./drug2014q2.csv" "./drug2014q3.csv" "./drug2014q4.csv" "./drug2015q1.csv" "./drug2015q2.csv" "./drug2015q3.csv"

    Variable names:

    "primaryid" "drug_seq" "role_cod" "drugname" "val_vbm" "route" "dose_vbm" "dechal" "rechal" "lot_num" "exp_dt" "exp_dt_num" "nda_num" 

    Merge all diagnoses/indications datasets

    filenames <- list.files(pattern="^indi.*.csv", full.names=TRUE)
    cat('We have downloaded the following quarterly diagnoses/indications datasets:n')
    
    filenames
    
    indi=lapply(filenames,fread)
    
    cat('n')
    cat('Variable names:n')
    
    names(indi[[15]])
    
    indi_all=do.call(rbind,lapply(1:length(indi), function(i) select(as.data.frame(indi[i]),primaryid,caseid, indi_drug_seq,indi_pt)))
    

    We have downloaded the following quarterly diagnoses/indications datasets:

    "./indi2012q1.csv" "./indi2012q2.csv" "./indi2012q3.csv" "./indi2012q4.csv" "./indi2013q1.csv" "./indi2013q2.csv" "./indi2013q3.csv" "./indi2013q4.csv" "./indi2014q1.csv" "./indi2014q2.csv" "./indi2014q3.csv" "./indi2014q4.csv" "./indi2015q1.csv" "./indi2015q2.csv" "./indi2015q3.csv"

    Variable names:

    "primaryid" "caseid" "indi_drug_seq" "indi_pt" 

    Merge patient outcomes

    filenames <- list.files(pattern="^outc.*.csv", full.names=TRUE)
    cat('We have downloaded the following quarterly patient outcome datasets:n')
    
    filenames
    outc_all=lapply(filenames,fread)
    
    cat('n')
    cat('Variable namesn')
    
    names(outc_all[[1]])
    names(outc_all[[4]])
    colnames(outc_all[[4]])=c("primaryid", "caseid", "outc_cod")
    outc_all=do.call(rbind,lapply(1:length(outc_all), function(i) select(as.data.frame(outc_all[i]),primaryid,outc_cod)))

    We have downloaded the following quarterly patient outcome datasets:

    "./outc2012q1.csv" "./outc2012q2.csv" "./outc2012q3.csv" "./outc2012q4.csv" "./outc2013q1.csv" "./outc2013q2.csv" "./outc2013q3.csv" "./outc2013q4.csv" "./outc2014q1.csv" "./outc2014q2.csv" "./outc2014q3.csv" "./outc2014q4.csv" "./outc2015q1.csv" "./outc2015q2.csv" "./outc2015q3.csv" 

    Variable names

        "primaryid" "outc_cod" 
        "primaryid" "caseid" "outc_code" 

    Finally, merge reaction (adverse event) datasets

    filenames <- list.files(pattern="^reac.*.csv", full.names=TRUE)
    cat('We have downloaded the following quarterly reaction (adverse event)  datasets:n')
    
    filenames
    reac=lapply(filenames,fread)
    
    cat('n')
    cat('Variable names:n')
    names(reac[[3]])
    
    reac_all=do.call(rbind,lapply(1:length(indi), function(i) select(as.data.frame(reac[i]),primaryid,pt)))
    

    We have downloaded the following quarterly reaction (adverse event) datasets:

    "./reac2012q1.csv" "./reac2012q2.csv" "./reac2012q3.csv" "./reac2012q4.csv" "./reac2013q1.csv" "./reac2013q2.csv" "./reac2013q3.csv" "./reac2013q4.csv" "./reac2014q1.csv" "./reac2014q2.csv" "./reac2014q3.csv" "./reac2014q4.csv" "./reac2015q1.csv" "./reac2015q2.csv" "./reac2015q3.csv"

    Variable names:

      "primaryid" "pt" 

    Let’s see number of rows from each data type

    all=as.data.frame(list(Demography=nrow(demo_all),Drug=nrow(drug_all),
                       Indications=nrow(indi_all),Outcomes=nrow(outc_all),
                       Reactions=nrow(reac_all)))
    row.names(all)='Number of rows'
    all

    table1

    SQL queries

    Keep in mind that sqldf uses SQLite.

    COUNT

    #SQL
    sqldf("SELECT COUNT(primaryid)as 'Number of rows of Demography data'
    FROM demo_all;")

    table2

    # R
    nrow(demo_all)
    3554979 

    LIMIT

    #  SQL
    sqldf("SELECT *
    FROM demo_all 
    LIMIT 6;")

    table3

    #R
    head(demo_all,6)

    table4

    R1=head(demo_all,6)
    SQL1 =sqldf("SELECT *
    FROM demo_all 
    LIMIT 6;")
    all.equal(R1,SQL1)
    TRUE

    WHERE

    SQL2=sqldf("SELECT *
    FROM demo_all WHERE sex ='F';")
    R2 = filter(demo_all, sex=="F")
    identical(SQL2, R2)
    TRUE
    SQL3=sqldf("SELECT *
    FROM demo_all WHERE age BETWEEN 20 AND 25;")
    R3 = filter(demo_all, age >= 20 & age <= 25)
    identical(SQL3, R3)
    TRUE

    GROUP BY and ORDER BY

    #SQL
    sqldf("SELECT sex, COUNT(primaryid) as Total
    FROM demo_all
    WHERE sex IN ('F','M','NS','UNK')
    GROUP BY sex
    ORDER BY Total DESC ;")

    table53

    # R
    demo_all%>%filter(sex %in%c('F','M','NS','UNK'))%>%group_by(sex) %>%
            summarise(Total = n())%>%arrange(desc(Total))

    table53

    SQL3 = sqldf("SELECT sex, COUNT(primaryid) as Total
    FROM demo_all
    GROUP BY sex
    ORDER BY Total DESC ;")
    
    R3 = demo_all%>%group_by(sex) %>%
            summarise(Total = n())%>%arrange(desc(Total))
    
    compare(SQL3,R3, allowAll=TRUE)
    TRUE
      dropped attributes
    SQL=sqldf("SELECT sex, COUNT(primaryid) as Total
    FROM demo_all
    WHERE sex IN ('F','M','NS','UNK')
    GROUP BY sex
    ORDER BY Total DESC ;")
    SQL$Total=as.numeric(SQL$Total
    pie3D(SQL$Total, labels = SQL$sex,explode=0.1,col=rainbow(4),
       main="Pie Chart of adverse event reports by gender",cex.lab=0.5, cex.axis=0.5, cex.main=1,labelcex=1)

    This is the plot:
    fig1

    Inner Join

    Let’s join drug data and indication data based on primary id and drug sequence
    First, let’s check the variable names to see how to merge the two datasets.

    names(indi_all)
    names(drug_all)
    
        "primaryid" "indi_drug_seq" "indi_pt" 
        "primaryid" "drug_seq" "drugname" "route" 
    
    names(indi_all)=c("primaryid", "drug_seq", "indi_pt" ) # so as to have the same name (drug_seq)
    R4= merge(drug_all,indi_all, by = intersect(names(drug_all), names(indi_all)))
    R4=arrange(R3, primaryid,drug_seq,drugname,indi_pt)
    SQL4= sqldf("SELECT d.primaryid as primaryid, d.drug_seq as drug_seq, d.drugname as drugname,
                           d.route as route,i.indi_pt as indi_pt
                           FROM drug_all d
                           INNER JOIN indi_all i
                          ON d.primaryid= i.primaryid AND d.drug_seq=i.drug_seq
                          ORDER BY primaryid,drug_seq,drugname, i.indi_pt")
    compare(R4,SQL4,allowAll=TRUE)
    TRUE
    
    R5 = merge(reac_all,outc_all,by=intersect(names(reac_all), names(outc_all)))
    SQL5 =reac_outc_new4=sqldf("SELECT r.*, o.outc_cod as outc_cod
                         FROM reac_all r 
                         INNER JOIN outc_all o
                         ON r.primaryid=o.primaryid
                         ORDER BY r.primaryid,r.pt,o.outc_cod")
    
    compare(R5,SQL5,allowAll = TRUE)
    TRUE
    
    ggplot(sqldf('SELECT age, sex
                 FROM demo_all
                 WHERE age between 0 AND 100 AND sex IN ("F","M")
                 LIMIT 10000;'), aes(x=age, fill = sex))+ geom_density(alpha = 0.6)

    This is the plot:
    fig2

    ggplot(sqldf("SELECT d.age as age, o.outc_cod as outcome
                         FROM demo_all d
                         INNER JOIN outc_all o
                         ON d.primaryid=o.primaryid
                         WHERE d.age BETWEEN 20 AND 100
                         LIMIT 20000;"),aes(x=age, fill = outcome))+ geom_density(alpha = 0.6)

    This is the plot:
    fig3

    ggplot(sqldf("SELECT de.sex as sex, dr.route as route
                         FROM demo_all de
                         INNER JOIN drug_all dr
                         ON de.primaryid=dr.primaryid
                         WHERE de.sex IN ('M','F') AND dr.route IN ('ORAL','INTRAVENOUS','TOPICAL')
                         LIMIT 200000;"),aes(x=route, fill = sex))+ geom_bar(alpha=0.6)

    This is the plot:
    fig4

    ggplot(sqldf("SELECT d.sex as sex, o.outc_cod as outcome
                         FROM demo_all d
                         INNER JOIN outc_all o
                         ON d.primaryid=o.primaryid
                         WHERE d.age BETWEEN 20 AND 100 AND sex IN ('F','M')
                         LIMIT 20000;"),aes(x=outcome,fill=sex))+ geom_bar(alpha = 0.6)

    This is the plot:
    fig5

    demo1= demo_all[1:20000,]
    demo2=demo_all[20001:40000,]

    UNION ALL

    R6 <- rbind(demo1, demo2)
    SQL6 <- sqldf("SELECT  * FROM demo1 UNION ALL SELECT * FROM demo2;")
    compare(R6,SQL6, allowAll = TRUE)
    TRUE

    INTERSECT

    R7 <- semi_join(demo1, demo2)
    SQL7 <- sqldf("SELECT  * FROM demo1 INTERSECT SELECT * FROM demo2;")
    compare(R7,SQL7, allowAll = TRUE)
    TRUE

    EXCEPT

    R8 <- anti_join(demo1, demo2)
    SQL8 <- sqldf("SELECT  * FROM demo1 EXCEPT SELECT * FROM demo2;")
    compare(R8,SQL8, allowAll = TRUE)
    TRUE

    See you in my next post! If you have any questions or feedback, feel free to leave a comment.

      Related Post

      1. Developing an R Tutorial shiny dashboard app
      2. Strategies to Speedup R Code
      3. Sentiment analysis with machine learning in R
      4. Sentiment Analysis on Donald Trump using R and Tableau
      5. Google scholar scraping with rvest package

      To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      On the growth of CRAN packages

      $
      0
      0

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      by Andrie de Vries

      Every once in a while somebody asks me how many packages are on CRAN. (More than 8,000 in April, 2016).  A year ago, in April 2015, there were ~6,200 packages on CRAN.

      This poses a second question: what is the historical growth of CRAN packages?

      One source of information is Bob Muenchen's blog R Now Contains 150 Times as Many Commands as SAS, that contains this graphic showing packages from 2002 through 2014. (Bob fitted a quadratic curve through the data, that fits quite well, except that this model estimates too high in the very early years).

      Muenchen data
      CRAN package data through 2014 by Bob Muenchen

      But where does this data come from?  Bob's article references an earlier article by John Fox in the R Journal, Aspects of the Social Organization and Trajectory of the R Project. (This is a fascinating article, and I highly recommend you read it). The analysis by John Fox contains this graphic showing data from 2001 through 2009. John fits an exponential growth curve through the data, that again fits very well:

      Fox data
      CRAN package data through 2009 by John Fox

      I was particularly interested in trying to see if I can find the original source of the data. The original graphic contains a caption with references to the R source code on SVN, but I could only find the release dates of historical R releases, not the package counts.

      Next I put the search term "john fox 2009 cran package data" into my favourite search engine and came across the dataset CRANPackages in the package Ecdat. The Ecdat package contains data sets for econometrics, compiled by Spencer Graves.

      I promptly installed the package and inspected the data: 

      > library(Ecdat)
      
      > head(CRANpackages)
        Version       Date Packages            Source
      1     1.3 2001-06-21      110         John Fox 
      2     1.4 2001-12-17      129         John Fox 
      3     1.5 2002-05-29      162         John Fox 
      4     1.6 2002-10-01      163 John Fox, updated
      5     1.7 2003-05-27      219         John Fox 
      6     1.8 2003-11-16      273         John Fox 
      > tail(CRANpackages) Version Date Packages Source 24 2.15 2012-07-07 4000 John Fox 25 2.15 2012-11-01 4082 Spencer Graves 26 2.15 2012-12-14 4210 Spencer Graves 27 2.15 2013-10-28 4960 Spencer Graves 28 2.15 2013-11-08 5000 Spencer Graves 29 3.1 2014-04-13 5428 Spencer Graves

       This data is exactly what I was after, but what is the origin?

      > ?CRANpackages

      Data casually collected on the number of packages on the Comprehensive R Archive Network (CRAN) at different dates.

      So it seems this gets compiled and updated by hand, orginally by John Fox, and more recently by Spencer Graves himself.

      Can we do better?

      This set me thinking. Can we do better and automate this process by scraping CRAN?

      This is in fact possible, and you can find the source data at CRAN for older, archived releases (R-1.7 in 2004 through R-2.10 in 2010) as well as more recent releases.

      However, you will have to scrape the dates from a list of package release dates for each historic release (you can find my code at the bottom of this blog).

      The results

      I get the following result. Note that the rug marks indicate the release date and number of packages for each release. The data is linear, not log, but the rug marks gives the illusion of a logarithmic scale.

      CRAN package data through 2016 by Andrie de Vries

      Caveat

      I took a few shortcuts in the analysis:

      • For each release, the actual data is a list of packages, as well as the publication date for each package. I took the date of the "release" as the very last package publication date. This means my estimate for the "release date" will be wrong. Specifically, in each case, the actual release would have occurred earlier.
      • I made no attempt to find the data prior to 2004.

      Further work

      The analysis can really benefit from fitting some curves through the data. Specifically, I would like to fit an exponential growth curve to see. For example, are there indications that the contribution rate is steady, accelerating or decelerating. Might a S-curve fit the data better?

      The plot itself needs additional labels for the dot releases.

      I hope to address these in a follow-up post.

      The code and data

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      The Hype Bubble Map for Dog Breeds

      $
      0
      0

      (This article was first published on Ripples, and kindly contributed to R-bloggers)

      In the whole history of the world there is but one thing that money can not buy… to wit the wag of a dog’s tail (Josh Billings)

      In this post I combine several things:

      • Simple webscraping to read the list of companion dogs from Wikipedia. I love rvest package to do these things.
      • Google Trends queries to download the evolution of searchings of breeds during last 6 months. I use gtrendsR package to do this and works quite well.
      • A dinamic Highchart visualization using the awesome highcharter package
      • A static ggplot visualization.

      The experiment is based on a simple idea: what people search on the Internet is what people do. Can be Google Trends an useful tool to know which breed will become fashionable in the future? To be honest, I don’t really know but I will make my own bet.

      What I have done is to extract last 6 months of Google trends of this list of companion breeds. After some simple text mining, I divide the set of names into 5-elements subsets because Google API doesn’t allow searchings with more than 5 items. The result of the query to Google trends is a normalized time series, meaning the 0 – 100 values are relative, not absolute, measures. This is done by taking all of the interest data for your keywords and dividing it by the highest point of interest for that date range. To make all 5-items of results comparable I always include King Charles Spaniel breed in all searchings (as a kind of undercover agent I will use to compare searching levels). The resulting number is my “Level” Y-Axis of the plot. I limit searchings to code=”0-66″ which is restrict results to Animals and pets category. Thanks, Philippe, for your help in this point. I also restrict rearchings To the United States of America.

      There are several ways to obtain an aggregated trend indicator of a time series. My choice here was doing a short moving average order=2 to the resulting interest over time obtained from Google. The I divide the weekly variations by the smoothed time series. The trend indicator is the mean of these values. To obtains a robust indicator, I remove outliers of the original time series. This is my X-axis.

      This is how dog breeds are arranged with respect my Trend and Level indicators:

      HypeBubbleGgplot

      Inspired by Gartner’s Hype Cycle of Emerging Technologies I distinguish two sets of dog breeds:

      • Plateau of Productivity Breeds (succesful breeds with very high level indicator and possitive trend): Golden Retriever, Pomeranian, Chihuahua, Collie and Shih Tzu.
      • Innovation Trigger Breeds (promising dog breeds with very high trend indicator and low level): Mexican Hairless Dog, Keeshond, West Highland White Terrier and German Spitz.

      I discovered recently a wonderful package called highcharter which allows you to create incredibly cool dynamic visualizations. I love it and I could not resist to use it to do the previous plot with the look and feel of The Economist. This is an screenshot (reproduce it to play with tits interactivity):

      BubbleEconomist
      And here comes my prediction. After analyzing the set Innovation Trigger Breeds, my bet is Keeshond will increase its popularity in the nearly future: don’t you think it is lovely?

      640px-Little_Puppy_Keeshond
      Photo by Terri BrownFlickr: IMG_4723, CC BY 2.0

      Here you have the code:

      library(gtrendsR)
      library(rvest)
      library(dplyr)
      library(stringr)
      library(forecast)
      library(outliers)
      library(highcharter)
      library(ggplot2)
      library(scales)
      
      #Webscraping
      x="https://en.wikipedia.org/wiki/Companion_dog"
      read_html(x) %>%
        html_nodes("ul:nth-child(19)") %>%
        html_text() %>%
        strsplit(., "n") %>%
        unlist() -> breeds
      
      #Some simple cleansing
      breeds=iconv(breeds[breeds!= ""], "UTF-8")
      
      usr <- "YOUR GOOGLE ACCOUNT"
      psw <- "YOUR GOOGLE PASSWORD"
      gconnect(usr, psw)
      
      #Reference (undercover agent)
      ref="King Charles Spaniel"
      
      #Remove the undercover agent from the set of breeds
      breeds=setdiff(breeds, ref)
      
      #Subsets. Do not worry about warning message
      sub.breeds=split(breeds, 1:ceiling(length(breeds)/4))
      
      #Loop to obtain google trends of each 5-items subset
      results=list()
      for (i in 1:length(sub.breeds))
      {
        res <- gtrends(unlist(union(ref, sub.breeds[i])),           start_date = Sys.Date()-180,           cat="0-66",           geo="US")   results[[i]]=res } #Loop to obtain trend and level indicator of each breed trends=data.frame(name=character(0), level=numeric(0), trend=numeric(0)) for (i in 1:length(results)) {   df=results[[i]]$trend   lr=mean(results[[i]]$trend[,3]/results[[1]]$trend[,3])   for (j in 3:ncol(df))   {     s=rm.outlier(df[,j], fill = TRUE)     t=mean(diff(ma(s, order=2))/ma(s, order=2), na.rm = T)     l=mean(results[[i]]$trend[,j]/lr)     trends=rbind(data.frame(name=colnames(df)[j], level=l, trend=t), trends)   } } #Prepare data for visualization trends %>%
        group_by(name) %>%
        summarize(level=mean(level), trend=mean(trend*100)) %>%
        filter(level>0 & trend > -10 & level<500) %>%
        na.omit() %>%
        mutate(name=str_replace_all(name, ".US","")) %>%
        mutate(name=str_replace_all(name ,"[[:punct:]]"," ")) %>%
        rename(
          x = trend,
          y = level
        ) -> trends
      trends$y=(trends$y/max(trends$y))*100
      
      #Dinamic chart as The Economist
      highchart() %>%
        hc_title(text = "The Hype Bubble Map for Dog Breeds") %>%
        hc_subtitle(text = "According Last 6 Months of Google Searchings") %>%
        hc_xAxis(title = list(text = "Trend"), labels = list(format = "{value}%")) %>%
        hc_yAxis(title = list(text = "Level")) %>%
        hc_add_theme(hc_theme_economist()) %>%
        hc_add_series(data = list.parse3(trends), type = "bubble", showInLegend=FALSE, maxSize=40) %>%
        hc_tooltip(formatter = JS("function(){
                                  return ('<b>Trend: </b>' + Highcharts.numberFormat(this.x, 2)+'%' + '
      <b>Level: </b>' + Highcharts.numberFormat(this.y, 2) + '
      <b>Breed: </b>' + this.point.name)
                                  }"))
      
      #Static chart
      opts=theme(
        panel.background = element_rect(fill="gray98"),
        panel.border = element_rect(colour="black", fill=NA),
        axis.line = element_line(size = 0.5, colour = "black"),
        axis.ticks = element_line(colour="black"),
        panel.grid.major = element_line(colour="gray75", linetype = 2),
        panel.grid.minor = element_blank(),
        axis.text.y = element_text(colour="gray25", size=15),
        axis.text.x = element_text(colour="gray25", size=15),
        text = element_text(size=20),
        legend.key = element_blank(),
        legend.position = "none",
        legend.background = element_blank(),
        plot.title = element_text(size = 30))
      ggplot(trends, aes(x=x/100, y=y, label=name), guide=FALSE)+
        geom_point(colour="white", fill="darkorchid2", shape=21, alpha=.3, size=9)+
        scale_size_continuous(range=c(2,40))+
        scale_x_continuous(limits=c(-.02,.02), labels = percent)+
        scale_y_continuous(limits=c(0,100))+
        labs(title="The Hype Bubble Map for Dog Breeds",
             x="Trend",
             y="Level")+
        geom_text(data=subset(trends, x> .2 & y > 50), size=4, colour="gray25")+
        geom_text(data=subset(trends, x > .7), size=4, colour="gray25")+opts
      

      To leave a comment for the author, please follow the link and comment on their blog: Ripples.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Web-Scraping JavaScript rendered Sites

      $
      0
      0

      (This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

      Gathering data from the web is one of the key tasks in order to generate easy data-driven insights into various topics.
      Thanks to the fantastic Rvest R package web scraping is pretty straight forward.
      It basically works like this; go to a website, find the right items using the selector gadget and plug the element path into your R-code.
      There are various, great tutorials on how to do that (e.g. 1, 2 ).

      Increasingly, I see that websites, which (un-)consciously make it hard to scrape their site by employing delayed JavaScript-based rendering. In these cases the simple approach of using rvest breaks down. Examples include the Bwin betting site or the German site busliniensuche. Busliniensuche is a site to compare bus travel providers on price, duration and schedule.

      As side-note, the German bus travel market has been deregulated in 2013, hence the market is still rapidly developing. I thought it would be interesting to analyse the basic market elements and compare bus travel to the established train provider Deutsche Bahn.

      As mentioned the site employs some kind of delayed JavaScript rendering. It basically loads the page content with a delay. This would be great if the web site would call a structured JSON endpoint. This is not the case, rather I believe they make it intentionally hard to gather their data in a structured form.

      ##How-to scrape JS-rendered websites?

      One way to gather the data nonetheless is using a “headless” browser such as PhantomJS.
      “A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” (Source: Wikipedia). In order to control PhantomJS from R we need two scripts; a) the PhantomJS file and b) a R file that manipulates and runs PhantomJS. Both files are included at the end of this post.

      The PhantomJS file has one parameter; the URL that is supposed to be scraped, here placed right at the beginning of the file.
      The headless browser loads the URL, waits 2500 milliseconds and saves the file to disk (file-name: 1.html).
      The R file changes the URL to the target site, runs the headless browser using a system call and works with the locally saved file in an rvest-like way.

      So to get started on your own: Download PhantomJS, place the .exe file in your working R-directory and adapt the source code accordingly.

      Happy Scraping!

      To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.

      R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
      Viewing all 132 articles
      Browse latest View live