Click here to close now.

Welcome!

Eclipse Authors: XebiaLabs Blog, Ken Fogel, Sematext Blog, Marcin Warpechowski, Trevor Parsons

Blog Feed Post

Diving into H2O

by Joseph Rickert One of the remarkable features of the R language is its adaptability. Motivated by R’s popularity and helped by R’s expressive power and transparency developers working on other platforms display what looks like inexhaustible creativity in providing seamless interfaces to software that complements R’s strengths. The H2O R package that connects to 0xdata’s H2O software (Apache 2.0 License) is an example of this kind of creativity. According to the 0xdata website, H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”. Indeed, H2O offers an impressive array of machine learning algorithms. The H2O R package provides functions for building GLM, GBM, Kmeans, Naive Bayes, Principal Components Analysis, Principal Components Regression, Random Forests and Deep Learning (multi-layer neural net models). Examples with timing information of running all of these models on fairly large data sets are available on the 0xdata website. Execution speeds are very impressive. In this post, I thought I would start a little slower and look at H2O from an R point of View. H2O is a Java Virtual Machine that is optimized for doing “in memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the documentation a cluster’s “memory capacity is the sum across all H2O nodes in the cluster”. So, as I understand it, if you were to build a 16 node cluster of machines each having 64GB of DRAM, and you installed H2O everything then you could run the H2O machine learning algorithms using a terabyte of memory. Underneath the covers, the H2O JVM sits on an in-memory, non-persistent key-value (KV) store that uses a distributed JAVA memory model. The KV store holds state information, all results and the big data itself. H2O keeps the data in a heap. When the heap gets full, i.e. when you are working with more data than physical DRAM, H20 swaps to disk. (See Cliff Click’s blog for the details.) The main point here is that the data is not in R. R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O. The R H2O package communicates with the H2O JVM over a REST API. R sends RCurl commands and H2O sends back JSON responses. Data ingestion, however, does not happen via the REST API. Rather, an R user calls a function that causes the data to be directly parsed into the H2O KV store. The H2O R package provides several functions for doing this Including: h20.importFile() which imports and parses files from a local directory, h20.importURL() which imports and pareses files from a website, and h2o.importHDFS() which imports and parses HDFS files sitting on a Hadoop cluster. So much for the background: let’s get started with H2O. The first thing you need to do is to get Java running on your machine. If you don’t already have Java the default download ought to be just fine. Then fetch and install the H2O R package. Note that the h2o.jar executable is currently shipped with the h2o R package. The following code from the 0xdata website ran just fine from RStudio on my PC: # The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }   # Next, we download, install and initialize the H2O package for R. install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos"))))   library(h2o) localH2O = h2o.init()   # Finally, let's run a demo to see H2O at work. demo(h2o.glm) Created by Pretty R at inside-R.org Note that the function h20.init() uses the defaults to start up R on your local machine. Users can also provide parameters to specify an IP address and port number in order to connect to a remote instance of H20 running on a cluster. h2o.init(Xmx="10g") will start up the H2O KV store with 10GB of RAM. demo(h2o,glm) runs the glm demo to let you know that everything is working just fine. I will save examining the model for another time. Instead let's look at some other H2O functionality. The first thing to get straight with H2O is to be clear about when you are working in R and when you are working in the H2O JVM. The H2O R package implements several R functions that are wrappers to H2O native functions. "H2O supports an R-like language" (See a note on R) but sometimes things behave differently than an R programmer might expect. For example, the R code: y <- apply(iris[,1:4],2,sum)y produces the following result: sepal.length sepal.width petal.length petal.width 876.5    458.6 563.7  179.9 now, let's see how things work in h2o, code loads h2o package, starts a local instance of uploads iris data set into from r package and produces very r-like summary. library(h2o) # load library localh2o =h2o.init() initial locl instance # upload file instance iris.hex >

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@ThingsExpo Stories
SYS-CON Events announced today that Cisco, the worldwide leader in IT that transforms how people connect, communicate and collaborate, has been named “Gold Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cisco makes amazing things happen by connecting the unconnected. Cisco has shaped the future of the Internet by becoming the worldwide leader in transforming how people connect, communicate and collaborate. Cisco and our partners are building the platform for the Internet of Everything by connecting the...
Wearable technology was dominant at this year’s International Consumer Electronics Show (CES) , and MWC was no exception to this trend. New versions of favorites, such as the Samsung Gear (three new products were released: the Gear 2, the Gear 2 Neo and the Gear Fit), shared the limelight with new wearables like Pebble Time Steel (the new premium version of the company’s previously released smartwatch) and the LG Watch Urbane. The most dramatic difference at MWC was an emphasis on presenting wearables as fashion accessories and moving away from the original clunky technology associated with t...
Temasys has announced senior management additions to its team. Joining are David Holloway as Vice President of Commercial and Nadine Yap as Vice President of Product. Over the past 12 months Temasys has doubled in size as it adds new customers and expands the development of its Skylink platform. Skylink leads the charge to move WebRTC, traditionally seen as a desktop, browser based technology, to become a ubiquitous web communications technology on web and mobile, as well as Internet of Things compatible devices.
SYS-CON Events announced today that robomq.io will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. robomq.io is an interoperable and composable platform that connects any device to any application. It helps systems integrators and the solution providers build new and innovative products and service for industries requiring monitoring or intelligence from devices and sensors.
Docker is an excellent platform for organizations interested in running microservices. It offers portability and consistency between development and production environments, quick provisioning times, and a simple way to isolate services. In his session at DevOps Summit at 16th Cloud Expo, Shannon Williams, co-founder of Rancher Labs, will walk through these and other benefits of using Docker to run microservices, and provide an overview of RancherOS, a minimalist distribution of Linux designed expressly to run Docker. He will also discuss Rancher, an orchestration and service discovery platf...
SYS-CON Events announced today that Akana, formerly SOA Software, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Akana’s comprehensive suite of API Management, API Security, Integrated SOA Governance, and Cloud Integration solutions helps businesses accelerate digital transformation by securely extending their reach across multiple channels – mobile, cloud and Internet of Things. Akana enables enterprises to share data as APIs, connect and integrate applications, drive part...
SYS-CON Events announced today that Vitria Technology, Inc. will exhibit at SYS-CON’s @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Vitria will showcase the company’s new IoT Analytics Platform through live demonstrations at booth #330. Vitria’s IoT Analytics Platform, fully integrated and powered by an operational intelligence engine, enables customers to rapidly build and operationalize advanced analytics to deliver timely business outcomes for use cases across the industrial, enterprise, and consumer segments.
SYS-CON Events announced today that Solgenia will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Solgenia is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between Personal and Professional Social, Mobile and Cloud user experiences, our solutions help large and medium-sized organizations dr...
SYS-CON Events announced today that Liaison Technologies, a leading provider of data management and integration cloud services and solutions, has been named "Silver Sponsor" of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York, NY. Liaison Technologies is a recognized market leader in providing cloud-enabled data integration and data management solutions to break down complex information barriers, enabling enterprises to make smarter decisions, faster.
Cloud is not a commodity. And no matter what you call it, computing doesn’t come out of the sky. It comes from physical hardware inside brick and mortar facilities connected by hundreds of miles of networking cable. And no two clouds are built the same way. SoftLayer gives you the highest performing cloud infrastructure available. One platform that takes data centers around the world that are full of the widest range of cloud computing options, and then integrates and automates everything. Join SoftLayer on June 9 at 16th Cloud Expo to learn about IBM Cloud's SoftLayer platform, explore se...
The WebRTC Summit 2014 New York, to be held June 9-11, 2015, at the Javits Center in New York, NY, announces that its Call for Papers is open. Topics include all aspects of improving IT delivery by eliminating waste through automated business models leveraging cloud technologies. WebRTC Summit is co-located with 16th International Cloud Expo, @ThingsExpo, Big Data Expo, and DevOps Summit.
@ThingsExpo has been named the Top 5 Most Influential M2M Brand by Onalytica in the ‘Machine to Machine: Top 100 Influencers and Brands.' Onalytica analyzed the online debate on M2M by looking at over 85,000 tweets to provide the most influential individuals and brands that drive the discussion. According to Onalytica the "analysis showed a very engaged community with a lot of interactive tweets. The M2M discussion seems to be more fragmented and driven by some of the major brands present in the M2M space. This really allows some room for influential individuals to create more high value inter...
The world's leading Cloud event, Cloud Expo has launched Microservices Journal on the SYS-CON.com portal, featuring over 19,000 original articles, news stories, features, and blog entries. DevOps Journal is focused on this critical enterprise IT topic in the world of cloud computing. Microservices Journal offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. Follow new article posts on Twitter at @MicroservicesE
SYS-CON Events announced today the IoT Bootcamp – Jumpstart Your IoT Strategy, being held June 9–10, 2015, in conjunction with 16th Cloud Expo and Internet of @ThingsExpo at the Javits Center in New York City. This is your chance to jumpstart your IoT strategy. Combined with real-world scenarios and use cases, the IoT Bootcamp is not just based on presentations but includes hands-on demos and walkthroughs. We will introduce you to a variety of Do-It-Yourself IoT platforms including Arduino, Raspberry Pi, BeagleBone, Spark and Intel Edison. You will also get an overview of cloud technologies s...
SYS-CON Events announced today that SafeLogic has been named “Bag Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. SafeLogic provides security products for applications in mobile and server/appliance environments. SafeLogic’s flagship product CryptoComply is a FIPS 140-2 validated cryptographic engine designed to secure data on servers, workstations, appliances, mobile devices, and in the Cloud.
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch of Docker's initial release in March of 2013, interest was revved up several notches. Then late last...
SOA Software has changed its name to Akana. With roots in Web Services and SOA Governance, Akana has established itself as a leader in API Management and is expanding into cloud integration as an alternative to the traditional heavyweight enterprise service bus (ESB). The company recently announced that it achieved more than 90% year-over-year growth. As Akana, the company now addresses the evolution and diversification of SOA, unifying security, management, and DevOps across SOA, APIs, microservices, and more.
After making a doctor’s appointment via your mobile device, you receive a calendar invite. The day of your appointment, you get a reminder with the doctor’s location and contact information. As you enter the doctor’s exam room, the medical team is equipped with the latest tablet containing your medical history – he or she makes real time updates to your medical file. At the end of your visit, you receive an electronic prescription to your preferred pharmacy and can schedule your next appointment.
The Open Compute Project is a collective effort by Facebook and a number of players in the datacenter industry to bring lessons learned from the social media giant's giant IT deployment to the rest of the world. Datacenters account for 3% of global electricity consumption – about the same as all of Switzerland or the Czech Republic -- according to people I met at the recent Open Compute Summit in San Jose. With increasing mobility at the edge of the cloud and vast new dataflows being predicted with the growth of the Internet of Things (and The Coming Age of Many Zettabytes) in the near...
GENBAND has announced that SageNet is leveraging the Nuvia platform to deliver Unified Communications as a Service (UCaaS) to its large base of retail and enterprise customers. Nuvia’s cloud-based solution provides SageNet’s customers with a full suite of business communications and collaboration tools. Two large national SageNet retail customers have recently signed up to deploy the Nuvia platform and the company will continue to sell the service to new and existing customers. Nuvia’s capabilities include HD voice, video, multimedia messaging, mobility, conferencing, Web collaboration, deskt...