O que é ETL?

Um passo muito importante na construção de uma arquitetura corporativa de integração de dados e análise.

All about Data!

ETL, vem do inglês Extract Transform Load, ou seja, Extração Transformação Carga. O ETL visa trabalhar com toda a parte de extração de dados de fontes externas, transformação para atender às necessidades de negócios e carga dos dados dentro do Data Warehouse (Para entender o conceito de Data Warehouse, leia o post sobre “O que é o Data Warehouse?”). O foco deste artigo é a utilização do ETL voltado para Data Warehouse, mas você pode utilizar as ferramentas de ETL para fazer todo tipo de trabalho de importação, exportação, transformação de dados para outros ambientes de banco de dados ou para outras necessidades a serem endereçadas.

Os projetos de data warehouse consolidam dados de diferentes fontes. A maioria dessas fontes tendem a ser bancos de dados relacionais ou flat files, mas podem existir outros tipos de fontes também. Um sistema ETL precisa ser capaz de se comunicar…

Ver o post original 553 mais palavras


Favorite maps and graphs in historical criminology

Apresentações de Estatísticas, também é história.

Andrew Wheeler

I was reading Charles Booth’s Life and Labour of the People in London (available entirely at Google books) and stumbled across this gem of a connected dot plot (between pages 18-19, maybe it came as a fold out in the book?)

(You will also get a surprise of the hand of the scanner in the page prior!) This reminded me I wanted to make a collection of my favorite historical examples of maps and graphs for criminology and criminal justice. If you read through Calvin Schmid’s Handbook of Graphical Presentation (available for free at the internet archive) it was a royal pain to create such statistical graphics by hand before computers. It makes you appreciate the effort all that much more, and many of the good ones will rival the quality of any graphic you can make on the computer.

Calvin Schmid himself has some of my favorite…

Ver o post original 654 mais palavras

Using RMySQL from Ubuntu

Exemplo simples e educativo.


  • Install MySQL

sudo apt-get install mysql-server

  • Check if Server is Running

sudo netstat -tap | grep mysql

I use the MySQL command line to check it

To connect

mysql -h localhost -u root -p

To see databases

mysql>show databases;

To see tables

mysql> show tables from mysql;

To quit mysql

mysql> q

Screenshot from 2015-07-23 17:48:51

  • Install and load RMySQL from within R


  • I connect using this

mydb = dbConnect(MySQL(),
port = 8018,

  • I write sql queries using this
> dbGetQuery(mydb, "select * from  servers")
[1] Server_name Host        Db          Username    Password   
[6] Port        Socket      Wrapper     Owner      
<0 rows> (or 0-length row.names)
> dbGetQuery(mydb, "select * from  db")
 [1] Host                  Db                   
 [3] User                  Select_priv          
 [5] Insert_priv           Update_priv          
 [7] Delete_priv           Create_priv          
 [9] Drop_priv             Grant_priv           
[11] References_priv       Index_priv           
[13] Alter_priv            Create_tmp_table_priv
[15] Lock_tables_priv      Create_view_priv     
[17] Show_view_priv        Create_routine_priv  
[19] Alter_routine_priv    Execute_priv         
[21] Event_priv            Trigger_priv         
<0 rows> (or 0-length row.names)

Screenshot from 2015-07-23 18:16:09





Ver o post original

Tips for using R in production analytics environment

Dicas são sempre bem-vindas!


Newface1) Read.csv is dead. Long live fread Use fread from data.table to import data and get a speed up factor of 5 X in the data import phase itself. Ignore data.table package and languish in hell

2) Write.csv is boring. Write as a .Rda file Use .Rda file to get compressions of upto 4 X

3) Use new project mode from RStudio This helps to clean workflow management

4) Use GUIs like Deducer / kmggplot2 plugin from Rcommander for great data viz right now For people who want to use ggplot2 straight away

5) Avoiding duplicates , remove prior copies and use gc() Memory management is key to use of R in production analytics.

6) Think object oriented. Forget other languages Think slice and dice and using $ and [] and using apply versus for loops.

7) Use ? and ?? before you google and ask for help on Stack Overflow…

Ver o post original 104 mais palavras

Scheduling R Tasks via Windows Task Scheduler

Geração de Gráficos e outras apresentações agendadas.

TRinker's R Blog

This post will allow you to impress your boss with your strong work ethic by enabling Windows R users to schedule late night tasks.  Picture it, your boss gets an email at 1:30 in the morning with the latest company data as a beautiful report.  I’m quite sure Linux and Mac users are able to do this rather easily via cron.  Windows users can do this via the Task Scheduler.  Users can also interface the task scheduler via the command line as well.

As this is more process oriented, I have created a minimal example on GitHub and the following video rather than providing scripts in-text.  All the scripts can be accessed via: https://github.com/trinker/Make_Task  User’s will need to fill in relevant information (e.g., paths, usernames, etc.) and download necessary libraries to run the scripts.  The main point of this demonstration is to provide the reader (who is a Windows…

Ver o post original 8 mais palavras