This presentation is an HTML5 website

Press key to advance.

Zoom in/out: Ctrl or Command + +/-

This slide uses styles/codes from Google's slides for HTML5

PHP and NoSQL

Dropping SQL and thinking in objects

César D. Rodas - crodas@php.net
PHPConference Brazil 2010
São Paulo - Brazil

 

These slides are Open Source (Apache License), feel free to reuse it as you wish.

About me
  • 1987 Born
  • 2003 PHP
  • 2005 PHP + Opensource
  • 2009 MongoDB
  • 2010 PHP + Opensource + MongoDB
$crodas ~= PHP + MongoDB + (NodeJS|C)

NoSQL?

NoSQL?

  • Set of databases without SQL ;-)
  • Usually with a different data behavior
  • No relation
  • This talk: MongoDB, Redis and Apache Solr

MongoDB

MongoDB

What is it?

  • Mongo(EN) != Mongo(PT)
  • Document oriented database
  • Fast, scalable and easy to use (devel-friendly)
  • Support indexes
  • Schemaless
  • Sharding
  • Replicated
  • Support nested-documents
  • PECL client (C driver for PHP)
$mongodb ~= MySQL (performance) + Object Relational Mapper - overhead
MongoDB

What is document?

It is an array/hash/object (without methods) that can be saved in the DB

// a document about me...
$crodas = array(
    '_id' => 'py-4290392',
    'name' => 'Cesar',
    'surname' => 'Rodas',
    'e-mail' => 'crodas@php.net',
    'works' => array(
        array('name' => 'PlumWillow', 'obj' => 'we are hiring'),
        array('name' => 'Some open source projects', 'obj' => 'I like it!'),
    )
);

// we save it
$db->users->save($crodas);

// we fetch it
$db->users->find(array('e-mail' => 'crodas@php.net'));

// or much better 
$db->users->find(array('works.name' => 'PlumWillow'));

MongoDB

Operations (Selects)

  • $gt, $lt, $gte, $lte, $eq, $neq: >, <, >=, <=, ==, !=
  • $in, $nin, $or, $min, $max, $all, $exists
  • $where: (and javascript code)
  • group()
  • distinct()
  • skip()
  • limit()
  • count()
MongoDB

Operations (Update)

  • $set, $unset
  • $push, $pull
  • $inc
  • findAndModify()

Is it better than my relational database?

No.

It is different.

MongoDB

Differences

  • Denormalize data is not bad
    • It makes queries simplest and fast
    • Disk-space is much cheaper than CPU
    • Much simpler to distribute data across multiple nodes
  • No CPU wasting doing ORM
  • No SQL injection
  • No Joins (they are evil!)
  • Batch processing
  • No CREATE TABLE and ALTER TABLE
MongoDB

Differences

MongoDB

Data structure

$user = array(
    'name' => 'crodas',
    'email' => 'crodas@php.net',
    'website' => 'http://cesarodas.com',
);

$category = array(
    'name' => 'Sport', 
    'description' => 'Everything about football (proper football)',
);

$comment = array(
    'news' => $news['_id'],
    'text' => 'I like MongoDB',
    'user' => $user['_id'],
    // duplicate properties from user object (to avoid joins)
    'name' => $user['name'],
    'website' => $user['website'],
);
MongoDB

Data structure

$news = array(
    'title' => 'My Talk about MongoDB',
    'content' => 'MongoDB rules, I like it, bla, bla... ',
    'author' => $user['_id'],

    // duplicate name
    'authorName' => $user['name'],

    // copy all categories (entire structure) here
    'categories' => array(
        $category[1], $category[2]
    ),

    // copy first 10 comments
    // (suppose we show 10 first comments, then we paginate it)
    'comments' => array(
        $comment[0], $comment[1], $comment[2] .. $comment[9]
    ),

    // comment users (reported by a dev at T!)
    'commentsUsers' => array(
        $userId => $userName
    ),

    // we update this for every new comment
    'totalComments' => $db->comments->count(array('news' => $news['_id'])),
);
MongoDB

Comparing

Get one news by its ID

MySQL

SELECT news.*, user.name FROM news INNER JOIN user ON user.id = news.author id
WHERE id = 1

SELECT category.category FROM category has news INNER JOIN category ON category
WHERE news id = 1

SELECT * FROM comments INNER JOIN user ON user.id = comments.user id
WHERE news id = 1

MongoDB

$mongo = new MongoDB;
$db = $mongo->database;

$news = $db->news->find(array('_id' => 1));
MongoDB

Comparing

Get 10-most commented news with more than 10 comments

MySQL

// Too bored to think in SQL right now

MongoDB

$query = array('totalComments' => array('$gt' => 10));
$news  = $db->news->find($query)->sort(array('totalComments' => -1))->limit(5);
MongoDB

Add a new comment

// Fetch news
$news = $db->news->findOne(array(
    '_id' => $_POST['news_id'],            
), array('totalComments' => true));

if (empty($news)) throw new Exception("News is not valid");

// Add new comment 
$comment = array(
    'user' => $_SESSION['user']['_id'],
    'name' => $_SESSION['user']['name'],
    'website' => $_SESSION['user']['website'],
    'content' => $_POST['comment'],
);
$db->comment->save($comment);

// Update totalComments
$update = array('$inc' => array('totalComments' => 1));
if ($news['totalComments'] < 10) {
    // save the name in a diff property to update it easily
    $update['$set'] = array('commentUsers.' . $comment['user'] => $comment['name']);
    // remove it to avoid du-du-duplicated values
    unset($comment['user'], $comment['name']);
    // cache the comments
    $update['$push'] = array('comments' => $comment);
}
$db->news->update(array('_id' => $news['_id']), $update);

What if the user changes its name?

MongoDB

Update cached properties

$update = array(
    '$set' => array('name' => $new_name)
);
$option = array('multiple' => true);

// update in comments collection
$db->comments->update(array('user' => $user_id), $update, $option);

$update = array('$set' => array('authorName' => $new_name));
$db->news->update(array('author'd => $user_id), $update, $option));

// update name on cached comments
$update = array('$set' => array('commentUsers.' . $user_id => $new_name));
$db->news->update(array(), $update, $option));

What about schema migration?

MongoDB

Update schema

// MongoDB (simple version – one node)
foreach ($db->news->find() as $news) {
    $query = array('_id' => $news['_id']);
    $update = array('$set' =>
        array('url' => get_url_from_title($news['title']))
    );
    $db->news->update($query, $update);
}
MongoDB

Update schema

// better approach (can run multiple instances safely)
while (true) {
    $news = $db->command(array(
        'findAndModify' => 'news',
        // where url doesn't exists (much better than $exists => false)
        'query' => array('url' => null),
            // set a new value for url, diff than null
            'update' => array('$set' => array('url' => ' ')),
    ));
    if ($news['ok'] != 1) break;

    $query = array('_id' => $news['value']['_id']):
    $update = array('$set' =>
        array('url' => get_url_from_title($news['value']['title']))
    );
    $db->news->update($query, $update);
}

Beyond SQL!

MongoDB

Store file

  • No need of any distributed File system
  • Each file has a checksum
  • Open data structure
  • You can extend it to support locking
  • You can save as much meta data as you want
  • Works in a sharded-environment
  • Native extension for NGINX
MongoDB

Storing file

$grid = $db->getGridFS();
$metadata = array(
    "whatever" => "metadata",
    "path" => "/foo",
    "download" => 0
);

$grid->storeFile($filename, $metadata);

// or
$grid->storeBytes($bytes, $metadata);

// Or (save $ FILE['foo'])
$grid->storeUpload('foo', $metadata);

MongoDB

Read file

$grid = $db->getGridFS();
$file = $grid->findOne(array('path' => '/foo'));

// update download
$update = array('$inc' => array('download' => 1));
$id = array(' id' => $file->file[' id']);
$grid->update($id, $update);

// print it
echo $file->Bytes();

// or (a bit better)
$tmp = '/tmp/apache/' . $file->file[' id'];
if (!is_file($tmp)) $file->write($tmp);
virtual($tmp);

Map/Reduce

MongoDB

Map/Reduce

  • Well known (Google’s) approach to process tons of data
  • Move the computation where the data is
  • Similar to GROUP BY, but more powerful (Javascript)
  • One function is applied to each news (map)
  • Map() produces 0 to N new values (key, value)
  • MongoDB sort everything by key
  • Reduce is called for each (key, values) and return 1 value
MongoDB

Map/Reduce

var map = function() {
    if (!this.categories) {
        return;
    }
    for (index in this.categories) {
        // return (several times) a new document
        // with id: category name, and value: 1
        emit(this.categories[index].name, 1);
    }
};
var reduce = function(key, values) {
    var count = 0;
    for (index in values) {
        count += values[index];
    }
    return count;
};
MongoDB

Replica sets

  • Advanced master/slave replication
  • Automatic failover and recovery
  • (optional) REST interface
  • A write is only truly committed once it has been replicated to a majority of members of the set
  • Perfect solution for apps. with high number of reads
MongoDB

Sharding

  • One logical collection, several physical collection
  • Horizontal scaling across multiple node
  • Data partition is done by using a partition key
  • Can be combined with replica sets
MongoDB

Sharding

// Shard key
{ x : 1}

// Targeted
db.foo.find( { x : 300 } ) 
db.foo.find( { x : 300, age : 40 } )    
db.foo.insert( <object> ) 
db.foo.update( { x : 100 }, <object> ) 
db.foo.remove( { x : 100 } ) 


// Global
db.foo.find( { age : 40 } ) 
db.foo.find() // sequential
db.foo.find(...).sort( { age : 1 } ) 
db.foo.update( { age : 40 }, <object> ) 
db.foo.remove( { age : 40 } )
MongoDB

Capped Collections

  • Fixed sized collections, high performance auto-FIFO age-out feature.
  • You may insert new objects in the capped collection.
  • The database does not allow deleting objects from a capped collection.
  • Logging, Caching, Auto Archiving.
Mongo ~= Documents + Replicated|Distributed + JS

MongoDB = the future of web-development

Redis

Redis

What is it?

  • Advanced key-value store
  • Similar to Memcached but:
    • Dataset is not volatile
    • Also lists, sets, and ordered sets
  • Atomic operations: inc, push, pull, add, remove
  • Server side operations: server side union, intersection, difference between sets, and so forth
  • Operation with keys
  • Extensible binary safe protocol
  • BSD license
Redis

When/how/where should you use it?

  • Memcached replacement
    • Cache database results
    • Cache static pages
    • Store and share sessions
  • Tracking traffic or user interaction (i.e Stats gathering)
  • A job queue system (like Gearman)
  • Anything that need high rates of read and/or writes per second

Apache Solr

Solr

What is it?

  • Fast opensource search platform
  • Written in Java (but it is fast! was it coded by Harry Poter?)
  • Speaks REST (json, xml and also PHP serialization)
  • Can index rich document (e.g., Word, PDF)
  • Reach query support (pretty similar to Google's syntax)
  • It has schemas, and support multiple values per property
  • Can be distributed to multiple machines (replication & sharding)
Solr

What makes it different?

  • Can return data not only the ID (as sphinxsearch does)
  • Extensible through plugins
  • Spellcheck plugin
  • Geo-search plugin
  • Easy to push data
Solr

Search examples

  • phpconference in brazil
  • "phpconference in brazil"
  • title:php
  • title:"php brazil"
  • type:bike AND price:[50 TO 100]
  • (type:bike OR (sport:football AND type:ball)) AND (price:[50 TO 100] OR price:[* TO 10])
  • tag:php AND tag:brazil
  • tag:php OR tag:phpbr
  • tag:php AND tag:brazil -type:blogpost

Conclusion

Use the web-tool for your job
(but believe me, MongoDB, Solr and Redis
are the best tool for web-development
)

Thanks!

Questions?

crodas@php.net

Questions?

crodas@php.net