This is a discussion on Help writing SQL statement in PHP script within the PHP Language forums, part of the PHP Programming Forums category; On May 15, 12:27 am, Corey Jansen <ccj9...@gmail.com> wrote: > Jerry's approach results in ...
|
|||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
|
|||
|
On May 15, 12:27 am, Corey Jansen <ccj9...@gmail.com> wrote:
> Jerry's approach results in a "cartesian explosion." Which is exactly the problem database normalization is designed to prevent. If only Mr. Stuckle had listened to what 10 people told him already. Obstinacy is his best policy it seems :) Yet another lesson in "Why You Should Use Proper Database Design." |
|
|||
|
Corey Jansen wrote:
> I tried both queries, and the result is Jerry's method produces very > strange results. The normalized approach posted by petersprc does give > the expected result though. > > For a table containing a a few thousand records with duplicates, Jerry's > query returned 200 million rows (yes 200 million) after running for > about 2 minutes. That's more rows than there were in the original table. > Those that can, do. Those that can't, teach. ;-) If I had a tenner for every 'theoretically correct' approach that has resulted in hgue software size, or machine overhead, or just plain not working.. > I copied the query directly into a test case. > > DROP PROCEDURE IF EXISTS setup; > > DELIMITER // > > CREATE PROCEDURE setup () > BEGIN > DECLARE i INT DEFAULT 0; > DROP TABLE IF EXISTS test; > CREATE TABLE test (entry_id int, > cat_id int); > WHILE i < 10000 DO > INSERT INTO test VALUES (2, 30), > (2, 35), (3, 30), (3, 35); > SET i = i + 1; > END WHILE; > END; > > // > > DELIMITER ; > > CALL setup(); > > DROP TABLE IF EXISTS result; > > CREATE TABLE result AS > SELECT a.entry_id > FROM test a > INNER JOIN test b > ON a.entry_id = b.entry_id > WHERE a.cat_id = 30 > AND b.cat_id = 35; > > The output is: > > Query OK, 200000000 rows affected (2 min 6.35 sec) > Records: 200000000 Duplicates: 0 Warnings: 0 > > Jerry's approach results in a "cartesian explosion." I'll remember that phrase... > |
|
|||
|
On May 14, 7:21 pm, Mike Lahey <mikey6...@yahoo.com> wrote:
> Jerry Stuckle wrote: > > > No argument. > > > But that was an additional condition the poster required - not the > > original op. And that's what makes it incorrect. > > Uniqueness is a consequence of the relationship the OP wanted to model. > Best practice is to create an index, which is the correct solution, as > has been pointed out several times. > > You should properly normalize your DB instead of working around a broken > design as you're arguing for. Amen. Any proposed solution that skips this step is incomplete. One shouldn't rely on a broken data model and expect to get good results. > The OP wanted to indicate membership in a group. A membership relation > does not contain duplicates. Yes, by definition, a membership set has no dups. To take another example, it wouldn't be proper for a student to belong to the same class twice. (He could repeat the course, but that wouldn't be the same class would it.) Using a flawed db design creates all sorts of inconsistencies which are better to avoid when developing robust systems. Jerry's suggested query blows up when faced with duplicates, so you can see how easy it is to fall into this trap. |
|
|||
|
Corey Jansen wrote:
> I tried both queries, and the result is Jerry's method produces very > strange results. The normalized approach posted by petersprc does give > the expected result though. > > For a table containing a a few thousand records with duplicates, Jerry's > query returned 200 million rows (yes 200 million) after running for > about 2 minutes. That's more rows than there were in the original table. > > I copied the query directly into a test case. > > DROP PROCEDURE IF EXISTS setup; > > DELIMITER // > > CREATE PROCEDURE setup () > BEGIN > DECLARE i INT DEFAULT 0; > DROP TABLE IF EXISTS test; > CREATE TABLE test (entry_id int, > cat_id int); > WHILE i < 10000 DO > INSERT INTO test VALUES (2, 30), > (2, 35), (3, 30), (3, 35); > SET i = i + 1; > END WHILE; > END; > > // > > DELIMITER ; > > CALL setup(); > > DROP TABLE IF EXISTS result; > > CREATE TABLE result AS > SELECT a.entry_id > FROM test a > INNER JOIN test b > ON a.entry_id = b.entry_id > WHERE a.cat_id = 30 > AND b.cat_id = 35; > > The output is: > > Query OK, 200000000 rows affected (2 min 6.35 sec) > Records: 200000000 Duplicates: 0 Warnings: 0 > > Jerry's approach results in a "cartesian explosion." > Then you have a broken database server. You need to report that as a bug to MySQL ASAP. A lot of people depend self-join queries like this! This works fine (sorry about the line wraps): <?php $link = mysql_connect('localhost', 'root', 'vps11131') or die("Can't connect: " . mysql_error()); $db = mysql_select_db('test'); // Clear table if it existed mysql_query('DROP TABLE IF EXISTS test'); mysql_query('CREATE TABLE test (groupid INT NOT NULL, ' . 'userid INT NOT NULL, PRIMARY KEY(groupid, userid))'); // Insert 10K rows of data for ($i = 1; $i <= 100; $i++) for ($j = 1; $j<= 100; $j++) mysql_query("INSERT INTO test(groupid, userid) VALUES($i, $j)"); // Now lets get rid of some of the data so we have meaningful results mysql_query('DELETE FROM test WHERE groupid=32 AND MOD(userid, 3) > 0'); mysql_query('DELETE FROM test WHERE groupid=38 AND MOD(userid, 4) > 0'); // Pull the matching data from the table $result = mysql_query('SELECT a.userid AS userid ' . 'FROM test a ' . 'INNER JOIN test b ' . 'ON a.userid = b.userid ' . 'WHERE a.groupid = 32 ' . 'AND b.groupid = 35'); echo 'Rows found: ' . mysql_num_rows($result) . "\n"; while ($data = mysql_fetch_array($result)) echo $data['userid'] . " "; mysql_close(); ?> -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|||
|
vkayute@gmail.com wrote:
> On May 15, 12:27 am, Corey Jansen <ccj9...@gmail.com> wrote: >> Jerry's approach results in a "cartesian explosion." > > Which is exactly the problem database normalization is designed to > prevent. > > If only Mr. Stuckle had listened to what 10 people told him already. > Obstinacy is his best policy it seems :) > > Yet another lesson in "Why You Should Use Proper Database Design." > I'm not arguing about proper database design. My only comment is it is IMPOSSIBLE to determine if the database is normalized or not from the given information. There could be one or more additional columns to determine uniqueness, for instance. And people wonder why I send folks to comp.databases.mysql for MySQL questions - that's where the REAL experts hang out. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|||
|
The Natural Philosopher wrote:
> Corey Jansen wrote: >> I tried both queries, and the result is Jerry's method produces very >> strange results. The normalized approach posted by petersprc does give >> the expected result though. >> >> For a table containing a a few thousand records with duplicates, >> Jerry's query returned 200 million rows (yes 200 million) after >> running for about 2 minutes. That's more rows than there were in the >> original table. >> > > Those that can, do. Those that can't, teach. ;-) > And those who can't teach become philosophers. > If I had a tenner for every 'theoretically correct' approach that has > resulted in hgue software size, or machine overhead, or just plain not > working.. > > If I had a tenner for every good comment you made, I'd be broke. However, if I had ten cents for every stupid remark you made, I could retire. > >> I copied the query directly into a test case. >> >> DROP PROCEDURE IF EXISTS setup; >> >> DELIMITER // >> >> CREATE PROCEDURE setup () >> BEGIN >> DECLARE i INT DEFAULT 0; >> DROP TABLE IF EXISTS test; >> CREATE TABLE test (entry_id int, >> cat_id int); >> WHILE i < 10000 DO >> INSERT INTO test VALUES (2, 30), >> (2, 35), (3, 30), (3, 35); >> SET i = i + 1; >> END WHILE; >> END; >> >> // >> >> DELIMITER ; >> >> CALL setup(); >> >> DROP TABLE IF EXISTS result; >> >> CREATE TABLE result AS >> SELECT a.entry_id >> FROM test a >> INNER JOIN test b >> ON a.entry_id = b.entry_id >> WHERE a.cat_id = 30 >> AND b.cat_id = 35; >> >> The output is: >> >> Query OK, 200000000 rows affected (2 min 6.35 sec) >> Records: 200000000 Duplicates: 0 Warnings: 0 >> >> Jerry's approach results in a "cartesian explosion." > > I'll remember that phrase... >> > ROFLMAO. Never heard of a cartesian product? -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|||
|
vkayute@gmail.com wrote:
> On May 14, 7:21 pm, Mike Lahey <mikey6...@yahoo.com> wrote: >> Jerry Stuckle wrote: >> >>> No argument. >>> But that was an additional condition the poster required - not the >>> original op. And that's what makes it incorrect. >> Uniqueness is a consequence of the relationship the OP wanted to model. >> Best practice is to create an index, which is the correct solution, as >> has been pointed out several times. >> >> You should properly normalize your DB instead of working around a broken >> design as you're arguing for. > > Amen. Any proposed solution that skips this step is incomplete. One > shouldn't rely on a broken data model and expect to get good results. > No arguments. But based on the information given, we cannot say the database was not normalized. >> The OP wanted to indicate membership in a group. A membership relation >> does not contain duplicates. > > Yes, by definition, a membership set has no dups. To take another > example, it wouldn't be proper for a student to belong to the same > class twice. (He could repeat the course, but that wouldn't be the > same class would it.) > It depends. For instance, you could have an additional column - privileges. Things like "read", "post", "upload" to determine the rights the user has. > Using a flawed db design creates all sorts of inconsistencies which > are better to avoid when developing robust systems. > > Jerry's suggested query blows up when faced with duplicates, so you > can see how easy it is to fall into this trap. > My query does not blow up with there are duplicates. It works perfectly well. But Peter's fails in that case. And people wonder why I refer MySQL questions to comp.databases.mysql - where the real experts hang out. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
|
|||
|
On Thu, 15 May 2008 11:55:29 -0400, Jerry Stuckle
<jstucklex@attglobal.net> wrote: >vkayute@gmail.com wrote: >> On May 15, 12:27 am, Corey Jansen <ccj9...@gmail.com> wrote: >>> Jerry's approach results in a "cartesian explosion." >> >> Which is exactly the problem database normalization is designed to >> prevent. >> >> If only Mr. Stuckle had listened to what 10 people told him already. >> Obstinacy is his best policy it seems :) >> >> Yet another lesson in "Why You Should Use Proper Database Design." >> > >I'm not arguing about proper database design. My only comment is it is >IMPOSSIBLE to determine if the database is normalized or not from the >given information. That doesn't mean that the relation can't be normalized first. That seems to be the critical point you're missing. You seem to arguing that it's better to build on a potentially flawed database design rather than get it right first, which is terrible advice. > There could be one or more additional columns to determine uniqueness, for instance. > >And people wonder why I send folks to comp.databases.mysql for MySQL >questions - that's where the REAL experts hang out. This is a pointless hypothetical. If you have N columns, you can still maintain uniqueness across those columns. That doesn't require duplicate rows any more than the original problem which had only 2 columns. Mitch |
|
|||
|
On Thu, 15 May 2008 11:50:57 -0400, Jerry Stuckle
<jstucklex@attglobal.net> wrote: >Corey Jansen wrote: >> >> Jerry's approach results in a "cartesian explosion." > >Then you have a broken database server. You need to report that as a >bug to MySQL ASAP. A lot of people depend self-join queries like this! Not at all, this is a bug in your query. It produced the same result here. MySQL did exactly what you told it to do. You seem desperate to avoid acknowledging this, resorting even to making up fictitious MySQL bug reports. The problem is you are self-joining using a condition that isn't unique and lacks a primary key reference. Sometimes this is what you want, but that is not the case in the original problem. Let me spell it out for you. Let's say you have rows A through F that contain the following values: A: (2, 30) B: (2, 35) C: (2, 30) D: (2, 35) E: (2, 30) F: (2, 35) There are only 6 rows in the table. Your query, however, will produce more than 6 matches. This is because rows A, C, and E can each be paired a total of 3 times. The result of the inner join is: (A, B), (A, D), (A, F) (C, B), (C, D), (C, F) (E, B), (E, D), (E, F) Now, here's how it looks in SQL: -- Create the table with 6 rows -- DROP TABLE IF EXISTS test; CREATE TABLE test (entry_id int, cat_id int); INSERT INTO test (entry_id, cat_id) values (2, 30), (2, 35), (2, 30), (2, 35), (2, 30), (2, 35); -- Run the query -- SELECT a.entry_id FROM test a INNER JOIN test b ON a.entry_id = b.entry_id WHERE a.entry_id = b.entry_id AND a.cat_id = 30 AND b.cat_id = 35; The result of your query is: 9 rows in set (0.00 sec) This gets worse as your table gets bigger. You end up with the "cartesian explosion" in the test case that you are denying exists. > >This works fine (sorry about the line wraps): > ><?php > >$link = mysql_connect('localhost', 'root', 'vps11131') or die("Can't >connect: " . mysql_error()); >$db = mysql_select_db('test'); > >// Clear table if it existed >mysql_query('DROP TABLE IF EXISTS test'); >mysql_query('CREATE TABLE test (groupid INT NOT NULL, ' . > 'userid INT NOT NULL, PRIMARY KEY(groupid, userid))'); Your script doesn't test the same scenario at all. The table you created is guaranteed not to have any duplicates because you defined a PRIMARY KEY. This is exactly what you've been arguing against doing all this time, so you've basically demonstrated why uniqueness is a good thing. Mitch |
|
|||
|
Mitch Sherman wrote:
> On Thu, 15 May 2008 11:55:29 -0400, Jerry Stuckle > <jstucklex@attglobal.net> wrote: >> vkayute@gmail.com wrote: >>> On May 15, 12:27 am, Corey Jansen <ccj9...@gmail.com> wrote: >>>> Jerry's approach results in a "cartesian explosion." >>> Which is exactly the problem database normalization is designed to >>> prevent. >>> >>> If only Mr. Stuckle had listened to what 10 people told him already. >>> Obstinacy is his best policy it seems :) >>> >>> Yet another lesson in "Why You Should Use Proper Database Design." >>> >> I'm not arguing about proper database design. My only comment is it is >> IMPOSSIBLE to determine if the database is normalized or not from the >> given information. > > That doesn't mean that the relation can't be normalized first. That > seems to be the critical point you're missing. > No, the critical point YOU'RE MISSING is that the table may be normalized - AND STILL HAVE DUPLICATES IN THESE COLUMNS. That is the critical point! > You seem to arguing that it's better to build on a potentially flawed > database design rather than get it right first, which is terrible > advice. > No, I'm not. There is nothing flawed about a design which has three columns (of which these are only two) determining the primary key (or other unique value). >> There could be one or more additional columns to determine uniqueness, for instance. >> >> And people wonder why I send folks to comp.databases.mysql for MySQL >> questions - that's where the REAL experts hang out. > > This is a pointless hypothetical. If you have N columns, you can still > maintain uniqueness across those columns. That doesn't require > duplicate rows any more than the original problem which had only 2 > columns. > > Mitch > No, it is not pointlessly hypothetical. It is very germane to this situation. We do not have all of the information - the complete database design, usage, etc. The other column(s) may not be germane to the problem, so the original op did not list them. That is quite common - and correct - as it does not confuse the issue at hand with irrelevant data. There may very well have been 2 columns - or 20 columns or even 200 columns. You don't know which is correct. For instance, here's a table which could very well be the case: userid groupid permission 1 1 read 1 1 write 1 1 delete 1 2 read 1 3 read This is a commonly used design. The permission column is not pertinent to the original ops question - so it wouldn't be listed. But Peter's query will fail if it looks for someone who is a member if groups 1 and 2. The correct query works in this case just fine. My God, I've never seen someone so insistent about making false assumptions about someone else's code - and so stubborn about sticking to a bad suggestion. I really suggest you learn some more advanced sql - actually, the correct answer isn't even advanced level. I'm not sure it even makes intermediate level. The correct query works 100% of the time - whether there are duplicates or not. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |