sql update – MYSQL use of CONCAT and SUBSTRING with http and https URLS in database

sql update – MYSQL use of CONCAT and SUBSTRING with http and https URLS in database

If all your links to be changed contain the substring .php? you can do this:

UPDATE content_temp SET link = SUBSTRING_INDEX(link, .php?, -1)
      WHERE link LIKE %.php?%

Pro tip: Its always a good idea to put a WHERE clause on an UPDATE operation: without it SQL updates all rows in the table. That can trash the table, and it can swamp the transaction log.

It looks like you have a total of four possible combinations were looking at.
The first can be either http or https, and the second can be either http or https. Four possible combinations:

http:/  http:/
http:/  https:/
https:/ http:/
https:/ https:/

A couple of suggestions:

Firstly, before we execute an UPDATE statement that were unsure that it is going to do what we intend it to do, we should write a SELECT statement to test the expressions. That lets us see the results of the expression, so we can verify that it is doing what we intend it to, under a variety of test conditions.

Secondly, it should be be possible to separate out the combinations (the four possibilities). If the changes we are applying is removing the first http/https leaving only one occurrence… then a modification we apply to one of the subsets wont make a row then move into a different subset. (I understood what I meant to say there, it probably came across garbled.)

If Ive got row like this

http 1   https 2 

and I change that with an update to be

https 2 

Then a subsequent run through the rows, thats not going to match a check for

https  https 

because theres only one occurrence of http in the string.


Lets assume that were only interested in rows where the link string contains two occurrences of http:// / https://

We could use a regular expression do some matching, or we can fudge it with some LIKE comparisons

SELECT t.link
     , t.link LIKE %http://%http://%   AS c1 
     , t.link LIKE %http://%https://%  AS c2 
     , t.link LIKE %https://%http://%  AS c3 
     , t.link LIKE %https://%https://% AS c4
  FROM (

SELECT http://somedomain.com/work-a-link.php?http://someotherdomain.com/thisisworthkeeping.php AS link
UNION ALL
SELECT http://somedomain.com/work-a-link.php?https://someotherdomain.com/thisisworthkeeping.php
UNION ALL
SELECT https://somedomain.com/work-a-link.php?http://someotherdomain.com/thisisworthkeeping.php
UNION ALL 
SELECT https://somedomain.com/work-a-link.php?https://someotherdomain.com/thisisworthkeeping.php

       ) t
 WHERE t.link LIKE %http%://%http%://%
   AND t.link NOT LIKE %http%://%http%://%http%://%  

returns something like this (with some string replacement in the link values to shorten them here) …

link                                                           c1  c2  c3  c4
-------------------------------------------------------------- --  --  --  --
http://somedomain.com/walp?http://someotherdomain.com/tiwkp     1   0   0   0
http://somedomain.com/walp?https://someotherdomain.com/tiwkp    0   1   0   0
https://somedomain.com/walp?http://someotherdomain.com/tiwkp    0   0   1   0
https://somedomain.com/walp?https://someotherdomain.com/tiwkp   0   0   0   1

Weve include a condition in the WHERE clause to filter out any URL where there are more than two occurrences of http[s]://. (Theres a little bit of fudging in there, the % wildcard could match any number of characters, so were not exactly checking for https:// and http:// … (again, we could implement a regular expression (REGEXP or RLIKE) comparison to be more exact.

But notice how weve managed to categorize the values link into c1, c2, c3, and c4.

Confirming that each link falls into a single category,
We can include an additional expression in the SELECT list

     , CASE 
       WHEN t.link LIKE %http://%http://%   THEN c1
       WHEN t.link LIKE %http://%https://%  THEN c2
       WHEN t.link LIKE %https://%http://%  THEN c3 
       WHEN t.link LIKE %https://%https://% THEN c4
       ELSE NULL
       END

If we find a matching WHEN condition, we return the THEN, and were done. (If there is an overlap, a link that falls into more than one category, we wont see it in this expression.)

link                                                            c   c1  c2  c3  c4
-------------------------------------------------------------- --  --  --  --  --
http://somedomain.com/walp?http://someotherdomain.com/tiwkp    c1   1   0   0   0
http://somedomain.com/walp?https://someotherdomain.com/tiwkp   c2   0   1   0   0
https://somedomain.com/walp?http://someotherdomain.com/tiwkp   c3   0   0   1   0
https://somedomain.com/walp?https://someotherdomain.com/tiwkp  c4   0   0   0   1

We can test with a variety of link values, various patterns, and verify that our matching categorization is working the way we intend.

(This technique of using a SELECT statement to test expressions, and especially functions were not familiar with… with SUBSTRING_INDEX, what gets returned when the search string isnt found? is the REPLACE function case sensitive? and so on.)

We can adjust and tweak, experiment and find the right combinations to get it working the way we want. Once we get that,

We can include yet another expression in the SELECT list. Well copy the one we just added, but this time, instead of returning a literal, we will include a different expression, one that exercises the SUBSTRING_INDEX and REPLACE functions.

Since were running a SELECT we know that were not going to mangle/munge the contents of the table. Were just testing some expressions, to see what they return.

     , CASE 
       WHEN t.link LIKE %http://%http://%    -- c1
       THEN CONCAT(http://, SUBSTRING_INDEX(t.link, http://, -1))

       WHEN t.link LIKE %http://%https://%   -- c2
       THEN CONCAT(https://, SUBSTRING_INDEX(t.link, https://, -1))

       WHEN t.link LIKE %https://%http://%   -- c3 
       THEN CONCAT(http://, SUBSTRING_INDEX(t.link, http://, -1))

       WHEN t.link LIKE %https://%https://%  -- c4
       THEN CONCAT(https://, SUBSTRING_INDEX(t.link, https://, -1))

       ELSE NULL
       END AS new_link

And pretty soon well be getting back a result like this…

link                                                           new_link 
-------------------------------------------------------------  ---------------------------------
http://somedomain.com/walp?http://someotherdomain.com/tiwkp    http://someotherdomain.com/tiwkp
http://somedomain.com/walp?https://someotherdomain.com/tiwkp   https://someotherdomain.com/tiwkp
https://somedomain.com/walp?http://someotherdomain.com/tiwkp   http://someotherdomain.com/tiwkp
https://somedomain.com/walp?https://someotherdomain.com/tiwkp  https://someotherdomain.com/tiwkp

Once we have an expression working that is returning a new_link value that we want to assign to the column, in place of the link value,

(and were running this against a test copy of the table first)

We can convert our SELECT statement into an UPDATE

replace SELECT ... FROM with UPDATE

and add a SET clause to assign the new_link expression to link

(replace the NULL in the ELSE with a reference to the column, so in the event we fall through all of the conditions were checking in the CASE , we wont be changing that row …)

UPDATE mytesttable t
   SET t.link
     = CASE 
       WHEN t.link LIKE %http://%http://%    -- c1
       THEN CONCAT(http://, SUBSTRING_INDEX(t.link, http://, -1))

       WHEN t.link LIKE %http://%https://%   -- c2
       THEN CONCAT(https://, SUBSTRING_INDEX(t.link, https://, -1))

       WHEN t.link LIKE %https://%http://%   -- c3 
       THEN CONCAT(http://, SUBSTRING_INDEX(t.link, http://, -1))

       WHEN t.link LIKE %https://%https://%  -- c4
       THEN CONCAT(https://, SUBSTRING_INDEX(t.link, https://, -1))

       ELSE t.link
       END
 WHERE t.link LIKE %http%://%http%://%  
   AND t.link NOT LIKE %http%://%http%://%http%://%  

But before we run the UPDATE, we should test our expressions with a variety of link values, including edge and corner cases.

And again note that those LIKE comparisons I wrote have the potential to match wonky rows we might not want to match e.g. http://BLAH http DERP :// flurb http://.

Looks like maybe we should also be checking for a question mark before the second occurrences of http.

We would get more accurate pattern matching using regular expressions, REGEXP (RLIKE) comparisons.


The theme remains the same… test expressions with SELECT first, and then run the UPDATE.

sql update – MYSQL use of CONCAT and SUBSTRING with http and https URLS in database

Leave a Reply

Your email address will not be published. Required fields are marked *